227 lines
8.5 KiB
Markdown
227 lines
8.5 KiB
Markdown
# Sections 7-10 Gaps Design: Monitoring, TLS, Logging, Ping/Pong
|
|
|
|
**Date:** 2026-02-23
|
|
**Scope:** Implement remaining gaps in differences.md sections 7 (Monitoring), 8 (TLS), 9 (Logging), 10 (Ping/Pong)
|
|
**Goal:** Go parity for all features within scope
|
|
|
|
---
|
|
|
|
## Section 7: Monitoring
|
|
|
|
### 7a. `/subz` Endpoint
|
|
|
|
Replace the empty stub with a full `SubszHandler`.
|
|
|
|
**Models:**
|
|
- `Subsz` — response envelope: `Id`, `Now`, `SublistStats`, `Total`, `Offset`, `Limit`, `Subs[]`
|
|
- `SubszOptions` — `Offset`, `Limit`, `Subscriptions` (bool for detail), `Account` (filter), `Test` (literal subject filter)
|
|
- Reuse existing `SubDetail` from Connz
|
|
|
|
**Algorithm:**
|
|
1. Iterate all accounts (or filter by `Account` param)
|
|
2. Collect all subscriptions from each account's SubList
|
|
3. If `Test` subject provided, filter using `SubjectMatch.MatchLiteral()` to only return subs that would receive that message
|
|
4. Apply pagination (offset/limit)
|
|
5. If `Subscriptions` is true, include `SubDetail[]` array
|
|
|
|
**SubList stats** — add a `Stats()` method to `SubList` returning `SublistStats` (count, cache size, inserts, removes, matches, cache hits).
|
|
|
|
**Files:** New `Monitoring/SubszHandler.cs`, `Monitoring/Subsz.cs`. Modify `MonitorServer.cs`, `SubList.cs`.
|
|
|
|
### 7b. Connz `ByStop` / `ByReason` Sorting
|
|
|
|
Add two missing sort options for closed connection queries.
|
|
|
|
- Add `ByStop` and `ByReason` to `SortOpt` enum
|
|
- Parse `sort=stop` and `sort=reason` in query params
|
|
- Validate: these sorts only work with `state=closed` — return error if used with open connections
|
|
|
|
### 7c. Connz State Filtering & Closed Connections
|
|
|
|
Track closed connections and support state-based filtering.
|
|
|
|
**Closed connection tracking:**
|
|
- `ClosedClient` record: `Cid`, `Ip`, `Port`, `Start`, `Stop`, `Reason`, `Name`, `Lang`, `Version`, `InMsgs`, `OutMsgs`, `InBytes`, `OutBytes`, `NumSubs`, `Rtt`, `TlsVersion`, `TlsCipherSuite`
|
|
- `ConcurrentQueue<ClosedClient>` on `NatsServer` (capped at 10,000 entries)
|
|
- Populate in `RemoveClient()` from client state before disposal
|
|
|
|
**State filter:**
|
|
- Parse `state=open|closed|all` query param
|
|
- `open` (default): current live connections only
|
|
- `closed`: only from closed connections list
|
|
- `all`: merge both
|
|
|
|
**Files:** Modify `NatsServer.cs`, `ConnzHandler.cs`, new `Monitoring/ClosedClient.cs`.
|
|
|
|
### 7d. Varz Slow Consumer Stats
|
|
|
|
Already at parity. `SlowConsumersStats` is populated from `ServerStats` counters. No changes needed.
|
|
|
|
---
|
|
|
|
## Section 8: TLS
|
|
|
|
### 8a. TLS Rate Limiting
|
|
|
|
Already implemented via `TlsRateLimiter` (semaphore + periodic refill timer). Wired into `AcceptClientAsync`. Only a unit test needed.
|
|
|
|
### 8b. TLS Cert-to-User Mapping (TlsMap)
|
|
|
|
Full DN parsing using .NET built-in `X500DistinguishedName`.
|
|
|
|
**New `TlsMapAuthenticator`:**
|
|
- Implements `IAuthenticator`
|
|
- Receives the list of configured `User` objects
|
|
- On `Authenticate()`:
|
|
1. Extract `X509Certificate2` from auth context (passed from `TlsConnectionState`)
|
|
2. Parse subject DN via `cert.SubjectName` (`X500DistinguishedName`)
|
|
3. Build normalized DN string from RDN components
|
|
4. Try exact DN match against user map (key = DN string)
|
|
5. If no exact match, try CN-only match
|
|
6. Return `AuthResult` with matched user's permissions
|
|
|
|
**Auth context extension:**
|
|
- Add `X509Certificate2? ClientCertificate` to `ClientAuthContext`
|
|
- Pass certificate from `TlsConnectionState` in `ProcessConnectAsync`
|
|
|
|
**AuthService integration:**
|
|
- When `options.TlsMap && options.TlsVerify`, add `TlsMapAuthenticator` to authenticator chain
|
|
- TlsMap auth runs before other authenticators (cert-based auth takes priority)
|
|
|
|
**Files:** New `Auth/TlsMapAuthenticator.cs`. Modify `Auth/AuthService.cs`, `Auth/ClientAuthContext.cs`, `NatsClient.cs`.
|
|
|
|
---
|
|
|
|
## Section 9: Logging
|
|
|
|
### 9a. File Logging with Rotation
|
|
|
|
**New options on `NatsOptions`:**
|
|
- `LogFile` (string?) — path to log file
|
|
- `LogSizeLimit` (long) — file size in bytes before rotation (0 = unlimited)
|
|
- `LogMaxFiles` (int) — max retained rotated files (0 = unlimited)
|
|
|
|
**CLI flags:** `--log_file`, `--log_size_limit`, `--log_max_files`
|
|
|
|
**Serilog config:** Add `WriteTo.File()` with `fileSizeLimitBytes` and `retainedFileCountLimit` when `LogFile` is set.
|
|
|
|
### 9b. Debug/Trace Modes
|
|
|
|
**New options on `NatsOptions`:**
|
|
- `Debug` (bool) — enable debug-level logging
|
|
- `Trace` (bool) — enable trace/verbose-level logging
|
|
|
|
**CLI flags:** `-D` (debug), `-V` or `-T` (trace), `-DV` (both)
|
|
|
|
**Serilog config:**
|
|
- Default: `MinimumLevel.Information()`
|
|
- `-D`: `MinimumLevel.Debug()`
|
|
- `-V`/`-T`: `MinimumLevel.Verbose()`
|
|
|
|
### 9c. Color Output
|
|
|
|
Auto-detect TTY via `Console.IsOutputRedirected`.
|
|
- TTY: use `Serilog.Sinks.Console` with `AnsiConsoleTheme.Code`
|
|
- Non-TTY: use `ConsoleTheme.None`
|
|
|
|
Matches Go's behavior of disabling color when stderr is not a terminal.
|
|
|
|
### 9d. Timestamp Format Control
|
|
|
|
**New options on `NatsOptions`:**
|
|
- `Logtime` (bool, default true) — include timestamps
|
|
- `LogtimeUTC` (bool, default false) — use UTC format
|
|
|
|
**CLI flags:** `--logtime` (true/false), `--logtime_utc`
|
|
|
|
**Output template adjustment:**
|
|
- With timestamps: `[{Timestamp:yyyy/MM/dd HH:mm:ss.ffffff} {Level:u3}] {Message:lj}{NewLine}{Exception}`
|
|
- Without timestamps: `[{Level:u3}] {Message:lj}{NewLine}{Exception}`
|
|
- UTC: set `Serilog.Formatting` culture to UTC
|
|
|
|
### 9e. Log Reopening (SIGUSR1)
|
|
|
|
When file logging is configured:
|
|
- SIGUSR1 handler calls `ReOpenLogFile()` on the server
|
|
- `ReOpenLogFile()` flushes and closes current Serilog logger, creates new one with same config
|
|
- This enables external log rotation tools (logrotate)
|
|
|
|
**Files:** Modify `NatsOptions.cs`, `Program.cs`, `NatsServer.cs`.
|
|
|
|
---
|
|
|
|
## Section 10: Ping/Pong
|
|
|
|
### 10a. RTT Tracking
|
|
|
|
**New fields on `NatsClient`:**
|
|
- `_rttStartTicks` (long) — UTC ticks when PING sent
|
|
- `_rtt` (long) — computed RTT in ticks
|
|
- `Rtt` property (TimeSpan) — computed from `_rtt`
|
|
|
|
**Logic:**
|
|
- In `RunPingTimerAsync`, before writing PING: `_rttStartTicks = DateTime.UtcNow.Ticks`
|
|
- In `DispatchCommandAsync` PONG handler: compute `_rtt = DateTime.UtcNow.Ticks - _rttStartTicks` (min 1 tick)
|
|
- `computeRTT()` helper ensures minimum 1 tick (handles clock granularity on Windows)
|
|
|
|
**Monitoring exposure:**
|
|
- Populate `ConnInfo.Rtt` as formatted string (e.g., `"1.234ms"`)
|
|
- Add `ByRtt` sort option to Connz
|
|
|
|
### 10b. RTT-Based First PING Delay
|
|
|
|
**New state on `NatsClient`:**
|
|
- `_firstPongSent` flag in `ClientFlags`
|
|
|
|
**Logic in `RunPingTimerAsync`:**
|
|
- Before first PING, check: `_firstPongSent || timeSinceStart > 2 seconds`
|
|
- If neither condition met, skip this PING cycle
|
|
- Set `_firstPongSent` on first PONG after CONNECT (in PONG handler)
|
|
|
|
This prevents the server from sending PING (for RTT) before the client has had a chance to respond to the initial INFO with CONNECT+PING.
|
|
|
|
### 10c. Stale Connection Stats
|
|
|
|
**New model:**
|
|
- `StaleConnectionStats` — `Clients`, `Routes`, `Gateways`, `Leafs` (matching Go)
|
|
|
|
**ServerStats extension:**
|
|
- Add `StaleConnectionClients`, `StaleConnectionRoutes`, etc. fields
|
|
- Increment in `MarkClosed(StaleConnection)` based on connection kind
|
|
|
|
**Varz exposure:**
|
|
- Add `StaleConnectionStats` field to `Varz`
|
|
- Populate from `ServerStats` counters
|
|
|
|
**Files:** Modify `NatsClient.cs`, `ServerStats.cs`, `Varz.cs`, `VarzHandler.cs`, `Connz.cs`, `ConnzHandler.cs`.
|
|
|
|
---
|
|
|
|
## Test Coverage
|
|
|
|
Each section includes unit tests:
|
|
|
|
| Feature | Test File | Tests |
|
|
|---------|-----------|-------|
|
|
| Subz endpoint | SubszHandlerTests.cs | Empty response, with subs, account filter, test subject filter, pagination |
|
|
| Connz closed state | ConnzHandlerTests.cs | State=closed, ByStop sort, ByReason sort, validation errors |
|
|
| TLS rate limiter | TlsRateLimiterTests.cs | Rate enforcement, refill behavior |
|
|
| TlsMap auth | TlsMapAuthenticatorTests.cs | DN matching, CN fallback, no match |
|
|
| File logging | LoggingTests.cs | File creation, rotation on size limit |
|
|
| RTT tracking | ClientTests.cs | RTT computed on PONG, exposed in connz, ByRtt sort |
|
|
| First PING delay | ClientTests.cs | PING delayed until first PONG or 2s |
|
|
| Stale stats | ServerTests.cs | Stale counters incremented, exposed in varz |
|
|
|
|
---
|
|
|
|
## Parallelization Strategy
|
|
|
|
These work streams are independent and can be developed by parallel subagents:
|
|
|
|
1. **Monitoring stream** (7a, 7b, 7c): SubszHandler + Connz closed connections + state filter
|
|
2. **TLS stream** (8b): TlsMapAuthenticator
|
|
3. **Logging stream** (9a-9e): All logging improvements
|
|
4. **Ping/Pong stream** (10a-10c): RTT tracking + first PING delay + stale stats
|
|
|
|
Streams 1-4 touch different files with minimal overlap. The only shared touch point is `NatsOptions.cs` (new options for logging and ping/pong), which can be handled by one stream first and the others will build on it.
|