Files
natsdotnet/docs/plans/2026-02-23-sections-7-10-gaps-design.md
2026-02-23 00:17:35 -05:00

227 lines
8.5 KiB
Markdown

# Sections 7-10 Gaps Design: Monitoring, TLS, Logging, Ping/Pong
**Date:** 2026-02-23
**Scope:** Implement remaining gaps in differences.md sections 7 (Monitoring), 8 (TLS), 9 (Logging), 10 (Ping/Pong)
**Goal:** Go parity for all features within scope
---
## Section 7: Monitoring
### 7a. `/subz` Endpoint
Replace the empty stub with a full `SubszHandler`.
**Models:**
- `Subsz` — response envelope: `Id`, `Now`, `SublistStats`, `Total`, `Offset`, `Limit`, `Subs[]`
- `SubszOptions``Offset`, `Limit`, `Subscriptions` (bool for detail), `Account` (filter), `Test` (literal subject filter)
- Reuse existing `SubDetail` from Connz
**Algorithm:**
1. Iterate all accounts (or filter by `Account` param)
2. Collect all subscriptions from each account's SubList
3. If `Test` subject provided, filter using `SubjectMatch.MatchLiteral()` to only return subs that would receive that message
4. Apply pagination (offset/limit)
5. If `Subscriptions` is true, include `SubDetail[]` array
**SubList stats** — add a `Stats()` method to `SubList` returning `SublistStats` (count, cache size, inserts, removes, matches, cache hits).
**Files:** New `Monitoring/SubszHandler.cs`, `Monitoring/Subsz.cs`. Modify `MonitorServer.cs`, `SubList.cs`.
### 7b. Connz `ByStop` / `ByReason` Sorting
Add two missing sort options for closed connection queries.
- Add `ByStop` and `ByReason` to `SortOpt` enum
- Parse `sort=stop` and `sort=reason` in query params
- Validate: these sorts only work with `state=closed` — return error if used with open connections
### 7c. Connz State Filtering & Closed Connections
Track closed connections and support state-based filtering.
**Closed connection tracking:**
- `ClosedClient` record: `Cid`, `Ip`, `Port`, `Start`, `Stop`, `Reason`, `Name`, `Lang`, `Version`, `InMsgs`, `OutMsgs`, `InBytes`, `OutBytes`, `NumSubs`, `Rtt`, `TlsVersion`, `TlsCipherSuite`
- `ConcurrentQueue<ClosedClient>` on `NatsServer` (capped at 10,000 entries)
- Populate in `RemoveClient()` from client state before disposal
**State filter:**
- Parse `state=open|closed|all` query param
- `open` (default): current live connections only
- `closed`: only from closed connections list
- `all`: merge both
**Files:** Modify `NatsServer.cs`, `ConnzHandler.cs`, new `Monitoring/ClosedClient.cs`.
### 7d. Varz Slow Consumer Stats
Already at parity. `SlowConsumersStats` is populated from `ServerStats` counters. No changes needed.
---
## Section 8: TLS
### 8a. TLS Rate Limiting
Already implemented via `TlsRateLimiter` (semaphore + periodic refill timer). Wired into `AcceptClientAsync`. Only a unit test needed.
### 8b. TLS Cert-to-User Mapping (TlsMap)
Full DN parsing using .NET built-in `X500DistinguishedName`.
**New `TlsMapAuthenticator`:**
- Implements `IAuthenticator`
- Receives the list of configured `User` objects
- On `Authenticate()`:
1. Extract `X509Certificate2` from auth context (passed from `TlsConnectionState`)
2. Parse subject DN via `cert.SubjectName` (`X500DistinguishedName`)
3. Build normalized DN string from RDN components
4. Try exact DN match against user map (key = DN string)
5. If no exact match, try CN-only match
6. Return `AuthResult` with matched user's permissions
**Auth context extension:**
- Add `X509Certificate2? ClientCertificate` to `ClientAuthContext`
- Pass certificate from `TlsConnectionState` in `ProcessConnectAsync`
**AuthService integration:**
- When `options.TlsMap && options.TlsVerify`, add `TlsMapAuthenticator` to authenticator chain
- TlsMap auth runs before other authenticators (cert-based auth takes priority)
**Files:** New `Auth/TlsMapAuthenticator.cs`. Modify `Auth/AuthService.cs`, `Auth/ClientAuthContext.cs`, `NatsClient.cs`.
---
## Section 9: Logging
### 9a. File Logging with Rotation
**New options on `NatsOptions`:**
- `LogFile` (string?) — path to log file
- `LogSizeLimit` (long) — file size in bytes before rotation (0 = unlimited)
- `LogMaxFiles` (int) — max retained rotated files (0 = unlimited)
**CLI flags:** `--log_file`, `--log_size_limit`, `--log_max_files`
**Serilog config:** Add `WriteTo.File()` with `fileSizeLimitBytes` and `retainedFileCountLimit` when `LogFile` is set.
### 9b. Debug/Trace Modes
**New options on `NatsOptions`:**
- `Debug` (bool) — enable debug-level logging
- `Trace` (bool) — enable trace/verbose-level logging
**CLI flags:** `-D` (debug), `-V` or `-T` (trace), `-DV` (both)
**Serilog config:**
- Default: `MinimumLevel.Information()`
- `-D`: `MinimumLevel.Debug()`
- `-V`/`-T`: `MinimumLevel.Verbose()`
### 9c. Color Output
Auto-detect TTY via `Console.IsOutputRedirected`.
- TTY: use `Serilog.Sinks.Console` with `AnsiConsoleTheme.Code`
- Non-TTY: use `ConsoleTheme.None`
Matches Go's behavior of disabling color when stderr is not a terminal.
### 9d. Timestamp Format Control
**New options on `NatsOptions`:**
- `Logtime` (bool, default true) — include timestamps
- `LogtimeUTC` (bool, default false) — use UTC format
**CLI flags:** `--logtime` (true/false), `--logtime_utc`
**Output template adjustment:**
- With timestamps: `[{Timestamp:yyyy/MM/dd HH:mm:ss.ffffff} {Level:u3}] {Message:lj}{NewLine}{Exception}`
- Without timestamps: `[{Level:u3}] {Message:lj}{NewLine}{Exception}`
- UTC: set `Serilog.Formatting` culture to UTC
### 9e. Log Reopening (SIGUSR1)
When file logging is configured:
- SIGUSR1 handler calls `ReOpenLogFile()` on the server
- `ReOpenLogFile()` flushes and closes current Serilog logger, creates new one with same config
- This enables external log rotation tools (logrotate)
**Files:** Modify `NatsOptions.cs`, `Program.cs`, `NatsServer.cs`.
---
## Section 10: Ping/Pong
### 10a. RTT Tracking
**New fields on `NatsClient`:**
- `_rttStartTicks` (long) — UTC ticks when PING sent
- `_rtt` (long) — computed RTT in ticks
- `Rtt` property (TimeSpan) — computed from `_rtt`
**Logic:**
- In `RunPingTimerAsync`, before writing PING: `_rttStartTicks = DateTime.UtcNow.Ticks`
- In `DispatchCommandAsync` PONG handler: compute `_rtt = DateTime.UtcNow.Ticks - _rttStartTicks` (min 1 tick)
- `computeRTT()` helper ensures minimum 1 tick (handles clock granularity on Windows)
**Monitoring exposure:**
- Populate `ConnInfo.Rtt` as formatted string (e.g., `"1.234ms"`)
- Add `ByRtt` sort option to Connz
### 10b. RTT-Based First PING Delay
**New state on `NatsClient`:**
- `_firstPongSent` flag in `ClientFlags`
**Logic in `RunPingTimerAsync`:**
- Before first PING, check: `_firstPongSent || timeSinceStart > 2 seconds`
- If neither condition met, skip this PING cycle
- Set `_firstPongSent` on first PONG after CONNECT (in PONG handler)
This prevents the server from sending PING (for RTT) before the client has had a chance to respond to the initial INFO with CONNECT+PING.
### 10c. Stale Connection Stats
**New model:**
- `StaleConnectionStats``Clients`, `Routes`, `Gateways`, `Leafs` (matching Go)
**ServerStats extension:**
- Add `StaleConnectionClients`, `StaleConnectionRoutes`, etc. fields
- Increment in `MarkClosed(StaleConnection)` based on connection kind
**Varz exposure:**
- Add `StaleConnectionStats` field to `Varz`
- Populate from `ServerStats` counters
**Files:** Modify `NatsClient.cs`, `ServerStats.cs`, `Varz.cs`, `VarzHandler.cs`, `Connz.cs`, `ConnzHandler.cs`.
---
## Test Coverage
Each section includes unit tests:
| Feature | Test File | Tests |
|---------|-----------|-------|
| Subz endpoint | SubszHandlerTests.cs | Empty response, with subs, account filter, test subject filter, pagination |
| Connz closed state | ConnzHandlerTests.cs | State=closed, ByStop sort, ByReason sort, validation errors |
| TLS rate limiter | TlsRateLimiterTests.cs | Rate enforcement, refill behavior |
| TlsMap auth | TlsMapAuthenticatorTests.cs | DN matching, CN fallback, no match |
| File logging | LoggingTests.cs | File creation, rotation on size limit |
| RTT tracking | ClientTests.cs | RTT computed on PONG, exposed in connz, ByRtt sort |
| First PING delay | ClientTests.cs | PING delayed until first PONG or 2s |
| Stale stats | ServerTests.cs | Stale counters incremented, exposed in varz |
---
## Parallelization Strategy
These work streams are independent and can be developed by parallel subagents:
1. **Monitoring stream** (7a, 7b, 7c): SubszHandler + Connz closed connections + state filter
2. **TLS stream** (8b): TlsMapAuthenticator
3. **Logging stream** (9a-9e): All logging improvements
4. **Ping/Pong stream** (10a-10c): RTT tracking + first PING delay + stale stats
Streams 1-4 touch different files with minimal overlap. The only shared touch point is `NatsOptions.cs` (new options for logging and ping/pong), which can be handled by one stream first and the others will build on it.