docs: add design for monitoring HTTP and TLS support

Covers /varz, /connz endpoints via Kestrel Minimal APIs,
full TLS support with four modes (none/required/first/mixed),
cert pinning, rate limiting, and testing strategy.
This commit is contained in:
Joseph Doherty
2026-02-22 21:33:24 -05:00
parent 16b8f9e2e2
commit 8ee5a7f97b

View File

@@ -0,0 +1,243 @@
# Monitoring HTTP & TLS Support Design
**Date:** 2026-02-22
**Scope:** Port monitoring endpoints (`/varz`, `/connz`) and full TLS support from Go NATS server
**Go Reference:** `golang/nats-server/server/monitor.go`, `server.go` (TLS), `client.go` (TLS), `opts.go`
## Overview
Two features ported from Go NATS:
1. **Monitoring HTTP** — Kestrel Minimal API embedded in `NatsServer`, serving `/varz`, `/connz`, `/healthz` and stub endpoints. Exact Go JSON schema for tooling compatibility.
2. **TLS Support**`SslStream` wrapping with four modes: no TLS, TLS required, TLS-first, and mixed TLS/plaintext. Certificate pinning, client cert verification, rate limiting.
## 1. Server-Level Stats Aggregation
New `ServerStats` class with atomic counters, replacing the need to sum across all clients on each `/varz` request.
### ServerStats Fields
```csharp
// src/NATS.Server/ServerStats.cs
public sealed class ServerStats
{
public long InMsgs;
public long OutMsgs;
public long InBytes;
public long OutBytes;
public long TotalConnections;
public long SlowConsumers;
public long StaleConnections;
public long Stalls;
public long SlowConsumerClients;
public long SlowConsumerRoutes;
public long SlowConsumerLeafs;
public long SlowConsumerGateways;
public readonly ConcurrentDictionary<string, long> HttpReqStats = new();
}
```
### Integration Points
- `NatsServer` owns a `ServerStats` instance, passes it to each `NatsClient`
- `NatsClient.ProcessPub` increments server-level `InMsgs`/`InBytes` alongside client-level counters
- `NatsClient.SendMessageAsync` increments server-level `OutMsgs`/`OutBytes`
- Accept loop increments `TotalConnections`
- `NatsServer.StartTime` field added (set once at startup)
## 2. Monitoring HTTP Endpoints
### HTTP Stack
Kestrel Minimal APIs via `FrameworkReference` to `Microsoft.AspNetCore.App`. No NuGet packages needed.
### Endpoints
| Path | Handler | Description |
|------|---------|-------------|
| `/` | `HandleRoot` | Links to all endpoints |
| `/varz` | `HandleVarz` | Server stats and config |
| `/connz` | `HandleConnz` | Connection info (paginated) |
| `/healthz` | `HandleHealthz` | Health check (200 OK) |
| `/routez` | stub | Returns `{}` |
| `/gatewayz` | stub | Returns `{}` |
| `/leafz` | stub | Returns `{}` |
| `/subz` | stub | Returns `{}` |
| `/accountz` | stub | Returns `{}` |
| `/jsz` | stub | Returns `{}` |
All paths support optional base path prefix via `MonitorBasePath` config.
### Configuration
```csharp
// Added to NatsOptions
public int MonitorPort { get; set; } // 0 = disabled, CLI: -m
public string MonitorHost { get; set; } = "0.0.0.0";
public string? MonitorBasePath { get; set; }
public int MonitorHttpsPort { get; set; } // 0 = disabled
```
### Varz Model
Exact Go JSON field names. All fields from Go's `Varz` struct including nested config structs (`ClusterOptsVarz`, `GatewayOptsVarz`, `LeafNodeOptsVarz`, `MqttOptsVarz`, `WebsocketOptsVarz`, `JetStreamVarz`). Nested structs return defaults/zeros until those subsystems are ported.
Key field categories: identification, network config, security/limits, timing/lifecycle, runtime metrics (mem, CPU, cores), connection stats, message stats, health counters, subsystem configs, HTTP request stats.
### Connz Model
Paginated connection list with query parameter support:
- `sort` — sort field (cid, bytes_to, msgs_to, etc.)
- `subs` / `subs=detail` — include subscription lists
- `offset` / `limit` — pagination (default limit 1024)
- `state` — filter open/closed/all
- `auth` — include usernames
`ConnInfo` includes all Go fields: cid, kind, ip, port, start, last_activity, rtt, uptime, idle, pending, msg/byte stats, subscription count, client name/lang/version, TLS version/cipher, account.
### Concurrency
- `HandleVarz` acquires a `SemaphoreSlim(1,1)` to serialize JSON building (matches Go's `varzMu`)
- `HandleConnz` snapshots `_clients.Values.ToArray()` to avoid holding the dictionary during serialization
- CPU percentage sampled via `Process.TotalProcessorTime` delta, cached for 1 second
### NatsClient Additions for ConnInfo
```csharp
public DateTime StartTime { get; } // set in constructor
public DateTime LastActivity; // updated on every command dispatch
public string? RemoteIp { get; } // from socket.RemoteEndPoint
public int RemotePort { get; } // from socket.RemoteEndPoint
```
## 3. TLS Support
### Configuration
```csharp
// Added to NatsOptions
public string? TlsCert { get; set; }
public string? TlsKey { get; set; }
public string? TlsCaCert { get; set; }
public bool TlsVerify { get; set; }
public bool TlsMap { get; set; }
public double TlsTimeout { get; set; } = 2.0;
public bool TlsHandshakeFirst { get; set; }
public TimeSpan TlsHandshakeFirstFallback { get; set; } = TimeSpan.FromMilliseconds(50);
public bool AllowNonTls { get; set; }
public long TlsRateLimit { get; set; }
public HashSet<string>? TlsPinnedCerts { get; set; }
public SslProtocols TlsMinVersion { get; set; } = SslProtocols.Tls12;
```
CLI args: `--tls`, `--tlscert`, `--tlskey`, `--tlscacert`, `--tlsverify`
### INFO Message Changes
Three new fields on `ServerInfo`: `tls_required`, `tls_verify`, `tls_available`.
- `tls_required = (TlsConfig != null && !AllowNonTls)`
- `tls_verify = (TlsConfig != null && TlsVerify)`
- `tls_available = (TlsConfig != null && AllowNonTls)`
### Four TLS Modes
**Mode 1: No TLS** — current behavior, unchanged.
**Mode 2: TLS Required** — send INFO with `tls_required=true`, client initiates TLS, server detects 0x16 byte, performs `SslStream` handshake, validates pinned certs, continues protocol over encrypted stream.
**Mode 3: TLS First** — do NOT send INFO, wait up to 50ms for data. If 0x16 byte arrives: TLS handshake then send INFO over encrypted stream. If timeout or non-TLS byte: fallback to Mode 2 flow.
**Mode 4: Mixed** — send INFO with `tls_available=true`, peek first byte. 0x16 → TLS handshake. Other → continue plaintext.
### Key Components
**`TlsHelper`** — static class for cert loading (`X509Certificate2` from PEM/PFX), CA cert loading, building `SslServerAuthenticationOptions`, pinned cert validation (SHA256 of SubjectPublicKeyInfo).
**`TlsConnectionWrapper`** — per-connection negotiation state machine. Takes socket + options, returns `(Stream stream, bool infoAlreadySent)`. Handles peek logic, timeout, handshake, cert validation.
**`PeekableStream`** — wraps `NetworkStream`, buffers peeked bytes, replays them on first `ReadAsync`. Required so `SslStream.AuthenticateAsServerAsync` sees the full TLS ClientHello including the peeked byte.
**`TlsRateLimiter`** — token-bucket rate limiter. Refills `TlsRateLimit` tokens per second. `WaitAsync` blocks if no tokens. Only applies to TLS handshakes, not plain connections.
**`TlsConnectionState`** — post-handshake record: `TlsVersion`, `CipherSuite`, `PeerCert`. Stored on `NatsClient` for `/connz` reporting.
### NatsClient Changes
Constructor takes `Stream` instead of building `NetworkStream` internally. TLS negotiation happens before `NatsClient` is constructed. `NatsClient` receives the already-negotiated stream and `TlsConnectionState`.
### Accept Loop Changes
```
Accept socket
→ Increment TotalConnections
→ Rate limit check (if TLS configured)
→ TlsConnectionWrapper.NegotiateAsync (returns stream + infoAlreadySent)
→ Extract TlsConnectionState from SslStream if applicable
→ Construct NatsClient with stream + tlsState
→ client.InfoAlreadySent flag set if TLS-first sent INFO during negotiation
→ RunClientAsync
```
## 4. File Layout
```
src/NATS.Server/
ServerStats.cs
Monitoring/
MonitorServer.cs # Kestrel host, route registration
Varz.cs # Varz + nested config structs
Connz.cs # Connz, ConnInfo, ConnzOptions, SubDetail
VarzHandler.cs # Snapshot logic, CPU/mem sampling
ConnzHandler.cs # Query param parsing, sort, pagination
Tls/
TlsHelper.cs # Cert loading, auth options builder
TlsConnectionWrapper.cs # Per-connection TLS negotiation
TlsConnectionState.cs # Post-handshake state record
TlsRateLimiter.cs # Token-bucket rate limiter
PeekableStream.cs # Buffered-peek stream wrapper
```
### Package Dependencies
- `FrameworkReference` to `Microsoft.AspNetCore.App` in `NATS.Server.csproj` (for Kestrel)
- No new NuGet packages — `SslStream`, `X509Certificate2`, `SslServerAuthenticationOptions` all in `System.Net.Security`
- Tests use `HttpClient` (built-in) and `CertificateRequest` (built-in) for self-signed test certs
## 5. Testing Strategy
### Monitoring Tests (`MonitorTests.cs`)
- `/varz` returns correct server identity, config limits, zero stats on fresh server
- After pub/sub traffic: message/byte counters are accurate
- `/connz` pagination: `?limit=2&offset=0` with 5 clients returns 2, total=5
- `/connz?sort=bytes_to` ordering
- `/connz?subs=true` includes subscription subjects
- `/healthz` returns 200
- HTTP request stats tracked in `/varz` response
### TLS Tests (`TlsTests.cs`)
Self-signed certs generated in-memory via `CertificateRequest` + `RSA.Create()`.
- Basic TLS: server cert, client connects with SslStream, pub/sub works
- TLS Required: plaintext client rejected
- TLS Verify: valid client cert succeeds, wrong cert fails
- Mixed mode: TLS and plaintext clients coexist
- TLS First: immediate TLS handshake without reading INFO first
- TLS First fallback: slow client gets INFO sent, normal negotiation
- Certificate pinning: matching cert accepted, non-matching rejected
- Rate limiting: rapid connections throttled
- TLS timeout: incomplete handshake closed after configured timeout
- Integration: NATS.Client.Core NuGet client works over TLS
- Monitoring: `/connz` shows `tls_version` and `tls_cipher_suite`
## 6. Error Handling
- **TLS handshake failures** are non-fatal: log warning, close socket, increment counter
- **Mixed mode byte detection**: 0x16 → TLS, printable ASCII → plain, connection close → clean disconnect
- **Rate limiter**: holds TCP connection open until token available (not rejected)
- **Monitoring concurrency**: `varzMu` semaphore serializes `/varz`, client snapshot for `/connz`
- **CPU sampling**: cached 1 second to avoid overhead on rapid polls
- **Graceful shutdown**: `MonitorServer.DisposeAsync()` stops Kestrel, rate limiter disposes timer, in-flight handshakes cancelled via CancellationToken