From 8ee5a7f97baee4f871b65eabcf59e6b3b77e8f0c Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Sun, 22 Feb 2026 21:33:24 -0500 Subject: [PATCH] docs: add design for monitoring HTTP and TLS support Covers /varz, /connz endpoints via Kestrel Minimal APIs, full TLS support with four modes (none/required/first/mixed), cert pinning, rate limiting, and testing strategy. --- .../plans/2026-02-22-monitoring-tls-design.md | 243 ++++++++++++++++++ 1 file changed, 243 insertions(+) create mode 100644 docs/plans/2026-02-22-monitoring-tls-design.md diff --git a/docs/plans/2026-02-22-monitoring-tls-design.md b/docs/plans/2026-02-22-monitoring-tls-design.md new file mode 100644 index 0000000..748abc7 --- /dev/null +++ b/docs/plans/2026-02-22-monitoring-tls-design.md @@ -0,0 +1,243 @@ +# Monitoring HTTP & TLS Support Design + +**Date:** 2026-02-22 +**Scope:** Port monitoring endpoints (`/varz`, `/connz`) and full TLS support from Go NATS server +**Go Reference:** `golang/nats-server/server/monitor.go`, `server.go` (TLS), `client.go` (TLS), `opts.go` + +## Overview + +Two features ported from Go NATS: + +1. **Monitoring HTTP** — Kestrel Minimal API embedded in `NatsServer`, serving `/varz`, `/connz`, `/healthz` and stub endpoints. Exact Go JSON schema for tooling compatibility. +2. **TLS Support** — `SslStream` wrapping with four modes: no TLS, TLS required, TLS-first, and mixed TLS/plaintext. Certificate pinning, client cert verification, rate limiting. + +## 1. Server-Level Stats Aggregation + +New `ServerStats` class with atomic counters, replacing the need to sum across all clients on each `/varz` request. + +### ServerStats Fields + +```csharp +// src/NATS.Server/ServerStats.cs +public sealed class ServerStats +{ + public long InMsgs; + public long OutMsgs; + public long InBytes; + public long OutBytes; + public long TotalConnections; + public long SlowConsumers; + public long StaleConnections; + public long Stalls; + public long SlowConsumerClients; + public long SlowConsumerRoutes; + public long SlowConsumerLeafs; + public long SlowConsumerGateways; + public readonly ConcurrentDictionary HttpReqStats = new(); +} +``` + +### Integration Points + +- `NatsServer` owns a `ServerStats` instance, passes it to each `NatsClient` +- `NatsClient.ProcessPub` increments server-level `InMsgs`/`InBytes` alongside client-level counters +- `NatsClient.SendMessageAsync` increments server-level `OutMsgs`/`OutBytes` +- Accept loop increments `TotalConnections` +- `NatsServer.StartTime` field added (set once at startup) + +## 2. Monitoring HTTP Endpoints + +### HTTP Stack + +Kestrel Minimal APIs via `FrameworkReference` to `Microsoft.AspNetCore.App`. No NuGet packages needed. + +### Endpoints + +| Path | Handler | Description | +|------|---------|-------------| +| `/` | `HandleRoot` | Links to all endpoints | +| `/varz` | `HandleVarz` | Server stats and config | +| `/connz` | `HandleConnz` | Connection info (paginated) | +| `/healthz` | `HandleHealthz` | Health check (200 OK) | +| `/routez` | stub | Returns `{}` | +| `/gatewayz` | stub | Returns `{}` | +| `/leafz` | stub | Returns `{}` | +| `/subz` | stub | Returns `{}` | +| `/accountz` | stub | Returns `{}` | +| `/jsz` | stub | Returns `{}` | + +All paths support optional base path prefix via `MonitorBasePath` config. + +### Configuration + +```csharp +// Added to NatsOptions +public int MonitorPort { get; set; } // 0 = disabled, CLI: -m +public string MonitorHost { get; set; } = "0.0.0.0"; +public string? MonitorBasePath { get; set; } +public int MonitorHttpsPort { get; set; } // 0 = disabled +``` + +### Varz Model + +Exact Go JSON field names. All fields from Go's `Varz` struct including nested config structs (`ClusterOptsVarz`, `GatewayOptsVarz`, `LeafNodeOptsVarz`, `MqttOptsVarz`, `WebsocketOptsVarz`, `JetStreamVarz`). Nested structs return defaults/zeros until those subsystems are ported. + +Key field categories: identification, network config, security/limits, timing/lifecycle, runtime metrics (mem, CPU, cores), connection stats, message stats, health counters, subsystem configs, HTTP request stats. + +### Connz Model + +Paginated connection list with query parameter support: + +- `sort` — sort field (cid, bytes_to, msgs_to, etc.) +- `subs` / `subs=detail` — include subscription lists +- `offset` / `limit` — pagination (default limit 1024) +- `state` — filter open/closed/all +- `auth` — include usernames + +`ConnInfo` includes all Go fields: cid, kind, ip, port, start, last_activity, rtt, uptime, idle, pending, msg/byte stats, subscription count, client name/lang/version, TLS version/cipher, account. + +### Concurrency + +- `HandleVarz` acquires a `SemaphoreSlim(1,1)` to serialize JSON building (matches Go's `varzMu`) +- `HandleConnz` snapshots `_clients.Values.ToArray()` to avoid holding the dictionary during serialization +- CPU percentage sampled via `Process.TotalProcessorTime` delta, cached for 1 second + +### NatsClient Additions for ConnInfo + +```csharp +public DateTime StartTime { get; } // set in constructor +public DateTime LastActivity; // updated on every command dispatch +public string? RemoteIp { get; } // from socket.RemoteEndPoint +public int RemotePort { get; } // from socket.RemoteEndPoint +``` + +## 3. TLS Support + +### Configuration + +```csharp +// Added to NatsOptions +public string? TlsCert { get; set; } +public string? TlsKey { get; set; } +public string? TlsCaCert { get; set; } +public bool TlsVerify { get; set; } +public bool TlsMap { get; set; } +public double TlsTimeout { get; set; } = 2.0; +public bool TlsHandshakeFirst { get; set; } +public TimeSpan TlsHandshakeFirstFallback { get; set; } = TimeSpan.FromMilliseconds(50); +public bool AllowNonTls { get; set; } +public long TlsRateLimit { get; set; } +public HashSet? TlsPinnedCerts { get; set; } +public SslProtocols TlsMinVersion { get; set; } = SslProtocols.Tls12; +``` + +CLI args: `--tls`, `--tlscert`, `--tlskey`, `--tlscacert`, `--tlsverify` + +### INFO Message Changes + +Three new fields on `ServerInfo`: `tls_required`, `tls_verify`, `tls_available`. + +- `tls_required = (TlsConfig != null && !AllowNonTls)` +- `tls_verify = (TlsConfig != null && TlsVerify)` +- `tls_available = (TlsConfig != null && AllowNonTls)` + +### Four TLS Modes + +**Mode 1: No TLS** — current behavior, unchanged. + +**Mode 2: TLS Required** — send INFO with `tls_required=true`, client initiates TLS, server detects 0x16 byte, performs `SslStream` handshake, validates pinned certs, continues protocol over encrypted stream. + +**Mode 3: TLS First** — do NOT send INFO, wait up to 50ms for data. If 0x16 byte arrives: TLS handshake then send INFO over encrypted stream. If timeout or non-TLS byte: fallback to Mode 2 flow. + +**Mode 4: Mixed** — send INFO with `tls_available=true`, peek first byte. 0x16 → TLS handshake. Other → continue plaintext. + +### Key Components + +**`TlsHelper`** — static class for cert loading (`X509Certificate2` from PEM/PFX), CA cert loading, building `SslServerAuthenticationOptions`, pinned cert validation (SHA256 of SubjectPublicKeyInfo). + +**`TlsConnectionWrapper`** — per-connection negotiation state machine. Takes socket + options, returns `(Stream stream, bool infoAlreadySent)`. Handles peek logic, timeout, handshake, cert validation. + +**`PeekableStream`** — wraps `NetworkStream`, buffers peeked bytes, replays them on first `ReadAsync`. Required so `SslStream.AuthenticateAsServerAsync` sees the full TLS ClientHello including the peeked byte. + +**`TlsRateLimiter`** — token-bucket rate limiter. Refills `TlsRateLimit` tokens per second. `WaitAsync` blocks if no tokens. Only applies to TLS handshakes, not plain connections. + +**`TlsConnectionState`** — post-handshake record: `TlsVersion`, `CipherSuite`, `PeerCert`. Stored on `NatsClient` for `/connz` reporting. + +### NatsClient Changes + +Constructor takes `Stream` instead of building `NetworkStream` internally. TLS negotiation happens before `NatsClient` is constructed. `NatsClient` receives the already-negotiated stream and `TlsConnectionState`. + +### Accept Loop Changes + +``` +Accept socket + → Increment TotalConnections + → Rate limit check (if TLS configured) + → TlsConnectionWrapper.NegotiateAsync (returns stream + infoAlreadySent) + → Extract TlsConnectionState from SslStream if applicable + → Construct NatsClient with stream + tlsState + → client.InfoAlreadySent flag set if TLS-first sent INFO during negotiation + → RunClientAsync +``` + +## 4. File Layout + +``` +src/NATS.Server/ + ServerStats.cs + Monitoring/ + MonitorServer.cs # Kestrel host, route registration + Varz.cs # Varz + nested config structs + Connz.cs # Connz, ConnInfo, ConnzOptions, SubDetail + VarzHandler.cs # Snapshot logic, CPU/mem sampling + ConnzHandler.cs # Query param parsing, sort, pagination + Tls/ + TlsHelper.cs # Cert loading, auth options builder + TlsConnectionWrapper.cs # Per-connection TLS negotiation + TlsConnectionState.cs # Post-handshake state record + TlsRateLimiter.cs # Token-bucket rate limiter + PeekableStream.cs # Buffered-peek stream wrapper +``` + +### Package Dependencies + +- `FrameworkReference` to `Microsoft.AspNetCore.App` in `NATS.Server.csproj` (for Kestrel) +- No new NuGet packages — `SslStream`, `X509Certificate2`, `SslServerAuthenticationOptions` all in `System.Net.Security` +- Tests use `HttpClient` (built-in) and `CertificateRequest` (built-in) for self-signed test certs + +## 5. Testing Strategy + +### Monitoring Tests (`MonitorTests.cs`) + +- `/varz` returns correct server identity, config limits, zero stats on fresh server +- After pub/sub traffic: message/byte counters are accurate +- `/connz` pagination: `?limit=2&offset=0` with 5 clients returns 2, total=5 +- `/connz?sort=bytes_to` ordering +- `/connz?subs=true` includes subscription subjects +- `/healthz` returns 200 +- HTTP request stats tracked in `/varz` response + +### TLS Tests (`TlsTests.cs`) + +Self-signed certs generated in-memory via `CertificateRequest` + `RSA.Create()`. + +- Basic TLS: server cert, client connects with SslStream, pub/sub works +- TLS Required: plaintext client rejected +- TLS Verify: valid client cert succeeds, wrong cert fails +- Mixed mode: TLS and plaintext clients coexist +- TLS First: immediate TLS handshake without reading INFO first +- TLS First fallback: slow client gets INFO sent, normal negotiation +- Certificate pinning: matching cert accepted, non-matching rejected +- Rate limiting: rapid connections throttled +- TLS timeout: incomplete handshake closed after configured timeout +- Integration: NATS.Client.Core NuGet client works over TLS +- Monitoring: `/connz` shows `tls_version` and `tls_cipher_suite` + +## 6. Error Handling + +- **TLS handshake failures** are non-fatal: log warning, close socket, increment counter +- **Mixed mode byte detection**: 0x16 → TLS, printable ASCII → plain, connection close → clean disconnect +- **Rate limiter**: holds TCP connection open until token available (not rejected) +- **Monitoring concurrency**: `varzMu` semaphore serializes `/varz`, client snapshot for `/connz` +- **CPU sampling**: cached 1 second to avoid overhead on rapid polls +- **Graceful shutdown**: `MonitorServer.DisposeAsync()` stops Kestrel, rate limiter disposes timer, in-flight handshakes cancelled via CancellationToken