docs: add design for monitoring HTTP and TLS support
Covers /varz, /connz endpoints via Kestrel Minimal APIs, full TLS support with four modes (none/required/first/mixed), cert pinning, rate limiting, and testing strategy.
This commit is contained in:
243
docs/plans/2026-02-22-monitoring-tls-design.md
Normal file
243
docs/plans/2026-02-22-monitoring-tls-design.md
Normal file
@@ -0,0 +1,243 @@
|
||||
# Monitoring HTTP & TLS Support Design
|
||||
|
||||
**Date:** 2026-02-22
|
||||
**Scope:** Port monitoring endpoints (`/varz`, `/connz`) and full TLS support from Go NATS server
|
||||
**Go Reference:** `golang/nats-server/server/monitor.go`, `server.go` (TLS), `client.go` (TLS), `opts.go`
|
||||
|
||||
## Overview
|
||||
|
||||
Two features ported from Go NATS:
|
||||
|
||||
1. **Monitoring HTTP** — Kestrel Minimal API embedded in `NatsServer`, serving `/varz`, `/connz`, `/healthz` and stub endpoints. Exact Go JSON schema for tooling compatibility.
|
||||
2. **TLS Support** — `SslStream` wrapping with four modes: no TLS, TLS required, TLS-first, and mixed TLS/plaintext. Certificate pinning, client cert verification, rate limiting.
|
||||
|
||||
## 1. Server-Level Stats Aggregation
|
||||
|
||||
New `ServerStats` class with atomic counters, replacing the need to sum across all clients on each `/varz` request.
|
||||
|
||||
### ServerStats Fields
|
||||
|
||||
```csharp
|
||||
// src/NATS.Server/ServerStats.cs
|
||||
public sealed class ServerStats
|
||||
{
|
||||
public long InMsgs;
|
||||
public long OutMsgs;
|
||||
public long InBytes;
|
||||
public long OutBytes;
|
||||
public long TotalConnections;
|
||||
public long SlowConsumers;
|
||||
public long StaleConnections;
|
||||
public long Stalls;
|
||||
public long SlowConsumerClients;
|
||||
public long SlowConsumerRoutes;
|
||||
public long SlowConsumerLeafs;
|
||||
public long SlowConsumerGateways;
|
||||
public readonly ConcurrentDictionary<string, long> HttpReqStats = new();
|
||||
}
|
||||
```
|
||||
|
||||
### Integration Points
|
||||
|
||||
- `NatsServer` owns a `ServerStats` instance, passes it to each `NatsClient`
|
||||
- `NatsClient.ProcessPub` increments server-level `InMsgs`/`InBytes` alongside client-level counters
|
||||
- `NatsClient.SendMessageAsync` increments server-level `OutMsgs`/`OutBytes`
|
||||
- Accept loop increments `TotalConnections`
|
||||
- `NatsServer.StartTime` field added (set once at startup)
|
||||
|
||||
## 2. Monitoring HTTP Endpoints
|
||||
|
||||
### HTTP Stack
|
||||
|
||||
Kestrel Minimal APIs via `FrameworkReference` to `Microsoft.AspNetCore.App`. No NuGet packages needed.
|
||||
|
||||
### Endpoints
|
||||
|
||||
| Path | Handler | Description |
|
||||
|------|---------|-------------|
|
||||
| `/` | `HandleRoot` | Links to all endpoints |
|
||||
| `/varz` | `HandleVarz` | Server stats and config |
|
||||
| `/connz` | `HandleConnz` | Connection info (paginated) |
|
||||
| `/healthz` | `HandleHealthz` | Health check (200 OK) |
|
||||
| `/routez` | stub | Returns `{}` |
|
||||
| `/gatewayz` | stub | Returns `{}` |
|
||||
| `/leafz` | stub | Returns `{}` |
|
||||
| `/subz` | stub | Returns `{}` |
|
||||
| `/accountz` | stub | Returns `{}` |
|
||||
| `/jsz` | stub | Returns `{}` |
|
||||
|
||||
All paths support optional base path prefix via `MonitorBasePath` config.
|
||||
|
||||
### Configuration
|
||||
|
||||
```csharp
|
||||
// Added to NatsOptions
|
||||
public int MonitorPort { get; set; } // 0 = disabled, CLI: -m
|
||||
public string MonitorHost { get; set; } = "0.0.0.0";
|
||||
public string? MonitorBasePath { get; set; }
|
||||
public int MonitorHttpsPort { get; set; } // 0 = disabled
|
||||
```
|
||||
|
||||
### Varz Model
|
||||
|
||||
Exact Go JSON field names. All fields from Go's `Varz` struct including nested config structs (`ClusterOptsVarz`, `GatewayOptsVarz`, `LeafNodeOptsVarz`, `MqttOptsVarz`, `WebsocketOptsVarz`, `JetStreamVarz`). Nested structs return defaults/zeros until those subsystems are ported.
|
||||
|
||||
Key field categories: identification, network config, security/limits, timing/lifecycle, runtime metrics (mem, CPU, cores), connection stats, message stats, health counters, subsystem configs, HTTP request stats.
|
||||
|
||||
### Connz Model
|
||||
|
||||
Paginated connection list with query parameter support:
|
||||
|
||||
- `sort` — sort field (cid, bytes_to, msgs_to, etc.)
|
||||
- `subs` / `subs=detail` — include subscription lists
|
||||
- `offset` / `limit` — pagination (default limit 1024)
|
||||
- `state` — filter open/closed/all
|
||||
- `auth` — include usernames
|
||||
|
||||
`ConnInfo` includes all Go fields: cid, kind, ip, port, start, last_activity, rtt, uptime, idle, pending, msg/byte stats, subscription count, client name/lang/version, TLS version/cipher, account.
|
||||
|
||||
### Concurrency
|
||||
|
||||
- `HandleVarz` acquires a `SemaphoreSlim(1,1)` to serialize JSON building (matches Go's `varzMu`)
|
||||
- `HandleConnz` snapshots `_clients.Values.ToArray()` to avoid holding the dictionary during serialization
|
||||
- CPU percentage sampled via `Process.TotalProcessorTime` delta, cached for 1 second
|
||||
|
||||
### NatsClient Additions for ConnInfo
|
||||
|
||||
```csharp
|
||||
public DateTime StartTime { get; } // set in constructor
|
||||
public DateTime LastActivity; // updated on every command dispatch
|
||||
public string? RemoteIp { get; } // from socket.RemoteEndPoint
|
||||
public int RemotePort { get; } // from socket.RemoteEndPoint
|
||||
```
|
||||
|
||||
## 3. TLS Support
|
||||
|
||||
### Configuration
|
||||
|
||||
```csharp
|
||||
// Added to NatsOptions
|
||||
public string? TlsCert { get; set; }
|
||||
public string? TlsKey { get; set; }
|
||||
public string? TlsCaCert { get; set; }
|
||||
public bool TlsVerify { get; set; }
|
||||
public bool TlsMap { get; set; }
|
||||
public double TlsTimeout { get; set; } = 2.0;
|
||||
public bool TlsHandshakeFirst { get; set; }
|
||||
public TimeSpan TlsHandshakeFirstFallback { get; set; } = TimeSpan.FromMilliseconds(50);
|
||||
public bool AllowNonTls { get; set; }
|
||||
public long TlsRateLimit { get; set; }
|
||||
public HashSet<string>? TlsPinnedCerts { get; set; }
|
||||
public SslProtocols TlsMinVersion { get; set; } = SslProtocols.Tls12;
|
||||
```
|
||||
|
||||
CLI args: `--tls`, `--tlscert`, `--tlskey`, `--tlscacert`, `--tlsverify`
|
||||
|
||||
### INFO Message Changes
|
||||
|
||||
Three new fields on `ServerInfo`: `tls_required`, `tls_verify`, `tls_available`.
|
||||
|
||||
- `tls_required = (TlsConfig != null && !AllowNonTls)`
|
||||
- `tls_verify = (TlsConfig != null && TlsVerify)`
|
||||
- `tls_available = (TlsConfig != null && AllowNonTls)`
|
||||
|
||||
### Four TLS Modes
|
||||
|
||||
**Mode 1: No TLS** — current behavior, unchanged.
|
||||
|
||||
**Mode 2: TLS Required** — send INFO with `tls_required=true`, client initiates TLS, server detects 0x16 byte, performs `SslStream` handshake, validates pinned certs, continues protocol over encrypted stream.
|
||||
|
||||
**Mode 3: TLS First** — do NOT send INFO, wait up to 50ms for data. If 0x16 byte arrives: TLS handshake then send INFO over encrypted stream. If timeout or non-TLS byte: fallback to Mode 2 flow.
|
||||
|
||||
**Mode 4: Mixed** — send INFO with `tls_available=true`, peek first byte. 0x16 → TLS handshake. Other → continue plaintext.
|
||||
|
||||
### Key Components
|
||||
|
||||
**`TlsHelper`** — static class for cert loading (`X509Certificate2` from PEM/PFX), CA cert loading, building `SslServerAuthenticationOptions`, pinned cert validation (SHA256 of SubjectPublicKeyInfo).
|
||||
|
||||
**`TlsConnectionWrapper`** — per-connection negotiation state machine. Takes socket + options, returns `(Stream stream, bool infoAlreadySent)`. Handles peek logic, timeout, handshake, cert validation.
|
||||
|
||||
**`PeekableStream`** — wraps `NetworkStream`, buffers peeked bytes, replays them on first `ReadAsync`. Required so `SslStream.AuthenticateAsServerAsync` sees the full TLS ClientHello including the peeked byte.
|
||||
|
||||
**`TlsRateLimiter`** — token-bucket rate limiter. Refills `TlsRateLimit` tokens per second. `WaitAsync` blocks if no tokens. Only applies to TLS handshakes, not plain connections.
|
||||
|
||||
**`TlsConnectionState`** — post-handshake record: `TlsVersion`, `CipherSuite`, `PeerCert`. Stored on `NatsClient` for `/connz` reporting.
|
||||
|
||||
### NatsClient Changes
|
||||
|
||||
Constructor takes `Stream` instead of building `NetworkStream` internally. TLS negotiation happens before `NatsClient` is constructed. `NatsClient` receives the already-negotiated stream and `TlsConnectionState`.
|
||||
|
||||
### Accept Loop Changes
|
||||
|
||||
```
|
||||
Accept socket
|
||||
→ Increment TotalConnections
|
||||
→ Rate limit check (if TLS configured)
|
||||
→ TlsConnectionWrapper.NegotiateAsync (returns stream + infoAlreadySent)
|
||||
→ Extract TlsConnectionState from SslStream if applicable
|
||||
→ Construct NatsClient with stream + tlsState
|
||||
→ client.InfoAlreadySent flag set if TLS-first sent INFO during negotiation
|
||||
→ RunClientAsync
|
||||
```
|
||||
|
||||
## 4. File Layout
|
||||
|
||||
```
|
||||
src/NATS.Server/
|
||||
ServerStats.cs
|
||||
Monitoring/
|
||||
MonitorServer.cs # Kestrel host, route registration
|
||||
Varz.cs # Varz + nested config structs
|
||||
Connz.cs # Connz, ConnInfo, ConnzOptions, SubDetail
|
||||
VarzHandler.cs # Snapshot logic, CPU/mem sampling
|
||||
ConnzHandler.cs # Query param parsing, sort, pagination
|
||||
Tls/
|
||||
TlsHelper.cs # Cert loading, auth options builder
|
||||
TlsConnectionWrapper.cs # Per-connection TLS negotiation
|
||||
TlsConnectionState.cs # Post-handshake state record
|
||||
TlsRateLimiter.cs # Token-bucket rate limiter
|
||||
PeekableStream.cs # Buffered-peek stream wrapper
|
||||
```
|
||||
|
||||
### Package Dependencies
|
||||
|
||||
- `FrameworkReference` to `Microsoft.AspNetCore.App` in `NATS.Server.csproj` (for Kestrel)
|
||||
- No new NuGet packages — `SslStream`, `X509Certificate2`, `SslServerAuthenticationOptions` all in `System.Net.Security`
|
||||
- Tests use `HttpClient` (built-in) and `CertificateRequest` (built-in) for self-signed test certs
|
||||
|
||||
## 5. Testing Strategy
|
||||
|
||||
### Monitoring Tests (`MonitorTests.cs`)
|
||||
|
||||
- `/varz` returns correct server identity, config limits, zero stats on fresh server
|
||||
- After pub/sub traffic: message/byte counters are accurate
|
||||
- `/connz` pagination: `?limit=2&offset=0` with 5 clients returns 2, total=5
|
||||
- `/connz?sort=bytes_to` ordering
|
||||
- `/connz?subs=true` includes subscription subjects
|
||||
- `/healthz` returns 200
|
||||
- HTTP request stats tracked in `/varz` response
|
||||
|
||||
### TLS Tests (`TlsTests.cs`)
|
||||
|
||||
Self-signed certs generated in-memory via `CertificateRequest` + `RSA.Create()`.
|
||||
|
||||
- Basic TLS: server cert, client connects with SslStream, pub/sub works
|
||||
- TLS Required: plaintext client rejected
|
||||
- TLS Verify: valid client cert succeeds, wrong cert fails
|
||||
- Mixed mode: TLS and plaintext clients coexist
|
||||
- TLS First: immediate TLS handshake without reading INFO first
|
||||
- TLS First fallback: slow client gets INFO sent, normal negotiation
|
||||
- Certificate pinning: matching cert accepted, non-matching rejected
|
||||
- Rate limiting: rapid connections throttled
|
||||
- TLS timeout: incomplete handshake closed after configured timeout
|
||||
- Integration: NATS.Client.Core NuGet client works over TLS
|
||||
- Monitoring: `/connz` shows `tls_version` and `tls_cipher_suite`
|
||||
|
||||
## 6. Error Handling
|
||||
|
||||
- **TLS handshake failures** are non-fatal: log warning, close socket, increment counter
|
||||
- **Mixed mode byte detection**: 0x16 → TLS, printable ASCII → plain, connection close → clean disconnect
|
||||
- **Rate limiter**: holds TCP connection open until token available (not rejected)
|
||||
- **Monitoring concurrency**: `varzMu` semaphore serializes `/varz`, client snapshot for `/connz`
|
||||
- **CPU sampling**: cached 1 second to avoid overhead on rapid polls
|
||||
- **Graceful shutdown**: `MonitorServer.DisposeAsync()` stops Kestrel, rate limiter disposes timer, in-flight handshakes cancelled via CancellationToken
|
||||
Reference in New Issue
Block a user