diff --git a/docs/plans/2026-02-22-core-lifecycle-design.md b/docs/plans/2026-02-22-core-lifecycle-design.md new file mode 100644 index 0000000..fc6818b --- /dev/null +++ b/docs/plans/2026-02-22-core-lifecycle-design.md @@ -0,0 +1,139 @@ +# Core Server Lifecycle — Design + +Implements all gaps from section 1 of `differences.md` (Core Server Lifecycle). + +Reference: `golang/nats-server/server/server.go`, `client.go`, `signal.go` + +## Components + +### 1. ClosedState Enum & Close Reason Tracking + +New file `src/NATS.Server/ClosedState.cs` — full Go enum (37 values from `client.go:188-228`). + +- `NatsClient` gets `CloseReason` property, `MarkClosed(ClosedState)` method +- Close reason set in `RunAsync` finally blocks based on exception type +- Error-related reasons (ReadError, WriteError, TLSHandshakeError) skip flush on close +- `NatsServer.RemoveClient` logs close reason via structured logging + +### 2. Accept Loop Exponential Backoff + +Port Go's `acceptError` pattern from `server.go:4607-4627`. + +- Constants: `AcceptMinSleep = 10ms`, `AcceptMaxSleep = 1s` +- On `SocketException`: sleep `tmpDelay`, double it, cap at 1s +- On success: reset to 10ms +- During sleep: check `_quitCts` to abort if shutting down +- Non-temporary errors break the loop + +### 3. Ephemeral Port (port=0) + +After `_listener.Bind()` + `Listen()`, resolve actual port: + +```csharp +if (_options.Port == 0) +{ + var actualPort = ((IPEndPoint)_listener.LocalEndPoint!).Port; + _options.Port = actualPort; + _serverInfo.Port = actualPort; +} +``` + +Add public `Port` property on `NatsServer` exposing the resolved port. + +### 4. Graceful Shutdown with WaitForShutdown + +New fields on `NatsServer`: +- `_shutdown` (volatile bool) +- `_shutdownComplete` (TaskCompletionSource) +- `_quitCts` (CancellationTokenSource) — internal shutdown signal + +`ShutdownAsync()` sequence: +1. Guard: if already shutting down, return +2. Set `_shutdown = true`, cancel `_quitCts` +3. Close `_listener` (stops accept loop) +4. Close all client connections with `ServerShutdown` reason +5. Wait for active client tasks to drain +6. Stop monitor server +7. Signal `_shutdownComplete` + +`WaitForShutdown()`: blocks on `_shutdownComplete.Task`. + +`Dispose()`: calls `ShutdownAsync` synchronously if not already shut down. + +### 5. Task Tracking + +Track active client tasks for clean shutdown: +- `_activeClientCount` (int, Interlocked) +- `_allClientsExited` (TaskCompletionSource, signaled when count hits 0 during shutdown) +- Increment in `AcceptClientAsync`, decrement in `RunClientAsync` finally block +- `ShutdownAsync` waits on `_allClientsExited` with timeout + +### 6. Flush Pending Data Before Close + +`NatsClient.FlushAndCloseAsync(bool minimalFlush)`: +- If not skip-flush reason: flush stream with 100ms write deadline +- Close socket + +`MarkClosed(ClosedState)` sets skip-flush flag for: ReadError, WriteError, SlowConsumerPendingBytes, SlowConsumerWriteDeadline, TLSHandshakeError. + +### 7. Lame Duck Mode + +New options: `LameDuckDuration` (default 2min), `LameDuckGracePeriod` (default 10s). + +`LameDuckShutdownAsync()`: +1. Set `_lameDuckMode = true` +2. Close listener (stop new connections) +3. Wait `LameDuckGracePeriod` (10s default) for clients to drain naturally +4. Stagger-close remaining clients over `LameDuckDuration - GracePeriod` + - Sleep interval = remaining duration / client count (min 1ms, max 1s) + - Randomize slightly to avoid reconnect storms +5. Call `ShutdownAsync()` for final cleanup + +Accept loop: on error, if `_lameDuckMode`, exit cleanly. + +### 8. PID File & Ports File + +New options: `PidFile` (string?), `PortsFileDir` (string?). + +PID file: `File.WriteAllText(pidFile, Process.GetCurrentProcess().Id.ToString())` +Ports file: JSON with `{ "client": port, "monitor": monitorPort }` written to `{dir}/{exe}_{pid}.ports` + +Written at startup, deleted at shutdown. + +### 9. Signal Handling + +In `Program.cs`, use `PosixSignalRegistration` (.NET 6+): + +- `SIGTERM` → `server.ShutdownAsync()` then exit +- `SIGUSR2` → `server.LameDuckShutdownAsync()` +- `SIGUSR1` → log "log reopen not yet supported" +- `SIGHUP` → log "config reload not yet supported" + +Keep existing Ctrl+C handler (SIGINT). + +### 10. Server Identity NKey (Stub) + +Generate Ed25519 key pair at construction. Store as `ServerNKey` (public) and `_serverSeed` (private). Not used in protocol yet — placeholder for future cluster identity. + +### 11. System Account (Stub) + +Create `$SYS` account in `_accounts` at construction. Expose as `SystemAccount` property. No internal subscriptions yet. + +### 12. Config File & Profiling (Stubs) + +- `NatsOptions.ConfigFile` — if set, log warning "config file parsing not yet supported" +- `NatsOptions.ProfPort` — if set, log warning "profiling endpoint not yet supported" +- `Program.cs`: add `-c` CLI flag + +## Testing + +- Accept loop backoff: mock socket that throws N times, verify delays +- Ephemeral port: start server with port=0, verify resolved port > 0 +- Graceful shutdown: start server, connect clients, call ShutdownAsync, verify all disconnected +- WaitForShutdown: verify it blocks until shutdown completes +- Close reason tracking: verify correct ClosedState for auth timeout, max connections, stale connection +- Lame duck mode: start server, connect clients, trigger lame duck, verify staggered closure +- PID file: start server with PidFile option, verify file contents, verify deleted on shutdown +- Ports file: start server with PortsFileDir, verify JSON contents +- Flush before close: verify data is flushed before socket close during shutdown +- System account: verify $SYS account exists after construction