docs: add core server lifecycle design for section 1 gaps
Covers ClosedState enum, accept loop backoff, ephemeral port, graceful shutdown, lame duck mode, PID/ports files, signal handling, and stub components.
This commit is contained in:
139
docs/plans/2026-02-22-core-lifecycle-design.md
Normal file
139
docs/plans/2026-02-22-core-lifecycle-design.md
Normal file
@@ -0,0 +1,139 @@
|
||||
# Core Server Lifecycle — Design
|
||||
|
||||
Implements all gaps from section 1 of `differences.md` (Core Server Lifecycle).
|
||||
|
||||
Reference: `golang/nats-server/server/server.go`, `client.go`, `signal.go`
|
||||
|
||||
## Components
|
||||
|
||||
### 1. ClosedState Enum & Close Reason Tracking
|
||||
|
||||
New file `src/NATS.Server/ClosedState.cs` — full Go enum (37 values from `client.go:188-228`).
|
||||
|
||||
- `NatsClient` gets `CloseReason` property, `MarkClosed(ClosedState)` method
|
||||
- Close reason set in `RunAsync` finally blocks based on exception type
|
||||
- Error-related reasons (ReadError, WriteError, TLSHandshakeError) skip flush on close
|
||||
- `NatsServer.RemoveClient` logs close reason via structured logging
|
||||
|
||||
### 2. Accept Loop Exponential Backoff
|
||||
|
||||
Port Go's `acceptError` pattern from `server.go:4607-4627`.
|
||||
|
||||
- Constants: `AcceptMinSleep = 10ms`, `AcceptMaxSleep = 1s`
|
||||
- On `SocketException`: sleep `tmpDelay`, double it, cap at 1s
|
||||
- On success: reset to 10ms
|
||||
- During sleep: check `_quitCts` to abort if shutting down
|
||||
- Non-temporary errors break the loop
|
||||
|
||||
### 3. Ephemeral Port (port=0)
|
||||
|
||||
After `_listener.Bind()` + `Listen()`, resolve actual port:
|
||||
|
||||
```csharp
|
||||
if (_options.Port == 0)
|
||||
{
|
||||
var actualPort = ((IPEndPoint)_listener.LocalEndPoint!).Port;
|
||||
_options.Port = actualPort;
|
||||
_serverInfo.Port = actualPort;
|
||||
}
|
||||
```
|
||||
|
||||
Add public `Port` property on `NatsServer` exposing the resolved port.
|
||||
|
||||
### 4. Graceful Shutdown with WaitForShutdown
|
||||
|
||||
New fields on `NatsServer`:
|
||||
- `_shutdown` (volatile bool)
|
||||
- `_shutdownComplete` (TaskCompletionSource)
|
||||
- `_quitCts` (CancellationTokenSource) — internal shutdown signal
|
||||
|
||||
`ShutdownAsync()` sequence:
|
||||
1. Guard: if already shutting down, return
|
||||
2. Set `_shutdown = true`, cancel `_quitCts`
|
||||
3. Close `_listener` (stops accept loop)
|
||||
4. Close all client connections with `ServerShutdown` reason
|
||||
5. Wait for active client tasks to drain
|
||||
6. Stop monitor server
|
||||
7. Signal `_shutdownComplete`
|
||||
|
||||
`WaitForShutdown()`: blocks on `_shutdownComplete.Task`.
|
||||
|
||||
`Dispose()`: calls `ShutdownAsync` synchronously if not already shut down.
|
||||
|
||||
### 5. Task Tracking
|
||||
|
||||
Track active client tasks for clean shutdown:
|
||||
- `_activeClientCount` (int, Interlocked)
|
||||
- `_allClientsExited` (TaskCompletionSource, signaled when count hits 0 during shutdown)
|
||||
- Increment in `AcceptClientAsync`, decrement in `RunClientAsync` finally block
|
||||
- `ShutdownAsync` waits on `_allClientsExited` with timeout
|
||||
|
||||
### 6. Flush Pending Data Before Close
|
||||
|
||||
`NatsClient.FlushAndCloseAsync(bool minimalFlush)`:
|
||||
- If not skip-flush reason: flush stream with 100ms write deadline
|
||||
- Close socket
|
||||
|
||||
`MarkClosed(ClosedState)` sets skip-flush flag for: ReadError, WriteError, SlowConsumerPendingBytes, SlowConsumerWriteDeadline, TLSHandshakeError.
|
||||
|
||||
### 7. Lame Duck Mode
|
||||
|
||||
New options: `LameDuckDuration` (default 2min), `LameDuckGracePeriod` (default 10s).
|
||||
|
||||
`LameDuckShutdownAsync()`:
|
||||
1. Set `_lameDuckMode = true`
|
||||
2. Close listener (stop new connections)
|
||||
3. Wait `LameDuckGracePeriod` (10s default) for clients to drain naturally
|
||||
4. Stagger-close remaining clients over `LameDuckDuration - GracePeriod`
|
||||
- Sleep interval = remaining duration / client count (min 1ms, max 1s)
|
||||
- Randomize slightly to avoid reconnect storms
|
||||
5. Call `ShutdownAsync()` for final cleanup
|
||||
|
||||
Accept loop: on error, if `_lameDuckMode`, exit cleanly.
|
||||
|
||||
### 8. PID File & Ports File
|
||||
|
||||
New options: `PidFile` (string?), `PortsFileDir` (string?).
|
||||
|
||||
PID file: `File.WriteAllText(pidFile, Process.GetCurrentProcess().Id.ToString())`
|
||||
Ports file: JSON with `{ "client": port, "monitor": monitorPort }` written to `{dir}/{exe}_{pid}.ports`
|
||||
|
||||
Written at startup, deleted at shutdown.
|
||||
|
||||
### 9. Signal Handling
|
||||
|
||||
In `Program.cs`, use `PosixSignalRegistration` (.NET 6+):
|
||||
|
||||
- `SIGTERM` → `server.ShutdownAsync()` then exit
|
||||
- `SIGUSR2` → `server.LameDuckShutdownAsync()`
|
||||
- `SIGUSR1` → log "log reopen not yet supported"
|
||||
- `SIGHUP` → log "config reload not yet supported"
|
||||
|
||||
Keep existing Ctrl+C handler (SIGINT).
|
||||
|
||||
### 10. Server Identity NKey (Stub)
|
||||
|
||||
Generate Ed25519 key pair at construction. Store as `ServerNKey` (public) and `_serverSeed` (private). Not used in protocol yet — placeholder for future cluster identity.
|
||||
|
||||
### 11. System Account (Stub)
|
||||
|
||||
Create `$SYS` account in `_accounts` at construction. Expose as `SystemAccount` property. No internal subscriptions yet.
|
||||
|
||||
### 12. Config File & Profiling (Stubs)
|
||||
|
||||
- `NatsOptions.ConfigFile` — if set, log warning "config file parsing not yet supported"
|
||||
- `NatsOptions.ProfPort` — if set, log warning "profiling endpoint not yet supported"
|
||||
- `Program.cs`: add `-c` CLI flag
|
||||
|
||||
## Testing
|
||||
|
||||
- Accept loop backoff: mock socket that throws N times, verify delays
|
||||
- Ephemeral port: start server with port=0, verify resolved port > 0
|
||||
- Graceful shutdown: start server, connect clients, call ShutdownAsync, verify all disconnected
|
||||
- WaitForShutdown: verify it blocks until shutdown completes
|
||||
- Close reason tracking: verify correct ClosedState for auth timeout, max connections, stale connection
|
||||
- Lame duck mode: start server, connect clients, trigger lame duck, verify staggered closure
|
||||
- PID file: start server with PidFile option, verify file contents, verify deleted on shutdown
|
||||
- Ports file: start server with PortsFileDir, verify JSON contents
|
||||
- Flush before close: verify data is flushed before socket close during shutdown
|
||||
- System account: verify $SYS account exists after construction
|
||||
Reference in New Issue
Block a user