# Design: Production Gap Closure (Tier 1 + Tier 2) **Date**: 2026-02-25 **Scope**: 15 gaps across FileStore, RAFT, JetStream Cluster, Consumer/Stream, Client, MQTT, Config, Networking **Approach**: Bottom-up by dependency (Phase 1→2→3→4) **Codex review**: Verified 2026-02-25 — feedback incorporated ## Overview This design covers all CRITICAL and HIGH severity gaps identified in the fresh scan of `docs/structuregaps.md` (2026-02-24). The implementation follows a bottom-up dependency approach where foundational layers (storage, consensus) are completed before higher-level features (cluster coordination, consumers, client protocol). ### Codex Feedback (Incorporated) 1. RAFT WAL is the hardest dependency — kept first in Phase 1 2. Meta snapshot encode/decode moved earlier in Phase 2 (before monitor loops) 3. Ack/NAK processing moved before/alongside delivery loop — foundational to correctness 4. Exit gates added per phase with measurable criteria 5. Binary formats treated as versioned contracts with golden tests 6. Feature flags for major subsystems recommended --- ## Phase 1: FileStore + RAFT Foundation **Dependencies**: None (pure infrastructure) **Exit gate**: All IStreamStore methods implemented, FileStore round-trips encrypted+compressed blocks, RAFT persists and recovers state across restarts, joint consensus passes membership change tests. ### 1A: FileStore Block Lifecycle & Encryption/Compression **Goal**: Wire AeadEncryptor and S2Codec into block I/O, add automatic rotation, crash recovery. **MsgBlock.cs changes:** - New fields: `byte[]? _aek` (per-block AEAD key), `bool _compressed`, `byte[] _lastChecksum` - `Seal()` — marks block read-only, releases write handle - `EncryptOnDisk(AeadEncryptor, key)` — encrypts block data at rest - `DecryptOnRead(AeadEncryptor, key)` — decrypts block data on load - `CompressRecord(S2Codec, payload)` → `byte[]` — per-message compression - `DecompressRecord(S2Codec, data)` → `byte[]` — per-message decompression - `ComputeChecksum(data)` → `byte[8]` — highway hash - `ValidateChecksum(data, expected)` → `bool` **FileStore.cs changes:** - `AppendAsync()` → triggers `RotateBlock()` when `_activeBlock.BytesWritten >= _options.BlockSize` - `RotateBlock()` — seals active block, creates new with `_nextBlockId++` - Enhanced `RecoverBlocks()` — rebuild TTL wheel, scan tombstones, validate checksums - `RecoverTtlState()` — scan all blocks, re-register unexpired messages in `_ttlWheel` - `WriteStateAtomically(path, data)` — temp file + `File.Move(temp, target, overwrite: true)` - `GenEncryptionKeys()` → per-block key derivation from stream encryption key - `RecoverAek(blockId)` → recover per-block key from metadata file **New file: `HighwayHash.cs`** — .NET highway hash implementation or NuGet wrapper. **Go reference**: `filestore.go` lines 816–1407 (encryption), 1754–2401 (recovery), 4443–4477 (cache), 5783–5842 (flusher). ### 1B: Missing IStreamStore Methods Implement 15+ methods currently throwing `NotSupportedException`: | Method | Strategy | Go ref | |--------|----------|--------| | `StoreRawMsg` | Append with caller seq/ts, bypass auto-increment | 4759 | | `LoadNextMsgMulti` | Multi-filter scan across blocks | 8426 | | `LoadPrevMsg` | Reverse scan from start sequence | 8580 | | `LoadPrevMsgMulti` | Reverse multi-filter scan | 8634 | | `NumPendingMulti` | Count with filter set | 4051 | | `EncodedStreamState` | Binary serialize (versioned format) | 10599 | | `Utilization` | bytes_used / bytes_allocated | 8758 | | `FlushAllPending` | Force-flush active block to disk | — | | `RegisterStorageUpdates` | Callback for change notifications | 4412 | | `RegisterStorageRemoveMsg` | Callback for removal events | 4424 | | `RegisterProcessJetStreamMsg` | Callback for JS processing | 4431 | | `ResetState` | Clear caches post-NRG catchup | 8788 | | `UpdateConfig` | Apply block size, retention, limits | 655 | | `Delete(inline)` | Stop + delete all block files | — | | `Stop()` | Flush + close all FDs | — | ### 1C: RAFT Persistent WAL **RaftLog.cs changes:** - Replace `List` backing with binary WAL file - Record format: `[4-byte length][entry bytes]` (length-prefixed) - Entry serialization: `[8-byte index][4-byte term][payload]` - `Append()` → append to WAL file, increment in-memory index - `SyncAsync()` → `FileStream.FlushAsync()` for durability - `LoadAsync()` → scan WAL, rebuild in-memory index, validate - `Compact(upToIndex)` → rewrite WAL from upToIndex+1 onward - Version header at WAL file start for forward compatibility **RaftNode.cs changes:** - Wire `_persistDirectory` into disk operations - Persist term/vote in `meta.json`: `{ "term": N, "votedFor": "id" }` - Startup: load WAL → load snapshot → apply entries after snapshot index - `PersistTermAsync()` — called on term/vote changes **Go reference**: `raft.go` lines 1000–1100 (snapshot streaming), WAL patterns throughout. ### 1D: RAFT Joint Consensus **RaftNode.cs changes:** - New field: `HashSet? _jointNewMembers` — Cnew during transition - `ProposeAddPeer(peerId)`: 1. Propose joint entry (Cold ∪ Cnew) to log 2. Wait for commit 3. Propose final Cnew entry 4. Wait for commit - `ProposeRemovePeer(peerId)` — same two-phase pattern - `CalculateQuorum()`: - Normal: `majority(_members)` - Joint: `max(majority(Cold), majority(Cnew))` - `ApplyMembershipChange(entry)`: - If joint entry: set `_jointNewMembers` - If final entry: replace `_members` with Cnew, clear `_jointNewMembers` **Go reference**: `raft.go` Section 4 (Raft paper), membership change handling. --- ## Phase 2: JetStream Cluster Coordination **Dependencies**: Phase 1 (RAFT persistence, FileStore) **Exit gate**: Meta-group processes stream/consumer assignments via RAFT, snapshots encode/decode round-trip, leadership transitions handled correctly, orphan detection works. ### 2A: Meta Snapshot Encoding/Decoding (moved earlier per Codex) **New file: `MetaSnapshotCodec.cs`** - `Encode(Dictionary> assignments)` → `byte[]` - Uses S2 compression for wire efficiency - Versioned format header for compatibility - `Decode(byte[])` → assignment maps - Golden tests with known-good snapshots ### 2B: Cluster Monitoring Loop **New class: `JetStreamClusterMonitor`** (implements `IHostedService`) - Long-running `ExecuteAsync()` loop consuming `Channel` from meta RAFT node - `ApplyMetaEntry(entry)` → dispatch by op type: - `assignStreamOp` → `ProcessStreamAssignment()` - `assignConsumerOp` → `ProcessConsumerAssignment()` - `removeStreamOp` → `ProcessStreamRemoval()` - `removeConsumerOp` → `ProcessConsumerRemoval()` - `updateStreamOp` → `ProcessUpdateStreamAssignment()` - `EntrySnapshot` → `ApplyMetaSnapshot()` - `EntryAddPeer`/`EntryRemovePeer` → peer management - Periodic snapshot: 1-minute interval, 15s on errors - Orphan check: 30s post-recovery scan - Recovery state machine: deferred add/remove during catchup ### 2C: Stream/Consumer Assignment Processing **Enhance `JetStreamMetaGroup.cs`:** - `ProcessStreamAssignment(sa)` — validate config, check account limits, create stream on this node if member of replica group, respond to inflight request - `ProcessConsumerAssignment(ca)` — validate, create consumer, bind to stream - `ProcessStreamRemoval(sa)` — cleanup local resources, delete RAFT group - `ProcessConsumerRemoval(ca)` — cleanup consumer state - `ProcessUpdateStreamAssignment(sa)` — apply config changes, handle replica scaling ### 2D: Inflight Tracking Enhancement **Change inflight maps from:** ```csharp ConcurrentDictionary _inflightStreams ``` **To:** ```csharp record InflightInfo(int OpsCount, bool Deleted, StreamAssignment? Assignment); ConcurrentDictionary> _inflightStreams; // account → stream → info ``` - `TrackInflightStreamProposal(account, stream, assignment)` — increment ops, store - `RemoveInflightStreamProposal(account, stream)` — decrement, clear when zero - Same pattern for consumers ### 2E: Leadership Transition **Add to `JetStreamClusterMonitor`:** - `ProcessLeaderChange(isLeader)`: - Clear all inflight proposal maps - If becoming leader: start update subscriptions, publish domain leader advisory - If losing leadership: stop update subscriptions - `ProcessStreamLeaderChange(stream, isLeader)`: - Leader: start monitor, set sync subjects, begin replication - Follower: stop local delivery - `ProcessConsumerLeaderChange(consumer, isLeader)`: - Leader: start delivery engine - Follower: stop delivery ### 2F: Per-Stream/Consumer Monitor Loops **New class: `StreamMonitor`** — per-stream background task: - Snapshot timing: 2-minute interval + random jitter - State change detection via hash comparison - Recovery timeout: escalate after N failures - Compaction trigger after snapshot **New class: `ConsumerMonitor`** — per-consumer background task: - Hash-based state change detection - Timeout tracking for stalled consumers - Migration detection (R1 → R3 scale-up) --- ## Phase 3: Consumer + Stream Engines **Dependencies**: Phase 1 (storage), Phase 2 (cluster coordination) **Exit gate**: Consumer delivery loop correctly delivers messages with redelivery, pull requests expire and respect batch/maxBytes, purge with subject filtering works, mirror/source retry recovers from failures. ### 3A: Ack/NAK Processing (moved before delivery loop per Codex) **Enhance `AckProcessor.cs`:** - Background subscription consuming from consumer's ack subject - Parse ack types: - `+ACK` — acknowledge message, remove from pending - `-NAK` — negative ack, optional `{delay}` for redelivery schedule - `-NAK {"delay": "2s", "backoff": [1,2,4]}` — with backoff schedule - `+TERM` — terminate, remove from pending, no redeliver - `+WPI` — work in progress, extend ack wait deadline - Track `deliveryCount` per message sequence - Max deliveries enforcement: advisory on exceed **Enhance `RedeliveryTracker.cs`:** - Replace internals with `PriorityQueue` (min-heap by deadline) - `Schedule(seq, delay)` — insert with deadline - `TryGetNextReady(now)` → sequence past deadline, or null - `BackoffSchedule` support: delay array indexed by delivery count - Rate limiting: configurable max redeliveries per tick ### 3B: Consumer Delivery Loop **New class: `ConsumerDeliveryLoop`** — shared between push and pull: - Background `Task` running while consumer is active - Main loop: 1. Check redelivery queue first (priority over new messages) 2. Poll `IStreamStore.LoadNextMsg()` for next matching message 3. Check `MaxAckPending` — block if pending >= limit 4. Deliver message (push: to deliver subject; pull: to request reply subject) 5. Register in pending map with ack deadline 6. Update stats (delivered count, bytes) **Push mode integration:** - Delivers to `DeliverSubject` directly - Idle heartbeat: sends when no messages for `IdleHeartbeat` duration - Flow control: `Nats-Consumer-Stalled` header when pending > threshold **Pull mode integration:** - Consumes from `WaitingRequestQueue` - Delivers to request's `ReplyTo` subject - Respects `Batch`, `MaxBytes`, `NoWait` per request ### 3C: Pull Request Pipeline **New class: `WaitingRequestQueue`:** ```csharp record PullRequest(string ReplyTo, int Batch, long MaxBytes, DateTimeOffset Expires, bool NoWait, string? PinId); ``` - Internal `LinkedList` ordered by arrival - `Enqueue(request)` — adds, starts expiry timer - `TryDequeue()` → next request with remaining capacity - `RemoveExpired(now)` — periodic cleanup - `Count`, `IsEmpty` properties ### 3D: Priority Group Pinning **Enhance `PriorityGroupManager.cs`:** - `AssignPinId()` → generates NUID string, starts TTL timer - `ValidatePinId(pinId)` → checks validity - `UnassignPinId()` → clears, publishes advisory - `PinnedTtl` configurable timeout - Integrates with `WaitingRequestQueue` for pin-aware dispatch ### 3E: Consumer Pause/Resume **Add to `ConsumerManager.cs`:** - `PauseUntil` field on consumer state - `PauseAsync(until)`: - Set deadline, stop delivery loop - Publish `$JS.EVENT.ADVISORY.CONSUMER.PAUSED` - Start auto-resume timer - `ResumeAsync()`: - Clear deadline, restart delivery loop - Publish resume advisory ### 3F: Mirror/Source Retry **Enhance `MirrorCoordinator.cs`:** - Retry loop: exponential backoff 1s → 30s with jitter - `SetupMirrorConsumer()` → create ephemeral consumer on source via JetStream API - Sequence tracking: `_sourceSeq` (last seen), `_destSeq` (last stored) - Gap detection: if source sequence jumps, log advisory - `SetMirrorErr(err)` → publishes error state advisory **Enhance `SourceCoordinator.cs`:** - Same retry pattern - Per-source consumer setup/teardown - Subject filter + transform before storing - Dedup window: `Dictionary` keyed by Nats-Msg-Id - Time-based pruning of dedup window ### 3G: Stream Purge with Filtering **Enhance `StreamManager.cs`:** - `PurgeEx(subject, seq, keep)`: - `subject` filter: iterate messages, purge only matching subject pattern - `seq` range: purge messages with seq < specified - `keep`: retain N most recent (purge all but last N) - After purge: cascade to consumers (clear pending for purged sequences) ### 3H: Interest Retention **New class: `InterestRetentionPolicy`:** - Track which consumers have interest in each message - `RegisterInterest(consumerName, filterSubject)` — consumer subscribes - `AcknowledgeDelivery(consumerName, seq)` — consumer acked - `ShouldRetain(seq)` → `false` when all interested consumers have acked - Periodic scan for removable messages --- ## Phase 4: Client Performance, MQTT, Config, Networking **Dependencies**: Phase 1 (storage for MQTT), Phase 3 (consumer for MQTT) **Exit gate**: Client flush coalescing measurably reduces syscalls under load, MQTT sessions survive server restart, SIGHUP triggers config reload, implicit routes auto-discover. ### 4A: Client Flush Coalescing **Enhance `NatsClient.cs`:** - New fields: `int _flushSignalsPending`, `const int MaxFlushPending = 10` - New field: `AsyncAutoResetEvent _flushSignal` — write loop waits on this - `QueueOutbound()`: after writing to channel, increment `_flushSignalsPending`, signal - `RunWriteLoopAsync()`: if `_flushSignalsPending < MaxFlushPending && _pendingBytes < MaxBufSize`, wait for signal (coalesces) - `NatsServer.FlushClients(HashSet)` — signals all pending clients ### 4B: Client Stall Gate **Enhance `NatsClient.cs`:** - New field: `SemaphoreSlim? _stallGate` - In `QueueOutbound()`: - If `_pendingBytes > _maxPending * 3 / 4` and `_stallGate == null`: create `new SemaphoreSlim(0, 1)` - If `_stallGate != null`: `await _stallGate.WaitAsync(timeout)` - In `RunWriteLoopAsync()`: - After successful flush below threshold: `_stallGate?.Release()`, set `_stallGate = null` - Per-kind: CLIENT → close on slow consumer; ROUTER/GATEWAY/LEAF → mark flag, keep alive ### 4C: Write Timeout Recovery **Enhance `RunWriteLoopAsync()`:** - Track `bytesWritten` and `bytesAttempted` per flush cycle - On write timeout: - If `Kind == CLIENT` → close immediately - If `bytesWritten > 0` (partial flush) → mark `isSlowConsumer`, continue - If `bytesWritten == 0` → close - New enum: `WriteTimeoutPolicy { Close, TcpFlush }` - Route/gateway/leaf use `TcpFlush` by default ### 4D: MQTT JetStream Persistence **Enhance MQTT subsystem:** **Streams per account:** - `$MQTT_msgs` — message persistence (subject: `$MQTT.msgs.>`) - `$MQTT_rmsgs` — retained messages (subject: `$MQTT.rmsgs.>`) - `$MQTT_sess` — session state (subject: `$MQTT.sess.>`) **MqttSessionStore.cs changes:** - `ConnectAsync(clientId, cleanSession)`: - If `cleanSession=false`: query `$MQTT_sess` for existing session, restore - If `cleanSession=true`: delete existing session from stream - `SaveSessionAsync()` — persist current state to `$MQTT_sess` - `DisconnectAsync(abnormal)`: - If abnormal and will configured: publish will message - If `cleanSession=false`: save session **MqttRetainedStore.cs changes:** - Backed by `$MQTT_rmsgs` stream - `SetRetained(topic, payload)` — store/update retained message - `GetRetained(topicFilter)` — query matching retained messages - On SUBSCRIBE: deliver matching retained messages **QoS tracking:** - QoS 1 pending acks stored in session stream - On reconnect: redeliver unacked QoS 1 messages - QoS 2 state machine backed by session stream ### 4E: SIGHUP Config Reload **New file: `SignalHandler.cs`** in Configuration/: ```csharp public static class SignalHandler { public static void Register(ConfigReloader reloader, NatsServer server) { PosixSignalRegistration.Create(PosixSignal.SIGHUP, ctx => { _ = reloader.ReloadAsync(server); }); } } ``` **Enhance `ConfigReloader.cs`:** - New `ReloadAsync(NatsServer server)`: - Re-parse config file - `Diff(old, new)` for change list - Apply reloadable changes: logging, auth, TLS, limits - Reject non-reloadable with warning - Auth change propagation: iterate clients, re-evaluate, close unauthorized - TLS reload: `server.ReloadTlsCertificate(newCert)` ### 4F: Implicit Route/Gateway Discovery **Enhance `RouteManager.cs`:** - `ProcessImplicitRoute(serverInfo)`: - Extract `connect_urls` from INFO - For each unknown URL: create solicited connection - Track in `_discoveredRoutes` set to avoid duplicates - `ForwardNewRouteInfoToKnownServers(newPeer)`: - Send updated INFO to all existing routes - Include new peer in `connect_urls` **Enhance `GatewayManager.cs`:** - `ProcessImplicitGateway(gatewayInfo)`: - Extract gateway URLs from INFO - Create outbound connection to discovered gateway - `GossipGatewaysToInboundGateway(conn)`: - Share known gateways with new inbound connections --- ## Test Strategy ### Per-Phase Test Requirements **Phase 1 tests:** - FileStore: encrypted block round-trip, compressed block round-trip, crash recovery with TTL, checksum validation on corrupt data, atomic write failure recovery - IStreamStore: each missing method gets at least 2 test cases - RAFT WAL: persist → restart → verify state, concurrent appends, compaction - Joint consensus: add peer during normal operation, remove peer, add during leader change **Phase 2 tests:** - Snapshot encode → decode round-trip (golden test with known data) - Monitor loop processes stream/consumer assignment entries - Inflight tracking prevents duplicate proposals - Leadership transition clears state correctly **Phase 3 tests:** - Consumer delivery loop delivers messages in order, respects MaxAckPending - Pull request expiry, batch/maxBytes enforcement - NAK with delay schedules redelivery at correct time - Pause/resume stops and restarts delivery - Mirror retry recovers after source failure - Purge with subject filter only removes matching **Phase 4 tests:** - Flush coalescing: measure syscall count reduction under load - Stall gate: producer blocks when client is slow - MQTT session restore after restart - SIGHUP triggers reload, auth changes disconnect unauthorized clients - Implicit route discovery creates connections to unknown peers ### Parity DB Updates When porting Go tests, update `docs/test_parity.db`: ```sql UPDATE go_tests SET status='mapped', dotnet_test='TestName', dotnet_file='File.cs' WHERE go_test='GoTestName'; ``` --- ## Risk Mitigation 1. **RAFT WAL correctness**: Use fsync on every commit, validate on load, golden tests with known-good WAL files 2. **Joint consensus edge cases**: Test split-brain scenarios with simulated partitions 3. **Snapshot format drift**: Version header, backward compatibility checks, golden tests 4. **Async loop races**: Use structured concurrency (CancellationToken chains), careful locking around inflight maps 5. **FileStore crash recovery**: Intentional corruption tests (truncated blocks, zero-filled regions) 6. **MQTT persistence migration**: Feature flag to toggle JetStream backing vs in-memory --- ## Phase Dependencies ``` Phase 1A (FileStore blocks) ─────────────────┐ Phase 1B (IStreamStore methods) ─────────────┤ Phase 1C (RAFT WAL) ─────────────────────────┼─→ Phase 2 (JetStream Cluster) ─→ Phase 3 (Consumer/Stream) Phase 1D (RAFT joint consensus) ─────────────┘ │ ↓ Phase 4A-C (Client perf) ───────────────────────────────────────────── independent ──────┤ Phase 4D (MQTT persistence) ──────────────────── depends on Phase 1 + Phase 3 ──────────┤ Phase 4E (SIGHUP reload) ──────────────────────────────────────────── independent ──────┤ Phase 4F (Route/Gateway discovery) ─────────────────────────────────── independent ─────┘ ``` Phase 4A-C, 4E, and 4F are independent and can be parallelized with Phase 3. Phase 4D requires Phase 1 (storage) and Phase 3 (consumers) to be complete.