# Design: Production Gap Closure (Tier 1 + Tier 2)

**Date**: 2026-02-25
**Scope**: 15 gaps across FileStore, RAFT, JetStream Cluster, Consumer/Stream, Client, MQTT, Config, Networking
**Approach**: Bottom-up by dependency (Phase 1→2→3→4)
**Codex review**: Verified 2026-02-25 — feedback incorporated

## Overview

This design covers all CRITICAL and HIGH severity gaps identified in the fresh scan of `docs/structuregaps.md` (2026-02-24). The implementation follows a bottom-up dependency approach where foundational layers (storage, consensus) are completed before higher-level features (cluster coordination, consumers, client protocol).

### Codex Feedback (Incorporated)

1. RAFT WAL is the hardest dependency — kept first in Phase 1
2. Meta snapshot encode/decode moved earlier in Phase 2 (before monitor loops)
3. Ack/NAK processing moved before/alongside delivery loop — foundational to correctness
4. Exit gates added per phase with measurable criteria
5. Binary formats treated as versioned contracts with golden tests
6. Feature flags for major subsystems recommended

---

## Phase 1: FileStore + RAFT Foundation

**Dependencies**: None (pure infrastructure)
**Exit gate**: All IStreamStore methods implemented, FileStore round-trips encrypted+compressed blocks, RAFT persists and recovers state across restarts, joint consensus passes membership change tests.

### 1A: FileStore Block Lifecycle & Encryption/Compression

**Goal**: Wire AeadEncryptor and S2Codec into block I/O, add automatic rotation, crash recovery.

**MsgBlock.cs changes:**
- New fields: `byte[]? _aek` (per-block AEAD key), `bool _compressed`, `byte[] _lastChecksum`
- `Seal()` — marks block read-only, releases write handle
- `EncryptOnDisk(AeadEncryptor, key)` — encrypts block data at rest
- `DecryptOnRead(AeadEncryptor, key)` — decrypts block data on load
- `CompressRecord(S2Codec, payload)` → `byte[]` — per-message compression
- `DecompressRecord(S2Codec, data)` → `byte[]` — per-message decompression
- `ComputeChecksum(data)` → `byte[8]` — highway hash
- `ValidateChecksum(data, expected)` → `bool`

**FileStore.cs changes:**
- `AppendAsync()` → triggers `RotateBlock()` when `_activeBlock.BytesWritten >= _options.BlockSize`
- `RotateBlock()` — seals active block, creates new with `_nextBlockId++`
- Enhanced `RecoverBlocks()` — rebuild TTL wheel, scan tombstones, validate checksums
- `RecoverTtlState()` — scan all blocks, re-register unexpired messages in `_ttlWheel`
- `WriteStateAtomically(path, data)` — temp file + `File.Move(temp, target, overwrite: true)`
- `GenEncryptionKeys()` → per-block key derivation from stream encryption key
- `RecoverAek(blockId)` → recover per-block key from metadata file

**New file: `HighwayHash.cs`** — .NET highway hash implementation or NuGet wrapper.

**Go reference**: `filestore.go` lines 816–1407 (encryption), 1754–2401 (recovery), 4443–4477 (cache), 5783–5842 (flusher).

### 1B: Missing IStreamStore Methods

Implement 15+ methods currently throwing `NotSupportedException`:

| Method | Strategy | Go ref |
|--------|----------|--------|
| `StoreRawMsg` | Append with caller seq/ts, bypass auto-increment | 4759 |
| `LoadNextMsgMulti` | Multi-filter scan across blocks | 8426 |
| `LoadPrevMsg` | Reverse scan from start sequence | 8580 |
| `LoadPrevMsgMulti` | Reverse multi-filter scan | 8634 |
| `NumPendingMulti` | Count with filter set | 4051 |
| `EncodedStreamState` | Binary serialize (versioned format) | 10599 |
| `Utilization` | bytes_used / bytes_allocated | 8758 |
| `FlushAllPending` | Force-flush active block to disk | — |
| `RegisterStorageUpdates` | Callback for change notifications | 4412 |
| `RegisterStorageRemoveMsg` | Callback for removal events | 4424 |
| `RegisterProcessJetStreamMsg` | Callback for JS processing | 4431 |
| `ResetState` | Clear caches post-NRG catchup | 8788 |
| `UpdateConfig` | Apply block size, retention, limits | 655 |
| `Delete(inline)` | Stop + delete all block files | — |
| `Stop()` | Flush + close all FDs | — |

### 1C: RAFT Persistent WAL

**RaftLog.cs changes:**
- Replace `List<RaftLogEntry>` backing with binary WAL file
- Record format: `[4-byte length][entry bytes]` (length-prefixed)
- Entry serialization: `[8-byte index][4-byte term][payload]`
- `Append()` → append to WAL file, increment in-memory index
- `SyncAsync()` → `FileStream.FlushAsync()` for durability
- `LoadAsync()` → scan WAL, rebuild in-memory index, validate
- `Compact(upToIndex)` → rewrite WAL from upToIndex+1 onward
- Version header at WAL file start for forward compatibility

**RaftNode.cs changes:**
- Wire `_persistDirectory` into disk operations
- Persist term/vote in `meta.json`: `{ "term": N, "votedFor": "id" }`
- Startup: load WAL → load snapshot → apply entries after snapshot index
- `PersistTermAsync()` — called on term/vote changes

**Go reference**: `raft.go` lines 1000–1100 (snapshot streaming), WAL patterns throughout.

### 1D: RAFT Joint Consensus

**RaftNode.cs changes:**
- New field: `HashSet<string>? _jointNewMembers` — Cnew during transition
- `ProposeAddPeer(peerId)`:
  1. Propose joint entry (Cold ∪ Cnew) to log
  2. Wait for commit
  3. Propose final Cnew entry
  4. Wait for commit
- `ProposeRemovePeer(peerId)` — same two-phase pattern
- `CalculateQuorum()`:
  - Normal: `majority(_members)`
  - Joint: `max(majority(Cold), majority(Cnew))`
- `ApplyMembershipChange(entry)`:
  - If joint entry: set `_jointNewMembers`
  - If final entry: replace `_members` with Cnew, clear `_jointNewMembers`

**Go reference**: `raft.go` Section 4 (Raft paper), membership change handling.

---

## Phase 2: JetStream Cluster Coordination

**Dependencies**: Phase 1 (RAFT persistence, FileStore)
**Exit gate**: Meta-group processes stream/consumer assignments via RAFT, snapshots encode/decode round-trip, leadership transitions handled correctly, orphan detection works.

### 2A: Meta Snapshot Encoding/Decoding (moved earlier per Codex)

**New file: `MetaSnapshotCodec.cs`**
- `Encode(Dictionary<string, Dictionary<string, StreamAssignment>> assignments)` → `byte[]`
- Uses S2 compression for wire efficiency
- Versioned format header for compatibility
- `Decode(byte[])` → assignment maps
- Golden tests with known-good snapshots

### 2B: Cluster Monitoring Loop

**New class: `JetStreamClusterMonitor`** (implements `IHostedService`)
- Long-running `ExecuteAsync()` loop consuming `Channel<RaftLogEntry>` from meta RAFT node
- `ApplyMetaEntry(entry)` → dispatch by op type:
  - `assignStreamOp` → `ProcessStreamAssignment()`
  - `assignConsumerOp` → `ProcessConsumerAssignment()`
  - `removeStreamOp` → `ProcessStreamRemoval()`
  - `removeConsumerOp` → `ProcessConsumerRemoval()`
  - `updateStreamOp` → `ProcessUpdateStreamAssignment()`
  - `EntrySnapshot` → `ApplyMetaSnapshot()`
  - `EntryAddPeer`/`EntryRemovePeer` → peer management
- Periodic snapshot: 1-minute interval, 15s on errors
- Orphan check: 30s post-recovery scan
- Recovery state machine: deferred add/remove during catchup

### 2C: Stream/Consumer Assignment Processing

**Enhance `JetStreamMetaGroup.cs`:**
- `ProcessStreamAssignment(sa)` — validate config, check account limits, create stream on this node if member of replica group, respond to inflight request
- `ProcessConsumerAssignment(ca)` — validate, create consumer, bind to stream
- `ProcessStreamRemoval(sa)` — cleanup local resources, delete RAFT group
- `ProcessConsumerRemoval(ca)` — cleanup consumer state
- `ProcessUpdateStreamAssignment(sa)` — apply config changes, handle replica scaling

### 2D: Inflight Tracking Enhancement

**Change inflight maps from:**
```csharp
ConcurrentDictionary<string, string> _inflightStreams
```
**To:**
```csharp
record InflightInfo(int OpsCount, bool Deleted, StreamAssignment? Assignment);
ConcurrentDictionary<string, Dictionary<string, InflightInfo>> _inflightStreams; // account → stream → info
```
- `TrackInflightStreamProposal(account, stream, assignment)` — increment ops, store
- `RemoveInflightStreamProposal(account, stream)` — decrement, clear when zero
- Same pattern for consumers

### 2E: Leadership Transition

**Add to `JetStreamClusterMonitor`:**
- `ProcessLeaderChange(isLeader)`:
  - Clear all inflight proposal maps
  - If becoming leader: start update subscriptions, publish domain leader advisory
  - If losing leadership: stop update subscriptions
- `ProcessStreamLeaderChange(stream, isLeader)`:
  - Leader: start monitor, set sync subjects, begin replication
  - Follower: stop local delivery
- `ProcessConsumerLeaderChange(consumer, isLeader)`:
  - Leader: start delivery engine
  - Follower: stop delivery

### 2F: Per-Stream/Consumer Monitor Loops

**New class: `StreamMonitor`** — per-stream background task:
- Snapshot timing: 2-minute interval + random jitter
- State change detection via hash comparison
- Recovery timeout: escalate after N failures
- Compaction trigger after snapshot

**New class: `ConsumerMonitor`** — per-consumer background task:
- Hash-based state change detection
- Timeout tracking for stalled consumers
- Migration detection (R1 → R3 scale-up)

---

## Phase 3: Consumer + Stream Engines

**Dependencies**: Phase 1 (storage), Phase 2 (cluster coordination)
**Exit gate**: Consumer delivery loop correctly delivers messages with redelivery, pull requests expire and respect batch/maxBytes, purge with subject filtering works, mirror/source retry recovers from failures.

### 3A: Ack/NAK Processing (moved before delivery loop per Codex)

**Enhance `AckProcessor.cs`:**
- Background subscription consuming from consumer's ack subject
- Parse ack types:
  - `+ACK` — acknowledge message, remove from pending
  - `-NAK` — negative ack, optional `{delay}` for redelivery schedule
  - `-NAK {"delay": "2s", "backoff": [1,2,4]}` — with backoff schedule
  - `+TERM` — terminate, remove from pending, no redeliver
  - `+WPI` — work in progress, extend ack wait deadline
- Track `deliveryCount` per message sequence
- Max deliveries enforcement: advisory on exceed

**Enhance `RedeliveryTracker.cs`:**
- Replace internals with `PriorityQueue<ulong, DateTimeOffset>` (min-heap by deadline)
- `Schedule(seq, delay)` — insert with deadline
- `TryGetNextReady(now)` → sequence past deadline, or null
- `BackoffSchedule` support: delay array indexed by delivery count
- Rate limiting: configurable max redeliveries per tick

### 3B: Consumer Delivery Loop

**New class: `ConsumerDeliveryLoop`** — shared between push and pull:
- Background `Task` running while consumer is active
- Main loop:
  1. Check redelivery queue first (priority over new messages)
  2. Poll `IStreamStore.LoadNextMsg()` for next matching message
  3. Check `MaxAckPending` — block if pending >= limit
  4. Deliver message (push: to deliver subject; pull: to request reply subject)
  5. Register in pending map with ack deadline
  6. Update stats (delivered count, bytes)

**Push mode integration:**
- Delivers to `DeliverSubject` directly
- Idle heartbeat: sends when no messages for `IdleHeartbeat` duration
- Flow control: `Nats-Consumer-Stalled` header when pending > threshold

**Pull mode integration:**
- Consumes from `WaitingRequestQueue`
- Delivers to request's `ReplyTo` subject
- Respects `Batch`, `MaxBytes`, `NoWait` per request

### 3C: Pull Request Pipeline

**New class: `WaitingRequestQueue`:**
```csharp
record PullRequest(string ReplyTo, int Batch, long MaxBytes, DateTimeOffset Expires, bool NoWait, string? PinId);
```
- Internal `LinkedList<PullRequest>` ordered by arrival
- `Enqueue(request)` — adds, starts expiry timer
- `TryDequeue()` → next request with remaining capacity
- `RemoveExpired(now)` — periodic cleanup
- `Count`, `IsEmpty` properties

### 3D: Priority Group Pinning

**Enhance `PriorityGroupManager.cs`:**
- `AssignPinId()` → generates NUID string, starts TTL timer
- `ValidatePinId(pinId)` → checks validity
- `UnassignPinId()` → clears, publishes advisory
- `PinnedTtl` configurable timeout
- Integrates with `WaitingRequestQueue` for pin-aware dispatch

### 3E: Consumer Pause/Resume

**Add to `ConsumerManager.cs`:**
- `PauseUntil` field on consumer state
- `PauseAsync(until)`:
  - Set deadline, stop delivery loop
  - Publish `$JS.EVENT.ADVISORY.CONSUMER.PAUSED`
  - Start auto-resume timer
- `ResumeAsync()`:
  - Clear deadline, restart delivery loop
  - Publish resume advisory

### 3F: Mirror/Source Retry

**Enhance `MirrorCoordinator.cs`:**
- Retry loop: exponential backoff 1s → 30s with jitter
- `SetupMirrorConsumer()` → create ephemeral consumer on source via JetStream API
- Sequence tracking: `_sourceSeq` (last seen), `_destSeq` (last stored)
- Gap detection: if source sequence jumps, log advisory
- `SetMirrorErr(err)` → publishes error state advisory

**Enhance `SourceCoordinator.cs`:**
- Same retry pattern
- Per-source consumer setup/teardown
- Subject filter + transform before storing
- Dedup window: `Dictionary<string, DateTimeOffset>` keyed by Nats-Msg-Id
- Time-based pruning of dedup window

### 3G: Stream Purge with Filtering

**Enhance `StreamManager.cs`:**
- `PurgeEx(subject, seq, keep)`:
  - `subject` filter: iterate messages, purge only matching subject pattern
  - `seq` range: purge messages with seq < specified
  - `keep`: retain N most recent (purge all but last N)
- After purge: cascade to consumers (clear pending for purged sequences)

### 3H: Interest Retention

**New class: `InterestRetentionPolicy`:**
- Track which consumers have interest in each message
- `RegisterInterest(consumerName, filterSubject)` — consumer subscribes
- `AcknowledgeDelivery(consumerName, seq)` — consumer acked
- `ShouldRetain(seq)` → `false` when all interested consumers have acked
- Periodic scan for removable messages

---

## Phase 4: Client Performance, MQTT, Config, Networking

**Dependencies**: Phase 1 (storage for MQTT), Phase 3 (consumer for MQTT)
**Exit gate**: Client flush coalescing measurably reduces syscalls under load, MQTT sessions survive server restart, SIGHUP triggers config reload, implicit routes auto-discover.

### 4A: Client Flush Coalescing

**Enhance `NatsClient.cs`:**
- New fields: `int _flushSignalsPending`, `const int MaxFlushPending = 10`
- New field: `AsyncAutoResetEvent _flushSignal` — write loop waits on this
- `QueueOutbound()`: after writing to channel, increment `_flushSignalsPending`, signal
- `RunWriteLoopAsync()`: if `_flushSignalsPending < MaxFlushPending && _pendingBytes < MaxBufSize`, wait for signal (coalesces)
- `NatsServer.FlushClients(HashSet<NatsClient>)` — signals all pending clients

### 4B: Client Stall Gate

**Enhance `NatsClient.cs`:**
- New field: `SemaphoreSlim? _stallGate`
- In `QueueOutbound()`:
  - If `_pendingBytes > _maxPending * 3 / 4` and `_stallGate == null`: create `new SemaphoreSlim(0, 1)`
  - If `_stallGate != null`: `await _stallGate.WaitAsync(timeout)`
- In `RunWriteLoopAsync()`:
  - After successful flush below threshold: `_stallGate?.Release()`, set `_stallGate = null`
- Per-kind: CLIENT → close on slow consumer; ROUTER/GATEWAY/LEAF → mark flag, keep alive

### 4C: Write Timeout Recovery

**Enhance `RunWriteLoopAsync()`:**
- Track `bytesWritten` and `bytesAttempted` per flush cycle
- On write timeout:
  - If `Kind == CLIENT` → close immediately
  - If `bytesWritten > 0` (partial flush) → mark `isSlowConsumer`, continue
  - If `bytesWritten == 0` → close
- New enum: `WriteTimeoutPolicy { Close, TcpFlush }`
- Route/gateway/leaf use `TcpFlush` by default

### 4D: MQTT JetStream Persistence

**Enhance MQTT subsystem:**

**Streams per account:**
- `$MQTT_msgs` — message persistence (subject: `$MQTT.msgs.>`)
- `$MQTT_rmsgs` — retained messages (subject: `$MQTT.rmsgs.>`)
- `$MQTT_sess` — session state (subject: `$MQTT.sess.>`)

**MqttSessionStore.cs changes:**
- `ConnectAsync(clientId, cleanSession)`:
  - If `cleanSession=false`: query `$MQTT_sess` for existing session, restore
  - If `cleanSession=true`: delete existing session from stream
- `SaveSessionAsync()` — persist current state to `$MQTT_sess`
- `DisconnectAsync(abnormal)`:
  - If abnormal and will configured: publish will message
  - If `cleanSession=false`: save session

**MqttRetainedStore.cs changes:**
- Backed by `$MQTT_rmsgs` stream
- `SetRetained(topic, payload)` — store/update retained message
- `GetRetained(topicFilter)` — query matching retained messages
- On SUBSCRIBE: deliver matching retained messages

**QoS tracking:**
- QoS 1 pending acks stored in session stream
- On reconnect: redeliver unacked QoS 1 messages
- QoS 2 state machine backed by session stream

### 4E: SIGHUP Config Reload

**New file: `SignalHandler.cs`** in Configuration/:
```csharp
public static class SignalHandler
{
    public static void Register(ConfigReloader reloader, NatsServer server)
    {
        PosixSignalRegistration.Create(PosixSignal.SIGHUP, ctx =>
        {
            _ = reloader.ReloadAsync(server);
        });
    }
}
```

**Enhance `ConfigReloader.cs`:**
- New `ReloadAsync(NatsServer server)`:
  - Re-parse config file
  - `Diff(old, new)` for change list
  - Apply reloadable changes: logging, auth, TLS, limits
  - Reject non-reloadable with warning
  - Auth change propagation: iterate clients, re-evaluate, close unauthorized
- TLS reload: `server.ReloadTlsCertificate(newCert)`

### 4F: Implicit Route/Gateway Discovery

**Enhance `RouteManager.cs`:**
- `ProcessImplicitRoute(serverInfo)`:
  - Extract `connect_urls` from INFO
  - For each unknown URL: create solicited connection
  - Track in `_discoveredRoutes` set to avoid duplicates
- `ForwardNewRouteInfoToKnownServers(newPeer)`:
  - Send updated INFO to all existing routes
  - Include new peer in `connect_urls`

**Enhance `GatewayManager.cs`:**
- `ProcessImplicitGateway(gatewayInfo)`:
  - Extract gateway URLs from INFO
  - Create outbound connection to discovered gateway
- `GossipGatewaysToInboundGateway(conn)`:
  - Share known gateways with new inbound connections

---

## Test Strategy

### Per-Phase Test Requirements

**Phase 1 tests:**
- FileStore: encrypted block round-trip, compressed block round-trip, crash recovery with TTL, checksum validation on corrupt data, atomic write failure recovery
- IStreamStore: each missing method gets at least 2 test cases
- RAFT WAL: persist → restart → verify state, concurrent appends, compaction
- Joint consensus: add peer during normal operation, remove peer, add during leader change

**Phase 2 tests:**
- Snapshot encode → decode round-trip (golden test with known data)
- Monitor loop processes stream/consumer assignment entries
- Inflight tracking prevents duplicate proposals
- Leadership transition clears state correctly

**Phase 3 tests:**
- Consumer delivery loop delivers messages in order, respects MaxAckPending
- Pull request expiry, batch/maxBytes enforcement
- NAK with delay schedules redelivery at correct time
- Pause/resume stops and restarts delivery
- Mirror retry recovers after source failure
- Purge with subject filter only removes matching

**Phase 4 tests:**
- Flush coalescing: measure syscall count reduction under load
- Stall gate: producer blocks when client is slow
- MQTT session restore after restart
- SIGHUP triggers reload, auth changes disconnect unauthorized clients
- Implicit route discovery creates connections to unknown peers

### Parity DB Updates

When porting Go tests, update `docs/test_parity.db`:
```sql
UPDATE go_tests SET status='mapped', dotnet_test='TestName', dotnet_file='File.cs'
WHERE go_test='GoTestName';
```

---

## Risk Mitigation

1. **RAFT WAL correctness**: Use fsync on every commit, validate on load, golden tests with known-good WAL files
2. **Joint consensus edge cases**: Test split-brain scenarios with simulated partitions
3. **Snapshot format drift**: Version header, backward compatibility checks, golden tests
4. **Async loop races**: Use structured concurrency (CancellationToken chains), careful locking around inflight maps
5. **FileStore crash recovery**: Intentional corruption tests (truncated blocks, zero-filled regions)
6. **MQTT persistence migration**: Feature flag to toggle JetStream backing vs in-memory

---

## Phase Dependencies

```
Phase 1A (FileStore blocks) ─────────────────┐
Phase 1B (IStreamStore methods) ─────────────┤
Phase 1C (RAFT WAL) ─────────────────────────┼─→ Phase 2 (JetStream Cluster) ─→ Phase 3 (Consumer/Stream)
Phase 1D (RAFT joint consensus) ─────────────┘                                         │
                                                                                        ↓
Phase 4A-C (Client perf) ───────────────────────────────────────────── independent ──────┤
Phase 4D (MQTT persistence) ──────────────────── depends on Phase 1 + Phase 3 ──────────┤
Phase 4E (SIGHUP reload) ──────────────────────────────────────────── independent ──────┤
Phase 4F (Route/Gateway discovery) ─────────────────────────────────── independent ─────┘
```

Phase 4A-C, 4E, and 4F are independent and can be parallelized with Phase 3.
Phase 4D requires Phase 1 (storage) and Phase 3 (consumers) to be complete.