Rewrote structuregaps.md with function-by-function gap analysis comparing current .NET implementation against Go reference. Added design doc for Tier 1+2 gap closure across 4 phases: FileStore+RAFT → JetStream Cluster → Consumer/Stream → Client/MQTT/Config.
21 KiB
Design: Production Gap Closure (Tier 1 + Tier 2)
Date: 2026-02-25 Scope: 15 gaps across FileStore, RAFT, JetStream Cluster, Consumer/Stream, Client, MQTT, Config, Networking Approach: Bottom-up by dependency (Phase 1→2→3→4) Codex review: Verified 2026-02-25 — feedback incorporated
Overview
This design covers all CRITICAL and HIGH severity gaps identified in the fresh scan of docs/structuregaps.md (2026-02-24). The implementation follows a bottom-up dependency approach where foundational layers (storage, consensus) are completed before higher-level features (cluster coordination, consumers, client protocol).
Codex Feedback (Incorporated)
- RAFT WAL is the hardest dependency — kept first in Phase 1
- Meta snapshot encode/decode moved earlier in Phase 2 (before monitor loops)
- Ack/NAK processing moved before/alongside delivery loop — foundational to correctness
- Exit gates added per phase with measurable criteria
- Binary formats treated as versioned contracts with golden tests
- Feature flags for major subsystems recommended
Phase 1: FileStore + RAFT Foundation
Dependencies: None (pure infrastructure) Exit gate: All IStreamStore methods implemented, FileStore round-trips encrypted+compressed blocks, RAFT persists and recovers state across restarts, joint consensus passes membership change tests.
1A: FileStore Block Lifecycle & Encryption/Compression
Goal: Wire AeadEncryptor and S2Codec into block I/O, add automatic rotation, crash recovery.
MsgBlock.cs changes:
- New fields:
byte[]? _aek(per-block AEAD key),bool _compressed,byte[] _lastChecksum Seal()— marks block read-only, releases write handleEncryptOnDisk(AeadEncryptor, key)— encrypts block data at restDecryptOnRead(AeadEncryptor, key)— decrypts block data on loadCompressRecord(S2Codec, payload)→byte[]— per-message compressionDecompressRecord(S2Codec, data)→byte[]— per-message decompressionComputeChecksum(data)→byte[8]— highway hashValidateChecksum(data, expected)→bool
FileStore.cs changes:
AppendAsync()→ triggersRotateBlock()when_activeBlock.BytesWritten >= _options.BlockSizeRotateBlock()— seals active block, creates new with_nextBlockId++- Enhanced
RecoverBlocks()— rebuild TTL wheel, scan tombstones, validate checksums RecoverTtlState()— scan all blocks, re-register unexpired messages in_ttlWheelWriteStateAtomically(path, data)— temp file +File.Move(temp, target, overwrite: true)GenEncryptionKeys()→ per-block key derivation from stream encryption keyRecoverAek(blockId)→ recover per-block key from metadata file
New file: HighwayHash.cs — .NET highway hash implementation or NuGet wrapper.
Go reference: filestore.go lines 816–1407 (encryption), 1754–2401 (recovery), 4443–4477 (cache), 5783–5842 (flusher).
1B: Missing IStreamStore Methods
Implement 15+ methods currently throwing NotSupportedException:
| Method | Strategy | Go ref |
|---|---|---|
StoreRawMsg |
Append with caller seq/ts, bypass auto-increment | 4759 |
LoadNextMsgMulti |
Multi-filter scan across blocks | 8426 |
LoadPrevMsg |
Reverse scan from start sequence | 8580 |
LoadPrevMsgMulti |
Reverse multi-filter scan | 8634 |
NumPendingMulti |
Count with filter set | 4051 |
EncodedStreamState |
Binary serialize (versioned format) | 10599 |
Utilization |
bytes_used / bytes_allocated | 8758 |
FlushAllPending |
Force-flush active block to disk | — |
RegisterStorageUpdates |
Callback for change notifications | 4412 |
RegisterStorageRemoveMsg |
Callback for removal events | 4424 |
RegisterProcessJetStreamMsg |
Callback for JS processing | 4431 |
ResetState |
Clear caches post-NRG catchup | 8788 |
UpdateConfig |
Apply block size, retention, limits | 655 |
Delete(inline) |
Stop + delete all block files | — |
Stop() |
Flush + close all FDs | — |
1C: RAFT Persistent WAL
RaftLog.cs changes:
- Replace
List<RaftLogEntry>backing with binary WAL file - Record format:
[4-byte length][entry bytes](length-prefixed) - Entry serialization:
[8-byte index][4-byte term][payload] Append()→ append to WAL file, increment in-memory indexSyncAsync()→FileStream.FlushAsync()for durabilityLoadAsync()→ scan WAL, rebuild in-memory index, validateCompact(upToIndex)→ rewrite WAL from upToIndex+1 onward- Version header at WAL file start for forward compatibility
RaftNode.cs changes:
- Wire
_persistDirectoryinto disk operations - Persist term/vote in
meta.json:{ "term": N, "votedFor": "id" } - Startup: load WAL → load snapshot → apply entries after snapshot index
PersistTermAsync()— called on term/vote changes
Go reference: raft.go lines 1000–1100 (snapshot streaming), WAL patterns throughout.
1D: RAFT Joint Consensus
RaftNode.cs changes:
- New field:
HashSet<string>? _jointNewMembers— Cnew during transition ProposeAddPeer(peerId):- Propose joint entry (Cold ∪ Cnew) to log
- Wait for commit
- Propose final Cnew entry
- Wait for commit
ProposeRemovePeer(peerId)— same two-phase patternCalculateQuorum():- Normal:
majority(_members) - Joint:
max(majority(Cold), majority(Cnew))
- Normal:
ApplyMembershipChange(entry):- If joint entry: set
_jointNewMembers - If final entry: replace
_memberswith Cnew, clear_jointNewMembers
- If joint entry: set
Go reference: raft.go Section 4 (Raft paper), membership change handling.
Phase 2: JetStream Cluster Coordination
Dependencies: Phase 1 (RAFT persistence, FileStore) Exit gate: Meta-group processes stream/consumer assignments via RAFT, snapshots encode/decode round-trip, leadership transitions handled correctly, orphan detection works.
2A: Meta Snapshot Encoding/Decoding (moved earlier per Codex)
New file: MetaSnapshotCodec.cs
Encode(Dictionary<string, Dictionary<string, StreamAssignment>> assignments)→byte[]- Uses S2 compression for wire efficiency
- Versioned format header for compatibility
Decode(byte[])→ assignment maps- Golden tests with known-good snapshots
2B: Cluster Monitoring Loop
New class: JetStreamClusterMonitor (implements IHostedService)
- Long-running
ExecuteAsync()loop consumingChannel<RaftLogEntry>from meta RAFT node ApplyMetaEntry(entry)→ dispatch by op type:assignStreamOp→ProcessStreamAssignment()assignConsumerOp→ProcessConsumerAssignment()removeStreamOp→ProcessStreamRemoval()removeConsumerOp→ProcessConsumerRemoval()updateStreamOp→ProcessUpdateStreamAssignment()EntrySnapshot→ApplyMetaSnapshot()EntryAddPeer/EntryRemovePeer→ peer management
- Periodic snapshot: 1-minute interval, 15s on errors
- Orphan check: 30s post-recovery scan
- Recovery state machine: deferred add/remove during catchup
2C: Stream/Consumer Assignment Processing
Enhance JetStreamMetaGroup.cs:
ProcessStreamAssignment(sa)— validate config, check account limits, create stream on this node if member of replica group, respond to inflight requestProcessConsumerAssignment(ca)— validate, create consumer, bind to streamProcessStreamRemoval(sa)— cleanup local resources, delete RAFT groupProcessConsumerRemoval(ca)— cleanup consumer stateProcessUpdateStreamAssignment(sa)— apply config changes, handle replica scaling
2D: Inflight Tracking Enhancement
Change inflight maps from:
ConcurrentDictionary<string, string> _inflightStreams
To:
record InflightInfo(int OpsCount, bool Deleted, StreamAssignment? Assignment);
ConcurrentDictionary<string, Dictionary<string, InflightInfo>> _inflightStreams; // account → stream → info
TrackInflightStreamProposal(account, stream, assignment)— increment ops, storeRemoveInflightStreamProposal(account, stream)— decrement, clear when zero- Same pattern for consumers
2E: Leadership Transition
Add to JetStreamClusterMonitor:
ProcessLeaderChange(isLeader):- Clear all inflight proposal maps
- If becoming leader: start update subscriptions, publish domain leader advisory
- If losing leadership: stop update subscriptions
ProcessStreamLeaderChange(stream, isLeader):- Leader: start monitor, set sync subjects, begin replication
- Follower: stop local delivery
ProcessConsumerLeaderChange(consumer, isLeader):- Leader: start delivery engine
- Follower: stop delivery
2F: Per-Stream/Consumer Monitor Loops
New class: StreamMonitor — per-stream background task:
- Snapshot timing: 2-minute interval + random jitter
- State change detection via hash comparison
- Recovery timeout: escalate after N failures
- Compaction trigger after snapshot
New class: ConsumerMonitor — per-consumer background task:
- Hash-based state change detection
- Timeout tracking for stalled consumers
- Migration detection (R1 → R3 scale-up)
Phase 3: Consumer + Stream Engines
Dependencies: Phase 1 (storage), Phase 2 (cluster coordination) Exit gate: Consumer delivery loop correctly delivers messages with redelivery, pull requests expire and respect batch/maxBytes, purge with subject filtering works, mirror/source retry recovers from failures.
3A: Ack/NAK Processing (moved before delivery loop per Codex)
Enhance AckProcessor.cs:
- Background subscription consuming from consumer's ack subject
- Parse ack types:
+ACK— acknowledge message, remove from pending-NAK— negative ack, optional{delay}for redelivery schedule-NAK {"delay": "2s", "backoff": [1,2,4]}— with backoff schedule+TERM— terminate, remove from pending, no redeliver+WPI— work in progress, extend ack wait deadline
- Track
deliveryCountper message sequence - Max deliveries enforcement: advisory on exceed
Enhance RedeliveryTracker.cs:
- Replace internals with
PriorityQueue<ulong, DateTimeOffset>(min-heap by deadline) Schedule(seq, delay)— insert with deadlineTryGetNextReady(now)→ sequence past deadline, or nullBackoffSchedulesupport: delay array indexed by delivery count- Rate limiting: configurable max redeliveries per tick
3B: Consumer Delivery Loop
New class: ConsumerDeliveryLoop — shared between push and pull:
- Background
Taskrunning while consumer is active - Main loop:
- Check redelivery queue first (priority over new messages)
- Poll
IStreamStore.LoadNextMsg()for next matching message - Check
MaxAckPending— block if pending >= limit - Deliver message (push: to deliver subject; pull: to request reply subject)
- Register in pending map with ack deadline
- Update stats (delivered count, bytes)
Push mode integration:
- Delivers to
DeliverSubjectdirectly - Idle heartbeat: sends when no messages for
IdleHeartbeatduration - Flow control:
Nats-Consumer-Stalledheader when pending > threshold
Pull mode integration:
- Consumes from
WaitingRequestQueue - Delivers to request's
ReplyTosubject - Respects
Batch,MaxBytes,NoWaitper request
3C: Pull Request Pipeline
New class: WaitingRequestQueue:
record PullRequest(string ReplyTo, int Batch, long MaxBytes, DateTimeOffset Expires, bool NoWait, string? PinId);
- Internal
LinkedList<PullRequest>ordered by arrival Enqueue(request)— adds, starts expiry timerTryDequeue()→ next request with remaining capacityRemoveExpired(now)— periodic cleanupCount,IsEmptyproperties
3D: Priority Group Pinning
Enhance PriorityGroupManager.cs:
AssignPinId()→ generates NUID string, starts TTL timerValidatePinId(pinId)→ checks validityUnassignPinId()→ clears, publishes advisoryPinnedTtlconfigurable timeout- Integrates with
WaitingRequestQueuefor pin-aware dispatch
3E: Consumer Pause/Resume
Add to ConsumerManager.cs:
PauseUntilfield on consumer statePauseAsync(until):- Set deadline, stop delivery loop
- Publish
$JS.EVENT.ADVISORY.CONSUMER.PAUSED - Start auto-resume timer
ResumeAsync():- Clear deadline, restart delivery loop
- Publish resume advisory
3F: Mirror/Source Retry
Enhance MirrorCoordinator.cs:
- Retry loop: exponential backoff 1s → 30s with jitter
SetupMirrorConsumer()→ create ephemeral consumer on source via JetStream API- Sequence tracking:
_sourceSeq(last seen),_destSeq(last stored) - Gap detection: if source sequence jumps, log advisory
SetMirrorErr(err)→ publishes error state advisory
Enhance SourceCoordinator.cs:
- Same retry pattern
- Per-source consumer setup/teardown
- Subject filter + transform before storing
- Dedup window:
Dictionary<string, DateTimeOffset>keyed by Nats-Msg-Id - Time-based pruning of dedup window
3G: Stream Purge with Filtering
Enhance StreamManager.cs:
PurgeEx(subject, seq, keep):subjectfilter: iterate messages, purge only matching subject patternseqrange: purge messages with seq < specifiedkeep: retain N most recent (purge all but last N)
- After purge: cascade to consumers (clear pending for purged sequences)
3H: Interest Retention
New class: InterestRetentionPolicy:
- Track which consumers have interest in each message
RegisterInterest(consumerName, filterSubject)— consumer subscribesAcknowledgeDelivery(consumerName, seq)— consumer ackedShouldRetain(seq)→falsewhen all interested consumers have acked- Periodic scan for removable messages
Phase 4: Client Performance, MQTT, Config, Networking
Dependencies: Phase 1 (storage for MQTT), Phase 3 (consumer for MQTT) Exit gate: Client flush coalescing measurably reduces syscalls under load, MQTT sessions survive server restart, SIGHUP triggers config reload, implicit routes auto-discover.
4A: Client Flush Coalescing
Enhance NatsClient.cs:
- New fields:
int _flushSignalsPending,const int MaxFlushPending = 10 - New field:
AsyncAutoResetEvent _flushSignal— write loop waits on this QueueOutbound(): after writing to channel, increment_flushSignalsPending, signalRunWriteLoopAsync(): if_flushSignalsPending < MaxFlushPending && _pendingBytes < MaxBufSize, wait for signal (coalesces)NatsServer.FlushClients(HashSet<NatsClient>)— signals all pending clients
4B: Client Stall Gate
Enhance NatsClient.cs:
- New field:
SemaphoreSlim? _stallGate - In
QueueOutbound():- If
_pendingBytes > _maxPending * 3 / 4and_stallGate == null: createnew SemaphoreSlim(0, 1) - If
_stallGate != null:await _stallGate.WaitAsync(timeout)
- If
- In
RunWriteLoopAsync():- After successful flush below threshold:
_stallGate?.Release(), set_stallGate = null
- After successful flush below threshold:
- Per-kind: CLIENT → close on slow consumer; ROUTER/GATEWAY/LEAF → mark flag, keep alive
4C: Write Timeout Recovery
Enhance RunWriteLoopAsync():
- Track
bytesWrittenandbytesAttemptedper flush cycle - On write timeout:
- If
Kind == CLIENT→ close immediately - If
bytesWritten > 0(partial flush) → markisSlowConsumer, continue - If
bytesWritten == 0→ close
- If
- New enum:
WriteTimeoutPolicy { Close, TcpFlush } - Route/gateway/leaf use
TcpFlushby default
4D: MQTT JetStream Persistence
Enhance MQTT subsystem:
Streams per account:
$MQTT_msgs— message persistence (subject:$MQTT.msgs.>)$MQTT_rmsgs— retained messages (subject:$MQTT.rmsgs.>)$MQTT_sess— session state (subject:$MQTT.sess.>)
MqttSessionStore.cs changes:
ConnectAsync(clientId, cleanSession):- If
cleanSession=false: query$MQTT_sessfor existing session, restore - If
cleanSession=true: delete existing session from stream
- If
SaveSessionAsync()— persist current state to$MQTT_sessDisconnectAsync(abnormal):- If abnormal and will configured: publish will message
- If
cleanSession=false: save session
MqttRetainedStore.cs changes:
- Backed by
$MQTT_rmsgsstream SetRetained(topic, payload)— store/update retained messageGetRetained(topicFilter)— query matching retained messages- On SUBSCRIBE: deliver matching retained messages
QoS tracking:
- QoS 1 pending acks stored in session stream
- On reconnect: redeliver unacked QoS 1 messages
- QoS 2 state machine backed by session stream
4E: SIGHUP Config Reload
New file: SignalHandler.cs in Configuration/:
public static class SignalHandler
{
public static void Register(ConfigReloader reloader, NatsServer server)
{
PosixSignalRegistration.Create(PosixSignal.SIGHUP, ctx =>
{
_ = reloader.ReloadAsync(server);
});
}
}
Enhance ConfigReloader.cs:
- New
ReloadAsync(NatsServer server):- Re-parse config file
Diff(old, new)for change list- Apply reloadable changes: logging, auth, TLS, limits
- Reject non-reloadable with warning
- Auth change propagation: iterate clients, re-evaluate, close unauthorized
- TLS reload:
server.ReloadTlsCertificate(newCert)
4F: Implicit Route/Gateway Discovery
Enhance RouteManager.cs:
ProcessImplicitRoute(serverInfo):- Extract
connect_urlsfrom INFO - For each unknown URL: create solicited connection
- Track in
_discoveredRoutesset to avoid duplicates
- Extract
ForwardNewRouteInfoToKnownServers(newPeer):- Send updated INFO to all existing routes
- Include new peer in
connect_urls
Enhance GatewayManager.cs:
ProcessImplicitGateway(gatewayInfo):- Extract gateway URLs from INFO
- Create outbound connection to discovered gateway
GossipGatewaysToInboundGateway(conn):- Share known gateways with new inbound connections
Test Strategy
Per-Phase Test Requirements
Phase 1 tests:
- FileStore: encrypted block round-trip, compressed block round-trip, crash recovery with TTL, checksum validation on corrupt data, atomic write failure recovery
- IStreamStore: each missing method gets at least 2 test cases
- RAFT WAL: persist → restart → verify state, concurrent appends, compaction
- Joint consensus: add peer during normal operation, remove peer, add during leader change
Phase 2 tests:
- Snapshot encode → decode round-trip (golden test with known data)
- Monitor loop processes stream/consumer assignment entries
- Inflight tracking prevents duplicate proposals
- Leadership transition clears state correctly
Phase 3 tests:
- Consumer delivery loop delivers messages in order, respects MaxAckPending
- Pull request expiry, batch/maxBytes enforcement
- NAK with delay schedules redelivery at correct time
- Pause/resume stops and restarts delivery
- Mirror retry recovers after source failure
- Purge with subject filter only removes matching
Phase 4 tests:
- Flush coalescing: measure syscall count reduction under load
- Stall gate: producer blocks when client is slow
- MQTT session restore after restart
- SIGHUP triggers reload, auth changes disconnect unauthorized clients
- Implicit route discovery creates connections to unknown peers
Parity DB Updates
When porting Go tests, update docs/test_parity.db:
UPDATE go_tests SET status='mapped', dotnet_test='TestName', dotnet_file='File.cs'
WHERE go_test='GoTestName';
Risk Mitigation
- RAFT WAL correctness: Use fsync on every commit, validate on load, golden tests with known-good WAL files
- Joint consensus edge cases: Test split-brain scenarios with simulated partitions
- Snapshot format drift: Version header, backward compatibility checks, golden tests
- Async loop races: Use structured concurrency (CancellationToken chains), careful locking around inflight maps
- FileStore crash recovery: Intentional corruption tests (truncated blocks, zero-filled regions)
- MQTT persistence migration: Feature flag to toggle JetStream backing vs in-memory
Phase Dependencies
Phase 1A (FileStore blocks) ─────────────────┐
Phase 1B (IStreamStore methods) ─────────────┤
Phase 1C (RAFT WAL) ─────────────────────────┼─→ Phase 2 (JetStream Cluster) ─→ Phase 3 (Consumer/Stream)
Phase 1D (RAFT joint consensus) ─────────────┘ │
↓
Phase 4A-C (Client perf) ───────────────────────────────────────────── independent ──────┤
Phase 4D (MQTT persistence) ──────────────────── depends on Phase 1 + Phase 3 ──────────┤
Phase 4E (SIGHUP reload) ──────────────────────────────────────────── independent ──────┤
Phase 4F (Route/Gateway discovery) ─────────────────────────────────── independent ─────┘
Phase 4A-C, 4E, and 4F are independent and can be parallelized with Phase 3. Phase 4D requires Phase 1 (storage) and Phase 3 (consumers) to be complete.