From f03e72654f922e02ac893854145a12c2d95e6d13 Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Wed, 25 Feb 2026 00:09:54 -0500 Subject: [PATCH] docs: fresh gap analysis and production gaps design MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Rewrote structuregaps.md with function-by-function gap analysis comparing current .NET implementation against Go reference. Added design doc for Tier 1+2 gap closure across 4 phases: FileStore+RAFT → JetStream Cluster → Consumer/Stream → Client/MQTT/Config. --- .../2026-02-25-production-gaps-design.md | 494 +++++++ docs/structuregaps.md | 1152 ++++++++++++----- 2 files changed, 1309 insertions(+), 337 deletions(-) create mode 100644 docs/plans/2026-02-25-production-gaps-design.md diff --git a/docs/plans/2026-02-25-production-gaps-design.md b/docs/plans/2026-02-25-production-gaps-design.md new file mode 100644 index 0000000..d272498 --- /dev/null +++ b/docs/plans/2026-02-25-production-gaps-design.md @@ -0,0 +1,494 @@ +# Design: Production Gap Closure (Tier 1 + Tier 2) + +**Date**: 2026-02-25 +**Scope**: 15 gaps across FileStore, RAFT, JetStream Cluster, Consumer/Stream, Client, MQTT, Config, Networking +**Approach**: Bottom-up by dependency (Phase 1→2→3→4) +**Codex review**: Verified 2026-02-25 — feedback incorporated + +## Overview + +This design covers all CRITICAL and HIGH severity gaps identified in the fresh scan of `docs/structuregaps.md` (2026-02-24). The implementation follows a bottom-up dependency approach where foundational layers (storage, consensus) are completed before higher-level features (cluster coordination, consumers, client protocol). + +### Codex Feedback (Incorporated) + +1. RAFT WAL is the hardest dependency — kept first in Phase 1 +2. Meta snapshot encode/decode moved earlier in Phase 2 (before monitor loops) +3. Ack/NAK processing moved before/alongside delivery loop — foundational to correctness +4. Exit gates added per phase with measurable criteria +5. Binary formats treated as versioned contracts with golden tests +6. Feature flags for major subsystems recommended + +--- + +## Phase 1: FileStore + RAFT Foundation + +**Dependencies**: None (pure infrastructure) +**Exit gate**: All IStreamStore methods implemented, FileStore round-trips encrypted+compressed blocks, RAFT persists and recovers state across restarts, joint consensus passes membership change tests. + +### 1A: FileStore Block Lifecycle & Encryption/Compression + +**Goal**: Wire AeadEncryptor and S2Codec into block I/O, add automatic rotation, crash recovery. + +**MsgBlock.cs changes:** +- New fields: `byte[]? _aek` (per-block AEAD key), `bool _compressed`, `byte[] _lastChecksum` +- `Seal()` — marks block read-only, releases write handle +- `EncryptOnDisk(AeadEncryptor, key)` — encrypts block data at rest +- `DecryptOnRead(AeadEncryptor, key)` — decrypts block data on load +- `CompressRecord(S2Codec, payload)` → `byte[]` — per-message compression +- `DecompressRecord(S2Codec, data)` → `byte[]` — per-message decompression +- `ComputeChecksum(data)` → `byte[8]` — highway hash +- `ValidateChecksum(data, expected)` → `bool` + +**FileStore.cs changes:** +- `AppendAsync()` → triggers `RotateBlock()` when `_activeBlock.BytesWritten >= _options.BlockSize` +- `RotateBlock()` — seals active block, creates new with `_nextBlockId++` +- Enhanced `RecoverBlocks()` — rebuild TTL wheel, scan tombstones, validate checksums +- `RecoverTtlState()` — scan all blocks, re-register unexpired messages in `_ttlWheel` +- `WriteStateAtomically(path, data)` — temp file + `File.Move(temp, target, overwrite: true)` +- `GenEncryptionKeys()` → per-block key derivation from stream encryption key +- `RecoverAek(blockId)` → recover per-block key from metadata file + +**New file: `HighwayHash.cs`** — .NET highway hash implementation or NuGet wrapper. + +**Go reference**: `filestore.go` lines 816–1407 (encryption), 1754–2401 (recovery), 4443–4477 (cache), 5783–5842 (flusher). + +### 1B: Missing IStreamStore Methods + +Implement 15+ methods currently throwing `NotSupportedException`: + +| Method | Strategy | Go ref | +|--------|----------|--------| +| `StoreRawMsg` | Append with caller seq/ts, bypass auto-increment | 4759 | +| `LoadNextMsgMulti` | Multi-filter scan across blocks | 8426 | +| `LoadPrevMsg` | Reverse scan from start sequence | 8580 | +| `LoadPrevMsgMulti` | Reverse multi-filter scan | 8634 | +| `NumPendingMulti` | Count with filter set | 4051 | +| `EncodedStreamState` | Binary serialize (versioned format) | 10599 | +| `Utilization` | bytes_used / bytes_allocated | 8758 | +| `FlushAllPending` | Force-flush active block to disk | — | +| `RegisterStorageUpdates` | Callback for change notifications | 4412 | +| `RegisterStorageRemoveMsg` | Callback for removal events | 4424 | +| `RegisterProcessJetStreamMsg` | Callback for JS processing | 4431 | +| `ResetState` | Clear caches post-NRG catchup | 8788 | +| `UpdateConfig` | Apply block size, retention, limits | 655 | +| `Delete(inline)` | Stop + delete all block files | — | +| `Stop()` | Flush + close all FDs | — | + +### 1C: RAFT Persistent WAL + +**RaftLog.cs changes:** +- Replace `List` backing with binary WAL file +- Record format: `[4-byte length][entry bytes]` (length-prefixed) +- Entry serialization: `[8-byte index][4-byte term][payload]` +- `Append()` → append to WAL file, increment in-memory index +- `SyncAsync()` → `FileStream.FlushAsync()` for durability +- `LoadAsync()` → scan WAL, rebuild in-memory index, validate +- `Compact(upToIndex)` → rewrite WAL from upToIndex+1 onward +- Version header at WAL file start for forward compatibility + +**RaftNode.cs changes:** +- Wire `_persistDirectory` into disk operations +- Persist term/vote in `meta.json`: `{ "term": N, "votedFor": "id" }` +- Startup: load WAL → load snapshot → apply entries after snapshot index +- `PersistTermAsync()` — called on term/vote changes + +**Go reference**: `raft.go` lines 1000–1100 (snapshot streaming), WAL patterns throughout. + +### 1D: RAFT Joint Consensus + +**RaftNode.cs changes:** +- New field: `HashSet? _jointNewMembers` — Cnew during transition +- `ProposeAddPeer(peerId)`: + 1. Propose joint entry (Cold ∪ Cnew) to log + 2. Wait for commit + 3. Propose final Cnew entry + 4. Wait for commit +- `ProposeRemovePeer(peerId)` — same two-phase pattern +- `CalculateQuorum()`: + - Normal: `majority(_members)` + - Joint: `max(majority(Cold), majority(Cnew))` +- `ApplyMembershipChange(entry)`: + - If joint entry: set `_jointNewMembers` + - If final entry: replace `_members` with Cnew, clear `_jointNewMembers` + +**Go reference**: `raft.go` Section 4 (Raft paper), membership change handling. + +--- + +## Phase 2: JetStream Cluster Coordination + +**Dependencies**: Phase 1 (RAFT persistence, FileStore) +**Exit gate**: Meta-group processes stream/consumer assignments via RAFT, snapshots encode/decode round-trip, leadership transitions handled correctly, orphan detection works. + +### 2A: Meta Snapshot Encoding/Decoding (moved earlier per Codex) + +**New file: `MetaSnapshotCodec.cs`** +- `Encode(Dictionary> assignments)` → `byte[]` +- Uses S2 compression for wire efficiency +- Versioned format header for compatibility +- `Decode(byte[])` → assignment maps +- Golden tests with known-good snapshots + +### 2B: Cluster Monitoring Loop + +**New class: `JetStreamClusterMonitor`** (implements `IHostedService`) +- Long-running `ExecuteAsync()` loop consuming `Channel` from meta RAFT node +- `ApplyMetaEntry(entry)` → dispatch by op type: + - `assignStreamOp` → `ProcessStreamAssignment()` + - `assignConsumerOp` → `ProcessConsumerAssignment()` + - `removeStreamOp` → `ProcessStreamRemoval()` + - `removeConsumerOp` → `ProcessConsumerRemoval()` + - `updateStreamOp` → `ProcessUpdateStreamAssignment()` + - `EntrySnapshot` → `ApplyMetaSnapshot()` + - `EntryAddPeer`/`EntryRemovePeer` → peer management +- Periodic snapshot: 1-minute interval, 15s on errors +- Orphan check: 30s post-recovery scan +- Recovery state machine: deferred add/remove during catchup + +### 2C: Stream/Consumer Assignment Processing + +**Enhance `JetStreamMetaGroup.cs`:** +- `ProcessStreamAssignment(sa)` — validate config, check account limits, create stream on this node if member of replica group, respond to inflight request +- `ProcessConsumerAssignment(ca)` — validate, create consumer, bind to stream +- `ProcessStreamRemoval(sa)` — cleanup local resources, delete RAFT group +- `ProcessConsumerRemoval(ca)` — cleanup consumer state +- `ProcessUpdateStreamAssignment(sa)` — apply config changes, handle replica scaling + +### 2D: Inflight Tracking Enhancement + +**Change inflight maps from:** +```csharp +ConcurrentDictionary _inflightStreams +``` +**To:** +```csharp +record InflightInfo(int OpsCount, bool Deleted, StreamAssignment? Assignment); +ConcurrentDictionary> _inflightStreams; // account → stream → info +``` +- `TrackInflightStreamProposal(account, stream, assignment)` — increment ops, store +- `RemoveInflightStreamProposal(account, stream)` — decrement, clear when zero +- Same pattern for consumers + +### 2E: Leadership Transition + +**Add to `JetStreamClusterMonitor`:** +- `ProcessLeaderChange(isLeader)`: + - Clear all inflight proposal maps + - If becoming leader: start update subscriptions, publish domain leader advisory + - If losing leadership: stop update subscriptions +- `ProcessStreamLeaderChange(stream, isLeader)`: + - Leader: start monitor, set sync subjects, begin replication + - Follower: stop local delivery +- `ProcessConsumerLeaderChange(consumer, isLeader)`: + - Leader: start delivery engine + - Follower: stop delivery + +### 2F: Per-Stream/Consumer Monitor Loops + +**New class: `StreamMonitor`** — per-stream background task: +- Snapshot timing: 2-minute interval + random jitter +- State change detection via hash comparison +- Recovery timeout: escalate after N failures +- Compaction trigger after snapshot + +**New class: `ConsumerMonitor`** — per-consumer background task: +- Hash-based state change detection +- Timeout tracking for stalled consumers +- Migration detection (R1 → R3 scale-up) + +--- + +## Phase 3: Consumer + Stream Engines + +**Dependencies**: Phase 1 (storage), Phase 2 (cluster coordination) +**Exit gate**: Consumer delivery loop correctly delivers messages with redelivery, pull requests expire and respect batch/maxBytes, purge with subject filtering works, mirror/source retry recovers from failures. + +### 3A: Ack/NAK Processing (moved before delivery loop per Codex) + +**Enhance `AckProcessor.cs`:** +- Background subscription consuming from consumer's ack subject +- Parse ack types: + - `+ACK` — acknowledge message, remove from pending + - `-NAK` — negative ack, optional `{delay}` for redelivery schedule + - `-NAK {"delay": "2s", "backoff": [1,2,4]}` — with backoff schedule + - `+TERM` — terminate, remove from pending, no redeliver + - `+WPI` — work in progress, extend ack wait deadline +- Track `deliveryCount` per message sequence +- Max deliveries enforcement: advisory on exceed + +**Enhance `RedeliveryTracker.cs`:** +- Replace internals with `PriorityQueue` (min-heap by deadline) +- `Schedule(seq, delay)` — insert with deadline +- `TryGetNextReady(now)` → sequence past deadline, or null +- `BackoffSchedule` support: delay array indexed by delivery count +- Rate limiting: configurable max redeliveries per tick + +### 3B: Consumer Delivery Loop + +**New class: `ConsumerDeliveryLoop`** — shared between push and pull: +- Background `Task` running while consumer is active +- Main loop: + 1. Check redelivery queue first (priority over new messages) + 2. Poll `IStreamStore.LoadNextMsg()` for next matching message + 3. Check `MaxAckPending` — block if pending >= limit + 4. Deliver message (push: to deliver subject; pull: to request reply subject) + 5. Register in pending map with ack deadline + 6. Update stats (delivered count, bytes) + +**Push mode integration:** +- Delivers to `DeliverSubject` directly +- Idle heartbeat: sends when no messages for `IdleHeartbeat` duration +- Flow control: `Nats-Consumer-Stalled` header when pending > threshold + +**Pull mode integration:** +- Consumes from `WaitingRequestQueue` +- Delivers to request's `ReplyTo` subject +- Respects `Batch`, `MaxBytes`, `NoWait` per request + +### 3C: Pull Request Pipeline + +**New class: `WaitingRequestQueue`:** +```csharp +record PullRequest(string ReplyTo, int Batch, long MaxBytes, DateTimeOffset Expires, bool NoWait, string? PinId); +``` +- Internal `LinkedList` ordered by arrival +- `Enqueue(request)` — adds, starts expiry timer +- `TryDequeue()` → next request with remaining capacity +- `RemoveExpired(now)` — periodic cleanup +- `Count`, `IsEmpty` properties + +### 3D: Priority Group Pinning + +**Enhance `PriorityGroupManager.cs`:** +- `AssignPinId()` → generates NUID string, starts TTL timer +- `ValidatePinId(pinId)` → checks validity +- `UnassignPinId()` → clears, publishes advisory +- `PinnedTtl` configurable timeout +- Integrates with `WaitingRequestQueue` for pin-aware dispatch + +### 3E: Consumer Pause/Resume + +**Add to `ConsumerManager.cs`:** +- `PauseUntil` field on consumer state +- `PauseAsync(until)`: + - Set deadline, stop delivery loop + - Publish `$JS.EVENT.ADVISORY.CONSUMER.PAUSED` + - Start auto-resume timer +- `ResumeAsync()`: + - Clear deadline, restart delivery loop + - Publish resume advisory + +### 3F: Mirror/Source Retry + +**Enhance `MirrorCoordinator.cs`:** +- Retry loop: exponential backoff 1s → 30s with jitter +- `SetupMirrorConsumer()` → create ephemeral consumer on source via JetStream API +- Sequence tracking: `_sourceSeq` (last seen), `_destSeq` (last stored) +- Gap detection: if source sequence jumps, log advisory +- `SetMirrorErr(err)` → publishes error state advisory + +**Enhance `SourceCoordinator.cs`:** +- Same retry pattern +- Per-source consumer setup/teardown +- Subject filter + transform before storing +- Dedup window: `Dictionary` keyed by Nats-Msg-Id +- Time-based pruning of dedup window + +### 3G: Stream Purge with Filtering + +**Enhance `StreamManager.cs`:** +- `PurgeEx(subject, seq, keep)`: + - `subject` filter: iterate messages, purge only matching subject pattern + - `seq` range: purge messages with seq < specified + - `keep`: retain N most recent (purge all but last N) +- After purge: cascade to consumers (clear pending for purged sequences) + +### 3H: Interest Retention + +**New class: `InterestRetentionPolicy`:** +- Track which consumers have interest in each message +- `RegisterInterest(consumerName, filterSubject)` — consumer subscribes +- `AcknowledgeDelivery(consumerName, seq)` — consumer acked +- `ShouldRetain(seq)` → `false` when all interested consumers have acked +- Periodic scan for removable messages + +--- + +## Phase 4: Client Performance, MQTT, Config, Networking + +**Dependencies**: Phase 1 (storage for MQTT), Phase 3 (consumer for MQTT) +**Exit gate**: Client flush coalescing measurably reduces syscalls under load, MQTT sessions survive server restart, SIGHUP triggers config reload, implicit routes auto-discover. + +### 4A: Client Flush Coalescing + +**Enhance `NatsClient.cs`:** +- New fields: `int _flushSignalsPending`, `const int MaxFlushPending = 10` +- New field: `AsyncAutoResetEvent _flushSignal` — write loop waits on this +- `QueueOutbound()`: after writing to channel, increment `_flushSignalsPending`, signal +- `RunWriteLoopAsync()`: if `_flushSignalsPending < MaxFlushPending && _pendingBytes < MaxBufSize`, wait for signal (coalesces) +- `NatsServer.FlushClients(HashSet)` — signals all pending clients + +### 4B: Client Stall Gate + +**Enhance `NatsClient.cs`:** +- New field: `SemaphoreSlim? _stallGate` +- In `QueueOutbound()`: + - If `_pendingBytes > _maxPending * 3 / 4` and `_stallGate == null`: create `new SemaphoreSlim(0, 1)` + - If `_stallGate != null`: `await _stallGate.WaitAsync(timeout)` +- In `RunWriteLoopAsync()`: + - After successful flush below threshold: `_stallGate?.Release()`, set `_stallGate = null` +- Per-kind: CLIENT → close on slow consumer; ROUTER/GATEWAY/LEAF → mark flag, keep alive + +### 4C: Write Timeout Recovery + +**Enhance `RunWriteLoopAsync()`:** +- Track `bytesWritten` and `bytesAttempted` per flush cycle +- On write timeout: + - If `Kind == CLIENT` → close immediately + - If `bytesWritten > 0` (partial flush) → mark `isSlowConsumer`, continue + - If `bytesWritten == 0` → close +- New enum: `WriteTimeoutPolicy { Close, TcpFlush }` +- Route/gateway/leaf use `TcpFlush` by default + +### 4D: MQTT JetStream Persistence + +**Enhance MQTT subsystem:** + +**Streams per account:** +- `$MQTT_msgs` — message persistence (subject: `$MQTT.msgs.>`) +- `$MQTT_rmsgs` — retained messages (subject: `$MQTT.rmsgs.>`) +- `$MQTT_sess` — session state (subject: `$MQTT.sess.>`) + +**MqttSessionStore.cs changes:** +- `ConnectAsync(clientId, cleanSession)`: + - If `cleanSession=false`: query `$MQTT_sess` for existing session, restore + - If `cleanSession=true`: delete existing session from stream +- `SaveSessionAsync()` — persist current state to `$MQTT_sess` +- `DisconnectAsync(abnormal)`: + - If abnormal and will configured: publish will message + - If `cleanSession=false`: save session + +**MqttRetainedStore.cs changes:** +- Backed by `$MQTT_rmsgs` stream +- `SetRetained(topic, payload)` — store/update retained message +- `GetRetained(topicFilter)` — query matching retained messages +- On SUBSCRIBE: deliver matching retained messages + +**QoS tracking:** +- QoS 1 pending acks stored in session stream +- On reconnect: redeliver unacked QoS 1 messages +- QoS 2 state machine backed by session stream + +### 4E: SIGHUP Config Reload + +**New file: `SignalHandler.cs`** in Configuration/: +```csharp +public static class SignalHandler +{ + public static void Register(ConfigReloader reloader, NatsServer server) + { + PosixSignalRegistration.Create(PosixSignal.SIGHUP, ctx => + { + _ = reloader.ReloadAsync(server); + }); + } +} +``` + +**Enhance `ConfigReloader.cs`:** +- New `ReloadAsync(NatsServer server)`: + - Re-parse config file + - `Diff(old, new)` for change list + - Apply reloadable changes: logging, auth, TLS, limits + - Reject non-reloadable with warning + - Auth change propagation: iterate clients, re-evaluate, close unauthorized +- TLS reload: `server.ReloadTlsCertificate(newCert)` + +### 4F: Implicit Route/Gateway Discovery + +**Enhance `RouteManager.cs`:** +- `ProcessImplicitRoute(serverInfo)`: + - Extract `connect_urls` from INFO + - For each unknown URL: create solicited connection + - Track in `_discoveredRoutes` set to avoid duplicates +- `ForwardNewRouteInfoToKnownServers(newPeer)`: + - Send updated INFO to all existing routes + - Include new peer in `connect_urls` + +**Enhance `GatewayManager.cs`:** +- `ProcessImplicitGateway(gatewayInfo)`: + - Extract gateway URLs from INFO + - Create outbound connection to discovered gateway +- `GossipGatewaysToInboundGateway(conn)`: + - Share known gateways with new inbound connections + +--- + +## Test Strategy + +### Per-Phase Test Requirements + +**Phase 1 tests:** +- FileStore: encrypted block round-trip, compressed block round-trip, crash recovery with TTL, checksum validation on corrupt data, atomic write failure recovery +- IStreamStore: each missing method gets at least 2 test cases +- RAFT WAL: persist → restart → verify state, concurrent appends, compaction +- Joint consensus: add peer during normal operation, remove peer, add during leader change + +**Phase 2 tests:** +- Snapshot encode → decode round-trip (golden test with known data) +- Monitor loop processes stream/consumer assignment entries +- Inflight tracking prevents duplicate proposals +- Leadership transition clears state correctly + +**Phase 3 tests:** +- Consumer delivery loop delivers messages in order, respects MaxAckPending +- Pull request expiry, batch/maxBytes enforcement +- NAK with delay schedules redelivery at correct time +- Pause/resume stops and restarts delivery +- Mirror retry recovers after source failure +- Purge with subject filter only removes matching + +**Phase 4 tests:** +- Flush coalescing: measure syscall count reduction under load +- Stall gate: producer blocks when client is slow +- MQTT session restore after restart +- SIGHUP triggers reload, auth changes disconnect unauthorized clients +- Implicit route discovery creates connections to unknown peers + +### Parity DB Updates + +When porting Go tests, update `docs/test_parity.db`: +```sql +UPDATE go_tests SET status='mapped', dotnet_test='TestName', dotnet_file='File.cs' +WHERE go_test='GoTestName'; +``` + +--- + +## Risk Mitigation + +1. **RAFT WAL correctness**: Use fsync on every commit, validate on load, golden tests with known-good WAL files +2. **Joint consensus edge cases**: Test split-brain scenarios with simulated partitions +3. **Snapshot format drift**: Version header, backward compatibility checks, golden tests +4. **Async loop races**: Use structured concurrency (CancellationToken chains), careful locking around inflight maps +5. **FileStore crash recovery**: Intentional corruption tests (truncated blocks, zero-filled regions) +6. **MQTT persistence migration**: Feature flag to toggle JetStream backing vs in-memory + +--- + +## Phase Dependencies + +``` +Phase 1A (FileStore blocks) ─────────────────┐ +Phase 1B (IStreamStore methods) ─────────────┤ +Phase 1C (RAFT WAL) ─────────────────────────┼─→ Phase 2 (JetStream Cluster) ─→ Phase 3 (Consumer/Stream) +Phase 1D (RAFT joint consensus) ─────────────┘ │ + ↓ +Phase 4A-C (Client perf) ───────────────────────────────────────────── independent ──────┤ +Phase 4D (MQTT persistence) ──────────────────── depends on Phase 1 + Phase 3 ──────────┤ +Phase 4E (SIGHUP reload) ──────────────────────────────────────────── independent ──────┤ +Phase 4F (Route/Gateway discovery) ─────────────────────────────────── independent ─────┘ +``` + +Phase 4A-C, 4E, and 4F are independent and can be parallelized with Phase 3. +Phase 4D requires Phase 1 (storage) and Phase 3 (consumers) to be complete. diff --git a/docs/structuregaps.md b/docs/structuregaps.md index 4861190..d8858ce 100644 --- a/docs/structuregaps.md +++ b/docs/structuregaps.md @@ -1,470 +1,948 @@ -# Implementation Gaps: Go NATS Server vs .NET Port +comm# Implementation Gaps: Go NATS Server vs .NET Port + +**Last updated:** 2026-02-24 (fresh scan) ## Overview | Metric | Go Server | .NET Port | Ratio | |--------|-----------|-----------|-------| -| Source lines | ~130K (109 files) | ~27K (192 files) | 4.8x | +| Source lines | ~130K (109 files) | ~38.8K (208 files) | 3.4x | | Test functions | 2,937 `func Test*` | 3,168 `[Fact]`/`[Theory]` | 0.93x | -| Tests passing | 2,937 | 3,501 (incl. parameterized) | 1.19x | -| Largest source file | filestore.go (12,593 lines) | NatsServer.cs (1,739 lines) | 7.2x | +| Tests mapped | 2,646 (90.1%) | — | — | +| Largest source file | filestore.go (12,593 lines) | NatsServer.cs (1,883 lines) | 6.7x | -The .NET port has more test methods than the Go server but covers less behavioral surface area. Many .NET tests are unit tests against stubs, mock coordinators, or simulated cluster fixtures, while Go tests are integration tests that start real servers, establish real TCP connections, and exercise full protocol paths. - -The .NET codebase is split across more files (192 vs 109) following .NET conventions of one-class-per-file, but each file is substantially smaller. The 4.8x line count gap understates the functional gap because the Go code is denser (less boilerplate per line of logic than C#). +The .NET port has grown significantly from the prior baseline (~27K→~38.8K lines). All 15 original gaps have seen substantial growth. However, many specific features within each subsystem remain unimplemented. This document catalogs the remaining gaps based on a function-by-function comparison. ## Gap Severity Legend - **CRITICAL**: Core functionality missing, blocks production use - **HIGH**: Important feature incomplete, limits capabilities significantly - **MEDIUM**: Feature gap exists but workarounds are possible or feature is less commonly used +- **LOW**: Nice-to-have, minor optimization or edge-case handling --- ## Gap 1: FileStore Block Management (CRITICAL) -**Go reference**: `filestore.go` -- 12,593 lines, ~300 functions -**NET implementation**: `FileStore.cs` -- 607 lines -**Gap factor**: 20.8x +**Go reference**: `filestore.go` — 12,593 lines +**NET implementation**: `FileStore.cs` (1,633) + `MsgBlock.cs` (630) + supporting files +**Current .NET total**: ~5,196 lines in Storage/ +**Gap factor**: 2.4x (down from 20.8x) -The Go FileStore is the most complex single file in the server. It implements a block-based persistent storage engine with S2 compression, AEAD encryption (ChaCha20-Poly1305 / AES-GCM), crash recovery, and TTL scheduling. The .NET FileStore is a basic JSONL append-only store that satisfies the `IStreamStore` interface but lacks the performance and durability characteristics required for production use. +The .NET FileStore has grown substantially with block-based storage, but the block lifecycle, encryption/compression integration, crash recovery, and integrity checking remain incomplete. -**Missing features**: -- Block-based I/O: Go organizes messages into 65KB+ blocks with per-block indexes, allowing efficient range reads and compaction. The .NET store appends to a single file. -- Encryption key management: `genEncryptionKeys`, `recoverAEK`, per-block encryption with key rotation. The .NET `AeadEncryptor` exists (165 lines) but is not integrated into the file store's block lifecycle. -- S2 compression on write / decompression on read: The .NET `S2Codec` exists (111 lines) but is not wired into the storage path. -- Message block state recovery: Go recovers block state from corrupted or partially-written files on startup, rebuilding indexes from raw data. The .NET store has no recovery path. -- Tombstone and deletion tracking: Go tracks deleted messages as tombstones within blocks, allowing sparse sequence sets. The .NET store does not support sparse sequences. -- TTL and message scheduling recovery: Go uses time hash wheels to schedule message expiration and recovers pending expirations on restart. -- Checksum validation: Go computes checksums per message and per block for data integrity verification. -- Atomic file overwrites: Go uses rename-based atomic writes for crash safety. The .NET store does not. -- Multi-block write cache: Go maintains an in-memory write cache with configurable block count limits. +### 1.1 Block Rotation & Lifecycle (CRITICAL) -**Related .NET files**: `FileStore.cs` (607), `IStreamStore.cs` (172), `AeadEncryptor.cs` (165), `S2Codec.cs` (111), `StreamState.cs` (78), `MemStore.cs` (160) -**Total .NET lines in storage**: ~1,293 -**Go equivalent**: `filestore.go` (12,593) + `memstore.go` (2,700) + `store.go` (791) = 16,084 lines +**Missing Go functions:** +- `initMsgBlock()` (line 1046) — initializes new block with metadata +- `recoverMsgBlock()` (line 1148) — recovers block from disk with integrity checks +- `newMsgBlockForWrite()` (line 4485) — creates writable block with locking +- `spinUpFlushLoop()` + `flushLoop()` (lines 5783–5842) — background flusher goroutine +- `enableForWriting()` (line 6622) — opens FD for active block with gating semaphore (`dios`) +- `closeFDs()` / `closeFDsLocked()` (lines 6843–6849) — explicit FD management +- `shouldCompactInline()` (line 5553) — inline compaction heuristics +- `shouldCompactSync()` (line 5561) — sync-time compaction heuristics + +**.NET status**: Basic block creation/recovery exists but no automatic rotation on size thresholds, no block sealing, no per-block metadata files, no background flusher. + +### 1.2 Encryption Integration into Block I/O (CRITICAL) + +Go encrypts block files on disk using per-block AEAD keys. The .NET `AeadEncryptor` exists as a standalone utility but is NOT wired into the block write/read path. + +**Missing Go functions:** +- `genEncryptionKeys()` (line 816) — generates AEAD + stream cipher per block +- `recoverAEK()` (line 878) — recovers encryption key from metadata +- `setupAEK()` (line 907) — sets up encryption for the stream +- `loadEncryptionForMsgBlock()` (line 1078) — loads per-block encryption state +- `checkAndLoadEncryption()` (line 1068) — verifies encrypted block integrity +- `ensureLastChecksumLoaded()` (line 1139) — validates block checksum +- `convertCipher()` (line 1318) — re-encrypts block with new cipher +- `convertToEncrypted()` (line 1407) — converts plaintext block to encrypted + +**Impact**: Go provides end-to-end encryption of block files on disk. .NET stores block files in plaintext. + +### 1.3 S2 Compression in Block Path (CRITICAL) + +Go compresses payloads at the message level during block write and decompresses on read. .NET's `S2Codec` is standalone only. + +**Missing Go functions:** +- `setupWriteCache()` (line 4443) — initializes compression-aware write cache +- Block-level compression/decompression during loadMsgs/flushPendingMsgs + +**Impact**: Go block files are compressed (smaller disk footprint). .NET block files are uncompressed. + +### 1.4 Crash Recovery & Index Rebuilding (CRITICAL) + +**Missing Go functions:** +- `recoverFullState()` (line 1754) — full recovery from blocks with lost-data tracking +- `recoverTTLState()` (line 2042) — recovers TTL wheel state post-crash +- `recoverMsgSchedulingState()` (line 2123) — recovers scheduled message delivery post-crash +- `recoverMsgs()` (line 2263) — scans all blocks and rebuilds message index +- `rebuildState()` (line 1446) — per-block state reconstruction +- `rebuildStateFromBufLocked()` (line 1493) — parses raw block buffer to rebuild state +- `expireMsgsOnRecover()` (line 2401) — enforces MaxAge on recovery +- `writeTTLState()` (line 10833) — persists TTL state to disk + +**.NET status**: `FileStore.RecoverBlocks()` exists but is minimal — no TTL wheel recovery, no scheduling recovery, no lost-data tracking, no tombstone cleanup. + +### 1.5 Checksum Validation (HIGH) + +Go uses highway hash for per-message checksums and validates integrity on read. + +**Missing:** +- `msgBlock.lchk[8]byte` — last checksum buffer +- `msgBlock.hh *highwayhash.Digest64` — highway hasher +- `msgBlock.lastChecksum()` (line 2204) — returns last computed checksum +- Message checksum validation during `msgFromBufEx()` (line ~8180) + +**.NET status**: `MessageRecord` has no checksum field. No integrity checks on read. + +### 1.6 Atomic File Overwrites (HIGH) + +Go uses temp file + rename for crash-safe state updates. + +**Missing:** +- `_writeFullState()` (line 10599) — atomic state serialization with temp file + rename +- State file write protection with mutual exclusion (`wfsmu`/`wfsrun`) + +**.NET status**: Writes directly without atomic guarantees. + +### 1.7 Tombstone & Deletion Tracking (HIGH) + +**Missing:** +- `removeMsg()` (line 5267) — removes message with optional secure erase +- `removeMsgFromBlock()` (line 5307) — removes from specific block +- `eraseMsg()` (line 5890) — overwrites deleted message with random data (secure erase) +- Sparse `avl.SequenceSet` for deletion tracking (Go uses AVL; .NET uses `HashSet`) +- Tombstone record persistence (deleted messages lost on crash in .NET) + +### 1.8 Multi-Block Write Cache (MEDIUM) + +**Missing Go functions:** +- `setupWriteCache()` (line 4443) — sets up write cache +- `finishedWithCache()` (line 4477) — marks cache complete +- `expireCache()` / `expireCacheLocked()` (line 6148+) — expires idle cache +- `resetCacheExpireTimer()` / `startCacheExpireTimer()` — manages cache timeout +- Elastic reference machinery (line 6704, `ecache.Strengthen()` / `WeakenAfter()`) +- `flushPendingMsgs()` / `flushPendingMsgsLocked()` (line 7560+) — background flush + +**.NET status**: Synchronous flush on rotation; no background flusher, no cache expiration. + +### 1.9 Missing IStreamStore Interface Methods (CRITICAL) + +| Method | Go Equivalent | .NET Status | +|--------|---------------|-------------| +| `StoreRawMsg` | filestore.go:4759 | Missing | +| `LoadNextMsgMulti` | filestore.go:8426 | Missing | +| `LoadPrevMsg` | filestore.go:8580 | Missing | +| `LoadPrevMsgMulti` | filestore.go:8634 | Missing | +| `NumPendingMulti` | filestore.go:4051 | Missing | +| `SyncDeleted` | — | Missing | +| `EncodedStreamState` | filestore.go:10599 | Missing | +| `Utilization` | filestore.go:8758 | Missing | +| `FlushAllPending` | — | Missing | +| `RegisterStorageUpdates` | filestore.go:4412 | Missing | +| `RegisterStorageRemoveMsg` | filestore.go:4424 | Missing | +| `RegisterProcessJetStreamMsg` | filestore.go:4431 | Missing | +| `ResetState` | filestore.go:8788 | Missing | +| `UpdateConfig` | filestore.go:655 | Stubbed | +| `Delete(bool)` | — | Incomplete | +| `Stop()` | — | Incomplete | + +### 1.10 Query/Filter Operations (MEDIUM) + +Go uses trie-based subject indexing (`SubjectTree[T]`) for O(log n) lookups. .NET uses linear LINQ scans. + +**Missing:** +- `FilteredState()` (line 3191) — optimized filtered state with caching +- `checkSkipFirstBlock()` family (lines 3241–3287) +- `numFiltered*()` family (lines 3308–3413) +- `LoadMsg()` (line 8308) — block-aware message loading --- ## Gap 2: JetStream Cluster Coordination (CRITICAL) -**Go reference**: `jetstream_cluster.go` -- 10,887 lines -**NET implementation**: `JetStreamMetaGroup.cs` -- 51 lines -**Gap factor**: 213x +**Go reference**: `jetstream_cluster.go` — 10,887 lines +**NET implementation**: `JetStreamMetaGroup.cs` (454) + `StreamReplicaGroup.cs` (300) + supporting files +**Current .NET total**: ~2,787 lines in Cluster/ + Raft/ +**Gap factor**: ~3.9x +**Missing functions**: ~84 -This is the largest single-file gap in the project. The Go `jetstream_cluster.go` implements the meta-controller that coordinates stream and consumer placement across a RAFT cluster. The .NET implementation is a 51-line stub. +The .NET implementation provides a skeleton suitable for unit testing but lacks the real cluster coordination machinery. The Go implementation is deeply integrated with RAFT, NATS subscriptions for inter-node communication, and background monitoring goroutines. -**Missing features**: -- `streamAssignment` and `consumerAssignment` tracking with RAFT proposal workflow -- Inflight request deduplication: `inflightStreamInfo`, `inflightConsumerInfo` prevent duplicate operations during leader transitions -- Peer remove and stream move operations with data rebalancing -- Leader-based stream and consumer placement with topology awareness (unique nodes, tags, clusters) -- Assignment validation: checking resource limits, storage availability, and replica count constraints before accepting -- Replica group management: `StreamReplicaGroup.cs` exists (91 lines) but lacks the coordination logic -- Step-down and leadership transfer for maintenance operations -- Cross-cluster gateway awareness for super-cluster stream placement +### 2.1 Cluster Monitoring Loop (CRITICAL) -**Related .NET files**: `JetStreamMetaGroup.cs` (51), `StreamReplicaGroup.cs` (91), `JetStreamService.cs` (148) -**Total .NET lines in cluster coordination**: ~290 +The main orchestration loop that drives cluster state transitions. + +**Missing Go functions:** +- `monitorCluster()` (lines 1455–1825, 370 lines) — processes metadata changes from RAFT ApplyQ +- `checkForOrphans()` (lines 1378–1432) — identifies unmatched streams/consumers after recovery +- `getOrphans()` (lines 1433–1454) — scans all accounts for orphaned assets +- `checkClusterSize()` (lines 1828–1869) — detects mixed-mode clusters, adjusts bootstrap size +- `isStreamCurrent()`, `isStreamHealthy()`, `isConsumerHealthy()` (lines 582–679) +- `subjectsOverlap()` (lines 751–767) — prevents subject collisions +- Recovery state machine (`recoveryUpdates` struct, lines 1332–1366) + +### 2.2 Stream & Consumer Assignment Processing (CRITICAL) + +**Missing Go functions:** +- `processStreamAssignment()` (lines 4541–4647, 107 lines) +- `processUpdateStreamAssignment()` (lines 4650–4768) +- `processStreamRemoval()` (lines 5216–5265) +- `processConsumerAssignment()` (lines 5363–5542, 180 lines) +- `processConsumerRemoval()` (lines 5543–5599) +- `processClusterCreateStream()` (lines 4944–5215, 272 lines) +- `processClusterDeleteStream()` (lines 5266–5362, 97 lines) +- `processClusterCreateConsumer()` (lines 5600–5855, 256 lines) +- `processClusterDeleteConsumer()` (lines 5856–5925) +- `processClusterUpdateStream()` (lines 4798–4943, 146 lines) + +### 2.3 Inflight Request Deduplication (HIGH) + +**Missing:** +- `trackInflightStreamProposal()` (lines 1193–1210) +- `removeInflightStreamProposal()` (lines 1214–1230) +- `trackInflightConsumerProposal()` (lines 1234–1257) +- `removeInflightConsumerProposal()` (lines 1260–1278) +- Full inflight tracking with ops count, deleted flag, and assignment capture + +### 2.4 Peer Management & Stream Moves (HIGH) + +**Missing:** +- `processAddPeer()` (lines 2290–2340) +- `processRemovePeer()` (lines 2342–2393) +- `removePeerFromStream()` / `removePeerFromStreamLocked()` (lines 2396–2439) +- `remapStreamAssignment()` (lines 7077–7111) +- Stream move operations in `jsClusteredStreamUpdateRequest()` (lines 7757+) + +### 2.5 Leadership Transition (HIGH) + +**Missing:** +- `processLeaderChange()` (lines 7001–7074) — meta-level leader change +- `processStreamLeaderChange()` (lines 4262–4458, 197 lines) — stream-level +- `processConsumerLeaderChange()` (lines 6622–6793, 172 lines) — consumer-level +- Step-down mechanisms: `JetStreamStepdownStream()`, `JetStreamStepdownConsumer()` + +### 2.6 Snapshot & State Recovery (HIGH) + +**Missing:** +- `metaSnapshot()` (line 1890) — triggers meta snapshot +- `encodeMetaSnapshot()` (lines 2075–2145) — binary+S2 compression +- `decodeMetaSnapshot()` (lines 2031–2074) — reverse of encode +- `applyMetaSnapshot()` (lines 1897–2030, 134 lines) — computes deltas and applies +- `collectStreamAndConsumerChanges()` (lines 2146–2240) +- Per-stream `monitorStream()` (lines 2895–3700, 800+ lines) +- Per-consumer `monitorConsumer()` (lines 6081–6350, 270+ lines) + +### 2.7 Entry Application Pipeline (HIGH) + +**Missing:** +- `applyMetaEntries()` (lines 2474–2609, 136 lines) — handles all entry types +- `applyStreamEntries()` (lines 3645–4067, 423 lines) — per-stream entries +- `applyStreamMsgOp()` (lines 4068–4261, 194 lines) — per-message ops +- `applyConsumerEntries()` (lines 6351–6621, 271 lines) — per-consumer entries + +### 2.8 Topology-Aware Placement (MEDIUM) + +**Missing from PlacementEngine (currently 80 lines vs Go's 312 lines for `selectPeerGroup()`):** +- Unique tag enforcement (`JetStreamUniqueTag`) +- HA asset limits per peer +- Dynamic available storage calculation +- Tag inclusion/exclusion with prefix handling +- Weighted node selection based on availability +- Mixed-mode detection +- `tieredStreamAndReservationCount()` (lines 7524–7546) +- `createGroupForConsumer()` (lines 8783–8923, 141 lines) +- `jsClusteredStreamLimitsCheck()` (lines 7599–7618) + +### 2.9 RAFT Group Creation & Lifecycle (MEDIUM) + +**Missing:** +- `createRaftGroup()` (lines 2659–2789, 131 lines) +- `raftGroup.isMember()` (lines 2611–2621) +- `raftGroup.setPreferred()` (lines 2623–2656) + +### 2.10 Assignment Encoding/Decoding (MEDIUM) + +**Missing (no serialization for RAFT persistence):** +- `encodeAddStreamAssignment()`, `encodeUpdateStreamAssignment()`, `encodeDeleteStreamAssignment()` +- `decodeStreamAssignment()`, `decodeStreamAssignmentConfig()` +- `encodeAddConsumerAssignment()`, `encodeDeleteConsumerAssignment()` +- `encodeAddConsumerAssignmentCompressed()`, `decodeConsumerAssignment()` +- `decodeConsumerAssignmentCompressed()`, `decodeConsumerAssignmentConfig()` + +### 2.11 Unsupported Asset Handling (LOW) + +**Missing:** +- `unsupportedStreamAssignment` type (lines 186–247) — graceful handling of version-incompatible streams +- `unsupportedConsumerAssignment` type (lines 268–330) + +### 2.12 Clustered API Handlers (MEDIUM) + +**Missing:** +- `jsClusteredStreamRequest()` (lines 7620–7701, 82 lines) +- `jsClusteredStreamUpdateRequest()` (lines 7757–8265, 509 lines) +- System-level subscriptions for result processing, peer removal, leadership step-down --- -## Gap 3: Stream Mirrors, Sources & Subject Transforms (HIGH) +## Gap 3: Consumer Delivery Engines (HIGH) -**Go reference**: `stream.go` -- 8,072 lines, 80+ functions -**NET implementation**: `StreamManager.cs` (436) + `MirrorCoordinator.cs` (22) + `SourceCoordinator.cs` (36) -**Gap factor**: 16.3x +**Go reference**: `consumer.go` — 6,715 lines, 209 functions +**NET implementation**: `PushConsumerEngine.cs` (264) + `PullConsumerEngine.cs` (411) + `AckProcessor.cs` (225) + `PriorityGroupManager.cs` (102) + `RedeliveryTracker.cs` (92) + `ConsumerManager.cs` (198) +**Current .NET total**: ~1,292 lines +**Gap factor**: 5.2x -The Go `stream.go` handles the full lifecycle of streams including mirror synchronization, source consumption, deduplication, subject transforms, and retention policy enforcement. The .NET `StreamManager` covers basic CRUD and message storage/retrieval but the mirror and source coordinators are stubs. +### 3.1 Core Message Delivery Loop (CRITICAL) -**Missing features**: -- `processMirrorMsgs`: continuous mirror synchronization loop that pulls messages from the source stream and applies them locally -- `setupMirrorConsumer` / `setupSourceConsumer`: creation of ephemeral internal consumers that track position in the source -- Source and mirror retry logic with exponential backoff and jitter -- Flow control reply handling for pull-based source consumption -- Subject transforms with regex matching (Go `subject_transform.go` is 688 lines; .NET `SubjectTransform.cs` is 708 lines -- this is actually well-ported) -- Deduplication window maintenance with `Nats-Msg-Id` header tracking -- Mirror/source error state tracking, health reporting, and automatic recovery -- Stream assignment awareness for determining whether a stream should be active on this node in cluster mode -- Purge operations with subject filtering and sequence-based purge -- Stream snapshot and restore for backup/migration +**Missing**: `loopAndGatherMsgs()` (Go lines ~1400–1700) — the main consumer dispatch loop that: +- Polls stream store for new messages matching filter +- Applies redelivery logic +- Handles num_pending calculations +- Manages delivery interest tracking +- Responds to stream updates -**Related .NET files**: `StreamManager.cs` (436), `MirrorCoordinator.cs` (22), `SourceCoordinator.cs` (36), `SubjectTransform.cs` (708), `StreamApiHandlers.cs` (351) -**Total .NET lines in stream management**: ~1,553 +### 3.2 Pull Request Pipeline (CRITICAL) + +**Missing**: `processNextMsgRequest()` (Go lines ~4276–4450) — full pull request handler with: +- Waiting request queue management (`waiting` list with priority) +- Max bytes accumulation and flow control +- Request expiry handling +- Batch size enforcement +- Pin ID assignment from priority groups + +### 3.3 Inbound Ack/NAK Processing Loop (HIGH) + +**Missing**: `processInboundAcks()` (Go line ~4854) — background goroutine that: +- Reads from ack subject subscription +- Parses `+ACK`, `-NAK`, `+TERM`, `+WPI` frames +- Dispatches to appropriate handlers +- Manages ack deadlines and redelivery scheduling + +### 3.4 Redelivery Scheduler (HIGH) + +**Missing**: Redelivery queue (`rdq`) as a min-heap/priority queue. `RedeliveryTracker` only tracks deadlines, doesn't: +- Order redeliveries by deadline +- Batch redelivery dispatches efficiently +- Support per-sequence redelivery state +- Handle redelivery rate limiting + +### 3.5 Idle Heartbeat & Flow Control (HIGH) + +**Missing**: +- `sendIdleHeartbeat()` (Go line ~5222) — heartbeat with pending counts and stall headers +- `sendFlowControl()` (Go line ~5495) — dedicated flow control handler +- Flow control reply generation with pending counts + +### 3.6 Priority Group Pinning (HIGH) + +`PriorityGroupManager` exists (102 lines) but missing: +- Pin ID (NUID) generation and assignment (`setPinnedTimer`, `assignNewPinId`, `unassignPinId`) +- Pin timeout timers (`pinnedTtl`) +- `Nats-Pin-Id` header response +- Advisory messages on pin/unpin events +- Priority group state persistence + +### 3.7 Pause/Resume State Management (HIGH) + +**Missing**: +- `PauseUntil` deadline tracking +- Pause expiry timer +- Pause advisory generation (`$JS.EVENT.ADVISORY.CONSUMER.PAUSED`) +- Resume/unpause event generation +- Pause state in consumer info response + +### 3.8 Delivery Interest Tracking (HIGH) + +**Missing**: Dynamic delivery interest monitoring: +- Subject interest change tracking +- Gateway interest checks +- Subscribe/unsubscribe tracking for delivery subject +- Interest-driven consumer cleanup (`deleteNotActive`) + +### 3.9 Max Deliveries Enforcement (MEDIUM) + +`AckProcessor` checks basic threshold but missing: +- Advisory event generation on exceeding max delivers +- Per-delivery-count policy selection +- `NotifyDeliveryExceeded` advisory + +### 3.10 Filter Subject Skip Tracking (MEDIUM) + +**Missing**: Efficient filter matching with compiled regex, skip list tracking, `UpdateSkipped` state updates. + +### 3.11 Sample/Observe Mode (MEDIUM) + +**Missing**: Sample frequency parsing (`"1%"`), stochastic sampling, latency measurement, latency advisory generation. + +### 3.12 Reset to Sequence (MEDIUM) + +**Missing**: `processResetReq()` (Go line ~4241) — consumer state reset to specific sequence. + +### 3.13 Rate Limiting (MEDIUM) + +`PushConsumerEngine` has basic rate limiting but missing accurate token bucket, dynamic updates, per-message delay calculation. + +### 3.14 Cluster-Aware Pending Requests (LOW) + +**Missing**: Pull request proposals to RAFT, cluster-wide pending request tracking. --- -## Gap 4: Consumer Delivery Engines (HIGH) +## Gap 4: Stream Mirrors, Sources & Lifecycle (HIGH) -**Go reference**: `consumer.go` -- 6,715 lines, 130+ functions -**NET implementation**: `PushConsumerEngine.cs` (67) + `PullConsumerEngine.cs` (169) + `ConsumerManager.cs` (198) + `AckProcessor.cs` (72) -**Gap factor**: 13.3x +**Go reference**: `stream.go` — 8,072 lines, 193 functions +**NET implementation**: `StreamManager.cs` (644) + `MirrorCoordinator.cs` (364) + `SourceCoordinator.cs` (470) +**Current .NET total**: ~1,478 lines +**Gap factor**: 5.5x -The Go consumer engine is a full state machine handling delivery, acknowledgment tracking, redelivery, rate limiting, and priority group management. The .NET engines are minimal wrappers that deliver messages but lack the stateful tracking required for production consumer behavior. +### 4.1 Mirror Consumer Setup & Retry (HIGH) -**Missing features**: -- Priority group pinning and flow: `setPinnedTimer`, `assignNewPinId` for sticky consumer assignment within priority groups -- Pending request queue with waiting consumer pool management for pull consumers -- NAK and redelivery tracking with configurable exponential backoff schedules -- Pause state management with advisory event publication -- Redelivery rate limiting per consumer to prevent thundering herd on redelivery -- Filter subject skip tracking for filtered consumers (skipping messages that do not match the filter without counting them against pending limits) -- Idle heartbeat generation and flow control acknowledgment -- Reset to sequence: allowing a consumer to rewind to a specific stream sequence -- Delivery interest tracking: distinguishing local vs gateway vs leaf node delivery targets -- Max-deliveries enforcement with configurable drop/reject/dead-letter policies -- Sample/observe mode for latency tracking (recording deliver-to-ack latency percentiles) -- Backoff schedules: per-message redelivery delay arrays -- Consumer pause/resume with coordinated cluster state +`MirrorCoordinator` (364 lines) exists but missing: +- Complete mirror consumer API request generation +- Consumer configuration with proper defaults +- Subject transforms for mirror +- Filter subject application +- OptStartSeq/OptStartTime handling +- Mirror error state tracking (`setMirrorErr`) +- Scheduled retry with exponential backoff +- Mirror health checking -**Related .NET files**: `PushConsumerEngine.cs` (67), `PullConsumerEngine.cs` (169), `ConsumerManager.cs` (198), `AckProcessor.cs` (72), `ConsumerApiHandlers.cs` (307), `ConsumerState.cs` (41), `ConsumerConfig.cs` (37), `IConsumerStore.cs` (56) -**Total .NET lines in consumer subsystem**: ~947 +### 4.2 Mirror Message Processing (HIGH) + +**Missing from MirrorCoordinator:** +- Gap detection in mirror stream (deleted/expired messages) +- Sequence alignment (`sseq`/`dseq` tracking) +- Mirror-specific handling for out-of-order arrival +- Mirror error advisory generation + +### 4.3 Source Consumer Setup (HIGH) + +`SourceCoordinator` (470 lines) exists but missing: +- Complete source consumer API request generation +- Subject filter configuration +- Subject transform setup +- Account isolation verification +- Flow control configuration +- Starting sequence selection logic + +### 4.4 Deduplication Window Management (HIGH) + +`SourceCoordinator` has `_dedupWindow` but missing: +- Time-based window pruning +- Memory-efficient cleanup +- Statistics reporting (`DeduplicatedCount`) +- `Nats-Msg-Id` header extraction integration + +### 4.5 Purge with Subject Filtering (HIGH) + +**Missing**: +- Subject-specific purge (`preq.Subject`) +- Sequence range purge (`preq.Sequence`) +- Keep-N functionality (`preq.Keep`) +- Consumer purge cascading based on filter overlap + +### 4.6 Interest Retention Enforcement (HIGH) + +**Missing**: +- Interest-based message cleanup (`checkInterestState`/`noInterest`) +- Per-consumer interest tracking +- `noInterestWithSubject` for filtered consumers +- Orphan message cleanup + +### 4.7 Stream Snapshot & Restore (MEDIUM) + +`StreamSnapshotService` is a 10-line stub. Missing: +- Snapshot deadline enforcement +- Consumer inclusion/exclusion +- TAR + S2 compression/decompression +- Snapshot validation +- Partial restore recovery + +### 4.8 Stream Config Update Validation (MEDIUM) + +**Missing**: +- Subjects overlap checking +- Mirror/source config validation +- Subjects modification handling +- Template update propagation +- Discard policy enforcement + +### 4.9 Source Retry & Health (MEDIUM) + +**Missing**: +- `retrySourceConsumerAtSeq()` — retry with exponential backoff +- `cancelSourceConsumer()` — source consumer cleanup +- `retryDisconnectedSyncConsumers()` — periodic health check +- `setupSourceConsumers()` — bulk initialization +- `stopSourceConsumers()` — coordinated shutdown + +### 4.10 Source/Mirror Info Reporting (LOW) + +**Missing**: Source/mirror lag, error state, consumer info for monitoring responses. --- ## Gap 5: Client Protocol Handling (HIGH) -**Go reference**: `client.go` -- 6,716 lines, 162+ functions -**NET implementation**: `NatsClient.cs` -- 924 lines +**Go reference**: `client.go` — 6,716 lines, 162+ functions +**NET implementation**: `NatsClient.cs` (924 lines) **Gap factor**: 7.3x -The Go client handler manages per-connection state including authentication, permissions, protocol parsing dispatch, write buffering, and trace logging. The .NET `NatsClient` handles the core read/write loops and subscription tracking but lacks the adaptive performance tuning and advanced protocol features. +### 5.1 Flush Coalescing (HIGH) -**Missing features**: -- Adaptive read buffer tuning: Go dynamically resizes read buffers from 512 bytes to 65KB based on message throughput. The .NET port has `AdaptiveReadBuffer` tests but the implementation depth is unclear. -- Dynamic write buffer pooling: Go pools write buffers and uses flush coalescing to reduce syscalls under load -- Per-client trace level with contextual logging: Go supports enabling protocol-level tracing per client connection -- Message trace with direction (in/out) and byte-level logging for debugging -- Permission caching with LRU and TTL: Go caches permission check results to avoid re-evaluating on every publish. The .NET `PermissionLruCache` exists in tests but integration depth is limited. -- Client kind differentiation: Go distinguishes CLIENT, ROUTER, GATEWAY, LEAF, and SYSTEM client types with different protocol handling paths. The .NET port has `ClientKind` enum and routing tests but the protocol dispatch is not fully differentiated. -- Write timeout handling with partial flush recovery -- Nonce-based TLS authentication flows -- Slow consumer detection and eviction with configurable thresholds -- Maximum control line enforcement (4096 bytes in Go) -- MQTT and WebSocket protocol upgrade paths from the base client +**Missing**: Flush signal pending counter (`fsp`), `maxFlushPending=10`, `pcd` map of pending clients for broadcast-style flush signaling. Go allows multiple producers to queue data, then ONE flush signals the writeLoop. .NET uses independent channels per client. -**Related .NET files**: `NatsClient.cs` (924), `NatsParser.cs` (495), `NatsProtocol.cs` (estimated), `NatsServer.cs` (1,739 -- includes accept loop and server lifecycle) +### 5.2 Stall Gate Backpressure (HIGH) + +**Missing**: `c.out.stc` channel that blocks producers when client is 75% full. No producer rate limiting in .NET. + +### 5.3 Write Timeout with Partial Flush Recovery (HIGH) + +**Missing**: +- Partial flush recovery logic (`written > 0` → retry vs close) +- `WriteTimeoutPolicy` enum (`Close`, `TcpFlush`) +- Different handling per client kind (routes can recover, clients cannot) +- .NET immediately closes on timeout; Go allows ROUTER/GATEWAY/LEAF to continue + +### 5.4 Per-Account Subscription Result Cache (HIGH) + +**Missing**: `readCache` struct with per-account results cache, route targets, statistical counters. Go caches Sublist match results per account on route/gateway inbound path (maxPerAccountCacheSize=8192). + +### 5.5 Slow Consumer Stall Gate (MEDIUM) + +**Missing**: Full stall gate mechanics — Go creates a channel blocking producers at 75% threshold, per-kind slow consumer statistics (route, gateway, leaf), account-level slow consumer tracking. + +### 5.6 Dynamic Write Buffer Pooling (MEDIUM) + +`OutboundBufferPool` exists but missing flush coalescing integration, `pcd` broadcast optimization. + +### 5.7 Per-Client Trace Level (MEDIUM) + +`SetTraceMode()` exists but missing message delivery tracing (`traceMsgDelivery`) for routed messages, per-client echo support. + +### 5.8 Subscribe Permission Caching (MEDIUM) + +`PermissionLruCache` caches PUB permissions but NOT SUB. Go caches per-account results with genid-based invalidation. + +### 5.9 Internal Client Kinds (LOW) + +`ClientKind` enum supports CLIENT, ROUTER, GATEWAY, LEAF but missing SYSTEM, JETSTREAM, ACCOUNT kinds and `isInternalClient()` predicate. + +### 5.10 Adaptive Read Buffer Short-Read Counter (LOW) + +`AdaptiveReadBuffer` implements 2x grow/shrink but missing Go's `srs` (short reads) counter for more precise shrink decisions. --- ## Gap 6: MQTT Protocol (HIGH) -**Go reference**: `mqtt.go` -- 5,882 lines -**NET implementation**: `Mqtt/` folder -- 539 lines (5 files) -**Gap factor**: 10.9x +**Go reference**: `mqtt.go` — 5,882 lines +**NET implementation**: `Mqtt/` — 1,238 lines (8 files) +**Gap factor**: 4.7x -The Go MQTT implementation provides full MQTT v3.1.1 compatibility with JetStream-backed session persistence. The .NET implementation has a protocol parser skeleton and connection handler but lacks the session management, QoS tracking, and JetStream integration. +### 6.1 JetStream-Backed Session Persistence (CRITICAL) -**Missing features**: -- MQTT client session creation and restoration with persistent ClientID mapping -- Will message handling: delivering a pre-configured message when a client disconnects unexpectedly -- QoS 1 and QoS 2 tracking with packet ID mapping and retry logic -- Retained message storage backed by a JetStream stream (per-account) -- MQTT session stream persistence: each MQTT session's state is stored in a dedicated JetStream consumer -- MQTT subscription wildcard translation: converting MQTT `+` and `#` wildcards to NATS `*` and `>` equivalents -- Session flapper detection: identifying clients that connect/disconnect rapidly and applying backoff -- MaxAckPending enforcement for QoS 1 flow control -- CONNECT packet validation with protocol version negotiation -- MQTT-to-NATS subject mapping with topic separator translation (`/` to `.`) +**Missing**: Go creates dedicated JetStream streams for: +- `$MQTT_msgs` — message persistence +- `$MQTT_rmsgs` — retained messages +- `$MQTT_sess` — session state +- `$MQTT_qos2in` — QoS 2 incoming tracking +- `$MQTT_out` — QoS 1/2 outgoing -**Related .NET files**: `MqttListener.cs` (166), `MqttProtocolParser.cs` (141), `MqttConnection.cs` (131), `MqttPacketReader.cs` (63), `MqttPacketWriter.cs` (38) +.NET MQTT is entirely in-memory. Sessions are lost on restart. + +### 6.2 Will Message Delivery (HIGH) + +`MqttSessionStore` stores will metadata (`WillTopic`, `WillPayload`, `WillQoS`, `WillRetain`) but does NOT publish the will message on abnormal client disconnection. + +### 6.3 QoS 1/2 Tracking (HIGH) + +`MqttQoS2StateMachine` exists for QoS 2 state transitions but missing: +- JetStream-backed QoS 1 acknowledgment tracking +- Durable redelivery of unacked QoS 1/2 messages +- PUBREL delivery stream for QoS 2 phase 2 + +### 6.4 MaxAckPending Enforcement (HIGH) + +**Missing**: No backpressure mechanism for QoS 1 accumulation, no per-subscription limits, no config reload hook. + +### 6.5 Retained Message Delivery on Subscribe (MEDIUM) + +`MqttRetainedStore` exists but not JetStream-backed and not integrated with subscription handling — retained messages aren't delivered on SUBSCRIBE. + +### 6.6 Session Flapper Detection (LOW) + +Partially implemented in `MqttSessionStore` (basic structure present) but integration unclear. --- -## Gap 7: JetStream API Layer (HIGH) +## Gap 7: JetStream API Layer (MEDIUM) -**Go reference**: `jetstream_api.go` -- 5,165 lines + `jetstream.go` -- 2,866 lines -**NET implementation**: `JetStreamApiRouter.cs` (117) + `StreamApiHandlers.cs` (351) + `ConsumerApiHandlers.cs` (307) + `DirectApiHandlers.cs` (61) + `AccountApiHandlers.cs` (16) + `AccountControlApiHandlers.cs` (34) + `JetStreamService.cs` (148) -**Gap factor**: 7.7x +**Go reference**: `jetstream_api.go` (5,165) + `jetstream.go` (2,866) = 8,031 lines +**NET implementation**: ~1,374 lines across 11 handler files +**Gap factor**: 5.8x -The Go JetStream API layer handles all `$JS.API.*` subject requests with request validation, leader forwarding, and response serialization. The .NET implementation covers the basic API surface but lacks leader forwarding, cluster-aware request routing, and several API endpoints. +### 7.1 Leader Forwarding (HIGH) -**Missing features**: -- Leader forwarding: API requests that arrive at a non-leader node must be forwarded to the current leader -- Stream/consumer info caching with generation-based invalidation -- API rate limiting and request deduplication -- Snapshot and restore API endpoints -- Stream purge with subject filter and keep options -- Consumer pause/resume API -- Advisory event publication for API operations -- JetStream account resource tracking (storage used, streams count, consumers count) +API requests arriving at non-leader nodes must be forwarded to the current leader. Not implemented. -**Related .NET total**: ~1,034 lines vs Go total: ~8,031 lines +### 7.2 Clustered API Handlers (HIGH) + +**Missing**: +- `jsClusteredStreamRequest()` — stream create in clustered mode +- `jsClusteredStreamUpdateRequest()` — update/delete/move/scale (509 lines in Go) +- Cluster-aware consumer create/delete handlers + +### 7.3 API Rate Limiting & Deduplication (MEDIUM) + +No request deduplication or rate limiting. + +### 7.4 Snapshot & Restore API (MEDIUM) + +No snapshot/restore API endpoints. + +### 7.5 Consumer Pause/Resume API (MEDIUM) + +No consumer pause/resume API endpoint. + +### 7.6 Advisory Event Publication (LOW) + +No advisory events for API operations. --- -## Gap 8: RAFT Consensus (MEDIUM) +## Gap 8: RAFT Consensus (CRITICAL for clustering) -**Go reference**: `raft.go` -- 5,037 lines -**NET implementation**: `Raft/` folder -- 1,136 lines (13 files) -**Gap factor**: 4.4x +**Go reference**: `raft.go` — 5,037 lines +**NET implementation**: `Raft/` — 1,838 lines (17 files) +**Gap factor**: 2.7x -The .NET RAFT implementation covers the core protocol: leader election, log replication, and basic snapshotting. The gap is in operational features needed for production cluster management. +### 8.1 Persistent Log Storage (CRITICAL) -**Missing features**: -- `InstallSnapshot` with partial state application and streaming transfer -- `CreateSnapshotCheckpoint` for coordinated snapshots across multiple RAFT groups -- Membership change proposals: `ProposeAddPeer`, `ProposeRemovePeer` with joint consensus -- Campaign timeout management with randomized election delays -- Healthy node detection: classifying nodes as current, catching-up, or leaderless -- Applied/processed entry tracking for flow control between the RAFT log and the state machine -- Compaction: truncating the log after a snapshot is taken -- Pre-vote protocol to prevent disruptive elections from partitioned nodes +.NET RAFT uses in-memory log only. RAFT state is not persisted across restarts. -**Related .NET files**: `RaftWireFormat.cs` (430), `NatsRaftTransport.cs` (201), `RaftNode.cs` (189), `RaftTransport.cs` (64), `RaftLog.cs` (62), `RaftSubjects.cs` (53), `RaftSnapshotStore.cs` (41), `RaftReplicator.cs` (39), and 5 smaller files +### 8.2 Joint Consensus for Membership Changes (CRITICAL) + +Current single-membership-change approach is unsafe for multi-node clusters. Go implements safe addition/removal per Section 4 of the RAFT paper. + +### 8.3 InstallSnapshot Streaming (HIGH) + +`RaftSnapshot` is fully loaded into memory. Go streams snapshots in chunks with progress tracking. + +### 8.4 Leadership Transfer (MEDIUM) + +No graceful leader step-down mechanism for maintenance operations. + +### 8.5 Log Compaction (MEDIUM) + +Basic compaction exists but no configurable retention policies. + +### 8.6 Quorum Check Before Proposing (MEDIUM) + +May propose entries when quorum is unreachable. + +### 8.7 Read-Only Query Optimization (LOW) + +All queries append log entries; no ReadIndex optimization for read-only commands. + +### 8.8 Election Timeout Jitter (LOW) + +May not have sufficient randomized jitter to prevent split votes. --- ## Gap 9: Account Management & Multi-Tenancy (MEDIUM) -**Go reference**: `accounts.go` -- 4,774 lines, 172+ functions -**NET implementation**: `Account.cs` (189) + `AuthService.cs` (172) + related auth files -**Gap factor**: 13x (comparing `accounts.go` to `Account.cs`) +**Go reference**: `accounts.go` — 4,774 lines, 172+ functions +**NET implementation**: `Account.cs` (280) + auth files +**Current .NET total**: ~1,266 lines +**Gap factor**: 3.8x -The Go account system provides full multi-tenant isolation where each account has its own `Sublist`, client set, and subject namespace. The .NET implementation has basic account configuration and authentication but lacks the runtime isolation and import/export machinery. +### 9.1 Service Export Latency Tracking (MEDIUM) -**Missing features**: -- Service and stream export whitelist enforcement -- Service import with weighted destination selection for load balancing -- Export authorization checks with account-level permissions -- Cycle detection for service import chains (preventing A imports from B imports from A) -- Response tracking for request-reply latency measurement -- Service latency metrics: p50, p90, p99 percentile tracking per service export -- Account-level JetStream resource limits: max storage, max streams, max consumers per account -- Leaf node cluster registration per account -- Client and connection tracking per account with eviction on limit exceeded -- Weighted subject mappings for traffic shaping -- System account with internal subscription handling +**Missing**: `TrackServiceExport()`, `UnTrackServiceExport()`, latency sampling, p50/p90/p99 tracking, `sendLatencyResult()`, `sendBadRequestTrackingLatency()`, `sendReplyInterestLostTrackLatency()`. -**Related .NET files**: `Account.cs` (189), `AuthService.cs` (172), `JwtAuthenticator.cs` (180), `AccountClaims.cs` (107), `AccountResolver.cs` (65), `ExportAuth.cs` (25), and 10+ smaller auth files -**Total .NET lines in auth/account**: ~1,266 +### 9.2 Service Export Response Threshold (MEDIUM) + +**Missing**: `ServiceExportResponseThreshold()`, `SetServiceExportResponseThreshold()` for SLA limits. + +### 9.3 Stream Import Cycle Detection (MEDIUM) + +Only service import cycles detected. **Missing**: `streamImportFormsCycle()`, `checkStreamImportsForCycles()`. + +### 9.4 Wildcard Service Exports (MEDIUM) + +**Missing**: `getWildcardServiceExport()` — cannot define wildcard exports like `svc.*`. + +### 9.5 Account Expiration & TTL (MEDIUM) + +**Missing**: `IsExpired()`, `expiredTimeout()`, `setExpirationTimer()` — expired accounts won't auto-cleanup. + +### 9.6 Account Claim Hot-Reload (MEDIUM) + +**Missing**: `UpdateAccountClaims()`, `updateAccountClaimsWithRefresh()` — account changes may require restart. + +### 9.7 Service/Stream Activation Expiration (LOW) + +**Missing**: JWT activation claim expiry enforcement. + +### 9.8 User NKey Revocation (LOW) + +**Missing**: `checkUserRevoked()` — cannot revoke individual users without re-deploying. + +### 9.9 Response Service Import (LOW) + +**Missing**: Reverse response mapping for cross-account request-reply (`addReverseRespMapEntry`, `checkForReverseEntries`). + +### 9.10 Service Import Shadowing Detection (LOW) + +**Missing**: `serviceImportShadowed()`. --- ## Gap 10: Monitoring & Events (MEDIUM) **Go reference**: `monitor.go` (4,240) + `events.go` (3,334) + `msgtrace.go` (799) = 8,373 lines -**NET implementation**: `Monitoring/` (1,698) + `Events/` (686) + `MessageTraceContext.cs` (22) = 2,406 lines -**Gap factor**: 3.5x +**NET implementation**: `Monitoring/` (1,698) + `Events/` (960) + `MessageTraceContext.cs` = ~2,658 lines +**Gap factor**: 3.1x -The monitoring and events subsystem is one of the closer areas in terms of coverage. The .NET port implements `/varz`, `/connz`, `/routez`, `/subsz`, `/leafz`, `/gatewayz`, `/jsz`, `/accountz`, `/stacksz`, `/healthz`, and `/pprof` endpoints. The main gaps are in event detail and message tracing depth. +### 10.1 Closed Connections Ring Buffer (HIGH) -**Missing features**: -- Full system event payloads: Go publishes detailed JSON events for connect, disconnect, auth errors, slow consumers. The .NET events exist but payload fields may be incomplete. -- Message trace propagation: Go traces messages through the full delivery pipeline (publish, route, gateway, leaf, deliver) with byte-level detail. The .NET `MessageTraceContext` is 22 lines. -- Closed connection tracking: Go maintains a ring buffer of recently closed connections with disconnect reasons for `/connz` queries. -- Account-scoped monitoring: `/connz?acc=ACCOUNT` filtering -- Sort options for monitoring endpoints (by bytes, messages, subs, etc.) +**Missing**: Ring buffer for recently closed connections with disconnect reasons. `/connz?state=closed` returns empty. + +### 10.2 Account-Scoped Filtering (MEDIUM) + +**Missing**: `/connz?acc=ACCOUNT` filtering. + +### 10.3 Sort Options (MEDIUM) + +**Missing**: `ConnzOptions.SortBy` (by bytes, msgs, uptime, etc.). + +### 10.4 Message Trace Propagation (MEDIUM) + +`MessageTraceContext` exists but trace data not propagated across servers in events. + +### 10.5 Auth Error Events (MEDIUM) + +**Missing**: `sendAuthErrorEvent()`, `sendAccountAuthErrorEvent()` — cannot monitor auth failures. + +### 10.6 Full System Event Payloads (MEDIUM) + +Event types defined but payloads may be incomplete (server info, client info fields). + +### 10.7 Closed Connection Reason Tracking (MEDIUM) + +`ClosedClient.Reason` exists but not populated consistently. + +### 10.8 Remote Server Events (LOW) + +**Missing**: `remoteServerShutdown()`, `remoteServerUpdate()`, `leafNodeConnected()` — limited cluster member visibility. + +### 10.9 Event Compression (LOW) + +**Missing**: `getAcceptEncoding()` — system events consume more bandwidth. + +### 10.10 OCSP Peer Events (LOW) + +**Missing**: `sendOCSPPeerRejectEvent()`, `sendOCSPPeerChainlinkInvalidEvent()`. --- ## Gap 11: Gateway Bridging (MEDIUM) -**Go reference**: `gateway.go` -- 3,426 lines, 79+ functions -**NET implementation**: `GatewayManager.cs` (225) + `GatewayConnection.cs` (242) + `ReplyMapper.cs` (39) + `GatewayOptions.cs` (9) -**Gap factor**: 6.7x +**Go reference**: `gateway.go` — 3,426 lines +**NET implementation**: `GatewayManager.cs` (225) + `GatewayConnection.cs` (259) + `GatewayInterestTracker.cs` (190) + `ReplyMapper.cs` (174) +**Current .NET total**: ~848 lines +**Gap factor**: 4.0x -Gateways bridge separate NATS clusters. The .NET implementation handles basic gateway connections and message forwarding but lacks the optimization modes and cross-cluster isolation features. +### 11.1 Implicit Gateway Discovery (HIGH) -**Missing features**: -- Interest-only mode optimization: after initial flooding, gateways switch to sending only messages for subjects that have known subscribers in the remote cluster -- Account-specific gateway routes: isolating gateway traffic per account -- Reply mapper (`_GR_.` prefix) for cross-cluster request-reply isolation (stub exists at 39 lines) -- Inbound gateway subscription interest tracking -- Outbound connection pooling (Go uses 3 connections per gateway peer by default) -- Gateway TLS handshake and mutual authentication -- Message trace propagation through gateways -- Gateway reconnection with exponential backoff +**Missing**: `processImplicitGateway()`, `gossipGatewaysToInboundGateway()`, `forwardNewGatewayToLocalCluster()`. All gateways must be explicitly configured. + +### 11.2 Gateway Reconnection with Backoff (MEDIUM) + +**Missing**: `solicitGateways()`, `reconnectGateway()`, `solicitGateway()` with structured delay/jitter. + +### 11.3 Account-Specific Gateway Routes (MEDIUM) + +**Missing**: `sendAccountSubsToGateway()` — routes all subscriptions without per-account isolation. + +### 11.4 Queue Group Propagation (MEDIUM) + +**Missing**: `sendQueueSubsToGateway()` — queue-based load balancing may not work across gateways. + +### 11.5 Reply Subject Mapping Cache (MEDIUM) + +**Missing**: `trackGWReply()`, `startGWReplyMapExpiration()` — dynamic reply tracking with TTL. Basic `_GR_` prefix exists but no caching. + +### 11.6 Gateway Command Protocol (LOW) + +**Missing**: `gatewayCmdGossip`, `gatewayCmdAllSubsStart`, `gatewayCmdAllSubsComplete` — no bulk subscription synchronization. + +### 11.7 Connection Registration (LOW) + +**Missing**: `registerInboundGatewayConnection()`, `registerOutboundGatewayConnection()`, `getRemoteGateway()` — minimal gateway registry. --- ## Gap 12: Leaf Node Connections (MEDIUM) -**Go reference**: `leafnode.go` -- 3,470 lines -**NET implementation**: `LeafNodeManager.cs` (213) + `LeafConnection.cs` (242) + `LeafLoopDetector.cs` (35) + `LeafHubSpokeMapper.cs` (30) -**Gap factor**: 6.7x +**Go reference**: `leafnode.go` — 3,470 lines +**NET implementation**: `LeafNodeManager.cs` (328) + `LeafConnection.cs` (284) + `LeafHubSpokeMapper.cs` (121) + `LeafLoopDetector.cs` (35) +**Current .NET total**: ~768 lines +**Gap factor**: 4.5x -Leaf nodes provide hub-and-spoke topology for edge deployments. The .NET implementation handles basic leaf connections and subscription propagation but lacks the full feature set. +### 12.1 TLS Certificate Hot-Reload (MEDIUM) -**Missing features**: -- Solicited leaf connection management with retry and reconnect logic -- Hub-spoke subject filtering: only propagating subjects that have active subscriptions -- Leaf node JetStream domain awareness -- Loop detection refinement: the basic detector exists (35 lines) but the Go implementation is more nuanced with `$LDS.` prefix handling -- Account-scoped leaf connections -- Leaf node compression negotiation (S2) -- Dynamic subscription interest updates after initial connect +**Missing**: `updateRemoteLeafNodesTLSConfig()` — TLS cert changes require server restart. -**Related .NET files**: `LeafNodeManager.cs` (213), `LeafConnection.cs` (242), `LeafLoopDetector.cs` (35), `LeafHubSpokeMapper.cs` (30), `LeafNodeOptions.cs` (8), `LeafzHandler.cs` (21) -**Total .NET lines**: ~549 +### 12.2 Permission & Account Syncing (MEDIUM) + +**Missing**: `sendPermsAndAccountInfo()`, `initLeafNodeSmapAndSendSubs()` — permission changes on hub may not propagate. + +### 12.3 Leaf Connection State Validation (MEDIUM) + +**Missing**: `remoteLeafNodeStillValid()` — no validation of remote config on reconnect. + +### 12.4 JetStream Migration Checks (LOW) + +**Missing**: `checkJetStreamMigrate()` — cannot safely migrate leaf JetStream domains. + +### 12.5 Leaf Node WebSocket Support (LOW) + +**Missing**: Go accepts WebSocket connections for leaf nodes. + +### 12.6 Leaf Cluster Registration (LOW) + +**Missing**: `registerLeafNodeCluster()`, `hasLeafNodeCluster()` — cannot query leaf topology. + +### 12.7 Connection Disable Flag (LOW) + +**Missing**: `isLeafConnectDisabled()` — cannot selectively disable leaf solicitation. --- ## Gap 13: Route Clustering (MEDIUM) -**Go reference**: `route.go` -- 3,314 lines -**NET implementation**: `RouteManager.cs` (269) + `RouteConnection.cs` (289) + `RouteCompressionCodec.cs` (26) -**Gap factor**: 5.7x +**Go reference**: `route.go` — 3,314 lines +**NET implementation**: `RouteManager.cs` (324) + `RouteConnection.cs` (296) + `RouteCompressionCodec.cs` (135) +**Current .NET total**: ~755 lines +**Gap factor**: 4.4x -Route clustering provides full-mesh connectivity between NATS servers. The .NET implementation handles basic route connections, subscription propagation, and message forwarding. +### 13.1 Implicit Route Discovery (HIGH) -**Missing features**: -- Route pooling: Go uses configurable connection pools (default 3) per route peer. The .NET implementation uses single connections. -- Account-specific dedicated routes -- Route compression with S2 (codec stub exists at 26 lines) -- Solicited route connection management with discovery -- Route permission enforcement -- Dynamic route addition and removal without restart -- Gossip-based topology discovery +**Missing**: `processImplicitRoute()`, `forwardNewRouteInfoToKnownServers()`. Routes must be explicitly configured. -**Related .NET files**: `RouteManager.cs` (269), `RouteConnection.cs` (289), `RouteCompressionCodec.cs` (26), `RouteCompression.cs` (7), `RoutezHandler.cs` (21) -**Total .NET lines**: ~612 +### 13.2 Account-Specific Dedicated Routes (MEDIUM) + +**Missing**: Per-account route pinning for traffic isolation. + +### 13.3 Route Pool Size Negotiation (MEDIUM) + +**Missing**: Handling heterogeneous clusters with different pool sizes. + +### 13.4 Route Hash Storage (LOW) + +**Missing**: `storeRouteByHash()`, `getRouteByHash()` for efficient route lookup. + +### 13.5 Cluster Split Handling (LOW) + +**Missing**: `removeAllRoutesExcept()`, `removeRoute()` — no partition handling. + +### 13.6 No-Pool Route Fallback (LOW) + +**Missing**: Interoperability with older servers without pool support. --- ## Gap 14: Configuration & Hot Reload (MEDIUM) **Go reference**: `opts.go` (6,435) + `reload.go` (2,653) = 9,088 lines -**NET implementation**: `ConfigProcessor.cs` (1,023) + `NatsConfLexer.cs` (1,503) + `NatsConfParser.cs` (421) + `ConfigReloader.cs` (395) = 3,342 lines -**Gap factor**: 2.7x +**NET implementation**: `ConfigProcessor.cs` (1,438) + `NatsConfLexer.cs` (1,503) + `NatsConfParser.cs` (421) + `ConfigReloader.cs` (526) +**Current .NET total**: ~3,888 lines +**Gap factor**: 2.3x -Configuration parsing is one of the closer areas. The .NET config parser handles the NATS config file format including includes, variables, and nested blocks. The gap is primarily in hot reload scope and CLI option coverage. +### 14.1 SIGHUP Signal Handling (HIGH) -**Missing features**: -- Signal handling (SIGHUP) for reload trigger on Unix systems -- Cluster permissions reloading without restart -- Auth change propagation to existing connections (disconnect clients that no longer have valid credentials) -- Logger reconfiguration without restart -- TLS certificate reloading (for certificate rotation) -- JetStream configuration changes (storage directory, limits) -- Route pool size changes at runtime -- Account list updates with connection cleanup for removed accounts -- Many CLI flags that Go supports are not parsed in the .NET host +**Missing**: Unix signal handler for config reload without restart. + +### 14.2 Auth Change Propagation (MEDIUM) + +**Missing**: When users/nkeys change, propagate to existing connections. + +### 14.3 TLS Certificate Reload (MEDIUM) + +**Missing**: Reload certificates for new connections without restart. + +### 14.4 Cluster Config Hot Reload (MEDIUM) + +**Missing**: Add/remove route URLs, gateway URLs, leaf node URLs at runtime. + +### 14.5 Logging Level Changes (LOW) + +**Missing**: Debug/trace/logtime changes at runtime. + +### 14.6 JetStream Config Changes (LOW) + +**Missing**: JetStream option reload. --- -## Gap 15: WebSocket Support (MEDIUM) +## Gap 15: WebSocket Support (LOW) -**Go reference**: `websocket.go` -- 1,550 lines -**NET implementation**: `WebSocket/` folder -- 1,210 lines (7 files) -**Gap factor**: 1.3x +**Go reference**: `websocket.go` — 1,550 lines +**NET implementation**: `WebSocket/` — 1,453 lines (7 files) +**Gap factor**: 1.1x -WebSocket support is one of the most complete areas in the .NET port. The basic WebSocket upgrade, frame reading/writing, and compression are implemented. +WebSocket support is the most complete subsystem. Compression negotiation (permessage-deflate), JWT auth through WebSocket, and origin checking are all implemented. -**Missing features**: -- WebSocket-specific TLS configuration -- Origin checking refinement (basic checker exists at 81 lines) -- WebSocket compression negotiation (permessage-deflate parameters) -- JWT authentication through WebSocket connect +### 15.1 WebSocket-Specific TLS (LOW) + +Minor TLS configuration differences. + +--- + +## Priority Summary + +### Tier 1: Production Blockers (CRITICAL) + +1. **FileStore encryption/compression/recovery** (Gap 1) — blocks data security and durability +2. **FileStore missing interface methods** (Gap 1.9) — blocks API completeness +3. **RAFT persistent log** (Gap 8.1) — blocks cluster persistence +4. **RAFT joint consensus** (Gap 8.2) — blocks safe membership changes +5. **JetStream cluster monitoring loop** (Gap 2.1) — blocks distributed JetStream + +### Tier 2: Capability Limiters (HIGH) + +6. **Consumer delivery loop** (Gap 3.1) — core consumer functionality +7. **Consumer pull request pipeline** (Gap 3.2) — pull consumer correctness +8. **Consumer ack/redelivery** (Gaps 3.3–3.4) — consumer reliability +9. **Stream mirror/source retry** (Gaps 4.1–4.3) — replication reliability +10. **Stream purge with filtering** (Gap 4.5) — operational requirement +11. **Client flush coalescing** (Gap 5.1) — performance +12. **Client stall gate** (Gap 5.2) — backpressure +13. **MQTT JetStream persistence** (Gap 6.1) — MQTT durability +14. **SIGHUP config reload** (Gap 14.1) — operational requirement +15. **Implicit route/gateway discovery** (Gaps 11.1, 13.1) — cluster usability + +### Tier 3: Important Features (MEDIUM) + +16–40: Account management, monitoring, gateway details, leaf node features, consumer advanced features, etc. + +### Tier 4: Nice-to-Have (LOW) + +41+: Internal client kinds, election jitter, event compression, etc. --- ## Test Coverage Gap Summary -The table below maps Go test files to their .NET equivalents. "Go Tests" counts `func Test*` functions. ".NET Tests" counts `[Fact]` and `[Theory]` attributes across all related .NET test files for that subsystem. +All 2,937 Go tests have been categorized: +- **2,646 mapped** (90.1%) — have corresponding .NET tests +- **272 not_applicable** (9.3%) — platform-specific or irrelevant +- **19 skipped** (0.6%) — intentionally deferred +- **0 unmapped** — none remaining -| Go Test Area | Go Tests | .NET Tests | Delta | Notes | -|---|---:|---:|---:|---| -| `jetstream_test.go` | 312 | ~350 | +38 | .NET tests are shallower; many test API routing not behavior | -| `filestore_test.go` | 232 | 199 | -33 | .NET lacks block management, recovery, encryption integration tests | -| `jetstream_consumer_test.go` | 160 | 116 | -44 | Missing priority groups, backoff, pause, rate limiting tests | -| `jetstream_cluster_1_test.go` | 151 | 542 (all cluster) | -- | .NET has more cluster tests but uses simulated meta-group, not real clustering | -| `jetstream_cluster_2_test.go` | 123 | (included above) | -- | | -| `jetstream_cluster_3_test.go` | 97 | (included above) | -- | | -| `jetstream_cluster_4_test.go` | 85 | (included above) | -- | | -| `jetstream_cluster_long_test.go` | 7 | (included above) | -- | | -| `jetstream_super_cluster_test.go` | 47 | ~20 | -27 | .NET cross-cluster tests exist but are thin | -| `mqtt_test.go` | 123 | 127 | +4 | .NET tests cover parsing and topic mapping but not session persistence | -| `leafnode_test.go` | 110 | 93 | -17 | .NET lacks solicited connection and JS domain tests | -| `raft_test.go` | 104 | 205 | +101 | .NET has extensive RAFT tests but against simulated transport | -| `monitor_test.go` | 100 | 136 | +36 | .NET monitoring is well-covered | -| `norace_1_test.go` | 100 | ~55 | -45 | Go norace tests are stress/concurrency; .NET stress tests are lighter | -| `norace_2_test.go` | 41 | (included above) | -- | | -| `jwt_test.go` | 88 | 88 | 0 | Good parity in JWT test count | -| `gateway_test.go` | 88 | 115 | +27 | .NET gateway tests cover config and basic forwarding well | -| `opts_test.go` | 86 | 225 | +139 | .NET has extensive config tests due to lexer/parser unit tests | -| `client_test.go` | 82 | 110 | +28 | .NET client tests are unit-level; Go tests are integration | -| `reload_test.go` | 73 | (included in config) | -- | | -| `routes_test.go` | 70 | 91 | +21 | .NET route tests cover handshake and subscription propagation | -| `sublist_test.go` | 65 | 40 | -25 | Go sublist tests include benchmarks and concurrent stress | -| `websocket_test.go` | 61 | 66 | +5 | Good parity | -| `events_test.go` | 51 | 53 | +2 | Good parity in count | -| `accounts_test.go` | 64 | 162 | +98 | .NET account tests cover auth mechanisms broadly but not import/export depth | -| `server_test.go` | 42 | 25 | -17 | Go server tests exercise full lifecycle | -| `msgtrace_test.go` | 33 | ~23 | -10 | .NET message trace tests exist but trace propagation is stubbed | -| `auth_callout_test.go` | 31 | ~15 | -16 | Partial coverage | -| `jetstream_batching_test.go` | 29 | ~10 | -19 | Batching semantics partially tested | -| `client_proxyproto_test.go` | 23 | ~5 | -18 | Proxy protocol minimally tested | -| `dirstore_test.go` | 19 | 0 | -19 | No directory store equivalent | -| `signal_test.go` | 19 | 0 | -19 | No signal handling in .NET | -| `jetstream_versioning_test.go` | 18 | ~5 | -13 | Version negotiation minimally tested | -| `jetstream_jwt_test.go` | 18 | ~10 | -8 | Partial JWT-JetStream integration | -| `parser_test.go` | 17 | 17 | 0 | Good parity | -| `store_test.go` | 17 | ~20 | +3 | Store interface tests | -| `jetstream_leafnode_test.go` | 13 | ~8 | -5 | Leaf-JetStream interaction partially covered | -| `split_test.go` | 12 | 0 | -12 | No split test equivalent | -| `auth_test.go` | 12 | (included in accounts) | -- | | -| `leafnode_proxy_test.go` | 9 | 0 | -9 | No proxy leaf node tests | -| `ipqueue_test.go` | 9 | 0 | -9 | No IP queue equivalent | -| `util_test.go` | 7 | 0 | -7 | Utility functions tested inline | -| `log_test.go` | 6 | ~3 | -3 | Basic logging tests | -| `nkey_test.go` | 5 | ~8 | +3 | Good parity | -| `subject_transform_test.go` | 4 | 53 | +49 | .NET has extensive transform tests | -| `errors_test.go` | 2 | ~4 | +2 | | -| **Subtree tests** | | | | | -| `stree/stree_test.go` | ~20 | ~25 | +5 | Subject tree tests | -| `gsl/gsl_test.go` | ~15 | ~20 | +5 | Generic subject list tests | -| `thw/thw_test.go` | ~10 | ~15 | +5 | Time hash wheel tests | -| `avl/` (in pse/) | ~5 | ~30 | +25 | Sequence set tests | -| **TOTAL** | **2,937** | **3,168** | **+231** | .NET has 8% more test methods | - ---- - -## Architectural Observations - -### 1. Cluster Testing is Simulated, Not Distributed - -The .NET test suite uses `JetStreamClusterFixture` which simulates a multi-node cluster within a single process using `JetStreamMetaGroup` stubs. Go tests start real separate server processes, establish real TCP route connections, and verify behavior through actual network I/O. This means the .NET cluster tests validate API contracts and state machine transitions but do not test network partitions, connection failures, leader elections under load, or real RAFT consensus. - -### 2. FileStore is Functionally a MemStore with Disk Persistence - -The .NET `FileStore` appends JSONL records to disk but does not implement the block-based architecture that gives the Go FileStore its performance characteristics. There is no block indexing, no write cache, no compression-on-write, and no encryption-at-rest integration into the storage path. The existing `AeadEncryptor` and `S2Codec` are tested in isolation but not composed into the file store. - -### 3. Consumer Engines are Delivery Wrappers, Not State Machines - -The Go consumer engine maintains complex state: pending messages, redelivery queues, ack tracking with sequence gaps, flow control counters, and rate limiters. The .NET push and pull consumer engines (67 and 169 lines respectively) handle message delivery but delegate most state management to simpler structures. Features like NAK-with-delay, max-deliveries enforcement, and priority group pinning are not implemented. - -### 4. Test Depth vs Test Count - -The .NET test suite has 3,168 test methods compared to Go's 2,937 test functions, but this comparison is misleading: -- Many .NET tests are `[Theory]` with `[InlineData]` that test the same code path with different inputs (e.g., config parsing with various values) -- Go test functions frequently contain subtests (`t.Run(...)`) that are not counted in the function count but execute many more scenarios -- Go tests start real servers and make real TCP connections; .NET tests often use mocked interfaces or in-process fixtures -- Go's `norace` tests (141 functions) are specifically designed to catch concurrency bugs under stress; .NET stress tests exist but are lighter - -### 5. Protocol Compatibility Surface - -The .NET server can interoperate with standard NATS clients for basic pub/sub, subscribe, unsubscribe, and request-reply. The protocol parser (`NatsParser.cs`, 495 lines) handles the core text protocol. However, the following protocol extensions are incomplete: -- `HMSG`/`HPUB` (headers): parsed but header propagation through routes/gateways is incomplete -- Route protocol (`RS+`/`RS-`/`RMSG`): basic forwarding works but pool awareness and account scoping are missing -- Gateway protocol: basic message forwarding but interest-only mode optimization is absent -- Leaf node protocol: subscription propagation works but JetStream domain headers are not forwarded - -### 6. Configuration Parsing is Relatively Complete - -The NATS config file parser (`NatsConfLexer.cs` at 1,503 lines + `NatsConfParser.cs` at 421 lines) is one of the more complete subsystems. It handles the custom NATS config format including nested blocks, includes, variable interpolation, and most configuration directives. The gap is primarily in the number of configuration options that are parsed but not acted upon at runtime. - -### 7. Internal Data Structures are Well-Ported - -The internal data structure libraries have good parity: -- `SequenceSet.cs` (777 lines) -- AVL-based sparse sequence tracking -- `GenericSubjectList.cs` (650 lines) -- optimized trie for subject matching -- `SubjectTree/` (1,265 lines across Nodes.cs + SubjectTree.cs + Parts.cs) -- subject-aware tree -- `HashWheel.cs` (414 lines) -- time hash wheel for TTL scheduling -- `SubList.cs` (984 lines) vs Go `sublist.go` (1,728 lines) -- trie-based subscription matching - -These are foundational and well-tested, which is positive for building the higher-level features on top of them. - -### 8. Authentication Breadth vs Depth - -The .NET port supports multiple authentication mechanisms (username/password, token, NKey, JWT, TLS map, external auth callout, proxy auth) which is good breadth. However, each mechanism is thinner than its Go equivalent. JWT authentication in particular lacks the full claims validation, account resolution with caching, and operator trust chain verification that the Go implementation provides. +The .NET test suite (3,168 test methods) achieves good coverage breadth but many tests validate API contracts against stubs/mocks rather than testing actual distributed behavior. The gap is in test *depth*, not test *count*.