Files
natsdotnet/docs/plans/2026-02-24-structuregaps-full-parity-design.md
Joseph Doherty f1e42f1b5f docs: add full Go parity design for all 15 structure gaps
5-track parallel architecture (Storage, Consensus, Protocol,
Networking, Services) covering all CRITICAL/HIGH/MEDIUM gaps
identified in structuregaps.md. Feature-first approach with
test_parity.db updates. Targets ~1,194 additional Go test mappings.
2026-02-24 11:57:15 -05:00

13 KiB

Full Go Parity: All 15 Structure Gaps — Design Document

For Claude: REQUIRED SUB-SKILL: Use superpowers-extended-cc:writing-plans to create the implementation plan from this design.

Goal: Port all missing functionality identified in docs/structuregaps.md (15 gaps, CRITICAL through MEDIUM) from the Go NATS server to the .NET port, achieving full behavioral parity. Update docs/test_parity.db as each Go test is ported.

Architecture: 5 parallel implementation tracks organized by dependency. Tracks A (Storage), D (Networking), and E (Services) are independent and start immediately. Track C (Protocol) depends on Track A. Track B (Consensus) depends on Tracks A + C. Each track builds features first, then ports corresponding Go tests.

Approach: Feature-first — build each missing feature, then port its Go tests as validation. Bottom-up dependency ordering ensures foundations are solid before integration.

Estimated impact: ~1,194 additional Go tests mapped (859 → ~2,053, from 29% to ~70%).


Track A: Storage (FileStore Block Management)

Gap #1 — CRITICAL, 20.8x gap factor Go: filestore.go (12,593 lines) | NET: FileStore.cs (607 lines) Dependencies: None (starts immediately) Tests to port: ~159 from filestore_test.go

Features

  1. Message Blocks — 65KB+ blocks with per-block index files. Header: magic, version, first/last sequence, message count, byte count. New block on size limit.
  2. Block Index — Per-block index mapping sequence number → (offset, length). Enables O(1) lookups.
  3. S2 Compression Integration — Wire existing S2Codec.cs into block writes (compress on flush) and reads (decompress on load).
  4. AEAD Encryption Integration — Wire existing AeadEncryptor.cs into block lifecycle. Per-block encryption keys with rotation on seal.
  5. Crash Recovery — Scan block directory on startup, validate checksums, rebuild indexes from raw data.
  6. Tombstone/Deletion Tracking — Sparse sequence sets (using SequenceSet.cs) for deleted messages. Purge by subject and sequence range.
  7. Write Cache — In-memory buffer for active (unsealed) block. Configurable max block count.
  8. Atomic File Operations — Write-to-temp + rename for crash-safe block sealing.
  9. TTL Scheduling Recovery — Reconstruct pending TTL expirations from blocks on restart, register with HashWheel.

Key Files

Action File Notes
Rewrite src/NATS.Server/JetStream/Storage/FileStore.cs Block-based architecture
Create src/NATS.Server/JetStream/Storage/MessageBlock.cs Block abstraction
Create src/NATS.Server/JetStream/Storage/BlockIndex.cs Per-block index
Modify src/NATS.Server/JetStream/Storage/S2Codec.cs Wire into block lifecycle
Modify src/NATS.Server/JetStream/Storage/AeadEncryptor.cs Per-block key management
Tests tests/.../JetStream/Storage/FileStoreBlockTests.cs New + expanded

Track B: Consensus (RAFT + JetStream Cluster)

Gap #8 — MEDIUM, 4.4x | Gap #2 — CRITICAL, 213x Go: raft.go (5,037) + jetstream_cluster.go (10,887) | NET: 1,136 + 51 lines Dependencies: Tracks A and C must merge first Tests to port: ~85 raft + ~358 cluster + ~47 super-cluster = ~490

Phase B1: RAFT Enhancements

  1. InstallSnapshot — Chunked streaming snapshot transfer. Follower receives chunks, applies partial state, catches up from log.
  2. Membership ChangesProposeAddPeer/ProposeRemovePeer with single-server changes (matching Go's approach).
  3. Pre-vote Protocol — Candidate must get pre-vote approval before incrementing term. Prevents disruptive elections from partitioned nodes.
  4. Log Compaction — Truncate RAFT log after snapshot. Track last applied index.
  5. Healthy Node Classification — Current/catching-up/leaderless states.
  6. Campaign Timeout Management — Randomized election delays.

Phase B2: JetStream Cluster Coordination

  1. Assignment TrackingStreamAssignment and ConsumerAssignment types via RAFT proposals. Records: stream config, replica group, placement constraints.
  2. RAFT Proposal Workflow — Leader validates → proposes to meta-group → on commit, all nodes apply → assigned nodes start stream/consumer.
  3. Placement Engine — Unique nodes, tag matching, cluster affinity. Expands AssetPlacementPlanner.cs.
  4. Inflight Deduplication — Track pending proposals to prevent duplicates during leader transitions.
  5. Peer Remove & Stream Move — Data rebalancing when a peer is removed.
  6. Step-down & Leadership Transfer — Graceful leader handoff.
  7. Per-Stream RAFT Groups — Separate RAFT group per stream for message replication.

Key Files

Action File Notes
Modify src/NATS.Server/Raft/RaftNode.cs Snapshot, membership, pre-vote, compaction
Create src/NATS.Server/Raft/RaftSnapshot.cs Streaming snapshot
Create src/NATS.Server/Raft/RaftMembership.cs Peer add/remove
Rewrite src/.../JetStream/Cluster/JetStreamMetaGroup.cs 51 → ~2,000+ lines
Create src/.../JetStream/Cluster/StreamAssignment.cs Assignment type
Create src/.../JetStream/Cluster/ConsumerAssignment.cs Assignment type
Create src/.../JetStream/Cluster/PlacementEngine.cs Topology-aware placement
Modify src/.../JetStream/Cluster/StreamReplicaGroup.cs Coordination logic

Track C: Protocol (Client, Consumer, JetStream API, Mirrors/Sources)

Gaps #5, #4, #7, #3 — all HIGH Dependencies: C1/C3 independent; C2 needs C3; C4 needs Track A Tests to port: ~43 client + ~134 consumer + ~184 jetstream + ~30 mirror = ~391

C1: Client Protocol Handling (Gap #5, 7.3x)

  1. Adaptive read buffer tuning (512→65536 based on throughput)
  2. Write buffer pooling with flush coalescing
  3. Per-client trace level
  4. Full CLIENT/ROUTER/GATEWAY/LEAF/SYSTEM protocol dispatch
  5. Slow consumer detection and eviction
  6. Max control line enforcement (4096 bytes)
  7. Write timeout with partial flush recovery

C2: Consumer Delivery Engines (Gap #4, 13.3x)

  1. NAK and redelivery tracking with exponential backoff schedules
  2. Pending request queue for pull consumers with flow control
  3. Max-deliveries enforcement (drop/reject/dead-letter)
  4. Priority group pinning (sticky consumer assignment)
  5. Idle heartbeat generation
  6. Pause/resume state with advisory events
  7. Filter subject skip tracking
  8. Per-message redelivery delay arrays (backoff schedules)

C3: JetStream API Layer (Gap #7, 7.7x)

  1. Leader forwarding for non-leader API requests
  2. Stream/consumer info caching with generation invalidation
  3. Snapshot/restore API endpoints
  4. Purge with subject filter, keep-N, sequence-based
  5. Consumer pause/resume API
  6. Advisory event publication for API operations
  7. Account resource tracking (storage, streams, consumers)

C4: Stream Mirrors, Sources & Transforms (Gap #3, 16.3x)

  1. Mirror synchronization loop (continuous pull, apply locally)
  2. Source/mirror ephemeral consumer setup with position tracking
  3. Retry with exponential backoff and jitter
  4. Deduplication window (Nats-Msg-Id header tracking)
  5. Purge operations (subject filter, sequence-based, keep-N)
  6. Stream snapshot and restore

Key Files

Action File Notes
Modify NatsClient.cs Adaptive buffers, slow consumer, trace
Modify PushConsumerEngine.cs Major expansion
Modify PullConsumerEngine.cs Major expansion
Create .../Consumers/RedeliveryTracker.cs NAK/redelivery state
Create .../Consumers/PriorityGroupManager.cs Priority pinning
Modify JetStreamApiRouter.cs Leader forwarding
Modify StreamApiHandlers.cs Purge, snapshot
Modify ConsumerApiHandlers.cs Pause/resume
Rewrite MirrorCoordinator.cs 22 → ~500+ lines
Rewrite SourceCoordinator.cs 36 → ~500+ lines

Track D: Networking (Gateway, Leaf Node, Routes)

Gaps #11, #12, #13 — all MEDIUM Dependencies: None (starts immediately) Tests to port: ~61 gateway + ~59 leafnode + ~39 routes = ~159

D1: Gateway Bridging (Gap #11, 6.7x)

  1. Interest-only mode (flood → interest switch)
  2. Account-specific gateway routes
  3. Reply mapper expansion (_GR_. prefix)
  4. Outbound connection pooling (default 3)
  5. Gateway TLS mutual auth
  6. Message trace through gateways
  7. Reconnection with exponential backoff

D2: Leaf Node Connections (Gap #12, 6.7x)

  1. Solicited leaf connection management with retry/reconnect
  2. Hub-spoke subject filtering
  3. JetStream domain awareness
  4. Account-scoped leaf connections
  5. Leaf compression negotiation (S2)
  6. Dynamic subscription interest updates
  7. Loop detection refinement ($LDS. prefix)

D3: Route Clustering (Gap #13, 5.7x)

  1. Route pooling (configurable, default 3)
  2. Account-specific dedicated routes
  3. Route compression (wire RouteCompressionCodec.cs)
  4. Solicited route connections with discovery
  5. Route permission enforcement
  6. Dynamic route add/remove without restart
  7. Gossip-based topology discovery

Key Files

Action File Notes
Modify GatewayManager.cs, GatewayConnection.cs Interest-only, pooling
Create GatewayInterestTracker.cs Interest tracking
Modify ReplyMapper.cs Full _GR_. handling
Modify LeafNodeManager.cs, LeafConnection.cs Solicited, JetStream
Modify LeafLoopDetector.cs $LDS. refinement
Modify RouteManager.cs, RouteConnection.cs Pooling, permissions
Create RoutePool.cs Connection pool
Modify RouteCompressionCodec.cs Wire into connection

Track E: Services (MQTT, Accounts, Config, WebSocket, Monitoring)

Gaps #6 (HIGH), #9, #10, #14, #15 (MEDIUM) Dependencies: None (starts immediately) Tests to port: ~59 mqtt + ~34 accounts + ~105 config + ~53 websocket + ~118 monitoring = ~369

E1: MQTT Protocol (Gap #6, 10.9x)

  1. Session persistence with JetStream-backed ClientID mapping
  2. Will message handling
  3. QoS 1/2 tracking with packet ID mapping and retry
  4. Retained messages (per-account JetStream stream)
  5. MQTT wildcard translation (+*, #>, /.)
  6. Session flapper detection with backoff
  7. MaxAckPending enforcement
  8. CONNECT packet validation and version negotiation

E2: Account Management (Gap #9, 13x)

  1. Service/stream export whitelist enforcement
  2. Service import with weighted destination selection
  3. Cycle detection for import chains
  4. Response tracking (request-reply latency)
  5. Account-level JetStream limits
  6. Client tracking per account with eviction
  7. Weighted subject mappings for traffic shaping
  8. System account with $SYS.> handling

E3: Configuration & Hot Reload (Gap #14, 2.7x)

  1. SIGHUP signal handling (PosixSignalRegistration)
  2. Auth change propagation (disconnect invalidated clients)
  3. TLS certificate reloading for rotation
  4. JetStream config changes at runtime
  5. Logger reconfiguration without restart
  6. Account list updates with connection cleanup

E4: WebSocket Support (Gap #15, 1.3x)

  1. WebSocket-specific TLS configuration
  2. Origin checking refinement
  3. permessage-deflate compression negotiation
  4. JWT auth through WebSocket upgrade

E5: Monitoring & Events (Gap #10, 3.5x)

  1. Full system event payloads (connect/disconnect/auth)
  2. Message trace propagation through full pipeline
  3. Closed connection tracking (ring buffer for /connz)
  4. Account-scoped monitoring (/connz?acc=ACCOUNT)
  5. Sort options for monitoring endpoints

Key Files

Action File Notes
Modify All Mqtt/ files Major expansion
Modify Account.cs, AuthService.cs Import/export, limits
Create AccountImportExport.cs Import/export logic
Create AccountLimits.cs Per-account JetStream limits
Modify ConfigReloader.cs Signal handling, auth propagation
Modify WebSocket/ files TLS, compression, JWT
Modify Monitoring handlers Events, trace, connz
Modify MessageTraceContext.cs 22 → ~200+ lines

DB Update Protocol

For every Go test ported:

UPDATE go_tests
SET status='mapped',
    dotnet_test='<DotNetTestMethodName>',
    dotnet_file='<DotNetTestFile.cs>',
    notes='Ported from <GoFunctionName> in <go_file>:<line>'
WHERE go_file='<go_test_file>' AND go_test='<GoTestName>';

For Go tests that cannot be ported (e.g., signal_test.go on .NET):

UPDATE go_tests
SET status='not_applicable',
    notes='<reason: e.g., Unix signal handling not applicable to .NET>'
WHERE go_file='<go_test_file>' AND go_test='<GoTestName>';

Batch DB updates at the end of each sub-phase to avoid per-test overhead.


Execution Order

Week 1-2: Tracks A, D, E start in parallel (3 worktrees)
Week 2-3: Track C starts (after Track A merges for C4)
Week 3-4: Track B starts (after Tracks A + C merge)

Merge order: A → D → E → C → B → main

Task Dependencies

Task Track Blocked By
#3 A: Storage (none)
#6 D: Networking (none)
#7 E: Services (none)
#5 C: Protocol #3 (for C4 mirrors)
#4 B: Consensus #3, #5

Success Criteria

  • All 15 gaps from structuregaps.md addressed
  • ~1,194 additional Go tests mapped in test_parity.db
  • Mapped ratio: 29% → ~70%
  • All new tests passing (dotnet test green)
  • Feature-first: each feature validated by its corresponding Go tests