Files
natsdotnet/docs/plans/2026-02-24-full-production-parity-design.md
Joseph Doherty d445a9fae1 docs: add full production parity design
6-wave implementation plan covering RAFT consensus, FileStore block
engine, internal data structures, JetStream clustering, and remaining
subsystem test suites. Targets ~1,160 new tests for ~75% Go parity.
2026-02-23 20:31:57 -05:00

224 lines
8.9 KiB
Markdown

# Full Production Parity Design
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers-extended-cc:executing-plans to implement this plan task-by-task.
**Goal:** Close all remaining gaps between the Go NATS server and the .NET port — implementation code and test coverage — achieving full production parity.
**Current state:** 1,081 tests passing, core pub/sub + JetStream basics + MQTT packet parsing + JWT claims ported. Three major implementation gaps remain: RAFT consensus, FileStore block engine, and internal data structures (AVL, subject tree, GSL, time hash wheel).
**Approach:** 6-wave slice-by-slice TDD, ordered by dependency. Each wave builds on the prior wave's production code and tests. Parallel subagents within each wave for independent subsystems.
---
## Gap Analysis Summary
### Implementation Gaps
| Gap | Go Source | .NET Status | Impact |
|-----|-----------|-------------|--------|
| RAFT consensus | `server/raft.go` (5,800 lines) | Missing entirely | Blocks clustered JetStream |
| FileStore block engine | `server/filestore.go` (337KB) | Flat JSONL stub | Blocks persistent JetStream |
| Internal data structures | `server/avl/`, `server/stree/`, `server/gsl/`, `server/thw/` | Missing entirely | Blocks FileStore + RAFT |
### Test Coverage Gap
- Go server tests: ~2,937 test functions
- .NET tests: 1,081 (32.5% coverage)
- Gap: ~1,856 tests across all subsystems
---
## Wave 1: Inventory + Scaffolding
**Purpose:** Establish project structure, create stub files, set up namespaces.
**Deliverables:**
- Namespace scaffolding: `NATS.Server.Internal.Avl`, `NATS.Server.Internal.SubjectTree`, `NATS.Server.Internal.Gsl`, `NATS.Server.Internal.TimeHashWheel`
- Stub interfaces for FileStore block engine
- Stub interfaces for RAFT node, log, transport
- Test project directory structure for all new subsystems
**Tests:** 0 (scaffolding only)
---
## Wave 2: Internal Data Structures
**Purpose:** Port Go's internal data structures that FileStore and RAFT depend on.
### AVL Tree (`server/avl/`)
- Sparse sequence set backed by AVL-balanced binary tree
- Used for JetStream ack tracking (consumer pending sets)
- Key operations: `Insert`, `Delete`, `Contains`, `Range`, `Size`
- Go reference: `server/avl/seqset.go`
- Port as `NATS.Server.Internal.Avl.SequenceSet`
- ~15 tests from Go's `TestSequenceSet*`
### Subject Tree (`server/stree/`)
- Trie for per-subject state in streams (sequence tracking, last-by-subject)
- Supports wildcard iteration (`*`, `>`)
- Go reference: `server/stree/stree.go`
- Port as `NATS.Server.Internal.SubjectTree.SubjectTree<T>`
- ~15 tests from Go's `TestSubjectTree*`
### Generic Subject List (`server/gsl/`)
- Optimized trie for subscription matching (alternative to SubList for specific paths)
- Go reference: `server/gsl/gsl.go`
- Port as `NATS.Server.Internal.Gsl.GenericSubjectList<T>`
- ~15 tests from Go's `TestGSL*`
### Time Hash Wheel (`server/thw/`)
- Efficient TTL expiration using hash wheel (O(1) insert/cancel, O(bucket) tick)
- Used for message expiry in MemStore and FileStore
- Go reference: `server/thw/thw.go`
- Port as `NATS.Server.Internal.TimeHashWheel.TimeHashWheel<T>`
- ~15 tests from Go's `TestTimeHashWheel*`
**Total tests:** ~60
---
## Wave 3: FileStore Block Engine
**Purpose:** Replace the flat JSONL FileStore stub with Go-compatible block-based storage.
### Design Decisions
- **Behavioral equivalence** — same 64MB block boundaries and semantics, not byte-level Go file compatibility
- **Block format:** Each block is a separate file containing sequential messages with headers
- **Compression:** S2 (Snappy variant) per-block, using IronSnappy or equivalent .NET library
- **Encryption:** AES-GCM per-block (matching Go's encryption support)
- **Recovery:** Block-level recovery on startup (scan for valid messages, rebuild index)
### Components
1. **Block Manager** — manages block files, rotation at 64MB, compaction
2. **Message Encoding** — per-message header (sequence, timestamp, subject, data length) + payload
3. **Index Layer** — in-memory index mapping sequence → block + offset
4. **Subject Index** — per-subject first/last sequence tracking using SubjectTree (Wave 2)
5. **Purge/Compact** — subject-based purge, sequence-based purge, compaction
6. **Recovery** — startup block scanning, index rebuild
### Go Reference Files
- `server/filestore.go` — main implementation
- `server/filestore_test.go` — test suite
**Total tests:** ~80 (store/load, block rotation, compression, encryption, purge, recovery, subject filtering)
---
## Wave 4: RAFT Consensus
**Purpose:** Faithful behavioral port of Go's RAFT implementation for clustered JetStream.
### Design Decisions
- **Faithful Go port** — not a third-party RAFT library; port Go's `raft.go` directly
- **Same state machine semantics** — leader election, log replication, snapshots, membership changes
- **Transport abstraction** — pluggable transport (in-process for tests, TCP for production)
### Components
1. **RAFT Node** — state machine (Follower → Candidate → Leader), term/vote tracking
2. **Log Storage** — append-only log with compaction, backed by FileStore blocks (Wave 3)
3. **Election** — randomized timeout, RequestVote RPC, majority quorum
4. **Log Replication** — AppendEntries RPC, leader → follower catch-up, conflict resolution
5. **Snapshots** — periodic state snapshots, snapshot transfer to lagging followers
6. **Membership Changes** — joint consensus for adding/removing nodes
7. **Transport** — RPC abstraction with in-process and TCP implementations
### Go Reference Files
- `server/raft.go` — main implementation (5,800 lines)
- `server/raft_test.go` — test suite
**Total tests:** ~70 (election, log replication, snapshots, membership, split-brain, network partition simulation)
---
## Wave 5: JetStream Clustering + Concurrency
**Purpose:** Wire RAFT into JetStream for clustered operation; add NORACE concurrency tests.
### Components
1. **Meta-Controller** — cluster-wide RAFT group for stream/consumer placement
- Ports Go's `jetStreamCluster` struct
- Routes `$JS.API.*` requests through meta-group leader
- Tests from Go's `TestJetStreamClusterCreate`, `TestJetStreamClusterStreamLeaderStepDown`
2. **Per-Stream RAFT Groups** — each R>1 stream gets its own RAFT group
- Leader accepts publishes, proposes entries, followers apply
- Tests: create R3 stream, publish, verify all replicas, step down, verify new leader
3. **Per-Consumer RAFT Groups** — consumer ack state replicated via RAFT
- Tests: ack on leader, verify ack floor propagation, consumer failover
4. **NORACE Concurrency Suite** — Go's `-race`-tagged tests ported to `Task.WhenAll` patterns
- Concurrent pub/sub on same stream
- Concurrent consumer creates
- Concurrent stream purge during publish
### Go Reference Files
- `server/jetstream_cluster.go`, `server/jetstream_cluster_test.go`
- `server/norace_test.go`
**Total tests:** ~100
---
## Wave 6: Remaining Subsystem Test Suites
**Purpose:** Port remaining Go test functions across all subsystems not covered by Waves 2-5.
### Subsystems
| Subsystem | Go Tests | Existing .NET | Gap | Files |
|-----------|----------|---------------|-----|-------|
| Config reload | ~92 | 3 | ~89 | `Configuration/` |
| MQTT bridge | ~123 | 50 | ~73 | `Mqtt/` |
| Leaf nodes | ~110 | 2 | ~108 | `LeafNodes/` |
| Accounts/auth | ~64 | 15 | ~49 | `Accounts/` |
| Gateway | ~87 | 2 | ~85 | `Gateways/` |
| Routes | ~73 | 2 | ~71 | `Routes/` |
| Monitoring | ~45 | 7 | ~38 | `Monitoring/` |
| Client protocol | ~120 | 30 | ~90 | root test dir |
| JetStream API | ~200 | 20 | ~180 | `JetStream/` |
### Approach
- Each subsystem is an independent parallel subagent task
- Tests organized by .NET namespace matching existing conventions
- Each test file has header comment mapping to Go source test function names
- Self-contained test helpers duplicated per file (no shared TestHelpers)
- Gate verification between subsystem batches
**Total tests:** ~780-850
---
## Dependency Graph
```
Wave 1 (Scaffolding) ──┬──► Wave 2 (Data Structures) ──► Wave 3 (FileStore) ──► Wave 4 (RAFT) ──► Wave 5 (Clustering)
└──► Wave 6 (Subsystem Suites) [parallel, independent of Waves 2-5]
```
Wave 6 subsystems are mutually independent and can execute in parallel. Waves 2-5 are sequential.
---
## Estimated Totals
| Metric | Value |
|--------|-------|
| New implementation code | ~15,000-20,000 lines |
| New test code | ~12,000-15,000 lines |
| New tests | ~1,160 |
| Final test count | ~2,241 |
| Final Go parity | ~75% of Go test functions |
## Key Conventions
- xUnit 3 + Shouldly assertions (never `Assert.*`)
- NSubstitute for mocking
- Go reference comments on each ported test: `// Go: TestFunctionName server/file.go:line`
- Self-contained helpers per test file
- C# 14 idioms: primary constructors, collection expressions, file-scoped namespaces
- TDD: write failing test first, then minimal implementation
- Gated commits between waves