Generated design docs and implementation plans via Codex for: - Batch 23: Routes - Batch 24: Leaf Nodes - Batch 25: Gateways - Batch 26: WebSocket - Batch 27: JetStream Core - Batch 28: JetStream API - Batch 29: JetStream Batching - Batch 30: Raft Part 1 All plans include mandatory verification protocol and anti-stub guardrails. Updated batches.md with file paths and planned status.
7.2 KiB
7.2 KiB
Batch 30 Raft Part 1 Design
Date: 2026-02-27
Batch: 30 (Raft Part 1)
Scope: 85 features + 414 unit tests
Dependencies: batches 4 (Logging), 18 (Server Core)
Go source: golang/nats-server/server/raft.go
Problem
Batch 30 is the first major Raft execution tranche and includes foundational node bootstrap, election/follower/leader loops (through runCatchup), append-entry/vote encoding helpers, and server-level Raft-node registration and lookup methods. The mapped test surface is very large (414 tests) and includes both direct Raft tests and broad JetStream cluster regressions.
The design goal is to make Batch 30 executable in a deterministic, evidence-driven way without repeating the placeholder-test/stub drift seen in earlier backlog files.
Context Findings
Required command results
batch show 30 --db porting.db- Status:
pending - Features:
85(all currentlydeferred) - Tests:
414(all currentlydeferred) - Depends on:
4,18
- Status:
batch list --db porting.db- Batch 30 is ordered before Batch 31 (
Raft Part 2) and is a dependency anchor for JetStream cluster batches.
- Batch 30 is ordered before Batch 31 (
report summary --db porting.db- Overall progress:
1924/6942 (27.7%) - Deferred backlog remains dominant, so verification rigor is mandatory.
- Overall progress:
Feature-map and codebase findings
- Batch 30 feature IDs are concentrated in
raft.goline ranges137-3237plus package/helper functions in4354-4753. - Existing .NET Raft baseline exists in
dotnet/src/ZB.MOM.NatsNet.Server/JetStream/RaftTypes.cs, but many mapped Batch 30 method targets are still missing or only partially approximated. - Server-level mapped methods (
BootstrapRaftNode,InitRaftNode,StartRaftNode,RegisterRaftNode, etc.) do not currently exist as first-classNatsServermethods.
Test-map findings (critical)
- Batch 30 tests are highly skewed:
366tests map to feature2683(raft.shutdown)3tests map to feature2715(appendEntry.encode)- remaining tests map to external feature IDs (45 tests)
- Only
19tests come fromserver/raft_test.go; the rest are mostly JetStream cluster/supercluster/concurrency/MQTT regression tests that rely on raft behavior transitively. - Existing
ImplBacklog/*.Impltests.csfiles contain many superficial placeholder-style tests and cannot be treated as verification evidence.
Approaches
Approach A: Full-surface implementation (85 features + all 414 tests in one pass)
- Pros: maximal immediate tracker movement.
- Cons: extremely high risk, weak causality, and likely stub/fake-pass relapse.
Approach B (Recommended): Feature-first Raft core in five groups, then two-tier test strategy
- Tier 1: implement and verify all 85 Batch 30 features in grouped, dependency-ordered slices.
- Tier 2: port direct Raft unit tests first, then process broader mapped regression tests class-by-class with explicit defer rules when runtime infra is missing.
- Pros: coherent sequencing, auditable status transitions, supports strict anti-stub controls.
- Cons: more checkpoints and status operations.
Approach C: Infra-first (build cluster harness before feature completion)
- Pros: can unlock larger integration tests earlier.
- Cons: violates YAGNI for this batch and delays core feature parity.
Decision: Approach B.
Proposed Design
1. Architecture and file strategy
- Keep Raft model/codec/state-machine logic centered in
JetStream/RaftTypes.cs, but split into partial files when method count becomes unreviewable (for example: lifecycle, codecs, follower/leader loops). - Add explicit
NatsServerRaft integration surface in a dedicated partial (NatsServer.Raft.cs) instead of reflection-only lifecycle hooks. - Preserve existing locking/concurrency primitives already used in repo (
ReaderWriterLockSlim,Interlocked,Channel<T>,IpQueue<T>), mapping Go intent rather than line-by-line syntax.
2. Feature grouping (max ~20 per group)
- Group A (15): bootstrap/init/server registry and early apply/snapshot prep
2599,2600,2601,2602,2603,2607,2608,2609,2610,2611,2612,2613,2614,2615,2629 - Group B (14): snapshot lifecycle + leader-state helpers/campaign hooks
2634,2637,2639,2645,2646,2647,2651,2652,2653,2659,2663,2664,2665,2674 - Group C (20): node runtime loop and follower pipeline scaffolding
2683,2684,2685,2686,2687,2688,2689,2690,2691,2692,2693,2694,2695,2696,2697,2698,2701,2702,2703,2704 - Group D (19): entry construction/encoding + membership-change handlers
2705,2706,2707,2708,2709,2710,2711,2712,2714,2715,2716,2717,2718,2719,2720,2721,2722,2723,2724 - Group E (17): leader loop/catchup + peer-state and vote persistence helpers
2725,2727,2728,2729,2731,2732,2757,2762,2763,2764,2770,2771,2773,2774,2775,2781,2782
3. Test design
- Tier T1 (Raft-direct tests, highest value first):
server/raft_test.gomapped IDs (2618,2619,2621,2623,2625,2632,2633,2639,2642,2675,2700,2701,2706,2707,2708,2709,2710,2711,2713). - Tier T2 (transitive regression tests): remaining mapped classes (
JetStreamClusterTests*,JetStreamSuperClusterTests,ConcurrencyTests*,MqttHandlerTests, etc.) processed only with real behavioral assertions; otherwise explicitly deferred with reason. - Existing placeholder tests are not accepted as evidence; they must be replaced or remain deferred.
4. Verification model
- Enforce per-feature and per-test loops with mandatory build/test gates before any status promotions.
- Enforce max
15IDs perfeature/test batch-updatecommand. - Require checkpoint protocol between tasks (stub scan, build, targeted tests, full unit test, then status updates).
5. Error handling and stuck policy
- If a feature/test is blocked by missing infrastructure, missing prerequisite behavior, or non-deterministic harness gaps:
- do not leave placeholders,
- do not force fake-pass tests,
- mark
deferredwith explicit blocker reason, - continue with next unblocked ID in current group.
Risks and Mitigations
- Risk:
raft.shutdownmapping concentration (366 tests) causes noisy/non-actionable test wave.
Mitigation: process tests by class and behavioral relevance; require per-test evidence and deferred reasons for infra-blocked cases. - Risk: large
RaftTypes.csfile becomes unreviewable.
Mitigation: allow controlled partial-file split while preserving namespaces and API shape. - Risk: false progress via stubbed placeholders in
ImplBacklog.
Mitigation: mandatory anti-stub scans and hard promotion gates.
Success Criteria
- All 85 features are either
verifiedwith evidence ordeferredwith specific blocker reason (no placeholders). - Batch 30 mapped tests are processed under the same rule: verified only with real execution evidence, otherwise deferred with explicit reason.
- No new stub/fake-pass patterns in touched production or test files.
- The implementation plan includes strict verification and anti-stub guardrails adapted to both features and tests.
Non-Goals
- Executing implementation in this design document.
- Completing downstream Batch 31+ raft/clustering behavior that is outside Batch 30 feature IDs.
- Building new distributed integration infrastructure beyond what is needed for deterministic unit-level verification.