Generated design docs and implementation plans via Codex for: - Batch 31: Raft Part 2 - Batch 32: JS Cluster Meta - Batch 33: JS Cluster Streams - Batch 34: JS Cluster Consumers - Batch 35: JS Cluster Remaining - Batch 36: Stream Lifecycle All plans include mandatory verification protocol and anti-stub guardrails. Updated batches.md with file paths and planned status.
5.8 KiB
Batch 31 Raft Part 2 Design
Date: 2026-02-27
Batch: 31 (Raft Part 2)
Scope: 53 features + 19 unit tests
Dependencies: batch 30 (Raft Part 1)
Go source: golang/nats-server/server/raft.go
Problem
Batch 31 covers the second Raft tranche in raft.go (roughly lines 3239-5038), focused on catchup/snapshot transfer, append-entry processing, WAL consistency, quorum tracking, vote request/response handling, and leadership state transitions. The mapped test set (19 tests) is concentrated on candidate/leader transitions, quorum correctness, membership-change edge cases, and snapshot/catchup behavior.
The design goal is to produce an execution-ready plan that enforces evidence-based status changes and prevents placeholder drift across both production features and tests.
Context Findings
Required command results
batch show 31 --db porting.db- Status:
pending - Features:
53(currentlydeferred) - Tests:
19(currentlydeferred) - Depends on:
30 - Go file:
server/raft.go
- Status:
batch list --db porting.db- Batch 31 is directly gated by Batch 30 and itself gates Batch 32 (
JS Cluster Meta).
- Batch 31 is directly gated by Batch 30 and itself gates Batch 32 (
report summary --db porting.db- Overall progress:
1924/6942 (27.7%) - Deferred backlog remains large; verification discipline is required.
- Overall progress:
Feature and source mapping findings
- Batch 31 feature IDs map in order to
raft.gomethods from:sendSnapshotToFollowerthroughupdateLeader(2733-2750)processAppendEntrythroughsetWriteErrLocked(2751-2777)isClosedthroughswitchToLeader(2778-2796)
- Existing .NET Raft surface is in:
dotnet/src/ZB.MOM.NatsNet.Server/JetStream/RaftTypes.cs
- Current comments in
RaftTypes.csstill describe algorithm methods as stubbed; Batch 31 must replace those gaps with concrete behavior and tests.
Test mapping findings
- All 19 mapped tests are from
server/raft_test.goand map toRaftNodeTestsmethods. dotnet/tests/ZB.MOM.NatsNet.Server.Tests/ImplBacklog/RaftNodeTests.Impltests.csdoes not currently exist, so Batch 31 planning should include creating it.- The mapped tests are behavior-heavy; they cannot be verified using placeholder assertions.
Approaches
Approach A: Monolithic implementation of all 53 features and 19 tests in one pass
- Pros: single sweep.
- Cons: high regression risk, weak traceability, hard to isolate failures.
Approach B (Recommended): Three feature groups (<=20 each) plus two test waves
- Features are implemented in ordered method clusters, each with strict gates before status updates.
- Tests are ported in two behavioral waves (state/quorum first, then snapshot/membership edge cases).
- Pros: bounded scope, better failure isolation, cleaner status evidence.
- Cons: more checkpoint overhead.
Approach C: Test-first across all 19 tests, then fill feature gaps
- Pros: quickly exposes missing behavior.
- Cons: expensive thrash because many tests depend on broad feature slices.
Decision: Approach B.
Proposed Design
1. Architecture and file strategy
- Keep Raft runtime behavior in
JetStream/RaftTypes.cs, with optional split into partials if file size hurts reviewability:RaftTypes.Catchup.csRaftTypes.AppendProcessing.csRaftTypes.Elections.cs
- Keep test implementation in dedicated mapped backlog file:
dotnet/tests/ZB.MOM.NatsNet.Server.Tests/ImplBacklog/RaftNodeTests.Impltests.cs
- Reuse existing support types (
IpQueue<T>,Channel<T>, lock +Interlocked) and avoid introducing new infra unless required for deterministic testability.
2. Feature slicing (max ~20 per group)
- Feature Group A (18): catchup/snapshot/commit foundations
2733,2734,2735,2736,2737,2738,2739,2740,2741,2742,2743,2744,2745,2746,2747,2748,2749,2750 - Feature Group B (18): append-entry processing and peer/WAL state
2751,2752,2753,2754,2755,2756,2758,2759,2760,2761,2765,2766,2767,2768,2769,2772,2776,2777 - Feature Group C (17): vote/RPC/state transitions
2778,2779,2780,2783,2784,2785,2786,2787,2788,2789,2790,2791,2792,2793,2794,2795,2796
3. Test slicing
- Test Wave T1 (10): state/quorum/election behavior
2626,2629,2635,2636,2663,2664,2667,2687,2690,2692 - Test Wave T2 (9): snapshot/catchup/membership-vote edge cases
2650,2651,2693,2694,2702,2704,2705,2712,2714
4. Verification model
- Enforce per-feature and per-test loops (red/green + stub scan + build/test gates).
- Enforce status-update chunking (
<=15IDs perfeature/test batch-update). - Enforce checkpoint protocol after every group/wave before proceeding.
5. Stuck-item policy
- A blocked item is not left as pseudo-implemented.
- If blocked, set
deferredimmediately with explicit reason via--override, then continue with next unblocked ID.
Risks and Mitigations
- Risk: Batch 30 dependency incomplete blocks execution.
Mitigation: preflight dependency gate is mandatory; no Batch 31 status updates until Batch 30 is complete/ready. - Risk: Large method
processAppendEntrycauses hidden regressions.
Mitigation: isolate with focused tests per behavior branch plus class-level gates. - Risk: fake progress via placeholder methods/tests.
Mitigation: mandatory anti-stub scans and hard promotion gates.
Success Criteria
- All 53 features are either
verifiedwith evidence ordeferredwith explicit blocker reason. - All 19 tests are either
verifiedwith execution evidence ordeferredwith explicit blocker reason. - No placeholder/stub patterns in touched production or test code.
- Batch-completion readiness is auditable through build/test outputs and chunked status updates.
Non-Goals
- Executing implementation in this design doc.
- Implementing Batch 32+ scope.
- Building new distributed integration infrastructure beyond deterministic unit-level needs.