Files
natsnet/docs/plans/2026-02-27-batch-31-raft-part-2-design.md
Joseph Doherty f8dce79ac0 Add batch plans for batches 31-36 (rounds 16-18)
Generated design docs and implementation plans via Codex for:
- Batch 31: Raft Part 2
- Batch 32: JS Cluster Meta
- Batch 33: JS Cluster Streams
- Batch 34: JS Cluster Consumers
- Batch 35: JS Cluster Remaining
- Batch 36: Stream Lifecycle

All plans include mandatory verification protocol and anti-stub guardrails.
Updated batches.md with file paths and planned status.
2026-02-27 17:01:31 -05:00

5.8 KiB

Batch 31 Raft Part 2 Design

Date: 2026-02-27
Batch: 31 (Raft Part 2)
Scope: 53 features + 19 unit tests
Dependencies: batch 30 (Raft Part 1)
Go source: golang/nats-server/server/raft.go

Problem

Batch 31 covers the second Raft tranche in raft.go (roughly lines 3239-5038), focused on catchup/snapshot transfer, append-entry processing, WAL consistency, quorum tracking, vote request/response handling, and leadership state transitions. The mapped test set (19 tests) is concentrated on candidate/leader transitions, quorum correctness, membership-change edge cases, and snapshot/catchup behavior.

The design goal is to produce an execution-ready plan that enforces evidence-based status changes and prevents placeholder drift across both production features and tests.

Context Findings

Required command results

  • batch show 31 --db porting.db
    • Status: pending
    • Features: 53 (currently deferred)
    • Tests: 19 (currently deferred)
    • Depends on: 30
    • Go file: server/raft.go
  • batch list --db porting.db
    • Batch 31 is directly gated by Batch 30 and itself gates Batch 32 (JS Cluster Meta).
  • report summary --db porting.db
    • Overall progress: 1924/6942 (27.7%)
    • Deferred backlog remains large; verification discipline is required.

Feature and source mapping findings

  • Batch 31 feature IDs map in order to raft.go methods from:
    • sendSnapshotToFollower through updateLeader (2733-2750)
    • processAppendEntry through setWriteErrLocked (2751-2777)
    • isClosed through switchToLeader (2778-2796)
  • Existing .NET Raft surface is in:
    • dotnet/src/ZB.MOM.NatsNet.Server/JetStream/RaftTypes.cs
  • Current comments in RaftTypes.cs still describe algorithm methods as stubbed; Batch 31 must replace those gaps with concrete behavior and tests.

Test mapping findings

  • All 19 mapped tests are from server/raft_test.go and map to RaftNodeTests methods.
  • dotnet/tests/ZB.MOM.NatsNet.Server.Tests/ImplBacklog/RaftNodeTests.Impltests.cs does not currently exist, so Batch 31 planning should include creating it.
  • The mapped tests are behavior-heavy; they cannot be verified using placeholder assertions.

Approaches

Approach A: Monolithic implementation of all 53 features and 19 tests in one pass

  • Pros: single sweep.
  • Cons: high regression risk, weak traceability, hard to isolate failures.
  • Features are implemented in ordered method clusters, each with strict gates before status updates.
  • Tests are ported in two behavioral waves (state/quorum first, then snapshot/membership edge cases).
  • Pros: bounded scope, better failure isolation, cleaner status evidence.
  • Cons: more checkpoint overhead.

Approach C: Test-first across all 19 tests, then fill feature gaps

  • Pros: quickly exposes missing behavior.
  • Cons: expensive thrash because many tests depend on broad feature slices.

Decision: Approach B.

Proposed Design

1. Architecture and file strategy

  • Keep Raft runtime behavior in JetStream/RaftTypes.cs, with optional split into partials if file size hurts reviewability:
    • RaftTypes.Catchup.cs
    • RaftTypes.AppendProcessing.cs
    • RaftTypes.Elections.cs
  • Keep test implementation in dedicated mapped backlog file:
    • dotnet/tests/ZB.MOM.NatsNet.Server.Tests/ImplBacklog/RaftNodeTests.Impltests.cs
  • Reuse existing support types (IpQueue<T>, Channel<T>, lock + Interlocked) and avoid introducing new infra unless required for deterministic testability.

2. Feature slicing (max ~20 per group)

  • Feature Group A (18): catchup/snapshot/commit foundations
    2733,2734,2735,2736,2737,2738,2739,2740,2741,2742,2743,2744,2745,2746,2747,2748,2749,2750
  • Feature Group B (18): append-entry processing and peer/WAL state
    2751,2752,2753,2754,2755,2756,2758,2759,2760,2761,2765,2766,2767,2768,2769,2772,2776,2777
  • Feature Group C (17): vote/RPC/state transitions
    2778,2779,2780,2783,2784,2785,2786,2787,2788,2789,2790,2791,2792,2793,2794,2795,2796

3. Test slicing

  • Test Wave T1 (10): state/quorum/election behavior
    2626,2629,2635,2636,2663,2664,2667,2687,2690,2692
  • Test Wave T2 (9): snapshot/catchup/membership-vote edge cases
    2650,2651,2693,2694,2702,2704,2705,2712,2714

4. Verification model

  • Enforce per-feature and per-test loops (red/green + stub scan + build/test gates).
  • Enforce status-update chunking (<=15 IDs per feature/test batch-update).
  • Enforce checkpoint protocol after every group/wave before proceeding.

5. Stuck-item policy

  • A blocked item is not left as pseudo-implemented.
  • If blocked, set deferred immediately with explicit reason via --override, then continue with next unblocked ID.

Risks and Mitigations

  • Risk: Batch 30 dependency incomplete blocks execution.
    Mitigation: preflight dependency gate is mandatory; no Batch 31 status updates until Batch 30 is complete/ready.
  • Risk: Large method processAppendEntry causes hidden regressions.
    Mitigation: isolate with focused tests per behavior branch plus class-level gates.
  • Risk: fake progress via placeholder methods/tests.
    Mitigation: mandatory anti-stub scans and hard promotion gates.

Success Criteria

  • All 53 features are either verified with evidence or deferred with explicit blocker reason.
  • All 19 tests are either verified with execution evidence or deferred with explicit blocker reason.
  • No placeholder/stub patterns in touched production or test code.
  • Batch-completion readiness is auditable through build/test outputs and chunked status updates.

Non-Goals

  • Executing implementation in this design doc.
  • Implementing Batch 32+ scope.
  • Building new distributed integration infrastructure beyond deterministic unit-level needs.