Files
natsnet/docs/plans/2026-02-27-batch-30-raft-part-1-design.md
Joseph Doherty c05d93618e Add batch plans for batches 23-30 (rounds 12-15)
Generated design docs and implementation plans via Codex for:
- Batch 23: Routes
- Batch 24: Leaf Nodes
- Batch 25: Gateways
- Batch 26: WebSocket
- Batch 27: JetStream Core
- Batch 28: JetStream API
- Batch 29: JetStream Batching
- Batch 30: Raft Part 1

All plans include mandatory verification protocol and anti-stub guardrails.
Updated batches.md with file paths and planned status.
2026-02-27 16:33:10 -05:00

7.2 KiB

Batch 30 Raft Part 1 Design

Date: 2026-02-27
Batch: 30 (Raft Part 1)
Scope: 85 features + 414 unit tests
Dependencies: batches 4 (Logging), 18 (Server Core)
Go source: golang/nats-server/server/raft.go

Problem

Batch 30 is the first major Raft execution tranche and includes foundational node bootstrap, election/follower/leader loops (through runCatchup), append-entry/vote encoding helpers, and server-level Raft-node registration and lookup methods. The mapped test surface is very large (414 tests) and includes both direct Raft tests and broad JetStream cluster regressions.

The design goal is to make Batch 30 executable in a deterministic, evidence-driven way without repeating the placeholder-test/stub drift seen in earlier backlog files.

Context Findings

Required command results

  • batch show 30 --db porting.db
    • Status: pending
    • Features: 85 (all currently deferred)
    • Tests: 414 (all currently deferred)
    • Depends on: 4,18
  • batch list --db porting.db
    • Batch 30 is ordered before Batch 31 (Raft Part 2) and is a dependency anchor for JetStream cluster batches.
  • report summary --db porting.db
    • Overall progress: 1924/6942 (27.7%)
    • Deferred backlog remains dominant, so verification rigor is mandatory.

Feature-map and codebase findings

  • Batch 30 feature IDs are concentrated in raft.go line ranges 137-3237 plus package/helper functions in 4354-4753.
  • Existing .NET Raft baseline exists in dotnet/src/ZB.MOM.NatsNet.Server/JetStream/RaftTypes.cs, but many mapped Batch 30 method targets are still missing or only partially approximated.
  • Server-level mapped methods (BootstrapRaftNode, InitRaftNode, StartRaftNode, RegisterRaftNode, etc.) do not currently exist as first-class NatsServer methods.

Test-map findings (critical)

  • Batch 30 tests are highly skewed:
    • 366 tests map to feature 2683 (raft.shutdown)
    • 3 tests map to feature 2715 (appendEntry.encode)
    • remaining tests map to external feature IDs (45 tests)
  • Only 19 tests come from server/raft_test.go; the rest are mostly JetStream cluster/supercluster/concurrency/MQTT regression tests that rely on raft behavior transitively.
  • Existing ImplBacklog/*.Impltests.cs files contain many superficial placeholder-style tests and cannot be treated as verification evidence.

Approaches

Approach A: Full-surface implementation (85 features + all 414 tests in one pass)

  • Pros: maximal immediate tracker movement.
  • Cons: extremely high risk, weak causality, and likely stub/fake-pass relapse.
  • Tier 1: implement and verify all 85 Batch 30 features in grouped, dependency-ordered slices.
  • Tier 2: port direct Raft unit tests first, then process broader mapped regression tests class-by-class with explicit defer rules when runtime infra is missing.
  • Pros: coherent sequencing, auditable status transitions, supports strict anti-stub controls.
  • Cons: more checkpoints and status operations.

Approach C: Infra-first (build cluster harness before feature completion)

  • Pros: can unlock larger integration tests earlier.
  • Cons: violates YAGNI for this batch and delays core feature parity.

Decision: Approach B.

Proposed Design

1. Architecture and file strategy

  • Keep Raft model/codec/state-machine logic centered in JetStream/RaftTypes.cs, but split into partial files when method count becomes unreviewable (for example: lifecycle, codecs, follower/leader loops).
  • Add explicit NatsServer Raft integration surface in a dedicated partial (NatsServer.Raft.cs) instead of reflection-only lifecycle hooks.
  • Preserve existing locking/concurrency primitives already used in repo (ReaderWriterLockSlim, Interlocked, Channel<T>, IpQueue<T>), mapping Go intent rather than line-by-line syntax.

2. Feature grouping (max ~20 per group)

  • Group A (15): bootstrap/init/server registry and early apply/snapshot prep
    2599,2600,2601,2602,2603,2607,2608,2609,2610,2611,2612,2613,2614,2615,2629
  • Group B (14): snapshot lifecycle + leader-state helpers/campaign hooks
    2634,2637,2639,2645,2646,2647,2651,2652,2653,2659,2663,2664,2665,2674
  • Group C (20): node runtime loop and follower pipeline scaffolding
    2683,2684,2685,2686,2687,2688,2689,2690,2691,2692,2693,2694,2695,2696,2697,2698,2701,2702,2703,2704
  • Group D (19): entry construction/encoding + membership-change handlers
    2705,2706,2707,2708,2709,2710,2711,2712,2714,2715,2716,2717,2718,2719,2720,2721,2722,2723,2724
  • Group E (17): leader loop/catchup + peer-state and vote persistence helpers
    2725,2727,2728,2729,2731,2732,2757,2762,2763,2764,2770,2771,2773,2774,2775,2781,2782

3. Test design

  • Tier T1 (Raft-direct tests, highest value first): server/raft_test.go mapped IDs (2618,2619,2621,2623,2625,2632,2633,2639,2642,2675,2700,2701,2706,2707,2708,2709,2710,2711,2713).
  • Tier T2 (transitive regression tests): remaining mapped classes (JetStreamClusterTests*, JetStreamSuperClusterTests, ConcurrencyTests*, MqttHandlerTests, etc.) processed only with real behavioral assertions; otherwise explicitly deferred with reason.
  • Existing placeholder tests are not accepted as evidence; they must be replaced or remain deferred.

4. Verification model

  • Enforce per-feature and per-test loops with mandatory build/test gates before any status promotions.
  • Enforce max 15 IDs per feature/test batch-update command.
  • Require checkpoint protocol between tasks (stub scan, build, targeted tests, full unit test, then status updates).

5. Error handling and stuck policy

  • If a feature/test is blocked by missing infrastructure, missing prerequisite behavior, or non-deterministic harness gaps:
    • do not leave placeholders,
    • do not force fake-pass tests,
    • mark deferred with explicit blocker reason,
    • continue with next unblocked ID in current group.

Risks and Mitigations

  • Risk: raft.shutdown mapping concentration (366 tests) causes noisy/non-actionable test wave.
    Mitigation: process tests by class and behavioral relevance; require per-test evidence and deferred reasons for infra-blocked cases.
  • Risk: large RaftTypes.cs file becomes unreviewable.
    Mitigation: allow controlled partial-file split while preserving namespaces and API shape.
  • Risk: false progress via stubbed placeholders in ImplBacklog.
    Mitigation: mandatory anti-stub scans and hard promotion gates.

Success Criteria

  • All 85 features are either verified with evidence or deferred with specific blocker reason (no placeholders).
  • Batch 30 mapped tests are processed under the same rule: verified only with real execution evidence, otherwise deferred with explicit reason.
  • No new stub/fake-pass patterns in touched production or test files.
  • The implementation plan includes strict verification and anti-stub guardrails adapted to both features and tests.

Non-Goals

  • Executing implementation in this design document.
  • Completing downstream Batch 31+ raft/clustering behavior that is outside Batch 30 feature IDs.
  • Building new distributed integration infrastructure beyond what is needed for deterministic unit-level verification.