Files
natsnet/docs/plans/2026-02-27-batch-30-raft-part-1-design.md
Joseph Doherty c05d93618e Add batch plans for batches 23-30 (rounds 12-15)
Generated design docs and implementation plans via Codex for:
- Batch 23: Routes
- Batch 24: Leaf Nodes
- Batch 25: Gateways
- Batch 26: WebSocket
- Batch 27: JetStream Core
- Batch 28: JetStream API
- Batch 29: JetStream Batching
- Batch 30: Raft Part 1

All plans include mandatory verification protocol and anti-stub guardrails.
Updated batches.md with file paths and planned status.
2026-02-27 16:33:10 -05:00

128 lines
7.2 KiB
Markdown

# Batch 30 Raft Part 1 Design
**Date:** 2026-02-27
**Batch:** 30 (`Raft Part 1`)
**Scope:** 85 features + 414 unit tests
**Dependencies:** batches `4` (Logging), `18` (Server Core)
**Go source:** `golang/nats-server/server/raft.go`
## Problem
Batch 30 is the first major Raft execution tranche and includes foundational node bootstrap, election/follower/leader loops (through `runCatchup`), append-entry/vote encoding helpers, and server-level Raft-node registration and lookup methods. The mapped test surface is very large (414 tests) and includes both direct Raft tests and broad JetStream cluster regressions.
The design goal is to make Batch 30 executable in a deterministic, evidence-driven way without repeating the placeholder-test/stub drift seen in earlier backlog files.
## Context Findings
### Required command results
- `batch show 30 --db porting.db`
- Status: `pending`
- Features: `85` (all currently `deferred`)
- Tests: `414` (all currently `deferred`)
- Depends on: `4,18`
- `batch list --db porting.db`
- Batch 30 is ordered before Batch 31 (`Raft Part 2`) and is a dependency anchor for JetStream cluster batches.
- `report summary --db porting.db`
- Overall progress: `1924/6942 (27.7%)`
- Deferred backlog remains dominant, so verification rigor is mandatory.
### Feature-map and codebase findings
- Batch 30 feature IDs are concentrated in `raft.go` line ranges `137-3237` plus package/helper functions in `4354-4753`.
- Existing .NET Raft baseline exists in `dotnet/src/ZB.MOM.NatsNet.Server/JetStream/RaftTypes.cs`, but many mapped Batch 30 method targets are still missing or only partially approximated.
- Server-level mapped methods (`BootstrapRaftNode`, `InitRaftNode`, `StartRaftNode`, `RegisterRaftNode`, etc.) do not currently exist as first-class `NatsServer` methods.
### Test-map findings (critical)
- Batch 30 tests are highly skewed:
- `366` tests map to feature `2683` (`raft.shutdown`)
- `3` tests map to feature `2715` (`appendEntry.encode`)
- remaining tests map to external feature IDs (45 tests)
- Only `19` tests come from `server/raft_test.go`; the rest are mostly JetStream cluster/supercluster/concurrency/MQTT regression tests that rely on raft behavior transitively.
- Existing `ImplBacklog/*.Impltests.cs` files contain many superficial placeholder-style tests and cannot be treated as verification evidence.
## Approaches
### Approach A: Full-surface implementation (85 features + all 414 tests in one pass)
- Pros: maximal immediate tracker movement.
- Cons: extremely high risk, weak causality, and likely stub/fake-pass relapse.
### Approach B (Recommended): Feature-first Raft core in five groups, then two-tier test strategy
- Tier 1: implement and verify all 85 Batch 30 features in grouped, dependency-ordered slices.
- Tier 2: port direct Raft unit tests first, then process broader mapped regression tests class-by-class with explicit defer rules when runtime infra is missing.
- Pros: coherent sequencing, auditable status transitions, supports strict anti-stub controls.
- Cons: more checkpoints and status operations.
### Approach C: Infra-first (build cluster harness before feature completion)
- Pros: can unlock larger integration tests earlier.
- Cons: violates YAGNI for this batch and delays core feature parity.
**Decision:** Approach B.
## Proposed Design
### 1. Architecture and file strategy
- Keep Raft model/codec/state-machine logic centered in `JetStream/RaftTypes.cs`, but split into partial files when method count becomes unreviewable (for example: lifecycle, codecs, follower/leader loops).
- Add explicit `NatsServer` Raft integration surface in a dedicated partial (`NatsServer.Raft.cs`) instead of reflection-only lifecycle hooks.
- Preserve existing locking/concurrency primitives already used in repo (`ReaderWriterLockSlim`, `Interlocked`, `Channel<T>`, `IpQueue<T>`), mapping Go intent rather than line-by-line syntax.
### 2. Feature grouping (max ~20 per group)
- **Group A (15):** bootstrap/init/server registry and early apply/snapshot prep
`2599,2600,2601,2602,2603,2607,2608,2609,2610,2611,2612,2613,2614,2615,2629`
- **Group B (14):** snapshot lifecycle + leader-state helpers/campaign hooks
`2634,2637,2639,2645,2646,2647,2651,2652,2653,2659,2663,2664,2665,2674`
- **Group C (20):** node runtime loop and follower pipeline scaffolding
`2683,2684,2685,2686,2687,2688,2689,2690,2691,2692,2693,2694,2695,2696,2697,2698,2701,2702,2703,2704`
- **Group D (19):** entry construction/encoding + membership-change handlers
`2705,2706,2707,2708,2709,2710,2711,2712,2714,2715,2716,2717,2718,2719,2720,2721,2722,2723,2724`
- **Group E (17):** leader loop/catchup + peer-state and vote persistence helpers
`2725,2727,2728,2729,2731,2732,2757,2762,2763,2764,2770,2771,2773,2774,2775,2781,2782`
### 3. Test design
- **Tier T1 (Raft-direct tests, highest value first):** `server/raft_test.go` mapped IDs (`2618,2619,2621,2623,2625,2632,2633,2639,2642,2675,2700,2701,2706,2707,2708,2709,2710,2711,2713`).
- **Tier T2 (transitive regression tests):** remaining mapped classes (`JetStreamClusterTests*`, `JetStreamSuperClusterTests`, `ConcurrencyTests*`, `MqttHandlerTests`, etc.) processed only with real behavioral assertions; otherwise explicitly deferred with reason.
- Existing placeholder tests are not accepted as evidence; they must be replaced or remain deferred.
### 4. Verification model
- Enforce per-feature and per-test loops with mandatory build/test gates before any status promotions.
- Enforce max `15` IDs per `feature/test batch-update` command.
- Require checkpoint protocol between tasks (stub scan, build, targeted tests, full unit test, then status updates).
### 5. Error handling and stuck policy
- If a feature/test is blocked by missing infrastructure, missing prerequisite behavior, or non-deterministic harness gaps:
- do not leave placeholders,
- do not force fake-pass tests,
- mark `deferred` with explicit blocker reason,
- continue with next unblocked ID in current group.
## Risks and Mitigations
- **Risk:** `raft.shutdown` mapping concentration (366 tests) causes noisy/non-actionable test wave.
**Mitigation:** process tests by class and behavioral relevance; require per-test evidence and deferred reasons for infra-blocked cases.
- **Risk:** large `RaftTypes.cs` file becomes unreviewable.
**Mitigation:** allow controlled partial-file split while preserving namespaces and API shape.
- **Risk:** false progress via stubbed placeholders in `ImplBacklog`.
**Mitigation:** mandatory anti-stub scans and hard promotion gates.
## Success Criteria
- All 85 features are either `verified` with evidence or `deferred` with specific blocker reason (no placeholders).
- Batch 30 mapped tests are processed under the same rule: verified only with real execution evidence, otherwise deferred with explicit reason.
- No new stub/fake-pass patterns in touched production or test files.
- The implementation plan includes strict verification and anti-stub guardrails adapted to both features and tests.
## Non-Goals
- Executing implementation in this design document.
- Completing downstream Batch 31+ raft/clustering behavior that is outside Batch 30 feature IDs.
- Building new distributed integration infrastructure beyond what is needed for deterministic unit-level verification.