Generated design docs and implementation plans via Codex for: - Batch 23: Routes - Batch 24: Leaf Nodes - Batch 25: Gateways - Batch 26: WebSocket - Batch 27: JetStream Core - Batch 28: JetStream API - Batch 29: JetStream Batching - Batch 30: Raft Part 1 All plans include mandatory verification protocol and anti-stub guardrails. Updated batches.md with file paths and planned status.
128 lines
7.2 KiB
Markdown
128 lines
7.2 KiB
Markdown
# Batch 30 Raft Part 1 Design
|
|
|
|
**Date:** 2026-02-27
|
|
**Batch:** 30 (`Raft Part 1`)
|
|
**Scope:** 85 features + 414 unit tests
|
|
**Dependencies:** batches `4` (Logging), `18` (Server Core)
|
|
**Go source:** `golang/nats-server/server/raft.go`
|
|
|
|
## Problem
|
|
|
|
Batch 30 is the first major Raft execution tranche and includes foundational node bootstrap, election/follower/leader loops (through `runCatchup`), append-entry/vote encoding helpers, and server-level Raft-node registration and lookup methods. The mapped test surface is very large (414 tests) and includes both direct Raft tests and broad JetStream cluster regressions.
|
|
|
|
The design goal is to make Batch 30 executable in a deterministic, evidence-driven way without repeating the placeholder-test/stub drift seen in earlier backlog files.
|
|
|
|
## Context Findings
|
|
|
|
### Required command results
|
|
|
|
- `batch show 30 --db porting.db`
|
|
- Status: `pending`
|
|
- Features: `85` (all currently `deferred`)
|
|
- Tests: `414` (all currently `deferred`)
|
|
- Depends on: `4,18`
|
|
- `batch list --db porting.db`
|
|
- Batch 30 is ordered before Batch 31 (`Raft Part 2`) and is a dependency anchor for JetStream cluster batches.
|
|
- `report summary --db porting.db`
|
|
- Overall progress: `1924/6942 (27.7%)`
|
|
- Deferred backlog remains dominant, so verification rigor is mandatory.
|
|
|
|
### Feature-map and codebase findings
|
|
|
|
- Batch 30 feature IDs are concentrated in `raft.go` line ranges `137-3237` plus package/helper functions in `4354-4753`.
|
|
- Existing .NET Raft baseline exists in `dotnet/src/ZB.MOM.NatsNet.Server/JetStream/RaftTypes.cs`, but many mapped Batch 30 method targets are still missing or only partially approximated.
|
|
- Server-level mapped methods (`BootstrapRaftNode`, `InitRaftNode`, `StartRaftNode`, `RegisterRaftNode`, etc.) do not currently exist as first-class `NatsServer` methods.
|
|
|
|
### Test-map findings (critical)
|
|
|
|
- Batch 30 tests are highly skewed:
|
|
- `366` tests map to feature `2683` (`raft.shutdown`)
|
|
- `3` tests map to feature `2715` (`appendEntry.encode`)
|
|
- remaining tests map to external feature IDs (45 tests)
|
|
- Only `19` tests come from `server/raft_test.go`; the rest are mostly JetStream cluster/supercluster/concurrency/MQTT regression tests that rely on raft behavior transitively.
|
|
- Existing `ImplBacklog/*.Impltests.cs` files contain many superficial placeholder-style tests and cannot be treated as verification evidence.
|
|
|
|
## Approaches
|
|
|
|
### Approach A: Full-surface implementation (85 features + all 414 tests in one pass)
|
|
|
|
- Pros: maximal immediate tracker movement.
|
|
- Cons: extremely high risk, weak causality, and likely stub/fake-pass relapse.
|
|
|
|
### Approach B (Recommended): Feature-first Raft core in five groups, then two-tier test strategy
|
|
|
|
- Tier 1: implement and verify all 85 Batch 30 features in grouped, dependency-ordered slices.
|
|
- Tier 2: port direct Raft unit tests first, then process broader mapped regression tests class-by-class with explicit defer rules when runtime infra is missing.
|
|
- Pros: coherent sequencing, auditable status transitions, supports strict anti-stub controls.
|
|
- Cons: more checkpoints and status operations.
|
|
|
|
### Approach C: Infra-first (build cluster harness before feature completion)
|
|
|
|
- Pros: can unlock larger integration tests earlier.
|
|
- Cons: violates YAGNI for this batch and delays core feature parity.
|
|
|
|
**Decision:** Approach B.
|
|
|
|
## Proposed Design
|
|
|
|
### 1. Architecture and file strategy
|
|
|
|
- Keep Raft model/codec/state-machine logic centered in `JetStream/RaftTypes.cs`, but split into partial files when method count becomes unreviewable (for example: lifecycle, codecs, follower/leader loops).
|
|
- Add explicit `NatsServer` Raft integration surface in a dedicated partial (`NatsServer.Raft.cs`) instead of reflection-only lifecycle hooks.
|
|
- Preserve existing locking/concurrency primitives already used in repo (`ReaderWriterLockSlim`, `Interlocked`, `Channel<T>`, `IpQueue<T>`), mapping Go intent rather than line-by-line syntax.
|
|
|
|
### 2. Feature grouping (max ~20 per group)
|
|
|
|
- **Group A (15):** bootstrap/init/server registry and early apply/snapshot prep
|
|
`2599,2600,2601,2602,2603,2607,2608,2609,2610,2611,2612,2613,2614,2615,2629`
|
|
- **Group B (14):** snapshot lifecycle + leader-state helpers/campaign hooks
|
|
`2634,2637,2639,2645,2646,2647,2651,2652,2653,2659,2663,2664,2665,2674`
|
|
- **Group C (20):** node runtime loop and follower pipeline scaffolding
|
|
`2683,2684,2685,2686,2687,2688,2689,2690,2691,2692,2693,2694,2695,2696,2697,2698,2701,2702,2703,2704`
|
|
- **Group D (19):** entry construction/encoding + membership-change handlers
|
|
`2705,2706,2707,2708,2709,2710,2711,2712,2714,2715,2716,2717,2718,2719,2720,2721,2722,2723,2724`
|
|
- **Group E (17):** leader loop/catchup + peer-state and vote persistence helpers
|
|
`2725,2727,2728,2729,2731,2732,2757,2762,2763,2764,2770,2771,2773,2774,2775,2781,2782`
|
|
|
|
### 3. Test design
|
|
|
|
- **Tier T1 (Raft-direct tests, highest value first):** `server/raft_test.go` mapped IDs (`2618,2619,2621,2623,2625,2632,2633,2639,2642,2675,2700,2701,2706,2707,2708,2709,2710,2711,2713`).
|
|
- **Tier T2 (transitive regression tests):** remaining mapped classes (`JetStreamClusterTests*`, `JetStreamSuperClusterTests`, `ConcurrencyTests*`, `MqttHandlerTests`, etc.) processed only with real behavioral assertions; otherwise explicitly deferred with reason.
|
|
- Existing placeholder tests are not accepted as evidence; they must be replaced or remain deferred.
|
|
|
|
### 4. Verification model
|
|
|
|
- Enforce per-feature and per-test loops with mandatory build/test gates before any status promotions.
|
|
- Enforce max `15` IDs per `feature/test batch-update` command.
|
|
- Require checkpoint protocol between tasks (stub scan, build, targeted tests, full unit test, then status updates).
|
|
|
|
### 5. Error handling and stuck policy
|
|
|
|
- If a feature/test is blocked by missing infrastructure, missing prerequisite behavior, or non-deterministic harness gaps:
|
|
- do not leave placeholders,
|
|
- do not force fake-pass tests,
|
|
- mark `deferred` with explicit blocker reason,
|
|
- continue with next unblocked ID in current group.
|
|
|
|
## Risks and Mitigations
|
|
|
|
- **Risk:** `raft.shutdown` mapping concentration (366 tests) causes noisy/non-actionable test wave.
|
|
**Mitigation:** process tests by class and behavioral relevance; require per-test evidence and deferred reasons for infra-blocked cases.
|
|
- **Risk:** large `RaftTypes.cs` file becomes unreviewable.
|
|
**Mitigation:** allow controlled partial-file split while preserving namespaces and API shape.
|
|
- **Risk:** false progress via stubbed placeholders in `ImplBacklog`.
|
|
**Mitigation:** mandatory anti-stub scans and hard promotion gates.
|
|
|
|
## Success Criteria
|
|
|
|
- All 85 features are either `verified` with evidence or `deferred` with specific blocker reason (no placeholders).
|
|
- Batch 30 mapped tests are processed under the same rule: verified only with real execution evidence, otherwise deferred with explicit reason.
|
|
- No new stub/fake-pass patterns in touched production or test files.
|
|
- The implementation plan includes strict verification and anti-stub guardrails adapted to both features and tests.
|
|
|
|
## Non-Goals
|
|
|
|
- Executing implementation in this design document.
|
|
- Completing downstream Batch 31+ raft/clustering behavior that is outside Batch 30 feature IDs.
|
|
- Building new distributed integration infrastructure beyond what is needed for deterministic unit-level verification.
|