natsnet/docs/plans/2026-02-27-batch-31-raft-part-2-design.md

# Batch 31 Raft Part 2 Design

**Date:** 2026-02-27
**Batch:** 31 (`Raft Part 2`)
**Scope:** 53 features + 19 unit tests
**Dependencies:** batch `30` (`Raft Part 1`)
**Go source:** `golang/nats-server/server/raft.go`

## Problem

Batch 31 covers the second Raft tranche in `raft.go` (roughly lines `3239-5038`), focused on catchup/snapshot transfer, append-entry processing, WAL consistency, quorum tracking, vote request/response handling, and leadership state transitions. The mapped test set (19 tests) is concentrated on candidate/leader transitions, quorum correctness, membership-change edge cases, and snapshot/catchup behavior.

The design goal is to produce an execution-ready plan that enforces evidence-based status changes and prevents placeholder drift across both production features and tests.

## Context Findings

### Required command results

- `batch show 31 --db porting.db`
  - Status: `pending`
  - Features: `53` (currently `deferred`)
  - Tests: `19` (currently `deferred`)
  - Depends on: `30`
  - Go file: `server/raft.go`
- `batch list --db porting.db`
  - Batch 31 is directly gated by Batch 30 and itself gates Batch 32 (`JS Cluster Meta`).
- `report summary --db porting.db`
  - Overall progress: `1924/6942 (27.7%)`
  - Deferred backlog remains large; verification discipline is required.

### Feature and source mapping findings

- Batch 31 feature IDs map in order to `raft.go` methods from:
  - `sendSnapshotToFollower` through `updateLeader` (`2733-2750`)
  - `processAppendEntry` through `setWriteErrLocked` (`2751-2777`)
  - `isClosed` through `switchToLeader` (`2778-2796`)
- Existing .NET Raft surface is in:
  - `dotnet/src/ZB.MOM.NatsNet.Server/JetStream/RaftTypes.cs`
- Current comments in `RaftTypes.cs` still describe algorithm methods as stubbed; Batch 31 must replace those gaps with concrete behavior and tests.

### Test mapping findings

- All 19 mapped tests are from `server/raft_test.go` and map to `RaftNodeTests` methods.
- `dotnet/tests/ZB.MOM.NatsNet.Server.Tests/ImplBacklog/RaftNodeTests.Impltests.cs` does not currently exist, so Batch 31 planning should include creating it.
- The mapped tests are behavior-heavy; they cannot be verified using placeholder assertions.

## Approaches

### Approach A: Monolithic implementation of all 53 features and 19 tests in one pass

- Pros: single sweep.
- Cons: high regression risk, weak traceability, hard to isolate failures.

### Approach B (Recommended): Three feature groups (<=20 each) plus two test waves

- Features are implemented in ordered method clusters, each with strict gates before status updates.
- Tests are ported in two behavioral waves (state/quorum first, then snapshot/membership edge cases).
- Pros: bounded scope, better failure isolation, cleaner status evidence.
- Cons: more checkpoint overhead.

### Approach C: Test-first across all 19 tests, then fill feature gaps

- Pros: quickly exposes missing behavior.
- Cons: expensive thrash because many tests depend on broad feature slices.

**Decision:** Approach B.

## Proposed Design

### 1. Architecture and file strategy

- Keep Raft runtime behavior in `JetStream/RaftTypes.cs`, with optional split into partials if file size hurts reviewability:
  - `RaftTypes.Catchup.cs`
  - `RaftTypes.AppendProcessing.cs`
  - `RaftTypes.Elections.cs`
- Keep test implementation in dedicated mapped backlog file:
  - `dotnet/tests/ZB.MOM.NatsNet.Server.Tests/ImplBacklog/RaftNodeTests.Impltests.cs`
- Reuse existing support types (`IpQueue<T>`, `Channel<T>`, lock + `Interlocked`) and avoid introducing new infra unless required for deterministic testability.

### 2. Feature slicing (max ~20 per group)

- **Feature Group A (18): catchup/snapshot/commit foundations**
  `2733,2734,2735,2736,2737,2738,2739,2740,2741,2742,2743,2744,2745,2746,2747,2748,2749,2750`
- **Feature Group B (18): append-entry processing and peer/WAL state**
  `2751,2752,2753,2754,2755,2756,2758,2759,2760,2761,2765,2766,2767,2768,2769,2772,2776,2777`
- **Feature Group C (17): vote/RPC/state transitions**
  `2778,2779,2780,2783,2784,2785,2786,2787,2788,2789,2790,2791,2792,2793,2794,2795,2796`

### 3. Test slicing

- **Test Wave T1 (10): state/quorum/election behavior**
  `2626,2629,2635,2636,2663,2664,2667,2687,2690,2692`
- **Test Wave T2 (9): snapshot/catchup/membership-vote edge cases**
  `2650,2651,2693,2694,2702,2704,2705,2712,2714`

### 4. Verification model

- Enforce per-feature and per-test loops (red/green + stub scan + build/test gates).
- Enforce status-update chunking (`<=15` IDs per `feature/test batch-update`).
- Enforce checkpoint protocol after every group/wave before proceeding.

### 5. Stuck-item policy

- A blocked item is not left as pseudo-implemented.
- If blocked, set `deferred` immediately with explicit reason via `--override`, then continue with next unblocked ID.

## Risks and Mitigations

- **Risk:** Batch 30 dependency incomplete blocks execution.
  **Mitigation:** preflight dependency gate is mandatory; no Batch 31 status updates until Batch 30 is complete/ready.
- **Risk:** Large method `processAppendEntry` causes hidden regressions.
  **Mitigation:** isolate with focused tests per behavior branch plus class-level gates.
- **Risk:** fake progress via placeholder methods/tests.
  **Mitigation:** mandatory anti-stub scans and hard promotion gates.

## Success Criteria

- All 53 features are either `verified` with evidence or `deferred` with explicit blocker reason.
- All 19 tests are either `verified` with execution evidence or `deferred` with explicit blocker reason.
- No placeholder/stub patterns in touched production or test code.
- Batch-completion readiness is auditable through build/test outputs and chunked status updates.

## Non-Goals

- Executing implementation in this design doc.
- Implementing Batch 32+ scope.
- Building new distributed integration infrastructure beyond deterministic unit-level needs.