Files
natsnet/docs/plans/2026-02-27-batch-12-filestore-recovery-plan.md
Joseph Doherty f0455a1e45 Add batch plans for batches 6-7, 9-12, 16-17 (rounds 4-7)
Generated design docs and implementation plans via Codex for:
- Batch 6: Opts package-level functions
- Batch 7: Opts class methods + Reload
- Batch 9: Auth, DirStore, OCSP foundations
- Batch 10: OCSP Cache + JS Events
- Batch 11: FileStore Init
- Batch 12: FileStore Recovery
- Batch 16: Client Core (first half)
- Batch 17: Client Core (second half)

All plans include mandatory verification protocol and anti-stub guardrails.
Updated batches.md with file paths and planned status.
2026-02-27 14:56:19 -05:00

361 lines
12 KiB
Markdown

# Batch 12 FileStore Recovery Implementation Plan
> **For Codex:** REQUIRED SUB-SKILL: Use `executeplan` to implement this plan task-by-task.
**Goal:** Implement and verify all Batch 12 FileStore Recovery features from `server/filestore.go` with no stub logic and evidence-backed status transitions.
**Architecture:** Execute Batch 12 in two vertical feature groups (5 + 3). Implement recovery logic directly in `JetStream/FileStore.cs`, touching supporting JetStream types only when required. After each group, run strict stub scans, build, and related test gates before any status updates.
**Tech Stack:** .NET 10, C# latest, xUnit 3, Shouldly, NSubstitute, PortTracker CLI, SQLite (`porting.db`)
**Design doc:** `docs/plans/2026-02-27-batch-12-filestore-recovery-design.md`
---
I'm using `writeplan` to create the implementation plan.
## Batch Inputs
- Batch: `12` (`FileStore Recovery`)
- Depends on: Batch `11`
- Features: `8`
- Tests: `0` (batch-owned), with known related reverse dependencies:
- test `#519` (`FileStoreRecoverFullStateDetectCorruptState_ShouldSucceed`)
- test `#545` (`FileStoreNoPanicOnRecoverTTLWithCorruptBlocks_ShouldSucceed`)
- Go source scope: `golang/nats-server/server/filestore.go` lines ~1708-2580
Feature groups (max ~20 features each):
- **Group 1 (5):** `987,988,991,992,993`
- **Group 2 (3):** `995,996,997`
---
## MANDATORY VERIFICATION PROTOCOL
> **NON-NEGOTIABLE:** Every task and every status update in this plan must follow this protocol.
### Per-Feature Verification Loop (REQUIRED for every feature ID)
For each feature ID in the active group:
1. Read feature mapping and exact Go intent:
```bash
/usr/local/share/dotnet/dotnet run --project tools/NatsNet.PortTracker -- feature show <FEATURE_ID> --db porting.db
```
2. Read corresponding Go method span in `golang/nats-server/server/filestore.go`.
3. Implement minimal real C# behavior (no placeholders).
4. Build immediately:
```bash
/usr/local/share/dotnet/dotnet build dotnet/
```
5. Run related tests for the touched behavior (see Test Gate below).
6. Record evidence (command + summary output) before adding the ID to status-update candidates.
### Stub Detection Check (REQUIRED after each feature group)
Run all scans below. Any match is a hard blocker:
```bash
# Production placeholder detection
rg -n "NotImplementedException|TODO|PLACEHOLDER" \
dotnet/src/ZB.MOM.NatsNet.Server/JetStream -g '*.cs'
# Empty method bodies (FileStore recovery surface)
rg -n "^\s*(public|private|internal|protected).*(Warn|Debug|RecoverFullState|RecoverTTLState|RecoverMsgSchedulingState|CleanupOldMeta|RecoverMsgs|ExpireMsgsOnRecover)\s*\([^)]*\)\s*\{\s*\}$" \
dotnet/src/ZB.MOM.NatsNet.Server/JetStream/FileStore.cs
# Test placeholders in directly related classes
rg -n "NotImplementedException|Assert\.True\(true\)|Assert\.Pass|// TODO|// PLACEHOLDER" \
dotnet/tests/ZB.MOM.NatsNet.Server.Tests/JetStream/JetStreamFileStoreTests.cs \
dotnet/tests/ZB.MOM.NatsNet.Server.Tests/ImplBacklog/JetStreamFileStoreTests.Impltests.cs
```
### Build Gate (REQUIRED after each feature group)
This must pass before status updates and before moving to next group:
```bash
/usr/local/share/dotnet/dotnet build dotnet/
```
### Test Gate (REQUIRED before marking features `verified`)
All related tests must pass. Run at least:
```bash
# Existing JetStream FileStore coverage
/usr/local/share/dotnet/dotnet test dotnet/tests/ZB.MOM.NatsNet.Server.Tests/ \
--filter "FullyQualifiedName~ZB.MOM.NatsNet.Server.Tests.JetStream.JetStreamFileStoreTests" \
--verbosity normal
# Backlog coverage for FileStore implementation
/usr/local/share/dotnet/dotnet test dotnet/tests/ZB.MOM.NatsNet.Server.Tests/ \
--filter "FullyQualifiedName~ZB.MOM.NatsNet.Server.Tests.ImplBacklog.JetStreamFileStoreTests" \
--verbosity normal
# Feature-linked methods from reverse dependencies
/usr/local/share/dotnet/dotnet test dotnet/tests/ZB.MOM.NatsNet.Server.Tests/ \
--filter "FullyQualifiedName~FileStoreRecoverFullStateDetectCorruptState|FullyQualifiedName~FileStoreNoPanicOnRecoverTTLWithCorruptBlocks" \
--verbosity normal
```
Gate rule:
- If related tests run and pass, eligible for `verified`.
- If related tests are unavailable/not yet implemented (0 discovered), feature may be set to `complete` only, with explicit note explaining why `verified` is deferred.
### Status Update Protocol (REQUIRED)
- Use max **15 IDs** per `feature batch-update` call.
- Required status progression: `deferred -> stub -> complete -> verified`.
- Do not mark `verified` without evidence from Build Gate + Test Gate.
- Keep an evidence log folder (example: `/tmp/batch12-evidence/`) with per-group command outputs.
Examples:
```bash
# Move active group to stub before editing
/usr/local/share/dotnet/dotnet run --project tools/NatsNet.PortTracker -- \
feature batch-update --ids "987,988,991,992,993" --set-status stub --db porting.db --execute
# Move group to complete after successful implementation + build/test evidence
/usr/local/share/dotnet/dotnet run --project tools/NatsNet.PortTracker -- \
feature batch-update --ids "987,988,991,992,993" --set-status complete --db porting.db --execute
```
### Checkpoint Protocol Between Tasks (REQUIRED)
At each group boundary:
1. Full build:
```bash
/usr/local/share/dotnet/dotnet build dotnet/
```
2. Full unit test sweep (not just filtered):
```bash
/usr/local/share/dotnet/dotnet test dotnet/tests/ZB.MOM.NatsNet.Server.Tests/ --verbosity normal
```
3. Commit checkpoint before next task:
```bash
git add dotnet/src/ZB.MOM.NatsNet.Server/JetStream \
dotnet/tests/ZB.MOM.NatsNet.Server.Tests \
porting.db
git commit -m "feat(batch12): complete group <N> filestore recovery"
```
---
## ANTI-STUB GUARDRAILS (NON-NEGOTIABLE)
### Forbidden Patterns
The following are forbidden in Batch 12 feature or related test code:
- `throw new NotImplementedException(...)`
- Empty recovery method bodies (`{ }`)
- `// TODO` or `// PLACEHOLDER` in implemented recovery methods
- Fake test pass patterns (`Assert.True(true)`, `Assert.Pass()`, assertion-only smoke checks that do not exercise production behavior)
- Swallowing corruption/IO errors silently instead of preserving Go intent
### Hard Limits
- Max ~20 features per implementation group (fixed here as 5 and 3)
- Max 15 feature IDs per status-update command
- One feature group per verification/update cycle
- Zero stub-scan matches before `complete` or `verified` transitions
- No `verified` transition without explicit Build Gate + Test Gate evidence
### If You Get Stuck (MANDATORY)
1. Do **not** add a stub, placeholder, or no-op workaround.
2. Mark only blocked feature IDs as `deferred` with a concrete reason.
3. Continue with remaining IDs in the group.
4. Record blocker details in evidence log and PortTracker override reason.
Example:
```bash
/usr/local/share/dotnet/dotnet run --project tools/NatsNet.PortTracker -- \
feature update <ID> --status deferred --db porting.db \
--override "blocked: <specific technical reason>"
```
---
### Task 1: Batch Start and Group 1 Staging
**Files:**
- Modify: `porting.db`
- Create: `/tmp/batch12-evidence/` (evidence logs)
**Step 1: Confirm current batch state**
Run:
```bash
/usr/local/share/dotnet/dotnet run --project tools/NatsNet.PortTracker -- batch show 12 --db porting.db
/usr/local/share/dotnet/dotnet run --project tools/NatsNet.PortTracker -- batch list --db porting.db
/usr/local/share/dotnet/dotnet run --project tools/NatsNet.PortTracker -- report summary --db porting.db
```
Expected: Batch 12 pending, dependency 11, 8 features, 0 tests.
**Step 2: Start batch**
Run:
```bash
/usr/local/share/dotnet/dotnet run --project tools/NatsNet.PortTracker -- batch start 12 --db porting.db
```
Expected: batch marked in-progress.
**Step 3: Stage Group 1 IDs to `stub`**
Run:
```bash
/usr/local/share/dotnet/dotnet run --project tools/NatsNet.PortTracker -- \
feature batch-update --ids "987,988,991,992,993" --set-status stub --db porting.db --execute
```
Expected: only Group 1 IDs set to `stub`.
**Step 4: Commit checkpoint**
```bash
git add porting.db
git commit -m "chore(batch12): start batch and stage group1 recovery ids"
```
### Task 2: Implement Group 1 Recovery Features (5 IDs)
**Files:**
- Modify: `dotnet/src/ZB.MOM.NatsNet.Server/JetStream/FileStore.cs`
- Modify (if needed): `dotnet/src/ZB.MOM.NatsNet.Server/JetStream/MessageBlock.cs`
- Modify (if needed): `dotnet/src/ZB.MOM.NatsNet.Server/JetStream/FileStoreTypes.cs`
**Feature IDs:** `987,988,991,992,993`
**Step 1: Implement logging helpers**
- ID `987` (`Warn`) and ID `988` (`Debug`) with FileStore-context prefixing and no-op behavior when logger/server is unavailable.
**Step 2: Implement full-state recovery**
- ID `991` (`RecoverFullState`): stream state file load, length/checksum validation, decode, stale/corrupt fallback signaling.
**Step 3: Implement TTL and schedule recovery**
- ID `992` (`RecoverTTLState`)
- ID `993` (`RecoverMsgSchedulingState`)
- Include stale-state linear scan fallback over recovered message blocks.
**Step 4: Run mandatory verification protocol for Group 1**
- Per-feature loop for all 5 IDs.
- Stub Detection Check.
- Build Gate.
- Test Gate.
**Step 5: Status updates (chunk <=15)**
- Set Group 1 IDs to `complete` after successful evidence.
- Promote to `verified` only if Test Gate evidence is sufficient for each feature.
### Task 3: Group 1 Checkpoint
**Files:**
- Modify: `porting.db`
**Step 1: Run Checkpoint Protocol**
- Full build + full unit tests.
**Step 2: Commit**
```bash
git add dotnet/src/ZB.MOM.NatsNet.Server/JetStream \
dotnet/tests/ZB.MOM.NatsNet.Server.Tests \
porting.db
git commit -m "feat(batch12): complete group1 filestore recovery"
```
### Task 4: Implement Group 2 Recovery Features (3 IDs)
**Files:**
- Modify: `dotnet/src/ZB.MOM.NatsNet.Server/JetStream/FileStore.cs`
- Modify (if needed): `dotnet/src/ZB.MOM.NatsNet.Server/JetStream/MessageBlock.cs`
- Modify (if needed): `dotnet/src/ZB.MOM.NatsNet.Server/JetStream/FileStoreTypes.cs`
**Feature IDs:** `995,996,997`
**Step 1: Implement metadata cleanup**
- ID `995` (`CleanupOldMeta`): remove stale metadata file types in message directory safely.
**Step 2: Implement ordered message block recovery**
- ID `996` (`RecoverMsgs`): enumerate/sort blocks, recover block state, reconcile stream accounting, prune orphan keys.
**Step 3: Implement startup expiration path**
- ID `997` (`ExpireMsgsOnRecover`): max-age pass at startup, per-subject updates, empty-block cleanup, tombstone continuity.
**Step 4: Run mandatory verification protocol for Group 2**
- Per-feature loop for all 3 IDs.
- Stub Detection Check.
- Build Gate.
- Test Gate.
**Step 5: Status updates (chunk <=15)**
- Set Group 2 IDs to `complete`, then `verified` only when test evidence criteria are met.
### Task 5: Group 2 Checkpoint and Batch Closure
**Files:**
- Modify: `porting.db`
- Generate: `reports/current.md`
**Step 1: Final gates**
Run:
```bash
/usr/local/share/dotnet/dotnet build dotnet/
/usr/local/share/dotnet/dotnet test dotnet/tests/ZB.MOM.NatsNet.Server.Tests/ --verbosity normal
/usr/local/share/dotnet/dotnet test dotnet/tests/ZB.MOM.NatsNet.Server.IntegrationTests/ --verbosity normal
```
Expected: zero failures in executed suites.
**Step 2: Verify batch status and unblocked work**
Run:
```bash
/usr/local/share/dotnet/dotnet run --project tools/NatsNet.PortTracker -- batch show 12 --db porting.db
/usr/local/share/dotnet/dotnet run --project tools/NatsNet.PortTracker -- dependency ready --db porting.db
/usr/local/share/dotnet/dotnet run --project tools/NatsNet.PortTracker -- report summary --db porting.db
```
**Step 3: Complete batch**
Run:
```bash
/usr/local/share/dotnet/dotnet run --project tools/NatsNet.PortTracker -- batch complete 12 --db porting.db
```
Expected: completion succeeds only if all items meet allowed terminal states.
**Step 4: Generate report + commit**
```bash
./reports/generate-report.sh
git add dotnet/src/ZB.MOM.NatsNet.Server/JetStream \
dotnet/tests/ZB.MOM.NatsNet.Server.Tests \
porting.db reports/
git commit -m "feat(batch12): complete filestore recovery"
```
---
Plan complete and saved to `docs/plans/2026-02-27-batch-12-filestore-recovery-plan.md`. Two execution options:
**1. Subagent-Driven (this session)** - I dispatch fresh subagent per task, review between tasks, fast iteration
**2. Parallel Session (separate)** - Open new session with `executeplan`, batch execution with checkpoints
Which approach?