Files
natsnet/docs/plans/2026-02-27-batch-12-filestore-recovery-design.md
Joseph Doherty f0455a1e45 Add batch plans for batches 6-7, 9-12, 16-17 (rounds 4-7)
Generated design docs and implementation plans via Codex for:
- Batch 6: Opts package-level functions
- Batch 7: Opts class methods + Reload
- Batch 9: Auth, DirStore, OCSP foundations
- Batch 10: OCSP Cache + JS Events
- Batch 11: FileStore Init
- Batch 12: FileStore Recovery
- Batch 16: Client Core (first half)
- Batch 17: Client Core (second half)

All plans include mandatory verification protocol and anti-stub guardrails.
Updated batches.md with file paths and planned status.
2026-02-27 14:56:19 -05:00

5.8 KiB

Batch 12 FileStore Recovery Design

Date: 2026-02-27
Batch: 12 (FileStore Recovery)
Dependency: Batch 11 (FileStore Init)
Scope: 8 features + 0 tests mapped to golang/nats-server/server/filestore.go


1. Context and Constraints

  • PortTracker reports Batch 12 as pending, dependent on Batch 11, with 8 deferred features and no batch-owned tests.
  • Batch 12 methods are all in the FileStore recovery path:
    • warn (ID 987)
    • debug (ID 988)
    • recoverFullState (ID 991)
    • recoverTTLState (ID 992)
    • recoverMsgSchedulingState (ID 993)
    • cleanupOldMeta (ID 995)
    • recoverMsgs (ID 996)
    • expireMsgsOnRecover (ID 997)
  • Current .NET JetStreamFileStore is still a mostly delegated shell; these recovery methods are not implemented yet.
  • Even though Batch 12 has 0 mapped tests, reverse dependencies exist for at least:
    • test #519 (RecoverFullState corruption detection)
    • test #545 (RecoverTTLState corrupt block resilience)
  • Planning-only session: produce design and implementation plan documents; do not implement feature code.

2. Success Criteria

  • Implement all 8 Batch 12 features with behavior parity to Go intent for recovery flows in filestore.go.
  • Introduce recovery logic without stubs/placeholders in either production or related test files.
  • Verify each feature using a strict per-feature loop: Go-source read, C# implementation, build, related tests.
  • Use evidence-backed PortTracker status updates in small chunks and keep dependency/order integrity.

3. Brainstormed Approach Options

Group methods by lifecycle stage, then run full build/test/status gates between groups.

  • Pros: low integration risk, easy isolation of regressions, aligns with mandatory checkpointing.
  • Cons: more build/test cycles.

Option B: Implement all 8 methods in one pass, verify once at the end

  • Pros: fastest coding flow initially.
  • Cons: high blast radius if behavior diverges; poor auditability for status updates.

Option C: Logging + cleanup first, then all recovery methods together

  • Pros: simple early wins on low-complexity methods.
  • Cons: still delays hardest-state validation and can hide ordering bugs until late.

Recommendation

Use Option A. Recovery behavior is stateful and failure-prone; bounded groups plus hard gates are the safest path.

4. Proposed Design

4.1 Feature Groups (max ~20 each)

  • Group 1 (5 features): 987, 988, 991, 992, 993
    • Logging wrappers and persisted state/TTL/scheduling recovery.
  • Group 2 (3 features): 995, 996, 997
    • Metadata cleanup, message block recovery sweep, startup age-expiration pass.

4.2 Primary Code Boundaries

  • Production:
    • dotnet/src/ZB.MOM.NatsNet.Server/JetStream/FileStore.cs (main implementation surface)
    • dotnet/src/ZB.MOM.NatsNet.Server/JetStream/MessageBlock.cs (state and helper interactions used by recovery logic)
    • dotnet/src/ZB.MOM.NatsNet.Server/JetStream/FileStoreTypes.cs (supporting types/constants as needed)
  • Related tests and verification targets:
    • dotnet/tests/ZB.MOM.NatsNet.Server.Tests/JetStream/JetStreamFileStoreTests.cs
    • dotnet/tests/ZB.MOM.NatsNet.Server.Tests/ImplBacklog/JetStreamFileStoreTests.Impltests.cs

4.3 Recovery Flow to Port

  1. Observability wrappers (warn, debug): no-op when server/logger is absent; prefix context consistently.
  2. Full state recovery (recoverFullState):
    • Read state file, validate minimum length and checksum, decrypt when configured.
    • Decode stream + subject + block state, detect corruption/outdated state, and force rebuild fallback conditions.
  3. TTL and scheduling recovery (recoverTTLState, recoverMsgSchedulingState):
    • Decode persisted wheels/schedules.
    • If stale, linear scan from recovered sequence through message blocks to rebuild runtime structures.
  4. Metadata cleanup (cleanupOldMeta): remove stale .idx/.fss files in message directory.
  5. Message recovery (recoverMsgs): enumerate blocks in order, recover each block, reconcile stream first/last/msgs/bytes, prune orphan key files.
  6. Startup age expiration (expireMsgsOnRecover): expire max-age-out messages/blocks safely, preserve tombstone behavior, and recompute top-level state.

4.4 Error Handling and Safety

  • Corrupt/full-state mismatch must fail closed and trigger rebuild path instead of continuing with partial state.
  • IO permissions/corruption conditions should be surfaced with actionable context (not swallowed).
  • Lock ordering must remain consistent with existing FileStore concurrency patterns to avoid recovery-time deadlocks.
  • Feature verification is not allowed on build-only evidence.
  • Each feature needs:
    • Go-source intent review
    • build pass
    • related test execution pass (or explicit deferred reason if test infra is unavailable)
  • If related tests for a feature are still deferred/unported, feature can be moved to complete but not verified unless equivalent evidence exists and is documented.

5. Risks and Mitigations

  • Risk: recovery logic complexity leads to placeholder shortcuts.
    Mitigation: mandatory stub scans after every group for both source and tests.
  • Risk: subtle state-accounting regressions (first/last/msgs/bytes).
    Mitigation: group gates require targeted tests plus JetStream store regression filters.
  • Risk: aggressive status updates without proof.
    Mitigation: cap status update chunks to 15 IDs and require command-output evidence per chunk.

6. Design Outcome

Batch 12 should be executed as two recovery-focused feature groups with strict evidence gates between them. This design preserves dependency discipline, prevents stub creep, and keeps feature status updates auditable against build/test outputs.