# Batch 12 FileStore Recovery Design **Date:** 2026-02-27 **Batch:** 12 (`FileStore Recovery`) **Dependency:** Batch 11 (`FileStore Init`) **Scope:** 8 features + 0 tests mapped to `golang/nats-server/server/filestore.go` --- ## 1. Context and Constraints - PortTracker reports Batch 12 as `pending`, dependent on Batch 11, with 8 deferred features and no batch-owned tests. - Batch 12 methods are all in the FileStore recovery path: - `warn` (ID 987) - `debug` (ID 988) - `recoverFullState` (ID 991) - `recoverTTLState` (ID 992) - `recoverMsgSchedulingState` (ID 993) - `cleanupOldMeta` (ID 995) - `recoverMsgs` (ID 996) - `expireMsgsOnRecover` (ID 997) - Current .NET `JetStreamFileStore` is still a mostly delegated shell; these recovery methods are not implemented yet. - Even though Batch 12 has 0 mapped tests, reverse dependencies exist for at least: - test `#519` (`RecoverFullState` corruption detection) - test `#545` (`RecoverTTLState` corrupt block resilience) - Planning-only session: produce design and implementation plan documents; do not implement feature code. ## 2. Success Criteria - Implement all 8 Batch 12 features with behavior parity to Go intent for recovery flows in `filestore.go`. - Introduce recovery logic without stubs/placeholders in either production or related test files. - Verify each feature using a strict per-feature loop: Go-source read, C# implementation, build, related tests. - Use evidence-backed PortTracker status updates in small chunks and keep dependency/order integrity. ## 3. Brainstormed Approach Options ### Option A (Recommended): Two vertical recovery groups with strict gates Group methods by lifecycle stage, then run full build/test/status gates between groups. - Pros: low integration risk, easy isolation of regressions, aligns with mandatory checkpointing. - Cons: more build/test cycles. ### Option B: Implement all 8 methods in one pass, verify once at the end - Pros: fastest coding flow initially. - Cons: high blast radius if behavior diverges; poor auditability for status updates. ### Option C: Logging + cleanup first, then all recovery methods together - Pros: simple early wins on low-complexity methods. - Cons: still delays hardest-state validation and can hide ordering bugs until late. ### Recommendation Use **Option A**. Recovery behavior is stateful and failure-prone; bounded groups plus hard gates are the safest path. ## 4. Proposed Design ### 4.1 Feature Groups (max ~20 each) - **Group 1 (5 features):** `987, 988, 991, 992, 993` - Logging wrappers and persisted state/TTL/scheduling recovery. - **Group 2 (3 features):** `995, 996, 997` - Metadata cleanup, message block recovery sweep, startup age-expiration pass. ### 4.2 Primary Code Boundaries - Production: - `dotnet/src/ZB.MOM.NatsNet.Server/JetStream/FileStore.cs` (main implementation surface) - `dotnet/src/ZB.MOM.NatsNet.Server/JetStream/MessageBlock.cs` (state and helper interactions used by recovery logic) - `dotnet/src/ZB.MOM.NatsNet.Server/JetStream/FileStoreTypes.cs` (supporting types/constants as needed) - Related tests and verification targets: - `dotnet/tests/ZB.MOM.NatsNet.Server.Tests/JetStream/JetStreamFileStoreTests.cs` - `dotnet/tests/ZB.MOM.NatsNet.Server.Tests/ImplBacklog/JetStreamFileStoreTests.Impltests.cs` ### 4.3 Recovery Flow to Port 1. **Observability wrappers** (`warn`, `debug`): no-op when server/logger is absent; prefix context consistently. 2. **Full state recovery** (`recoverFullState`): - Read state file, validate minimum length and checksum, decrypt when configured. - Decode stream + subject + block state, detect corruption/outdated state, and force rebuild fallback conditions. 3. **TTL and scheduling recovery** (`recoverTTLState`, `recoverMsgSchedulingState`): - Decode persisted wheels/schedules. - If stale, linear scan from recovered sequence through message blocks to rebuild runtime structures. 4. **Metadata cleanup** (`cleanupOldMeta`): remove stale `.idx`/`.fss` files in message directory. 5. **Message recovery** (`recoverMsgs`): enumerate blocks in order, recover each block, reconcile stream first/last/msgs/bytes, prune orphan key files. 6. **Startup age expiration** (`expireMsgsOnRecover`): expire max-age-out messages/blocks safely, preserve tombstone behavior, and recompute top-level state. ### 4.4 Error Handling and Safety - Corrupt/full-state mismatch must fail closed and trigger rebuild path instead of continuing with partial state. - IO permissions/corruption conditions should be surfaced with actionable context (not swallowed). - Lock ordering must remain consistent with existing FileStore concurrency patterns to avoid recovery-time deadlocks. ### 4.5 Verification Design (Feature + Related Tests) - Feature verification is not allowed on build-only evidence. - Each feature needs: - Go-source intent review - build pass - related test execution pass (or explicit deferred reason if test infra is unavailable) - If related tests for a feature are still deferred/unported, feature can be moved to `complete` but not `verified` unless equivalent evidence exists and is documented. ## 5. Risks and Mitigations - **Risk:** recovery logic complexity leads to placeholder shortcuts. **Mitigation:** mandatory stub scans after every group for both source and tests. - **Risk:** subtle state-accounting regressions (first/last/msgs/bytes). **Mitigation:** group gates require targeted tests plus JetStream store regression filters. - **Risk:** aggressive status updates without proof. **Mitigation:** cap status update chunks to 15 IDs and require command-output evidence per chunk. ## 6. Design Outcome Batch 12 should be executed as two recovery-focused feature groups with strict evidence gates between them. This design preserves dependency discipline, prevents stub creep, and keeps feature status updates auditable against build/test outputs.