Generated design docs and implementation plans via Codex for: - Batch 6: Opts package-level functions - Batch 7: Opts class methods + Reload - Batch 9: Auth, DirStore, OCSP foundations - Batch 10: OCSP Cache + JS Events - Batch 11: FileStore Init - Batch 12: FileStore Recovery - Batch 16: Client Core (first half) - Batch 17: Client Core (second half) All plans include mandatory verification protocol and anti-stub guardrails. Updated batches.md with file paths and planned status.
117 lines
5.8 KiB
Markdown
117 lines
5.8 KiB
Markdown
# Batch 12 FileStore Recovery Design
|
|
|
|
**Date:** 2026-02-27
|
|
**Batch:** 12 (`FileStore Recovery`)
|
|
**Dependency:** Batch 11 (`FileStore Init`)
|
|
**Scope:** 8 features + 0 tests mapped to `golang/nats-server/server/filestore.go`
|
|
|
|
---
|
|
|
|
## 1. Context and Constraints
|
|
|
|
- PortTracker reports Batch 12 as `pending`, dependent on Batch 11, with 8 deferred features and no batch-owned tests.
|
|
- Batch 12 methods are all in the FileStore recovery path:
|
|
- `warn` (ID 987)
|
|
- `debug` (ID 988)
|
|
- `recoverFullState` (ID 991)
|
|
- `recoverTTLState` (ID 992)
|
|
- `recoverMsgSchedulingState` (ID 993)
|
|
- `cleanupOldMeta` (ID 995)
|
|
- `recoverMsgs` (ID 996)
|
|
- `expireMsgsOnRecover` (ID 997)
|
|
- Current .NET `JetStreamFileStore` is still a mostly delegated shell; these recovery methods are not implemented yet.
|
|
- Even though Batch 12 has 0 mapped tests, reverse dependencies exist for at least:
|
|
- test `#519` (`RecoverFullState` corruption detection)
|
|
- test `#545` (`RecoverTTLState` corrupt block resilience)
|
|
- Planning-only session: produce design and implementation plan documents; do not implement feature code.
|
|
|
|
## 2. Success Criteria
|
|
|
|
- Implement all 8 Batch 12 features with behavior parity to Go intent for recovery flows in `filestore.go`.
|
|
- Introduce recovery logic without stubs/placeholders in either production or related test files.
|
|
- Verify each feature using a strict per-feature loop: Go-source read, C# implementation, build, related tests.
|
|
- Use evidence-backed PortTracker status updates in small chunks and keep dependency/order integrity.
|
|
|
|
## 3. Brainstormed Approach Options
|
|
|
|
### Option A (Recommended): Two vertical recovery groups with strict gates
|
|
|
|
Group methods by lifecycle stage, then run full build/test/status gates between groups.
|
|
|
|
- Pros: low integration risk, easy isolation of regressions, aligns with mandatory checkpointing.
|
|
- Cons: more build/test cycles.
|
|
|
|
### Option B: Implement all 8 methods in one pass, verify once at the end
|
|
|
|
- Pros: fastest coding flow initially.
|
|
- Cons: high blast radius if behavior diverges; poor auditability for status updates.
|
|
|
|
### Option C: Logging + cleanup first, then all recovery methods together
|
|
|
|
- Pros: simple early wins on low-complexity methods.
|
|
- Cons: still delays hardest-state validation and can hide ordering bugs until late.
|
|
|
|
### Recommendation
|
|
|
|
Use **Option A**. Recovery behavior is stateful and failure-prone; bounded groups plus hard gates are the safest path.
|
|
|
|
## 4. Proposed Design
|
|
|
|
### 4.1 Feature Groups (max ~20 each)
|
|
|
|
- **Group 1 (5 features):** `987, 988, 991, 992, 993`
|
|
- Logging wrappers and persisted state/TTL/scheduling recovery.
|
|
- **Group 2 (3 features):** `995, 996, 997`
|
|
- Metadata cleanup, message block recovery sweep, startup age-expiration pass.
|
|
|
|
### 4.2 Primary Code Boundaries
|
|
|
|
- Production:
|
|
- `dotnet/src/ZB.MOM.NatsNet.Server/JetStream/FileStore.cs` (main implementation surface)
|
|
- `dotnet/src/ZB.MOM.NatsNet.Server/JetStream/MessageBlock.cs` (state and helper interactions used by recovery logic)
|
|
- `dotnet/src/ZB.MOM.NatsNet.Server/JetStream/FileStoreTypes.cs` (supporting types/constants as needed)
|
|
- Related tests and verification targets:
|
|
- `dotnet/tests/ZB.MOM.NatsNet.Server.Tests/JetStream/JetStreamFileStoreTests.cs`
|
|
- `dotnet/tests/ZB.MOM.NatsNet.Server.Tests/ImplBacklog/JetStreamFileStoreTests.Impltests.cs`
|
|
|
|
### 4.3 Recovery Flow to Port
|
|
|
|
1. **Observability wrappers** (`warn`, `debug`): no-op when server/logger is absent; prefix context consistently.
|
|
2. **Full state recovery** (`recoverFullState`):
|
|
- Read state file, validate minimum length and checksum, decrypt when configured.
|
|
- Decode stream + subject + block state, detect corruption/outdated state, and force rebuild fallback conditions.
|
|
3. **TTL and scheduling recovery** (`recoverTTLState`, `recoverMsgSchedulingState`):
|
|
- Decode persisted wheels/schedules.
|
|
- If stale, linear scan from recovered sequence through message blocks to rebuild runtime structures.
|
|
4. **Metadata cleanup** (`cleanupOldMeta`): remove stale `.idx`/`.fss` files in message directory.
|
|
5. **Message recovery** (`recoverMsgs`): enumerate blocks in order, recover each block, reconcile stream first/last/msgs/bytes, prune orphan key files.
|
|
6. **Startup age expiration** (`expireMsgsOnRecover`): expire max-age-out messages/blocks safely, preserve tombstone behavior, and recompute top-level state.
|
|
|
|
### 4.4 Error Handling and Safety
|
|
|
|
- Corrupt/full-state mismatch must fail closed and trigger rebuild path instead of continuing with partial state.
|
|
- IO permissions/corruption conditions should be surfaced with actionable context (not swallowed).
|
|
- Lock ordering must remain consistent with existing FileStore concurrency patterns to avoid recovery-time deadlocks.
|
|
|
|
### 4.5 Verification Design (Feature + Related Tests)
|
|
|
|
- Feature verification is not allowed on build-only evidence.
|
|
- Each feature needs:
|
|
- Go-source intent review
|
|
- build pass
|
|
- related test execution pass (or explicit deferred reason if test infra is unavailable)
|
|
- If related tests for a feature are still deferred/unported, feature can be moved to `complete` but not `verified` unless equivalent evidence exists and is documented.
|
|
|
|
## 5. Risks and Mitigations
|
|
|
|
- **Risk:** recovery logic complexity leads to placeholder shortcuts.
|
|
**Mitigation:** mandatory stub scans after every group for both source and tests.
|
|
- **Risk:** subtle state-accounting regressions (first/last/msgs/bytes).
|
|
**Mitigation:** group gates require targeted tests plus JetStream store regression filters.
|
|
- **Risk:** aggressive status updates without proof.
|
|
**Mitigation:** cap status update chunks to 15 IDs and require command-output evidence per chunk.
|
|
|
|
## 6. Design Outcome
|
|
|
|
Batch 12 should be executed as two recovery-focused feature groups with strict evidence gates between them. This design preserves dependency discipline, prevents stub creep, and keeps feature status updates auditable against build/test outputs.
|