docs: add optimization planning documents
This commit is contained in:
331
docs/plans/2026-03-13-optimizations_filestore-plan.md
Normal file
331
docs/plans/2026-03-13-optimizations_filestore-plan.md
Normal file
@@ -0,0 +1,331 @@
|
||||
# FileStore Payload And Index Optimization Implementation Plan
|
||||
|
||||
> **For Codex:** REQUIRED SUB-SKILLS: Use `using-git-worktrees` to create an isolated workspace before Task 1, then use `executeplan` to implement this plan task-by-task. After verification is complete, merge the finished branch back into `main`.
|
||||
|
||||
**Goal:** Reduce JetStream FileStore memory churn and repeated full scans by tightening payload ownership, splitting compact metadata from large payload buffers, and replacing LINQ-based maintenance work with explicit indexes and loops.
|
||||
|
||||
**Architecture:** Start by freezing current behavior across `AppendAsync`, `StoreMsg`, retention, snapshots, and recovery. Then introduce compact metadata/index structures, remove avoidable duplicate payload buffers, replace repeated `_messages` scans with maintained indexes, and finish by updating recovery/snapshot paths plus benchmark coverage.
|
||||
|
||||
**Tech Stack:** .NET 10, C#, JetStream storage stack, `ReadOnlyMemory<byte>`, pooled buffers where safe, xUnit, existing JetStream benchmark harness.
|
||||
|
||||
---
|
||||
|
||||
## Scope Anchors
|
||||
- Primary source: `src/NATS.Server/JetStream/Storage/FileStore.cs`
|
||||
- Supporting sources:
|
||||
- `src/NATS.Server/JetStream/Storage/MsgBlock.cs`
|
||||
- `src/NATS.Server/JetStream/Storage/StoredMessage.cs`
|
||||
- `src/NATS.Server/JetStream/Storage/MessageRecord.cs`
|
||||
- Existing contract tests:
|
||||
- `tests/NATS.Server.JetStream.Tests/StreamStoreContractTests.cs`
|
||||
- `tests/NATS.Server.JetStream.Tests/JetStream/Storage/StoreInterfaceTests.cs`
|
||||
- Existing FileStore coverage:
|
||||
- `tests/NATS.Server.JetStream.Tests/FileStoreTests.cs`
|
||||
- `tests/NATS.Server.JetStream.Tests/JetStreamStoreIndexTests.cs`
|
||||
- `tests/NATS.Server.JetStream.Tests/JetStream/Storage/FileStoreCompressionTests.cs`
|
||||
- `tests/NATS.Server.JetStream.Tests/JetStream/Storage/FileStoreCrashRecoveryTests.cs`
|
||||
- `tests/NATS.Server.JetStream.Tests/JetStream/Storage/FileStoreTombstoneTests.cs`
|
||||
- Documentation to update: `Documentation/JetStream/Overview.md`
|
||||
- Benchmark project: `tests/NATS.Server.Benchmark.Tests/NATS.Server.Benchmark.Tests.csproj`
|
||||
- Benchmark comparison doc: `benchmarks_comparison.md`
|
||||
|
||||
## Task 0: Create an isolated git worktree and verify the baseline
|
||||
|
||||
**Files:**
|
||||
- Modify: `.gitignore` only if the chosen local worktree directory is not already ignored
|
||||
|
||||
**Step 1: Choose the worktree location using the repo convention**
|
||||
- Check for an existing `.worktrees/` directory first, then `worktrees/`.
|
||||
- If neither exists, check repo guidance before creating one.
|
||||
- Prefer a project-local `.worktrees/` directory when available.
|
||||
|
||||
**Step 2: Verify the worktree directory is ignored before creating anything**
|
||||
- Run:
|
||||
```bash
|
||||
git check-ignore -q .worktrees || git check-ignore -q worktrees
|
||||
```
|
||||
- Expected: one configured worktree directory is ignored.
|
||||
- If neither directory is ignored, add the chosen directory to `.gitignore`, commit that change on `main`, and then continue.
|
||||
|
||||
**Step 3: Create a dedicated branch and worktree for this plan**
|
||||
- Run:
|
||||
```bash
|
||||
git worktree add .worktrees/filestore-payload-index-optimization -b codex/filestore-payload-index-optimization
|
||||
```
|
||||
- Expected: a new isolated checkout exists at `.worktrees/filestore-payload-index-optimization`.
|
||||
|
||||
**Step 4: Move into the worktree and verify the starting baseline**
|
||||
- Run:
|
||||
```bash
|
||||
cd .worktrees/filestore-payload-index-optimization
|
||||
dotnet test tests/NATS.Server.JetStream.Tests/NATS.Server.JetStream.Tests.csproj -c Release
|
||||
```
|
||||
- Expected: PASS before implementation starts.
|
||||
- If the baseline fails, stop and resolve whether to proceed before changing FileStore code.
|
||||
|
||||
**Step 5: Commit only the worktree bootstrap change if one was required**
|
||||
- Run only if `.gitignore` had to change:
|
||||
```bash
|
||||
git add .gitignore
|
||||
git commit -m "chore: ignore local worktree directory"
|
||||
```
|
||||
|
||||
## Task 1: Freeze store behavior and add scan/ownership regression tests
|
||||
|
||||
**Files:**
|
||||
- Modify: `tests/NATS.Server.JetStream.Tests/JetStreamStoreIndexTests.cs`
|
||||
- Modify: `tests/NATS.Server.JetStream.Tests/JetStream/Storage/StoreInterfaceTests.cs`
|
||||
- Create: `tests/NATS.Server.JetStream.Tests/JetStream/Storage/FileStoreOptimizationGuardTests.cs`
|
||||
|
||||
**Step 1: Add failing tests for the targeted optimization boundaries**
|
||||
- Cover:
|
||||
- `AppendAsync` retaining logical payload behavior
|
||||
- `StoreMsg` with headers + payload
|
||||
- `LoadLastBySubjectAsync`
|
||||
- `TrimToMaxMessages`
|
||||
- `PurgeEx`
|
||||
- snapshot/recovery round-trips
|
||||
|
||||
**Step 2: Add tests that lock first/last sequence bookkeeping**
|
||||
- Ensure `_firstSeq`, `_last`, and subject-last lookup behavior remain correct after removes, purges, compaction, and recovery.
|
||||
|
||||
**Step 3: Run focused JetStream tests to prove the new tests fail first**
|
||||
- Run: `dotnet test tests/NATS.Server.JetStream.Tests/NATS.Server.JetStream.Tests.csproj --filter "FullyQualifiedName~FileStoreOptimizationGuardTests|FullyQualifiedName~JetStreamStoreIndexTests|FullyQualifiedName~StoreInterfaceTests" -c Release`
|
||||
- Expected: FAIL only in the newly added optimization-guard tests.
|
||||
|
||||
**Step 4: Commit the failing-test baseline**
|
||||
- Run:
|
||||
```bash
|
||||
git add tests/NATS.Server.JetStream.Tests/JetStreamStoreIndexTests.cs tests/NATS.Server.JetStream.Tests/JetStream/Storage/StoreInterfaceTests.cs tests/NATS.Server.JetStream.Tests/JetStream/Storage/FileStoreOptimizationGuardTests.cs
|
||||
git commit -m "test: lock FileStore optimization boundaries"
|
||||
```
|
||||
|
||||
## Task 2: Introduce compact metadata/index types and remove full-scan bookkeeping
|
||||
|
||||
**Files:**
|
||||
- Create: `src/NATS.Server/JetStream/Storage/StoredMessageIndex.cs`
|
||||
- Modify: `src/NATS.Server/JetStream/Storage/FileStore.cs`
|
||||
- Modify: `src/NATS.Server/JetStream/Storage/StoredMessage.cs`
|
||||
|
||||
**Step 1: Split compact indexing metadata from payload-bearing message objects**
|
||||
- Add a small immutable metadata/index type that tracks at least:
|
||||
- sequence
|
||||
- subject
|
||||
- logical payload length
|
||||
- timestamp
|
||||
- subject-local links or last-seen markers if needed
|
||||
|
||||
**Step 2: Replace repeated `Min()` / `Max()` / full-value scans with maintained state**
|
||||
- Maintain first live sequence, last live sequence, and last-by-subject values incrementally rather than recomputing them with LINQ.
|
||||
|
||||
**Step 3: Run targeted index tests**
|
||||
- Run: `dotnet test tests/NATS.Server.JetStream.Tests/NATS.Server.JetStream.Tests.csproj --filter "FullyQualifiedName~JetStreamStoreIndexTests|FullyQualifiedName~FileStoreOptimizationGuardTests" -c Release`
|
||||
- Expected: PASS.
|
||||
|
||||
**Step 4: Commit the metadata/index layer**
|
||||
- Run:
|
||||
```bash
|
||||
git add src/NATS.Server/JetStream/Storage/StoredMessageIndex.cs src/NATS.Server/JetStream/Storage/FileStore.cs src/NATS.Server/JetStream/Storage/StoredMessage.cs
|
||||
git commit -m "perf: add compact FileStore index metadata"
|
||||
```
|
||||
|
||||
## Task 3: Remove duplicate payload ownership in append and store paths
|
||||
|
||||
**Files:**
|
||||
- Modify: `src/NATS.Server/JetStream/Storage/FileStore.cs`
|
||||
- Modify: `src/NATS.Server/JetStream/Storage/MsgBlock.cs`
|
||||
- Modify: `src/NATS.Server/JetStream/Storage/MessageRecord.cs`
|
||||
- Modify: `tests/NATS.Server.JetStream.Tests/FileStoreTests.cs`
|
||||
|
||||
**Step 1: Rework `AppendAsync` and `StoreMsg` payload flow**
|
||||
- Stop eagerly keeping both a transformed persisted payload and a second fully duplicated managed payload when the same buffer/view can safely back both responsibilities.
|
||||
- Keep correctness for compression, encryption, and header-bearing records explicit.
|
||||
|
||||
**Step 2: Remove concatenated header+payload arrays where possible**
|
||||
- Let record encoding paths consume header and payload spans directly instead of always building `combined = new byte[...]`.
|
||||
- Leave a copy in place only where the persistence or recovery contract actually requires one.
|
||||
|
||||
**Step 3: Run targeted persistence tests**
|
||||
- Run: `dotnet test tests/NATS.Server.JetStream.Tests/NATS.Server.JetStream.Tests.csproj --filter "FullyQualifiedName~FileStoreTests|FullyQualifiedName~FileStoreCompressionTests|FullyQualifiedName~FileStoreEncryptionTests" -c Release`
|
||||
- Expected: PASS.
|
||||
|
||||
**Step 4: Commit the payload-ownership refactor**
|
||||
- Run:
|
||||
```bash
|
||||
git add src/NATS.Server/JetStream/Storage/FileStore.cs src/NATS.Server/JetStream/Storage/MsgBlock.cs src/NATS.Server/JetStream/Storage/MessageRecord.cs tests/NATS.Server.JetStream.Tests/FileStoreTests.cs
|
||||
git commit -m "perf: reduce FileStore duplicate payload buffers"
|
||||
```
|
||||
|
||||
## Task 4: Replace LINQ-heavy maintenance operations with explicit indexed paths
|
||||
|
||||
**Files:**
|
||||
- Modify: `src/NATS.Server/JetStream/Storage/FileStore.cs`
|
||||
- Modify: `tests/NATS.Server.JetStream.Tests/JetStream/Storage/StoreInterfaceTests.cs`
|
||||
- Modify: `tests/NATS.Server.JetStream.Tests/JetStream/Storage/FileStoreCrashRecoveryTests.cs`
|
||||
- Modify: `tests/NATS.Server.JetStream.Tests/JetStream/Storage/FileStoreTombstoneTests.cs`
|
||||
|
||||
**Step 1: Rewrite hot maintenance methods**
|
||||
- Replace LINQ-based implementations in:
|
||||
- `LoadLastBySubjectAsync`
|
||||
- `TrimToMaxMessages`
|
||||
- `PurgeEx`
|
||||
- snapshot/recovery recomputation paths
|
||||
- Use explicit loops and maintained indexes first; only add more elaborate per-subject structures if profiling still demands them.
|
||||
|
||||
**Step 2: Preserve recovery and tombstone correctness**
|
||||
- Verify delete markers, TTL rebuilds, compaction, and sequence-gap handling still match the current parity tests.
|
||||
|
||||
**Step 3: Run targeted JetStream storage suites**
|
||||
- Run: `dotnet test tests/NATS.Server.JetStream.Tests/NATS.Server.JetStream.Tests.csproj --filter "FullyQualifiedName~StoreInterfaceTests|FullyQualifiedName~FileStoreCrashRecoveryTests|FullyQualifiedName~FileStoreTombstoneTests" -c Release`
|
||||
- Expected: PASS.
|
||||
|
||||
**Step 4: Commit the maintenance-path rewrite**
|
||||
- Run:
|
||||
```bash
|
||||
git add src/NATS.Server/JetStream/Storage/FileStore.cs tests/NATS.Server.JetStream.Tests/JetStream/Storage/StoreInterfaceTests.cs tests/NATS.Server.JetStream.Tests/JetStream/Storage/FileStoreCrashRecoveryTests.cs tests/NATS.Server.JetStream.Tests/JetStream/Storage/FileStoreTombstoneTests.cs
|
||||
git commit -m "perf: replace FileStore full scans with indexed loops"
|
||||
```
|
||||
|
||||
## Task 5: Add benchmark coverage, update docs, and run full verification
|
||||
|
||||
**Files:**
|
||||
- Create: `tests/NATS.Server.Benchmark.Tests/JetStream/FileStoreAppendBenchmarks.cs`
|
||||
- Modify: `Documentation/JetStream/Overview.md`
|
||||
|
||||
**Step 1: Add FileStore-focused benchmarks**
|
||||
- Cover:
|
||||
- append throughput
|
||||
- sync publish
|
||||
- load-last-by-subject
|
||||
- purge/trim maintenance overhead
|
||||
- Record allocation deltas before/after.
|
||||
|
||||
**Step 2: Update JetStream documentation**
|
||||
- Document how FileStore now separates metadata/index concerns from payload storage and where copies still remain by design.
|
||||
|
||||
**Step 3: Run full verification**
|
||||
- Run: `dotnet test tests/NATS.Server.JetStream.Tests/NATS.Server.JetStream.Tests.csproj -c Release`
|
||||
- Run: `dotnet test tests/NATS.Server.Benchmark.Tests/NATS.Server.Benchmark.Tests.csproj --filter "FullyQualifiedName~FileStore|FullyQualifiedName~SyncPublish|FullyQualifiedName~AsyncPublish" -c Release`
|
||||
- Expected: PASS; benchmark output shows fewer allocations in append-heavy scenarios.
|
||||
|
||||
**Step 4: Commit docs and benchmarks**
|
||||
- Run:
|
||||
```bash
|
||||
git add tests/NATS.Server.Benchmark.Tests/JetStream/FileStoreAppendBenchmarks.cs Documentation/JetStream/Overview.md
|
||||
git commit -m "docs: record FileStore payload and index strategy"
|
||||
```
|
||||
|
||||
## Task 6: Merge the verified worktree branch back into `main`
|
||||
|
||||
**Files:**
|
||||
- No source-file changes expected unless the merge surfaces conflicts that require a follow-up fix
|
||||
|
||||
**Step 1: Confirm the worktree branch is clean and fully verified**
|
||||
- Re-run the Task 5 verification commands in the worktree if anything changed after the final commit.
|
||||
- Run:
|
||||
```bash
|
||||
git status --short
|
||||
```
|
||||
- Expected: no uncommitted changes.
|
||||
|
||||
**Step 2: Update `main` before merging**
|
||||
- From the primary checkout, run:
|
||||
```bash
|
||||
git switch main
|
||||
git pull --ff-only
|
||||
```
|
||||
- Expected: local `main` matches the latest remote state.
|
||||
|
||||
**Step 3: Merge the finished branch back to `main`**
|
||||
- Run:
|
||||
```bash
|
||||
git merge --ff-only codex/filestore-payload-index-optimization
|
||||
```
|
||||
- Expected: `main` fast-forwards to include the completed FileStore optimization commits.
|
||||
- If fast-forward is not possible, rebase `codex/filestore-payload-index-optimization` onto `main`, re-run verification, and then repeat this step.
|
||||
|
||||
**Step 4: Confirm `main` still passes after the merge**
|
||||
- Run:
|
||||
```bash
|
||||
dotnet test tests/NATS.Server.JetStream.Tests/NATS.Server.JetStream.Tests.csproj -c Release
|
||||
```
|
||||
- Expected: PASS on `main`.
|
||||
|
||||
**Step 5: Remove the temporary worktree after merge confirmation**
|
||||
- Run:
|
||||
```bash
|
||||
git worktree remove .worktrees/filestore-payload-index-optimization
|
||||
git branch -d codex/filestore-payload-index-optimization
|
||||
```
|
||||
- Expected: the temporary checkout is removed and the topic branch is no longer needed locally.
|
||||
|
||||
## Task 7: Run the benchmark suite per the benchmark README and update the comparison document
|
||||
|
||||
**Files:**
|
||||
- Modify: `benchmarks_comparison.md`
|
||||
- Reference: `tests/NATS.Server.Benchmark.Tests/README.md`
|
||||
|
||||
**Step 1: Run the full benchmark suite with the README-prescribed command**
|
||||
- From `main` after Task 6 succeeds, run:
|
||||
```bash
|
||||
dotnet test tests/NATS.Server.Benchmark.Tests \
|
||||
--filter "Category=Benchmark" \
|
||||
-v normal \
|
||||
--logger "console;verbosity=detailed" 2>&1 | tee /tmp/bench-output.txt
|
||||
```
|
||||
- Expected: the benchmark suite completes and writes comparison blocks to `/tmp/bench-output.txt`.
|
||||
|
||||
**Step 2: Extract the benchmark results from the captured output**
|
||||
- Review the `Standard Output Messages` sections in `/tmp/bench-output.txt`.
|
||||
- Capture the updated values for:
|
||||
- core pub/sub throughput
|
||||
- request/reply latency
|
||||
- JetStream sync publish
|
||||
- JetStream async file publish
|
||||
- ordered consumer throughput
|
||||
- durable consumer fetch throughput
|
||||
|
||||
**Step 3: Update `benchmarks_comparison.md`**
|
||||
- Update:
|
||||
- the benchmark run date on the first line
|
||||
- environment details if they changed
|
||||
- all affected tables with the new msg/s, MB/s, ratio, and latency values
|
||||
- the Summary and Key Observations text if the new ratios materially change the assessment
|
||||
|
||||
**Step 4: Verify the comparison document changes are the only remaining edits**
|
||||
- Run:
|
||||
```bash
|
||||
git status --short
|
||||
```
|
||||
- Expected: only `benchmarks_comparison.md` is modified at this point unless the benchmark run surfaced a legitimate follow-up issue to capture separately.
|
||||
|
||||
**Step 5: Commit the benchmark comparison refresh**
|
||||
- Run:
|
||||
```bash
|
||||
git add benchmarks_comparison.md
|
||||
git commit -m "docs: update benchmark comparison after FileStore optimization"
|
||||
```
|
||||
|
||||
## Completion Checklist
|
||||
- [ ] Implementation started from an isolated git worktree on `codex/filestore-payload-index-optimization`.
|
||||
- [ ] `AppendAsync` and `StoreMsg` avoid unnecessary duplicate payload ownership.
|
||||
- [ ] `LoadLastBySubjectAsync`, `TrimToMaxMessages`, and `PurgeEx` no longer rely on repeated LINQ full scans.
|
||||
- [ ] First/last/live-sequence bookkeeping is maintained incrementally.
|
||||
- [ ] JetStream storage, recovery, compression, encryption, and tombstone tests remain green.
|
||||
- [ ] FileStore-focused benchmark coverage exists in `tests/NATS.Server.Benchmark.Tests/JetStream/`.
|
||||
- [ ] `Documentation/JetStream/Overview.md` explains the updated storage/index model.
|
||||
- [ ] Verified work has been merged back into `main` and the temporary worktree has been removed.
|
||||
- [ ] Full benchmark suite has been run from `main` using the command in `tests/NATS.Server.Benchmark.Tests/README.md`.
|
||||
- [ ] `benchmarks_comparison.md` has been updated to reflect the new benchmark results.
|
||||
|
||||
## Concise Execution Checklist For The Current Codebase
|
||||
- [ ] Create `codex/filestore-payload-index-optimization` in `.worktrees/filestore-payload-index-optimization` and verify `tests/NATS.Server.JetStream.Tests/NATS.Server.JetStream.Tests.csproj` passes before changes.
|
||||
- [ ] Add optimization-guard coverage in `tests/NATS.Server.JetStream.Tests/JetStreamStoreIndexTests.cs`, `tests/NATS.Server.JetStream.Tests/JetStream/Storage/StoreInterfaceTests.cs`, and new `tests/NATS.Server.JetStream.Tests/JetStream/Storage/FileStoreOptimizationGuardTests.cs`.
|
||||
- [ ] Rework the current FileStore hot paths in `src/NATS.Server/JetStream/Storage/FileStore.cs`: `AppendAsync`, `LoadLastBySubjectAsync`, `TrimToMaxMessages`, `StoreMsg`, and `PurgeEx`.
|
||||
- [ ] Introduce compact FileStore indexing metadata in new `src/NATS.Server/JetStream/Storage/StoredMessageIndex.cs` and adjust `src/NATS.Server/JetStream/Storage/StoredMessage.cs` accordingly.
|
||||
- [ ] Remove avoidable payload duplication in `src/NATS.Server/JetStream/Storage/FileStore.cs`, `src/NATS.Server/JetStream/Storage/MsgBlock.cs`, and `src/NATS.Server/JetStream/Storage/MessageRecord.cs`.
|
||||
- [ ] Keep JetStream storage parity green by re-running the existing storage-focused suites under `tests/NATS.Server.JetStream.Tests/JetStream/Storage/`, especially compression, crash recovery, tombstones, and store interface coverage.
|
||||
- [ ] Add FileStore benchmark coverage alongside the existing JetStream benchmark classes in `tests/NATS.Server.Benchmark.Tests/JetStream/`.
|
||||
- [ ] Update `Documentation/JetStream/Overview.md` to describe the new payload/index split and the remaining intentional copy boundaries.
|
||||
- [ ] Merge the verified topic branch back into `main`, re-run JetStream tests on `main`, then remove the temporary worktree.
|
||||
- [ ] Run the full benchmark suite exactly as documented in `tests/NATS.Server.Benchmark.Tests/README.md` and update `benchmarks_comparison.md` with the new measurements.
|
||||
Reference in New Issue
Block a user