Files
natsdotnet/docs/plans/2026-03-13-optimizations_filestore-plan.md
2026-03-13 10:19:56 -04:00

17 KiB

FileStore Payload And Index Optimization Implementation Plan

For Codex: REQUIRED SUB-SKILLS: Use using-git-worktrees to create an isolated workspace before Task 1, then use executeplan to implement this plan task-by-task. After verification is complete, merge the finished branch back into main.

Goal: Reduce JetStream FileStore memory churn and repeated full scans by tightening payload ownership, splitting compact metadata from large payload buffers, and replacing LINQ-based maintenance work with explicit indexes and loops.

Architecture: Start by freezing current behavior across AppendAsync, StoreMsg, retention, snapshots, and recovery. Then introduce compact metadata/index structures, remove avoidable duplicate payload buffers, replace repeated _messages scans with maintained indexes, and finish by updating recovery/snapshot paths plus benchmark coverage.

Tech Stack: .NET 10, C#, JetStream storage stack, ReadOnlyMemory<byte>, pooled buffers where safe, xUnit, existing JetStream benchmark harness.


Scope Anchors

  • Primary source: src/NATS.Server/JetStream/Storage/FileStore.cs
  • Supporting sources:
    • src/NATS.Server/JetStream/Storage/MsgBlock.cs
    • src/NATS.Server/JetStream/Storage/StoredMessage.cs
    • src/NATS.Server/JetStream/Storage/MessageRecord.cs
  • Existing contract tests:
    • tests/NATS.Server.JetStream.Tests/StreamStoreContractTests.cs
    • tests/NATS.Server.JetStream.Tests/JetStream/Storage/StoreInterfaceTests.cs
  • Existing FileStore coverage:
    • tests/NATS.Server.JetStream.Tests/FileStoreTests.cs
    • tests/NATS.Server.JetStream.Tests/JetStreamStoreIndexTests.cs
    • tests/NATS.Server.JetStream.Tests/JetStream/Storage/FileStoreCompressionTests.cs
    • tests/NATS.Server.JetStream.Tests/JetStream/Storage/FileStoreCrashRecoveryTests.cs
    • tests/NATS.Server.JetStream.Tests/JetStream/Storage/FileStoreTombstoneTests.cs
  • Documentation to update: Documentation/JetStream/Overview.md
  • Benchmark project: tests/NATS.Server.Benchmark.Tests/NATS.Server.Benchmark.Tests.csproj
  • Benchmark comparison doc: benchmarks_comparison.md

Task 0: Create an isolated git worktree and verify the baseline

Files:

  • Modify: .gitignore only if the chosen local worktree directory is not already ignored

Step 1: Choose the worktree location using the repo convention

  • Check for an existing .worktrees/ directory first, then worktrees/.
  • If neither exists, check repo guidance before creating one.
  • Prefer a project-local .worktrees/ directory when available.

Step 2: Verify the worktree directory is ignored before creating anything

  • Run:
git check-ignore -q .worktrees || git check-ignore -q worktrees
  • Expected: one configured worktree directory is ignored.
  • If neither directory is ignored, add the chosen directory to .gitignore, commit that change on main, and then continue.

Step 3: Create a dedicated branch and worktree for this plan

  • Run:
git worktree add .worktrees/filestore-payload-index-optimization -b codex/filestore-payload-index-optimization
  • Expected: a new isolated checkout exists at .worktrees/filestore-payload-index-optimization.

Step 4: Move into the worktree and verify the starting baseline

  • Run:
cd .worktrees/filestore-payload-index-optimization
dotnet test tests/NATS.Server.JetStream.Tests/NATS.Server.JetStream.Tests.csproj -c Release
  • Expected: PASS before implementation starts.
  • If the baseline fails, stop and resolve whether to proceed before changing FileStore code.

Step 5: Commit only the worktree bootstrap change if one was required

  • Run only if .gitignore had to change:
git add .gitignore
git commit -m "chore: ignore local worktree directory"

Task 1: Freeze store behavior and add scan/ownership regression tests

Files:

  • Modify: tests/NATS.Server.JetStream.Tests/JetStreamStoreIndexTests.cs
  • Modify: tests/NATS.Server.JetStream.Tests/JetStream/Storage/StoreInterfaceTests.cs
  • Create: tests/NATS.Server.JetStream.Tests/JetStream/Storage/FileStoreOptimizationGuardTests.cs

Step 1: Add failing tests for the targeted optimization boundaries

  • Cover:
    • AppendAsync retaining logical payload behavior
    • StoreMsg with headers + payload
    • LoadLastBySubjectAsync
    • TrimToMaxMessages
    • PurgeEx
    • snapshot/recovery round-trips

Step 2: Add tests that lock first/last sequence bookkeeping

  • Ensure _firstSeq, _last, and subject-last lookup behavior remain correct after removes, purges, compaction, and recovery.

Step 3: Run focused JetStream tests to prove the new tests fail first

  • Run: dotnet test tests/NATS.Server.JetStream.Tests/NATS.Server.JetStream.Tests.csproj --filter "FullyQualifiedName~FileStoreOptimizationGuardTests|FullyQualifiedName~JetStreamStoreIndexTests|FullyQualifiedName~StoreInterfaceTests" -c Release
  • Expected: FAIL only in the newly added optimization-guard tests.

Step 4: Commit the failing-test baseline

  • Run:
git add tests/NATS.Server.JetStream.Tests/JetStreamStoreIndexTests.cs tests/NATS.Server.JetStream.Tests/JetStream/Storage/StoreInterfaceTests.cs tests/NATS.Server.JetStream.Tests/JetStream/Storage/FileStoreOptimizationGuardTests.cs
git commit -m "test: lock FileStore optimization boundaries"

Task 2: Introduce compact metadata/index types and remove full-scan bookkeeping

Files:

  • Create: src/NATS.Server/JetStream/Storage/StoredMessageIndex.cs
  • Modify: src/NATS.Server/JetStream/Storage/FileStore.cs
  • Modify: src/NATS.Server/JetStream/Storage/StoredMessage.cs

Step 1: Split compact indexing metadata from payload-bearing message objects

  • Add a small immutable metadata/index type that tracks at least:
    • sequence
    • subject
    • logical payload length
    • timestamp
    • subject-local links or last-seen markers if needed

Step 2: Replace repeated Min() / Max() / full-value scans with maintained state

  • Maintain first live sequence, last live sequence, and last-by-subject values incrementally rather than recomputing them with LINQ.

Step 3: Run targeted index tests

  • Run: dotnet test tests/NATS.Server.JetStream.Tests/NATS.Server.JetStream.Tests.csproj --filter "FullyQualifiedName~JetStreamStoreIndexTests|FullyQualifiedName~FileStoreOptimizationGuardTests" -c Release
  • Expected: PASS.

Step 4: Commit the metadata/index layer

  • Run:
git add src/NATS.Server/JetStream/Storage/StoredMessageIndex.cs src/NATS.Server/JetStream/Storage/FileStore.cs src/NATS.Server/JetStream/Storage/StoredMessage.cs
git commit -m "perf: add compact FileStore index metadata"

Task 3: Remove duplicate payload ownership in append and store paths

Files:

  • Modify: src/NATS.Server/JetStream/Storage/FileStore.cs
  • Modify: src/NATS.Server/JetStream/Storage/MsgBlock.cs
  • Modify: src/NATS.Server/JetStream/Storage/MessageRecord.cs
  • Modify: tests/NATS.Server.JetStream.Tests/FileStoreTests.cs

Step 1: Rework AppendAsync and StoreMsg payload flow

  • Stop eagerly keeping both a transformed persisted payload and a second fully duplicated managed payload when the same buffer/view can safely back both responsibilities.
  • Keep correctness for compression, encryption, and header-bearing records explicit.

Step 2: Remove concatenated header+payload arrays where possible

  • Let record encoding paths consume header and payload spans directly instead of always building combined = new byte[...].
  • Leave a copy in place only where the persistence or recovery contract actually requires one.

Step 3: Run targeted persistence tests

  • Run: dotnet test tests/NATS.Server.JetStream.Tests/NATS.Server.JetStream.Tests.csproj --filter "FullyQualifiedName~FileStoreTests|FullyQualifiedName~FileStoreCompressionTests|FullyQualifiedName~FileStoreEncryptionTests" -c Release
  • Expected: PASS.

Step 4: Commit the payload-ownership refactor

  • Run:
git add src/NATS.Server/JetStream/Storage/FileStore.cs src/NATS.Server/JetStream/Storage/MsgBlock.cs src/NATS.Server/JetStream/Storage/MessageRecord.cs tests/NATS.Server.JetStream.Tests/FileStoreTests.cs
git commit -m "perf: reduce FileStore duplicate payload buffers"

Task 4: Replace LINQ-heavy maintenance operations with explicit indexed paths

Files:

  • Modify: src/NATS.Server/JetStream/Storage/FileStore.cs
  • Modify: tests/NATS.Server.JetStream.Tests/JetStream/Storage/StoreInterfaceTests.cs
  • Modify: tests/NATS.Server.JetStream.Tests/JetStream/Storage/FileStoreCrashRecoveryTests.cs
  • Modify: tests/NATS.Server.JetStream.Tests/JetStream/Storage/FileStoreTombstoneTests.cs

Step 1: Rewrite hot maintenance methods

  • Replace LINQ-based implementations in:
    • LoadLastBySubjectAsync
    • TrimToMaxMessages
    • PurgeEx
    • snapshot/recovery recomputation paths
  • Use explicit loops and maintained indexes first; only add more elaborate per-subject structures if profiling still demands them.

Step 2: Preserve recovery and tombstone correctness

  • Verify delete markers, TTL rebuilds, compaction, and sequence-gap handling still match the current parity tests.

Step 3: Run targeted JetStream storage suites

  • Run: dotnet test tests/NATS.Server.JetStream.Tests/NATS.Server.JetStream.Tests.csproj --filter "FullyQualifiedName~StoreInterfaceTests|FullyQualifiedName~FileStoreCrashRecoveryTests|FullyQualifiedName~FileStoreTombstoneTests" -c Release
  • Expected: PASS.

Step 4: Commit the maintenance-path rewrite

  • Run:
git add src/NATS.Server/JetStream/Storage/FileStore.cs tests/NATS.Server.JetStream.Tests/JetStream/Storage/StoreInterfaceTests.cs tests/NATS.Server.JetStream.Tests/JetStream/Storage/FileStoreCrashRecoveryTests.cs tests/NATS.Server.JetStream.Tests/JetStream/Storage/FileStoreTombstoneTests.cs
git commit -m "perf: replace FileStore full scans with indexed loops"

Task 5: Add benchmark coverage, update docs, and run full verification

Files:

  • Create: tests/NATS.Server.Benchmark.Tests/JetStream/FileStoreAppendBenchmarks.cs
  • Modify: Documentation/JetStream/Overview.md

Step 1: Add FileStore-focused benchmarks

  • Cover:
    • append throughput
    • sync publish
    • load-last-by-subject
    • purge/trim maintenance overhead
  • Record allocation deltas before/after.

Step 2: Update JetStream documentation

  • Document how FileStore now separates metadata/index concerns from payload storage and where copies still remain by design.

Step 3: Run full verification

  • Run: dotnet test tests/NATS.Server.JetStream.Tests/NATS.Server.JetStream.Tests.csproj -c Release
  • Run: dotnet test tests/NATS.Server.Benchmark.Tests/NATS.Server.Benchmark.Tests.csproj --filter "FullyQualifiedName~FileStore|FullyQualifiedName~SyncPublish|FullyQualifiedName~AsyncPublish" -c Release
  • Expected: PASS; benchmark output shows fewer allocations in append-heavy scenarios.

Step 4: Commit docs and benchmarks

  • Run:
git add tests/NATS.Server.Benchmark.Tests/JetStream/FileStoreAppendBenchmarks.cs Documentation/JetStream/Overview.md
git commit -m "docs: record FileStore payload and index strategy"

Task 6: Merge the verified worktree branch back into main

Files:

  • No source-file changes expected unless the merge surfaces conflicts that require a follow-up fix

Step 1: Confirm the worktree branch is clean and fully verified

  • Re-run the Task 5 verification commands in the worktree if anything changed after the final commit.
  • Run:
git status --short
  • Expected: no uncommitted changes.

Step 2: Update main before merging

  • From the primary checkout, run:
git switch main
git pull --ff-only
  • Expected: local main matches the latest remote state.

Step 3: Merge the finished branch back to main

  • Run:
git merge --ff-only codex/filestore-payload-index-optimization
  • Expected: main fast-forwards to include the completed FileStore optimization commits.
  • If fast-forward is not possible, rebase codex/filestore-payload-index-optimization onto main, re-run verification, and then repeat this step.

Step 4: Confirm main still passes after the merge

  • Run:
dotnet test tests/NATS.Server.JetStream.Tests/NATS.Server.JetStream.Tests.csproj -c Release
  • Expected: PASS on main.

Step 5: Remove the temporary worktree after merge confirmation

  • Run:
git worktree remove .worktrees/filestore-payload-index-optimization
git branch -d codex/filestore-payload-index-optimization
  • Expected: the temporary checkout is removed and the topic branch is no longer needed locally.

Task 7: Run the benchmark suite per the benchmark README and update the comparison document

Files:

  • Modify: benchmarks_comparison.md
  • Reference: tests/NATS.Server.Benchmark.Tests/README.md

Step 1: Run the full benchmark suite with the README-prescribed command

  • From main after Task 6 succeeds, run:
dotnet test tests/NATS.Server.Benchmark.Tests \
  --filter "Category=Benchmark" \
  -v normal \
  --logger "console;verbosity=detailed" 2>&1 | tee /tmp/bench-output.txt
  • Expected: the benchmark suite completes and writes comparison blocks to /tmp/bench-output.txt.

Step 2: Extract the benchmark results from the captured output

  • Review the Standard Output Messages sections in /tmp/bench-output.txt.
  • Capture the updated values for:
    • core pub/sub throughput
    • request/reply latency
    • JetStream sync publish
    • JetStream async file publish
    • ordered consumer throughput
    • durable consumer fetch throughput

Step 3: Update benchmarks_comparison.md

  • Update:
    • the benchmark run date on the first line
    • environment details if they changed
    • all affected tables with the new msg/s, MB/s, ratio, and latency values
    • the Summary and Key Observations text if the new ratios materially change the assessment

Step 4: Verify the comparison document changes are the only remaining edits

  • Run:
git status --short
  • Expected: only benchmarks_comparison.md is modified at this point unless the benchmark run surfaced a legitimate follow-up issue to capture separately.

Step 5: Commit the benchmark comparison refresh

  • Run:
git add benchmarks_comparison.md
git commit -m "docs: update benchmark comparison after FileStore optimization"

Completion Checklist

  • Implementation started from an isolated git worktree on codex/filestore-payload-index-optimization.
  • AppendAsync and StoreMsg avoid unnecessary duplicate payload ownership.
  • LoadLastBySubjectAsync, TrimToMaxMessages, and PurgeEx no longer rely on repeated LINQ full scans.
  • First/last/live-sequence bookkeeping is maintained incrementally.
  • JetStream storage, recovery, compression, encryption, and tombstone tests remain green.
  • FileStore-focused benchmark coverage exists in tests/NATS.Server.Benchmark.Tests/JetStream/.
  • Documentation/JetStream/Overview.md explains the updated storage/index model.
  • Verified work has been merged back into main and the temporary worktree has been removed.
  • Full benchmark suite has been run from main using the command in tests/NATS.Server.Benchmark.Tests/README.md.
  • benchmarks_comparison.md has been updated to reflect the new benchmark results.

Concise Execution Checklist For The Current Codebase

  • Create codex/filestore-payload-index-optimization in .worktrees/filestore-payload-index-optimization and verify tests/NATS.Server.JetStream.Tests/NATS.Server.JetStream.Tests.csproj passes before changes.
  • Add optimization-guard coverage in tests/NATS.Server.JetStream.Tests/JetStreamStoreIndexTests.cs, tests/NATS.Server.JetStream.Tests/JetStream/Storage/StoreInterfaceTests.cs, and new tests/NATS.Server.JetStream.Tests/JetStream/Storage/FileStoreOptimizationGuardTests.cs.
  • Rework the current FileStore hot paths in src/NATS.Server/JetStream/Storage/FileStore.cs: AppendAsync, LoadLastBySubjectAsync, TrimToMaxMessages, StoreMsg, and PurgeEx.
  • Introduce compact FileStore indexing metadata in new src/NATS.Server/JetStream/Storage/StoredMessageIndex.cs and adjust src/NATS.Server/JetStream/Storage/StoredMessage.cs accordingly.
  • Remove avoidable payload duplication in src/NATS.Server/JetStream/Storage/FileStore.cs, src/NATS.Server/JetStream/Storage/MsgBlock.cs, and src/NATS.Server/JetStream/Storage/MessageRecord.cs.
  • Keep JetStream storage parity green by re-running the existing storage-focused suites under tests/NATS.Server.JetStream.Tests/JetStream/Storage/, especially compression, crash recovery, tombstones, and store interface coverage.
  • Add FileStore benchmark coverage alongside the existing JetStream benchmark classes in tests/NATS.Server.Benchmark.Tests/JetStream/.
  • Update Documentation/JetStream/Overview.md to describe the new payload/index split and the remaining intentional copy boundaries.
  • Merge the verified topic branch back into main, re-run JetStream tests on main, then remove the temporary worktree.
  • Run the full benchmark suite exactly as documented in tests/NATS.Server.Benchmark.Tests/README.md and update benchmarks_comparison.md with the new measurements.