docs: add JetStream perf investigation notes and test status tracking

Add detailed analysis of the 1,200x JetStream file publish gap identifying the bottleneck in the outbound write path (not FileStore). Add tests.md tracking skipped/failing test status across Core and JetStream suites.
2026-03-13 03:20:43 -04:00
parent 4de691c9c5
commit 9e0df9b3d7
2 changed files with 84 additions and 1 deletions
@@ -92,6 +92,57 @@ Benchmark run: 2026-03-13. Both servers running on the same machine, tested with

 1. **Pub-only and request/reply are within striking distance** (0.6x–0.85x), suggesting the core message path is reasonably well ported.
 2. **Small-payload pub/sub and fan-out are 5x slower** (0.18x ratio). The bottleneck is likely in the subscription dispatch / message delivery hot path — the `SubList.Match()` → `MSG` write loop.
-3. **JetStream file store async publish remains at ~174 msg/s** despite FileStore-level optimizations (buffered writes with background flush loop, O(1) state tracking, eliminating redundant per-publish work). The bottleneck is in the E2E network/protocol processing path — synchronous `.GetAwaiter().GetResult()` calls in the client read loop block the async pipeline.
+3. **JetStream file store async publish is 1,200x slower than Go** — see [investigation notes](#jetstream-async-file-publish-investigation) below.
 4. **JetStream consumption** (durable fetch) is 8x slower than Go. Ordered consumers don't work yet.
 5. The multi-pub/sub result showing .NET faster is likely a measurement artifact from the small message count (2,000 per publisher) — not representative at scale.
+
+---
+
+## JetStream Async File Publish Investigation
+
+The async file store publish benchmark publishes 5,000 128-byte messages in batches of 100 to a `Retention=Limits`, `Storage=File`, `MaxMsgs=10_000_000` stream (no MaxAge, no MaxMsgsPer). Go achieves **210,387 msg/s**; .NET achieves **174 msg/s** — a **1,208x** gap.
+
+The JetStream sync memory store benchmark achieves **0.82x** parity, confirming the bottleneck is specific to the file-store async publish path.
+
+### What was optimized (FileStore layer)
+
+Five root causes were identified and fixed in the FileStore/StreamManager layer:
+
+| # | Root Cause | Fix | Impact |
+|---|-----------|-----|--------|
+| 1 | **Per-message synchronous disk I/O** — `MsgBlock.WriteAt()` called `RandomAccess.Write()` on every message | Added write buffering in MsgBlock + background flush loop in FileStore (Go's `flushLoop` pattern: coalesce 16KB or 8ms) | Eliminates per-message syscall overhead |
+| 2 | **O(n) `GetStateAsync` per publish** — `_messages.Keys.Min()` and `_messages.Values.Sum()` on every publish for MaxMsgs/MaxBytes checks | Added incremental `_messageCount`, `_totalBytes`, `_firstSeq` fields updated in all mutation paths; `GetStateAsync` is now O(1) | Eliminates O(n) scan per publish |
+| 3 | **Unnecessary `LoadAsync` after every append** — `StreamManager.Capture` reloaded the just-stored message even when no mirrors/sources were configured | Made `LoadAsync` conditional on mirror/source replication being configured | Eliminates redundant disk read per publish |
+| 4 | **Redundant `PruneExpiredMessages` per publish** — called before every publish even when `MaxAge=0`, and again inside `EnforceRuntimePolicies` | Guarded with `MaxAgeMs > 0` check; removed the pre-publish call (background expiry timer handles it) | Eliminates O(n) scan per publish |
+| 5 | **`PrunePerSubject` loading all messages per publish** — `EnforceRuntimePolicies` → `PrunePerSubject` called `ListAsync().GroupBy()` even when `MaxMsgsPer=0` | Guarded with `MaxMsgsPer > 0` check | Eliminates O(n) scan per publish |
+
+Additional fixes: SHA256 envelope bypass for unencrypted/uncompressed stores, RAFT propose skip for single-replica streams.
+
+### Why the benchmark didn't improve
+
+After all FileStore-level optimizations, the benchmark remained at ~174 msg/s. The bottleneck is **upstream of the storage layer** in the E2E network/protocol processing path.
+
+**Important context from Go source verification:** Go also processes JetStream messages inline on the read goroutine — `processInbound` → `processInboundClientMsg` calls `processJetStreamMsg` synchronously (no channel handoff). `processJetStreamMsg` takes `mset.mu` and calls `store.StoreMsg()` inline (`server/stream.go:5436–6136`). The `pcd` field is `map[*client]struct{}` for deferred outbound flush bookkeeping (`server/client.go:291`), not a channel.
+
+So Go faces the same serial read→process constraint per connection — the 1,200x gap cannot be explained by Go offloading JetStream to another goroutine (it doesn't). The actual differences are:
+
+1. **Write coalescing in the file store** — Go's `writeMsgRecordLocked` appends to `mb.cache.buf` (an in-memory byte slice) and defers disk I/O to a background `flushLoop` goroutine that coalesces at 16KB or 8ms (`server/filestore.go:328, 5796, 5841`). Our .NET port now matches this pattern, but there may be differences in how efficiently the flush loop runs (Task scheduling overhead vs goroutine scheduling).
+
+2. **Coalescing write loop for outbound data** — Go has a dedicated `writeLoop` goroutine per connection that waits on `c.out.sg` (`sync.Cond`, `server/client.go:355, 1274`). Outbound data accumulates in `out.nb` (`net.Buffers`) and is flushed in batches via `net.Buffers.WriteTo` up to `nbMaxVectorSize` buffers (`server/client.go:1615`). The .NET server writes ack responses individually per message — no outbound batching.
+
+3. **Per-message overhead in the .NET protocol path** — The .NET `NatsClient.ProcessInboundAsync` calls `TryCaptureJetStreamPublish` via `.GetAwaiter().GetResult()`, blocking the read loop Task. While Go also processes inline, Go's goroutine scheduler is cheaper for this pattern — goroutines that block on mutex or I/O yield efficiently to the runtime scheduler, whereas .NET's `Task` + `GetAwaiter().GetResult()` on an async context can cause thread pool starvation or synchronization overhead.
+
+4. **AsyncFlush configuration** — Go's file store respects `fcfg.AsyncFlush` (`server/filestore.go:456`). When `AsyncFlush=true` (the default for streams), `writeMsgRecordLocked` does NOT flush synchronously (`server/filestore.go:6803`). When `AsyncFlush=false`, it flushes inline after each write. The .NET benchmark may be triggering synchronous flushes unintentionally.
+
+### What would actually fix it
+
+The fix requires changes to the outbound write path and careful profiling, not further FileStore tuning:
+
+| Change | Description | Go Reference |
+|--------|-------------|-------------|
+| **Coalescing write loop** | Add a dedicated outbound write loop per connection that batches acks/MSGs using `net.Buffers`-style vectored I/O, woken by a `sync.Cond`-equivalent signal | `server/client.go:1274, 1615` — `writeLoop` with `out.sg` (`sync.Cond`) and `out.nb` (`net.Buffers`) |
+| **Eliminate sync-over-async** | Replace `.GetAwaiter().GetResult()` calls in the read loop with true async/await or a synchronous-only code path to avoid thread pool overhead | N/A — architectural difference |
+| **Profile Task scheduling** | The background flush loop uses `Task.Delay(1)` for coalescing waits; this may have higher latency than Go's `time.Sleep(1ms)` due to Task scheduler granularity | `server/filestore.go:5841` — `time.Sleep` in `flushLoop` |
+| **Verify AsyncFlush is enabled** | Ensure the benchmark stream config sets `AsyncFlush=true` so the file store uses buffered writes rather than synchronous per-message flushes | `server/filestore.go:456` — `fs.fip = !fcfg.AsyncFlush` |
+
+The coalescing write loop is likely the highest-impact change — it explains both the JetStream ack throughput gap and the 0.18x gap in pub/sub (small payload) and fan-out benchmarks.
@@ -0,0 +1,32 @@
+# Test Status
+
+## Failing Tests
+
+No known failing tests.
+
+## Skipped Tests
+
+### NATS.Server.Core.Tests (6 skipped)
+
+| Test | Reason |
+|------|--------|
+| `ClientProtocolGoParityTests.Client_rejects_NRG_subjects_for_non_system_users` | Not yet implemented |
+| `ClientProtocolGoParityTests.Header_stripped_for_non_header_subscriber` | Not yet implemented |
+| `ClientProtocolGoParityTests.Header_stripped_for_non_header_queue_subscriber` | Not yet implemented |
+| `SubjectTreeTests.TestSubjectTreeMatchAllPerf` | Performance-only |
+| `SubjectTreeTests.TestSubjectTreeIterPerf` | Performance-only |
+| `SubjectTreeTests.TestSubjectTreeGSLIntersection` | Not yet implemented |
+
+### NATS.Server.JetStream.Tests (9 skipped)
+
+| Test | Reason |
+|------|--------|
+| `MirrorSourceGoParityTests.Mirror_consumer_fails_after_restart_and_recovers` | Requires real server restart with FileStore persistence |
+| `MirrorSourceGoParityTests.Remove_external_source_stops_forwarding` | Requires multi-server leaf node topology |
+| `MirrorSourceGoParityTests.Work_queue_source_recovers_after_restart` | Requires real server restart |
+| `MirrorSourceGoParityTests.Work_queue_source_naming_recovers_after_restart` | Requires real server restart |
+| `MirrorSourceGoParityTests.Stream_update_with_external_source_works` | Requires multi-server leaf node topology |
+| `FileStoreRecovery2Tests.WriteFailures_FullDisk_SkipEnvironmentSpecific` | Environment-specific (requires Docker tmpfs) |
+| `FileStoreRecovery2Tests.PartialIndexes_StoreAndLoadAfterCacheExpiry` | Obsolete in Go upstream |
+| `FileStoreRecovery2Tests.WriteFullState_HighSubjectCardinality` | Performance benchmark only |
+| `FileStoreRecovery2Tests.Corruption_HbitSanityCheck` | Go-internal wire format (hbit/cbit) |