perf: eliminate per-message allocations in pub/sub hot path and coalesce outbound writes

Pub/sub 1:1 (16B) improved from 0.18x to 0.50x, fan-out from 0.18x to 0.44x, and JetStream durable fetch from 0.13x to 0.64x vs Go. Key changes: replace .ToArray() copy in SendMessage with pooled buffer handoff, batch multiple small writes into single WriteAsync via 64KB coalesce buffer in write loop, and remove profiling Stopwatch instrumentation from ProcessMessage/StreamManager hot paths.
2026-03-13 05:09:36 -04:00
parent 9e0df9b3d7
commit 0a4e7a822f
10 changed files with 654 additions and 232 deletions
--- a/benchmarks_comparison.md
+++ b/benchmarks_comparison.md
@@ -12,27 +12,27 @@ Benchmark run: 2026-03-13. Both servers running on the same machine, tested with

 | Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
 |---------|----------|---------|------------|-----------|-----------------|
-| 16 B | 2,436,416 | 37.2 | 1,425,767 | 21.8 | 0.59x |
-| 128 B | 2,143,434 | 261.6 | 1,654,692 | 202.0 | 0.77x |
+| 16 B | 2,138,955 | 32.6 | 1,373,272 | 21.0 | 0.64x |
+| 128 B | 1,995,574 | 243.6 | 1,672,825 | 204.2 | 0.84x |

 ### Publisher + Subscriber (1:1)

 | Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
 |---------|----------|---------|------------|-----------|-----------------|
-| 16 B | 1,140,225 | 17.4 | 207,654 | 3.2 | 0.18x |
-| 16 KB | 41,762 | 652.5 | 34,429 | 538.0 | 0.82x |
+| 16 B | 1,180,986 | 18.0 | 586,118 | 8.9 | 0.50x |
+| 16 KB | 42,660 | 666.6 | 41,555 | 649.3 | 0.97x |

 ### Fan-Out (1 Publisher : 4 Subscribers)

 | Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
 |---------|----------|---------|------------|-----------|-----------------|
-| 128 B | 3,192,313 | 389.7 | 581,284 | 71.0 | 0.18x |
+| 128 B | 3,200,845 | 390.7 | 1,423,721 | 173.8 | 0.44x |

 ### Multi-Publisher / Multi-Subscriber (4P x 4S)

 | Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
 |---------|----------|---------|------------|-----------|-----------------|
-| 128 B | 269,445 | 32.9 | 529,808 | 64.7 | 1.97x |
+| 128 B | 3,081,071 | 376.1 | 1,518,459 | 185.4 | 0.49x |

 ---

@@ -42,13 +42,13 @@ Benchmark run: 2026-03-13. Both servers running on the same machine, tested with

 | Payload | Go msg/s | .NET msg/s | Ratio | Go P50 (us) | .NET P50 (us) | Go P99 (us) | .NET P99 (us) |
 |---------|----------|------------|-------|-------------|---------------|-------------|---------------|
-| 128 B | 9,347 | 7,215 | 0.77x | 104.5 | 134.7 | 146.2 | 190.5 |
+| 128 B | 9,174 | 7,317 | 0.80x | 106.3 | 134.2 | 149.2 | 175.2 |

 ### 10 Clients, 2 Services (Queue Group)

 | Payload | Go msg/s | .NET msg/s | Ratio | Go P50 (us) | .NET P50 (us) | Go P99 (us) | .NET P99 (us) |
 |---------|----------|------------|-------|-------------|---------------|-------------|---------------|
-| 16 B | 30,893 | 25,861 | 0.84x | 315.0 | 370.2 | 451.1 | 595.0 |
+| 16 B | 30,386 | 25,639 | 0.84x | 318.5 | 374.2 | 458.4 | 519.5 |

 ---

@@ -56,10 +56,10 @@ Benchmark run: 2026-03-13. Both servers running on the same machine, tested with

 | Mode | Payload | Storage | Go msg/s | .NET msg/s | Ratio (.NET/Go) |
 |------|---------|---------|----------|------------|-----------------|
-| Synchronous | 16 B | Memory | 16,783 | 13,815 | 0.82x |
-| Async (batch) | 128 B | File | 210,387 | 174 | 0.00x |
+| Synchronous | 16 B | Memory | 15,241 | 12,879 | 0.85x |
+| Async (batch) | 128 B | File | 201,055 | 55,268 | 0.27x |

-> **Note:** Async file store publish remains extremely slow after FileStore-level optimizations (buffered writes, O(1) state tracking, redundant work elimination). The bottleneck is in the E2E network/protocol processing path (synchronous `.GetAwaiter().GetResult()` calls in the client read loop), not storage I/O.
+> **Note:** Async file store publish improved from 174 msg/s to 55,268 msg/s (318x improvement) after two rounds of FileStore-level optimizations plus profiling overhead removal. Remaining 4x gap is GC pressure from per-message allocations and ack delivery overhead.

 ---

@@ -67,10 +67,10 @@ Benchmark run: 2026-03-13. Both servers running on the same machine, tested with

 | Mode | Go msg/s | .NET msg/s | Ratio (.NET/Go) |
 |------|----------|------------|-----------------|
-| Ordered ephemeral consumer | 109,519 | N/A | N/A |
-| Durable consumer fetch | 639,247 | 80,792 | 0.13x |
+| Ordered ephemeral consumer | 688,061 | N/A | N/A |
+| Durable consumer fetch | 701,932 | 450,727 | 0.64x |

-> **Note:** Ordered ephemeral consumer is not yet fully supported on the .NET server (API timeout during consumer creation).
+> **Note:** Ordered ephemeral consumer is not yet fully supported on the .NET server (API timeout during consumer creation). Durable fetch improved from 0.13x to 0.64x after write coalescing and buffer pooling optimizations in the outbound write path.

 ---

@@ -78,35 +78,51 @@ Benchmark run: 2026-03-13. Both servers running on the same machine, tested with

 | Category | Ratio Range | Assessment |
 |----------|-------------|------------|
-| Pub-only throughput | 0.59x–0.77x | Good — within 2x |
-| Pub/sub (large payload) | 0.82x | Good |
-| Pub/sub (small payload) | 0.18x | Needs optimization |
-| Fan-out | 0.18x | Needs optimization |
-| Multi pub/sub | 1.97x | .NET faster (likely measurement artifact at low counts) |
-| Request/reply latency | 0.77x–0.84x | Good |
-| JetStream sync publish | 0.82x | Good |
-| JetStream async file publish | ~0x | Broken — E2E protocol path bottleneck |
-| JetStream durable fetch | 0.13x | Needs optimization |
+| Pub-only throughput | 0.64x–0.84x | Good — within 2x |
+| Pub/sub (large payload) | 0.97x | Excellent — near parity |
+| Pub/sub (small payload) | 0.50x | Improved from 0.18x |
+| Fan-out | 0.44x | Improved from 0.18x |
+| Multi pub/sub | 0.49x | Good |
+| Request/reply latency | 0.80x–0.84x | Good |
+| JetStream sync publish | 0.85x | Good |
+| JetStream async file publish | 0.27x | Improved from 0.00x — storage write path dominates |
+| JetStream durable fetch | 0.64x | Improved from 0.13x |

 ### Key Observations

 1. **Pub-only and request/reply are within striking distance** (0.6x–0.85x), suggesting the core message path is reasonably well ported.
-2. **Small-payload pub/sub and fan-out are 5x slower** (0.18x ratio). The bottleneck is likely in the subscription dispatch / message delivery hot path — the `SubList.Match()` → `MSG` write loop.
-3. **JetStream file store async publish is 1,200x slower than Go** — see [investigation notes](#jetstream-async-file-publish-investigation) below.
-4. **JetStream consumption** (durable fetch) is 8x slower than Go. Ordered consumers don't work yet.
-5. The multi-pub/sub result showing .NET faster is likely a measurement artifact from the small message count (2,000 per publisher) — not representative at scale.
+2. **Small-payload pub/sub improved from 0.18x to 0.50x** after eliminating per-message `.ToArray()` allocations in `SendMessage`, adding write coalescing in the write loop, and removing profiling instrumentation from the hot path.
+3. **Fan-out improved from 0.18x to 0.44x** — same optimizations. The remaining gap vs Go is primarily vectored I/O (`net.Buffers`/`writev` in Go vs sequential `WriteAsync` in .NET) and per-client scratch buffer reuse (Go's 1KB `msgb` per client).
+4. **JetStream durable fetch improved from 0.13x to 0.64x** — the outbound write path optimizations benefit all message delivery, including consumer fetch responses.
+5. **Large-payload pub/sub reached near-parity** (0.97x) — payload copy dominates, and the protocol overhead optimizations have minimal impact at large sizes.
+6. **JetStream file store async publish** (0.27x) — remaining gap is GC pressure from per-message `StoredMessage` objects and `byte[]` copies (65% of server time).

 ---

-## JetStream Async File Publish Investigation
+## Optimization History

-The async file store publish benchmark publishes 5,000 128-byte messages in batches of 100 to a `Retention=Limits`, `Storage=File`, `MaxMsgs=10_000_000` stream (no MaxAge, no MaxMsgsPer). Go achieves **210,387 msg/s**; .NET achieves **174 msg/s** — a **1,208x** gap.
+### Round 3: Outbound Write Path (pub/sub + fan-out + fetch)

-The JetStream sync memory store benchmark achieves **0.82x** parity, confirming the bottleneck is specific to the file-store async publish path.
+Three root causes were identified and fixed in the message delivery hot path:

-### What was optimized (FileStore layer)
+| # | Root Cause | Fix | Impact |
+|---|-----------|-----|--------|
+| 12 | **Per-message `.ToArray()` allocation in SendMessage** — `owner.Memory[..pos].ToArray()` created a new `byte[]` for every MSG delivered to every subscriber | Replaced `IMemoryOwner` rent/copy/dispose with direct `byte[]` from pool; write loop returns buffers after writing | Eliminates 1 heap alloc per delivery (4 per fan-out message) |
+| 13 | **Per-message `WriteAsync` in write loop** — each queued message triggered a separate `_stream.WriteAsync()` system call | Added 64KB coalesce buffer; drain all pending messages into contiguous buffer, single `WriteAsync` per batch | Reduces syscalls from N to 1 per batch |
+| 14 | **Profiling `Stopwatch` on every message** — `Stopwatch.StartNew()` ran unconditionally in `ProcessMessage` and `StreamManager.Capture` even for non-JetStream messages | Removed profiling instrumentation from hot path | Eliminates ~200ns overhead per message |

-Five root causes were identified and fixed in the FileStore/StreamManager layer:
+### Round 2: FileStore AppendAsync Hot Path
+
+| # | Root Cause | Fix | Impact |
+|---|-----------|-----|--------|
+| 6 | **Async state machine overhead** — `AppendAsync` was `async ValueTask<ulong>` but never actually awaited | Changed to synchronous `ValueTask<ulong>` returning `ValueTask.FromResult(_last)` | Eliminates Task state machine allocation |
+| 7 | **Double payload copy** — `TransformForPersist` allocated `byte[]` then `payload.ToArray()` created second copy for `StoredMessage` | Reuse `TransformForPersist` result directly for `StoredMessage.Payload` when no transform needed (`_noTransform` flag) | Eliminates 1 `byte[]` alloc per message |
+| 8 | **Unnecessary TTL work per publish** — `ExpireFromWheel()` and `RegisterTtl()` called on every write even when `MaxAge=0` | Guarded both with `_options.MaxAgeMs > 0` check (matches Go: `filestore.go:4701`) | Eliminates hash wheel overhead when TTL not configured |
+| 9 | **Per-message MsgBlock cache allocation** — `WriteAt` created `new MessageRecord` for `_cache` on every write | Removed eager cache population; reads now decode from pending buffer or disk | Eliminates 1 object alloc per message |
+| 10 | **Contiguous write buffer** — `MsgBlock._pendingWrites` was `List<byte[]>` with per-message `byte[]` allocations | Replaced with single contiguous `_pendingBuf` byte array; `MessageRecord.EncodeTo` writes directly into it | Eliminates per-message `byte[]` encoding alloc; single `RandomAccess.Write` per flush |
+| 11 | **Pending buffer read path** — `MsgBlock.Read()` flushed pending writes to disk before reading | Added in-memory read from `_pendingBuf` when data is still in the buffer | Avoids unnecessary disk flush on read-after-write |
+
+### Round 1: FileStore/StreamManager Layer

 | # | Root Cause | Fix | Impact |
 |---|-----------|-----|--------|
@@ -118,31 +134,11 @@ Five root causes were identified and fixed in the FileStore/StreamManager layer:

 Additional fixes: SHA256 envelope bypass for unencrypted/uncompressed stores, RAFT propose skip for single-replica streams.

-### Why the benchmark didn't improve
+### What would further close the gap

-After all FileStore-level optimizations, the benchmark remained at ~174 msg/s. The bottleneck is **upstream of the storage layer** in the E2E network/protocol processing path.
-
-**Important context from Go source verification:** Go also processes JetStream messages inline on the read goroutine — `processInbound` → `processInboundClientMsg` calls `processJetStreamMsg` synchronously (no channel handoff). `processJetStreamMsg` takes `mset.mu` and calls `store.StoreMsg()` inline (`server/stream.go:5436–6136`). The `pcd` field is `map[*client]struct{}` for deferred outbound flush bookkeeping (`server/client.go:291`), not a channel.
-
-So Go faces the same serial read→process constraint per connection — the 1,200x gap cannot be explained by Go offloading JetStream to another goroutine (it doesn't). The actual differences are:
-
-1. **Write coalescing in the file store** — Go's `writeMsgRecordLocked` appends to `mb.cache.buf` (an in-memory byte slice) and defers disk I/O to a background `flushLoop` goroutine that coalesces at 16KB or 8ms (`server/filestore.go:328, 5796, 5841`). Our .NET port now matches this pattern, but there may be differences in how efficiently the flush loop runs (Task scheduling overhead vs goroutine scheduling).
-
-2. **Coalescing write loop for outbound data** — Go has a dedicated `writeLoop` goroutine per connection that waits on `c.out.sg` (`sync.Cond`, `server/client.go:355, 1274`). Outbound data accumulates in `out.nb` (`net.Buffers`) and is flushed in batches via `net.Buffers.WriteTo` up to `nbMaxVectorSize` buffers (`server/client.go:1615`). The .NET server writes ack responses individually per message — no outbound batching.
-
-3. **Per-message overhead in the .NET protocol path** — The .NET `NatsClient.ProcessInboundAsync` calls `TryCaptureJetStreamPublish` via `.GetAwaiter().GetResult()`, blocking the read loop Task. While Go also processes inline, Go's goroutine scheduler is cheaper for this pattern — goroutines that block on mutex or I/O yield efficiently to the runtime scheduler, whereas .NET's `Task` + `GetAwaiter().GetResult()` on an async context can cause thread pool starvation or synchronization overhead.
-
-4. **AsyncFlush configuration** — Go's file store respects `fcfg.AsyncFlush` (`server/filestore.go:456`). When `AsyncFlush=true` (the default for streams), `writeMsgRecordLocked` does NOT flush synchronously (`server/filestore.go:6803`). When `AsyncFlush=false`, it flushes inline after each write. The .NET benchmark may be triggering synchronous flushes unintentionally.
-
-### What would actually fix it
-
-The fix requires changes to the outbound write path and careful profiling, not further FileStore tuning:
-
-| Change | Description | Go Reference |
-|--------|-------------|-------------|
-| **Coalescing write loop** | Add a dedicated outbound write loop per connection that batches acks/MSGs using `net.Buffers`-style vectored I/O, woken by a `sync.Cond`-equivalent signal | `server/client.go:1274, 1615` — `writeLoop` with `out.sg` (`sync.Cond`) and `out.nb` (`net.Buffers`) |
-| **Eliminate sync-over-async** | Replace `.GetAwaiter().GetResult()` calls in the read loop with true async/await or a synchronous-only code path to avoid thread pool overhead | N/A — architectural difference |
-| **Profile Task scheduling** | The background flush loop uses `Task.Delay(1)` for coalescing waits; this may have higher latency than Go's `time.Sleep(1ms)` due to Task scheduler granularity | `server/filestore.go:5841` — `time.Sleep` in `flushLoop` |
-| **Verify AsyncFlush is enabled** | Ensure the benchmark stream config sets `AsyncFlush=true` so the file store uses buffered writes rather than synchronous per-message flushes | `server/filestore.go:456` — `fs.fip = !fcfg.AsyncFlush` |
-
-The coalescing write loop is likely the highest-impact change — it explains both the JetStream ack throughput gap and the 0.18x gap in pub/sub (small payload) and fan-out benchmarks.
+| Change | Expected Impact | Go Reference |
+|--------|----------------|-------------|
+| **Vectored I/O (`writev`)** | Eliminate coalesce copy in write loop — write gathered buffers in single syscall | Go: `net.Buffers.WriteTo()` → `writev()` in `flushOutbound()` |
+| **Per-client scratch buffer** | Reuse 1KB buffer for MSG header formatting across deliveries | Go: `client.msgb` (1024-byte scratch, `msgScratchSize`) |
+| **Batch flush signaling** | Deduplicate write loop wakeups — signal once per readloop iteration, not per delivery | Go: `pcd` map tracks affected clients, `flushClients()` at end of readloop |
+| **Eliminate per-message GC allocations** | ~30% improvement on FileStore AppendAsync — pool or eliminate `StoredMessage` objects | Go stores in `cache.buf`/`cache.idx` with zero per-message allocs |