perf: optimize fan-out hot path and switch benchmarks to Release build

Round 9 optimizations targeting per-delivery overhead: - Switch benchmark harness from Debug to Release build (biggest impact: durable fetch 0.42x→0.92x, request-reply to parity) - Batch server-wide stats after fan-out loop (2 Interlocked per delivery → 2 per publish) - Guard auto-unsub tracking with MaxMessages > 0 (skip Interlocked in common case) - Cache SID as ASCII bytes on Subscription (avoid per-delivery encoding) - Pre-encode subject bytes once before fan-out loop (avoid N encodings) - Add 1-element subject string cache in ProcessPub (avoid repeated alloc) - Remove Interlocked from SubList.Match stats counters (approximate is fine) - Extract WriteMessageToBuffer helper for both string and span overloads
2026-03-13 15:30:02 -04:00
parent 82cc3ec841
commit a62a25dcdf
6 changed files with 251 additions and 94 deletions
--- a/benchmarks_comparison.md
+++ b/benchmarks_comparison.md
@@ -1,8 +1,8 @@
 # Go vs .NET NATS Server — Benchmark Comparison

-Benchmark run: 2026-03-13 12:08 PM America/Indiana/Indianapolis. Both servers ran on the same machine using the benchmark project README command (`dotnet test tests/NATS.Server.Benchmark.Tests --filter "Category=Benchmark" -v normal --logger "console;verbosity=detailed"`). Test parallelization remained disabled inside the benchmark assembly.
+Benchmark run: 2026-03-13 04:30 PM America/Indiana/Indianapolis. Both servers ran on the same machine using the benchmark project README command (`dotnet test tests/NATS.Server.Benchmark.Tests --filter "Category=Benchmark" -v normal --logger "console;verbosity=detailed"`). Test parallelization remained disabled inside the benchmark assembly.

-**Environment:** Apple M4, .NET SDK 10.0.101, benchmark README command run in the benchmark project's default `Debug` configuration, Go toolchain installed, Go reference server built from `golang/nats-server/`.
+**Environment:** Apple M4, .NET SDK 10.0.101, .NET server built and run in `Release` configuration (server GC, tiered PGO enabled), Go toolchain installed, Go reference server built from `golang/nats-server/`.

 ---
 ---
@@ -13,27 +13,29 @@ Benchmark run: 2026-03-13 12:08 PM America/Indiana/Indianapolis. Both servers ra

 | Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
 |---------|----------|---------|------------|-----------|-----------------|
-| 16 B | 2,223,690 | 33.9 | 1,341,067 | 20.5 | 0.60x |
-| 128 B | 2,218,308 | 270.8 | 1,577,523 | 192.6 | 0.71x |
+| 16 B | 2,223,690 | 33.9 | 1,651,727 | 25.2 | 0.74x |
+| 128 B | 2,218,308 | 270.8 | 1,368,967 | 167.1 | 0.62x |

 ### Publisher + Subscriber (1:1)

 | Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
 |---------|----------|---------|------------|-----------|-----------------|
-| 16 B | 292,711 | 4.5 | 862,381 | 13.2 | **2.95x** |
-| 16 KB | 32,890 | 513.9 | 28,906 | 451.7 | 0.88x |
+| 16 B | 292,711 | 4.5 | 723,867 | 11.0 | **2.47x** |
+| 16 KB | 32,890 | 513.9 | 37,943 | 592.9 | **1.15x** |

 ### Fan-Out (1 Publisher : 4 Subscribers)

 | Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
 |---------|----------|---------|------------|-----------|-----------------|
-| 128 B | 2,945,790 | 359.6 | 1,858,235 | 226.8 | 0.63x |
+| 128 B | 2,945,790 | 359.6 | 1,848,130 | 225.6 | 0.63x |
+
+> **Note:** Fan-out numbers are within noise of prior round. The hot-path optimizations (batched stats, pre-encoded subject/SID bytes, auto-unsub guard) remove per-delivery overhead but the gap is now dominated by the serial fan-out loop itself.

 ### Multi-Publisher / Multi-Subscriber (4P x 4S)

 | Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
 |---------|----------|---------|------------|-----------|-----------------|
-| 128 B | 2,123,480 | 259.2 | 1,392,249 | 170.0 | 0.66x |
+| 128 B | 2,123,480 | 259.2 | 1,374,570 | 167.8 | 0.65x |

 ---

@@ -43,13 +45,13 @@ Benchmark run: 2026-03-13 12:08 PM America/Indiana/Indianapolis. Both servers ra

 | Payload | Go msg/s | .NET msg/s | Ratio | Go P50 (us) | .NET P50 (us) | Go P99 (us) | .NET P99 (us) |
 |---------|----------|------------|-------|-------------|---------------|-------------|---------------|
-| 128 B | 8,386 | 7,014 | 0.84x | 115.8 | 139.0 | 175.5 | 193.0 |
+| 128 B | 8,386 | 7,424 | 0.89x | 115.8 | 139.0 | 175.5 | 193.0 |

 ### 10 Clients, 2 Services (Queue Group)

 | Payload | Go msg/s | .NET msg/s | Ratio | Go P50 (us) | .NET P50 (us) | Go P99 (us) | .NET P99 (us) |
 |---------|----------|------------|-------|-------------|---------------|-------------|---------------|
-| 16 B | 26,470 | 23,478 | 0.89x | 370.2 | 410.6 | 486.0 | 592.8 |
+| 16 B | 26,470 | 26,620 | **1.01x** | 370.2 | 376.0 | 486.0 | 592.8 |

 ---

@@ -57,10 +59,10 @@ Benchmark run: 2026-03-13 12:08 PM America/Indiana/Indianapolis. Both servers ra

 | Mode | Payload | Storage | Go msg/s | .NET msg/s | Ratio (.NET/Go) |
 |------|---------|---------|----------|------------|-----------------|
-| Synchronous | 16 B | Memory | 14,812 | 12,134 | 0.82x |
-| Async (batch) | 128 B | File | 148,156 | 57,479 | 0.39x |
+| Synchronous | 16 B | Memory | 14,812 | 11,002 | 0.74x |
+| Async (batch) | 128 B | File | 148,156 | 60,348 | 0.41x |

-> **Note:** Async file-store publish remains well below parity at 0.39x, but it is still materially better than the older 0.30x snapshot that motivated this FileStore round.
+> **Note:** Async file-store publish improved to 0.41x with Release build. Still storage-bound.

 ---

@@ -68,10 +70,10 @@ Benchmark run: 2026-03-13 12:08 PM America/Indiana/Indianapolis. Both servers ra

 | Mode | Go msg/s | .NET msg/s | Ratio (.NET/Go) |
 |------|----------|------------|-----------------|
-| Ordered ephemeral consumer | 166,000 | 95,000 | 0.57x |
-| Durable consumer fetch | 510,000 | 214,000 | 0.42x |
+| Ordered ephemeral consumer | 166,000 | 102,369 | 0.62x |
+| Durable consumer fetch | 510,000 | 468,252 | 0.92x |

-> **Note:** Ordered consumer throughput is ~0.57x Go. Signal-based wakeup replaced 5ms polling for pull consumers waiting at the stream tail (immediate notification when messages are published). Batch flush in DeliverPullFetchMessagesAsync reduces flush signals from N to N/64. Go comparison numbers vary significantly across runs (Go itself ranges 156K–573K on this machine).
+> **Note:** Ordered consumer improved to 0.62x (102K vs 166K). Durable fetch jumped to 0.92x (468K vs 510K) — the Release build with tiered PGO dramatically improved the JIT quality for the fetch delivery path. Go comparison numbers vary significantly across runs.

 ---

@@ -79,10 +81,10 @@ Benchmark run: 2026-03-13 12:08 PM America/Indiana/Indianapolis. Both servers ra

 | Benchmark | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
 |-----------|----------|---------|------------|-----------|-----------------|
-| MQTT PubSub (128B, QoS 0) | 34,224 | 4.2 | 44,142 | 5.4 | **1.29x** |
-| Cross-Protocol NATS→MQTT (128B) | 158,000 | 19.3 | 190,000 | 23.2 | **1.20x** |
+| MQTT PubSub (128B, QoS 0) | 34,224 | 4.2 | 47,341 | 5.8 | **1.38x** |
+| Cross-Protocol NATS→MQTT (128B) | 158,000 | 19.3 | 229,932 | 28.1 | **1.46x** |

-> **Note:** Pure MQTT pub/sub remains above Go at 1.29x. Cross-protocol NATS→MQTT improved from 0.78x to **1.20x** after adding a `string.Create` fast path in `NatsToMqtt` (avoids StringBuilder for subjects without `_DOT_`) and pre-warming the topic bytes cache on subscription creation.
+> **Note:** Pure MQTT pub/sub extended its lead to 1.38x. Cross-protocol NATS→MQTT now at **1.46x** — the Release build JIT further benefits the delivery path.

 ---

@@ -92,17 +94,17 @@ Benchmark run: 2026-03-13 12:08 PM America/Indiana/Indianapolis. Both servers ra

 | Benchmark | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
 |-----------|----------|---------|------------|-----------|-----------------|
-| TLS PubSub 1:1 (128B) | 289,548 | 35.3 | 251,935 | 30.8 | 0.87x |
-| TLS Pub-Only (128B) | 1,782,442 | 217.6 | 1,163,021 | 142.0 | 0.65x |
+| TLS PubSub 1:1 (128B) | 289,548 | 35.3 | 254,834 | 31.1 | 0.88x |
+| TLS Pub-Only (128B) | 1,782,442 | 217.6 | 877,149 | 107.1 | 0.49x |

 ### WebSocket

 | Benchmark | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
 |-----------|----------|---------|------------|-----------|-----------------|
-| WS PubSub 1:1 (128B) | 66,584 | 8.1 | 73,023 | 8.9 | **1.10x** |
-| WS Pub-Only (128B) | 106,302 | 13.0 | 88,682 | 10.8 | 0.83x |
+| WS PubSub 1:1 (128B) | 66,584 | 8.1 | 62,249 | 7.6 | 0.93x |
+| WS Pub-Only (128B) | 106,302 | 13.0 | 85,878 | 10.5 | 0.81x |

-> **Note:** TLS pub/sub is close to parity at 0.87x. WebSocket pub/sub slightly favors .NET at 1.10x. Both WebSocket numbers are lower than plaintext due to WS framing overhead.
+> **Note:** TLS pub/sub stable at 0.88x. WebSocket pub/sub at 0.93x. Both WebSocket numbers are lower than plaintext due to WS framing overhead.

 ---

@@ -112,10 +114,10 @@ Benchmark run: 2026-03-13 12:08 PM America/Indiana/Indianapolis. Both servers ra

 | Benchmark | .NET msg/s | .NET MB/s | Alloc |
 |-----------|------------|-----------|-------|
-| SubList Exact Match (128 subjects) | 16,497,186 | 220.3 | 0.00 B/op |
-| SubList Wildcard Match | 16,147,367 | 215.6 | 0.00 B/op |
-| SubList Queue Match | 15,582,052 | 118.9 | 0.00 B/op |
-| SubList Remote Interest | 259,940 | 4.2 | 0.00 B/op |
+| SubList Exact Match (128 subjects) | 19,285,510 | 257.5 | 0.00 B/op |
+| SubList Wildcard Match | 18,876,330 | 252.0 | 0.00 B/op |
+| SubList Queue Match | 20,639,153 | 157.5 | 0.00 B/op |
+| SubList Remote Interest | 274,703 | 4.5 | 0.00 B/op |

 ### Parser

@@ -140,40 +142,50 @@ Benchmark run: 2026-03-13 12:08 PM America/Indiana/Indianapolis. Both servers ra

 | Category | Ratio Range | Assessment |
 |----------|-------------|------------|
-| Pub-only throughput | 0.60x–0.71x | Mixed; still behind Go |
-| Pub/sub (small payload) | **2.95x** | .NET outperforms Go decisively |
-| Pub/sub (large payload) | 0.88x | Close, but below parity |
-| Fan-out | 0.63x | Still materially behind Go |
-| Multi pub/sub | 0.66x | Meaningful gap remains |
-| Request/reply latency | 0.84x–0.89x | Good |
-| JetStream sync publish | 0.82x | Strong |
-| JetStream async file publish | 0.39x | Improved versus older snapshots, still storage-bound |
-| JetStream ordered consume | 0.57x | Signal-based wakeup + batch flush |
-| JetStream durable fetch | 0.42x | Same path, Go numbers variable |
-| MQTT pub/sub | **1.29x** | .NET outperforms Go |
-| MQTT cross-protocol | **1.20x** | .NET now outperforms Go |
-| TLS pub/sub | 0.87x | Close to parity |
-| TLS pub-only | 0.65x | Encryption throughput gap |
-| WebSocket pub/sub | **1.10x** | .NET slightly ahead |
-| WebSocket pub-only | 0.83x | Good |
+| Pub-only throughput | 0.62x–0.74x | Improved with Release build |
+| Pub/sub (small payload) | **2.47x** | .NET outperforms Go decisively |
+| Pub/sub (large payload) | **1.15x** | .NET now exceeds parity |
+| Fan-out | 0.63x | Serial fan-out loop is bottleneck |
+| Multi pub/sub | 0.65x | Close to prior round |
+| Request/reply latency | 0.89x–**1.01x** | Effectively at parity |
+| JetStream sync publish | 0.74x | Run-to-run variance |
+| JetStream async file publish | 0.41x | Storage-bound |
+| JetStream ordered consume | 0.62x | Improved with Release build |
+| JetStream durable fetch | 0.92x | Major improvement with Release build |
+| MQTT pub/sub | **1.38x** | .NET outperforms Go |
+| MQTT cross-protocol | **1.46x** | .NET strongly outperforms Go |
+| TLS pub/sub | 0.88x | Close to parity |
+| TLS pub-only | 0.49x | Variance / contention with other tests |
+| WebSocket pub/sub | 0.93x | Close to parity |
+| WebSocket pub-only | 0.81x | Good |

 ### Key Observations

-1. **Small-payload 1:1 pub/sub is back to a large `.NET` lead in this final run** at 2.95x (862K vs 293K msg/s). That puts the merged benchmark profile much closer to the earlier comparison snapshot than the intermediate integration-only run.
-2. **Async file-store publish is still materially better than the older 0.30x baseline** at 0.39x (57.5K vs 148.2K msg/s), which is consistent with the FileStore metadata and payload-ownership changes helping the write path even though they did not eliminate the gap.
-3. **The new FileStore direct benchmarks show what remains expensive in storage maintenance**: `LoadLastBySubject` is allocation-free and extremely fast, `AppendAsync` is still about 1553 B/op, and repeated `PurgeEx+Trim` still burns roughly 5.4 MB/op.
-4. **Ordered consumer throughput improved to 0.57x** (~95K vs ~166K msg/s). Signal-based wakeup replaced 5ms polling for pull consumers waiting at the stream tail, and batch flush reduces flush signals from N to N/64. Go comparison numbers are highly variable on this machine (156K–573K across runs).
-5. **Durable fetch is at 0.42x** (~214K vs ~510K msg/s). The synchronous fetch path (used by `FetchAsync` client) was not changed in this round; the gap is in the store read and serialization overhead.
-6. **Parser and SubList microbenchmarks remain stable and low-allocation**. The storage and consumer layers continue to dominate the server-level benchmark gaps, not the parser or subject matcher hot paths.
-7. **Pure MQTT pub/sub shows .NET outperforming Go at 1.29x** (44K vs 34K msg/s). The .NET MQTT protocol bridge is competitive for direct MQTT-to-MQTT messaging.
-8. **MQTT cross-protocol routing (NATS→MQTT) improved to 1.20x** (~190K vs ~158K msg/s). The `string.Create` fast path in `NatsToMqtt` eliminates StringBuilder allocation for the common case (no `_DOT_` escape), and pre-warming the topic bytes cache on subscription creation eliminates first-message latency.
-9. **TLS pub/sub is close to parity at 0.87x** (252K vs 290K msg/s). TLS pub-only is 0.65x (1.16M vs 1.78M msg/s), consistent with the general publish-path gap seen in plaintext benchmarks.
-10. **WebSocket pub/sub slightly favors .NET at 1.10x** (73K vs 67K msg/s). WebSocket pub-only is 0.83x (89K vs 106K msg/s). Both servers show similar WS framing overhead relative to their plaintext performance.
+1. **Switching the benchmark harness to Release build was the highest-impact change.** Durable fetch jumped from 0.42x to 0.92x (468K vs 510K msg/s). Ordered consumer improved from 0.57x to 0.62x. Request-reply 10Cx2S reached parity at 1.01x. Large-payload pub/sub now exceeds Go at 1.15x.
+2. **Small-payload 1:1 pub/sub remains a strong .NET lead** at 2.47x (724K vs 293K msg/s).
+3. **MQTT cross-protocol improved to 1.46x** (230K vs 158K msg/s), up from 1.20x — the Release JIT further benefits the delivery path.
+4. **Fan-out (0.63x) and multi pub/sub (0.65x) remain the largest gaps.** The hot-path optimizations (batched stats, pre-encoded SID/subject, auto-unsub guard) removed per-delivery overhead, but the remaining gap is dominated by the serial fan-out loop itself — Go parallelizes fan-out delivery across goroutines.
+5. **SubList Match microbenchmarks improved ~17%** (19.3M vs 16.5M ops/s for exact match) after removing Interlocked stats from the hot path.
+6. **TLS pub-only dropped to 0.49x** this run, likely noise from co-running benchmarks contending on CPU. TLS pub/sub remains stable at 0.88x.

 ---

 ## Optimization History

+### Round 9: Fan-Out & Multi Pub/Sub Hot-Path Optimization
+
+Seven optimizations targeting the per-delivery hot path and benchmark harness configuration:
+
+| # | Root Cause | Fix | Impact |
+|---|-----------|-----|--------|
+| 31 | **Benchmark harness built server in Debug** — `DotNetServerProcess.cs` hardcoded `-c Debug`, disabling JIT optimizations, tiered PGO, and inlining | Changed to `-c Release` build and DLL path | Major: durable fetch 0.42x→0.92x, request-reply to parity |
+| 32 | **Per-delivery Interlocked on server-wide stats** — `SendMessageNoFlush` did 2 `Interlocked` ops per delivery; fan-out 4 subs = 8 interlocked ops per publish | Moved server-wide stats to batch `Interlocked.Add` once after fan-out loop in `ProcessMessage` | Eliminates N×2 interlocked ops per publish |
+| 33 | **Auto-unsub tracking on every delivery** — `Interlocked.Increment(ref sub.MessageCount)` on every delivery even when `MaxMessages == 0` (no limit — the common case) | Guarded with `if (sub.MaxMessages > 0)` | Eliminates 1 interlocked op per delivery in common case |
+| 34 | **Per-delivery SID ASCII encoding** — `Encoding.ASCII.GetBytes(sid)` on every delivery; SID is a small integer that never changes | Added `Subscription.SidBytes` cached property; new `SendMessageNoFlush` overload accepts `ReadOnlySpan<byte>` | Eliminates per-delivery encoding |
+| 35 | **Per-delivery subject ASCII encoding** — `Encoding.ASCII.GetBytes(subject)` for each subscriber; fan-out 4 = 4× encoding same subject | Pre-encode subject once in `ProcessMessage` before fan-out loop; new overload uses span copy | Eliminates N-1 subject encodings per publish |
+| 36 | **Per-publish subject string allocation** — `Encoding.ASCII.GetString(cmd.Subject.Span)` on every PUB even when publishing to the same subject repeatedly | Added 1-element string cache per client; reuses string when subject bytes match | Eliminates string alloc for repeated subjects |
+| 37 | **Interlocked stats in SubList.Match hot path** — `Interlocked.Increment(ref _matches)` and `_cacheHits` on every match call | Replaced with non-atomic increments (approximate counters for monitoring) | Eliminates 1-2 interlocked ops per match |
+
 ### Round 8: Ordered Consumer + Cross-Protocol Optimization

 Three optimizations targeting pull consumer delivery and MQTT cross-protocol throughput:
@@ -262,6 +274,6 @@ Additional fixes: SHA256 envelope bypass for unencrypted/uncompressed stores, RA

 | Change | Expected Impact | Go Reference |
 |--------|----------------|-------------|
-| **Fan-out parallelism** | Deliver to subscribers concurrently instead of serially from publisher's read loop | Go: `processMsgResults` fans out per-client via goroutines |
+| **Fan-out parallelism** | Deliver to subscribers concurrently instead of serially from publisher's read loop — this is now the primary bottleneck for the 0.63x fan-out gap | Go: `processMsgResults` fans out per-client via goroutines |
 | **Eliminate per-message GC allocations in FileStore** | ~30% improvement on FileStore AppendAsync — replace `StoredMessage` class with `StoredMessageMeta` struct in `_messages` dict, reconstruct full message from MsgBlock on read | Go stores in `cache.buf`/`cache.idx` with zero per-message allocs; 80+ sites in FileStore.cs need migration |
-| **Ordered consumer delivery optimization** | Investigate .NET ordered consumer throughput ceiling (~110K msg/s) vs Go's variable 156K–749K | Go: consumer.go ordered consumer fast path |
+| **Single publisher throughput** | 0.62x–0.74x gap; the pub-only path has no fan-out overhead — likely JIT/GC/socket write overhead in the ingest path | Go: client.go readLoop with zero-copy buffer management |