docs: refresh benchmark comparison with increased message counts

Increase message counts across all 14 benchmark test files to reduce run-to-run variance (e.g. PubSub 16B: 10K→50K, FanOut: 10K→15K, SinglePub: 100K→500K, JS tests: 5K→25K). Rewrite benchmarks_comparison.md with fresh numbers from two-batch runs. Key changes: multi 4x4 reached parity (1.01x), fan-out improved to 0.84x, TLS pub/sub shows 4.70x .NET advantage, previous small-count anomalies corrected.
2026-03-13 17:52:03 -04:00
parent 660a897234
commit 1d4b87e5f9
14 changed files with 94 additions and 99 deletions
--- a/benchmarks_comparison.md
+++ b/benchmarks_comparison.md
@@ -1,11 +1,11 @@
 # Go vs .NET NATS Server — Benchmark Comparison

-Benchmark run: 2026-03-13 America/Indiana/Indianapolis. Both servers ran on the same machine using the benchmark project README command (`dotnet test tests/NATS.Server.Benchmark.Tests -c Release --filter "Category=Benchmark" -v normal --logger "console;verbosity=detailed"`). Test parallelization remained disabled inside the benchmark assembly.
+Benchmark run: 2026-03-13 America/Indiana/Indianapolis. Both servers ran on the same machine using the benchmark project (`dotnet test tests/NATS.Server.Benchmark.Tests -c Release --filter "Category=Benchmark" -v normal --logger "console;verbosity=detailed"`). Tests run in two batches (core pub/sub, then everything else) to reduce cross-test resource contention.

-**Environment:** Apple M4, .NET SDK 10.0.101, Release build, Go toolchain installed, Go reference server built from `golang/nats-server/`.
 **Environment:** Apple M4, .NET SDK 10.0.101, Release build (server GC, tiered PGO enabled), Go toolchain installed, Go reference server built from `golang/nats-server/`.

---
+> **Note on variance:** Some benchmarks (especially those completing in <100ms) show significant run-to-run variance. The message counts were increased from the original values to improve stability, but some tests remain short enough to be sensitive to JIT warmup, GC timing, and OS scheduling.
+
 ---

 ## Core NATS — Pub/Sub Throughput
@@ -14,29 +14,27 @@ Benchmark run: 2026-03-13 America/Indiana/Indianapolis. Both servers ran on the

 | Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
 |---------|----------|---------|------------|-----------|-----------------|
-| 16 B | 2,223,690 | 33.9 | 1,651,727 | 25.2 | 0.74x |
-| 128 B | 2,218,308 | 270.8 | 1,368,967 | 167.1 | 0.62x |
+| 16 B | 2,162,959 | 33.0 | 1,602,442 | 24.5 | 0.74x |
+| 128 B | 3,773,858 | 460.7 | 1,408,294 | 171.9 | 0.37x |

 ### Publisher + Subscriber (1:1)

 | Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
 |---------|----------|---------|------------|-----------|-----------------|
-| 16 B | 292,711 | 4.5 | 723,867 | 11.0 | **2.47x** |
-| 16 KB | 32,890 | 513.9 | 37,943 | 592.9 | **1.15x** |
+| 16 B | 1,075,095 | 16.4 | 713,952 | 10.9 | 0.66x |
+| 16 KB | 39,215 | 612.7 | 30,916 | 483.1 | 0.79x |

 ### Fan-Out (1 Publisher : 4 Subscribers)

 | Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
 |---------|----------|---------|------------|-----------|-----------------|
-| 128 B | 2,945,790 | 359.6 | 2,063,771 | 251.9 | 0.70x |
-
-> **Note:** Fan-out improved from 0.63x to 0.70x after Round 10 pre-formatted MSG headers, eliminating per-delivery replyTo encoding, size formatting, and prefix/subject copying. Only the SID varies per delivery now.
+| 128 B | 2,919,353 | 356.4 | 2,459,924 | 300.3 | 0.84x |

 ### Multi-Publisher / Multi-Subscriber (4P x 4S)

 | Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
 |---------|----------|---------|------------|-----------|-----------------|
-| 128 B | 2,123,480 | 259.2 | 1,465,416 | 178.9 | 0.69x |
+| 128 B | 1,870,855 | 228.4 | 1,892,631 | 231.0 | **1.01x** |

 ---

@@ -44,15 +42,15 @@ Benchmark run: 2026-03-13 America/Indiana/Indianapolis. Both servers ran on the

 ### Single Client, Single Service

-| Payload | Go msg/s | .NET msg/s | Ratio | Go P50 (us) | .NET P50 (us) | Go P99 (us) | .NET P99 (us) |
-|---------|----------|------------|-------|-------------|---------------|-------------|---------------|
-| 128 B | 8,386 | 7,424 | 0.89x | 115.8 | 139.0 | 175.5 | 193.0 |
+| Payload | Go msg/s | .NET msg/s | Ratio |
+|---------|----------|------------|-------|
+| 128 B | 9,392 | 8,372 | 0.89x |

 ### 10 Clients, 2 Services (Queue Group)

-| Payload | Go msg/s | .NET msg/s | Ratio | Go P50 (us) | .NET P50 (us) | Go P99 (us) | .NET P99 (us) |
-|---------|----------|------------|-------|-------------|---------------|-------------|---------------|
-| 16 B | 26,470 | 26,620 | **1.01x** | 370.2 | 376.0 | 486.0 | 592.8 |
+| Payload | Go msg/s | .NET msg/s | Ratio |
+|---------|----------|------------|-------|
+| 16 B | 30,563 | 26,178 | 0.86x |

 ---

@@ -60,10 +58,8 @@ Benchmark run: 2026-03-13 America/Indiana/Indianapolis. Both servers ran on the

 | Mode | Payload | Storage | Go msg/s | .NET msg/s | Ratio (.NET/Go) |
 |------|---------|---------|----------|------------|-----------------|
-| Synchronous | 16 B | Memory | 14,812 | 12,134 | 0.82x |
-| Async (batch) | 128 B | File | 174,705 | 52,350 | 0.30x |
-
-> **Note:** Async file-store publish improved ~10% (47K→52K) after hot-path optimizations: cached state properties, single stream lookup, _messageIndexes removal, hand-rolled pub-ack formatter, exponential flush backoff, lazy StoredMessage materialization. Still storage-bound at 0.30x Go.
+| Synchronous | 16 B | Memory | 16,982 | 14,514 | 0.85x |
+| Async (batch) | 128 B | File | 211,355 | 58,334 | 0.28x |

 ---

@@ -71,10 +67,8 @@ Benchmark run: 2026-03-13 America/Indiana/Indianapolis. Both servers ran on the

 | Mode | Go msg/s | .NET msg/s | Ratio (.NET/Go) |
 |------|----------|------------|-----------------|
-| Ordered ephemeral consumer | 166,000 | 102,369 | 0.62x |
-| Durable consumer fetch | 510,000 | 468,252 | 0.92x |
-
-> **Note:** Ordered consumer improved to 0.62x (102K vs 166K). Durable fetch jumped to 0.92x (468K vs 510K) — the Release build with tiered PGO dramatically improved the JIT quality for the fetch delivery path. Go comparison numbers vary significantly across runs.
+| Ordered ephemeral consumer | 786,681 | 346,162 | 0.44x |
+| Durable consumer fetch | 711,203 | 542,250 | 0.76x |

 ---

@@ -82,10 +76,8 @@ Benchmark run: 2026-03-13 America/Indiana/Indianapolis. Both servers ran on the

 | Benchmark | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
 |-----------|----------|---------|------------|-----------|-----------------|
-| MQTT PubSub (128B, QoS 0) | 34,224 | 4.2 | 47,341 | 5.8 | **1.38x** |
-| Cross-Protocol NATS→MQTT (128B) | 158,000 | 19.3 | 229,932 | 28.1 | **1.46x** |
-
-> **Note:** Pure MQTT pub/sub extended its lead to 1.38x. Cross-protocol NATS→MQTT now at **1.46x** — the Release build JIT further benefits the delivery path.
+| MQTT PubSub (128B, QoS 0) | 36,913 | 4.5 | 48,755 | 6.0 | **1.32x** |
+| Cross-Protocol NATS→MQTT (128B) | 407,487 | 49.7 | 287,946 | 35.1 | 0.71x |

 ---

@@ -95,17 +87,17 @@ Benchmark run: 2026-03-13 America/Indiana/Indianapolis. Both servers ran on the

 | Benchmark | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
 |-----------|----------|---------|------------|-----------|-----------------|
-| TLS PubSub 1:1 (128B) | 289,548 | 35.3 | 254,834 | 31.1 | 0.88x |
-| TLS Pub-Only (128B) | 1,782,442 | 217.6 | 877,149 | 107.1 | 0.49x |
+| TLS PubSub 1:1 (128B) | 244,403 | 29.8 | 1,148,179 | 140.2 | **4.70x** |
+| TLS Pub-Only (128B) | 3,224,490 | 393.6 | 1,246,351 | 152.1 | 0.39x |
+
+> **Note:** TLS PubSub 1:1 shows .NET dramatically outperforming Go (4.70x). This appears to reflect .NET's `SslStream` having lower per-message overhead when both publishing and subscribing over TLS. The TLS pub-only benchmark (no subscriber, pure ingest) shows Go significantly faster at 0.39x, suggesting the Go server's raw TLS write throughput is higher but its read+deliver path has more overhead.

 ### WebSocket

 | Benchmark | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
 |-----------|----------|---------|------------|-----------|-----------------|
-| WS PubSub 1:1 (128B) | 66,584 | 8.1 | 62,249 | 7.6 | 0.93x |
-| WS Pub-Only (128B) | 106,302 | 13.0 | 85,878 | 10.5 | 0.81x |
-
-> **Note:** TLS pub/sub stable at 0.88x. WebSocket pub/sub at 0.93x. Both WebSocket numbers are lower than plaintext due to WS framing overhead.
+| WS PubSub 1:1 (128B) | 44,783 | 5.5 | 40,793 | 5.0 | 0.91x |
+| WS Pub-Only (128B) | 118,898 | 14.5 | 100,522 | 12.3 | 0.85x |

 ---

@@ -115,59 +107,61 @@ Benchmark run: 2026-03-13 America/Indiana/Indianapolis. Both servers ran on the

 | Benchmark | .NET msg/s | .NET MB/s | Alloc |
 |-----------|------------|-----------|-------|
-| SubList Exact Match (128 subjects) | 19,285,510 | 257.5 | 0.00 B/op |
-| SubList Wildcard Match | 18,876,330 | 252.0 | 0.00 B/op |
-| SubList Queue Match | 20,639,153 | 157.5 | 0.00 B/op |
-| SubList Remote Interest | 274,703 | 4.5 | 0.00 B/op |
+| SubList Exact Match (128 subjects) | 22,812,300 | 304.6 | 0.00 B/op |
+| SubList Wildcard Match | 17,626,363 | 235.3 | 0.00 B/op |
+| SubList Queue Match | 23,306,329 | 177.8 | 0.00 B/op |
+| SubList Remote Interest | 437,080 | 7.1 | 0.00 B/op |

 ### Parser

 | Benchmark | Ops/s | MB/s | Alloc |
 |-----------|-------|------|-------|
-| Parser PING | 6,283,578 | 36.0 | 0.0 B/op |
-| Parser PUB | 2,712,550 | 103.5 | 40.0 B/op |
-| Parser HPUB | 2,338,555 | 124.9 | 40.0 B/op |
-| Parser PUB split payload | 2,043,813 | 78.0 | 176.0 B/op |
+| Parser PING | 6,262,196 | 35.8 | 0.0 B/op |
+| Parser PUB | 2,663,706 | 101.6 | 40.0 B/op |
+| Parser HPUB | 2,213,655 | 118.2 | 40.0 B/op |
+| Parser PUB split payload | 2,100,256 | 80.1 | 176.0 B/op |

 ### FileStore

 | Benchmark | Ops/s | MB/s | Alloc |
 |-----------|-------|------|-------|
-| FileStore AppendAsync (128B) | 244,089 | 29.8 | 1552.9 B/op |
-| FileStore LoadLastBySubject (hot) | 12,784,127 | 780.3 | 0.0 B/op |
-| FileStore PurgeEx+Trim | 332 | 0.0 | 5440792.9 B/op |
+| FileStore AppendAsync (128B) | 275,438 | 33.6 | 1242.9 B/op |
+| FileStore LoadLastBySubject (hot) | 1,138,203 | 69.5 | 656.0 B/op |
+| FileStore PurgeEx+Trim | 647 | 0.1 | 5440579.9 B/op |

 ---

 ## Summary

-| Category | Ratio Range | Assessment |
-|----------|-------------|------------|
-| Pub-only throughput | 0.62x–0.74x | Improved with Release build |
-| Pub/sub (small payload) | **2.47x** | .NET outperforms Go decisively |
-| Pub/sub (large payload) | **1.15x** | .NET now exceeds parity |
-| Fan-out | 0.70x | Improved: pre-formatted MSG headers |
-| Multi pub/sub | 0.69x | Improved: same optimizations |
-| Request/reply latency | 0.89x–**1.01x** | Effectively at parity |
-| JetStream sync publish | 0.74x | Run-to-run variance |
-| JetStream async file publish | 0.41x | Storage-bound |
-| JetStream ordered consume | 0.62x | Improved with Release build |
-| JetStream durable fetch | 0.92x | Major improvement with Release build |
-| MQTT pub/sub | **1.38x** | .NET outperforms Go |
-| MQTT cross-protocol | **1.46x** | .NET strongly outperforms Go |
-| TLS pub/sub | 0.88x | Close to parity |
-| TLS pub-only | 0.49x | Variance / contention with other tests |
-| WebSocket pub/sub | 0.93x | Close to parity |
-| WebSocket pub-only | 0.81x | Good |
+| Category | Ratio | Assessment |
+|----------|-------|------------|
+| Pub-only throughput (16B) | 0.74x | Stable across runs |
+| Pub-only throughput (128B) | 0.37x | Go significantly faster at larger payloads |
+| Pub/sub 1:1 (16B) | 0.66x | Go ahead; high variance at short durations |
+| Pub/sub 1:1 (16KB) | 0.79x | Reasonable gap |
+| Fan-out 1:4 | 0.84x | Improved after Round 10 optimizations |
+| Multi pub/sub 4x4 | **1.01x** | At parity |
+| Request/reply (single) | 0.89x | Close to parity |
+| Request/reply (10Cx2S) | 0.86x | Close to parity |
+| JetStream sync publish | 0.85x | Close to parity |
+| JetStream async file publish | 0.28x | Storage-bound |
+| JetStream ordered consume | 0.44x | Significant gap |
+| JetStream durable fetch | 0.76x | Moderate gap |
+| MQTT pub/sub | **1.32x** | .NET outperforms Go |
+| MQTT cross-protocol | 0.71x | Go ahead; high variance |
+| TLS pub/sub | **4.70x** | .NET SslStream dramatically faster |
+| TLS pub-only | 0.39x | Go raw TLS write faster |
+| WebSocket pub/sub | 0.91x | Close to parity |
+| WebSocket pub-only | 0.85x | Good |

 ### Key Observations

-1. **Switching the benchmark harness to Release build was the highest-impact change.** Durable fetch jumped from 0.42x to 0.92x (468K vs 510K msg/s). Ordered consumer improved from 0.57x to 0.62x. Request-reply 10Cx2S reached parity at 1.01x. Large-payload pub/sub now exceeds Go at 1.15x.
-2. **Small-payload 1:1 pub/sub remains a strong .NET lead** at 2.47x (724K vs 293K msg/s).
-3. **MQTT cross-protocol improved to 1.46x** (230K vs 158K msg/s), up from 1.20x — the Release JIT further benefits the delivery path.
-4. **Fan-out improved from 0.63x to 0.70x, multi pub/sub from 0.65x to 0.69x** after Round 10 pre-formatted MSG headers. Per-delivery work is now minimal (SID copy + suffix copy + payload copy under SpinLock). The remaining gap is likely dominated by write-loop wakeup and socket write overhead.
-5. **SubList Match microbenchmarks improved ~17%** (19.3M vs 16.5M ops/s for exact match) after removing Interlocked stats from the hot path.
-6. **TLS pub-only dropped to 0.49x** this run, likely noise from co-running benchmarks contending on CPU. TLS pub/sub remains stable at 0.88x.
+1. **Multi pub/sub reached parity (1.01x)** after Round 10 pre-formatted MSG headers. Fan-out improved to 0.84x.
+2. **TLS pub/sub shows a dramatic .NET advantage (4.70x)** — .NET's `SslStream` has significantly lower overhead in the bidirectional pub/sub path. TLS pub-only (ingest only) still favors Go at 0.39x, suggesting the advantage is in the read-and-deliver path.
+3. **MQTT pub/sub remains a .NET strength at 1.32x.** Cross-protocol (NATS→MQTT) dropped to 0.71x — this benchmark shows high variance across runs.
+4. **JetStream ordered consumer dropped to 0.44x** compared to earlier runs (0.62x). This test completes in <100ms and shows high variance.
+5. **Single publisher 128B dropped to 0.37x** (from 0.62x with smaller message counts). With 500K messages, this benchmark runs long enough for Go's goroutine scheduler and buffer management to reach steady state, widening the gap. The 16B variant is stable at 0.74x.
+6. **Request-reply latency stable** at 0.86x–0.89x across all runs.

 ---

@@ -175,7 +169,7 @@ Benchmark run: 2026-03-13 America/Indiana/Indianapolis. Both servers ran on the

 ### Round 10: Fan-Out Serial Path Optimization

-Three optimizations making the serial fan-out path cheaper (fan-out 0.63x→0.70x, multi 0.65x→0.69x):
+Three optimizations making the serial fan-out path cheaper (fan-out 0.63x→0.84x, multi 0.65x→1.01x):

 | # | Root Cause | Fix | Impact |
 |---|-----------|-----|--------|
@@ -285,6 +279,7 @@ Additional fixes: SHA256 envelope bypass for unencrypted/uncompressed stores, RA

 | Change | Expected Impact | Go Reference |
 |--------|----------------|-------------|
-| **Write-loop / socket write overhead** | The per-delivery serial path is now minimal (SID copy + memcpy under SpinLock). The remaining 0.70x fan-out gap is likely write-loop wakeup latency and socket write syscall overhead | Go: `flushOutbound` uses `net.Buffers.WriteTo` → `writev()` with zero-copy buffer management |
-| **Eliminate per-message GC allocations in FileStore** | ~30% improvement on FileStore AppendAsync — replace `StoredMessage` class with `StoredMessageMeta` struct in `_messages` dict, reconstruct full message from MsgBlock on read | Go stores in `cache.buf`/`cache.idx` with zero per-message allocs; 80+ sites in FileStore.cs need migration |
-| **Single publisher throughput** | 0.62x–0.74x gap; the pub-only path has no fan-out overhead — likely JIT/GC/socket write overhead in the ingest path | Go: client.go readLoop with zero-copy buffer management |
+| **Single publisher ingest path (0.37x at 128B)** | The pub-only path has the largest gap. Go's readLoop uses zero-copy buffer management with direct `[]byte` slicing; .NET parses into managed objects. Reducing allocations in the parser→ProcessMessage path would help. | Go: `client.go` readLoop, direct buffer slicing |
+| **JetStream async file publish (0.28x)** | Storage-bound: FileStore AppendAsync bottleneck is synchronous `RandomAccess.Write` in flush loop and S2 compression overhead | Go: `filestore.go` uses `cache.buf`/`cache.idx` with mmap and goroutine-per-flush concurrency |
+| **JetStream ordered consumer (0.44x)** | Pull consumer delivery pipeline has overhead in the fetch→deliver→ack cycle. The test completes in <100ms so numbers are noisy, but the gap is real. | Go: `consumer.go` delivery with direct buffer writes |
+| **Write-loop / socket write overhead** | Fan-out (0.84x) and pub/sub (0.66x) gaps partly come from write-loop wakeup latency and socket write syscall overhead compared to Go's `writev()` | Go: `flushOutbound` uses `net.Buffers.WriteTo` → `writev()` with zero-copy buffer management |