From fb0d31c6153b12c45f7020c9d696d0f2476f1a1f Mon Sep 17 00:00:00 2001
From: Joseph Doherty <dohejw01@gmail.com>
Date: Fri, 13 Mar 2026 10:18:52 -0400
Subject: [PATCH] docs: refresh benchmark comparison after SubList optimization

---
 benchmarks_comparison.md | 123 +++++++++++++++++----------------------
 1 file changed, 54 insertions(+), 69 deletions(-)

diff --git a/benchmarks_comparison.md b/benchmarks_comparison.md
index 1c98912..b821999 100644
--- a/benchmarks_comparison.md
+++ b/benchmarks_comparison.md
@@ -1,47 +1,10 @@
 # Go vs .NET NATS Server — Benchmark Comparison
 
-Benchmark run: 2026-03-13 10:06 AM America/Indiana/Indianapolis. The latest refresh used the benchmark project README command (`dotnet test tests/NATS.Server.Benchmark.Tests --filter "Category=Benchmark" -v normal --logger "console;verbosity=detailed"`) and completed successfully as a `.NET`-only run. The Go/.NET comparison tables below remain the last Go-capable comparison baseline.
+Benchmark run: 2026-03-13 10:16 AM America/Indiana/Indianapolis. Both servers ran on the same machine using the benchmark project README command (`dotnet test tests/NATS.Server.Benchmark.Tests --filter "Category=Benchmark" -v normal --logger "console;verbosity=detailed"`). Test parallelization remained disabled inside the benchmark assembly.
 
-**Environment:** Apple M4, .NET SDK 10.0.101, README benchmark command run in the benchmark project's default `Debug` configuration, Go toolchain installed but the current full-suite run emitted only `.NET` result blocks.
+**Environment:** Apple M4, .NET SDK 10.0.101, benchmark README command run in the benchmark project's default `Debug` configuration, Go toolchain installed, Go reference server built from `golang/nats-server/`.
 
 ---
-
-## Latest README Run (.NET only)
-
-The current refresh came from `/tmp/bench-output.txt` using the benchmark project README workflow. Because the run did not emit any Go comparison blocks, the values below are the latest `.NET`-only numbers from that run, and the historical Go/.NET comparison tables are preserved below instead of being overwritten with mixed-source ratios.
-
-### Core and JetStream
-
-| Benchmark | .NET msg/s | .NET MB/s | Notes |
-|-----------|------------|-----------|-------|
-| Single Publisher (16B) | 1,392,442 | 21.2 | README full-suite run |
-| Single Publisher (128B) | 1,491,226 | 182.0 | README full-suite run |
-| PubSub 1:1 (16B) | 717,731 | 11.0 | README full-suite run |
-| PubSub 1:1 (16KB) | 28,450 | 444.5 | README full-suite run |
-| Fan-Out 1:4 (128B) | 1,451,748 | 177.2 | README full-suite run |
-| Multi 4Px4S (128B) | 244,878 | 29.9 | README full-suite run |
-| Request-Reply Single (128B) | 6,840 | 0.8 | P50 142.5 us, P99 203.9 us |
-| Request-Reply 10Cx2S (16B) | 22,844 | 0.3 | P50 421.1 us, P99 602.1 us |
-| JS Sync Publish (16B Memory) | 12,619 | 0.2 | README full-suite run |
-| JS Async Publish (128B File) | 46,631 | 5.7 | README full-suite run |
-| JS Ordered Consumer (128B) | 108,057 | 13.2 | README full-suite run |
-| JS Durable Fetch (128B) | 490,090 | 59.8 | README full-suite run |
-
-### Parser Microbenchmarks
-
-| Benchmark | Ops/s | MB/s | Alloc |
-|-----------|-------|------|-------|
-| Parser PING | 5,756,370 | 32.9 | 0.0 B/op |
-| Parser PUB | 2,537,973 | 96.8 | 40.0 B/op |
-| Parser HPUB | 2,298,811 | 122.8 | 40.0 B/op |
-| Parser PUB split payload | 2,049,535 | 78.2 | 176.0 B/op |
-
-### Current Run Highlights
-
-1. The parser microbenchmarks show the hot path is already at zero allocation for `PING`, with contiguous `PUB` and `HPUB` still paying a small fixed cost for retained field copies.
-2. Split-payload `PUB` remains meaningfully more allocation-heavy than contiguous `PUB` because the parser must preserve unread payload state across reads and then materialize contiguous memory at the current client boundary.
-3. The README-driven suite was a `.NET`-only refresh, so the comparative Go/.NET ratios below should still be treated as the last Go-capable baseline rather than current same-run ratios.
-
 ---
 
 ## Core NATS — Pub/Sub Throughput
@@ -50,27 +13,27 @@ The current refresh came from `/tmp/bench-output.txt` using the benchmark projec
 
 | Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
 |---------|----------|---------|------------|-----------|-----------------|
-| 16 B | 2,252,242 | 34.4 | 1,610,807 | 24.6 | 0.72x |
-| 128 B | 2,199,267 | 268.5 | 1,661,014 | 202.8 | 0.76x |
+| 16 B | 2,258,647 | 34.5 | 1,275,230 | 19.5 | 0.56x |
+| 128 B | 2,251,274 | 274.8 | 1,661,668 | 202.8 | 0.74x |
 
 ### Publisher + Subscriber (1:1)
 
 | Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
 |---------|----------|---------|------------|-----------|-----------------|
-| 16 B | 313,790 | 4.8 | 909,298 | 13.9 | **2.90x** |
-| 16 KB | 41,153 | 643.0 | 38,287 | 598.2 | 0.93x |
+| 16 B | 296,374 | 4.5 | 875,105 | 13.4 | **2.95x** |
+| 16 KB | 32,111 | 501.7 | 30,030 | 469.2 | 0.94x |
 
 ### Fan-Out (1 Publisher : 4 Subscribers)
 
 | Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
 |---------|----------|---------|------------|-----------|-----------------|
-| 128 B | 3,217,684 | 392.8 | 1,817,860 | 221.9 | 0.57x |
+| 128 B | 2,387,889 | 291.5 | 1,780,888 | 217.4 | 0.75x |
 
 ### Multi-Publisher / Multi-Subscriber (4P x 4S)
 
 | Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
 |---------|----------|---------|------------|-----------|-----------------|
-| 128 B | 2,101,337 | 256.5 | 1,527,330 | 186.4 | 0.73x |
+| 128 B | 1,079,112 | 131.7 | 953,596 | 116.4 | 0.88x |
 
 ---
 
@@ -80,13 +43,13 @@ The current refresh came from `/tmp/bench-output.txt` using the benchmark projec
 
 | Payload | Go msg/s | .NET msg/s | Ratio | Go P50 (us) | .NET P50 (us) | Go P99 (us) | .NET P99 (us) |
 |---------|----------|------------|-------|-------------|---------------|-------------|---------------|
-| 128 B | 9,450 | 7,662 | 0.81x | 103.2 | 128.9 | 145.6 | 170.8 |
+| 128 B | 8,506 | 7,182 | 0.84x | 114.9 | 135.2 | 161.2 | 189.8 |
 
 ### 10 Clients, 2 Services (Queue Group)
 
 | Payload | Go msg/s | .NET msg/s | Ratio | Go P50 (us) | .NET P50 (us) | Go P99 (us) | .NET P99 (us) |
 |---------|----------|------------|-------|-------------|---------------|-------------|---------------|
-| 16 B | 31,094 | 26,144 | 0.84x | 316.9 | 368.7 | 439.2 | 559.7 |
+| 16 B | 26,610 | 22,533 | 0.85x | 367.7 | 425.3 | 487.4 | 622.5 |
 
 ---
 
@@ -94,10 +57,10 @@ The current refresh came from `/tmp/bench-output.txt` using the benchmark projec
 
 | Mode | Payload | Storage | Go msg/s | .NET msg/s | Ratio (.NET/Go) |
 |------|---------|---------|----------|------------|-----------------|
-| Synchronous | 16 B | Memory | 17,533 | 14,373 | 0.82x |
-| Async (batch) | 128 B | File | 198,237 | 60,416 | 0.30x |
+| Synchronous | 16 B | Memory | 13,756 | 9,954 | 0.72x |
+| Async (batch) | 128 B | File | 171,761 | 50,711 | 0.30x |
 
-> **Note:** Async file store publish improved from 174 msg/s to 60K msg/s (347x improvement) after two rounds of FileStore-level optimizations plus profiling overhead removal. Remaining 3.3x gap is GC pressure from per-message allocations.
+> **Note:** Async file-store publish remains the largest JetStream gap at 0.30x. The bottleneck is still the storage write path and the remaining managed allocation pressure around persisted message state.
 
 ---
 
@@ -105,10 +68,32 @@ The current refresh came from `/tmp/bench-output.txt` using the benchmark projec
 
 | Mode | Go msg/s | .NET msg/s | Ratio (.NET/Go) |
 |------|----------|------------|-----------------|
-| Ordered ephemeral consumer | 748,671 | 114,021 | 0.15x |
-| Durable consumer fetch | 662,471 | 488,520 | 0.74x |
+| Ordered ephemeral consumer | 135,704 | 107,168 | 0.79x |
+| Durable consumer fetch | 533,441 | 375,652 | 0.70x |
 
-> **Note:** Durable fetch improved from 0.13x → 0.60x → **0.74x** after Round 6 optimizations (batch flush, ackReply stack formatting, cached CompiledFilter, pooled fetch list). Ordered consumer ratio dropped due to Go benchmark improvement (748K vs 156K in earlier runs); .NET throughput is stable at ~110K msg/s.
+> **Note:** Ordered-consumer results in this run are much closer to parity than earlier snapshots. That suggests prior Go-side variance was material; `.NET` throughput is still clustered around ~107K msg/s.
+
+---
+
+## Hot Path Microbenchmarks (.NET only)
+
+### SubList
+
+| Benchmark | .NET msg/s | .NET MB/s | Alloc |
+|-----------|------------|-----------|-------|
+| SubList Exact Match (128 subjects) | 17,746,607 | 236.9 | 0.00 B/op |
+| SubList Wildcard Match | 18,811,278 | 251.2 | 0.00 B/op |
+| SubList Queue Match | 20,624,510 | 157.4 | 0.00 B/op |
+| SubList Remote Interest | 264,725 | 4.3 | 0.00 B/op |
+
+### Parser
+
+| Benchmark | Ops/s | MB/s | Alloc |
+|-----------|-------|------|-------|
+| Parser PING | 5,598,176 | 32.0 | 0.0 B/op |
+| Parser PUB | 2,701,645 | 103.1 | 40.0 B/op |
+| Parser HPUB | 2,177,745 | 116.3 | 40.0 B/op |
+| Parser PUB split payload | 1,702,439 | 64.9 | 176.0 B/op |
 
 ---
 
@@ -116,25 +101,25 @@ The current refresh came from `/tmp/bench-output.txt` using the benchmark projec
 
 | Category | Ratio Range | Assessment |
 |----------|-------------|------------|
-| Pub-only throughput | 0.72x–0.76x | Good — within 2x |
-| Pub/sub (small payload) | **2.90x** | .NET outperforms Go — direct buffer path eliminates all per-message overhead |
-| Pub/sub (large payload) | 0.93x | Near parity |
-| Fan-out | 0.57x | Improved from 0.18x → 0.44x → 0.66x; batch flush applied but serial delivery remains |
-| Multi pub/sub | 0.73x | Improved from 0.49x → 0.84x; variance from system load |
-| Request/reply latency | 0.81x–0.84x | Good — improved from 0.77x |
-| JetStream sync publish | 0.82x | Good |
-| JetStream async file publish | 0.30x | Improved from 0.00x — storage write path dominates |
-| JetStream ordered consume | 0.15x | .NET stable ~110K; Go variance high (156K–749K) |
-| JetStream durable fetch | **0.74x** | **Improved from 0.60x** — batch flush + ackReply optimization |
+| Pub-only throughput | 0.56x–0.74x | Mixed — 128 B is solid, 16 B still trails materially |
+| Pub/sub (small payload) | **2.95x** | .NET outperforms Go decisively |
+| Pub/sub (large payload) | 0.94x | Near parity |
+| Fan-out | 0.75x | Good improvement; still limited by serial delivery |
+| Multi pub/sub | 0.88x | Close to parity in this run |
+| Request/reply latency | 0.84x–0.85x | Good |
+| JetStream sync publish | 0.72x | Good |
+| JetStream async file publish | 0.30x | Storage write path still dominates |
+| JetStream ordered consume | 0.79x | Much closer to parity in this run |
+| JetStream durable fetch | 0.70x | Good |
 
 ### Key Observations
 
-1. **Small-payload 1:1 pub/sub outperforms Go by ~3x** (909K vs 314K msg/s). The per-client direct write buffer with `stackalloc` header formatting eliminates all per-message heap allocations and channel overhead.
-2. **Durable consumer fetch improved to 0.74x** (489K vs 662K msg/s) — Round 6 batch flush signaling and `string.Create`-based ack reply formatting reduced per-message overhead significantly.
-3. **Fan-out holds at ~0.57x** despite batch flush optimization. The remaining gap is goroutine-level parallelism (Go fans out per-client via goroutines; .NET delivers serially). The batch flush reduces wakeup overhead but doesn't add concurrency.
-4. **Request/reply improved to 0.81x–0.84x** — deferred flush benefits single-message delivery paths too.
-5. **JetStream file store async publish: 0.30x** — remaining gap is GC pressure from per-message `StoredMessage` objects and `byte[]` copies (Change 2 deferred due to scope: 80+ sites in FileStore.cs need migration).
-6. **JetStream ordered consumer: 0.15x** — ratio drop is due to Go benchmark variance (749K in this run vs 156K previously); .NET throughput stable at ~110K msg/s. Further investigation needed for the Go variability.
+1. **Small-payload 1:1 pub/sub still beats Go by ~3x** (875K vs 296K msg/s). The direct write path continues to pay off when message fanout is simple and payloads are tiny.
+2. **Fan-out and multi pub/sub both improved in this run** to 0.75x and 0.88x respectively. The remaining gap is still consistent with Go's more naturally parallel fanout model.
+3. **Ordered consumer moved up to 0.79x** (107K vs 136K msg/s). That is materially stronger than earlier runs and suggests previous Go-side variance was distorting the comparison more than the `.NET` consumer path itself.
+4. **Durable fetch remains solid at 0.70x**. The Round 6 fetch-path work is still holding, but there is room left in consumer dispatch and storage reads.
+5. **Async file-store publish is still the largest server-level gap at 0.30x**. The storage layer remains the highest-value runtime target after parser and SubList hot-path cleanup.
+6. **The new SubList microbenchmarks show effectively zero temporary allocation per operation** for exact, wildcard, queue, and remote-interest lookups in the current implementation. Parser contiguous hot paths also remain small and stable, while split-payload `PUB` still pays a higher copy cost.
 
 ---