From fb0d31c6153b12c45f7020c9d696d0f2476f1a1f Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Fri, 13 Mar 2026 10:18:52 -0400 Subject: [PATCH] docs: refresh benchmark comparison after SubList optimization --- benchmarks_comparison.md | 123 +++++++++++++++++---------------------- 1 file changed, 54 insertions(+), 69 deletions(-) diff --git a/benchmarks_comparison.md b/benchmarks_comparison.md index 1c98912..b821999 100644 --- a/benchmarks_comparison.md +++ b/benchmarks_comparison.md @@ -1,47 +1,10 @@ # Go vs .NET NATS Server — Benchmark Comparison -Benchmark run: 2026-03-13 10:06 AM America/Indiana/Indianapolis. The latest refresh used the benchmark project README command (`dotnet test tests/NATS.Server.Benchmark.Tests --filter "Category=Benchmark" -v normal --logger "console;verbosity=detailed"`) and completed successfully as a `.NET`-only run. The Go/.NET comparison tables below remain the last Go-capable comparison baseline. +Benchmark run: 2026-03-13 10:16 AM America/Indiana/Indianapolis. Both servers ran on the same machine using the benchmark project README command (`dotnet test tests/NATS.Server.Benchmark.Tests --filter "Category=Benchmark" -v normal --logger "console;verbosity=detailed"`). Test parallelization remained disabled inside the benchmark assembly. -**Environment:** Apple M4, .NET SDK 10.0.101, README benchmark command run in the benchmark project's default `Debug` configuration, Go toolchain installed but the current full-suite run emitted only `.NET` result blocks. +**Environment:** Apple M4, .NET SDK 10.0.101, benchmark README command run in the benchmark project's default `Debug` configuration, Go toolchain installed, Go reference server built from `golang/nats-server/`. --- - -## Latest README Run (.NET only) - -The current refresh came from `/tmp/bench-output.txt` using the benchmark project README workflow. Because the run did not emit any Go comparison blocks, the values below are the latest `.NET`-only numbers from that run, and the historical Go/.NET comparison tables are preserved below instead of being overwritten with mixed-source ratios. - -### Core and JetStream - -| Benchmark | .NET msg/s | .NET MB/s | Notes | -|-----------|------------|-----------|-------| -| Single Publisher (16B) | 1,392,442 | 21.2 | README full-suite run | -| Single Publisher (128B) | 1,491,226 | 182.0 | README full-suite run | -| PubSub 1:1 (16B) | 717,731 | 11.0 | README full-suite run | -| PubSub 1:1 (16KB) | 28,450 | 444.5 | README full-suite run | -| Fan-Out 1:4 (128B) | 1,451,748 | 177.2 | README full-suite run | -| Multi 4Px4S (128B) | 244,878 | 29.9 | README full-suite run | -| Request-Reply Single (128B) | 6,840 | 0.8 | P50 142.5 us, P99 203.9 us | -| Request-Reply 10Cx2S (16B) | 22,844 | 0.3 | P50 421.1 us, P99 602.1 us | -| JS Sync Publish (16B Memory) | 12,619 | 0.2 | README full-suite run | -| JS Async Publish (128B File) | 46,631 | 5.7 | README full-suite run | -| JS Ordered Consumer (128B) | 108,057 | 13.2 | README full-suite run | -| JS Durable Fetch (128B) | 490,090 | 59.8 | README full-suite run | - -### Parser Microbenchmarks - -| Benchmark | Ops/s | MB/s | Alloc | -|-----------|-------|------|-------| -| Parser PING | 5,756,370 | 32.9 | 0.0 B/op | -| Parser PUB | 2,537,973 | 96.8 | 40.0 B/op | -| Parser HPUB | 2,298,811 | 122.8 | 40.0 B/op | -| Parser PUB split payload | 2,049,535 | 78.2 | 176.0 B/op | - -### Current Run Highlights - -1. The parser microbenchmarks show the hot path is already at zero allocation for `PING`, with contiguous `PUB` and `HPUB` still paying a small fixed cost for retained field copies. -2. Split-payload `PUB` remains meaningfully more allocation-heavy than contiguous `PUB` because the parser must preserve unread payload state across reads and then materialize contiguous memory at the current client boundary. -3. The README-driven suite was a `.NET`-only refresh, so the comparative Go/.NET ratios below should still be treated as the last Go-capable baseline rather than current same-run ratios. - --- ## Core NATS — Pub/Sub Throughput @@ -50,27 +13,27 @@ The current refresh came from `/tmp/bench-output.txt` using the benchmark projec | Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) | |---------|----------|---------|------------|-----------|-----------------| -| 16 B | 2,252,242 | 34.4 | 1,610,807 | 24.6 | 0.72x | -| 128 B | 2,199,267 | 268.5 | 1,661,014 | 202.8 | 0.76x | +| 16 B | 2,258,647 | 34.5 | 1,275,230 | 19.5 | 0.56x | +| 128 B | 2,251,274 | 274.8 | 1,661,668 | 202.8 | 0.74x | ### Publisher + Subscriber (1:1) | Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) | |---------|----------|---------|------------|-----------|-----------------| -| 16 B | 313,790 | 4.8 | 909,298 | 13.9 | **2.90x** | -| 16 KB | 41,153 | 643.0 | 38,287 | 598.2 | 0.93x | +| 16 B | 296,374 | 4.5 | 875,105 | 13.4 | **2.95x** | +| 16 KB | 32,111 | 501.7 | 30,030 | 469.2 | 0.94x | ### Fan-Out (1 Publisher : 4 Subscribers) | Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) | |---------|----------|---------|------------|-----------|-----------------| -| 128 B | 3,217,684 | 392.8 | 1,817,860 | 221.9 | 0.57x | +| 128 B | 2,387,889 | 291.5 | 1,780,888 | 217.4 | 0.75x | ### Multi-Publisher / Multi-Subscriber (4P x 4S) | Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) | |---------|----------|---------|------------|-----------|-----------------| -| 128 B | 2,101,337 | 256.5 | 1,527,330 | 186.4 | 0.73x | +| 128 B | 1,079,112 | 131.7 | 953,596 | 116.4 | 0.88x | --- @@ -80,13 +43,13 @@ The current refresh came from `/tmp/bench-output.txt` using the benchmark projec | Payload | Go msg/s | .NET msg/s | Ratio | Go P50 (us) | .NET P50 (us) | Go P99 (us) | .NET P99 (us) | |---------|----------|------------|-------|-------------|---------------|-------------|---------------| -| 128 B | 9,450 | 7,662 | 0.81x | 103.2 | 128.9 | 145.6 | 170.8 | +| 128 B | 8,506 | 7,182 | 0.84x | 114.9 | 135.2 | 161.2 | 189.8 | ### 10 Clients, 2 Services (Queue Group) | Payload | Go msg/s | .NET msg/s | Ratio | Go P50 (us) | .NET P50 (us) | Go P99 (us) | .NET P99 (us) | |---------|----------|------------|-------|-------------|---------------|-------------|---------------| -| 16 B | 31,094 | 26,144 | 0.84x | 316.9 | 368.7 | 439.2 | 559.7 | +| 16 B | 26,610 | 22,533 | 0.85x | 367.7 | 425.3 | 487.4 | 622.5 | --- @@ -94,10 +57,10 @@ The current refresh came from `/tmp/bench-output.txt` using the benchmark projec | Mode | Payload | Storage | Go msg/s | .NET msg/s | Ratio (.NET/Go) | |------|---------|---------|----------|------------|-----------------| -| Synchronous | 16 B | Memory | 17,533 | 14,373 | 0.82x | -| Async (batch) | 128 B | File | 198,237 | 60,416 | 0.30x | +| Synchronous | 16 B | Memory | 13,756 | 9,954 | 0.72x | +| Async (batch) | 128 B | File | 171,761 | 50,711 | 0.30x | -> **Note:** Async file store publish improved from 174 msg/s to 60K msg/s (347x improvement) after two rounds of FileStore-level optimizations plus profiling overhead removal. Remaining 3.3x gap is GC pressure from per-message allocations. +> **Note:** Async file-store publish remains the largest JetStream gap at 0.30x. The bottleneck is still the storage write path and the remaining managed allocation pressure around persisted message state. --- @@ -105,10 +68,32 @@ The current refresh came from `/tmp/bench-output.txt` using the benchmark projec | Mode | Go msg/s | .NET msg/s | Ratio (.NET/Go) | |------|----------|------------|-----------------| -| Ordered ephemeral consumer | 748,671 | 114,021 | 0.15x | -| Durable consumer fetch | 662,471 | 488,520 | 0.74x | +| Ordered ephemeral consumer | 135,704 | 107,168 | 0.79x | +| Durable consumer fetch | 533,441 | 375,652 | 0.70x | -> **Note:** Durable fetch improved from 0.13x → 0.60x → **0.74x** after Round 6 optimizations (batch flush, ackReply stack formatting, cached CompiledFilter, pooled fetch list). Ordered consumer ratio dropped due to Go benchmark improvement (748K vs 156K in earlier runs); .NET throughput is stable at ~110K msg/s. +> **Note:** Ordered-consumer results in this run are much closer to parity than earlier snapshots. That suggests prior Go-side variance was material; `.NET` throughput is still clustered around ~107K msg/s. + +--- + +## Hot Path Microbenchmarks (.NET only) + +### SubList + +| Benchmark | .NET msg/s | .NET MB/s | Alloc | +|-----------|------------|-----------|-------| +| SubList Exact Match (128 subjects) | 17,746,607 | 236.9 | 0.00 B/op | +| SubList Wildcard Match | 18,811,278 | 251.2 | 0.00 B/op | +| SubList Queue Match | 20,624,510 | 157.4 | 0.00 B/op | +| SubList Remote Interest | 264,725 | 4.3 | 0.00 B/op | + +### Parser + +| Benchmark | Ops/s | MB/s | Alloc | +|-----------|-------|------|-------| +| Parser PING | 5,598,176 | 32.0 | 0.0 B/op | +| Parser PUB | 2,701,645 | 103.1 | 40.0 B/op | +| Parser HPUB | 2,177,745 | 116.3 | 40.0 B/op | +| Parser PUB split payload | 1,702,439 | 64.9 | 176.0 B/op | --- @@ -116,25 +101,25 @@ The current refresh came from `/tmp/bench-output.txt` using the benchmark projec | Category | Ratio Range | Assessment | |----------|-------------|------------| -| Pub-only throughput | 0.72x–0.76x | Good — within 2x | -| Pub/sub (small payload) | **2.90x** | .NET outperforms Go — direct buffer path eliminates all per-message overhead | -| Pub/sub (large payload) | 0.93x | Near parity | -| Fan-out | 0.57x | Improved from 0.18x → 0.44x → 0.66x; batch flush applied but serial delivery remains | -| Multi pub/sub | 0.73x | Improved from 0.49x → 0.84x; variance from system load | -| Request/reply latency | 0.81x–0.84x | Good — improved from 0.77x | -| JetStream sync publish | 0.82x | Good | -| JetStream async file publish | 0.30x | Improved from 0.00x — storage write path dominates | -| JetStream ordered consume | 0.15x | .NET stable ~110K; Go variance high (156K–749K) | -| JetStream durable fetch | **0.74x** | **Improved from 0.60x** — batch flush + ackReply optimization | +| Pub-only throughput | 0.56x–0.74x | Mixed — 128 B is solid, 16 B still trails materially | +| Pub/sub (small payload) | **2.95x** | .NET outperforms Go decisively | +| Pub/sub (large payload) | 0.94x | Near parity | +| Fan-out | 0.75x | Good improvement; still limited by serial delivery | +| Multi pub/sub | 0.88x | Close to parity in this run | +| Request/reply latency | 0.84x–0.85x | Good | +| JetStream sync publish | 0.72x | Good | +| JetStream async file publish | 0.30x | Storage write path still dominates | +| JetStream ordered consume | 0.79x | Much closer to parity in this run | +| JetStream durable fetch | 0.70x | Good | ### Key Observations -1. **Small-payload 1:1 pub/sub outperforms Go by ~3x** (909K vs 314K msg/s). The per-client direct write buffer with `stackalloc` header formatting eliminates all per-message heap allocations and channel overhead. -2. **Durable consumer fetch improved to 0.74x** (489K vs 662K msg/s) — Round 6 batch flush signaling and `string.Create`-based ack reply formatting reduced per-message overhead significantly. -3. **Fan-out holds at ~0.57x** despite batch flush optimization. The remaining gap is goroutine-level parallelism (Go fans out per-client via goroutines; .NET delivers serially). The batch flush reduces wakeup overhead but doesn't add concurrency. -4. **Request/reply improved to 0.81x–0.84x** — deferred flush benefits single-message delivery paths too. -5. **JetStream file store async publish: 0.30x** — remaining gap is GC pressure from per-message `StoredMessage` objects and `byte[]` copies (Change 2 deferred due to scope: 80+ sites in FileStore.cs need migration). -6. **JetStream ordered consumer: 0.15x** — ratio drop is due to Go benchmark variance (749K in this run vs 156K previously); .NET throughput stable at ~110K msg/s. Further investigation needed for the Go variability. +1. **Small-payload 1:1 pub/sub still beats Go by ~3x** (875K vs 296K msg/s). The direct write path continues to pay off when message fanout is simple and payloads are tiny. +2. **Fan-out and multi pub/sub both improved in this run** to 0.75x and 0.88x respectively. The remaining gap is still consistent with Go's more naturally parallel fanout model. +3. **Ordered consumer moved up to 0.79x** (107K vs 136K msg/s). That is materially stronger than earlier runs and suggests previous Go-side variance was distorting the comparison more than the `.NET` consumer path itself. +4. **Durable fetch remains solid at 0.70x**. The Round 6 fetch-path work is still holding, but there is room left in consumer dispatch and storage reads. +5. **Async file-store publish is still the largest server-level gap at 0.30x**. The storage layer remains the highest-value runtime target after parser and SubList hot-path cleanup. +6. **The new SubList microbenchmarks show effectively zero temporary allocation per operation** for exact, wildcard, queue, and remote-interest lookups in the current implementation. Parser contiguous hot paths also remain small and stable, while split-payload `PUB` still pays a higher copy cost. ---