docs: refresh benchmark comparison after SubList optimization
This commit is contained in:
@@ -1,47 +1,10 @@
|
||||
# Go vs .NET NATS Server — Benchmark Comparison
|
||||
|
||||
Benchmark run: 2026-03-13 10:06 AM America/Indiana/Indianapolis. The latest refresh used the benchmark project README command (`dotnet test tests/NATS.Server.Benchmark.Tests --filter "Category=Benchmark" -v normal --logger "console;verbosity=detailed"`) and completed successfully as a `.NET`-only run. The Go/.NET comparison tables below remain the last Go-capable comparison baseline.
|
||||
Benchmark run: 2026-03-13 10:16 AM America/Indiana/Indianapolis. Both servers ran on the same machine using the benchmark project README command (`dotnet test tests/NATS.Server.Benchmark.Tests --filter "Category=Benchmark" -v normal --logger "console;verbosity=detailed"`). Test parallelization remained disabled inside the benchmark assembly.
|
||||
|
||||
**Environment:** Apple M4, .NET SDK 10.0.101, README benchmark command run in the benchmark project's default `Debug` configuration, Go toolchain installed but the current full-suite run emitted only `.NET` result blocks.
|
||||
**Environment:** Apple M4, .NET SDK 10.0.101, benchmark README command run in the benchmark project's default `Debug` configuration, Go toolchain installed, Go reference server built from `golang/nats-server/`.
|
||||
|
||||
---
|
||||
|
||||
## Latest README Run (.NET only)
|
||||
|
||||
The current refresh came from `/tmp/bench-output.txt` using the benchmark project README workflow. Because the run did not emit any Go comparison blocks, the values below are the latest `.NET`-only numbers from that run, and the historical Go/.NET comparison tables are preserved below instead of being overwritten with mixed-source ratios.
|
||||
|
||||
### Core and JetStream
|
||||
|
||||
| Benchmark | .NET msg/s | .NET MB/s | Notes |
|
||||
|-----------|------------|-----------|-------|
|
||||
| Single Publisher (16B) | 1,392,442 | 21.2 | README full-suite run |
|
||||
| Single Publisher (128B) | 1,491,226 | 182.0 | README full-suite run |
|
||||
| PubSub 1:1 (16B) | 717,731 | 11.0 | README full-suite run |
|
||||
| PubSub 1:1 (16KB) | 28,450 | 444.5 | README full-suite run |
|
||||
| Fan-Out 1:4 (128B) | 1,451,748 | 177.2 | README full-suite run |
|
||||
| Multi 4Px4S (128B) | 244,878 | 29.9 | README full-suite run |
|
||||
| Request-Reply Single (128B) | 6,840 | 0.8 | P50 142.5 us, P99 203.9 us |
|
||||
| Request-Reply 10Cx2S (16B) | 22,844 | 0.3 | P50 421.1 us, P99 602.1 us |
|
||||
| JS Sync Publish (16B Memory) | 12,619 | 0.2 | README full-suite run |
|
||||
| JS Async Publish (128B File) | 46,631 | 5.7 | README full-suite run |
|
||||
| JS Ordered Consumer (128B) | 108,057 | 13.2 | README full-suite run |
|
||||
| JS Durable Fetch (128B) | 490,090 | 59.8 | README full-suite run |
|
||||
|
||||
### Parser Microbenchmarks
|
||||
|
||||
| Benchmark | Ops/s | MB/s | Alloc |
|
||||
|-----------|-------|------|-------|
|
||||
| Parser PING | 5,756,370 | 32.9 | 0.0 B/op |
|
||||
| Parser PUB | 2,537,973 | 96.8 | 40.0 B/op |
|
||||
| Parser HPUB | 2,298,811 | 122.8 | 40.0 B/op |
|
||||
| Parser PUB split payload | 2,049,535 | 78.2 | 176.0 B/op |
|
||||
|
||||
### Current Run Highlights
|
||||
|
||||
1. The parser microbenchmarks show the hot path is already at zero allocation for `PING`, with contiguous `PUB` and `HPUB` still paying a small fixed cost for retained field copies.
|
||||
2. Split-payload `PUB` remains meaningfully more allocation-heavy than contiguous `PUB` because the parser must preserve unread payload state across reads and then materialize contiguous memory at the current client boundary.
|
||||
3. The README-driven suite was a `.NET`-only refresh, so the comparative Go/.NET ratios below should still be treated as the last Go-capable baseline rather than current same-run ratios.
|
||||
|
||||
---
|
||||
|
||||
## Core NATS — Pub/Sub Throughput
|
||||
@@ -50,27 +13,27 @@ The current refresh came from `/tmp/bench-output.txt` using the benchmark projec
|
||||
|
||||
| Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
|
||||
|---------|----------|---------|------------|-----------|-----------------|
|
||||
| 16 B | 2,252,242 | 34.4 | 1,610,807 | 24.6 | 0.72x |
|
||||
| 128 B | 2,199,267 | 268.5 | 1,661,014 | 202.8 | 0.76x |
|
||||
| 16 B | 2,258,647 | 34.5 | 1,275,230 | 19.5 | 0.56x |
|
||||
| 128 B | 2,251,274 | 274.8 | 1,661,668 | 202.8 | 0.74x |
|
||||
|
||||
### Publisher + Subscriber (1:1)
|
||||
|
||||
| Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
|
||||
|---------|----------|---------|------------|-----------|-----------------|
|
||||
| 16 B | 313,790 | 4.8 | 909,298 | 13.9 | **2.90x** |
|
||||
| 16 KB | 41,153 | 643.0 | 38,287 | 598.2 | 0.93x |
|
||||
| 16 B | 296,374 | 4.5 | 875,105 | 13.4 | **2.95x** |
|
||||
| 16 KB | 32,111 | 501.7 | 30,030 | 469.2 | 0.94x |
|
||||
|
||||
### Fan-Out (1 Publisher : 4 Subscribers)
|
||||
|
||||
| Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
|
||||
|---------|----------|---------|------------|-----------|-----------------|
|
||||
| 128 B | 3,217,684 | 392.8 | 1,817,860 | 221.9 | 0.57x |
|
||||
| 128 B | 2,387,889 | 291.5 | 1,780,888 | 217.4 | 0.75x |
|
||||
|
||||
### Multi-Publisher / Multi-Subscriber (4P x 4S)
|
||||
|
||||
| Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
|
||||
|---------|----------|---------|------------|-----------|-----------------|
|
||||
| 128 B | 2,101,337 | 256.5 | 1,527,330 | 186.4 | 0.73x |
|
||||
| 128 B | 1,079,112 | 131.7 | 953,596 | 116.4 | 0.88x |
|
||||
|
||||
---
|
||||
|
||||
@@ -80,13 +43,13 @@ The current refresh came from `/tmp/bench-output.txt` using the benchmark projec
|
||||
|
||||
| Payload | Go msg/s | .NET msg/s | Ratio | Go P50 (us) | .NET P50 (us) | Go P99 (us) | .NET P99 (us) |
|
||||
|---------|----------|------------|-------|-------------|---------------|-------------|---------------|
|
||||
| 128 B | 9,450 | 7,662 | 0.81x | 103.2 | 128.9 | 145.6 | 170.8 |
|
||||
| 128 B | 8,506 | 7,182 | 0.84x | 114.9 | 135.2 | 161.2 | 189.8 |
|
||||
|
||||
### 10 Clients, 2 Services (Queue Group)
|
||||
|
||||
| Payload | Go msg/s | .NET msg/s | Ratio | Go P50 (us) | .NET P50 (us) | Go P99 (us) | .NET P99 (us) |
|
||||
|---------|----------|------------|-------|-------------|---------------|-------------|---------------|
|
||||
| 16 B | 31,094 | 26,144 | 0.84x | 316.9 | 368.7 | 439.2 | 559.7 |
|
||||
| 16 B | 26,610 | 22,533 | 0.85x | 367.7 | 425.3 | 487.4 | 622.5 |
|
||||
|
||||
---
|
||||
|
||||
@@ -94,10 +57,10 @@ The current refresh came from `/tmp/bench-output.txt` using the benchmark projec
|
||||
|
||||
| Mode | Payload | Storage | Go msg/s | .NET msg/s | Ratio (.NET/Go) |
|
||||
|------|---------|---------|----------|------------|-----------------|
|
||||
| Synchronous | 16 B | Memory | 17,533 | 14,373 | 0.82x |
|
||||
| Async (batch) | 128 B | File | 198,237 | 60,416 | 0.30x |
|
||||
| Synchronous | 16 B | Memory | 13,756 | 9,954 | 0.72x |
|
||||
| Async (batch) | 128 B | File | 171,761 | 50,711 | 0.30x |
|
||||
|
||||
> **Note:** Async file store publish improved from 174 msg/s to 60K msg/s (347x improvement) after two rounds of FileStore-level optimizations plus profiling overhead removal. Remaining 3.3x gap is GC pressure from per-message allocations.
|
||||
> **Note:** Async file-store publish remains the largest JetStream gap at 0.30x. The bottleneck is still the storage write path and the remaining managed allocation pressure around persisted message state.
|
||||
|
||||
---
|
||||
|
||||
@@ -105,10 +68,32 @@ The current refresh came from `/tmp/bench-output.txt` using the benchmark projec
|
||||
|
||||
| Mode | Go msg/s | .NET msg/s | Ratio (.NET/Go) |
|
||||
|------|----------|------------|-----------------|
|
||||
| Ordered ephemeral consumer | 748,671 | 114,021 | 0.15x |
|
||||
| Durable consumer fetch | 662,471 | 488,520 | 0.74x |
|
||||
| Ordered ephemeral consumer | 135,704 | 107,168 | 0.79x |
|
||||
| Durable consumer fetch | 533,441 | 375,652 | 0.70x |
|
||||
|
||||
> **Note:** Durable fetch improved from 0.13x → 0.60x → **0.74x** after Round 6 optimizations (batch flush, ackReply stack formatting, cached CompiledFilter, pooled fetch list). Ordered consumer ratio dropped due to Go benchmark improvement (748K vs 156K in earlier runs); .NET throughput is stable at ~110K msg/s.
|
||||
> **Note:** Ordered-consumer results in this run are much closer to parity than earlier snapshots. That suggests prior Go-side variance was material; `.NET` throughput is still clustered around ~107K msg/s.
|
||||
|
||||
---
|
||||
|
||||
## Hot Path Microbenchmarks (.NET only)
|
||||
|
||||
### SubList
|
||||
|
||||
| Benchmark | .NET msg/s | .NET MB/s | Alloc |
|
||||
|-----------|------------|-----------|-------|
|
||||
| SubList Exact Match (128 subjects) | 17,746,607 | 236.9 | 0.00 B/op |
|
||||
| SubList Wildcard Match | 18,811,278 | 251.2 | 0.00 B/op |
|
||||
| SubList Queue Match | 20,624,510 | 157.4 | 0.00 B/op |
|
||||
| SubList Remote Interest | 264,725 | 4.3 | 0.00 B/op |
|
||||
|
||||
### Parser
|
||||
|
||||
| Benchmark | Ops/s | MB/s | Alloc |
|
||||
|-----------|-------|------|-------|
|
||||
| Parser PING | 5,598,176 | 32.0 | 0.0 B/op |
|
||||
| Parser PUB | 2,701,645 | 103.1 | 40.0 B/op |
|
||||
| Parser HPUB | 2,177,745 | 116.3 | 40.0 B/op |
|
||||
| Parser PUB split payload | 1,702,439 | 64.9 | 176.0 B/op |
|
||||
|
||||
---
|
||||
|
||||
@@ -116,25 +101,25 @@ The current refresh came from `/tmp/bench-output.txt` using the benchmark projec
|
||||
|
||||
| Category | Ratio Range | Assessment |
|
||||
|----------|-------------|------------|
|
||||
| Pub-only throughput | 0.72x–0.76x | Good — within 2x |
|
||||
| Pub/sub (small payload) | **2.90x** | .NET outperforms Go — direct buffer path eliminates all per-message overhead |
|
||||
| Pub/sub (large payload) | 0.93x | Near parity |
|
||||
| Fan-out | 0.57x | Improved from 0.18x → 0.44x → 0.66x; batch flush applied but serial delivery remains |
|
||||
| Multi pub/sub | 0.73x | Improved from 0.49x → 0.84x; variance from system load |
|
||||
| Request/reply latency | 0.81x–0.84x | Good — improved from 0.77x |
|
||||
| JetStream sync publish | 0.82x | Good |
|
||||
| JetStream async file publish | 0.30x | Improved from 0.00x — storage write path dominates |
|
||||
| JetStream ordered consume | 0.15x | .NET stable ~110K; Go variance high (156K–749K) |
|
||||
| JetStream durable fetch | **0.74x** | **Improved from 0.60x** — batch flush + ackReply optimization |
|
||||
| Pub-only throughput | 0.56x–0.74x | Mixed — 128 B is solid, 16 B still trails materially |
|
||||
| Pub/sub (small payload) | **2.95x** | .NET outperforms Go decisively |
|
||||
| Pub/sub (large payload) | 0.94x | Near parity |
|
||||
| Fan-out | 0.75x | Good improvement; still limited by serial delivery |
|
||||
| Multi pub/sub | 0.88x | Close to parity in this run |
|
||||
| Request/reply latency | 0.84x–0.85x | Good |
|
||||
| JetStream sync publish | 0.72x | Good |
|
||||
| JetStream async file publish | 0.30x | Storage write path still dominates |
|
||||
| JetStream ordered consume | 0.79x | Much closer to parity in this run |
|
||||
| JetStream durable fetch | 0.70x | Good |
|
||||
|
||||
### Key Observations
|
||||
|
||||
1. **Small-payload 1:1 pub/sub outperforms Go by ~3x** (909K vs 314K msg/s). The per-client direct write buffer with `stackalloc` header formatting eliminates all per-message heap allocations and channel overhead.
|
||||
2. **Durable consumer fetch improved to 0.74x** (489K vs 662K msg/s) — Round 6 batch flush signaling and `string.Create`-based ack reply formatting reduced per-message overhead significantly.
|
||||
3. **Fan-out holds at ~0.57x** despite batch flush optimization. The remaining gap is goroutine-level parallelism (Go fans out per-client via goroutines; .NET delivers serially). The batch flush reduces wakeup overhead but doesn't add concurrency.
|
||||
4. **Request/reply improved to 0.81x–0.84x** — deferred flush benefits single-message delivery paths too.
|
||||
5. **JetStream file store async publish: 0.30x** — remaining gap is GC pressure from per-message `StoredMessage` objects and `byte[]` copies (Change 2 deferred due to scope: 80+ sites in FileStore.cs need migration).
|
||||
6. **JetStream ordered consumer: 0.15x** — ratio drop is due to Go benchmark variance (749K in this run vs 156K previously); .NET throughput stable at ~110K msg/s. Further investigation needed for the Go variability.
|
||||
1. **Small-payload 1:1 pub/sub still beats Go by ~3x** (875K vs 296K msg/s). The direct write path continues to pay off when message fanout is simple and payloads are tiny.
|
||||
2. **Fan-out and multi pub/sub both improved in this run** to 0.75x and 0.88x respectively. The remaining gap is still consistent with Go's more naturally parallel fanout model.
|
||||
3. **Ordered consumer moved up to 0.79x** (107K vs 136K msg/s). That is materially stronger than earlier runs and suggests previous Go-side variance was distorting the comparison more than the `.NET` consumer path itself.
|
||||
4. **Durable fetch remains solid at 0.70x**. The Round 6 fetch-path work is still holding, but there is room left in consumer dispatch and storage reads.
|
||||
5. **Async file-store publish is still the largest server-level gap at 0.30x**. The storage layer remains the highest-value runtime target after parser and SubList hot-path cleanup.
|
||||
6. **The new SubList microbenchmarks show effectively zero temporary allocation per operation** for exact, wildcard, queue, and remote-interest lookups in the current implementation. Parser contiguous hot paths also remain small and stable, while split-payload `PUB` still pays a higher copy cost.
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user