Go vs .NET NATS Server — Benchmark Comparison
Benchmark run: 2026-03-13 10:06 AM America/Indiana/Indianapolis. The latest refresh used the benchmark project README command (dotnet test tests/NATS.Server.Benchmark.Tests --filter "Category=Benchmark" -v normal --logger "console;verbosity=detailed") and completed successfully as a .NET-only run. The Go/.NET comparison tables below remain the last Go-capable comparison baseline.
Environment: Apple M4, .NET SDK 10.0.101, README benchmark command run in the benchmark project's default Debug configuration, Go toolchain installed but the current full-suite run emitted only .NET result blocks.
Latest README Run (.NET only)
The current refresh came from /tmp/bench-output.txt using the benchmark project README workflow. Because the run did not emit any Go comparison blocks, the values below are the latest .NET-only numbers from that run, and the historical Go/.NET comparison tables are preserved below instead of being overwritten with mixed-source ratios.
Core and JetStream
| Benchmark |
.NET msg/s |
.NET MB/s |
Notes |
| Single Publisher (16B) |
1,392,442 |
21.2 |
README full-suite run |
| Single Publisher (128B) |
1,491,226 |
182.0 |
README full-suite run |
| PubSub 1:1 (16B) |
717,731 |
11.0 |
README full-suite run |
| PubSub 1:1 (16KB) |
28,450 |
444.5 |
README full-suite run |
| Fan-Out 1:4 (128B) |
1,451,748 |
177.2 |
README full-suite run |
| Multi 4Px4S (128B) |
244,878 |
29.9 |
README full-suite run |
| Request-Reply Single (128B) |
6,840 |
0.8 |
P50 142.5 us, P99 203.9 us |
| Request-Reply 10Cx2S (16B) |
22,844 |
0.3 |
P50 421.1 us, P99 602.1 us |
| JS Sync Publish (16B Memory) |
12,619 |
0.2 |
README full-suite run |
| JS Async Publish (128B File) |
46,631 |
5.7 |
README full-suite run |
| JS Ordered Consumer (128B) |
108,057 |
13.2 |
README full-suite run |
| JS Durable Fetch (128B) |
490,090 |
59.8 |
README full-suite run |
Parser Microbenchmarks
| Benchmark |
Ops/s |
MB/s |
Alloc |
| Parser PING |
5,756,370 |
32.9 |
0.0 B/op |
| Parser PUB |
2,537,973 |
96.8 |
40.0 B/op |
| Parser HPUB |
2,298,811 |
122.8 |
40.0 B/op |
| Parser PUB split payload |
2,049,535 |
78.2 |
176.0 B/op |
Current Run Highlights
- The parser microbenchmarks show the hot path is already at zero allocation for
PING, with contiguous PUB and HPUB still paying a small fixed cost for retained field copies.
- Split-payload
PUB remains meaningfully more allocation-heavy than contiguous PUB because the parser must preserve unread payload state across reads and then materialize contiguous memory at the current client boundary.
- The README-driven suite was a
.NET-only refresh, so the comparative Go/.NET ratios below should still be treated as the last Go-capable baseline rather than current same-run ratios.
Core NATS — Pub/Sub Throughput
Single Publisher (no subscribers)
| Payload |
Go msg/s |
Go MB/s |
.NET msg/s |
.NET MB/s |
Ratio (.NET/Go) |
| 16 B |
2,252,242 |
34.4 |
1,610,807 |
24.6 |
0.72x |
| 128 B |
2,199,267 |
268.5 |
1,661,014 |
202.8 |
0.76x |
Publisher + Subscriber (1:1)
| Payload |
Go msg/s |
Go MB/s |
.NET msg/s |
.NET MB/s |
Ratio (.NET/Go) |
| 16 B |
313,790 |
4.8 |
909,298 |
13.9 |
2.90x |
| 16 KB |
41,153 |
643.0 |
38,287 |
598.2 |
0.93x |
Fan-Out (1 Publisher : 4 Subscribers)
| Payload |
Go msg/s |
Go MB/s |
.NET msg/s |
.NET MB/s |
Ratio (.NET/Go) |
| 128 B |
3,217,684 |
392.8 |
1,817,860 |
221.9 |
0.57x |
Multi-Publisher / Multi-Subscriber (4P x 4S)
| Payload |
Go msg/s |
Go MB/s |
.NET msg/s |
.NET MB/s |
Ratio (.NET/Go) |
| 128 B |
2,101,337 |
256.5 |
1,527,330 |
186.4 |
0.73x |
Core NATS — Request/Reply Latency
Single Client, Single Service
| Payload |
Go msg/s |
.NET msg/s |
Ratio |
Go P50 (us) |
.NET P50 (us) |
Go P99 (us) |
.NET P99 (us) |
| 128 B |
9,450 |
7,662 |
0.81x |
103.2 |
128.9 |
145.6 |
170.8 |
10 Clients, 2 Services (Queue Group)
| Payload |
Go msg/s |
.NET msg/s |
Ratio |
Go P50 (us) |
.NET P50 (us) |
Go P99 (us) |
.NET P99 (us) |
| 16 B |
31,094 |
26,144 |
0.84x |
316.9 |
368.7 |
439.2 |
559.7 |
JetStream — Publication
| Mode |
Payload |
Storage |
Go msg/s |
.NET msg/s |
Ratio (.NET/Go) |
| Synchronous |
16 B |
Memory |
17,533 |
14,373 |
0.82x |
| Async (batch) |
128 B |
File |
198,237 |
60,416 |
0.30x |
Note: Async file store publish improved from 174 msg/s to 60K msg/s (347x improvement) after two rounds of FileStore-level optimizations plus profiling overhead removal. Remaining 3.3x gap is GC pressure from per-message allocations.
JetStream — Consumption
| Mode |
Go msg/s |
.NET msg/s |
Ratio (.NET/Go) |
| Ordered ephemeral consumer |
748,671 |
114,021 |
0.15x |
| Durable consumer fetch |
662,471 |
488,520 |
0.74x |
Note: Durable fetch improved from 0.13x → 0.60x → 0.74x after Round 6 optimizations (batch flush, ackReply stack formatting, cached CompiledFilter, pooled fetch list). Ordered consumer ratio dropped due to Go benchmark improvement (748K vs 156K in earlier runs); .NET throughput is stable at ~110K msg/s.
Summary
| Category |
Ratio Range |
Assessment |
| Pub-only throughput |
0.72x–0.76x |
Good — within 2x |
| Pub/sub (small payload) |
2.90x |
.NET outperforms Go — direct buffer path eliminates all per-message overhead |
| Pub/sub (large payload) |
0.93x |
Near parity |
| Fan-out |
0.57x |
Improved from 0.18x → 0.44x → 0.66x; batch flush applied but serial delivery remains |
| Multi pub/sub |
0.73x |
Improved from 0.49x → 0.84x; variance from system load |
| Request/reply latency |
0.81x–0.84x |
Good — improved from 0.77x |
| JetStream sync publish |
0.82x |
Good |
| JetStream async file publish |
0.30x |
Improved from 0.00x — storage write path dominates |
| JetStream ordered consume |
0.15x |
.NET stable ~110K; Go variance high (156K–749K) |
| JetStream durable fetch |
0.74x |
Improved from 0.60x — batch flush + ackReply optimization |
Key Observations
- Small-payload 1:1 pub/sub outperforms Go by ~3x (909K vs 314K msg/s). The per-client direct write buffer with
stackalloc header formatting eliminates all per-message heap allocations and channel overhead.
- Durable consumer fetch improved to 0.74x (489K vs 662K msg/s) — Round 6 batch flush signaling and
string.Create-based ack reply formatting reduced per-message overhead significantly.
- Fan-out holds at ~0.57x despite batch flush optimization. The remaining gap is goroutine-level parallelism (Go fans out per-client via goroutines; .NET delivers serially). The batch flush reduces wakeup overhead but doesn't add concurrency.
- Request/reply improved to 0.81x–0.84x — deferred flush benefits single-message delivery paths too.
- JetStream file store async publish: 0.30x — remaining gap is GC pressure from per-message
StoredMessage objects and byte[] copies (Change 2 deferred due to scope: 80+ sites in FileStore.cs need migration).
- JetStream ordered consumer: 0.15x — ratio drop is due to Go benchmark variance (749K in this run vs 156K previously); .NET throughput stable at ~110K msg/s. Further investigation needed for the Go variability.
Optimization History
Round 6: Batch Flush Signaling + Fetch Optimizations
Four optimizations targeting fan-out and consumer fetch hot paths:
| # |
Root Cause |
Fix |
Impact |
| 20 |
Per-subscriber flush signal in fan-out — each SendMessage called _flushSignal.Writer.TryWrite(0) independently; for 1:4 fan-out, 4 channel writes + 4 write-loop wakeups per published message |
Split SendMessage into SendMessageNoFlush + SignalFlush; ProcessMessage collects unique clients in [ThreadStatic] HashSet<INatsClient> (Go's pcd pattern), one flush signal per unique client after fan-out |
Reduces channel writes from N to unique-client-count per publish |
| 21 |
Per-fetch CompiledFilter allocation — CompiledFilter.FromConfig(consumer.Config) called on every fetch request, allocating a new filter object each time |
Cached CompiledFilter on ConsumerHandle with staleness detection (reference + value check on filter config fields); reused across fetches |
Eliminates per-fetch filter allocation |
| 22 |
Per-message string interpolation in ack reply — $"$JS.ACK.{stream}.{consumer}.1.{seq}.{deliverySeq}.{ts}.{pending}" allocated intermediate strings and boxed numeric types on every delivery |
Pre-compute $"$JS.ACK.{stream}.{consumer}.1." prefix before loop; use stackalloc char[] + TryFormat for numeric suffix — zero intermediate allocations |
Eliminates 4+ string allocs per delivered message |
| 23 |
Per-fetch List<StoredMessage> allocation — new List<StoredMessage>(batch) allocated on every FetchAsync call |
[ThreadStatic] reusable list with .Clear() + capacity growth; PullFetchBatch snapshots via .ToArray() for safe handoff |
Eliminates per-fetch list allocation |
Round 5: Non-blocking ConsumeAsync (ordered + durable consumers)
One root cause was identified and fixed in the MSG.NEXT request handling path:
| # |
Root Cause |
Fix |
Impact |
| 19 |
Synchronous blocking in DeliverPullFetchMessages — FetchAsync(...).GetAwaiter().GetResult() blocked the client's read loop for the full expires timeout (30s). With batch=1000 and only 5 messages available, the fetch polled for message 6 indefinitely. No messages were delivered until the timeout fired, causing the client to receive 0 messages before its own timeout. |
Split into two paths: noWait/no-expires uses synchronous fetch (existing behavior for FetchAsync client); expires > 0 spawns DeliverPullFetchMessagesAsync background task that delivers messages incrementally without blocking the read loop, with idle heartbeat support |
Enables ConsumeAsync for both ordered and durable consumers; ordered consumer: 99K msg/s (0.64x Go) |
Round 4: Per-Client Direct Write Buffer (pub/sub + fan-out + multi pub/sub)
Four optimizations were implemented in the message delivery hot path:
| # |
Root Cause |
Fix |
Impact |
| 15 |
Per-message channel overhead — each SendMessage call went through Channel<OutboundData>.TryWrite, incurring lock contention and memory barriers |
Replaced channel-based message delivery with per-client _directBuf byte array under SpinLock; messages written directly to contiguous buffer |
Eliminates channel overhead per delivery |
| 16 |
Per-message heap allocation for MSG header — _outboundBufferPool.RentBuffer() allocated a pooled byte[] for each MSG header |
Replaced with stackalloc byte[512] — MSG header formatted entirely on the stack, then copied into _directBuf |
Zero heap allocations per delivery |
| 17 |
Per-message socket write — write loop issued one SendAsync per channel item, even with coalescing |
Double-buffer swap: write loop swaps _directBuf ↔ _writeBuf under SpinLock, then writes the entire batch in a single SendAsync; zero allocation on swap |
Single syscall per batch, zero-copy buffer reuse |
| 18 |
Separate wake channels — SendMessage and WriteProtocol used different signaling paths |
Unified on _flushSignal channel (bounded capacity 1, DropWrite); both paths signal the same channel, write loop drains both _directBuf and _outbound on each wake |
Single wait point, no missed wakes |
Round 3: Outbound Write Path (pub/sub + fan-out + fetch)
Three root causes were identified and fixed in the message delivery hot path:
| # |
Root Cause |
Fix |
Impact |
| 12 |
Per-message .ToArray() allocation in SendMessage — owner.Memory[..pos].ToArray() created a new byte[] for every MSG delivered to every subscriber |
Replaced IMemoryOwner rent/copy/dispose with direct byte[] from pool; write loop returns buffers after writing |
Eliminates 1 heap alloc per delivery (4 per fan-out message) |
| 13 |
Per-message WriteAsync in write loop — each queued message triggered a separate _stream.WriteAsync() system call |
Added 64KB coalesce buffer; drain all pending messages into contiguous buffer, single WriteAsync per batch |
Reduces syscalls from N to 1 per batch |
| 14 |
Profiling Stopwatch on every message — Stopwatch.StartNew() ran unconditionally in ProcessMessage and StreamManager.Capture even for non-JetStream messages |
Removed profiling instrumentation from hot path |
Eliminates ~200ns overhead per message |
Round 2: FileStore AppendAsync Hot Path
| # |
Root Cause |
Fix |
Impact |
| 6 |
Async state machine overhead — AppendAsync was async ValueTask<ulong> but never actually awaited |
Changed to synchronous ValueTask<ulong> returning ValueTask.FromResult(_last) |
Eliminates Task state machine allocation |
| 7 |
Double payload copy — TransformForPersist allocated byte[] then payload.ToArray() created second copy for StoredMessage |
Reuse TransformForPersist result directly for StoredMessage.Payload when no transform needed (_noTransform flag) |
Eliminates 1 byte[] alloc per message |
| 8 |
Unnecessary TTL work per publish — ExpireFromWheel() and RegisterTtl() called on every write even when MaxAge=0 |
Guarded both with _options.MaxAgeMs > 0 check (matches Go: filestore.go:4701) |
Eliminates hash wheel overhead when TTL not configured |
| 9 |
Per-message MsgBlock cache allocation — WriteAt created new MessageRecord for _cache on every write |
Removed eager cache population; reads now decode from pending buffer or disk |
Eliminates 1 object alloc per message |
| 10 |
Contiguous write buffer — MsgBlock._pendingWrites was List<byte[]> with per-message byte[] allocations |
Replaced with single contiguous _pendingBuf byte array; MessageRecord.EncodeTo writes directly into it |
Eliminates per-message byte[] encoding alloc; single RandomAccess.Write per flush |
| 11 |
Pending buffer read path — MsgBlock.Read() flushed pending writes to disk before reading |
Added in-memory read from _pendingBuf when data is still in the buffer |
Avoids unnecessary disk flush on read-after-write |
Round 1: FileStore/StreamManager Layer
| # |
Root Cause |
Fix |
Impact |
| 1 |
Per-message synchronous disk I/O — MsgBlock.WriteAt() called RandomAccess.Write() on every message |
Added write buffering in MsgBlock + background flush loop in FileStore (Go's flushLoop pattern: coalesce 16KB or 8ms) |
Eliminates per-message syscall overhead |
| 2 |
O(n) GetStateAsync per publish — _messages.Keys.Min() and _messages.Values.Sum() on every publish for MaxMsgs/MaxBytes checks |
Added incremental _messageCount, _totalBytes, _firstSeq fields updated in all mutation paths; GetStateAsync is now O(1) |
Eliminates O(n) scan per publish |
| 3 |
Unnecessary LoadAsync after every append — StreamManager.Capture reloaded the just-stored message even when no mirrors/sources were configured |
Made LoadAsync conditional on mirror/source replication being configured |
Eliminates redundant disk read per publish |
| 4 |
Redundant PruneExpiredMessages per publish — called before every publish even when MaxAge=0, and again inside EnforceRuntimePolicies |
Guarded with MaxAgeMs > 0 check; removed the pre-publish call (background expiry timer handles it) |
Eliminates O(n) scan per publish |
| 5 |
PrunePerSubject loading all messages per publish — EnforceRuntimePolicies → PrugePerSubject called ListAsync().GroupBy() even when MaxMsgsPer=0 |
Guarded with MaxMsgsPer > 0 check |
Eliminates O(n) scan per publish |
Additional fixes: SHA256 envelope bypass for unencrypted/uncompressed stores, RAFT propose skip for single-replica streams.
What would further close the gap
| Change |
Expected Impact |
Go Reference |
| Fan-out parallelism |
Deliver to subscribers concurrently instead of serially from publisher's read loop |
Go: processMsgResults fans out per-client via goroutines |
| Eliminate per-message GC allocations in FileStore |
~30% improvement on FileStore AppendAsync — replace StoredMessage class with StoredMessageMeta struct in _messages dict, reconstruct full message from MsgBlock on read |
Go stores in cache.buf/cache.idx with zero per-message allocs; 80+ sites in FileStore.cs need migration |
| Ordered consumer delivery optimization |
Investigate .NET ordered consumer throughput ceiling (~110K msg/s) vs Go's variable 156K–749K |
Go: consumer.go ordered consumer fast path |