Go vs .NET NATS Server — Benchmark Comparison
Benchmark run: 2026-03-13 04:30 PM America/Indiana/Indianapolis. Both servers ran on the same machine using the benchmark project README command (dotnet test tests/NATS.Server.Benchmark.Tests --filter "Category=Benchmark" -v normal --logger "console;verbosity=detailed"). Test parallelization remained disabled inside the benchmark assembly.
Environment: Apple M4, .NET SDK 10.0.101, .NET server built and run in Release configuration (server GC, tiered PGO enabled), Go toolchain installed, Go reference server built from golang/nats-server/.
Core NATS — Pub/Sub Throughput
Single Publisher (no subscribers)
| Payload |
Go msg/s |
Go MB/s |
.NET msg/s |
.NET MB/s |
Ratio (.NET/Go) |
| 16 B |
2,223,690 |
33.9 |
1,651,727 |
25.2 |
0.74x |
| 128 B |
2,218,308 |
270.8 |
1,368,967 |
167.1 |
0.62x |
Publisher + Subscriber (1:1)
| Payload |
Go msg/s |
Go MB/s |
.NET msg/s |
.NET MB/s |
Ratio (.NET/Go) |
| 16 B |
292,711 |
4.5 |
723,867 |
11.0 |
2.47x |
| 16 KB |
32,890 |
513.9 |
37,943 |
592.9 |
1.15x |
Fan-Out (1 Publisher : 4 Subscribers)
| Payload |
Go msg/s |
Go MB/s |
.NET msg/s |
.NET MB/s |
Ratio (.NET/Go) |
| 128 B |
2,945,790 |
359.6 |
1,848,130 |
225.6 |
0.63x |
Note: Fan-out numbers are within noise of prior round. The hot-path optimizations (batched stats, pre-encoded subject/SID bytes, auto-unsub guard) remove per-delivery overhead but the gap is now dominated by the serial fan-out loop itself.
Multi-Publisher / Multi-Subscriber (4P x 4S)
| Payload |
Go msg/s |
Go MB/s |
.NET msg/s |
.NET MB/s |
Ratio (.NET/Go) |
| 128 B |
2,123,480 |
259.2 |
1,374,570 |
167.8 |
0.65x |
Core NATS — Request/Reply Latency
Single Client, Single Service
| Payload |
Go msg/s |
.NET msg/s |
Ratio |
Go P50 (us) |
.NET P50 (us) |
Go P99 (us) |
.NET P99 (us) |
| 128 B |
8,386 |
7,424 |
0.89x |
115.8 |
139.0 |
175.5 |
193.0 |
10 Clients, 2 Services (Queue Group)
| Payload |
Go msg/s |
.NET msg/s |
Ratio |
Go P50 (us) |
.NET P50 (us) |
Go P99 (us) |
.NET P99 (us) |
| 16 B |
26,470 |
26,620 |
1.01x |
370.2 |
376.0 |
486.0 |
592.8 |
JetStream — Publication
| Mode |
Payload |
Storage |
Go msg/s |
.NET msg/s |
Ratio (.NET/Go) |
| Synchronous |
16 B |
Memory |
14,812 |
11,002 |
0.74x |
| Async (batch) |
128 B |
File |
148,156 |
60,348 |
0.41x |
Note: Async file-store publish improved to 0.41x with Release build. Still storage-bound.
JetStream — Consumption
| Mode |
Go msg/s |
.NET msg/s |
Ratio (.NET/Go) |
| Ordered ephemeral consumer |
166,000 |
102,369 |
0.62x |
| Durable consumer fetch |
510,000 |
468,252 |
0.92x |
Note: Ordered consumer improved to 0.62x (102K vs 166K). Durable fetch jumped to 0.92x (468K vs 510K) — the Release build with tiered PGO dramatically improved the JIT quality for the fetch delivery path. Go comparison numbers vary significantly across runs.
MQTT Throughput
| Benchmark |
Go msg/s |
Go MB/s |
.NET msg/s |
.NET MB/s |
Ratio (.NET/Go) |
| MQTT PubSub (128B, QoS 0) |
34,224 |
4.2 |
47,341 |
5.8 |
1.38x |
| Cross-Protocol NATS→MQTT (128B) |
158,000 |
19.3 |
229,932 |
28.1 |
1.46x |
Note: Pure MQTT pub/sub extended its lead to 1.38x. Cross-protocol NATS→MQTT now at 1.46x — the Release build JIT further benefits the delivery path.
Transport Overhead
TLS
| Benchmark |
Go msg/s |
Go MB/s |
.NET msg/s |
.NET MB/s |
Ratio (.NET/Go) |
| TLS PubSub 1:1 (128B) |
289,548 |
35.3 |
254,834 |
31.1 |
0.88x |
| TLS Pub-Only (128B) |
1,782,442 |
217.6 |
877,149 |
107.1 |
0.49x |
WebSocket
| Benchmark |
Go msg/s |
Go MB/s |
.NET msg/s |
.NET MB/s |
Ratio (.NET/Go) |
| WS PubSub 1:1 (128B) |
66,584 |
8.1 |
62,249 |
7.6 |
0.93x |
| WS Pub-Only (128B) |
106,302 |
13.0 |
85,878 |
10.5 |
0.81x |
Note: TLS pub/sub stable at 0.88x. WebSocket pub/sub at 0.93x. Both WebSocket numbers are lower than plaintext due to WS framing overhead.
Hot Path Microbenchmarks (.NET only)
SubList
| Benchmark |
.NET msg/s |
.NET MB/s |
Alloc |
| SubList Exact Match (128 subjects) |
19,285,510 |
257.5 |
0.00 B/op |
| SubList Wildcard Match |
18,876,330 |
252.0 |
0.00 B/op |
| SubList Queue Match |
20,639,153 |
157.5 |
0.00 B/op |
| SubList Remote Interest |
274,703 |
4.5 |
0.00 B/op |
Parser
| Benchmark |
Ops/s |
MB/s |
Alloc |
| Parser PING |
6,283,578 |
36.0 |
0.0 B/op |
| Parser PUB |
2,712,550 |
103.5 |
40.0 B/op |
| Parser HPUB |
2,338,555 |
124.9 |
40.0 B/op |
| Parser PUB split payload |
2,043,813 |
78.0 |
176.0 B/op |
FileStore
| Benchmark |
Ops/s |
MB/s |
Alloc |
| FileStore AppendAsync (128B) |
244,089 |
29.8 |
1552.9 B/op |
| FileStore LoadLastBySubject (hot) |
12,784,127 |
780.3 |
0.0 B/op |
| FileStore PurgeEx+Trim |
332 |
0.0 |
5440792.9 B/op |
Summary
| Category |
Ratio Range |
Assessment |
| Pub-only throughput |
0.62x–0.74x |
Improved with Release build |
| Pub/sub (small payload) |
2.47x |
.NET outperforms Go decisively |
| Pub/sub (large payload) |
1.15x |
.NET now exceeds parity |
| Fan-out |
0.63x |
Serial fan-out loop is bottleneck |
| Multi pub/sub |
0.65x |
Close to prior round |
| Request/reply latency |
0.89x–1.01x |
Effectively at parity |
| JetStream sync publish |
0.74x |
Run-to-run variance |
| JetStream async file publish |
0.41x |
Storage-bound |
| JetStream ordered consume |
0.62x |
Improved with Release build |
| JetStream durable fetch |
0.92x |
Major improvement with Release build |
| MQTT pub/sub |
1.38x |
.NET outperforms Go |
| MQTT cross-protocol |
1.46x |
.NET strongly outperforms Go |
| TLS pub/sub |
0.88x |
Close to parity |
| TLS pub-only |
0.49x |
Variance / contention with other tests |
| WebSocket pub/sub |
0.93x |
Close to parity |
| WebSocket pub-only |
0.81x |
Good |
Key Observations
- Switching the benchmark harness to Release build was the highest-impact change. Durable fetch jumped from 0.42x to 0.92x (468K vs 510K msg/s). Ordered consumer improved from 0.57x to 0.62x. Request-reply 10Cx2S reached parity at 1.01x. Large-payload pub/sub now exceeds Go at 1.15x.
- Small-payload 1:1 pub/sub remains a strong .NET lead at 2.47x (724K vs 293K msg/s).
- MQTT cross-protocol improved to 1.46x (230K vs 158K msg/s), up from 1.20x — the Release JIT further benefits the delivery path.
- Fan-out (0.63x) and multi pub/sub (0.65x) remain the largest gaps. The hot-path optimizations (batched stats, pre-encoded SID/subject, auto-unsub guard) removed per-delivery overhead, but the remaining gap is dominated by the serial fan-out loop itself — Go parallelizes fan-out delivery across goroutines.
- SubList Match microbenchmarks improved ~17% (19.3M vs 16.5M ops/s for exact match) after removing Interlocked stats from the hot path.
- TLS pub-only dropped to 0.49x this run, likely noise from co-running benchmarks contending on CPU. TLS pub/sub remains stable at 0.88x.
Optimization History
Round 9: Fan-Out & Multi Pub/Sub Hot-Path Optimization
Seven optimizations targeting the per-delivery hot path and benchmark harness configuration:
| # |
Root Cause |
Fix |
Impact |
| 31 |
Benchmark harness built server in Debug — DotNetServerProcess.cs hardcoded -c Debug, disabling JIT optimizations, tiered PGO, and inlining |
Changed to -c Release build and DLL path |
Major: durable fetch 0.42x→0.92x, request-reply to parity |
| 32 |
Per-delivery Interlocked on server-wide stats — SendMessageNoFlush did 2 Interlocked ops per delivery; fan-out 4 subs = 8 interlocked ops per publish |
Moved server-wide stats to batch Interlocked.Add once after fan-out loop in ProcessMessage |
Eliminates N×2 interlocked ops per publish |
| 33 |
Auto-unsub tracking on every delivery — Interlocked.Increment(ref sub.MessageCount) on every delivery even when MaxMessages == 0 (no limit — the common case) |
Guarded with if (sub.MaxMessages > 0) |
Eliminates 1 interlocked op per delivery in common case |
| 34 |
Per-delivery SID ASCII encoding — Encoding.ASCII.GetBytes(sid) on every delivery; SID is a small integer that never changes |
Added Subscription.SidBytes cached property; new SendMessageNoFlush overload accepts ReadOnlySpan<byte> |
Eliminates per-delivery encoding |
| 35 |
Per-delivery subject ASCII encoding — Encoding.ASCII.GetBytes(subject) for each subscriber; fan-out 4 = 4× encoding same subject |
Pre-encode subject once in ProcessMessage before fan-out loop; new overload uses span copy |
Eliminates N-1 subject encodings per publish |
| 36 |
Per-publish subject string allocation — Encoding.ASCII.GetString(cmd.Subject.Span) on every PUB even when publishing to the same subject repeatedly |
Added 1-element string cache per client; reuses string when subject bytes match |
Eliminates string alloc for repeated subjects |
| 37 |
Interlocked stats in SubList.Match hot path — Interlocked.Increment(ref _matches) and _cacheHits on every match call |
Replaced with non-atomic increments (approximate counters for monitoring) |
Eliminates 1-2 interlocked ops per match |
Round 8: Ordered Consumer + Cross-Protocol Optimization
Three optimizations targeting pull consumer delivery and MQTT cross-protocol throughput:
| # |
Root Cause |
Fix |
Impact |
| 28 |
Per-message flush signal in DeliverPullFetchMessagesAsync — DeliverMessage called SendMessage which triggered _flushSignal.Writer.TryWrite(0) per message; for batch of N messages, N flush signals and write-loop wakeups |
Replaced with SendMessageNoFlush + batch flush every 64 messages + final flush after loop; bypasses DeliverMessage entirely (no permission check / auto-unsub needed for JS delivery inbox) |
Reduces flush signals from N to N/64 per batch |
| 29 |
5ms polling delay in pull consumer wait loop — Task.Delay(5) in DeliverPullFetchMessagesAsync and PullConsumerEngine.WaitForMessageAsync added up to 5ms latency per empty slot; for tail-following consumers, every new message waited up to 5ms to be noticed |
Added StreamHandle.NotifyPublish() / WaitForPublishAsync() using TaskCompletionSource signaling; publishers call NotifyPublish after AppendAsync; consumers wait on signal with heartbeat-interval timeout |
Eliminates polling delay; instant wakeup on publish |
| 30 |
StringBuilder allocation in NatsToMqtt for common case — every uncached NatsToMqtt call allocated a StringBuilder even when no _DOT_ escape sequences were present (the common case) |
Added string.Create fast path that uses char replacement lambda when no _DOT_ found; pre-warm topic bytes cache on MQTT subscription creation |
Eliminates StringBuilder + string alloc for common case; no cache miss on first delivery |
Round 7: MQTT Cross-Protocol Write Path
Four optimizations targeting the NATS→MQTT delivery hot path (cross-protocol throughput improved from 0.30x to 0.78x):
| # |
Root Cause |
Fix |
Impact |
| 24 |
Per-message async fire-and-forget in MqttNatsClientAdapter — each SendMessage called SendBinaryPublishAsync which acquired a SemaphoreSlim, allocated a full PUBLISH packet byte[], wrote, and flushed the stream — all per message, bypassing the server's deferred-flush batching |
Replaced with synchronous EnqueuePublishNoFlush() that formats MQTT PUBLISH directly into _directBuf under SpinLock, matching the NatsClient pattern; SignalFlush() signals the write loop for batch flush |
Eliminates async Task + SemaphoreSlim + per-message flush |
| 25 |
Per-message byte[] allocation for MQTT PUBLISH packets — MqttPacketWriter.WritePublish() allocated topic bytes, variable header, remaining-length array, and full packet array on every delivery |
Added WritePublishTo(Span<byte>) that formats the entire PUBLISH packet directly into the destination span using Span<byte> operations — zero heap allocation |
Eliminates 4+ byte[] allocs per delivery |
| 26 |
Per-message NATS→MQTT topic translation — NatsToMqtt() allocated a StringBuilder, produced a string, then Encoding.UTF8.GetBytes() re-encoded it on every delivery |
Added NatsToMqttBytes() with bounded ConcurrentDictionary<string, byte[]> cache (4096 entries); cached result includes pre-encoded UTF-8 bytes |
Eliminates string + encoding alloc per delivery for cached topics |
| 27 |
Per-message FlushAsync on plain TCP sockets — WriteBinaryAsync flushed after every packet write, even on NetworkStream where TCP auto-flushes |
Write loop skips FlushAsync for plain sockets; for TLS/wrapped streams, flushes once per batch (not per message) |
Reduces syscalls from 2N to 1 per batch |
Round 6: Batch Flush Signaling + Fetch Optimizations
Four optimizations targeting fan-out and consumer fetch hot paths:
| # |
Root Cause |
Fix |
Impact |
| 20 |
Per-subscriber flush signal in fan-out — each SendMessage called _flushSignal.Writer.TryWrite(0) independently; for 1:4 fan-out, 4 channel writes + 4 write-loop wakeups per published message |
Split SendMessage into SendMessageNoFlush + SignalFlush; ProcessMessage collects unique clients in [ThreadStatic] HashSet<INatsClient> (Go's pcd pattern), one flush signal per unique client after fan-out |
Reduces channel writes from N to unique-client-count per publish |
| 21 |
Per-fetch CompiledFilter allocation — CompiledFilter.FromConfig(consumer.Config) called on every fetch request, allocating a new filter object each time |
Cached CompiledFilter on ConsumerHandle with staleness detection (reference + value check on filter config fields); reused across fetches |
Eliminates per-fetch filter allocation |
| 22 |
Per-message string interpolation in ack reply — $"$JS.ACK.{stream}.{consumer}.1.{seq}.{deliverySeq}.{ts}.{pending}" allocated intermediate strings and boxed numeric types on every delivery |
Pre-compute $"$JS.ACK.{stream}.{consumer}.1." prefix before loop; use stackalloc char[] + TryFormat for numeric suffix — zero intermediate allocations |
Eliminates 4+ string allocs per delivered message |
| 23 |
Per-fetch List<StoredMessage> allocation — new List<StoredMessage>(batch) allocated on every FetchAsync call |
[ThreadStatic] reusable list with .Clear() + capacity growth; PullFetchBatch snapshots via .ToArray() for safe handoff |
Eliminates per-fetch list allocation |
Round 5: Non-blocking ConsumeAsync (ordered + durable consumers)
One root cause was identified and fixed in the MSG.NEXT request handling path:
| # |
Root Cause |
Fix |
Impact |
| 19 |
Synchronous blocking in DeliverPullFetchMessages — FetchAsync(...).GetAwaiter().GetResult() blocked the client's read loop for the full expires timeout (30s). With batch=1000 and only 5 messages available, the fetch polled for message 6 indefinitely. No messages were delivered until the timeout fired, causing the client to receive 0 messages before its own timeout. |
Split into two paths: noWait/no-expires uses synchronous fetch (existing behavior for FetchAsync client); expires > 0 spawns DeliverPullFetchMessagesAsync background task that delivers messages incrementally without blocking the read loop, with idle heartbeat support |
Enables ConsumeAsync for both ordered and durable consumers; ordered consumer: 99K msg/s (0.64x Go) |
Round 4: Per-Client Direct Write Buffer (pub/sub + fan-out + multi pub/sub)
Four optimizations were implemented in the message delivery hot path:
| # |
Root Cause |
Fix |
Impact |
| 15 |
Per-message channel overhead — each SendMessage call went through Channel<OutboundData>.TryWrite, incurring lock contention and memory barriers |
Replaced channel-based message delivery with per-client _directBuf byte array under SpinLock; messages written directly to contiguous buffer |
Eliminates channel overhead per delivery |
| 16 |
Per-message heap allocation for MSG header — _outboundBufferPool.RentBuffer() allocated a pooled byte[] for each MSG header |
Replaced with stackalloc byte[512] — MSG header formatted entirely on the stack, then copied into _directBuf |
Zero heap allocations per delivery |
| 17 |
Per-message socket write — write loop issued one SendAsync per channel item, even with coalescing |
Double-buffer swap: write loop swaps _directBuf ↔ _writeBuf under SpinLock, then writes the entire batch in a single SendAsync; zero allocation on swap |
Single syscall per batch, zero-copy buffer reuse |
| 18 |
Separate wake channels — SendMessage and WriteProtocol used different signaling paths |
Unified on _flushSignal channel (bounded capacity 1, DropWrite); both paths signal the same channel, write loop drains both _directBuf and _outbound on each wake |
Single wait point, no missed wakes |
Round 3: Outbound Write Path (pub/sub + fan-out + fetch)
Three root causes were identified and fixed in the message delivery hot path:
| # |
Root Cause |
Fix |
Impact |
| 12 |
Per-message .ToArray() allocation in SendMessage — owner.Memory[..pos].ToArray() created a new byte[] for every MSG delivered to every subscriber |
Replaced IMemoryOwner rent/copy/dispose with direct byte[] from pool; write loop returns buffers after writing |
Eliminates 1 heap alloc per delivery (4 per fan-out message) |
| 13 |
Per-message WriteAsync in write loop — each queued message triggered a separate _stream.WriteAsync() system call |
Added 64KB coalesce buffer; drain all pending messages into contiguous buffer, single WriteAsync per batch |
Reduces syscalls from N to 1 per batch |
| 14 |
Profiling Stopwatch on every message — Stopwatch.StartNew() ran unconditionally in ProcessMessage and StreamManager.Capture even for non-JetStream messages |
Removed profiling instrumentation from hot path |
Eliminates ~200ns overhead per message |
Round 2: FileStore AppendAsync Hot Path
| # |
Root Cause |
Fix |
Impact |
| 6 |
Async state machine overhead — AppendAsync was async ValueTask<ulong> but never actually awaited |
Changed to synchronous ValueTask<ulong> returning ValueTask.FromResult(_last) |
Eliminates Task state machine allocation |
| 7 |
Double payload copy — TransformForPersist allocated byte[] then payload.ToArray() created second copy for StoredMessage |
Reuse TransformForPersist result directly for StoredMessage.Payload when no transform needed (_noTransform flag) |
Eliminates 1 byte[] alloc per message |
| 8 |
Unnecessary TTL work per publish — ExpireFromWheel() and RegisterTtl() called on every write even when MaxAge=0 |
Guarded both with _options.MaxAgeMs > 0 check (matches Go: filestore.go:4701) |
Eliminates hash wheel overhead when TTL not configured |
| 9 |
Per-message MsgBlock cache allocation — WriteAt created new MessageRecord for _cache on every write |
Removed eager cache population; reads now decode from pending buffer or disk |
Eliminates 1 object alloc per message |
| 10 |
Contiguous write buffer — MsgBlock._pendingWrites was List<byte[]> with per-message byte[] allocations |
Replaced with single contiguous _pendingBuf byte array; MessageRecord.EncodeTo writes directly into it |
Eliminates per-message byte[] encoding alloc; single RandomAccess.Write per flush |
| 11 |
Pending buffer read path — MsgBlock.Read() flushed pending writes to disk before reading |
Added in-memory read from _pendingBuf when data is still in the buffer |
Avoids unnecessary disk flush on read-after-write |
Round 1: FileStore/StreamManager Layer
| # |
Root Cause |
Fix |
Impact |
| 1 |
Per-message synchronous disk I/O — MsgBlock.WriteAt() called RandomAccess.Write() on every message |
Added write buffering in MsgBlock + background flush loop in FileStore (Go's flushLoop pattern: coalesce 16KB or 8ms) |
Eliminates per-message syscall overhead |
| 2 |
O(n) GetStateAsync per publish — _messages.Keys.Min() and _messages.Values.Sum() on every publish for MaxMsgs/MaxBytes checks |
Added incremental _messageCount, _totalBytes, _firstSeq fields updated in all mutation paths; GetStateAsync is now O(1) |
Eliminates O(n) scan per publish |
| 3 |
Unnecessary LoadAsync after every append — StreamManager.Capture reloaded the just-stored message even when no mirrors/sources were configured |
Made LoadAsync conditional on mirror/source replication being configured |
Eliminates redundant disk read per publish |
| 4 |
Redundant PruneExpiredMessages per publish — called before every publish even when MaxAge=0, and again inside EnforceRuntimePolicies |
Guarded with MaxAgeMs > 0 check; removed the pre-publish call (background expiry timer handles it) |
Eliminates O(n) scan per publish |
| 5 |
PrunePerSubject loading all messages per publish — EnforceRuntimePolicies → PrugePerSubject called ListAsync().GroupBy() even when MaxMsgsPer=0 |
Guarded with MaxMsgsPer > 0 check |
Eliminates O(n) scan per publish |
Additional fixes: SHA256 envelope bypass for unencrypted/uncompressed stores, RAFT propose skip for single-replica streams.
What would further close the gap
| Change |
Expected Impact |
Go Reference |
| Fan-out parallelism |
Deliver to subscribers concurrently instead of serially from publisher's read loop — this is now the primary bottleneck for the 0.63x fan-out gap |
Go: processMsgResults fans out per-client via goroutines |
| Eliminate per-message GC allocations in FileStore |
~30% improvement on FileStore AppendAsync — replace StoredMessage class with StoredMessageMeta struct in _messages dict, reconstruct full message from MsgBlock on read |
Go stores in cache.buf/cache.idx with zero per-message allocs; 80+ sites in FileStore.cs need migration |
| Single publisher throughput |
0.62x–0.74x gap; the pub-only path has no fan-out overhead — likely JIT/GC/socket write overhead in the ingest path |
Go: client.go readLoop with zero-copy buffer management |