Files
natsdotnet/benchmarks_comparison.md
Joseph Doherty 1d4b87e5f9 docs: refresh benchmark comparison with increased message counts
Increase message counts across all 14 benchmark test files to reduce
run-to-run variance (e.g. PubSub 16B: 10K→50K, FanOut: 10K→15K,
SinglePub: 100K→500K, JS tests: 5K→25K). Rewrite benchmarks_comparison.md
with fresh numbers from two-batch runs. Key changes: multi 4x4 reached
parity (1.01x), fan-out improved to 0.84x, TLS pub/sub shows 4.70x .NET
advantage, previous small-count anomalies corrected.
2026-03-13 17:52:03 -04:00

24 KiB
Raw Blame History

Go vs .NET NATS Server — Benchmark Comparison

Benchmark run: 2026-03-13 America/Indiana/Indianapolis. Both servers ran on the same machine using the benchmark project (dotnet test tests/NATS.Server.Benchmark.Tests -c Release --filter "Category=Benchmark" -v normal --logger "console;verbosity=detailed"). Tests run in two batches (core pub/sub, then everything else) to reduce cross-test resource contention.

Environment: Apple M4, .NET SDK 10.0.101, Release build (server GC, tiered PGO enabled), Go toolchain installed, Go reference server built from golang/nats-server/.

Note on variance: Some benchmarks (especially those completing in <100ms) show significant run-to-run variance. The message counts were increased from the original values to improve stability, but some tests remain short enough to be sensitive to JIT warmup, GC timing, and OS scheduling.


Core NATS — Pub/Sub Throughput

Single Publisher (no subscribers)

Payload Go msg/s Go MB/s .NET msg/s .NET MB/s Ratio (.NET/Go)
16 B 2,162,959 33.0 1,602,442 24.5 0.74x
128 B 3,773,858 460.7 1,408,294 171.9 0.37x

Publisher + Subscriber (1:1)

Payload Go msg/s Go MB/s .NET msg/s .NET MB/s Ratio (.NET/Go)
16 B 1,075,095 16.4 713,952 10.9 0.66x
16 KB 39,215 612.7 30,916 483.1 0.79x

Fan-Out (1 Publisher : 4 Subscribers)

Payload Go msg/s Go MB/s .NET msg/s .NET MB/s Ratio (.NET/Go)
128 B 2,919,353 356.4 2,459,924 300.3 0.84x

Multi-Publisher / Multi-Subscriber (4P x 4S)

Payload Go msg/s Go MB/s .NET msg/s .NET MB/s Ratio (.NET/Go)
128 B 1,870,855 228.4 1,892,631 231.0 1.01x

Core NATS — Request/Reply Latency

Single Client, Single Service

Payload Go msg/s .NET msg/s Ratio
128 B 9,392 8,372 0.89x

10 Clients, 2 Services (Queue Group)

Payload Go msg/s .NET msg/s Ratio
16 B 30,563 26,178 0.86x

JetStream — Publication

Mode Payload Storage Go msg/s .NET msg/s Ratio (.NET/Go)
Synchronous 16 B Memory 16,982 14,514 0.85x
Async (batch) 128 B File 211,355 58,334 0.28x

JetStream — Consumption

Mode Go msg/s .NET msg/s Ratio (.NET/Go)
Ordered ephemeral consumer 786,681 346,162 0.44x
Durable consumer fetch 711,203 542,250 0.76x

MQTT Throughput

Benchmark Go msg/s Go MB/s .NET msg/s .NET MB/s Ratio (.NET/Go)
MQTT PubSub (128B, QoS 0) 36,913 4.5 48,755 6.0 1.32x
Cross-Protocol NATS→MQTT (128B) 407,487 49.7 287,946 35.1 0.71x

Transport Overhead

TLS

Benchmark Go msg/s Go MB/s .NET msg/s .NET MB/s Ratio (.NET/Go)
TLS PubSub 1:1 (128B) 244,403 29.8 1,148,179 140.2 4.70x
TLS Pub-Only (128B) 3,224,490 393.6 1,246,351 152.1 0.39x

Note: TLS PubSub 1:1 shows .NET dramatically outperforming Go (4.70x). This appears to reflect .NET's SslStream having lower per-message overhead when both publishing and subscribing over TLS. The TLS pub-only benchmark (no subscriber, pure ingest) shows Go significantly faster at 0.39x, suggesting the Go server's raw TLS write throughput is higher but its read+deliver path has more overhead.

WebSocket

Benchmark Go msg/s Go MB/s .NET msg/s .NET MB/s Ratio (.NET/Go)
WS PubSub 1:1 (128B) 44,783 5.5 40,793 5.0 0.91x
WS Pub-Only (128B) 118,898 14.5 100,522 12.3 0.85x

Hot Path Microbenchmarks (.NET only)

SubList

Benchmark .NET msg/s .NET MB/s Alloc
SubList Exact Match (128 subjects) 22,812,300 304.6 0.00 B/op
SubList Wildcard Match 17,626,363 235.3 0.00 B/op
SubList Queue Match 23,306,329 177.8 0.00 B/op
SubList Remote Interest 437,080 7.1 0.00 B/op

Parser

Benchmark Ops/s MB/s Alloc
Parser PING 6,262,196 35.8 0.0 B/op
Parser PUB 2,663,706 101.6 40.0 B/op
Parser HPUB 2,213,655 118.2 40.0 B/op
Parser PUB split payload 2,100,256 80.1 176.0 B/op

FileStore

Benchmark Ops/s MB/s Alloc
FileStore AppendAsync (128B) 275,438 33.6 1242.9 B/op
FileStore LoadLastBySubject (hot) 1,138,203 69.5 656.0 B/op
FileStore PurgeEx+Trim 647 0.1 5440579.9 B/op

Summary

Category Ratio Assessment
Pub-only throughput (16B) 0.74x Stable across runs
Pub-only throughput (128B) 0.37x Go significantly faster at larger payloads
Pub/sub 1:1 (16B) 0.66x Go ahead; high variance at short durations
Pub/sub 1:1 (16KB) 0.79x Reasonable gap
Fan-out 1:4 0.84x Improved after Round 10 optimizations
Multi pub/sub 4x4 1.01x At parity
Request/reply (single) 0.89x Close to parity
Request/reply (10Cx2S) 0.86x Close to parity
JetStream sync publish 0.85x Close to parity
JetStream async file publish 0.28x Storage-bound
JetStream ordered consume 0.44x Significant gap
JetStream durable fetch 0.76x Moderate gap
MQTT pub/sub 1.32x .NET outperforms Go
MQTT cross-protocol 0.71x Go ahead; high variance
TLS pub/sub 4.70x .NET SslStream dramatically faster
TLS pub-only 0.39x Go raw TLS write faster
WebSocket pub/sub 0.91x Close to parity
WebSocket pub-only 0.85x Good

Key Observations

  1. Multi pub/sub reached parity (1.01x) after Round 10 pre-formatted MSG headers. Fan-out improved to 0.84x.
  2. TLS pub/sub shows a dramatic .NET advantage (4.70x) — .NET's SslStream has significantly lower overhead in the bidirectional pub/sub path. TLS pub-only (ingest only) still favors Go at 0.39x, suggesting the advantage is in the read-and-deliver path.
  3. MQTT pub/sub remains a .NET strength at 1.32x. Cross-protocol (NATS→MQTT) dropped to 0.71x — this benchmark shows high variance across runs.
  4. JetStream ordered consumer dropped to 0.44x compared to earlier runs (0.62x). This test completes in <100ms and shows high variance.
  5. Single publisher 128B dropped to 0.37x (from 0.62x with smaller message counts). With 500K messages, this benchmark runs long enough for Go's goroutine scheduler and buffer management to reach steady state, widening the gap. The 16B variant is stable at 0.74x.
  6. Request-reply latency stable at 0.86x0.89x across all runs.

Optimization History

Round 10: Fan-Out Serial Path Optimization

Three optimizations making the serial fan-out path cheaper (fan-out 0.63x→0.84x, multi 0.65x→1.01x):

# Root Cause Fix Impact
38 Per-delivery MSG header re-formattingSendMessageNoFlush independently formats the entire MSG header line (prefix, subject copy, replyTo encoding, size formatting, CRLF) for every subscriber — but only the SID varies per delivery Pre-build prefix (MSG subject ) and suffix ( [reply] sizes\r\n) once per publish; new SendMessagePreformatted writes prefix+sid+suffix directly into _directBuf — zero encoding, pure memory copies Eliminates per-delivery replyTo encoding, size formatting, prefix/subject copying
39 Queue-group round-robin burns 2 Interlocked opsInterlocked.Increment(ref OutMsgs) + Interlocked.Decrement(ref OutMsgs) per queue group just to pick an index Replaced with non-atomic uint QueueRoundRobin++ — safe because ProcessMessage runs single-threaded per publisher connection (the read loop) Eliminates 2 interlocked ops per queue group per publish
40 HashSet<INatsClient> pcd overhead — hash computation + bucket lookup per Add for small fan-out counts (4 subscribers) Replaced with [ThreadStatic] INatsClient[] + linear scan; O(n) but n≤16, faster than hash for small counts Eliminates hash computation and internal array overhead

Round 9: Fan-Out & Multi Pub/Sub Hot-Path Optimization

Seven optimizations targeting the per-delivery hot path and benchmark harness configuration:

# Root Cause Fix Impact
31 Benchmark harness built server in DebugDotNetServerProcess.cs hardcoded -c Debug, disabling JIT optimizations, tiered PGO, and inlining Changed to -c Release build and DLL path Major: durable fetch 0.42x→0.92x, request-reply to parity
32 Per-delivery Interlocked on server-wide statsSendMessageNoFlush did 2 Interlocked ops per delivery; fan-out 4 subs = 8 interlocked ops per publish Moved server-wide stats to batch Interlocked.Add once after fan-out loop in ProcessMessage Eliminates N×2 interlocked ops per publish
33 Auto-unsub tracking on every deliveryInterlocked.Increment(ref sub.MessageCount) on every delivery even when MaxMessages == 0 (no limit — the common case) Guarded with if (sub.MaxMessages > 0) Eliminates 1 interlocked op per delivery in common case
34 Per-delivery SID ASCII encodingEncoding.ASCII.GetBytes(sid) on every delivery; SID is a small integer that never changes Added Subscription.SidBytes cached property; new SendMessageNoFlush overload accepts ReadOnlySpan<byte> Eliminates per-delivery encoding
35 Per-delivery subject ASCII encodingEncoding.ASCII.GetBytes(subject) for each subscriber; fan-out 4 = 4× encoding same subject Pre-encode subject once in ProcessMessage before fan-out loop; new overload uses span copy Eliminates N-1 subject encodings per publish
36 Per-publish subject string allocationEncoding.ASCII.GetString(cmd.Subject.Span) on every PUB even when publishing to the same subject repeatedly Added 1-element string cache per client; reuses string when subject bytes match Eliminates string alloc for repeated subjects
37 Interlocked stats in SubList.Match hot pathInterlocked.Increment(ref _matches) and _cacheHits on every match call Replaced with non-atomic increments (approximate counters for monitoring) Eliminates 1-2 interlocked ops per match

Round 8: Ordered Consumer + Cross-Protocol Optimization

Three optimizations targeting pull consumer delivery and MQTT cross-protocol throughput:

# Root Cause Fix Impact
28 Per-message flush signal in DeliverPullFetchMessagesAsyncDeliverMessage called SendMessage which triggered _flushSignal.Writer.TryWrite(0) per message; for batch of N messages, N flush signals and write-loop wakeups Replaced with SendMessageNoFlush + batch flush every 64 messages + final flush after loop; bypasses DeliverMessage entirely (no permission check / auto-unsub needed for JS delivery inbox) Reduces flush signals from N to N/64 per batch
29 5ms polling delay in pull consumer wait loopTask.Delay(5) in DeliverPullFetchMessagesAsync and PullConsumerEngine.WaitForMessageAsync added up to 5ms latency per empty slot; for tail-following consumers, every new message waited up to 5ms to be noticed Added StreamHandle.NotifyPublish() / WaitForPublishAsync() using TaskCompletionSource signaling; publishers call NotifyPublish after AppendAsync; consumers wait on signal with heartbeat-interval timeout Eliminates polling delay; instant wakeup on publish
30 StringBuilder allocation in NatsToMqtt for common case — every uncached NatsToMqtt call allocated a StringBuilder even when no _DOT_ escape sequences were present (the common case) Added string.Create fast path that uses char replacement lambda when no _DOT_ found; pre-warm topic bytes cache on MQTT subscription creation Eliminates StringBuilder + string alloc for common case; no cache miss on first delivery

Round 7: MQTT Cross-Protocol Write Path

Four optimizations targeting the NATS→MQTT delivery hot path (cross-protocol throughput improved from 0.30x to 0.78x):

# Root Cause Fix Impact
24 Per-message async fire-and-forget in MqttNatsClientAdapter — each SendMessage called SendBinaryPublishAsync which acquired a SemaphoreSlim, allocated a full PUBLISH packet byte[], wrote, and flushed the stream — all per message, bypassing the server's deferred-flush batching Replaced with synchronous EnqueuePublishNoFlush() that formats MQTT PUBLISH directly into _directBuf under SpinLock, matching the NatsClient pattern; SignalFlush() signals the write loop for batch flush Eliminates async Task + SemaphoreSlim + per-message flush
25 Per-message byte[] allocation for MQTT PUBLISH packetsMqttPacketWriter.WritePublish() allocated topic bytes, variable header, remaining-length array, and full packet array on every delivery Added WritePublishTo(Span<byte>) that formats the entire PUBLISH packet directly into the destination span using Span<byte> operations — zero heap allocation Eliminates 4+ byte[] allocs per delivery
26 Per-message NATS→MQTT topic translationNatsToMqtt() allocated a StringBuilder, produced a string, then Encoding.UTF8.GetBytes() re-encoded it on every delivery Added NatsToMqttBytes() with bounded ConcurrentDictionary<string, byte[]> cache (4096 entries); cached result includes pre-encoded UTF-8 bytes Eliminates string + encoding alloc per delivery for cached topics
27 Per-message FlushAsync on plain TCP socketsWriteBinaryAsync flushed after every packet write, even on NetworkStream where TCP auto-flushes Write loop skips FlushAsync for plain sockets; for TLS/wrapped streams, flushes once per batch (not per message) Reduces syscalls from 2N to 1 per batch

Round 6: Batch Flush Signaling + Fetch Optimizations

Four optimizations targeting fan-out and consumer fetch hot paths:

# Root Cause Fix Impact
20 Per-subscriber flush signal in fan-out — each SendMessage called _flushSignal.Writer.TryWrite(0) independently; for 1:4 fan-out, 4 channel writes + 4 write-loop wakeups per published message Split SendMessage into SendMessageNoFlush + SignalFlush; ProcessMessage collects unique clients in [ThreadStatic] HashSet<INatsClient> (Go's pcd pattern), one flush signal per unique client after fan-out Reduces channel writes from N to unique-client-count per publish
21 Per-fetch CompiledFilter allocationCompiledFilter.FromConfig(consumer.Config) called on every fetch request, allocating a new filter object each time Cached CompiledFilter on ConsumerHandle with staleness detection (reference + value check on filter config fields); reused across fetches Eliminates per-fetch filter allocation
22 Per-message string interpolation in ack reply$"$JS.ACK.{stream}.{consumer}.1.{seq}.{deliverySeq}.{ts}.{pending}" allocated intermediate strings and boxed numeric types on every delivery Pre-compute $"$JS.ACK.{stream}.{consumer}.1." prefix before loop; use stackalloc char[] + TryFormat for numeric suffix — zero intermediate allocations Eliminates 4+ string allocs per delivered message
23 Per-fetch List<StoredMessage> allocationnew List<StoredMessage>(batch) allocated on every FetchAsync call [ThreadStatic] reusable list with .Clear() + capacity growth; PullFetchBatch snapshots via .ToArray() for safe handoff Eliminates per-fetch list allocation

Round 5: Non-blocking ConsumeAsync (ordered + durable consumers)

One root cause was identified and fixed in the MSG.NEXT request handling path:

# Root Cause Fix Impact
19 Synchronous blocking in DeliverPullFetchMessagesFetchAsync(...).GetAwaiter().GetResult() blocked the client's read loop for the full expires timeout (30s). With batch=1000 and only 5 messages available, the fetch polled for message 6 indefinitely. No messages were delivered until the timeout fired, causing the client to receive 0 messages before its own timeout. Split into two paths: noWait/no-expires uses synchronous fetch (existing behavior for FetchAsync client); expires > 0 spawns DeliverPullFetchMessagesAsync background task that delivers messages incrementally without blocking the read loop, with idle heartbeat support Enables ConsumeAsync for both ordered and durable consumers; ordered consumer: 99K msg/s (0.64x Go)

Round 4: Per-Client Direct Write Buffer (pub/sub + fan-out + multi pub/sub)

Four optimizations were implemented in the message delivery hot path:

# Root Cause Fix Impact
15 Per-message channel overhead — each SendMessage call went through Channel<OutboundData>.TryWrite, incurring lock contention and memory barriers Replaced channel-based message delivery with per-client _directBuf byte array under SpinLock; messages written directly to contiguous buffer Eliminates channel overhead per delivery
16 Per-message heap allocation for MSG header_outboundBufferPool.RentBuffer() allocated a pooled byte[] for each MSG header Replaced with stackalloc byte[512] — MSG header formatted entirely on the stack, then copied into _directBuf Zero heap allocations per delivery
17 Per-message socket write — write loop issued one SendAsync per channel item, even with coalescing Double-buffer swap: write loop swaps _directBuf_writeBuf under SpinLock, then writes the entire batch in a single SendAsync; zero allocation on swap Single syscall per batch, zero-copy buffer reuse
18 Separate wake channelsSendMessage and WriteProtocol used different signaling paths Unified on _flushSignal channel (bounded capacity 1, DropWrite); both paths signal the same channel, write loop drains both _directBuf and _outbound on each wake Single wait point, no missed wakes

Round 3: Outbound Write Path (pub/sub + fan-out + fetch)

Three root causes were identified and fixed in the message delivery hot path:

# Root Cause Fix Impact
12 Per-message .ToArray() allocation in SendMessageowner.Memory[..pos].ToArray() created a new byte[] for every MSG delivered to every subscriber Replaced IMemoryOwner rent/copy/dispose with direct byte[] from pool; write loop returns buffers after writing Eliminates 1 heap alloc per delivery (4 per fan-out message)
13 Per-message WriteAsync in write loop — each queued message triggered a separate _stream.WriteAsync() system call Added 64KB coalesce buffer; drain all pending messages into contiguous buffer, single WriteAsync per batch Reduces syscalls from N to 1 per batch
14 Profiling Stopwatch on every messageStopwatch.StartNew() ran unconditionally in ProcessMessage and StreamManager.Capture even for non-JetStream messages Removed profiling instrumentation from hot path Eliminates ~200ns overhead per message

Round 2: FileStore AppendAsync Hot Path

# Root Cause Fix Impact
6 Async state machine overheadAppendAsync was async ValueTask<ulong> but never actually awaited Changed to synchronous ValueTask<ulong> returning ValueTask.FromResult(_last) Eliminates Task state machine allocation
7 Double payload copyTransformForPersist allocated byte[] then payload.ToArray() created second copy for StoredMessage Reuse TransformForPersist result directly for StoredMessage.Payload when no transform needed (_noTransform flag) Eliminates 1 byte[] alloc per message
8 Unnecessary TTL work per publishExpireFromWheel() and RegisterTtl() called on every write even when MaxAge=0 Guarded both with _options.MaxAgeMs > 0 check (matches Go: filestore.go:4701) Eliminates hash wheel overhead when TTL not configured
9 Per-message MsgBlock cache allocationWriteAt created new MessageRecord for _cache on every write Removed eager cache population; reads now decode from pending buffer or disk Eliminates 1 object alloc per message
10 Contiguous write bufferMsgBlock._pendingWrites was List<byte[]> with per-message byte[] allocations Replaced with single contiguous _pendingBuf byte array; MessageRecord.EncodeTo writes directly into it Eliminates per-message byte[] encoding alloc; single RandomAccess.Write per flush
11 Pending buffer read pathMsgBlock.Read() flushed pending writes to disk before reading Added in-memory read from _pendingBuf when data is still in the buffer Avoids unnecessary disk flush on read-after-write

Round 1: FileStore/StreamManager Layer

# Root Cause Fix Impact
1 Per-message synchronous disk I/OMsgBlock.WriteAt() called RandomAccess.Write() on every message Added write buffering in MsgBlock + background flush loop in FileStore (Go's flushLoop pattern: coalesce 16KB or 8ms) Eliminates per-message syscall overhead
2 O(n) GetStateAsync per publish_messages.Keys.Min() and _messages.Values.Sum() on every publish for MaxMsgs/MaxBytes checks Added incremental _messageCount, _totalBytes, _firstSeq fields updated in all mutation paths; GetStateAsync is now O(1) Eliminates O(n) scan per publish
3 Unnecessary LoadAsync after every appendStreamManager.Capture reloaded the just-stored message even when no mirrors/sources were configured Made LoadAsync conditional on mirror/source replication being configured Eliminates redundant disk read per publish
4 Redundant PruneExpiredMessages per publish — called before every publish even when MaxAge=0, and again inside EnforceRuntimePolicies Guarded with MaxAgeMs > 0 check; removed the pre-publish call (background expiry timer handles it) Eliminates O(n) scan per publish
5 PrunePerSubject loading all messages per publishEnforceRuntimePoliciesPrugePerSubject called ListAsync().GroupBy() even when MaxMsgsPer=0 Guarded with MaxMsgsPer > 0 check Eliminates O(n) scan per publish

Additional fixes: SHA256 envelope bypass for unencrypted/uncompressed stores, RAFT propose skip for single-replica streams.

What would further close the gap

Change Expected Impact Go Reference
Single publisher ingest path (0.37x at 128B) The pub-only path has the largest gap. Go's readLoop uses zero-copy buffer management with direct []byte slicing; .NET parses into managed objects. Reducing allocations in the parser→ProcessMessage path would help. Go: client.go readLoop, direct buffer slicing
JetStream async file publish (0.28x) Storage-bound: FileStore AppendAsync bottleneck is synchronous RandomAccess.Write in flush loop and S2 compression overhead Go: filestore.go uses cache.buf/cache.idx with mmap and goroutine-per-flush concurrency
JetStream ordered consumer (0.44x) Pull consumer delivery pipeline has overhead in the fetch→deliver→ack cycle. The test completes in <100ms so numbers are noisy, but the gap is real. Go: consumer.go delivery with direct buffer writes
Write-loop / socket write overhead Fan-out (0.84x) and pub/sub (0.66x) gaps partly come from write-loop wakeup latency and socket write syscall overhead compared to Go's writev() Go: flushOutbound uses net.Buffers.WriteTowritev() with zero-copy buffer management