Files
natsdotnet/benchmarks_comparison.md
Joseph Doherty 0e5ce4ed9b perf: optimize fan-out serial path — pre-formatted MSG headers, non-atomic RR, linear pcd
Three optimizations making the serial fan-out path cheaper (fan-out 0.63x→0.70x,
multi pub/sub 0.65x→0.69x):

1. Pre-format MSG prefix ("MSG subject ") and suffix (" [reply] sizes\r\n") once
   per publish. New SendMessagePreformatted writes prefix+sid+suffix directly into
   _directBuf — zero encoding, pure memory copies. Only SID varies per delivery.

2. Replace queue-group round-robin Interlocked.Increment/Decrement with non-atomic
   uint QueueRoundRobin++ (safe: ProcessMessage runs single-threaded per connection).

3. Replace HashSet<INatsClient> pcd with ThreadStatic INatsClient[] + linear scan.
   O(n) but n≤16; faster than hash for small fan-out counts.
2026-03-13 16:23:18 -04:00

25 KiB
Raw Blame History

Go vs .NET NATS Server — Benchmark Comparison

Benchmark run: 2026-03-13 America/Indiana/Indianapolis. Both servers ran on the same machine using the benchmark project README command (dotnet test tests/NATS.Server.Benchmark.Tests -c Release --filter "Category=Benchmark" -v normal --logger "console;verbosity=detailed"). Test parallelization remained disabled inside the benchmark assembly.

Environment: Apple M4, .NET SDK 10.0.101, Release build, Go toolchain installed, Go reference server built from golang/nats-server/. Environment: Apple M4, .NET SDK 10.0.101, Release build (server GC, tiered PGO enabled), Go toolchain installed, Go reference server built from golang/nats-server/.



Core NATS — Pub/Sub Throughput

Single Publisher (no subscribers)

Payload Go msg/s Go MB/s .NET msg/s .NET MB/s Ratio (.NET/Go)
16 B 2,223,690 33.9 1,651,727 25.2 0.74x
128 B 2,218,308 270.8 1,368,967 167.1 0.62x

Publisher + Subscriber (1:1)

Payload Go msg/s Go MB/s .NET msg/s .NET MB/s Ratio (.NET/Go)
16 B 292,711 4.5 723,867 11.0 2.47x
16 KB 32,890 513.9 37,943 592.9 1.15x

Fan-Out (1 Publisher : 4 Subscribers)

Payload Go msg/s Go MB/s .NET msg/s .NET MB/s Ratio (.NET/Go)
128 B 2,945,790 359.6 2,063,771 251.9 0.70x

Note: Fan-out improved from 0.63x to 0.70x after Round 10 pre-formatted MSG headers, eliminating per-delivery replyTo encoding, size formatting, and prefix/subject copying. Only the SID varies per delivery now.

Multi-Publisher / Multi-Subscriber (4P x 4S)

Payload Go msg/s Go MB/s .NET msg/s .NET MB/s Ratio (.NET/Go)
128 B 2,123,480 259.2 1,465,416 178.9 0.69x

Core NATS — Request/Reply Latency

Single Client, Single Service

Payload Go msg/s .NET msg/s Ratio Go P50 (us) .NET P50 (us) Go P99 (us) .NET P99 (us)
128 B 8,386 7,424 0.89x 115.8 139.0 175.5 193.0

10 Clients, 2 Services (Queue Group)

Payload Go msg/s .NET msg/s Ratio Go P50 (us) .NET P50 (us) Go P99 (us) .NET P99 (us)
16 B 26,470 26,620 1.01x 370.2 376.0 486.0 592.8

JetStream — Publication

Mode Payload Storage Go msg/s .NET msg/s Ratio (.NET/Go)
Synchronous 16 B Memory 14,812 12,134 0.82x
Async (batch) 128 B File 174,705 52,350 0.30x

Note: Async file-store publish improved ~10% (47K→52K) after hot-path optimizations: cached state properties, single stream lookup, _messageIndexes removal, hand-rolled pub-ack formatter, exponential flush backoff, lazy StoredMessage materialization. Still storage-bound at 0.30x Go.


JetStream — Consumption

Mode Go msg/s .NET msg/s Ratio (.NET/Go)
Ordered ephemeral consumer 166,000 102,369 0.62x
Durable consumer fetch 510,000 468,252 0.92x

Note: Ordered consumer improved to 0.62x (102K vs 166K). Durable fetch jumped to 0.92x (468K vs 510K) — the Release build with tiered PGO dramatically improved the JIT quality for the fetch delivery path. Go comparison numbers vary significantly across runs.


MQTT Throughput

Benchmark Go msg/s Go MB/s .NET msg/s .NET MB/s Ratio (.NET/Go)
MQTT PubSub (128B, QoS 0) 34,224 4.2 47,341 5.8 1.38x
Cross-Protocol NATS→MQTT (128B) 158,000 19.3 229,932 28.1 1.46x

Note: Pure MQTT pub/sub extended its lead to 1.38x. Cross-protocol NATS→MQTT now at 1.46x — the Release build JIT further benefits the delivery path.


Transport Overhead

TLS

Benchmark Go msg/s Go MB/s .NET msg/s .NET MB/s Ratio (.NET/Go)
TLS PubSub 1:1 (128B) 289,548 35.3 254,834 31.1 0.88x
TLS Pub-Only (128B) 1,782,442 217.6 877,149 107.1 0.49x

WebSocket

Benchmark Go msg/s Go MB/s .NET msg/s .NET MB/s Ratio (.NET/Go)
WS PubSub 1:1 (128B) 66,584 8.1 62,249 7.6 0.93x
WS Pub-Only (128B) 106,302 13.0 85,878 10.5 0.81x

Note: TLS pub/sub stable at 0.88x. WebSocket pub/sub at 0.93x. Both WebSocket numbers are lower than plaintext due to WS framing overhead.


Hot Path Microbenchmarks (.NET only)

SubList

Benchmark .NET msg/s .NET MB/s Alloc
SubList Exact Match (128 subjects) 19,285,510 257.5 0.00 B/op
SubList Wildcard Match 18,876,330 252.0 0.00 B/op
SubList Queue Match 20,639,153 157.5 0.00 B/op
SubList Remote Interest 274,703 4.5 0.00 B/op

Parser

Benchmark Ops/s MB/s Alloc
Parser PING 6,283,578 36.0 0.0 B/op
Parser PUB 2,712,550 103.5 40.0 B/op
Parser HPUB 2,338,555 124.9 40.0 B/op
Parser PUB split payload 2,043,813 78.0 176.0 B/op

FileStore

Benchmark Ops/s MB/s Alloc
FileStore AppendAsync (128B) 244,089 29.8 1552.9 B/op
FileStore LoadLastBySubject (hot) 12,784,127 780.3 0.0 B/op
FileStore PurgeEx+Trim 332 0.0 5440792.9 B/op

Summary

Category Ratio Range Assessment
Pub-only throughput 0.62x0.74x Improved with Release build
Pub/sub (small payload) 2.47x .NET outperforms Go decisively
Pub/sub (large payload) 1.15x .NET now exceeds parity
Fan-out 0.70x Improved: pre-formatted MSG headers
Multi pub/sub 0.69x Improved: same optimizations
Request/reply latency 0.89x1.01x Effectively at parity
JetStream sync publish 0.74x Run-to-run variance
JetStream async file publish 0.41x Storage-bound
JetStream ordered consume 0.62x Improved with Release build
JetStream durable fetch 0.92x Major improvement with Release build
MQTT pub/sub 1.38x .NET outperforms Go
MQTT cross-protocol 1.46x .NET strongly outperforms Go
TLS pub/sub 0.88x Close to parity
TLS pub-only 0.49x Variance / contention with other tests
WebSocket pub/sub 0.93x Close to parity
WebSocket pub-only 0.81x Good

Key Observations

  1. Switching the benchmark harness to Release build was the highest-impact change. Durable fetch jumped from 0.42x to 0.92x (468K vs 510K msg/s). Ordered consumer improved from 0.57x to 0.62x. Request-reply 10Cx2S reached parity at 1.01x. Large-payload pub/sub now exceeds Go at 1.15x.
  2. Small-payload 1:1 pub/sub remains a strong .NET lead at 2.47x (724K vs 293K msg/s).
  3. MQTT cross-protocol improved to 1.46x (230K vs 158K msg/s), up from 1.20x — the Release JIT further benefits the delivery path.
  4. Fan-out improved from 0.63x to 0.70x, multi pub/sub from 0.65x to 0.69x after Round 10 pre-formatted MSG headers. Per-delivery work is now minimal (SID copy + suffix copy + payload copy under SpinLock). The remaining gap is likely dominated by write-loop wakeup and socket write overhead.
  5. SubList Match microbenchmarks improved ~17% (19.3M vs 16.5M ops/s for exact match) after removing Interlocked stats from the hot path.
  6. TLS pub-only dropped to 0.49x this run, likely noise from co-running benchmarks contending on CPU. TLS pub/sub remains stable at 0.88x.

Optimization History

Round 10: Fan-Out Serial Path Optimization

Three optimizations making the serial fan-out path cheaper (fan-out 0.63x→0.70x, multi 0.65x→0.69x):

# Root Cause Fix Impact
38 Per-delivery MSG header re-formattingSendMessageNoFlush independently formats the entire MSG header line (prefix, subject copy, replyTo encoding, size formatting, CRLF) for every subscriber — but only the SID varies per delivery Pre-build prefix (MSG subject ) and suffix ( [reply] sizes\r\n) once per publish; new SendMessagePreformatted writes prefix+sid+suffix directly into _directBuf — zero encoding, pure memory copies Eliminates per-delivery replyTo encoding, size formatting, prefix/subject copying
39 Queue-group round-robin burns 2 Interlocked opsInterlocked.Increment(ref OutMsgs) + Interlocked.Decrement(ref OutMsgs) per queue group just to pick an index Replaced with non-atomic uint QueueRoundRobin++ — safe because ProcessMessage runs single-threaded per publisher connection (the read loop) Eliminates 2 interlocked ops per queue group per publish
40 HashSet<INatsClient> pcd overhead — hash computation + bucket lookup per Add for small fan-out counts (4 subscribers) Replaced with [ThreadStatic] INatsClient[] + linear scan; O(n) but n≤16, faster than hash for small counts Eliminates hash computation and internal array overhead

Round 9: Fan-Out & Multi Pub/Sub Hot-Path Optimization

Seven optimizations targeting the per-delivery hot path and benchmark harness configuration:

# Root Cause Fix Impact
31 Benchmark harness built server in DebugDotNetServerProcess.cs hardcoded -c Debug, disabling JIT optimizations, tiered PGO, and inlining Changed to -c Release build and DLL path Major: durable fetch 0.42x→0.92x, request-reply to parity
32 Per-delivery Interlocked on server-wide statsSendMessageNoFlush did 2 Interlocked ops per delivery; fan-out 4 subs = 8 interlocked ops per publish Moved server-wide stats to batch Interlocked.Add once after fan-out loop in ProcessMessage Eliminates N×2 interlocked ops per publish
33 Auto-unsub tracking on every deliveryInterlocked.Increment(ref sub.MessageCount) on every delivery even when MaxMessages == 0 (no limit — the common case) Guarded with if (sub.MaxMessages > 0) Eliminates 1 interlocked op per delivery in common case
34 Per-delivery SID ASCII encodingEncoding.ASCII.GetBytes(sid) on every delivery; SID is a small integer that never changes Added Subscription.SidBytes cached property; new SendMessageNoFlush overload accepts ReadOnlySpan<byte> Eliminates per-delivery encoding
35 Per-delivery subject ASCII encodingEncoding.ASCII.GetBytes(subject) for each subscriber; fan-out 4 = 4× encoding same subject Pre-encode subject once in ProcessMessage before fan-out loop; new overload uses span copy Eliminates N-1 subject encodings per publish
36 Per-publish subject string allocationEncoding.ASCII.GetString(cmd.Subject.Span) on every PUB even when publishing to the same subject repeatedly Added 1-element string cache per client; reuses string when subject bytes match Eliminates string alloc for repeated subjects
37 Interlocked stats in SubList.Match hot pathInterlocked.Increment(ref _matches) and _cacheHits on every match call Replaced with non-atomic increments (approximate counters for monitoring) Eliminates 1-2 interlocked ops per match

Round 8: Ordered Consumer + Cross-Protocol Optimization

Three optimizations targeting pull consumer delivery and MQTT cross-protocol throughput:

# Root Cause Fix Impact
28 Per-message flush signal in DeliverPullFetchMessagesAsyncDeliverMessage called SendMessage which triggered _flushSignal.Writer.TryWrite(0) per message; for batch of N messages, N flush signals and write-loop wakeups Replaced with SendMessageNoFlush + batch flush every 64 messages + final flush after loop; bypasses DeliverMessage entirely (no permission check / auto-unsub needed for JS delivery inbox) Reduces flush signals from N to N/64 per batch
29 5ms polling delay in pull consumer wait loopTask.Delay(5) in DeliverPullFetchMessagesAsync and PullConsumerEngine.WaitForMessageAsync added up to 5ms latency per empty slot; for tail-following consumers, every new message waited up to 5ms to be noticed Added StreamHandle.NotifyPublish() / WaitForPublishAsync() using TaskCompletionSource signaling; publishers call NotifyPublish after AppendAsync; consumers wait on signal with heartbeat-interval timeout Eliminates polling delay; instant wakeup on publish
30 StringBuilder allocation in NatsToMqtt for common case — every uncached NatsToMqtt call allocated a StringBuilder even when no _DOT_ escape sequences were present (the common case) Added string.Create fast path that uses char replacement lambda when no _DOT_ found; pre-warm topic bytes cache on MQTT subscription creation Eliminates StringBuilder + string alloc for common case; no cache miss on first delivery

Round 7: MQTT Cross-Protocol Write Path

Four optimizations targeting the NATS→MQTT delivery hot path (cross-protocol throughput improved from 0.30x to 0.78x):

# Root Cause Fix Impact
24 Per-message async fire-and-forget in MqttNatsClientAdapter — each SendMessage called SendBinaryPublishAsync which acquired a SemaphoreSlim, allocated a full PUBLISH packet byte[], wrote, and flushed the stream — all per message, bypassing the server's deferred-flush batching Replaced with synchronous EnqueuePublishNoFlush() that formats MQTT PUBLISH directly into _directBuf under SpinLock, matching the NatsClient pattern; SignalFlush() signals the write loop for batch flush Eliminates async Task + SemaphoreSlim + per-message flush
25 Per-message byte[] allocation for MQTT PUBLISH packetsMqttPacketWriter.WritePublish() allocated topic bytes, variable header, remaining-length array, and full packet array on every delivery Added WritePublishTo(Span<byte>) that formats the entire PUBLISH packet directly into the destination span using Span<byte> operations — zero heap allocation Eliminates 4+ byte[] allocs per delivery
26 Per-message NATS→MQTT topic translationNatsToMqtt() allocated a StringBuilder, produced a string, then Encoding.UTF8.GetBytes() re-encoded it on every delivery Added NatsToMqttBytes() with bounded ConcurrentDictionary<string, byte[]> cache (4096 entries); cached result includes pre-encoded UTF-8 bytes Eliminates string + encoding alloc per delivery for cached topics
27 Per-message FlushAsync on plain TCP socketsWriteBinaryAsync flushed after every packet write, even on NetworkStream where TCP auto-flushes Write loop skips FlushAsync for plain sockets; for TLS/wrapped streams, flushes once per batch (not per message) Reduces syscalls from 2N to 1 per batch

Round 6: Batch Flush Signaling + Fetch Optimizations

Four optimizations targeting fan-out and consumer fetch hot paths:

# Root Cause Fix Impact
20 Per-subscriber flush signal in fan-out — each SendMessage called _flushSignal.Writer.TryWrite(0) independently; for 1:4 fan-out, 4 channel writes + 4 write-loop wakeups per published message Split SendMessage into SendMessageNoFlush + SignalFlush; ProcessMessage collects unique clients in [ThreadStatic] HashSet<INatsClient> (Go's pcd pattern), one flush signal per unique client after fan-out Reduces channel writes from N to unique-client-count per publish
21 Per-fetch CompiledFilter allocationCompiledFilter.FromConfig(consumer.Config) called on every fetch request, allocating a new filter object each time Cached CompiledFilter on ConsumerHandle with staleness detection (reference + value check on filter config fields); reused across fetches Eliminates per-fetch filter allocation
22 Per-message string interpolation in ack reply$"$JS.ACK.{stream}.{consumer}.1.{seq}.{deliverySeq}.{ts}.{pending}" allocated intermediate strings and boxed numeric types on every delivery Pre-compute $"$JS.ACK.{stream}.{consumer}.1." prefix before loop; use stackalloc char[] + TryFormat for numeric suffix — zero intermediate allocations Eliminates 4+ string allocs per delivered message
23 Per-fetch List<StoredMessage> allocationnew List<StoredMessage>(batch) allocated on every FetchAsync call [ThreadStatic] reusable list with .Clear() + capacity growth; PullFetchBatch snapshots via .ToArray() for safe handoff Eliminates per-fetch list allocation

Round 5: Non-blocking ConsumeAsync (ordered + durable consumers)

One root cause was identified and fixed in the MSG.NEXT request handling path:

# Root Cause Fix Impact
19 Synchronous blocking in DeliverPullFetchMessagesFetchAsync(...).GetAwaiter().GetResult() blocked the client's read loop for the full expires timeout (30s). With batch=1000 and only 5 messages available, the fetch polled for message 6 indefinitely. No messages were delivered until the timeout fired, causing the client to receive 0 messages before its own timeout. Split into two paths: noWait/no-expires uses synchronous fetch (existing behavior for FetchAsync client); expires > 0 spawns DeliverPullFetchMessagesAsync background task that delivers messages incrementally without blocking the read loop, with idle heartbeat support Enables ConsumeAsync for both ordered and durable consumers; ordered consumer: 99K msg/s (0.64x Go)

Round 4: Per-Client Direct Write Buffer (pub/sub + fan-out + multi pub/sub)

Four optimizations were implemented in the message delivery hot path:

# Root Cause Fix Impact
15 Per-message channel overhead — each SendMessage call went through Channel<OutboundData>.TryWrite, incurring lock contention and memory barriers Replaced channel-based message delivery with per-client _directBuf byte array under SpinLock; messages written directly to contiguous buffer Eliminates channel overhead per delivery
16 Per-message heap allocation for MSG header_outboundBufferPool.RentBuffer() allocated a pooled byte[] for each MSG header Replaced with stackalloc byte[512] — MSG header formatted entirely on the stack, then copied into _directBuf Zero heap allocations per delivery
17 Per-message socket write — write loop issued one SendAsync per channel item, even with coalescing Double-buffer swap: write loop swaps _directBuf_writeBuf under SpinLock, then writes the entire batch in a single SendAsync; zero allocation on swap Single syscall per batch, zero-copy buffer reuse
18 Separate wake channelsSendMessage and WriteProtocol used different signaling paths Unified on _flushSignal channel (bounded capacity 1, DropWrite); both paths signal the same channel, write loop drains both _directBuf and _outbound on each wake Single wait point, no missed wakes

Round 3: Outbound Write Path (pub/sub + fan-out + fetch)

Three root causes were identified and fixed in the message delivery hot path:

# Root Cause Fix Impact
12 Per-message .ToArray() allocation in SendMessageowner.Memory[..pos].ToArray() created a new byte[] for every MSG delivered to every subscriber Replaced IMemoryOwner rent/copy/dispose with direct byte[] from pool; write loop returns buffers after writing Eliminates 1 heap alloc per delivery (4 per fan-out message)
13 Per-message WriteAsync in write loop — each queued message triggered a separate _stream.WriteAsync() system call Added 64KB coalesce buffer; drain all pending messages into contiguous buffer, single WriteAsync per batch Reduces syscalls from N to 1 per batch
14 Profiling Stopwatch on every messageStopwatch.StartNew() ran unconditionally in ProcessMessage and StreamManager.Capture even for non-JetStream messages Removed profiling instrumentation from hot path Eliminates ~200ns overhead per message

Round 2: FileStore AppendAsync Hot Path

# Root Cause Fix Impact
6 Async state machine overheadAppendAsync was async ValueTask<ulong> but never actually awaited Changed to synchronous ValueTask<ulong> returning ValueTask.FromResult(_last) Eliminates Task state machine allocation
7 Double payload copyTransformForPersist allocated byte[] then payload.ToArray() created second copy for StoredMessage Reuse TransformForPersist result directly for StoredMessage.Payload when no transform needed (_noTransform flag) Eliminates 1 byte[] alloc per message
8 Unnecessary TTL work per publishExpireFromWheel() and RegisterTtl() called on every write even when MaxAge=0 Guarded both with _options.MaxAgeMs > 0 check (matches Go: filestore.go:4701) Eliminates hash wheel overhead when TTL not configured
9 Per-message MsgBlock cache allocationWriteAt created new MessageRecord for _cache on every write Removed eager cache population; reads now decode from pending buffer or disk Eliminates 1 object alloc per message
10 Contiguous write bufferMsgBlock._pendingWrites was List<byte[]> with per-message byte[] allocations Replaced with single contiguous _pendingBuf byte array; MessageRecord.EncodeTo writes directly into it Eliminates per-message byte[] encoding alloc; single RandomAccess.Write per flush
11 Pending buffer read pathMsgBlock.Read() flushed pending writes to disk before reading Added in-memory read from _pendingBuf when data is still in the buffer Avoids unnecessary disk flush on read-after-write

Round 1: FileStore/StreamManager Layer

# Root Cause Fix Impact
1 Per-message synchronous disk I/OMsgBlock.WriteAt() called RandomAccess.Write() on every message Added write buffering in MsgBlock + background flush loop in FileStore (Go's flushLoop pattern: coalesce 16KB or 8ms) Eliminates per-message syscall overhead
2 O(n) GetStateAsync per publish_messages.Keys.Min() and _messages.Values.Sum() on every publish for MaxMsgs/MaxBytes checks Added incremental _messageCount, _totalBytes, _firstSeq fields updated in all mutation paths; GetStateAsync is now O(1) Eliminates O(n) scan per publish
3 Unnecessary LoadAsync after every appendStreamManager.Capture reloaded the just-stored message even when no mirrors/sources were configured Made LoadAsync conditional on mirror/source replication being configured Eliminates redundant disk read per publish
4 Redundant PruneExpiredMessages per publish — called before every publish even when MaxAge=0, and again inside EnforceRuntimePolicies Guarded with MaxAgeMs > 0 check; removed the pre-publish call (background expiry timer handles it) Eliminates O(n) scan per publish
5 PrunePerSubject loading all messages per publishEnforceRuntimePoliciesPrugePerSubject called ListAsync().GroupBy() even when MaxMsgsPer=0 Guarded with MaxMsgsPer > 0 check Eliminates O(n) scan per publish

Additional fixes: SHA256 envelope bypass for unencrypted/uncompressed stores, RAFT propose skip for single-replica streams.

What would further close the gap

Change Expected Impact Go Reference
Write-loop / socket write overhead The per-delivery serial path is now minimal (SID copy + memcpy under SpinLock). The remaining 0.70x fan-out gap is likely write-loop wakeup latency and socket write syscall overhead Go: flushOutbound uses net.Buffers.WriteTowritev() with zero-copy buffer management
Eliminate per-message GC allocations in FileStore ~30% improvement on FileStore AppendAsync — replace StoredMessage class with StoredMessageMeta struct in _messages dict, reconstruct full message from MsgBlock on read Go stores in cache.buf/cache.idx with zero per-message allocs; 80+ sites in FileStore.cs need migration
Single publisher throughput 0.62x0.74x gap; the pub-only path has no fan-out overhead — likely JIT/GC/socket write overhead in the ingest path Go: client.go readLoop with zero-copy buffer management