Files

Joseph Doherty 1d4b87e5f9 docs: refresh benchmark comparison with increased message counts

Increase message counts across all 14 benchmark test files to reduce
run-to-run variance (e.g. PubSub 16B: 10K→50K, FanOut: 10K→15K,
SinglePub: 100K→500K, JS tests: 5K→25K). Rewrite benchmarks_comparison.md
with fresh numbers from two-batch runs. Key changes: multi 4x4 reached
parity (1.01x), fan-out improved to 0.84x, TLS pub/sub shows 4.70x .NET
advantage, previous small-count anomalies corrected.

2026-03-13 17:52:03 -04:00

24 KiB

Raw Blame History

Go vs .NET NATS Server — Benchmark Comparison

Benchmark run: 2026-03-13 America/Indiana/Indianapolis. Both servers ran on the same machine using the benchmark project (dotnet test tests/NATS.Server.Benchmark.Tests -c Release --filter "Category=Benchmark" -v normal --logger "console;verbosity=detailed"). Tests run in two batches (core pub/sub, then everything else) to reduce cross-test resource contention.

Environment: Apple M4, .NET SDK 10.0.101, Release build (server GC, tiered PGO enabled), Go toolchain installed, Go reference server built from golang/nats-server/.

Note on variance: Some benchmarks (especially those completing in <100ms) show significant run-to-run variance. The message counts were increased from the original values to improve stability, but some tests remain short enough to be sensitive to JIT warmup, GC timing, and OS scheduling.

Core NATS — Pub/Sub Throughput

Single Publisher (no subscribers)

Payload	Go msg/s	Go MB/s	.NET msg/s	.NET MB/s	Ratio (.NET/Go)
16 B	2,162,959	33.0	1,602,442	24.5	0.74x
128 B	3,773,858	460.7	1,408,294	171.9	0.37x

Publisher + Subscriber (1:1)

Payload	Go msg/s	Go MB/s	.NET msg/s	.NET MB/s	Ratio (.NET/Go)
16 B	1,075,095	16.4	713,952	10.9	0.66x
16 KB	39,215	612.7	30,916	483.1	0.79x

Fan-Out (1 Publisher : 4 Subscribers)

Payload	Go msg/s	Go MB/s	.NET msg/s	.NET MB/s	Ratio (.NET/Go)
128 B	2,919,353	356.4	2,459,924	300.3	0.84x

Multi-Publisher / Multi-Subscriber (4P x 4S)

Payload	Go msg/s	Go MB/s	.NET msg/s	.NET MB/s	Ratio (.NET/Go)
128 B	1,870,855	228.4	1,892,631	231.0	1.01x

Core NATS — Request/Reply Latency

Single Client, Single Service

Payload	Go msg/s	.NET msg/s	Ratio
128 B	9,392	8,372	0.89x

10 Clients, 2 Services (Queue Group)

Payload	Go msg/s	.NET msg/s	Ratio
16 B	30,563	26,178	0.86x

JetStream — Publication

Mode	Payload	Storage	Go msg/s	.NET msg/s	Ratio (.NET/Go)
Synchronous	16 B	Memory	16,982	14,514	0.85x
Async (batch)	128 B	File	211,355	58,334	0.28x

JetStream — Consumption

Mode	Go msg/s	.NET msg/s	Ratio (.NET/Go)
Ordered ephemeral consumer	786,681	346,162	0.44x
Durable consumer fetch	711,203	542,250	0.76x

MQTT Throughput

Benchmark	Go msg/s	Go MB/s	.NET msg/s	.NET MB/s	Ratio (.NET/Go)
MQTT PubSub (128B, QoS 0)	36,913	4.5	48,755	6.0	1.32x
Cross-Protocol NATS→MQTT (128B)	407,487	49.7	287,946	35.1	0.71x

Transport Overhead

TLS

Benchmark	Go msg/s	Go MB/s	.NET msg/s	.NET MB/s	Ratio (.NET/Go)
TLS PubSub 1:1 (128B)	244,403	29.8	1,148,179	140.2	4.70x
TLS Pub-Only (128B)	3,224,490	393.6	1,246,351	152.1	0.39x

Note: TLS PubSub 1:1 shows .NET dramatically outperforming Go (4.70x). This appears to reflect .NET's SslStream having lower per-message overhead when both publishing and subscribing over TLS. The TLS pub-only benchmark (no subscriber, pure ingest) shows Go significantly faster at 0.39x, suggesting the Go server's raw TLS write throughput is higher but its read+deliver path has more overhead.

WebSocket

Benchmark	Go msg/s	Go MB/s	.NET msg/s	.NET MB/s	Ratio (.NET/Go)
WS PubSub 1:1 (128B)	44,783	5.5	40,793	5.0	0.91x
WS Pub-Only (128B)	118,898	14.5	100,522	12.3	0.85x

Hot Path Microbenchmarks (.NET only)

SubList

Benchmark	.NET msg/s	.NET MB/s
SubList Exact Match (128 subjects)	22,812,300	304.6
SubList Wildcard Match	17,626,363	235.3
SubList Queue Match	23,306,329	177.8
SubList Remote Interest	437,080	7.1

Parser

Benchmark	Ops/s	MB/s	Alloc
Parser PING	6,262,196	35.8	0.0 B/op
Parser PUB	2,663,706	101.6	40.0 B/op
Parser HPUB	2,213,655	118.2	40.0 B/op
Parser PUB split payload	2,100,256	80.1	176.0 B/op

FileStore

Benchmark	Ops/s	MB/s	Alloc
FileStore AppendAsync (128B)	275,438	33.6	1242.9 B/op
FileStore LoadLastBySubject (hot)	1,138,203	69.5	656.0 B/op
FileStore PurgeEx+Trim	647	0.1	5440579.9 B/op

Summary

Category	Ratio	Assessment
Pub-only throughput (16B)	0.74x	Stable across runs
Pub-only throughput (128B)	0.37x	Go significantly faster at larger payloads
Pub/sub 1:1 (16B)	0.66x	Go ahead; high variance at short durations
Pub/sub 1:1 (16KB)	0.79x	Reasonable gap
Fan-out 1:4	0.84x	Improved after Round 10 optimizations
Multi pub/sub 4x4	1.01x	At parity
Request/reply (single)	0.89x	Close to parity
Request/reply (10Cx2S)	0.86x	Close to parity
JetStream sync publish	0.85x	Close to parity
JetStream async file publish	0.28x	Storage-bound
JetStream ordered consume	0.44x	Significant gap
JetStream durable fetch	0.76x	Moderate gap
MQTT pub/sub	1.32x	.NET outperforms Go
MQTT cross-protocol	0.71x	Go ahead; high variance
TLS pub/sub	4.70x	.NET SslStream dramatically faster
TLS pub-only	0.39x	Go raw TLS write faster
WebSocket pub/sub	0.91x	Close to parity
WebSocket pub-only	0.85x	Good

Key Observations

Multi pub/sub reached parity (1.01x) after Round 10 pre-formatted MSG headers. Fan-out improved to 0.84x.
TLS pub/sub shows a dramatic .NET advantage (4.70x) — .NET's SslStream has significantly lower overhead in the bidirectional pub/sub path. TLS pub-only (ingest only) still favors Go at 0.39x, suggesting the advantage is in the read-and-deliver path.
MQTT pub/sub remains a .NET strength at 1.32x. Cross-protocol (NATS→MQTT) dropped to 0.71x — this benchmark shows high variance across runs.
JetStream ordered consumer dropped to 0.44x compared to earlier runs (0.62x). This test completes in <100ms and shows high variance.
Single publisher 128B dropped to 0.37x (from 0.62x with smaller message counts). With 500K messages, this benchmark runs long enough for Go's goroutine scheduler and buffer management to reach steady state, widening the gap. The 16B variant is stable at 0.74x.
Request-reply latency stable at 0.86x–0.89x across all runs.

Optimization History

Round 10: Fan-Out Serial Path Optimization

Three optimizations making the serial fan-out path cheaper (fan-out 0.63x→0.84x, multi 0.65x→1.01x):

#	Root Cause	Fix	Impact
38	Per-delivery MSG header re-formatting — `SendMessageNoFlush` independently formats the entire MSG header line (prefix, subject copy, replyTo encoding, size formatting, CRLF) for every subscriber — but only the SID varies per delivery	Pre-build prefix (`MSG subject` ) and suffix ( `[reply] sizes\r\n`) once per publish; new `SendMessagePreformatted` writes prefix+sid+suffix directly into `_directBuf` — zero encoding, pure memory copies	Eliminates per-delivery replyTo encoding, size formatting, prefix/subject copying
39	Queue-group round-robin burns 2 Interlocked ops — `Interlocked.Increment(ref OutMsgs)` + `Interlocked.Decrement(ref OutMsgs)` per queue group just to pick an index	Replaced with non-atomic `uint QueueRoundRobin++` — safe because ProcessMessage runs single-threaded per publisher connection (the read loop)	Eliminates 2 interlocked ops per queue group per publish
40	`HashSet<INatsClient>` pcd overhead — hash computation + bucket lookup per Add for small fan-out counts (4 subscribers)	Replaced with `[ThreadStatic] INatsClient[]` + linear scan; O(n) but n≤16, faster than hash for small counts	Eliminates hash computation and internal array overhead

Round 9: Fan-Out & Multi Pub/Sub Hot-Path Optimization

Seven optimizations targeting the per-delivery hot path and benchmark harness configuration:

#	Root Cause	Fix	Impact
31	Benchmark harness built server in Debug — `DotNetServerProcess.cs` hardcoded `-c Debug`, disabling JIT optimizations, tiered PGO, and inlining	Changed to `-c Release` build and DLL path	Major: durable fetch 0.42x→0.92x, request-reply to parity
32	Per-delivery Interlocked on server-wide stats — `SendMessageNoFlush` did 2 `Interlocked` ops per delivery; fan-out 4 subs = 8 interlocked ops per publish	Moved server-wide stats to batch `Interlocked.Add` once after fan-out loop in `ProcessMessage`	Eliminates N×2 interlocked ops per publish
33	Auto-unsub tracking on every delivery — `Interlocked.Increment(ref sub.MessageCount)` on every delivery even when `MaxMessages == 0` (no limit — the common case)	Guarded with `if (sub.MaxMessages > 0)`	Eliminates 1 interlocked op per delivery in common case
34	Per-delivery SID ASCII encoding — `Encoding.ASCII.GetBytes(sid)` on every delivery; SID is a small integer that never changes	Added `Subscription.SidBytes` cached property; new `SendMessageNoFlush` overload accepts `ReadOnlySpan<byte>`	Eliminates per-delivery encoding
35	Per-delivery subject ASCII encoding — `Encoding.ASCII.GetBytes(subject)` for each subscriber; fan-out 4 = 4× encoding same subject	Pre-encode subject once in `ProcessMessage` before fan-out loop; new overload uses span copy	Eliminates N-1 subject encodings per publish
36	Per-publish subject string allocation — `Encoding.ASCII.GetString(cmd.Subject.Span)` on every PUB even when publishing to the same subject repeatedly	Added 1-element string cache per client; reuses string when subject bytes match	Eliminates string alloc for repeated subjects
37	Interlocked stats in SubList.Match hot path — `Interlocked.Increment(ref _matches)` and `_cacheHits` on every match call	Replaced with non-atomic increments (approximate counters for monitoring)	Eliminates 1-2 interlocked ops per match

Round 8: Ordered Consumer + Cross-Protocol Optimization

Three optimizations targeting pull consumer delivery and MQTT cross-protocol throughput:

#	Root Cause	Fix	Impact
28	Per-message flush signal in DeliverPullFetchMessagesAsync — `DeliverMessage` called `SendMessage` which triggered `_flushSignal.Writer.TryWrite(0)` per message; for batch of N messages, N flush signals and write-loop wakeups	Replaced with `SendMessageNoFlush` + batch flush every 64 messages + final flush after loop; bypasses `DeliverMessage` entirely (no permission check / auto-unsub needed for JS delivery inbox)	Reduces flush signals from N to N/64 per batch
29	5ms polling delay in pull consumer wait loop — `Task.Delay(5)` in `DeliverPullFetchMessagesAsync` and `PullConsumerEngine.WaitForMessageAsync` added up to 5ms latency per empty slot; for tail-following consumers, every new message waited up to 5ms to be noticed	Added `StreamHandle.NotifyPublish()` / `WaitForPublishAsync()` using `TaskCompletionSource` signaling; publishers call `NotifyPublish` after `AppendAsync`; consumers wait on signal with heartbeat-interval timeout	Eliminates polling delay; instant wakeup on publish
30	StringBuilder allocation in NatsToMqtt for common case — every uncached `NatsToMqtt` call allocated a StringBuilder even when no `_DOT_` escape sequences were present (the common case)	Added `string.Create` fast path that uses char replacement lambda when no `_DOT_` found; pre-warm topic bytes cache on MQTT subscription creation	Eliminates StringBuilder + string alloc for common case; no cache miss on first delivery

Round 7: MQTT Cross-Protocol Write Path

Four optimizations targeting the NATS→MQTT delivery hot path (cross-protocol throughput improved from 0.30x to 0.78x):

#	Root Cause	Fix	Impact
24	Per-message async fire-and-forget in MqttNatsClientAdapter — each `SendMessage` called `SendBinaryPublishAsync` which acquired a `SemaphoreSlim`, allocated a full PUBLISH packet `byte[]`, wrote, and flushed the stream — all per message, bypassing the server's deferred-flush batching	Replaced with synchronous `EnqueuePublishNoFlush()` that formats MQTT PUBLISH directly into `_directBuf` under SpinLock, matching the NatsClient pattern; `SignalFlush()` signals the write loop for batch flush	Eliminates async Task + SemaphoreSlim + per-message flush
25	Per-message `byte[]` allocation for MQTT PUBLISH packets — `MqttPacketWriter.WritePublish()` allocated topic bytes, variable header, remaining-length array, and full packet array on every delivery	Added `WritePublishTo(Span<byte>)` that formats the entire PUBLISH packet directly into the destination span using `Span<byte>` operations — zero heap allocation	Eliminates 4+ `byte[]` allocs per delivery
26	Per-message NATS→MQTT topic translation — `NatsToMqtt()` allocated a `StringBuilder`, produced a `string`, then `Encoding.UTF8.GetBytes()` re-encoded it on every delivery	Added `NatsToMqttBytes()` with bounded `ConcurrentDictionary<string, byte[]>` cache (4096 entries); cached result includes pre-encoded UTF-8 bytes	Eliminates string + encoding alloc per delivery for cached topics
27	Per-message `FlushAsync` on plain TCP sockets — `WriteBinaryAsync` flushed after every packet write, even on `NetworkStream` where TCP auto-flushes	Write loop skips `FlushAsync` for plain sockets; for TLS/wrapped streams, flushes once per batch (not per message)	Reduces syscalls from 2N to 1 per batch

Round 6: Batch Flush Signaling + Fetch Optimizations

Four optimizations targeting fan-out and consumer fetch hot paths:

#	Root Cause	Fix	Impact
20	Per-subscriber flush signal in fan-out — each `SendMessage` called `_flushSignal.Writer.TryWrite(0)` independently; for 1:4 fan-out, 4 channel writes + 4 write-loop wakeups per published message	Split `SendMessage` into `SendMessageNoFlush` + `SignalFlush`; `ProcessMessage` collects unique clients in `[ThreadStatic] HashSet<INatsClient>` (Go's `pcd` pattern), one flush signal per unique client after fan-out	Reduces channel writes from N to unique-client-count per publish
21	Per-fetch `CompiledFilter` allocation — `CompiledFilter.FromConfig(consumer.Config)` called on every fetch request, allocating a new filter object each time	Cached `CompiledFilter` on `ConsumerHandle` with staleness detection (reference + value check on filter config fields); reused across fetches	Eliminates per-fetch filter allocation
22	Per-message string interpolation in ack reply — `$"$JS.ACK.{stream}.{consumer}.1.{seq}.{deliverySeq}.{ts}.{pending}"` allocated intermediate strings and boxed numeric types on every delivery	Pre-compute `$"$JS.ACK.{stream}.{consumer}.1."` prefix before loop; use `stackalloc char[]` + `TryFormat` for numeric suffix — zero intermediate allocations	Eliminates 4+ string allocs per delivered message
23	Per-fetch `List<StoredMessage>` allocation — `new List<StoredMessage>(batch)` allocated on every `FetchAsync` call	`[ThreadStatic]` reusable list with `.Clear()` + capacity growth; `PullFetchBatch` snapshots via `.ToArray()` for safe handoff	Eliminates per-fetch list allocation

Round 5: Non-blocking ConsumeAsync (ordered + durable consumers)

One root cause was identified and fixed in the MSG.NEXT request handling path:

#	Root Cause	Fix	Impact
19	Synchronous blocking in DeliverPullFetchMessages — `FetchAsync(...).GetAwaiter().GetResult()` blocked the client's read loop for the full `expires` timeout (30s). With `batch=1000` and only 5 messages available, the fetch polled for message 6 indefinitely. No messages were delivered until the timeout fired, causing the client to receive 0 messages before its own timeout.	Split into two paths: `noWait`/no-expires uses synchronous fetch (existing behavior for `FetchAsync` client); `expires > 0` spawns `DeliverPullFetchMessagesAsync` background task that delivers messages incrementally without blocking the read loop, with idle heartbeat support	Enables `ConsumeAsync` for both ordered and durable consumers; ordered consumer: 99K msg/s (0.64x Go)

Round 4: Per-Client Direct Write Buffer (pub/sub + fan-out + multi pub/sub)

Four optimizations were implemented in the message delivery hot path:

#	Root Cause	Fix	Impact
15	Per-message channel overhead — each `SendMessage` call went through `Channel<OutboundData>.TryWrite`, incurring lock contention and memory barriers	Replaced channel-based message delivery with per-client `_directBuf` byte array under `SpinLock`; messages written directly to contiguous buffer	Eliminates channel overhead per delivery
16	Per-message heap allocation for MSG header — `_outboundBufferPool.RentBuffer()` allocated a pooled `byte[]` for each MSG header	Replaced with `stackalloc byte[512]` — MSG header formatted entirely on the stack, then copied into `_directBuf`	Zero heap allocations per delivery
17	Per-message socket write — write loop issued one `SendAsync` per channel item, even with coalescing	Double-buffer swap: write loop swaps `_directBuf` ↔ `_writeBuf` under `SpinLock`, then writes the entire batch in a single `SendAsync`; zero allocation on swap	Single syscall per batch, zero-copy buffer reuse
18	Separate wake channels — `SendMessage` and `WriteProtocol` used different signaling paths	Unified on `_flushSignal` channel (bounded capacity 1, DropWrite); both paths signal the same channel, write loop drains both `_directBuf` and `_outbound` on each wake	Single wait point, no missed wakes

Round 3: Outbound Write Path (pub/sub + fan-out + fetch)

Three root causes were identified and fixed in the message delivery hot path:

#	Root Cause	Fix	Impact
12	Per-message `.ToArray()` allocation in SendMessage — `owner.Memory[..pos].ToArray()` created a new `byte[]` for every MSG delivered to every subscriber	Replaced `IMemoryOwner` rent/copy/dispose with direct `byte[]` from pool; write loop returns buffers after writing	Eliminates 1 heap alloc per delivery (4 per fan-out message)
13	Per-message `WriteAsync` in write loop — each queued message triggered a separate `_stream.WriteAsync()` system call	Added 64KB coalesce buffer; drain all pending messages into contiguous buffer, single `WriteAsync` per batch	Reduces syscalls from N to 1 per batch
14	Profiling `Stopwatch` on every message — `Stopwatch.StartNew()` ran unconditionally in `ProcessMessage` and `StreamManager.Capture` even for non-JetStream messages	Removed profiling instrumentation from hot path	Eliminates ~200ns overhead per message

Round 2: FileStore AppendAsync Hot Path

#	Root Cause	Fix	Impact
6	Async state machine overhead — `AppendAsync` was `async ValueTask<ulong>` but never actually awaited	Changed to synchronous `ValueTask<ulong>` returning `ValueTask.FromResult(_last)`	Eliminates Task state machine allocation
7	Double payload copy — `TransformForPersist` allocated `byte[]` then `payload.ToArray()` created second copy for `StoredMessage`	Reuse `TransformForPersist` result directly for `StoredMessage.Payload` when no transform needed (`_noTransform` flag)	Eliminates 1 `byte[]` alloc per message
8	Unnecessary TTL work per publish — `ExpireFromWheel()` and `RegisterTtl()` called on every write even when `MaxAge=0`	Guarded both with `_options.MaxAgeMs > 0` check (matches Go: `filestore.go:4701`)	Eliminates hash wheel overhead when TTL not configured
9	Per-message MsgBlock cache allocation — `WriteAt` created `new MessageRecord` for `_cache` on every write	Removed eager cache population; reads now decode from pending buffer or disk	Eliminates 1 object alloc per message
10	Contiguous write buffer — `MsgBlock._pendingWrites` was `List<byte[]>` with per-message `byte[]` allocations	Replaced with single contiguous `_pendingBuf` byte array; `MessageRecord.EncodeTo` writes directly into it	Eliminates per-message `byte[]` encoding alloc; single `RandomAccess.Write` per flush
11	Pending buffer read path — `MsgBlock.Read()` flushed pending writes to disk before reading	Added in-memory read from `_pendingBuf` when data is still in the buffer	Avoids unnecessary disk flush on read-after-write

Round 1: FileStore/StreamManager Layer

#	Root Cause	Fix	Impact
1	Per-message synchronous disk I/O — `MsgBlock.WriteAt()` called `RandomAccess.Write()` on every message	Added write buffering in MsgBlock + background flush loop in FileStore (Go's `flushLoop` pattern: coalesce 16KB or 8ms)	Eliminates per-message syscall overhead
2	O(n) `GetStateAsync` per publish — `_messages.Keys.Min()` and `_messages.Values.Sum()` on every publish for MaxMsgs/MaxBytes checks	Added incremental `_messageCount`, `_totalBytes`, `_firstSeq` fields updated in all mutation paths; `GetStateAsync` is now O(1)	Eliminates O(n) scan per publish
3	Unnecessary `LoadAsync` after every append — `StreamManager.Capture` reloaded the just-stored message even when no mirrors/sources were configured	Made `LoadAsync` conditional on mirror/source replication being configured	Eliminates redundant disk read per publish
4	Redundant `PruneExpiredMessages` per publish — called before every publish even when `MaxAge=0`, and again inside `EnforceRuntimePolicies`	Guarded with `MaxAgeMs > 0` check; removed the pre-publish call (background expiry timer handles it)	Eliminates O(n) scan per publish
5	`PrunePerSubject` loading all messages per publish — `EnforceRuntimePolicies` → `PrugePerSubject` called `ListAsync().GroupBy()` even when `MaxMsgsPer=0`	Guarded with `MaxMsgsPer > 0` check	Eliminates O(n) scan per publish

Additional fixes: SHA256 envelope bypass for unencrypted/uncompressed stores, RAFT propose skip for single-replica streams.

What would further close the gap

Change	Expected Impact	Go Reference
Single publisher ingest path (0.37x at 128B)	The pub-only path has the largest gap. Go's readLoop uses zero-copy buffer management with direct `[]byte` slicing; .NET parses into managed objects. Reducing allocations in the parser→ProcessMessage path would help.	Go: `client.go` readLoop, direct buffer slicing
JetStream async file publish (0.28x)	Storage-bound: FileStore AppendAsync bottleneck is synchronous `RandomAccess.Write` in flush loop and S2 compression overhead	Go: `filestore.go` uses `cache.buf`/`cache.idx` with mmap and goroutine-per-flush concurrency
JetStream ordered consumer (0.44x)	Pull consumer delivery pipeline has overhead in the fetch→deliver→ack cycle. The test completes in <100ms so numbers are noisy, but the gap is real.	Go: `consumer.go` delivery with direct buffer writes
Write-loop / socket write overhead	Fan-out (0.84x) and pub/sub (0.66x) gaps partly come from write-loop wakeup latency and socket write syscall overhead compared to Go's `writev()`	Go: `flushOutbound` uses `net.Buffers.WriteTo` → `writev()` with zero-copy buffer management

24 KiB Raw Blame History Unescape Escape