Files

Joseph Doherty a62a25dcdf perf: optimize fan-out hot path and switch benchmarks to Release build

Round 9 optimizations targeting per-delivery overhead:
- Switch benchmark harness from Debug to Release build (biggest impact:
  durable fetch 0.42x→0.92x, request-reply to parity)
- Batch server-wide stats after fan-out loop (2 Interlocked per delivery → 2 per publish)
- Guard auto-unsub tracking with MaxMessages > 0 (skip Interlocked in common case)
- Cache SID as ASCII bytes on Subscription (avoid per-delivery encoding)
- Pre-encode subject bytes once before fan-out loop (avoid N encodings)
- Add 1-element subject string cache in ProcessPub (avoid repeated alloc)
- Remove Interlocked from SubList.Match stats counters (approximate is fine)
- Extract WriteMessageToBuffer helper for both string and span overloads

2026-03-13 15:30:02 -04:00

23 KiB

Raw Blame History

Go vs .NET NATS Server — Benchmark Comparison

Benchmark run: 2026-03-13 04:30 PM America/Indiana/Indianapolis. Both servers ran on the same machine using the benchmark project README command (dotnet test tests/NATS.Server.Benchmark.Tests --filter "Category=Benchmark" -v normal --logger "console;verbosity=detailed"). Test parallelization remained disabled inside the benchmark assembly.

Environment: Apple M4, .NET SDK 10.0.101, .NET server built and run in Release configuration (server GC, tiered PGO enabled), Go toolchain installed, Go reference server built from golang/nats-server/.

Core NATS — Pub/Sub Throughput

Single Publisher (no subscribers)

Payload	Go msg/s	Go MB/s	.NET msg/s	.NET MB/s	Ratio (.NET/Go)
16 B	2,223,690	33.9	1,651,727	25.2	0.74x
128 B	2,218,308	270.8	1,368,967	167.1	0.62x

Publisher + Subscriber (1:1)

Payload	Go msg/s	Go MB/s	.NET msg/s	.NET MB/s	Ratio (.NET/Go)
16 B	292,711	4.5	723,867	11.0	2.47x
16 KB	32,890	513.9	37,943	592.9	1.15x

Fan-Out (1 Publisher : 4 Subscribers)

Payload	Go msg/s	Go MB/s	.NET msg/s	.NET MB/s	Ratio (.NET/Go)
128 B	2,945,790	359.6	1,848,130	225.6	0.63x

Note: Fan-out numbers are within noise of prior round. The hot-path optimizations (batched stats, pre-encoded subject/SID bytes, auto-unsub guard) remove per-delivery overhead but the gap is now dominated by the serial fan-out loop itself.

Multi-Publisher / Multi-Subscriber (4P x 4S)

Payload	Go msg/s	Go MB/s	.NET msg/s	.NET MB/s	Ratio (.NET/Go)
128 B	2,123,480	259.2	1,374,570	167.8	0.65x

Core NATS — Request/Reply Latency

Single Client, Single Service

Payload	Go msg/s	.NET msg/s	Ratio	Go P50 (us)	.NET P50 (us)	Go P99 (us)	.NET P99 (us)
128 B	8,386	7,424	0.89x	115.8	139.0	175.5	193.0

10 Clients, 2 Services (Queue Group)

Payload	Go msg/s	.NET msg/s	Ratio	Go P50 (us)	.NET P50 (us)	Go P99 (us)	.NET P99 (us)
16 B	26,470	26,620	1.01x	370.2	376.0	486.0	592.8

JetStream — Publication

Mode	Payload	Storage	Go msg/s	.NET msg/s	Ratio (.NET/Go)
Synchronous	16 B	Memory	14,812	11,002	0.74x
Async (batch)	128 B	File	148,156	60,348	0.41x

Note: Async file-store publish improved to 0.41x with Release build. Still storage-bound.

JetStream — Consumption

Mode	Go msg/s	.NET msg/s	Ratio (.NET/Go)
Ordered ephemeral consumer	166,000	102,369	0.62x
Durable consumer fetch	510,000	468,252	0.92x

Note: Ordered consumer improved to 0.62x (102K vs 166K). Durable fetch jumped to 0.92x (468K vs 510K) — the Release build with tiered PGO dramatically improved the JIT quality for the fetch delivery path. Go comparison numbers vary significantly across runs.

MQTT Throughput

Benchmark	Go msg/s	Go MB/s	.NET msg/s	.NET MB/s	Ratio (.NET/Go)
MQTT PubSub (128B, QoS 0)	34,224	4.2	47,341	5.8	1.38x
Cross-Protocol NATS→MQTT (128B)	158,000	19.3	229,932	28.1	1.46x

Note: Pure MQTT pub/sub extended its lead to 1.38x. Cross-protocol NATS→MQTT now at 1.46x — the Release build JIT further benefits the delivery path.

Transport Overhead

TLS

Benchmark	Go msg/s	Go MB/s	.NET msg/s	.NET MB/s	Ratio (.NET/Go)
TLS PubSub 1:1 (128B)	289,548	35.3	254,834	31.1	0.88x
TLS Pub-Only (128B)	1,782,442	217.6	877,149	107.1	0.49x

WebSocket

Benchmark	Go msg/s	Go MB/s	.NET msg/s	.NET MB/s	Ratio (.NET/Go)
WS PubSub 1:1 (128B)	66,584	8.1	62,249	7.6	0.93x
WS Pub-Only (128B)	106,302	13.0	85,878	10.5	0.81x

Note: TLS pub/sub stable at 0.88x. WebSocket pub/sub at 0.93x. Both WebSocket numbers are lower than plaintext due to WS framing overhead.

Hot Path Microbenchmarks (.NET only)

SubList

Benchmark	.NET msg/s	.NET MB/s
SubList Exact Match (128 subjects)	19,285,510	257.5
SubList Wildcard Match	18,876,330	252.0
SubList Queue Match	20,639,153	157.5
SubList Remote Interest	274,703	4.5

Parser

Benchmark	Ops/s	MB/s	Alloc
Parser PING	6,283,578	36.0	0.0 B/op
Parser PUB	2,712,550	103.5	40.0 B/op
Parser HPUB	2,338,555	124.9	40.0 B/op
Parser PUB split payload	2,043,813	78.0	176.0 B/op

FileStore

Benchmark	Ops/s	MB/s	Alloc
FileStore AppendAsync (128B)	244,089	29.8	1552.9 B/op
FileStore LoadLastBySubject (hot)	12,784,127	780.3	0.0 B/op
FileStore PurgeEx+Trim	332	0.0	5440792.9 B/op

Summary

Category	Ratio Range	Assessment
Pub-only throughput	0.62x–0.74x	Improved with Release build
Pub/sub (small payload)	2.47x	.NET outperforms Go decisively
Pub/sub (large payload)	1.15x	.NET now exceeds parity
Fan-out	0.63x	Serial fan-out loop is bottleneck
Multi pub/sub	0.65x	Close to prior round
Request/reply latency	0.89x–1.01x	Effectively at parity
JetStream sync publish	0.74x	Run-to-run variance
JetStream async file publish	0.41x	Storage-bound
JetStream ordered consume	0.62x	Improved with Release build
JetStream durable fetch	0.92x	Major improvement with Release build
MQTT pub/sub	1.38x	.NET outperforms Go
MQTT cross-protocol	1.46x	.NET strongly outperforms Go
TLS pub/sub	0.88x	Close to parity
TLS pub-only	0.49x	Variance / contention with other tests
WebSocket pub/sub	0.93x	Close to parity
WebSocket pub-only	0.81x	Good

Key Observations

Switching the benchmark harness to Release build was the highest-impact change. Durable fetch jumped from 0.42x to 0.92x (468K vs 510K msg/s). Ordered consumer improved from 0.57x to 0.62x. Request-reply 10Cx2S reached parity at 1.01x. Large-payload pub/sub now exceeds Go at 1.15x.
Small-payload 1:1 pub/sub remains a strong .NET lead at 2.47x (724K vs 293K msg/s).
MQTT cross-protocol improved to 1.46x (230K vs 158K msg/s), up from 1.20x — the Release JIT further benefits the delivery path.
Fan-out (0.63x) and multi pub/sub (0.65x) remain the largest gaps. The hot-path optimizations (batched stats, pre-encoded SID/subject, auto-unsub guard) removed per-delivery overhead, but the remaining gap is dominated by the serial fan-out loop itself — Go parallelizes fan-out delivery across goroutines.
SubList Match microbenchmarks improved ~17% (19.3M vs 16.5M ops/s for exact match) after removing Interlocked stats from the hot path.
TLS pub-only dropped to 0.49x this run, likely noise from co-running benchmarks contending on CPU. TLS pub/sub remains stable at 0.88x.

Optimization History

Round 9: Fan-Out & Multi Pub/Sub Hot-Path Optimization

Seven optimizations targeting the per-delivery hot path and benchmark harness configuration:

#	Root Cause	Fix	Impact
31	Benchmark harness built server in Debug — `DotNetServerProcess.cs` hardcoded `-c Debug`, disabling JIT optimizations, tiered PGO, and inlining	Changed to `-c Release` build and DLL path	Major: durable fetch 0.42x→0.92x, request-reply to parity
32	Per-delivery Interlocked on server-wide stats — `SendMessageNoFlush` did 2 `Interlocked` ops per delivery; fan-out 4 subs = 8 interlocked ops per publish	Moved server-wide stats to batch `Interlocked.Add` once after fan-out loop in `ProcessMessage`	Eliminates N×2 interlocked ops per publish
33	Auto-unsub tracking on every delivery — `Interlocked.Increment(ref sub.MessageCount)` on every delivery even when `MaxMessages == 0` (no limit — the common case)	Guarded with `if (sub.MaxMessages > 0)`	Eliminates 1 interlocked op per delivery in common case
34	Per-delivery SID ASCII encoding — `Encoding.ASCII.GetBytes(sid)` on every delivery; SID is a small integer that never changes	Added `Subscription.SidBytes` cached property; new `SendMessageNoFlush` overload accepts `ReadOnlySpan<byte>`	Eliminates per-delivery encoding
35	Per-delivery subject ASCII encoding — `Encoding.ASCII.GetBytes(subject)` for each subscriber; fan-out 4 = 4× encoding same subject	Pre-encode subject once in `ProcessMessage` before fan-out loop; new overload uses span copy	Eliminates N-1 subject encodings per publish
36	Per-publish subject string allocation — `Encoding.ASCII.GetString(cmd.Subject.Span)` on every PUB even when publishing to the same subject repeatedly	Added 1-element string cache per client; reuses string when subject bytes match	Eliminates string alloc for repeated subjects
37	Interlocked stats in SubList.Match hot path — `Interlocked.Increment(ref _matches)` and `_cacheHits` on every match call	Replaced with non-atomic increments (approximate counters for monitoring)	Eliminates 1-2 interlocked ops per match

Round 8: Ordered Consumer + Cross-Protocol Optimization

Three optimizations targeting pull consumer delivery and MQTT cross-protocol throughput:

#	Root Cause	Fix	Impact
28	Per-message flush signal in DeliverPullFetchMessagesAsync — `DeliverMessage` called `SendMessage` which triggered `_flushSignal.Writer.TryWrite(0)` per message; for batch of N messages, N flush signals and write-loop wakeups	Replaced with `SendMessageNoFlush` + batch flush every 64 messages + final flush after loop; bypasses `DeliverMessage` entirely (no permission check / auto-unsub needed for JS delivery inbox)	Reduces flush signals from N to N/64 per batch
29	5ms polling delay in pull consumer wait loop — `Task.Delay(5)` in `DeliverPullFetchMessagesAsync` and `PullConsumerEngine.WaitForMessageAsync` added up to 5ms latency per empty slot; for tail-following consumers, every new message waited up to 5ms to be noticed	Added `StreamHandle.NotifyPublish()` / `WaitForPublishAsync()` using `TaskCompletionSource` signaling; publishers call `NotifyPublish` after `AppendAsync`; consumers wait on signal with heartbeat-interval timeout	Eliminates polling delay; instant wakeup on publish
30	StringBuilder allocation in NatsToMqtt for common case — every uncached `NatsToMqtt` call allocated a StringBuilder even when no `_DOT_` escape sequences were present (the common case)	Added `string.Create` fast path that uses char replacement lambda when no `_DOT_` found; pre-warm topic bytes cache on MQTT subscription creation	Eliminates StringBuilder + string alloc for common case; no cache miss on first delivery

Round 7: MQTT Cross-Protocol Write Path

Four optimizations targeting the NATS→MQTT delivery hot path (cross-protocol throughput improved from 0.30x to 0.78x):

#	Root Cause	Fix	Impact
24	Per-message async fire-and-forget in MqttNatsClientAdapter — each `SendMessage` called `SendBinaryPublishAsync` which acquired a `SemaphoreSlim`, allocated a full PUBLISH packet `byte[]`, wrote, and flushed the stream — all per message, bypassing the server's deferred-flush batching	Replaced with synchronous `EnqueuePublishNoFlush()` that formats MQTT PUBLISH directly into `_directBuf` under SpinLock, matching the NatsClient pattern; `SignalFlush()` signals the write loop for batch flush	Eliminates async Task + SemaphoreSlim + per-message flush
25	Per-message `byte[]` allocation for MQTT PUBLISH packets — `MqttPacketWriter.WritePublish()` allocated topic bytes, variable header, remaining-length array, and full packet array on every delivery	Added `WritePublishTo(Span<byte>)` that formats the entire PUBLISH packet directly into the destination span using `Span<byte>` operations — zero heap allocation	Eliminates 4+ `byte[]` allocs per delivery
26	Per-message NATS→MQTT topic translation — `NatsToMqtt()` allocated a `StringBuilder`, produced a `string`, then `Encoding.UTF8.GetBytes()` re-encoded it on every delivery	Added `NatsToMqttBytes()` with bounded `ConcurrentDictionary<string, byte[]>` cache (4096 entries); cached result includes pre-encoded UTF-8 bytes	Eliminates string + encoding alloc per delivery for cached topics
27	Per-message `FlushAsync` on plain TCP sockets — `WriteBinaryAsync` flushed after every packet write, even on `NetworkStream` where TCP auto-flushes	Write loop skips `FlushAsync` for plain sockets; for TLS/wrapped streams, flushes once per batch (not per message)	Reduces syscalls from 2N to 1 per batch

Round 6: Batch Flush Signaling + Fetch Optimizations

Four optimizations targeting fan-out and consumer fetch hot paths:

#	Root Cause	Fix	Impact
20	Per-subscriber flush signal in fan-out — each `SendMessage` called `_flushSignal.Writer.TryWrite(0)` independently; for 1:4 fan-out, 4 channel writes + 4 write-loop wakeups per published message	Split `SendMessage` into `SendMessageNoFlush` + `SignalFlush`; `ProcessMessage` collects unique clients in `[ThreadStatic] HashSet<INatsClient>` (Go's `pcd` pattern), one flush signal per unique client after fan-out	Reduces channel writes from N to unique-client-count per publish
21	Per-fetch `CompiledFilter` allocation — `CompiledFilter.FromConfig(consumer.Config)` called on every fetch request, allocating a new filter object each time	Cached `CompiledFilter` on `ConsumerHandle` with staleness detection (reference + value check on filter config fields); reused across fetches	Eliminates per-fetch filter allocation
22	Per-message string interpolation in ack reply — `$"$JS.ACK.{stream}.{consumer}.1.{seq}.{deliverySeq}.{ts}.{pending}"` allocated intermediate strings and boxed numeric types on every delivery	Pre-compute `$"$JS.ACK.{stream}.{consumer}.1."` prefix before loop; use `stackalloc char[]` + `TryFormat` for numeric suffix — zero intermediate allocations	Eliminates 4+ string allocs per delivered message
23	Per-fetch `List<StoredMessage>` allocation — `new List<StoredMessage>(batch)` allocated on every `FetchAsync` call	`[ThreadStatic]` reusable list with `.Clear()` + capacity growth; `PullFetchBatch` snapshots via `.ToArray()` for safe handoff	Eliminates per-fetch list allocation

Round 5: Non-blocking ConsumeAsync (ordered + durable consumers)

One root cause was identified and fixed in the MSG.NEXT request handling path:

#	Root Cause	Fix	Impact
19	Synchronous blocking in DeliverPullFetchMessages — `FetchAsync(...).GetAwaiter().GetResult()` blocked the client's read loop for the full `expires` timeout (30s). With `batch=1000` and only 5 messages available, the fetch polled for message 6 indefinitely. No messages were delivered until the timeout fired, causing the client to receive 0 messages before its own timeout.	Split into two paths: `noWait`/no-expires uses synchronous fetch (existing behavior for `FetchAsync` client); `expires > 0` spawns `DeliverPullFetchMessagesAsync` background task that delivers messages incrementally without blocking the read loop, with idle heartbeat support	Enables `ConsumeAsync` for both ordered and durable consumers; ordered consumer: 99K msg/s (0.64x Go)

Round 4: Per-Client Direct Write Buffer (pub/sub + fan-out + multi pub/sub)

Four optimizations were implemented in the message delivery hot path:

#	Root Cause	Fix	Impact
15	Per-message channel overhead — each `SendMessage` call went through `Channel<OutboundData>.TryWrite`, incurring lock contention and memory barriers	Replaced channel-based message delivery with per-client `_directBuf` byte array under `SpinLock`; messages written directly to contiguous buffer	Eliminates channel overhead per delivery
16	Per-message heap allocation for MSG header — `_outboundBufferPool.RentBuffer()` allocated a pooled `byte[]` for each MSG header	Replaced with `stackalloc byte[512]` — MSG header formatted entirely on the stack, then copied into `_directBuf`	Zero heap allocations per delivery
17	Per-message socket write — write loop issued one `SendAsync` per channel item, even with coalescing	Double-buffer swap: write loop swaps `_directBuf` ↔ `_writeBuf` under `SpinLock`, then writes the entire batch in a single `SendAsync`; zero allocation on swap	Single syscall per batch, zero-copy buffer reuse
18	Separate wake channels — `SendMessage` and `WriteProtocol` used different signaling paths	Unified on `_flushSignal` channel (bounded capacity 1, DropWrite); both paths signal the same channel, write loop drains both `_directBuf` and `_outbound` on each wake	Single wait point, no missed wakes

Round 3: Outbound Write Path (pub/sub + fan-out + fetch)

Three root causes were identified and fixed in the message delivery hot path:

#	Root Cause	Fix	Impact
12	Per-message `.ToArray()` allocation in SendMessage — `owner.Memory[..pos].ToArray()` created a new `byte[]` for every MSG delivered to every subscriber	Replaced `IMemoryOwner` rent/copy/dispose with direct `byte[]` from pool; write loop returns buffers after writing	Eliminates 1 heap alloc per delivery (4 per fan-out message)
13	Per-message `WriteAsync` in write loop — each queued message triggered a separate `_stream.WriteAsync()` system call	Added 64KB coalesce buffer; drain all pending messages into contiguous buffer, single `WriteAsync` per batch	Reduces syscalls from N to 1 per batch
14	Profiling `Stopwatch` on every message — `Stopwatch.StartNew()` ran unconditionally in `ProcessMessage` and `StreamManager.Capture` even for non-JetStream messages	Removed profiling instrumentation from hot path	Eliminates ~200ns overhead per message

Round 2: FileStore AppendAsync Hot Path

#	Root Cause	Fix	Impact
6	Async state machine overhead — `AppendAsync` was `async ValueTask<ulong>` but never actually awaited	Changed to synchronous `ValueTask<ulong>` returning `ValueTask.FromResult(_last)`	Eliminates Task state machine allocation
7	Double payload copy — `TransformForPersist` allocated `byte[]` then `payload.ToArray()` created second copy for `StoredMessage`	Reuse `TransformForPersist` result directly for `StoredMessage.Payload` when no transform needed (`_noTransform` flag)	Eliminates 1 `byte[]` alloc per message
8	Unnecessary TTL work per publish — `ExpireFromWheel()` and `RegisterTtl()` called on every write even when `MaxAge=0`	Guarded both with `_options.MaxAgeMs > 0` check (matches Go: `filestore.go:4701`)	Eliminates hash wheel overhead when TTL not configured
9	Per-message MsgBlock cache allocation — `WriteAt` created `new MessageRecord` for `_cache` on every write	Removed eager cache population; reads now decode from pending buffer or disk	Eliminates 1 object alloc per message
10	Contiguous write buffer — `MsgBlock._pendingWrites` was `List<byte[]>` with per-message `byte[]` allocations	Replaced with single contiguous `_pendingBuf` byte array; `MessageRecord.EncodeTo` writes directly into it	Eliminates per-message `byte[]` encoding alloc; single `RandomAccess.Write` per flush
11	Pending buffer read path — `MsgBlock.Read()` flushed pending writes to disk before reading	Added in-memory read from `_pendingBuf` when data is still in the buffer	Avoids unnecessary disk flush on read-after-write

Round 1: FileStore/StreamManager Layer

#	Root Cause	Fix	Impact
1	Per-message synchronous disk I/O — `MsgBlock.WriteAt()` called `RandomAccess.Write()` on every message	Added write buffering in MsgBlock + background flush loop in FileStore (Go's `flushLoop` pattern: coalesce 16KB or 8ms)	Eliminates per-message syscall overhead
2	O(n) `GetStateAsync` per publish — `_messages.Keys.Min()` and `_messages.Values.Sum()` on every publish for MaxMsgs/MaxBytes checks	Added incremental `_messageCount`, `_totalBytes`, `_firstSeq` fields updated in all mutation paths; `GetStateAsync` is now O(1)	Eliminates O(n) scan per publish
3	Unnecessary `LoadAsync` after every append — `StreamManager.Capture` reloaded the just-stored message even when no mirrors/sources were configured	Made `LoadAsync` conditional on mirror/source replication being configured	Eliminates redundant disk read per publish
4	Redundant `PruneExpiredMessages` per publish — called before every publish even when `MaxAge=0`, and again inside `EnforceRuntimePolicies`	Guarded with `MaxAgeMs > 0` check; removed the pre-publish call (background expiry timer handles it)	Eliminates O(n) scan per publish
5	`PrunePerSubject` loading all messages per publish — `EnforceRuntimePolicies` → `PrugePerSubject` called `ListAsync().GroupBy()` even when `MaxMsgsPer=0`	Guarded with `MaxMsgsPer > 0` check	Eliminates O(n) scan per publish

Additional fixes: SHA256 envelope bypass for unencrypted/uncompressed stores, RAFT propose skip for single-replica streams.

What would further close the gap

Change	Expected Impact	Go Reference
Fan-out parallelism	Deliver to subscribers concurrently instead of serially from publisher's read loop — this is now the primary bottleneck for the 0.63x fan-out gap	Go: `processMsgResults` fans out per-client via goroutines
Eliminate per-message GC allocations in FileStore	~30% improvement on FileStore AppendAsync — replace `StoredMessage` class with `StoredMessageMeta` struct in `_messages` dict, reconstruct full message from MsgBlock on read	Go stores in `cache.buf`/`cache.idx` with zero per-message allocs; 80+ sites in FileStore.cs need migration
Single publisher throughput	0.62x–0.74x gap; the pub-only path has no fan-out overhead — likely JIT/GC/socket write overhead in the ingest path	Go: client.go readLoop with zero-copy buffer management

23 KiB Raw Blame History Unescape Escape