Files
natsdotnet/benchmarks_comparison.md

15 KiB
Raw Blame History

Go vs .NET NATS Server — Benchmark Comparison

Benchmark run: 2026-03-13 10:16 AM America/Indiana/Indianapolis. Both servers ran on the same machine using the benchmark project README command (dotnet test tests/NATS.Server.Benchmark.Tests --filter "Category=Benchmark" -v normal --logger "console;verbosity=detailed"). Test parallelization remained disabled inside the benchmark assembly.

Environment: Apple M4, .NET SDK 10.0.101, benchmark README command run in the benchmark project's default Debug configuration, Go toolchain installed, Go reference server built from golang/nats-server/.



Core NATS — Pub/Sub Throughput

Single Publisher (no subscribers)

Payload Go msg/s Go MB/s .NET msg/s .NET MB/s Ratio (.NET/Go)
16 B 2,258,647 34.5 1,275,230 19.5 0.56x
128 B 2,251,274 274.8 1,661,668 202.8 0.74x

Publisher + Subscriber (1:1)

Payload Go msg/s Go MB/s .NET msg/s .NET MB/s Ratio (.NET/Go)
16 B 296,374 4.5 875,105 13.4 2.95x
16 KB 32,111 501.7 30,030 469.2 0.94x

Fan-Out (1 Publisher : 4 Subscribers)

Payload Go msg/s Go MB/s .NET msg/s .NET MB/s Ratio (.NET/Go)
128 B 2,387,889 291.5 1,780,888 217.4 0.75x

Multi-Publisher / Multi-Subscriber (4P x 4S)

Payload Go msg/s Go MB/s .NET msg/s .NET MB/s Ratio (.NET/Go)
128 B 1,079,112 131.7 953,596 116.4 0.88x

Core NATS — Request/Reply Latency

Single Client, Single Service

Payload Go msg/s .NET msg/s Ratio Go P50 (us) .NET P50 (us) Go P99 (us) .NET P99 (us)
128 B 8,506 7,182 0.84x 114.9 135.2 161.2 189.8

10 Clients, 2 Services (Queue Group)

Payload Go msg/s .NET msg/s Ratio Go P50 (us) .NET P50 (us) Go P99 (us) .NET P99 (us)
16 B 26,610 22,533 0.85x 367.7 425.3 487.4 622.5

JetStream — Publication

Mode Payload Storage Go msg/s .NET msg/s Ratio (.NET/Go)
Synchronous 16 B Memory 13,756 9,954 0.72x
Async (batch) 128 B File 171,761 50,711 0.30x

Note: Async file-store publish remains the largest JetStream gap at 0.30x. The bottleneck is still the storage write path and the remaining managed allocation pressure around persisted message state.


JetStream — Consumption

Mode Go msg/s .NET msg/s Ratio (.NET/Go)
Ordered ephemeral consumer 135,704 107,168 0.79x
Durable consumer fetch 533,441 375,652 0.70x

Note: Ordered-consumer results in this run are much closer to parity than earlier snapshots. That suggests prior Go-side variance was material; .NET throughput is still clustered around ~107K msg/s.


Hot Path Microbenchmarks (.NET only)

SubList

Benchmark .NET msg/s .NET MB/s Alloc
SubList Exact Match (128 subjects) 17,746,607 236.9 0.00 B/op
SubList Wildcard Match 18,811,278 251.2 0.00 B/op
SubList Queue Match 20,624,510 157.4 0.00 B/op
SubList Remote Interest 264,725 4.3 0.00 B/op

Parser

Benchmark Ops/s MB/s Alloc
Parser PING 5,598,176 32.0 0.0 B/op
Parser PUB 2,701,645 103.1 40.0 B/op
Parser HPUB 2,177,745 116.3 40.0 B/op
Parser PUB split payload 1,702,439 64.9 176.0 B/op

Summary

Category Ratio Range Assessment
Pub-only throughput 0.56x0.74x Mixed — 128 B is solid, 16 B still trails materially
Pub/sub (small payload) 2.95x .NET outperforms Go decisively
Pub/sub (large payload) 0.94x Near parity
Fan-out 0.75x Good improvement; still limited by serial delivery
Multi pub/sub 0.88x Close to parity in this run
Request/reply latency 0.84x0.85x Good
JetStream sync publish 0.72x Good
JetStream async file publish 0.30x Storage write path still dominates
JetStream ordered consume 0.79x Much closer to parity in this run
JetStream durable fetch 0.70x Good

Key Observations

  1. Small-payload 1:1 pub/sub still beats Go by ~3x (875K vs 296K msg/s). The direct write path continues to pay off when message fanout is simple and payloads are tiny.
  2. Fan-out and multi pub/sub both improved in this run to 0.75x and 0.88x respectively. The remaining gap is still consistent with Go's more naturally parallel fanout model.
  3. Ordered consumer moved up to 0.79x (107K vs 136K msg/s). That is materially stronger than earlier runs and suggests previous Go-side variance was distorting the comparison more than the .NET consumer path itself.
  4. Durable fetch remains solid at 0.70x. The Round 6 fetch-path work is still holding, but there is room left in consumer dispatch and storage reads.
  5. Async file-store publish is still the largest server-level gap at 0.30x. The storage layer remains the highest-value runtime target after parser and SubList hot-path cleanup.
  6. The new SubList microbenchmarks show effectively zero temporary allocation per operation for exact, wildcard, queue, and remote-interest lookups in the current implementation. Parser contiguous hot paths also remain small and stable, while split-payload PUB still pays a higher copy cost.

Optimization History

Round 6: Batch Flush Signaling + Fetch Optimizations

Four optimizations targeting fan-out and consumer fetch hot paths:

# Root Cause Fix Impact
20 Per-subscriber flush signal in fan-out — each SendMessage called _flushSignal.Writer.TryWrite(0) independently; for 1:4 fan-out, 4 channel writes + 4 write-loop wakeups per published message Split SendMessage into SendMessageNoFlush + SignalFlush; ProcessMessage collects unique clients in [ThreadStatic] HashSet<INatsClient> (Go's pcd pattern), one flush signal per unique client after fan-out Reduces channel writes from N to unique-client-count per publish
21 Per-fetch CompiledFilter allocationCompiledFilter.FromConfig(consumer.Config) called on every fetch request, allocating a new filter object each time Cached CompiledFilter on ConsumerHandle with staleness detection (reference + value check on filter config fields); reused across fetches Eliminates per-fetch filter allocation
22 Per-message string interpolation in ack reply$"$JS.ACK.{stream}.{consumer}.1.{seq}.{deliverySeq}.{ts}.{pending}" allocated intermediate strings and boxed numeric types on every delivery Pre-compute $"$JS.ACK.{stream}.{consumer}.1." prefix before loop; use stackalloc char[] + TryFormat for numeric suffix — zero intermediate allocations Eliminates 4+ string allocs per delivered message
23 Per-fetch List<StoredMessage> allocationnew List<StoredMessage>(batch) allocated on every FetchAsync call [ThreadStatic] reusable list with .Clear() + capacity growth; PullFetchBatch snapshots via .ToArray() for safe handoff Eliminates per-fetch list allocation

Round 5: Non-blocking ConsumeAsync (ordered + durable consumers)

One root cause was identified and fixed in the MSG.NEXT request handling path:

# Root Cause Fix Impact
19 Synchronous blocking in DeliverPullFetchMessagesFetchAsync(...).GetAwaiter().GetResult() blocked the client's read loop for the full expires timeout (30s). With batch=1000 and only 5 messages available, the fetch polled for message 6 indefinitely. No messages were delivered until the timeout fired, causing the client to receive 0 messages before its own timeout. Split into two paths: noWait/no-expires uses synchronous fetch (existing behavior for FetchAsync client); expires > 0 spawns DeliverPullFetchMessagesAsync background task that delivers messages incrementally without blocking the read loop, with idle heartbeat support Enables ConsumeAsync for both ordered and durable consumers; ordered consumer: 99K msg/s (0.64x Go)

Round 4: Per-Client Direct Write Buffer (pub/sub + fan-out + multi pub/sub)

Four optimizations were implemented in the message delivery hot path:

# Root Cause Fix Impact
15 Per-message channel overhead — each SendMessage call went through Channel<OutboundData>.TryWrite, incurring lock contention and memory barriers Replaced channel-based message delivery with per-client _directBuf byte array under SpinLock; messages written directly to contiguous buffer Eliminates channel overhead per delivery
16 Per-message heap allocation for MSG header_outboundBufferPool.RentBuffer() allocated a pooled byte[] for each MSG header Replaced with stackalloc byte[512] — MSG header formatted entirely on the stack, then copied into _directBuf Zero heap allocations per delivery
17 Per-message socket write — write loop issued one SendAsync per channel item, even with coalescing Double-buffer swap: write loop swaps _directBuf_writeBuf under SpinLock, then writes the entire batch in a single SendAsync; zero allocation on swap Single syscall per batch, zero-copy buffer reuse
18 Separate wake channelsSendMessage and WriteProtocol used different signaling paths Unified on _flushSignal channel (bounded capacity 1, DropWrite); both paths signal the same channel, write loop drains both _directBuf and _outbound on each wake Single wait point, no missed wakes

Round 3: Outbound Write Path (pub/sub + fan-out + fetch)

Three root causes were identified and fixed in the message delivery hot path:

# Root Cause Fix Impact
12 Per-message .ToArray() allocation in SendMessageowner.Memory[..pos].ToArray() created a new byte[] for every MSG delivered to every subscriber Replaced IMemoryOwner rent/copy/dispose with direct byte[] from pool; write loop returns buffers after writing Eliminates 1 heap alloc per delivery (4 per fan-out message)
13 Per-message WriteAsync in write loop — each queued message triggered a separate _stream.WriteAsync() system call Added 64KB coalesce buffer; drain all pending messages into contiguous buffer, single WriteAsync per batch Reduces syscalls from N to 1 per batch
14 Profiling Stopwatch on every messageStopwatch.StartNew() ran unconditionally in ProcessMessage and StreamManager.Capture even for non-JetStream messages Removed profiling instrumentation from hot path Eliminates ~200ns overhead per message

Round 2: FileStore AppendAsync Hot Path

# Root Cause Fix Impact
6 Async state machine overheadAppendAsync was async ValueTask<ulong> but never actually awaited Changed to synchronous ValueTask<ulong> returning ValueTask.FromResult(_last) Eliminates Task state machine allocation
7 Double payload copyTransformForPersist allocated byte[] then payload.ToArray() created second copy for StoredMessage Reuse TransformForPersist result directly for StoredMessage.Payload when no transform needed (_noTransform flag) Eliminates 1 byte[] alloc per message
8 Unnecessary TTL work per publishExpireFromWheel() and RegisterTtl() called on every write even when MaxAge=0 Guarded both with _options.MaxAgeMs > 0 check (matches Go: filestore.go:4701) Eliminates hash wheel overhead when TTL not configured
9 Per-message MsgBlock cache allocationWriteAt created new MessageRecord for _cache on every write Removed eager cache population; reads now decode from pending buffer or disk Eliminates 1 object alloc per message
10 Contiguous write bufferMsgBlock._pendingWrites was List<byte[]> with per-message byte[] allocations Replaced with single contiguous _pendingBuf byte array; MessageRecord.EncodeTo writes directly into it Eliminates per-message byte[] encoding alloc; single RandomAccess.Write per flush
11 Pending buffer read pathMsgBlock.Read() flushed pending writes to disk before reading Added in-memory read from _pendingBuf when data is still in the buffer Avoids unnecessary disk flush on read-after-write

Round 1: FileStore/StreamManager Layer

# Root Cause Fix Impact
1 Per-message synchronous disk I/OMsgBlock.WriteAt() called RandomAccess.Write() on every message Added write buffering in MsgBlock + background flush loop in FileStore (Go's flushLoop pattern: coalesce 16KB or 8ms) Eliminates per-message syscall overhead
2 O(n) GetStateAsync per publish_messages.Keys.Min() and _messages.Values.Sum() on every publish for MaxMsgs/MaxBytes checks Added incremental _messageCount, _totalBytes, _firstSeq fields updated in all mutation paths; GetStateAsync is now O(1) Eliminates O(n) scan per publish
3 Unnecessary LoadAsync after every appendStreamManager.Capture reloaded the just-stored message even when no mirrors/sources were configured Made LoadAsync conditional on mirror/source replication being configured Eliminates redundant disk read per publish
4 Redundant PruneExpiredMessages per publish — called before every publish even when MaxAge=0, and again inside EnforceRuntimePolicies Guarded with MaxAgeMs > 0 check; removed the pre-publish call (background expiry timer handles it) Eliminates O(n) scan per publish
5 PrunePerSubject loading all messages per publishEnforceRuntimePoliciesPrugePerSubject called ListAsync().GroupBy() even when MaxMsgsPer=0 Guarded with MaxMsgsPer > 0 check Eliminates O(n) scan per publish

Additional fixes: SHA256 envelope bypass for unencrypted/uncompressed stores, RAFT propose skip for single-replica streams.

What would further close the gap

Change Expected Impact Go Reference
Fan-out parallelism Deliver to subscribers concurrently instead of serially from publisher's read loop Go: processMsgResults fans out per-client via goroutines
Eliminate per-message GC allocations in FileStore ~30% improvement on FileStore AppendAsync — replace StoredMessage class with StoredMessageMeta struct in _messages dict, reconstruct full message from MsgBlock on read Go stores in cache.buf/cache.idx with zero per-message allocs; 80+ sites in FileStore.cs need migration
Ordered consumer delivery optimization Investigate .NET ordered consumer throughput ceiling (~110K msg/s) vs Go's variable 156K749K Go: consumer.go ordered consumer fast path