Files
natsdotnet/benchmarks_comparison.md
Joseph Doherty 0a4e7a822f perf: eliminate per-message allocations in pub/sub hot path and coalesce outbound writes
Pub/sub 1:1 (16B) improved from 0.18x to 0.50x, fan-out from 0.18x to 0.44x,
and JetStream durable fetch from 0.13x to 0.64x vs Go. Key changes: replace
.ToArray() copy in SendMessage with pooled buffer handoff, batch multiple small
writes into single WriteAsync via 64KB coalesce buffer in write loop, and remove
profiling Stopwatch instrumentation from ProcessMessage/StreamManager hot paths.
2026-03-13 05:09:36 -04:00

10 KiB
Raw Blame History

Go vs .NET NATS Server — Benchmark Comparison

Benchmark run: 2026-03-13. Both servers running on the same machine, tested with identical NATS.Client.Core workloads. Test parallelization disabled to avoid resource contention.

Environment: Apple M4, .NET 10, Go nats-server (latest from golang/nats-server/).


Core NATS — Pub/Sub Throughput

Single Publisher (no subscribers)

Payload Go msg/s Go MB/s .NET msg/s .NET MB/s Ratio (.NET/Go)
16 B 2,138,955 32.6 1,373,272 21.0 0.64x
128 B 1,995,574 243.6 1,672,825 204.2 0.84x

Publisher + Subscriber (1:1)

Payload Go msg/s Go MB/s .NET msg/s .NET MB/s Ratio (.NET/Go)
16 B 1,180,986 18.0 586,118 8.9 0.50x
16 KB 42,660 666.6 41,555 649.3 0.97x

Fan-Out (1 Publisher : 4 Subscribers)

Payload Go msg/s Go MB/s .NET msg/s .NET MB/s Ratio (.NET/Go)
128 B 3,200,845 390.7 1,423,721 173.8 0.44x

Multi-Publisher / Multi-Subscriber (4P x 4S)

Payload Go msg/s Go MB/s .NET msg/s .NET MB/s Ratio (.NET/Go)
128 B 3,081,071 376.1 1,518,459 185.4 0.49x

Core NATS — Request/Reply Latency

Single Client, Single Service

Payload Go msg/s .NET msg/s Ratio Go P50 (us) .NET P50 (us) Go P99 (us) .NET P99 (us)
128 B 9,174 7,317 0.80x 106.3 134.2 149.2 175.2

10 Clients, 2 Services (Queue Group)

Payload Go msg/s .NET msg/s Ratio Go P50 (us) .NET P50 (us) Go P99 (us) .NET P99 (us)
16 B 30,386 25,639 0.84x 318.5 374.2 458.4 519.5

JetStream — Publication

Mode Payload Storage Go msg/s .NET msg/s Ratio (.NET/Go)
Synchronous 16 B Memory 15,241 12,879 0.85x
Async (batch) 128 B File 201,055 55,268 0.27x

Note: Async file store publish improved from 174 msg/s to 55,268 msg/s (318x improvement) after two rounds of FileStore-level optimizations plus profiling overhead removal. Remaining 4x gap is GC pressure from per-message allocations and ack delivery overhead.


JetStream — Consumption

Mode Go msg/s .NET msg/s Ratio (.NET/Go)
Ordered ephemeral consumer 688,061 N/A N/A
Durable consumer fetch 701,932 450,727 0.64x

Note: Ordered ephemeral consumer is not yet fully supported on the .NET server (API timeout during consumer creation). Durable fetch improved from 0.13x to 0.64x after write coalescing and buffer pooling optimizations in the outbound write path.


Summary

Category Ratio Range Assessment
Pub-only throughput 0.64x0.84x Good — within 2x
Pub/sub (large payload) 0.97x Excellent — near parity
Pub/sub (small payload) 0.50x Improved from 0.18x
Fan-out 0.44x Improved from 0.18x
Multi pub/sub 0.49x Good
Request/reply latency 0.80x0.84x Good
JetStream sync publish 0.85x Good
JetStream async file publish 0.27x Improved from 0.00x — storage write path dominates
JetStream durable fetch 0.64x Improved from 0.13x

Key Observations

  1. Pub-only and request/reply are within striking distance (0.6x0.85x), suggesting the core message path is reasonably well ported.
  2. Small-payload pub/sub improved from 0.18x to 0.50x after eliminating per-message .ToArray() allocations in SendMessage, adding write coalescing in the write loop, and removing profiling instrumentation from the hot path.
  3. Fan-out improved from 0.18x to 0.44x — same optimizations. The remaining gap vs Go is primarily vectored I/O (net.Buffers/writev in Go vs sequential WriteAsync in .NET) and per-client scratch buffer reuse (Go's 1KB msgb per client).
  4. JetStream durable fetch improved from 0.13x to 0.64x — the outbound write path optimizations benefit all message delivery, including consumer fetch responses.
  5. Large-payload pub/sub reached near-parity (0.97x) — payload copy dominates, and the protocol overhead optimizations have minimal impact at large sizes.
  6. JetStream file store async publish (0.27x) — remaining gap is GC pressure from per-message StoredMessage objects and byte[] copies (65% of server time).

Optimization History

Round 3: Outbound Write Path (pub/sub + fan-out + fetch)

Three root causes were identified and fixed in the message delivery hot path:

# Root Cause Fix Impact
12 Per-message .ToArray() allocation in SendMessageowner.Memory[..pos].ToArray() created a new byte[] for every MSG delivered to every subscriber Replaced IMemoryOwner rent/copy/dispose with direct byte[] from pool; write loop returns buffers after writing Eliminates 1 heap alloc per delivery (4 per fan-out message)
13 Per-message WriteAsync in write loop — each queued message triggered a separate _stream.WriteAsync() system call Added 64KB coalesce buffer; drain all pending messages into contiguous buffer, single WriteAsync per batch Reduces syscalls from N to 1 per batch
14 Profiling Stopwatch on every messageStopwatch.StartNew() ran unconditionally in ProcessMessage and StreamManager.Capture even for non-JetStream messages Removed profiling instrumentation from hot path Eliminates ~200ns overhead per message

Round 2: FileStore AppendAsync Hot Path

# Root Cause Fix Impact
6 Async state machine overheadAppendAsync was async ValueTask<ulong> but never actually awaited Changed to synchronous ValueTask<ulong> returning ValueTask.FromResult(_last) Eliminates Task state machine allocation
7 Double payload copyTransformForPersist allocated byte[] then payload.ToArray() created second copy for StoredMessage Reuse TransformForPersist result directly for StoredMessage.Payload when no transform needed (_noTransform flag) Eliminates 1 byte[] alloc per message
8 Unnecessary TTL work per publishExpireFromWheel() and RegisterTtl() called on every write even when MaxAge=0 Guarded both with _options.MaxAgeMs > 0 check (matches Go: filestore.go:4701) Eliminates hash wheel overhead when TTL not configured
9 Per-message MsgBlock cache allocationWriteAt created new MessageRecord for _cache on every write Removed eager cache population; reads now decode from pending buffer or disk Eliminates 1 object alloc per message
10 Contiguous write bufferMsgBlock._pendingWrites was List<byte[]> with per-message byte[] allocations Replaced with single contiguous _pendingBuf byte array; MessageRecord.EncodeTo writes directly into it Eliminates per-message byte[] encoding alloc; single RandomAccess.Write per flush
11 Pending buffer read pathMsgBlock.Read() flushed pending writes to disk before reading Added in-memory read from _pendingBuf when data is still in the buffer Avoids unnecessary disk flush on read-after-write

Round 1: FileStore/StreamManager Layer

# Root Cause Fix Impact
1 Per-message synchronous disk I/OMsgBlock.WriteAt() called RandomAccess.Write() on every message Added write buffering in MsgBlock + background flush loop in FileStore (Go's flushLoop pattern: coalesce 16KB or 8ms) Eliminates per-message syscall overhead
2 O(n) GetStateAsync per publish_messages.Keys.Min() and _messages.Values.Sum() on every publish for MaxMsgs/MaxBytes checks Added incremental _messageCount, _totalBytes, _firstSeq fields updated in all mutation paths; GetStateAsync is now O(1) Eliminates O(n) scan per publish
3 Unnecessary LoadAsync after every appendStreamManager.Capture reloaded the just-stored message even when no mirrors/sources were configured Made LoadAsync conditional on mirror/source replication being configured Eliminates redundant disk read per publish
4 Redundant PruneExpiredMessages per publish — called before every publish even when MaxAge=0, and again inside EnforceRuntimePolicies Guarded with MaxAgeMs > 0 check; removed the pre-publish call (background expiry timer handles it) Eliminates O(n) scan per publish
5 PrunePerSubject loading all messages per publishEnforceRuntimePoliciesPrunePerSubject called ListAsync().GroupBy() even when MaxMsgsPer=0 Guarded with MaxMsgsPer > 0 check Eliminates O(n) scan per publish

Additional fixes: SHA256 envelope bypass for unencrypted/uncompressed stores, RAFT propose skip for single-replica streams.

What would further close the gap

Change Expected Impact Go Reference
Vectored I/O (writev) Eliminate coalesce copy in write loop — write gathered buffers in single syscall Go: net.Buffers.WriteTo()writev() in flushOutbound()
Per-client scratch buffer Reuse 1KB buffer for MSG header formatting across deliveries Go: client.msgb (1024-byte scratch, msgScratchSize)
Batch flush signaling Deduplicate write loop wakeups — signal once per readloop iteration, not per delivery Go: pcd map tracks affected clients, flushClients() at end of readloop
Eliminate per-message GC allocations ~30% improvement on FileStore AppendAsync — pool or eliminate StoredMessage objects Go stores in cache.buf/cache.idx with zero per-message allocs