Files
natsdotnet/benchmarks_comparison.md
2026-03-13 10:08:20 -04:00

17 KiB
Raw Blame History

Go vs .NET NATS Server — Benchmark Comparison

Benchmark run: 2026-03-13 10:06 AM America/Indiana/Indianapolis. The latest refresh used the benchmark project README command (dotnet test tests/NATS.Server.Benchmark.Tests --filter "Category=Benchmark" -v normal --logger "console;verbosity=detailed") and completed successfully as a .NET-only run. The Go/.NET comparison tables below remain the last Go-capable comparison baseline.

Environment: Apple M4, .NET SDK 10.0.101, README benchmark command run in the benchmark project's default Debug configuration, Go toolchain installed but the current full-suite run emitted only .NET result blocks.


Latest README Run (.NET only)

The current refresh came from /tmp/bench-output.txt using the benchmark project README workflow. Because the run did not emit any Go comparison blocks, the values below are the latest .NET-only numbers from that run, and the historical Go/.NET comparison tables are preserved below instead of being overwritten with mixed-source ratios.

Core and JetStream

Benchmark .NET msg/s .NET MB/s Notes
Single Publisher (16B) 1,392,442 21.2 README full-suite run
Single Publisher (128B) 1,491,226 182.0 README full-suite run
PubSub 1:1 (16B) 717,731 11.0 README full-suite run
PubSub 1:1 (16KB) 28,450 444.5 README full-suite run
Fan-Out 1:4 (128B) 1,451,748 177.2 README full-suite run
Multi 4Px4S (128B) 244,878 29.9 README full-suite run
Request-Reply Single (128B) 6,840 0.8 P50 142.5 us, P99 203.9 us
Request-Reply 10Cx2S (16B) 22,844 0.3 P50 421.1 us, P99 602.1 us
JS Sync Publish (16B Memory) 12,619 0.2 README full-suite run
JS Async Publish (128B File) 46,631 5.7 README full-suite run
JS Ordered Consumer (128B) 108,057 13.2 README full-suite run
JS Durable Fetch (128B) 490,090 59.8 README full-suite run

Parser Microbenchmarks

Benchmark Ops/s MB/s Alloc
Parser PING 5,756,370 32.9 0.0 B/op
Parser PUB 2,537,973 96.8 40.0 B/op
Parser HPUB 2,298,811 122.8 40.0 B/op
Parser PUB split payload 2,049,535 78.2 176.0 B/op

Current Run Highlights

  1. The parser microbenchmarks show the hot path is already at zero allocation for PING, with contiguous PUB and HPUB still paying a small fixed cost for retained field copies.
  2. Split-payload PUB remains meaningfully more allocation-heavy than contiguous PUB because the parser must preserve unread payload state across reads and then materialize contiguous memory at the current client boundary.
  3. The README-driven suite was a .NET-only refresh, so the comparative Go/.NET ratios below should still be treated as the last Go-capable baseline rather than current same-run ratios.

Core NATS — Pub/Sub Throughput

Single Publisher (no subscribers)

Payload Go msg/s Go MB/s .NET msg/s .NET MB/s Ratio (.NET/Go)
16 B 2,252,242 34.4 1,610,807 24.6 0.72x
128 B 2,199,267 268.5 1,661,014 202.8 0.76x

Publisher + Subscriber (1:1)

Payload Go msg/s Go MB/s .NET msg/s .NET MB/s Ratio (.NET/Go)
16 B 313,790 4.8 909,298 13.9 2.90x
16 KB 41,153 643.0 38,287 598.2 0.93x

Fan-Out (1 Publisher : 4 Subscribers)

Payload Go msg/s Go MB/s .NET msg/s .NET MB/s Ratio (.NET/Go)
128 B 3,217,684 392.8 1,817,860 221.9 0.57x

Multi-Publisher / Multi-Subscriber (4P x 4S)

Payload Go msg/s Go MB/s .NET msg/s .NET MB/s Ratio (.NET/Go)
128 B 2,101,337 256.5 1,527,330 186.4 0.73x

Core NATS — Request/Reply Latency

Single Client, Single Service

Payload Go msg/s .NET msg/s Ratio Go P50 (us) .NET P50 (us) Go P99 (us) .NET P99 (us)
128 B 9,450 7,662 0.81x 103.2 128.9 145.6 170.8

10 Clients, 2 Services (Queue Group)

Payload Go msg/s .NET msg/s Ratio Go P50 (us) .NET P50 (us) Go P99 (us) .NET P99 (us)
16 B 31,094 26,144 0.84x 316.9 368.7 439.2 559.7

JetStream — Publication

Mode Payload Storage Go msg/s .NET msg/s Ratio (.NET/Go)
Synchronous 16 B Memory 17,533 14,373 0.82x
Async (batch) 128 B File 198,237 60,416 0.30x

Note: Async file store publish improved from 174 msg/s to 60K msg/s (347x improvement) after two rounds of FileStore-level optimizations plus profiling overhead removal. Remaining 3.3x gap is GC pressure from per-message allocations.


JetStream — Consumption

Mode Go msg/s .NET msg/s Ratio (.NET/Go)
Ordered ephemeral consumer 748,671 114,021 0.15x
Durable consumer fetch 662,471 488,520 0.74x

Note: Durable fetch improved from 0.13x → 0.60x → 0.74x after Round 6 optimizations (batch flush, ackReply stack formatting, cached CompiledFilter, pooled fetch list). Ordered consumer ratio dropped due to Go benchmark improvement (748K vs 156K in earlier runs); .NET throughput is stable at ~110K msg/s.


Summary

Category Ratio Range Assessment
Pub-only throughput 0.72x0.76x Good — within 2x
Pub/sub (small payload) 2.90x .NET outperforms Go — direct buffer path eliminates all per-message overhead
Pub/sub (large payload) 0.93x Near parity
Fan-out 0.57x Improved from 0.18x → 0.44x → 0.66x; batch flush applied but serial delivery remains
Multi pub/sub 0.73x Improved from 0.49x → 0.84x; variance from system load
Request/reply latency 0.81x0.84x Good — improved from 0.77x
JetStream sync publish 0.82x Good
JetStream async file publish 0.30x Improved from 0.00x — storage write path dominates
JetStream ordered consume 0.15x .NET stable ~110K; Go variance high (156K749K)
JetStream durable fetch 0.74x Improved from 0.60x — batch flush + ackReply optimization

Key Observations

  1. Small-payload 1:1 pub/sub outperforms Go by ~3x (909K vs 314K msg/s). The per-client direct write buffer with stackalloc header formatting eliminates all per-message heap allocations and channel overhead.
  2. Durable consumer fetch improved to 0.74x (489K vs 662K msg/s) — Round 6 batch flush signaling and string.Create-based ack reply formatting reduced per-message overhead significantly.
  3. Fan-out holds at ~0.57x despite batch flush optimization. The remaining gap is goroutine-level parallelism (Go fans out per-client via goroutines; .NET delivers serially). The batch flush reduces wakeup overhead but doesn't add concurrency.
  4. Request/reply improved to 0.81x0.84x — deferred flush benefits single-message delivery paths too.
  5. JetStream file store async publish: 0.30x — remaining gap is GC pressure from per-message StoredMessage objects and byte[] copies (Change 2 deferred due to scope: 80+ sites in FileStore.cs need migration).
  6. JetStream ordered consumer: 0.15x — ratio drop is due to Go benchmark variance (749K in this run vs 156K previously); .NET throughput stable at ~110K msg/s. Further investigation needed for the Go variability.

Optimization History

Round 6: Batch Flush Signaling + Fetch Optimizations

Four optimizations targeting fan-out and consumer fetch hot paths:

# Root Cause Fix Impact
20 Per-subscriber flush signal in fan-out — each SendMessage called _flushSignal.Writer.TryWrite(0) independently; for 1:4 fan-out, 4 channel writes + 4 write-loop wakeups per published message Split SendMessage into SendMessageNoFlush + SignalFlush; ProcessMessage collects unique clients in [ThreadStatic] HashSet<INatsClient> (Go's pcd pattern), one flush signal per unique client after fan-out Reduces channel writes from N to unique-client-count per publish
21 Per-fetch CompiledFilter allocationCompiledFilter.FromConfig(consumer.Config) called on every fetch request, allocating a new filter object each time Cached CompiledFilter on ConsumerHandle with staleness detection (reference + value check on filter config fields); reused across fetches Eliminates per-fetch filter allocation
22 Per-message string interpolation in ack reply$"$JS.ACK.{stream}.{consumer}.1.{seq}.{deliverySeq}.{ts}.{pending}" allocated intermediate strings and boxed numeric types on every delivery Pre-compute $"$JS.ACK.{stream}.{consumer}.1." prefix before loop; use stackalloc char[] + TryFormat for numeric suffix — zero intermediate allocations Eliminates 4+ string allocs per delivered message
23 Per-fetch List<StoredMessage> allocationnew List<StoredMessage>(batch) allocated on every FetchAsync call [ThreadStatic] reusable list with .Clear() + capacity growth; PullFetchBatch snapshots via .ToArray() for safe handoff Eliminates per-fetch list allocation

Round 5: Non-blocking ConsumeAsync (ordered + durable consumers)

One root cause was identified and fixed in the MSG.NEXT request handling path:

# Root Cause Fix Impact
19 Synchronous blocking in DeliverPullFetchMessagesFetchAsync(...).GetAwaiter().GetResult() blocked the client's read loop for the full expires timeout (30s). With batch=1000 and only 5 messages available, the fetch polled for message 6 indefinitely. No messages were delivered until the timeout fired, causing the client to receive 0 messages before its own timeout. Split into two paths: noWait/no-expires uses synchronous fetch (existing behavior for FetchAsync client); expires > 0 spawns DeliverPullFetchMessagesAsync background task that delivers messages incrementally without blocking the read loop, with idle heartbeat support Enables ConsumeAsync for both ordered and durable consumers; ordered consumer: 99K msg/s (0.64x Go)

Round 4: Per-Client Direct Write Buffer (pub/sub + fan-out + multi pub/sub)

Four optimizations were implemented in the message delivery hot path:

# Root Cause Fix Impact
15 Per-message channel overhead — each SendMessage call went through Channel<OutboundData>.TryWrite, incurring lock contention and memory barriers Replaced channel-based message delivery with per-client _directBuf byte array under SpinLock; messages written directly to contiguous buffer Eliminates channel overhead per delivery
16 Per-message heap allocation for MSG header_outboundBufferPool.RentBuffer() allocated a pooled byte[] for each MSG header Replaced with stackalloc byte[512] — MSG header formatted entirely on the stack, then copied into _directBuf Zero heap allocations per delivery
17 Per-message socket write — write loop issued one SendAsync per channel item, even with coalescing Double-buffer swap: write loop swaps _directBuf_writeBuf under SpinLock, then writes the entire batch in a single SendAsync; zero allocation on swap Single syscall per batch, zero-copy buffer reuse
18 Separate wake channelsSendMessage and WriteProtocol used different signaling paths Unified on _flushSignal channel (bounded capacity 1, DropWrite); both paths signal the same channel, write loop drains both _directBuf and _outbound on each wake Single wait point, no missed wakes

Round 3: Outbound Write Path (pub/sub + fan-out + fetch)

Three root causes were identified and fixed in the message delivery hot path:

# Root Cause Fix Impact
12 Per-message .ToArray() allocation in SendMessageowner.Memory[..pos].ToArray() created a new byte[] for every MSG delivered to every subscriber Replaced IMemoryOwner rent/copy/dispose with direct byte[] from pool; write loop returns buffers after writing Eliminates 1 heap alloc per delivery (4 per fan-out message)
13 Per-message WriteAsync in write loop — each queued message triggered a separate _stream.WriteAsync() system call Added 64KB coalesce buffer; drain all pending messages into contiguous buffer, single WriteAsync per batch Reduces syscalls from N to 1 per batch
14 Profiling Stopwatch on every messageStopwatch.StartNew() ran unconditionally in ProcessMessage and StreamManager.Capture even for non-JetStream messages Removed profiling instrumentation from hot path Eliminates ~200ns overhead per message

Round 2: FileStore AppendAsync Hot Path

# Root Cause Fix Impact
6 Async state machine overheadAppendAsync was async ValueTask<ulong> but never actually awaited Changed to synchronous ValueTask<ulong> returning ValueTask.FromResult(_last) Eliminates Task state machine allocation
7 Double payload copyTransformForPersist allocated byte[] then payload.ToArray() created second copy for StoredMessage Reuse TransformForPersist result directly for StoredMessage.Payload when no transform needed (_noTransform flag) Eliminates 1 byte[] alloc per message
8 Unnecessary TTL work per publishExpireFromWheel() and RegisterTtl() called on every write even when MaxAge=0 Guarded both with _options.MaxAgeMs > 0 check (matches Go: filestore.go:4701) Eliminates hash wheel overhead when TTL not configured
9 Per-message MsgBlock cache allocationWriteAt created new MessageRecord for _cache on every write Removed eager cache population; reads now decode from pending buffer or disk Eliminates 1 object alloc per message
10 Contiguous write bufferMsgBlock._pendingWrites was List<byte[]> with per-message byte[] allocations Replaced with single contiguous _pendingBuf byte array; MessageRecord.EncodeTo writes directly into it Eliminates per-message byte[] encoding alloc; single RandomAccess.Write per flush
11 Pending buffer read pathMsgBlock.Read() flushed pending writes to disk before reading Added in-memory read from _pendingBuf when data is still in the buffer Avoids unnecessary disk flush on read-after-write

Round 1: FileStore/StreamManager Layer

# Root Cause Fix Impact
1 Per-message synchronous disk I/OMsgBlock.WriteAt() called RandomAccess.Write() on every message Added write buffering in MsgBlock + background flush loop in FileStore (Go's flushLoop pattern: coalesce 16KB or 8ms) Eliminates per-message syscall overhead
2 O(n) GetStateAsync per publish_messages.Keys.Min() and _messages.Values.Sum() on every publish for MaxMsgs/MaxBytes checks Added incremental _messageCount, _totalBytes, _firstSeq fields updated in all mutation paths; GetStateAsync is now O(1) Eliminates O(n) scan per publish
3 Unnecessary LoadAsync after every appendStreamManager.Capture reloaded the just-stored message even when no mirrors/sources were configured Made LoadAsync conditional on mirror/source replication being configured Eliminates redundant disk read per publish
4 Redundant PruneExpiredMessages per publish — called before every publish even when MaxAge=0, and again inside EnforceRuntimePolicies Guarded with MaxAgeMs > 0 check; removed the pre-publish call (background expiry timer handles it) Eliminates O(n) scan per publish
5 PrunePerSubject loading all messages per publishEnforceRuntimePoliciesPrugePerSubject called ListAsync().GroupBy() even when MaxMsgsPer=0 Guarded with MaxMsgsPer > 0 check Eliminates O(n) scan per publish

Additional fixes: SHA256 envelope bypass for unencrypted/uncompressed stores, RAFT propose skip for single-replica streams.

What would further close the gap

Change Expected Impact Go Reference
Fan-out parallelism Deliver to subscribers concurrently instead of serially from publisher's read loop Go: processMsgResults fans out per-client via goroutines
Eliminate per-message GC allocations in FileStore ~30% improvement on FileStore AppendAsync — replace StoredMessage class with StoredMessageMeta struct in _messages dict, reconstruct full message from MsgBlock on read Go stores in cache.buf/cache.idx with zero per-message allocs; 80+ sites in FileStore.cs need migration
Ordered consumer delivery optimization Investigate .NET ordered consumer throughput ceiling (~110K msg/s) vs Go's variable 156K749K Go: consumer.go ordered consumer fast path