Files

Joseph Doherty fb0d31c615 docs: refresh benchmark comparison after SubList optimization

2026-03-13 10:18:52 -04:00

15 KiB

Raw Blame History

Go vs .NET NATS Server — Benchmark Comparison

Benchmark run: 2026-03-13 10:16 AM America/Indiana/Indianapolis. Both servers ran on the same machine using the benchmark project README command (dotnet test tests/NATS.Server.Benchmark.Tests --filter "Category=Benchmark" -v normal --logger "console;verbosity=detailed"). Test parallelization remained disabled inside the benchmark assembly.

Environment: Apple M4, .NET SDK 10.0.101, benchmark README command run in the benchmark project's default Debug configuration, Go toolchain installed, Go reference server built from golang/nats-server/.

Core NATS — Pub/Sub Throughput

Single Publisher (no subscribers)

Payload	Go msg/s	Go MB/s	.NET msg/s	.NET MB/s	Ratio (.NET/Go)
16 B	2,258,647	34.5	1,275,230	19.5	0.56x
128 B	2,251,274	274.8	1,661,668	202.8	0.74x

Publisher + Subscriber (1:1)

Payload	Go msg/s	Go MB/s	.NET msg/s	.NET MB/s	Ratio (.NET/Go)
16 B	296,374	4.5	875,105	13.4	2.95x
16 KB	32,111	501.7	30,030	469.2	0.94x

Fan-Out (1 Publisher : 4 Subscribers)

Payload	Go msg/s	Go MB/s	.NET msg/s	.NET MB/s	Ratio (.NET/Go)
128 B	2,387,889	291.5	1,780,888	217.4	0.75x

Multi-Publisher / Multi-Subscriber (4P x 4S)

Payload	Go msg/s	Go MB/s	.NET msg/s	.NET MB/s	Ratio (.NET/Go)
128 B	1,079,112	131.7	953,596	116.4	0.88x

Core NATS — Request/Reply Latency

Single Client, Single Service

Payload	Go msg/s	.NET msg/s	Ratio	Go P50 (us)	.NET P50 (us)	Go P99 (us)	.NET P99 (us)
128 B	8,506	7,182	0.84x	114.9	135.2	161.2	189.8

10 Clients, 2 Services (Queue Group)

Payload	Go msg/s	.NET msg/s	Ratio	Go P50 (us)	.NET P50 (us)	Go P99 (us)	.NET P99 (us)
16 B	26,610	22,533	0.85x	367.7	425.3	487.4	622.5

JetStream — Publication

Mode	Payload	Storage	Go msg/s	.NET msg/s	Ratio (.NET/Go)
Synchronous	16 B	Memory	13,756	9,954	0.72x
Async (batch)	128 B	File	171,761	50,711	0.30x

Note: Async file-store publish remains the largest JetStream gap at 0.30x. The bottleneck is still the storage write path and the remaining managed allocation pressure around persisted message state.

JetStream — Consumption

Mode	Go msg/s	.NET msg/s	Ratio (.NET/Go)
Ordered ephemeral consumer	135,704	107,168	0.79x
Durable consumer fetch	533,441	375,652	0.70x

Note: Ordered-consumer results in this run are much closer to parity than earlier snapshots. That suggests prior Go-side variance was material; .NET throughput is still clustered around ~107K msg/s.

Hot Path Microbenchmarks (.NET only)

SubList

Benchmark	.NET msg/s	.NET MB/s
SubList Exact Match (128 subjects)	17,746,607	236.9
SubList Wildcard Match	18,811,278	251.2
SubList Queue Match	20,624,510	157.4
SubList Remote Interest	264,725	4.3

Parser

Benchmark	Ops/s	MB/s	Alloc
Parser PING	5,598,176	32.0	0.0 B/op
Parser PUB	2,701,645	103.1	40.0 B/op
Parser HPUB	2,177,745	116.3	40.0 B/op
Parser PUB split payload	1,702,439	64.9	176.0 B/op

Summary

Category	Ratio Range	Assessment
Pub-only throughput	0.56x–0.74x	Mixed — 128 B is solid, 16 B still trails materially
Pub/sub (small payload)	2.95x	.NET outperforms Go decisively
Pub/sub (large payload)	0.94x	Near parity
Fan-out	0.75x	Good improvement; still limited by serial delivery
Multi pub/sub	0.88x	Close to parity in this run
Request/reply latency	0.84x–0.85x	Good
JetStream sync publish	0.72x	Good
JetStream async file publish	0.30x	Storage write path still dominates
JetStream ordered consume	0.79x	Much closer to parity in this run
JetStream durable fetch	0.70x	Good

Key Observations

Small-payload 1:1 pub/sub still beats Go by ~3x (875K vs 296K msg/s). The direct write path continues to pay off when message fanout is simple and payloads are tiny.
Fan-out and multi pub/sub both improved in this run to 0.75x and 0.88x respectively. The remaining gap is still consistent with Go's more naturally parallel fanout model.
Ordered consumer moved up to 0.79x (107K vs 136K msg/s). That is materially stronger than earlier runs and suggests previous Go-side variance was distorting the comparison more than the .NET consumer path itself.
Durable fetch remains solid at 0.70x. The Round 6 fetch-path work is still holding, but there is room left in consumer dispatch and storage reads.
Async file-store publish is still the largest server-level gap at 0.30x. The storage layer remains the highest-value runtime target after parser and SubList hot-path cleanup.
The new SubList microbenchmarks show effectively zero temporary allocation per operation for exact, wildcard, queue, and remote-interest lookups in the current implementation. Parser contiguous hot paths also remain small and stable, while split-payload PUB still pays a higher copy cost.

Optimization History

Round 6: Batch Flush Signaling + Fetch Optimizations

Four optimizations targeting fan-out and consumer fetch hot paths:

#	Root Cause	Fix	Impact
20	Per-subscriber flush signal in fan-out — each `SendMessage` called `_flushSignal.Writer.TryWrite(0)` independently; for 1:4 fan-out, 4 channel writes + 4 write-loop wakeups per published message	Split `SendMessage` into `SendMessageNoFlush` + `SignalFlush`; `ProcessMessage` collects unique clients in `[ThreadStatic] HashSet<INatsClient>` (Go's `pcd` pattern), one flush signal per unique client after fan-out	Reduces channel writes from N to unique-client-count per publish
21	Per-fetch `CompiledFilter` allocation — `CompiledFilter.FromConfig(consumer.Config)` called on every fetch request, allocating a new filter object each time	Cached `CompiledFilter` on `ConsumerHandle` with staleness detection (reference + value check on filter config fields); reused across fetches	Eliminates per-fetch filter allocation
22	Per-message string interpolation in ack reply — `$"$JS.ACK.{stream}.{consumer}.1.{seq}.{deliverySeq}.{ts}.{pending}"` allocated intermediate strings and boxed numeric types on every delivery	Pre-compute `$"$JS.ACK.{stream}.{consumer}.1."` prefix before loop; use `stackalloc char[]` + `TryFormat` for numeric suffix — zero intermediate allocations	Eliminates 4+ string allocs per delivered message
23	Per-fetch `List<StoredMessage>` allocation — `new List<StoredMessage>(batch)` allocated on every `FetchAsync` call	`[ThreadStatic]` reusable list with `.Clear()` + capacity growth; `PullFetchBatch` snapshots via `.ToArray()` for safe handoff	Eliminates per-fetch list allocation

Round 5: Non-blocking ConsumeAsync (ordered + durable consumers)

One root cause was identified and fixed in the MSG.NEXT request handling path:

#	Root Cause	Fix	Impact
19	Synchronous blocking in DeliverPullFetchMessages — `FetchAsync(...).GetAwaiter().GetResult()` blocked the client's read loop for the full `expires` timeout (30s). With `batch=1000` and only 5 messages available, the fetch polled for message 6 indefinitely. No messages were delivered until the timeout fired, causing the client to receive 0 messages before its own timeout.	Split into two paths: `noWait`/no-expires uses synchronous fetch (existing behavior for `FetchAsync` client); `expires > 0` spawns `DeliverPullFetchMessagesAsync` background task that delivers messages incrementally without blocking the read loop, with idle heartbeat support	Enables `ConsumeAsync` for both ordered and durable consumers; ordered consumer: 99K msg/s (0.64x Go)

Round 4: Per-Client Direct Write Buffer (pub/sub + fan-out + multi pub/sub)

Four optimizations were implemented in the message delivery hot path:

#	Root Cause	Fix	Impact
15	Per-message channel overhead — each `SendMessage` call went through `Channel<OutboundData>.TryWrite`, incurring lock contention and memory barriers	Replaced channel-based message delivery with per-client `_directBuf` byte array under `SpinLock`; messages written directly to contiguous buffer	Eliminates channel overhead per delivery
16	Per-message heap allocation for MSG header — `_outboundBufferPool.RentBuffer()` allocated a pooled `byte[]` for each MSG header	Replaced with `stackalloc byte[512]` — MSG header formatted entirely on the stack, then copied into `_directBuf`	Zero heap allocations per delivery
17	Per-message socket write — write loop issued one `SendAsync` per channel item, even with coalescing	Double-buffer swap: write loop swaps `_directBuf` ↔ `_writeBuf` under `SpinLock`, then writes the entire batch in a single `SendAsync`; zero allocation on swap	Single syscall per batch, zero-copy buffer reuse
18	Separate wake channels — `SendMessage` and `WriteProtocol` used different signaling paths	Unified on `_flushSignal` channel (bounded capacity 1, DropWrite); both paths signal the same channel, write loop drains both `_directBuf` and `_outbound` on each wake	Single wait point, no missed wakes

Round 3: Outbound Write Path (pub/sub + fan-out + fetch)

Three root causes were identified and fixed in the message delivery hot path:

#	Root Cause	Fix	Impact
12	Per-message `.ToArray()` allocation in SendMessage — `owner.Memory[..pos].ToArray()` created a new `byte[]` for every MSG delivered to every subscriber	Replaced `IMemoryOwner` rent/copy/dispose with direct `byte[]` from pool; write loop returns buffers after writing	Eliminates 1 heap alloc per delivery (4 per fan-out message)
13	Per-message `WriteAsync` in write loop — each queued message triggered a separate `_stream.WriteAsync()` system call	Added 64KB coalesce buffer; drain all pending messages into contiguous buffer, single `WriteAsync` per batch	Reduces syscalls from N to 1 per batch
14	Profiling `Stopwatch` on every message — `Stopwatch.StartNew()` ran unconditionally in `ProcessMessage` and `StreamManager.Capture` even for non-JetStream messages	Removed profiling instrumentation from hot path	Eliminates ~200ns overhead per message

Round 2: FileStore AppendAsync Hot Path

#	Root Cause	Fix	Impact
6	Async state machine overhead — `AppendAsync` was `async ValueTask<ulong>` but never actually awaited	Changed to synchronous `ValueTask<ulong>` returning `ValueTask.FromResult(_last)`	Eliminates Task state machine allocation
7	Double payload copy — `TransformForPersist` allocated `byte[]` then `payload.ToArray()` created second copy for `StoredMessage`	Reuse `TransformForPersist` result directly for `StoredMessage.Payload` when no transform needed (`_noTransform` flag)	Eliminates 1 `byte[]` alloc per message
8	Unnecessary TTL work per publish — `ExpireFromWheel()` and `RegisterTtl()` called on every write even when `MaxAge=0`	Guarded both with `_options.MaxAgeMs > 0` check (matches Go: `filestore.go:4701`)	Eliminates hash wheel overhead when TTL not configured
9	Per-message MsgBlock cache allocation — `WriteAt` created `new MessageRecord` for `_cache` on every write	Removed eager cache population; reads now decode from pending buffer or disk	Eliminates 1 object alloc per message
10	Contiguous write buffer — `MsgBlock._pendingWrites` was `List<byte[]>` with per-message `byte[]` allocations	Replaced with single contiguous `_pendingBuf` byte array; `MessageRecord.EncodeTo` writes directly into it	Eliminates per-message `byte[]` encoding alloc; single `RandomAccess.Write` per flush
11	Pending buffer read path — `MsgBlock.Read()` flushed pending writes to disk before reading	Added in-memory read from `_pendingBuf` when data is still in the buffer	Avoids unnecessary disk flush on read-after-write

Round 1: FileStore/StreamManager Layer

#	Root Cause	Fix	Impact
1	Per-message synchronous disk I/O — `MsgBlock.WriteAt()` called `RandomAccess.Write()` on every message	Added write buffering in MsgBlock + background flush loop in FileStore (Go's `flushLoop` pattern: coalesce 16KB or 8ms)	Eliminates per-message syscall overhead
2	O(n) `GetStateAsync` per publish — `_messages.Keys.Min()` and `_messages.Values.Sum()` on every publish for MaxMsgs/MaxBytes checks	Added incremental `_messageCount`, `_totalBytes`, `_firstSeq` fields updated in all mutation paths; `GetStateAsync` is now O(1)	Eliminates O(n) scan per publish
3	Unnecessary `LoadAsync` after every append — `StreamManager.Capture` reloaded the just-stored message even when no mirrors/sources were configured	Made `LoadAsync` conditional on mirror/source replication being configured	Eliminates redundant disk read per publish
4	Redundant `PruneExpiredMessages` per publish — called before every publish even when `MaxAge=0`, and again inside `EnforceRuntimePolicies`	Guarded with `MaxAgeMs > 0` check; removed the pre-publish call (background expiry timer handles it)	Eliminates O(n) scan per publish
5	`PrunePerSubject` loading all messages per publish — `EnforceRuntimePolicies` → `PrugePerSubject` called `ListAsync().GroupBy()` even when `MaxMsgsPer=0`	Guarded with `MaxMsgsPer > 0` check	Eliminates O(n) scan per publish

Additional fixes: SHA256 envelope bypass for unencrypted/uncompressed stores, RAFT propose skip for single-replica streams.

What would further close the gap

Change	Expected Impact	Go Reference
Fan-out parallelism	Deliver to subscribers concurrently instead of serially from publisher's read loop	Go: `processMsgResults` fans out per-client via goroutines
Eliminate per-message GC allocations in FileStore	~30% improvement on FileStore AppendAsync — replace `StoredMessage` class with `StoredMessageMeta` struct in `_messages` dict, reconstruct full message from MsgBlock on read	Go stores in `cache.buf`/`cache.idx` with zero per-message allocs; 80+ sites in FileStore.cs need migration
Ordered consumer delivery optimization	Investigate .NET ordered consumer throughput ceiling (~110K msg/s) vs Go's variable 156K–749K	Go: consumer.go ordered consumer fast path

15 KiB Raw Blame History Unescape Escape