Files

Joseph Doherty a3b34fb16d docs: record parser hot-path allocation strategy

2026-03-13 10:08:20 -04:00

17 KiB

Raw Blame History

Go vs .NET NATS Server — Benchmark Comparison

Benchmark run: 2026-03-13 10:06 AM America/Indiana/Indianapolis. The latest refresh used the benchmark project README command (dotnet test tests/NATS.Server.Benchmark.Tests --filter "Category=Benchmark" -v normal --logger "console;verbosity=detailed") and completed successfully as a .NET-only run. The Go/.NET comparison tables below remain the last Go-capable comparison baseline.

Environment: Apple M4, .NET SDK 10.0.101, README benchmark command run in the benchmark project's default Debug configuration, Go toolchain installed but the current full-suite run emitted only .NET result blocks.

Latest README Run (.NET only)

The current refresh came from /tmp/bench-output.txt using the benchmark project README workflow. Because the run did not emit any Go comparison blocks, the values below are the latest .NET-only numbers from that run, and the historical Go/.NET comparison tables are preserved below instead of being overwritten with mixed-source ratios.

Core and JetStream

Benchmark	.NET msg/s	.NET MB/s	Notes
Single Publisher (16B)	1,392,442	21.2	README full-suite run
Single Publisher (128B)	1,491,226	182.0	README full-suite run
PubSub 1:1 (16B)	717,731	11.0	README full-suite run
PubSub 1:1 (16KB)	28,450	444.5	README full-suite run
Fan-Out 1:4 (128B)	1,451,748	177.2	README full-suite run
Multi 4Px4S (128B)	244,878	29.9	README full-suite run
Request-Reply Single (128B)	6,840	0.8	P50 142.5 us, P99 203.9 us
Request-Reply 10Cx2S (16B)	22,844	0.3	P50 421.1 us, P99 602.1 us
JS Sync Publish (16B Memory)	12,619	0.2	README full-suite run
JS Async Publish (128B File)	46,631	5.7	README full-suite run
JS Ordered Consumer (128B)	108,057	13.2	README full-suite run
JS Durable Fetch (128B)	490,090	59.8	README full-suite run

Parser Microbenchmarks

Benchmark	Ops/s	MB/s	Alloc
Parser PING	5,756,370	32.9	0.0 B/op
Parser PUB	2,537,973	96.8	40.0 B/op
Parser HPUB	2,298,811	122.8	40.0 B/op
Parser PUB split payload	2,049,535	78.2	176.0 B/op

Current Run Highlights

The parser microbenchmarks show the hot path is already at zero allocation for PING, with contiguous PUB and HPUB still paying a small fixed cost for retained field copies.
Split-payload PUB remains meaningfully more allocation-heavy than contiguous PUB because the parser must preserve unread payload state across reads and then materialize contiguous memory at the current client boundary.
The README-driven suite was a .NET-only refresh, so the comparative Go/.NET ratios below should still be treated as the last Go-capable baseline rather than current same-run ratios.

Core NATS — Pub/Sub Throughput

Single Publisher (no subscribers)

Payload	Go msg/s	Go MB/s	.NET msg/s	.NET MB/s	Ratio (.NET/Go)
16 B	2,252,242	34.4	1,610,807	24.6	0.72x
128 B	2,199,267	268.5	1,661,014	202.8	0.76x

Publisher + Subscriber (1:1)

Payload	Go msg/s	Go MB/s	.NET msg/s	.NET MB/s	Ratio (.NET/Go)
16 B	313,790	4.8	909,298	13.9	2.90x
16 KB	41,153	643.0	38,287	598.2	0.93x

Fan-Out (1 Publisher : 4 Subscribers)

Payload	Go msg/s	Go MB/s	.NET msg/s	.NET MB/s	Ratio (.NET/Go)
128 B	3,217,684	392.8	1,817,860	221.9	0.57x

Multi-Publisher / Multi-Subscriber (4P x 4S)

Payload	Go msg/s	Go MB/s	.NET msg/s	.NET MB/s	Ratio (.NET/Go)
128 B	2,101,337	256.5	1,527,330	186.4	0.73x

Core NATS — Request/Reply Latency

Single Client, Single Service

Payload	Go msg/s	.NET msg/s	Ratio	Go P50 (us)	.NET P50 (us)	Go P99 (us)	.NET P99 (us)
128 B	9,450	7,662	0.81x	103.2	128.9	145.6	170.8

10 Clients, 2 Services (Queue Group)

Payload	Go msg/s	.NET msg/s	Ratio	Go P50 (us)	.NET P50 (us)	Go P99 (us)	.NET P99 (us)
16 B	31,094	26,144	0.84x	316.9	368.7	439.2	559.7

JetStream — Publication

Mode	Payload	Storage	Go msg/s	.NET msg/s	Ratio (.NET/Go)
Synchronous	16 B	Memory	17,533	14,373	0.82x
Async (batch)	128 B	File	198,237	60,416	0.30x

Note: Async file store publish improved from 174 msg/s to 60K msg/s (347x improvement) after two rounds of FileStore-level optimizations plus profiling overhead removal. Remaining 3.3x gap is GC pressure from per-message allocations.

JetStream — Consumption

Mode	Go msg/s	.NET msg/s	Ratio (.NET/Go)
Ordered ephemeral consumer	748,671	114,021	0.15x
Durable consumer fetch	662,471	488,520	0.74x

Note: Durable fetch improved from 0.13x → 0.60x → 0.74x after Round 6 optimizations (batch flush, ackReply stack formatting, cached CompiledFilter, pooled fetch list). Ordered consumer ratio dropped due to Go benchmark improvement (748K vs 156K in earlier runs); .NET throughput is stable at ~110K msg/s.

Summary

Category	Ratio Range	Assessment
Pub-only throughput	0.72x–0.76x	Good — within 2x
Pub/sub (small payload)	2.90x	.NET outperforms Go — direct buffer path eliminates all per-message overhead
Pub/sub (large payload)	0.93x	Near parity
Fan-out	0.57x	Improved from 0.18x → 0.44x → 0.66x; batch flush applied but serial delivery remains
Multi pub/sub	0.73x	Improved from 0.49x → 0.84x; variance from system load
Request/reply latency	0.81x–0.84x	Good — improved from 0.77x
JetStream sync publish	0.82x	Good
JetStream async file publish	0.30x	Improved from 0.00x — storage write path dominates
JetStream ordered consume	0.15x	.NET stable ~110K; Go variance high (156K–749K)
JetStream durable fetch	0.74x	Improved from 0.60x — batch flush + ackReply optimization

Key Observations

Small-payload 1:1 pub/sub outperforms Go by ~3x (909K vs 314K msg/s). The per-client direct write buffer with stackalloc header formatting eliminates all per-message heap allocations and channel overhead.
Durable consumer fetch improved to 0.74x (489K vs 662K msg/s) — Round 6 batch flush signaling and string.Create-based ack reply formatting reduced per-message overhead significantly.
Fan-out holds at ~0.57x despite batch flush optimization. The remaining gap is goroutine-level parallelism (Go fans out per-client via goroutines; .NET delivers serially). The batch flush reduces wakeup overhead but doesn't add concurrency.
Request/reply improved to 0.81x–0.84x — deferred flush benefits single-message delivery paths too.
JetStream file store async publish: 0.30x — remaining gap is GC pressure from per-message StoredMessage objects and byte[] copies (Change 2 deferred due to scope: 80+ sites in FileStore.cs need migration).
JetStream ordered consumer: 0.15x — ratio drop is due to Go benchmark variance (749K in this run vs 156K previously); .NET throughput stable at ~110K msg/s. Further investigation needed for the Go variability.

Optimization History

Round 6: Batch Flush Signaling + Fetch Optimizations

Four optimizations targeting fan-out and consumer fetch hot paths:

#	Root Cause	Fix	Impact
20	Per-subscriber flush signal in fan-out — each `SendMessage` called `_flushSignal.Writer.TryWrite(0)` independently; for 1:4 fan-out, 4 channel writes + 4 write-loop wakeups per published message	Split `SendMessage` into `SendMessageNoFlush` + `SignalFlush`; `ProcessMessage` collects unique clients in `[ThreadStatic] HashSet<INatsClient>` (Go's `pcd` pattern), one flush signal per unique client after fan-out	Reduces channel writes from N to unique-client-count per publish
21	Per-fetch `CompiledFilter` allocation — `CompiledFilter.FromConfig(consumer.Config)` called on every fetch request, allocating a new filter object each time	Cached `CompiledFilter` on `ConsumerHandle` with staleness detection (reference + value check on filter config fields); reused across fetches	Eliminates per-fetch filter allocation
22	Per-message string interpolation in ack reply — `$"$JS.ACK.{stream}.{consumer}.1.{seq}.{deliverySeq}.{ts}.{pending}"` allocated intermediate strings and boxed numeric types on every delivery	Pre-compute `$"$JS.ACK.{stream}.{consumer}.1."` prefix before loop; use `stackalloc char[]` + `TryFormat` for numeric suffix — zero intermediate allocations	Eliminates 4+ string allocs per delivered message
23	Per-fetch `List<StoredMessage>` allocation — `new List<StoredMessage>(batch)` allocated on every `FetchAsync` call	`[ThreadStatic]` reusable list with `.Clear()` + capacity growth; `PullFetchBatch` snapshots via `.ToArray()` for safe handoff	Eliminates per-fetch list allocation

Round 5: Non-blocking ConsumeAsync (ordered + durable consumers)

One root cause was identified and fixed in the MSG.NEXT request handling path:

#	Root Cause	Fix	Impact
19	Synchronous blocking in DeliverPullFetchMessages — `FetchAsync(...).GetAwaiter().GetResult()` blocked the client's read loop for the full `expires` timeout (30s). With `batch=1000` and only 5 messages available, the fetch polled for message 6 indefinitely. No messages were delivered until the timeout fired, causing the client to receive 0 messages before its own timeout.	Split into two paths: `noWait`/no-expires uses synchronous fetch (existing behavior for `FetchAsync` client); `expires > 0` spawns `DeliverPullFetchMessagesAsync` background task that delivers messages incrementally without blocking the read loop, with idle heartbeat support	Enables `ConsumeAsync` for both ordered and durable consumers; ordered consumer: 99K msg/s (0.64x Go)

Round 4: Per-Client Direct Write Buffer (pub/sub + fan-out + multi pub/sub)

Four optimizations were implemented in the message delivery hot path:

#	Root Cause	Fix	Impact
15	Per-message channel overhead — each `SendMessage` call went through `Channel<OutboundData>.TryWrite`, incurring lock contention and memory barriers	Replaced channel-based message delivery with per-client `_directBuf` byte array under `SpinLock`; messages written directly to contiguous buffer	Eliminates channel overhead per delivery
16	Per-message heap allocation for MSG header — `_outboundBufferPool.RentBuffer()` allocated a pooled `byte[]` for each MSG header	Replaced with `stackalloc byte[512]` — MSG header formatted entirely on the stack, then copied into `_directBuf`	Zero heap allocations per delivery
17	Per-message socket write — write loop issued one `SendAsync` per channel item, even with coalescing	Double-buffer swap: write loop swaps `_directBuf` ↔ `_writeBuf` under `SpinLock`, then writes the entire batch in a single `SendAsync`; zero allocation on swap	Single syscall per batch, zero-copy buffer reuse
18	Separate wake channels — `SendMessage` and `WriteProtocol` used different signaling paths	Unified on `_flushSignal` channel (bounded capacity 1, DropWrite); both paths signal the same channel, write loop drains both `_directBuf` and `_outbound` on each wake	Single wait point, no missed wakes

Round 3: Outbound Write Path (pub/sub + fan-out + fetch)

Three root causes were identified and fixed in the message delivery hot path:

#	Root Cause	Fix	Impact
12	Per-message `.ToArray()` allocation in SendMessage — `owner.Memory[..pos].ToArray()` created a new `byte[]` for every MSG delivered to every subscriber	Replaced `IMemoryOwner` rent/copy/dispose with direct `byte[]` from pool; write loop returns buffers after writing	Eliminates 1 heap alloc per delivery (4 per fan-out message)
13	Per-message `WriteAsync` in write loop — each queued message triggered a separate `_stream.WriteAsync()` system call	Added 64KB coalesce buffer; drain all pending messages into contiguous buffer, single `WriteAsync` per batch	Reduces syscalls from N to 1 per batch
14	Profiling `Stopwatch` on every message — `Stopwatch.StartNew()` ran unconditionally in `ProcessMessage` and `StreamManager.Capture` even for non-JetStream messages	Removed profiling instrumentation from hot path	Eliminates ~200ns overhead per message

Round 2: FileStore AppendAsync Hot Path

#	Root Cause	Fix	Impact
6	Async state machine overhead — `AppendAsync` was `async ValueTask<ulong>` but never actually awaited	Changed to synchronous `ValueTask<ulong>` returning `ValueTask.FromResult(_last)`	Eliminates Task state machine allocation
7	Double payload copy — `TransformForPersist` allocated `byte[]` then `payload.ToArray()` created second copy for `StoredMessage`	Reuse `TransformForPersist` result directly for `StoredMessage.Payload` when no transform needed (`_noTransform` flag)	Eliminates 1 `byte[]` alloc per message
8	Unnecessary TTL work per publish — `ExpireFromWheel()` and `RegisterTtl()` called on every write even when `MaxAge=0`	Guarded both with `_options.MaxAgeMs > 0` check (matches Go: `filestore.go:4701`)	Eliminates hash wheel overhead when TTL not configured
9	Per-message MsgBlock cache allocation — `WriteAt` created `new MessageRecord` for `_cache` on every write	Removed eager cache population; reads now decode from pending buffer or disk	Eliminates 1 object alloc per message
10	Contiguous write buffer — `MsgBlock._pendingWrites` was `List<byte[]>` with per-message `byte[]` allocations	Replaced with single contiguous `_pendingBuf` byte array; `MessageRecord.EncodeTo` writes directly into it	Eliminates per-message `byte[]` encoding alloc; single `RandomAccess.Write` per flush
11	Pending buffer read path — `MsgBlock.Read()` flushed pending writes to disk before reading	Added in-memory read from `_pendingBuf` when data is still in the buffer	Avoids unnecessary disk flush on read-after-write

Round 1: FileStore/StreamManager Layer

#	Root Cause	Fix	Impact
1	Per-message synchronous disk I/O — `MsgBlock.WriteAt()` called `RandomAccess.Write()` on every message	Added write buffering in MsgBlock + background flush loop in FileStore (Go's `flushLoop` pattern: coalesce 16KB or 8ms)	Eliminates per-message syscall overhead
2	O(n) `GetStateAsync` per publish — `_messages.Keys.Min()` and `_messages.Values.Sum()` on every publish for MaxMsgs/MaxBytes checks	Added incremental `_messageCount`, `_totalBytes`, `_firstSeq` fields updated in all mutation paths; `GetStateAsync` is now O(1)	Eliminates O(n) scan per publish
3	Unnecessary `LoadAsync` after every append — `StreamManager.Capture` reloaded the just-stored message even when no mirrors/sources were configured	Made `LoadAsync` conditional on mirror/source replication being configured	Eliminates redundant disk read per publish
4	Redundant `PruneExpiredMessages` per publish — called before every publish even when `MaxAge=0`, and again inside `EnforceRuntimePolicies`	Guarded with `MaxAgeMs > 0` check; removed the pre-publish call (background expiry timer handles it)	Eliminates O(n) scan per publish
5	`PrunePerSubject` loading all messages per publish — `EnforceRuntimePolicies` → `PrugePerSubject` called `ListAsync().GroupBy()` even when `MaxMsgsPer=0`	Guarded with `MaxMsgsPer > 0` check	Eliminates O(n) scan per publish

Additional fixes: SHA256 envelope bypass for unencrypted/uncompressed stores, RAFT propose skip for single-replica streams.

What would further close the gap

Change	Expected Impact	Go Reference
Fan-out parallelism	Deliver to subscribers concurrently instead of serially from publisher's read loop	Go: `processMsgResults` fans out per-client via goroutines
Eliminate per-message GC allocations in FileStore	~30% improvement on FileStore AppendAsync — replace `StoredMessage` class with `StoredMessageMeta` struct in `_messages` dict, reconstruct full message from MsgBlock on read	Go stores in `cache.buf`/`cache.idx` with zero per-message allocs; 80+ sites in FileStore.cs need migration
Ordered consumer delivery optimization	Investigate .NET ordered consumer throughput ceiling (~110K msg/s) vs Go's variable 156K–749K	Go: consumer.go ordered consumer fast path

17 KiB Raw Blame History Unescape Escape