Files

Joseph Doherty 0a4e7a822f perf: eliminate per-message allocations in pub/sub hot path and coalesce outbound writes

Pub/sub 1:1 (16B) improved from 0.18x to 0.50x, fan-out from 0.18x to 0.44x,
and JetStream durable fetch from 0.13x to 0.64x vs Go. Key changes: replace
.ToArray() copy in SendMessage with pooled buffer handoff, batch multiple small
writes into single WriteAsync via 64KB coalesce buffer in write loop, and remove
profiling Stopwatch instrumentation from ProcessMessage/StreamManager hot paths.

2026-03-13 05:09:36 -04:00

10 KiB

Raw Blame History

Go vs .NET NATS Server — Benchmark Comparison

Benchmark run: 2026-03-13. Both servers running on the same machine, tested with identical NATS.Client.Core workloads. Test parallelization disabled to avoid resource contention.

Environment: Apple M4, .NET 10, Go nats-server (latest from golang/nats-server/).

Core NATS — Pub/Sub Throughput

Single Publisher (no subscribers)

Payload	Go msg/s	Go MB/s	.NET msg/s	.NET MB/s	Ratio (.NET/Go)
16 B	2,138,955	32.6	1,373,272	21.0	0.64x
128 B	1,995,574	243.6	1,672,825	204.2	0.84x

Publisher + Subscriber (1:1)

Payload	Go msg/s	Go MB/s	.NET msg/s	.NET MB/s	Ratio (.NET/Go)
16 B	1,180,986	18.0	586,118	8.9	0.50x
16 KB	42,660	666.6	41,555	649.3	0.97x

Fan-Out (1 Publisher : 4 Subscribers)

Payload	Go msg/s	Go MB/s	.NET msg/s	.NET MB/s	Ratio (.NET/Go)
128 B	3,200,845	390.7	1,423,721	173.8	0.44x

Multi-Publisher / Multi-Subscriber (4P x 4S)

Payload	Go msg/s	Go MB/s	.NET msg/s	.NET MB/s	Ratio (.NET/Go)
128 B	3,081,071	376.1	1,518,459	185.4	0.49x

Core NATS — Request/Reply Latency

Single Client, Single Service

Payload	Go msg/s	.NET msg/s	Ratio	Go P50 (us)	.NET P50 (us)	Go P99 (us)	.NET P99 (us)
128 B	9,174	7,317	0.80x	106.3	134.2	149.2	175.2

10 Clients, 2 Services (Queue Group)

Payload	Go msg/s	.NET msg/s	Ratio	Go P50 (us)	.NET P50 (us)	Go P99 (us)	.NET P99 (us)
16 B	30,386	25,639	0.84x	318.5	374.2	458.4	519.5

JetStream — Publication

Mode	Payload	Storage	Go msg/s	.NET msg/s	Ratio (.NET/Go)
Synchronous	16 B	Memory	15,241	12,879	0.85x
Async (batch)	128 B	File	201,055	55,268	0.27x

Note: Async file store publish improved from 174 msg/s to 55,268 msg/s (318x improvement) after two rounds of FileStore-level optimizations plus profiling overhead removal. Remaining 4x gap is GC pressure from per-message allocations and ack delivery overhead.

JetStream — Consumption

Mode	Go msg/s	.NET msg/s	Ratio (.NET/Go)
Ordered ephemeral consumer	688,061	N/A	N/A
Durable consumer fetch	701,932	450,727	0.64x

Note: Ordered ephemeral consumer is not yet fully supported on the .NET server (API timeout during consumer creation). Durable fetch improved from 0.13x to 0.64x after write coalescing and buffer pooling optimizations in the outbound write path.

Summary

Category	Ratio Range	Assessment
Pub-only throughput	0.64x–0.84x	Good — within 2x
Pub/sub (large payload)	0.97x	Excellent — near parity
Pub/sub (small payload)	0.50x	Improved from 0.18x
Fan-out	0.44x	Improved from 0.18x
Multi pub/sub	0.49x	Good
Request/reply latency	0.80x–0.84x	Good
JetStream sync publish	0.85x	Good
JetStream async file publish	0.27x	Improved from 0.00x — storage write path dominates
JetStream durable fetch	0.64x	Improved from 0.13x

Key Observations

Pub-only and request/reply are within striking distance (0.6x–0.85x), suggesting the core message path is reasonably well ported.
Small-payload pub/sub improved from 0.18x to 0.50x after eliminating per-message .ToArray() allocations in SendMessage, adding write coalescing in the write loop, and removing profiling instrumentation from the hot path.
Fan-out improved from 0.18x to 0.44x — same optimizations. The remaining gap vs Go is primarily vectored I/O (net.Buffers/writev in Go vs sequential WriteAsync in .NET) and per-client scratch buffer reuse (Go's 1KB msgb per client).
JetStream durable fetch improved from 0.13x to 0.64x — the outbound write path optimizations benefit all message delivery, including consumer fetch responses.
Large-payload pub/sub reached near-parity (0.97x) — payload copy dominates, and the protocol overhead optimizations have minimal impact at large sizes.
JetStream file store async publish (0.27x) — remaining gap is GC pressure from per-message StoredMessage objects and byte[] copies (65% of server time).

Optimization History

Round 3: Outbound Write Path (pub/sub + fan-out + fetch)

Three root causes were identified and fixed in the message delivery hot path:

#	Root Cause	Fix	Impact
12	Per-message `.ToArray()` allocation in SendMessage — `owner.Memory[..pos].ToArray()` created a new `byte[]` for every MSG delivered to every subscriber	Replaced `IMemoryOwner` rent/copy/dispose with direct `byte[]` from pool; write loop returns buffers after writing	Eliminates 1 heap alloc per delivery (4 per fan-out message)
13	Per-message `WriteAsync` in write loop — each queued message triggered a separate `_stream.WriteAsync()` system call	Added 64KB coalesce buffer; drain all pending messages into contiguous buffer, single `WriteAsync` per batch	Reduces syscalls from N to 1 per batch
14	Profiling `Stopwatch` on every message — `Stopwatch.StartNew()` ran unconditionally in `ProcessMessage` and `StreamManager.Capture` even for non-JetStream messages	Removed profiling instrumentation from hot path	Eliminates ~200ns overhead per message

Round 2: FileStore AppendAsync Hot Path

#	Root Cause	Fix	Impact
6	Async state machine overhead — `AppendAsync` was `async ValueTask<ulong>` but never actually awaited	Changed to synchronous `ValueTask<ulong>` returning `ValueTask.FromResult(_last)`	Eliminates Task state machine allocation
7	Double payload copy — `TransformForPersist` allocated `byte[]` then `payload.ToArray()` created second copy for `StoredMessage`	Reuse `TransformForPersist` result directly for `StoredMessage.Payload` when no transform needed (`_noTransform` flag)	Eliminates 1 `byte[]` alloc per message
8	Unnecessary TTL work per publish — `ExpireFromWheel()` and `RegisterTtl()` called on every write even when `MaxAge=0`	Guarded both with `_options.MaxAgeMs > 0` check (matches Go: `filestore.go:4701`)	Eliminates hash wheel overhead when TTL not configured
9	Per-message MsgBlock cache allocation — `WriteAt` created `new MessageRecord` for `_cache` on every write	Removed eager cache population; reads now decode from pending buffer or disk	Eliminates 1 object alloc per message
10	Contiguous write buffer — `MsgBlock._pendingWrites` was `List<byte[]>` with per-message `byte[]` allocations	Replaced with single contiguous `_pendingBuf` byte array; `MessageRecord.EncodeTo` writes directly into it	Eliminates per-message `byte[]` encoding alloc; single `RandomAccess.Write` per flush
11	Pending buffer read path — `MsgBlock.Read()` flushed pending writes to disk before reading	Added in-memory read from `_pendingBuf` when data is still in the buffer	Avoids unnecessary disk flush on read-after-write

Round 1: FileStore/StreamManager Layer

#	Root Cause	Fix	Impact
1	Per-message synchronous disk I/O — `MsgBlock.WriteAt()` called `RandomAccess.Write()` on every message	Added write buffering in MsgBlock + background flush loop in FileStore (Go's `flushLoop` pattern: coalesce 16KB or 8ms)	Eliminates per-message syscall overhead
2	O(n) `GetStateAsync` per publish — `_messages.Keys.Min()` and `_messages.Values.Sum()` on every publish for MaxMsgs/MaxBytes checks	Added incremental `_messageCount`, `_totalBytes`, `_firstSeq` fields updated in all mutation paths; `GetStateAsync` is now O(1)	Eliminates O(n) scan per publish
3	Unnecessary `LoadAsync` after every append — `StreamManager.Capture` reloaded the just-stored message even when no mirrors/sources were configured	Made `LoadAsync` conditional on mirror/source replication being configured	Eliminates redundant disk read per publish
4	Redundant `PruneExpiredMessages` per publish — called before every publish even when `MaxAge=0`, and again inside `EnforceRuntimePolicies`	Guarded with `MaxAgeMs > 0` check; removed the pre-publish call (background expiry timer handles it)	Eliminates O(n) scan per publish
5	`PrunePerSubject` loading all messages per publish — `EnforceRuntimePolicies` → `PrunePerSubject` called `ListAsync().GroupBy()` even when `MaxMsgsPer=0`	Guarded with `MaxMsgsPer > 0` check	Eliminates O(n) scan per publish

Additional fixes: SHA256 envelope bypass for unencrypted/uncompressed stores, RAFT propose skip for single-replica streams.

What would further close the gap

Change	Expected Impact	Go Reference
Vectored I/O (`writev`)	Eliminate coalesce copy in write loop — write gathered buffers in single syscall	Go: `net.Buffers.WriteTo()` → `writev()` in `flushOutbound()`
Per-client scratch buffer	Reuse 1KB buffer for MSG header formatting across deliveries	Go: `client.msgb` (1024-byte scratch, `msgScratchSize`)
Batch flush signaling	Deduplicate write loop wakeups — signal once per readloop iteration, not per delivery	Go: `pcd` map tracks affected clients, `flushClients()` at end of readloop
Eliminate per-message GC allocations	~30% improvement on FileStore AppendAsync — pool or eliminate `StoredMessage` objects	Go stores in `cache.buf`/`cache.idx` with zero per-message allocs

10 KiB Raw Blame History Unescape Escape