Go vs .NET NATS Server — Benchmark Comparison
Benchmark run: 2026-03-13. Both servers running on the same machine, tested with identical NATS.Client.Core workloads. Test parallelization disabled to avoid resource contention.
Environment: Apple M4, .NET 10, Go nats-server (latest from golang/nats-server/).
Core NATS — Pub/Sub Throughput
Single Publisher (no subscribers)
| Payload |
Go msg/s |
Go MB/s |
.NET msg/s |
.NET MB/s |
Ratio (.NET/Go) |
| 16 B |
2,138,955 |
32.6 |
1,373,272 |
21.0 |
0.64x |
| 128 B |
1,995,574 |
243.6 |
1,672,825 |
204.2 |
0.84x |
Publisher + Subscriber (1:1)
| Payload |
Go msg/s |
Go MB/s |
.NET msg/s |
.NET MB/s |
Ratio (.NET/Go) |
| 16 B |
1,180,986 |
18.0 |
586,118 |
8.9 |
0.50x |
| 16 KB |
42,660 |
666.6 |
41,555 |
649.3 |
0.97x |
Fan-Out (1 Publisher : 4 Subscribers)
| Payload |
Go msg/s |
Go MB/s |
.NET msg/s |
.NET MB/s |
Ratio (.NET/Go) |
| 128 B |
3,200,845 |
390.7 |
1,423,721 |
173.8 |
0.44x |
Multi-Publisher / Multi-Subscriber (4P x 4S)
| Payload |
Go msg/s |
Go MB/s |
.NET msg/s |
.NET MB/s |
Ratio (.NET/Go) |
| 128 B |
3,081,071 |
376.1 |
1,518,459 |
185.4 |
0.49x |
Core NATS — Request/Reply Latency
Single Client, Single Service
| Payload |
Go msg/s |
.NET msg/s |
Ratio |
Go P50 (us) |
.NET P50 (us) |
Go P99 (us) |
.NET P99 (us) |
| 128 B |
9,174 |
7,317 |
0.80x |
106.3 |
134.2 |
149.2 |
175.2 |
10 Clients, 2 Services (Queue Group)
| Payload |
Go msg/s |
.NET msg/s |
Ratio |
Go P50 (us) |
.NET P50 (us) |
Go P99 (us) |
.NET P99 (us) |
| 16 B |
30,386 |
25,639 |
0.84x |
318.5 |
374.2 |
458.4 |
519.5 |
JetStream — Publication
| Mode |
Payload |
Storage |
Go msg/s |
.NET msg/s |
Ratio (.NET/Go) |
| Synchronous |
16 B |
Memory |
15,241 |
12,879 |
0.85x |
| Async (batch) |
128 B |
File |
201,055 |
55,268 |
0.27x |
Note: Async file store publish improved from 174 msg/s to 55,268 msg/s (318x improvement) after two rounds of FileStore-level optimizations plus profiling overhead removal. Remaining 4x gap is GC pressure from per-message allocations and ack delivery overhead.
JetStream — Consumption
| Mode |
Go msg/s |
.NET msg/s |
Ratio (.NET/Go) |
| Ordered ephemeral consumer |
688,061 |
N/A |
N/A |
| Durable consumer fetch |
701,932 |
450,727 |
0.64x |
Note: Ordered ephemeral consumer is not yet fully supported on the .NET server (API timeout during consumer creation). Durable fetch improved from 0.13x to 0.64x after write coalescing and buffer pooling optimizations in the outbound write path.
Summary
| Category |
Ratio Range |
Assessment |
| Pub-only throughput |
0.64x–0.84x |
Good — within 2x |
| Pub/sub (large payload) |
0.97x |
Excellent — near parity |
| Pub/sub (small payload) |
0.50x |
Improved from 0.18x |
| Fan-out |
0.44x |
Improved from 0.18x |
| Multi pub/sub |
0.49x |
Good |
| Request/reply latency |
0.80x–0.84x |
Good |
| JetStream sync publish |
0.85x |
Good |
| JetStream async file publish |
0.27x |
Improved from 0.00x — storage write path dominates |
| JetStream durable fetch |
0.64x |
Improved from 0.13x |
Key Observations
- Pub-only and request/reply are within striking distance (0.6x–0.85x), suggesting the core message path is reasonably well ported.
- Small-payload pub/sub improved from 0.18x to 0.50x after eliminating per-message
.ToArray() allocations in SendMessage, adding write coalescing in the write loop, and removing profiling instrumentation from the hot path.
- Fan-out improved from 0.18x to 0.44x — same optimizations. The remaining gap vs Go is primarily vectored I/O (
net.Buffers/writev in Go vs sequential WriteAsync in .NET) and per-client scratch buffer reuse (Go's 1KB msgb per client).
- JetStream durable fetch improved from 0.13x to 0.64x — the outbound write path optimizations benefit all message delivery, including consumer fetch responses.
- Large-payload pub/sub reached near-parity (0.97x) — payload copy dominates, and the protocol overhead optimizations have minimal impact at large sizes.
- JetStream file store async publish (0.27x) — remaining gap is GC pressure from per-message
StoredMessage objects and byte[] copies (65% of server time).
Optimization History
Round 3: Outbound Write Path (pub/sub + fan-out + fetch)
Three root causes were identified and fixed in the message delivery hot path:
| # |
Root Cause |
Fix |
Impact |
| 12 |
Per-message .ToArray() allocation in SendMessage — owner.Memory[..pos].ToArray() created a new byte[] for every MSG delivered to every subscriber |
Replaced IMemoryOwner rent/copy/dispose with direct byte[] from pool; write loop returns buffers after writing |
Eliminates 1 heap alloc per delivery (4 per fan-out message) |
| 13 |
Per-message WriteAsync in write loop — each queued message triggered a separate _stream.WriteAsync() system call |
Added 64KB coalesce buffer; drain all pending messages into contiguous buffer, single WriteAsync per batch |
Reduces syscalls from N to 1 per batch |
| 14 |
Profiling Stopwatch on every message — Stopwatch.StartNew() ran unconditionally in ProcessMessage and StreamManager.Capture even for non-JetStream messages |
Removed profiling instrumentation from hot path |
Eliminates ~200ns overhead per message |
Round 2: FileStore AppendAsync Hot Path
| # |
Root Cause |
Fix |
Impact |
| 6 |
Async state machine overhead — AppendAsync was async ValueTask<ulong> but never actually awaited |
Changed to synchronous ValueTask<ulong> returning ValueTask.FromResult(_last) |
Eliminates Task state machine allocation |
| 7 |
Double payload copy — TransformForPersist allocated byte[] then payload.ToArray() created second copy for StoredMessage |
Reuse TransformForPersist result directly for StoredMessage.Payload when no transform needed (_noTransform flag) |
Eliminates 1 byte[] alloc per message |
| 8 |
Unnecessary TTL work per publish — ExpireFromWheel() and RegisterTtl() called on every write even when MaxAge=0 |
Guarded both with _options.MaxAgeMs > 0 check (matches Go: filestore.go:4701) |
Eliminates hash wheel overhead when TTL not configured |
| 9 |
Per-message MsgBlock cache allocation — WriteAt created new MessageRecord for _cache on every write |
Removed eager cache population; reads now decode from pending buffer or disk |
Eliminates 1 object alloc per message |
| 10 |
Contiguous write buffer — MsgBlock._pendingWrites was List<byte[]> with per-message byte[] allocations |
Replaced with single contiguous _pendingBuf byte array; MessageRecord.EncodeTo writes directly into it |
Eliminates per-message byte[] encoding alloc; single RandomAccess.Write per flush |
| 11 |
Pending buffer read path — MsgBlock.Read() flushed pending writes to disk before reading |
Added in-memory read from _pendingBuf when data is still in the buffer |
Avoids unnecessary disk flush on read-after-write |
Round 1: FileStore/StreamManager Layer
| # |
Root Cause |
Fix |
Impact |
| 1 |
Per-message synchronous disk I/O — MsgBlock.WriteAt() called RandomAccess.Write() on every message |
Added write buffering in MsgBlock + background flush loop in FileStore (Go's flushLoop pattern: coalesce 16KB or 8ms) |
Eliminates per-message syscall overhead |
| 2 |
O(n) GetStateAsync per publish — _messages.Keys.Min() and _messages.Values.Sum() on every publish for MaxMsgs/MaxBytes checks |
Added incremental _messageCount, _totalBytes, _firstSeq fields updated in all mutation paths; GetStateAsync is now O(1) |
Eliminates O(n) scan per publish |
| 3 |
Unnecessary LoadAsync after every append — StreamManager.Capture reloaded the just-stored message even when no mirrors/sources were configured |
Made LoadAsync conditional on mirror/source replication being configured |
Eliminates redundant disk read per publish |
| 4 |
Redundant PruneExpiredMessages per publish — called before every publish even when MaxAge=0, and again inside EnforceRuntimePolicies |
Guarded with MaxAgeMs > 0 check; removed the pre-publish call (background expiry timer handles it) |
Eliminates O(n) scan per publish |
| 5 |
PrunePerSubject loading all messages per publish — EnforceRuntimePolicies → PrunePerSubject called ListAsync().GroupBy() even when MaxMsgsPer=0 |
Guarded with MaxMsgsPer > 0 check |
Eliminates O(n) scan per publish |
Additional fixes: SHA256 envelope bypass for unencrypted/uncompressed stores, RAFT propose skip for single-replica streams.
What would further close the gap
| Change |
Expected Impact |
Go Reference |
Vectored I/O (writev) |
Eliminate coalesce copy in write loop — write gathered buffers in single syscall |
Go: net.Buffers.WriteTo() → writev() in flushOutbound() |
| Per-client scratch buffer |
Reuse 1KB buffer for MSG header formatting across deliveries |
Go: client.msgb (1024-byte scratch, msgScratchSize) |
| Batch flush signaling |
Deduplicate write loop wakeups — signal once per readloop iteration, not per delivery |
Go: pcd map tracks affected clients, flushClients() at end of readloop |
| Eliminate per-message GC allocations |
~30% improvement on FileStore AppendAsync — pool or eliminate StoredMessage objects |
Go stores in cache.buf/cache.idx with zero per-message allocs |