# Go vs .NET NATS Server — Benchmark Comparison Benchmark run: 2026-03-13 11:37 AM America/Indiana/Indianapolis. Both servers ran on the same machine using the benchmark project README command (`dotnet test tests/NATS.Server.Benchmark.Tests --filter "Category=Benchmark" -v normal --logger "console;verbosity=detailed"`). Test parallelization remained disabled inside the benchmark assembly. **Environment:** Apple M4, .NET SDK 10.0.101, benchmark README command run in the benchmark project's default `Debug` configuration, Go toolchain installed, Go reference server built from `golang/nats-server/`. --- --- ## Core NATS — Pub/Sub Throughput ### Single Publisher (no subscribers) | Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) | |---------|----------|---------|------------|-----------|-----------------| | 16 B | 2,837,040 | 43.3 | 1,856,572 | 28.3 | 0.65x | | 128 B | 2,778,511 | 339.2 | 1,542,298 | 188.3 | 0.56x | ### Publisher + Subscriber (1:1) | Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) | |---------|----------|---------|------------|-----------|-----------------| | 16 B | 1,442,273 | 22.0 | 888,155 | 13.6 | 0.62x | | 16 KB | 33,013 | 515.8 | 31,068 | 485.4 | 0.94x | ### Fan-Out (1 Publisher : 4 Subscribers) | Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) | |---------|----------|---------|------------|-----------|-----------------| | 128 B | 2,981,804 | 364.0 | 1,729,483 | 211.1 | 0.58x | ### Multi-Publisher / Multi-Subscriber (4P x 4S) | Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) | |---------|----------|---------|------------|-----------|-----------------| | 128 B | 1,567,030 | 191.3 | 1,371,131 | 167.4 | 0.87x | --- ## Core NATS — Request/Reply Latency ### Single Client, Single Service | Payload | Go msg/s | .NET msg/s | Ratio | Go P50 (us) | .NET P50 (us) | Go P99 (us) | .NET P99 (us) | |---------|----------|------------|-------|-------------|---------------|-------------|---------------| | 128 B | 8,316 | 7,128 | 0.86x | 116.7 | 136.4 | 165.8 | 203.5 | ### 10 Clients, 2 Services (Queue Group) | Payload | Go msg/s | .NET msg/s | Ratio | Go P50 (us) | .NET P50 (us) | Go P99 (us) | .NET P99 (us) | |---------|----------|------------|-------|-------------|---------------|-------------|---------------| | 16 B | 26,409 | 23,024 | 0.87x | 369.2 | 416.5 | 527.5 | 603.8 | --- ## JetStream — Publication | Mode | Payload | Storage | Go msg/s | .NET msg/s | Ratio (.NET/Go) | |------|---------|---------|----------|------------|-----------------| | Synchronous | 16 B | Memory | 13,090 | 9,368 | 0.72x | | Async (batch) | 128 B | File | 132,869 | 54,750 | 0.41x | > **Note:** Async file-store publish improved to 0.41x in this run, but the storage write path is still the largest publication gap after the FileStore changes. --- ## JetStream — Consumption | Mode | Go msg/s | .NET msg/s | Ratio (.NET/Go) | |------|----------|------------|-----------------| | Ordered ephemeral consumer | 564,226 | 62,192 | 0.11x | | Durable consumer fetch | 478,634 | 317,563 | 0.66x | > **Note:** Ordered-consumer throughput regressed materially in this run. The merged FileStore work helped publish and subject-lookup paths, but ordered consumption remains the clearest JetStream hotspot after this round. --- ## Hot Path Microbenchmarks (.NET only) ### SubList | Benchmark | .NET msg/s | .NET MB/s | Alloc | |-----------|------------|-----------|-------| | SubList Exact Match (128 subjects) | 18,472,815 | 246.6 | 0.00 B/op | | SubList Wildcard Match | 18,647,671 | 249.0 | 0.00 B/op | | SubList Queue Match | 19,313,073 | 147.3 | 0.00 B/op | | SubList Remote Interest | 270,082 | 4.4 | 0.00 B/op | ### Parser | Benchmark | Ops/s | MB/s | Alloc | |-----------|-------|------|-------| | Parser PING | 5,765,742 | 33.0 | 0.0 B/op | | Parser PUB | 2,542,120 | 97.0 | 40.0 B/op | | Parser HPUB | 2,151,468 | 114.9 | 40.0 B/op | | Parser PUB split payload | 1,876,479 | 71.6 | 176.0 B/op | ### FileStore | Benchmark | Ops/s | MB/s | Alloc | |-----------|-------|------|-------| | FileStore AppendAsync (128B) | 250,964 | 30.6 | 1550.9 B/op | | FileStore LoadLastBySubject (hot) | 12,057,199 | 735.9 | 0.0 B/op | | FileStore PurgeEx+Trim | 328 | 0.0 | 5440792.9 B/op | --- ## Summary | Category | Ratio Range | Assessment | |----------|-------------|------------| | Pub-only throughput | 0.56x–0.65x | Still behind Go on both payload sizes | | Pub/sub (small payload) | 0.62x | Regression versus the prior run; no longer ahead of Go | | Pub/sub (large payload) | 0.94x | Near parity | | Fan-out | 0.58x | Fan-out remains materially behind Go | | Multi pub/sub | 0.87x | Close to parity | | Request/reply latency | 0.86x–0.87x | Good | | JetStream sync publish | 0.72x | Good | | JetStream async file publish | 0.41x | Improved, but still storage-bound | | JetStream ordered consume | 0.11x | Major regression / highest-priority JetStream gap | | JetStream durable fetch | 0.66x | Good, but slightly down from the prior run | ### Key Observations 1. **Async file-store publish improved from the prior 0.30x snapshot to 0.41x** (54.8K vs 132.9K msg/s). That is directionally consistent with the FileStore metadata and payload-ownership work landing in this round. 2. **The new FileStore direct benchmarks show the shape of the remaining storage cost clearly**: hot last-by-subject lookup is effectively allocation-free and very fast, append is still around 1551 B/op, and repeated `PurgeEx+Trim` is still extremely allocation-heavy at roughly 5.4 MB/op. 3. **Ordered consumer throughput is now the dominant JetStream problem at 0.11x** (62K vs 564K msg/s). Whatever helped publish and fetch paths did not carry over to ordered-consumer delivery in this run. 4. **Core pub/sub is no longer showing the earlier small-payload outlier win over Go**. 1:1 16 B came in at 0.62x, fan-out at 0.58x, and multi pub/sub at 0.87x, which is a much more uniform profile. 5. **Durable fetch remains respectable at 0.66x**, but it is slightly softer than the last snapshot and still trails Go by a meaningful margin on the same merged build. 6. **SubList and parser microbenchmarks remain strong and stable**. Exact, wildcard, queue, and remote-interest lookups still allocate essentially nothing, and parser contiguous hot paths remain well below the FileStore and consumer-path gaps. --- ## Optimization History ### Round 6: Batch Flush Signaling + Fetch Optimizations Four optimizations targeting fan-out and consumer fetch hot paths: | # | Root Cause | Fix | Impact | |---|-----------|-----|--------| | 20 | **Per-subscriber flush signal in fan-out** — each `SendMessage` called `_flushSignal.Writer.TryWrite(0)` independently; for 1:4 fan-out, 4 channel writes + 4 write-loop wakeups per published message | Split `SendMessage` into `SendMessageNoFlush` + `SignalFlush`; `ProcessMessage` collects unique clients in `[ThreadStatic] HashSet` (Go's `pcd` pattern), one flush signal per unique client after fan-out | Reduces channel writes from N to unique-client-count per publish | | 21 | **Per-fetch `CompiledFilter` allocation** — `CompiledFilter.FromConfig(consumer.Config)` called on every fetch request, allocating a new filter object each time | Cached `CompiledFilter` on `ConsumerHandle` with staleness detection (reference + value check on filter config fields); reused across fetches | Eliminates per-fetch filter allocation | | 22 | **Per-message string interpolation in ack reply** — `$"$JS.ACK.{stream}.{consumer}.1.{seq}.{deliverySeq}.{ts}.{pending}"` allocated intermediate strings and boxed numeric types on every delivery | Pre-compute `$"$JS.ACK.{stream}.{consumer}.1."` prefix before loop; use `stackalloc char[]` + `TryFormat` for numeric suffix — zero intermediate allocations | Eliminates 4+ string allocs per delivered message | | 23 | **Per-fetch `List` allocation** — `new List(batch)` allocated on every `FetchAsync` call | `[ThreadStatic]` reusable list with `.Clear()` + capacity growth; `PullFetchBatch` snapshots via `.ToArray()` for safe handoff | Eliminates per-fetch list allocation | ### Round 5: Non-blocking ConsumeAsync (ordered + durable consumers) One root cause was identified and fixed in the MSG.NEXT request handling path: | # | Root Cause | Fix | Impact | |---|-----------|-----|--------| | 19 | **Synchronous blocking in DeliverPullFetchMessages** — `FetchAsync(...).GetAwaiter().GetResult()` blocked the client's read loop for the full `expires` timeout (30s). With `batch=1000` and only 5 messages available, the fetch polled for message 6 indefinitely. No messages were delivered until the timeout fired, causing the client to receive 0 messages before its own timeout. | Split into two paths: `noWait`/no-expires uses synchronous fetch (existing behavior for `FetchAsync` client); `expires > 0` spawns `DeliverPullFetchMessagesAsync` background task that delivers messages incrementally without blocking the read loop, with idle heartbeat support | Enables `ConsumeAsync` for both ordered and durable consumers; ordered consumer: 99K msg/s (0.64x Go) | ### Round 4: Per-Client Direct Write Buffer (pub/sub + fan-out + multi pub/sub) Four optimizations were implemented in the message delivery hot path: | # | Root Cause | Fix | Impact | |---|-----------|-----|--------| | 15 | **Per-message channel overhead** — each `SendMessage` call went through `Channel.TryWrite`, incurring lock contention and memory barriers | Replaced channel-based message delivery with per-client `_directBuf` byte array under `SpinLock`; messages written directly to contiguous buffer | Eliminates channel overhead per delivery | | 16 | **Per-message heap allocation for MSG header** — `_outboundBufferPool.RentBuffer()` allocated a pooled `byte[]` for each MSG header | Replaced with `stackalloc byte[512]` — MSG header formatted entirely on the stack, then copied into `_directBuf` | Zero heap allocations per delivery | | 17 | **Per-message socket write** — write loop issued one `SendAsync` per channel item, even with coalescing | Double-buffer swap: write loop swaps `_directBuf` ↔ `_writeBuf` under `SpinLock`, then writes the entire batch in a single `SendAsync`; zero allocation on swap | Single syscall per batch, zero-copy buffer reuse | | 18 | **Separate wake channels** — `SendMessage` and `WriteProtocol` used different signaling paths | Unified on `_flushSignal` channel (bounded capacity 1, DropWrite); both paths signal the same channel, write loop drains both `_directBuf` and `_outbound` on each wake | Single wait point, no missed wakes | ### Round 3: Outbound Write Path (pub/sub + fan-out + fetch) Three root causes were identified and fixed in the message delivery hot path: | # | Root Cause | Fix | Impact | |---|-----------|-----|--------| | 12 | **Per-message `.ToArray()` allocation in SendMessage** — `owner.Memory[..pos].ToArray()` created a new `byte[]` for every MSG delivered to every subscriber | Replaced `IMemoryOwner` rent/copy/dispose with direct `byte[]` from pool; write loop returns buffers after writing | Eliminates 1 heap alloc per delivery (4 per fan-out message) | | 13 | **Per-message `WriteAsync` in write loop** — each queued message triggered a separate `_stream.WriteAsync()` system call | Added 64KB coalesce buffer; drain all pending messages into contiguous buffer, single `WriteAsync` per batch | Reduces syscalls from N to 1 per batch | | 14 | **Profiling `Stopwatch` on every message** — `Stopwatch.StartNew()` ran unconditionally in `ProcessMessage` and `StreamManager.Capture` even for non-JetStream messages | Removed profiling instrumentation from hot path | Eliminates ~200ns overhead per message | ### Round 2: FileStore AppendAsync Hot Path | # | Root Cause | Fix | Impact | |---|-----------|-----|--------| | 6 | **Async state machine overhead** — `AppendAsync` was `async ValueTask` but never actually awaited | Changed to synchronous `ValueTask` returning `ValueTask.FromResult(_last)` | Eliminates Task state machine allocation | | 7 | **Double payload copy** — `TransformForPersist` allocated `byte[]` then `payload.ToArray()` created second copy for `StoredMessage` | Reuse `TransformForPersist` result directly for `StoredMessage.Payload` when no transform needed (`_noTransform` flag) | Eliminates 1 `byte[]` alloc per message | | 8 | **Unnecessary TTL work per publish** — `ExpireFromWheel()` and `RegisterTtl()` called on every write even when `MaxAge=0` | Guarded both with `_options.MaxAgeMs > 0` check (matches Go: `filestore.go:4701`) | Eliminates hash wheel overhead when TTL not configured | | 9 | **Per-message MsgBlock cache allocation** — `WriteAt` created `new MessageRecord` for `_cache` on every write | Removed eager cache population; reads now decode from pending buffer or disk | Eliminates 1 object alloc per message | | 10 | **Contiguous write buffer** — `MsgBlock._pendingWrites` was `List` with per-message `byte[]` allocations | Replaced with single contiguous `_pendingBuf` byte array; `MessageRecord.EncodeTo` writes directly into it | Eliminates per-message `byte[]` encoding alloc; single `RandomAccess.Write` per flush | | 11 | **Pending buffer read path** — `MsgBlock.Read()` flushed pending writes to disk before reading | Added in-memory read from `_pendingBuf` when data is still in the buffer | Avoids unnecessary disk flush on read-after-write | ### Round 1: FileStore/StreamManager Layer | # | Root Cause | Fix | Impact | |---|-----------|-----|--------| | 1 | **Per-message synchronous disk I/O** — `MsgBlock.WriteAt()` called `RandomAccess.Write()` on every message | Added write buffering in MsgBlock + background flush loop in FileStore (Go's `flushLoop` pattern: coalesce 16KB or 8ms) | Eliminates per-message syscall overhead | | 2 | **O(n) `GetStateAsync` per publish** — `_messages.Keys.Min()` and `_messages.Values.Sum()` on every publish for MaxMsgs/MaxBytes checks | Added incremental `_messageCount`, `_totalBytes`, `_firstSeq` fields updated in all mutation paths; `GetStateAsync` is now O(1) | Eliminates O(n) scan per publish | | 3 | **Unnecessary `LoadAsync` after every append** — `StreamManager.Capture` reloaded the just-stored message even when no mirrors/sources were configured | Made `LoadAsync` conditional on mirror/source replication being configured | Eliminates redundant disk read per publish | | 4 | **Redundant `PruneExpiredMessages` per publish** — called before every publish even when `MaxAge=0`, and again inside `EnforceRuntimePolicies` | Guarded with `MaxAgeMs > 0` check; removed the pre-publish call (background expiry timer handles it) | Eliminates O(n) scan per publish | | 5 | **`PrunePerSubject` loading all messages per publish** — `EnforceRuntimePolicies` → `PrugePerSubject` called `ListAsync().GroupBy()` even when `MaxMsgsPer=0` | Guarded with `MaxMsgsPer > 0` check | Eliminates O(n) scan per publish | Additional fixes: SHA256 envelope bypass for unencrypted/uncompressed stores, RAFT propose skip for single-replica streams. ### What would further close the gap | Change | Expected Impact | Go Reference | |--------|----------------|-------------| | **Fan-out parallelism** | Deliver to subscribers concurrently instead of serially from publisher's read loop | Go: `processMsgResults` fans out per-client via goroutines | | **Eliminate per-message GC allocations in FileStore** | ~30% improvement on FileStore AppendAsync — replace `StoredMessage` class with `StoredMessageMeta` struct in `_messages` dict, reconstruct full message from MsgBlock on read | Go stores in `cache.buf`/`cache.idx` with zero per-message allocs; 80+ sites in FileStore.cs need migration | | **Ordered consumer delivery optimization** | Investigate .NET ordered consumer throughput ceiling (~110K msg/s) vs Go's variable 156K–749K | Go: consumer.go ordered consumer fast path |