natsdotnet/benchmarks_comparison.md

# Go vs .NET NATS Server — Benchmark Comparison

Benchmark run: 2026-03-13. Both servers running on the same machine, tested with identical NATS.Client.Core workloads. Test parallelization disabled to avoid resource contention. Best-of-3 runs reported.

**Environment:** Apple M4, .NET 10, Go nats-server (latest from `golang/nats-server/`).

---

## Core NATS — Pub/Sub Throughput

### Single Publisher (no subscribers)

| Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
|---------|----------|---------|------------|-----------|-----------------|
| 16 B | 2,252,242 | 34.4 | 1,610,807 | 24.6 | 0.72x |
| 128 B | 2,199,267 | 268.5 | 1,661,014 | 202.8 | 0.76x |

### Publisher + Subscriber (1:1)

| Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
|---------|----------|---------|------------|-----------|-----------------|
| 16 B | 313,790 | 4.8 | 909,298 | 13.9 | **2.90x** |
| 16 KB | 41,153 | 643.0 | 38,287 | 598.2 | 0.93x |

### Fan-Out (1 Publisher : 4 Subscribers)

| Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
|---------|----------|---------|------------|-----------|-----------------|
| 128 B | 3,217,684 | 392.8 | 1,817,860 | 221.9 | 0.57x |

### Multi-Publisher / Multi-Subscriber (4P x 4S)

| Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
|---------|----------|---------|------------|-----------|-----------------|
| 128 B | 2,101,337 | 256.5 | 1,527,330 | 186.4 | 0.73x |

---

## Core NATS — Request/Reply Latency

### Single Client, Single Service

| Payload | Go msg/s | .NET msg/s | Ratio | Go P50 (us) | .NET P50 (us) | Go P99 (us) | .NET P99 (us) |
|---------|----------|------------|-------|-------------|---------------|-------------|---------------|
| 128 B | 9,450 | 7,662 | 0.81x | 103.2 | 128.9 | 145.6 | 170.8 |

### 10 Clients, 2 Services (Queue Group)

| Payload | Go msg/s | .NET msg/s | Ratio | Go P50 (us) | .NET P50 (us) | Go P99 (us) | .NET P99 (us) |
|---------|----------|------------|-------|-------------|---------------|-------------|---------------|
| 16 B | 31,094 | 26,144 | 0.84x | 316.9 | 368.7 | 439.2 | 559.7 |

---

## JetStream — Publication

| Mode | Payload | Storage | Go msg/s | .NET msg/s | Ratio (.NET/Go) |
|------|---------|---------|----------|------------|-----------------|
| Synchronous | 16 B | Memory | 17,533 | 14,373 | 0.82x |
| Async (batch) | 128 B | File | 198,237 | 60,416 | 0.30x |

> **Note:** Async file store publish improved from 174 msg/s to 60K msg/s (347x improvement) after two rounds of FileStore-level optimizations plus profiling overhead removal. Remaining 3.3x gap is GC pressure from per-message allocations.

---

## JetStream — Consumption

| Mode | Go msg/s | .NET msg/s | Ratio (.NET/Go) |
|------|----------|------------|-----------------|
| Ordered ephemeral consumer | 748,671 | 114,021 | 0.15x |
| Durable consumer fetch | 662,471 | 488,520 | 0.74x |

> **Note:** Durable fetch improved from 0.13x → 0.60x → **0.74x** after Round 6 optimizations (batch flush, ackReply stack formatting, cached CompiledFilter, pooled fetch list). Ordered consumer ratio dropped due to Go benchmark improvement (748K vs 156K in earlier runs); .NET throughput is stable at ~110K msg/s.

---

## Summary

| Category | Ratio Range | Assessment |
|----------|-------------|------------|
| Pub-only throughput | 0.72x–0.76x | Good — within 2x |
| Pub/sub (small payload) | **2.90x** | .NET outperforms Go — direct buffer path eliminates all per-message overhead |
| Pub/sub (large payload) | 0.93x | Near parity |
| Fan-out | 0.57x | Improved from 0.18x → 0.44x → 0.66x; batch flush applied but serial delivery remains |
| Multi pub/sub | 0.73x | Improved from 0.49x → 0.84x; variance from system load |
| Request/reply latency | 0.81x–0.84x | Good — improved from 0.77x |
| JetStream sync publish | 0.82x | Good |
| JetStream async file publish | 0.30x | Improved from 0.00x — storage write path dominates |
| JetStream ordered consume | 0.15x | .NET stable ~110K; Go variance high (156K–749K) |
| JetStream durable fetch | **0.74x** | **Improved from 0.60x** — batch flush + ackReply optimization |

### Key Observations

1. **Small-payload 1:1 pub/sub outperforms Go by ~3x** (909K vs 314K msg/s). The per-client direct write buffer with `stackalloc` header formatting eliminates all per-message heap allocations and channel overhead.
2. **Durable consumer fetch improved to 0.74x** (489K vs 662K msg/s) — Round 6 batch flush signaling and `string.Create`-based ack reply formatting reduced per-message overhead significantly.
3. **Fan-out holds at ~0.57x** despite batch flush optimization. The remaining gap is goroutine-level parallelism (Go fans out per-client via goroutines; .NET delivers serially). The batch flush reduces wakeup overhead but doesn't add concurrency.
4. **Request/reply improved to 0.81x–0.84x** — deferred flush benefits single-message delivery paths too.
5. **JetStream file store async publish: 0.30x** — remaining gap is GC pressure from per-message `StoredMessage` objects and `byte[]` copies (Change 2 deferred due to scope: 80+ sites in FileStore.cs need migration).
6. **JetStream ordered consumer: 0.15x** — ratio drop is due to Go benchmark variance (749K in this run vs 156K previously); .NET throughput stable at ~110K msg/s. Further investigation needed for the Go variability.

---

## Optimization History

### Round 6: Batch Flush Signaling + Fetch Optimizations

Four optimizations targeting fan-out and consumer fetch hot paths:

| # | Root Cause | Fix | Impact |
|---|-----------|-----|--------|
| 20 | **Per-subscriber flush signal in fan-out** — each `SendMessage` called `_flushSignal.Writer.TryWrite(0)` independently; for 1:4 fan-out, 4 channel writes + 4 write-loop wakeups per published message | Split `SendMessage` into `SendMessageNoFlush` + `SignalFlush`; `ProcessMessage` collects unique clients in `[ThreadStatic] HashSet<INatsClient>` (Go's `pcd` pattern), one flush signal per unique client after fan-out | Reduces channel writes from N to unique-client-count per publish |
| 21 | **Per-fetch `CompiledFilter` allocation** — `CompiledFilter.FromConfig(consumer.Config)` called on every fetch request, allocating a new filter object each time | Cached `CompiledFilter` on `ConsumerHandle` with staleness detection (reference + value check on filter config fields); reused across fetches | Eliminates per-fetch filter allocation |
| 22 | **Per-message string interpolation in ack reply** — `$"$JS.ACK.{stream}.{consumer}.1.{seq}.{deliverySeq}.{ts}.{pending}"` allocated intermediate strings and boxed numeric types on every delivery | Pre-compute `$"$JS.ACK.{stream}.{consumer}.1."` prefix before loop; use `stackalloc char[]` + `TryFormat` for numeric suffix — zero intermediate allocations | Eliminates 4+ string allocs per delivered message |
| 23 | **Per-fetch `List<StoredMessage>` allocation** — `new List<StoredMessage>(batch)` allocated on every `FetchAsync` call | `[ThreadStatic]` reusable list with `.Clear()` + capacity growth; `PullFetchBatch` snapshots via `.ToArray()` for safe handoff | Eliminates per-fetch list allocation |

### Round 5: Non-blocking ConsumeAsync (ordered + durable consumers)

One root cause was identified and fixed in the MSG.NEXT request handling path:

| # | Root Cause | Fix | Impact |
|---|-----------|-----|--------|
| 19 | **Synchronous blocking in DeliverPullFetchMessages** — `FetchAsync(...).GetAwaiter().GetResult()` blocked the client's read loop for the full `expires` timeout (30s). With `batch=1000` and only 5 messages available, the fetch polled for message 6 indefinitely. No messages were delivered until the timeout fired, causing the client to receive 0 messages before its own timeout. | Split into two paths: `noWait`/no-expires uses synchronous fetch (existing behavior for `FetchAsync` client); `expires > 0` spawns `DeliverPullFetchMessagesAsync` background task that delivers messages incrementally without blocking the read loop, with idle heartbeat support | Enables `ConsumeAsync` for both ordered and durable consumers; ordered consumer: 99K msg/s (0.64x Go) |

### Round 4: Per-Client Direct Write Buffer (pub/sub + fan-out + multi pub/sub)

Four optimizations were implemented in the message delivery hot path:

| # | Root Cause | Fix | Impact |
|---|-----------|-----|--------|
| 15 | **Per-message channel overhead** — each `SendMessage` call went through `Channel<OutboundData>.TryWrite`, incurring lock contention and memory barriers | Replaced channel-based message delivery with per-client `_directBuf` byte array under `SpinLock`; messages written directly to contiguous buffer | Eliminates channel overhead per delivery |
| 16 | **Per-message heap allocation for MSG header** — `_outboundBufferPool.RentBuffer()` allocated a pooled `byte[]` for each MSG header | Replaced with `stackalloc byte[512]` — MSG header formatted entirely on the stack, then copied into `_directBuf` | Zero heap allocations per delivery |
| 17 | **Per-message socket write** — write loop issued one `SendAsync` per channel item, even with coalescing | Double-buffer swap: write loop swaps `_directBuf` ↔ `_writeBuf` under `SpinLock`, then writes the entire batch in a single `SendAsync`; zero allocation on swap | Single syscall per batch, zero-copy buffer reuse |
| 18 | **Separate wake channels** — `SendMessage` and `WriteProtocol` used different signaling paths | Unified on `_flushSignal` channel (bounded capacity 1, DropWrite); both paths signal the same channel, write loop drains both `_directBuf` and `_outbound` on each wake | Single wait point, no missed wakes |

### Round 3: Outbound Write Path (pub/sub + fan-out + fetch)

Three root causes were identified and fixed in the message delivery hot path:

| # | Root Cause | Fix | Impact |
|---|-----------|-----|--------|
| 12 | **Per-message `.ToArray()` allocation in SendMessage** — `owner.Memory[..pos].ToArray()` created a new `byte[]` for every MSG delivered to every subscriber | Replaced `IMemoryOwner` rent/copy/dispose with direct `byte[]` from pool; write loop returns buffers after writing | Eliminates 1 heap alloc per delivery (4 per fan-out message) |
| 13 | **Per-message `WriteAsync` in write loop** — each queued message triggered a separate `_stream.WriteAsync()` system call | Added 64KB coalesce buffer; drain all pending messages into contiguous buffer, single `WriteAsync` per batch | Reduces syscalls from N to 1 per batch |
| 14 | **Profiling `Stopwatch` on every message** — `Stopwatch.StartNew()` ran unconditionally in `ProcessMessage` and `StreamManager.Capture` even for non-JetStream messages | Removed profiling instrumentation from hot path | Eliminates ~200ns overhead per message |

### Round 2: FileStore AppendAsync Hot Path

| # | Root Cause | Fix | Impact |
|---|-----------|-----|--------|
| 6 | **Async state machine overhead** — `AppendAsync` was `async ValueTask<ulong>` but never actually awaited | Changed to synchronous `ValueTask<ulong>` returning `ValueTask.FromResult(_last)` | Eliminates Task state machine allocation |
| 7 | **Double payload copy** — `TransformForPersist` allocated `byte[]` then `payload.ToArray()` created second copy for `StoredMessage` | Reuse `TransformForPersist` result directly for `StoredMessage.Payload` when no transform needed (`_noTransform` flag) | Eliminates 1 `byte[]` alloc per message |
| 8 | **Unnecessary TTL work per publish** — `ExpireFromWheel()` and `RegisterTtl()` called on every write even when `MaxAge=0` | Guarded both with `_options.MaxAgeMs > 0` check (matches Go: `filestore.go:4701`) | Eliminates hash wheel overhead when TTL not configured |
| 9 | **Per-message MsgBlock cache allocation** — `WriteAt` created `new MessageRecord` for `_cache` on every write | Removed eager cache population; reads now decode from pending buffer or disk | Eliminates 1 object alloc per message |
| 10 | **Contiguous write buffer** — `MsgBlock._pendingWrites` was `List<byte[]>` with per-message `byte[]` allocations | Replaced with single contiguous `_pendingBuf` byte array; `MessageRecord.EncodeTo` writes directly into it | Eliminates per-message `byte[]` encoding alloc; single `RandomAccess.Write` per flush |
| 11 | **Pending buffer read path** — `MsgBlock.Read()` flushed pending writes to disk before reading | Added in-memory read from `_pendingBuf` when data is still in the buffer | Avoids unnecessary disk flush on read-after-write |

### Round 1: FileStore/StreamManager Layer

| # | Root Cause | Fix | Impact |
|---|-----------|-----|--------|
| 1 | **Per-message synchronous disk I/O** — `MsgBlock.WriteAt()` called `RandomAccess.Write()` on every message | Added write buffering in MsgBlock + background flush loop in FileStore (Go's `flushLoop` pattern: coalesce 16KB or 8ms) | Eliminates per-message syscall overhead |
| 2 | **O(n) `GetStateAsync` per publish** — `_messages.Keys.Min()` and `_messages.Values.Sum()` on every publish for MaxMsgs/MaxBytes checks | Added incremental `_messageCount`, `_totalBytes`, `_firstSeq` fields updated in all mutation paths; `GetStateAsync` is now O(1) | Eliminates O(n) scan per publish |
| 3 | **Unnecessary `LoadAsync` after every append** — `StreamManager.Capture` reloaded the just-stored message even when no mirrors/sources were configured | Made `LoadAsync` conditional on mirror/source replication being configured | Eliminates redundant disk read per publish |
| 4 | **Redundant `PruneExpiredMessages` per publish** — called before every publish even when `MaxAge=0`, and again inside `EnforceRuntimePolicies` | Guarded with `MaxAgeMs > 0` check; removed the pre-publish call (background expiry timer handles it) | Eliminates O(n) scan per publish |
| 5 | **`PrunePerSubject` loading all messages per publish** — `EnforceRuntimePolicies` → `PrugePerSubject` called `ListAsync().GroupBy()` even when `MaxMsgsPer=0` | Guarded with `MaxMsgsPer > 0` check | Eliminates O(n) scan per publish |

Additional fixes: SHA256 envelope bypass for unencrypted/uncompressed stores, RAFT propose skip for single-replica streams.

### What would further close the gap

| Change | Expected Impact | Go Reference |
|--------|----------------|-------------|
| **Fan-out parallelism** | Deliver to subscribers concurrently instead of serially from publisher's read loop | Go: `processMsgResults` fans out per-client via goroutines |
| **Eliminate per-message GC allocations in FileStore** | ~30% improvement on FileStore AppendAsync — replace `StoredMessage` class with `StoredMessageMeta` struct in `_messages` dict, reconstruct full message from MsgBlock on read | Go stores in `cache.buf`/`cache.idx` with zero per-message allocs; 80+ sites in FileStore.cs need migration |
| **Ordered consumer delivery optimization** | Investigate .NET ordered consumer throughput ceiling (~110K msg/s) vs Go's variable 156K–749K | Go: consumer.go ordered consumer fast path |