natsdotnet/benchmarks_comparison.md

# Go vs .NET NATS Server — Benchmark Comparison

Benchmark run: 2026-03-13. Both servers running on the same machine, tested with identical NATS.Client.Core workloads. Test parallelization disabled to avoid resource contention.

**Environment:** Apple M4, .NET 10, Go nats-server (latest from `golang/nats-server/`).

---

## Core NATS — Pub/Sub Throughput

### Single Publisher (no subscribers)

| Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
|---------|----------|---------|------------|-----------|-----------------|
| 16 B | 2,138,955 | 32.6 | 1,373,272 | 21.0 | 0.64x |
| 128 B | 1,995,574 | 243.6 | 1,672,825 | 204.2 | 0.84x |

### Publisher + Subscriber (1:1)

| Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
|---------|----------|---------|------------|-----------|-----------------|
| 16 B | 1,180,986 | 18.0 | 586,118 | 8.9 | 0.50x |
| 16 KB | 42,660 | 666.6 | 41,555 | 649.3 | 0.97x |

### Fan-Out (1 Publisher : 4 Subscribers)

| Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
|---------|----------|---------|------------|-----------|-----------------|
| 128 B | 3,200,845 | 390.7 | 1,423,721 | 173.8 | 0.44x |

### Multi-Publisher / Multi-Subscriber (4P x 4S)

| Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
|---------|----------|---------|------------|-----------|-----------------|
| 128 B | 3,081,071 | 376.1 | 1,518,459 | 185.4 | 0.49x |

---

## Core NATS — Request/Reply Latency

### Single Client, Single Service

| Payload | Go msg/s | .NET msg/s | Ratio | Go P50 (us) | .NET P50 (us) | Go P99 (us) | .NET P99 (us) |
|---------|----------|------------|-------|-------------|---------------|-------------|---------------|
| 128 B | 9,174 | 7,317 | 0.80x | 106.3 | 134.2 | 149.2 | 175.2 |

### 10 Clients, 2 Services (Queue Group)

| Payload | Go msg/s | .NET msg/s | Ratio | Go P50 (us) | .NET P50 (us) | Go P99 (us) | .NET P99 (us) |
|---------|----------|------------|-------|-------------|---------------|-------------|---------------|
| 16 B | 30,386 | 25,639 | 0.84x | 318.5 | 374.2 | 458.4 | 519.5 |

---

## JetStream — Publication

| Mode | Payload | Storage | Go msg/s | .NET msg/s | Ratio (.NET/Go) |
|------|---------|---------|----------|------------|-----------------|
| Synchronous | 16 B | Memory | 15,241 | 12,879 | 0.85x |
| Async (batch) | 128 B | File | 201,055 | 55,268 | 0.27x |

> **Note:** Async file store publish improved from 174 msg/s to 55,268 msg/s (318x improvement) after two rounds of FileStore-level optimizations plus profiling overhead removal. Remaining 4x gap is GC pressure from per-message allocations and ack delivery overhead.

---

## JetStream — Consumption

| Mode | Go msg/s | .NET msg/s | Ratio (.NET/Go) |
|------|----------|------------|-----------------|
| Ordered ephemeral consumer | 688,061 | N/A | N/A |
| Durable consumer fetch | 701,932 | 450,727 | 0.64x |

> **Note:** Ordered ephemeral consumer is not yet fully supported on the .NET server (API timeout during consumer creation). Durable fetch improved from 0.13x to 0.64x after write coalescing and buffer pooling optimizations in the outbound write path.

---

## Summary

| Category | Ratio Range | Assessment |
|----------|-------------|------------|
| Pub-only throughput | 0.64x–0.84x | Good — within 2x |
| Pub/sub (large payload) | 0.97x | Excellent — near parity |
| Pub/sub (small payload) | 0.50x | Improved from 0.18x |
| Fan-out | 0.44x | Improved from 0.18x |
| Multi pub/sub | 0.49x | Good |
| Request/reply latency | 0.80x–0.84x | Good |
| JetStream sync publish | 0.85x | Good |
| JetStream async file publish | 0.27x | Improved from 0.00x — storage write path dominates |
| JetStream durable fetch | 0.64x | Improved from 0.13x |

### Key Observations

1. **Pub-only and request/reply are within striking distance** (0.6x–0.85x), suggesting the core message path is reasonably well ported.
2. **Small-payload pub/sub improved from 0.18x to 0.50x** after eliminating per-message `.ToArray()` allocations in `SendMessage`, adding write coalescing in the write loop, and removing profiling instrumentation from the hot path.
3. **Fan-out improved from 0.18x to 0.44x** — same optimizations. The remaining gap vs Go is primarily vectored I/O (`net.Buffers`/`writev` in Go vs sequential `WriteAsync` in .NET) and per-client scratch buffer reuse (Go's 1KB `msgb` per client).
4. **JetStream durable fetch improved from 0.13x to 0.64x** — the outbound write path optimizations benefit all message delivery, including consumer fetch responses.
5. **Large-payload pub/sub reached near-parity** (0.97x) — payload copy dominates, and the protocol overhead optimizations have minimal impact at large sizes.
6. **JetStream file store async publish** (0.27x) — remaining gap is GC pressure from per-message `StoredMessage` objects and `byte[]` copies (65% of server time).

---

## Optimization History

### Round 3: Outbound Write Path (pub/sub + fan-out + fetch)

Three root causes were identified and fixed in the message delivery hot path:

| # | Root Cause | Fix | Impact |
|---|-----------|-----|--------|
| 12 | **Per-message `.ToArray()` allocation in SendMessage** — `owner.Memory[..pos].ToArray()` created a new `byte[]` for every MSG delivered to every subscriber | Replaced `IMemoryOwner` rent/copy/dispose with direct `byte[]` from pool; write loop returns buffers after writing | Eliminates 1 heap alloc per delivery (4 per fan-out message) |
| 13 | **Per-message `WriteAsync` in write loop** — each queued message triggered a separate `_stream.WriteAsync()` system call | Added 64KB coalesce buffer; drain all pending messages into contiguous buffer, single `WriteAsync` per batch | Reduces syscalls from N to 1 per batch |
| 14 | **Profiling `Stopwatch` on every message** — `Stopwatch.StartNew()` ran unconditionally in `ProcessMessage` and `StreamManager.Capture` even for non-JetStream messages | Removed profiling instrumentation from hot path | Eliminates ~200ns overhead per message |

### Round 2: FileStore AppendAsync Hot Path

| # | Root Cause | Fix | Impact |
|---|-----------|-----|--------|
| 6 | **Async state machine overhead** — `AppendAsync` was `async ValueTask<ulong>` but never actually awaited | Changed to synchronous `ValueTask<ulong>` returning `ValueTask.FromResult(_last)` | Eliminates Task state machine allocation |
| 7 | **Double payload copy** — `TransformForPersist` allocated `byte[]` then `payload.ToArray()` created second copy for `StoredMessage` | Reuse `TransformForPersist` result directly for `StoredMessage.Payload` when no transform needed (`_noTransform` flag) | Eliminates 1 `byte[]` alloc per message |
| 8 | **Unnecessary TTL work per publish** — `ExpireFromWheel()` and `RegisterTtl()` called on every write even when `MaxAge=0` | Guarded both with `_options.MaxAgeMs > 0` check (matches Go: `filestore.go:4701`) | Eliminates hash wheel overhead when TTL not configured |
| 9 | **Per-message MsgBlock cache allocation** — `WriteAt` created `new MessageRecord` for `_cache` on every write | Removed eager cache population; reads now decode from pending buffer or disk | Eliminates 1 object alloc per message |
| 10 | **Contiguous write buffer** — `MsgBlock._pendingWrites` was `List<byte[]>` with per-message `byte[]` allocations | Replaced with single contiguous `_pendingBuf` byte array; `MessageRecord.EncodeTo` writes directly into it | Eliminates per-message `byte[]` encoding alloc; single `RandomAccess.Write` per flush |
| 11 | **Pending buffer read path** — `MsgBlock.Read()` flushed pending writes to disk before reading | Added in-memory read from `_pendingBuf` when data is still in the buffer | Avoids unnecessary disk flush on read-after-write |

### Round 1: FileStore/StreamManager Layer

| # | Root Cause | Fix | Impact |
|---|-----------|-----|--------|
| 1 | **Per-message synchronous disk I/O** — `MsgBlock.WriteAt()` called `RandomAccess.Write()` on every message | Added write buffering in MsgBlock + background flush loop in FileStore (Go's `flushLoop` pattern: coalesce 16KB or 8ms) | Eliminates per-message syscall overhead |
| 2 | **O(n) `GetStateAsync` per publish** — `_messages.Keys.Min()` and `_messages.Values.Sum()` on every publish for MaxMsgs/MaxBytes checks | Added incremental `_messageCount`, `_totalBytes`, `_firstSeq` fields updated in all mutation paths; `GetStateAsync` is now O(1) | Eliminates O(n) scan per publish |
| 3 | **Unnecessary `LoadAsync` after every append** — `StreamManager.Capture` reloaded the just-stored message even when no mirrors/sources were configured | Made `LoadAsync` conditional on mirror/source replication being configured | Eliminates redundant disk read per publish |
| 4 | **Redundant `PruneExpiredMessages` per publish** — called before every publish even when `MaxAge=0`, and again inside `EnforceRuntimePolicies` | Guarded with `MaxAgeMs > 0` check; removed the pre-publish call (background expiry timer handles it) | Eliminates O(n) scan per publish |
| 5 | **`PrunePerSubject` loading all messages per publish** — `EnforceRuntimePolicies` → `PrunePerSubject` called `ListAsync().GroupBy()` even when `MaxMsgsPer=0` | Guarded with `MaxMsgsPer > 0` check | Eliminates O(n) scan per publish |

Additional fixes: SHA256 envelope bypass for unencrypted/uncompressed stores, RAFT propose skip for single-replica streams.

### What would further close the gap

| Change | Expected Impact | Go Reference |
|--------|----------------|-------------|
| **Vectored I/O (`writev`)** | Eliminate coalesce copy in write loop — write gathered buffers in single syscall | Go: `net.Buffers.WriteTo()` → `writev()` in `flushOutbound()` |
| **Per-client scratch buffer** | Reuse 1KB buffer for MSG header formatting across deliveries | Go: `client.msgb` (1024-byte scratch, `msgScratchSize`) |
| **Batch flush signaling** | Deduplicate write loop wakeups — signal once per readloop iteration, not per delivery | Go: `pcd` map tracks affected clients, `flushClients()` at end of readloop |
| **Eliminate per-message GC allocations** | ~30% improvement on FileStore AppendAsync — pool or eliminate `StoredMessage` objects | Go stores in `cache.buf`/`cache.idx` with zero per-message allocs |