natsdotnet/benchmarks_comparison.md

# Go vs .NET NATS Server — Benchmark Comparison

Benchmark run: 2026-03-13. Both servers running on the same machine, tested with identical NATS.Client.Core workloads. Test parallelization disabled to avoid resource contention.

**Environment:** Apple M4, .NET 10, Go nats-server (latest from `golang/nats-server/`).

---

## Core NATS — Pub/Sub Throughput

### Single Publisher (no subscribers)

| Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
|---------|----------|---------|------------|-----------|-----------------|
| 16 B | 2,436,416 | 37.2 | 1,425,767 | 21.8 | 0.59x |
| 128 B | 2,143,434 | 261.6 | 1,654,692 | 202.0 | 0.77x |

### Publisher + Subscriber (1:1)

| Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
|---------|----------|---------|------------|-----------|-----------------|
| 16 B | 1,140,225 | 17.4 | 207,654 | 3.2 | 0.18x |
| 16 KB | 41,762 | 652.5 | 34,429 | 538.0 | 0.82x |

### Fan-Out (1 Publisher : 4 Subscribers)

| Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
|---------|----------|---------|------------|-----------|-----------------|
| 128 B | 3,192,313 | 389.7 | 581,284 | 71.0 | 0.18x |

### Multi-Publisher / Multi-Subscriber (4P x 4S)

| Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
|---------|----------|---------|------------|-----------|-----------------|
| 128 B | 269,445 | 32.9 | 529,808 | 64.7 | 1.97x |

---

## Core NATS — Request/Reply Latency

### Single Client, Single Service

| Payload | Go msg/s | .NET msg/s | Ratio | Go P50 (us) | .NET P50 (us) | Go P99 (us) | .NET P99 (us) |
|---------|----------|------------|-------|-------------|---------------|-------------|---------------|
| 128 B | 9,347 | 7,215 | 0.77x | 104.5 | 134.7 | 146.2 | 190.5 |

### 10 Clients, 2 Services (Queue Group)

| Payload | Go msg/s | .NET msg/s | Ratio | Go P50 (us) | .NET P50 (us) | Go P99 (us) | .NET P99 (us) |
|---------|----------|------------|-------|-------------|---------------|-------------|---------------|
| 16 B | 30,893 | 25,861 | 0.84x | 315.0 | 370.2 | 451.1 | 595.0 |

---

## JetStream — Publication

| Mode | Payload | Storage | Go msg/s | .NET msg/s | Ratio (.NET/Go) |
|------|---------|---------|----------|------------|-----------------|
| Synchronous | 16 B | Memory | 16,783 | 13,815 | 0.82x |
| Async (batch) | 128 B | File | 210,387 | 174 | 0.00x |

> **Note:** Async file store publish remains extremely slow after FileStore-level optimizations (buffered writes, O(1) state tracking, redundant work elimination). The bottleneck is in the E2E network/protocol processing path (synchronous `.GetAwaiter().GetResult()` calls in the client read loop), not storage I/O.

---

## JetStream — Consumption

| Mode | Go msg/s | .NET msg/s | Ratio (.NET/Go) |
|------|----------|------------|-----------------|
| Ordered ephemeral consumer | 109,519 | N/A | N/A |
| Durable consumer fetch | 639,247 | 80,792 | 0.13x |

> **Note:** Ordered ephemeral consumer is not yet fully supported on the .NET server (API timeout during consumer creation).

---

## Summary

| Category | Ratio Range | Assessment |
|----------|-------------|------------|
| Pub-only throughput | 0.59x–0.77x | Good — within 2x |
| Pub/sub (large payload) | 0.82x | Good |
| Pub/sub (small payload) | 0.18x | Needs optimization |
| Fan-out | 0.18x | Needs optimization |
| Multi pub/sub | 1.97x | .NET faster (likely measurement artifact at low counts) |
| Request/reply latency | 0.77x–0.84x | Good |
| JetStream sync publish | 0.82x | Good |
| JetStream async file publish | ~0x | Broken — E2E protocol path bottleneck |
| JetStream durable fetch | 0.13x | Needs optimization |

### Key Observations

1. **Pub-only and request/reply are within striking distance** (0.6x–0.85x), suggesting the core message path is reasonably well ported.
2. **Small-payload pub/sub and fan-out are 5x slower** (0.18x ratio). The bottleneck is likely in the subscription dispatch / message delivery hot path — the `SubList.Match()` → `MSG` write loop.
3. **JetStream file store async publish is 1,200x slower than Go** — see [investigation notes](#jetstream-async-file-publish-investigation) below.
4. **JetStream consumption** (durable fetch) is 8x slower than Go. Ordered consumers don't work yet.
5. The multi-pub/sub result showing .NET faster is likely a measurement artifact from the small message count (2,000 per publisher) — not representative at scale.

---

## JetStream Async File Publish Investigation

The async file store publish benchmark publishes 5,000 128-byte messages in batches of 100 to a `Retention=Limits`, `Storage=File`, `MaxMsgs=10_000_000` stream (no MaxAge, no MaxMsgsPer). Go achieves **210,387 msg/s**; .NET achieves **174 msg/s** — a **1,208x** gap.

The JetStream sync memory store benchmark achieves **0.82x** parity, confirming the bottleneck is specific to the file-store async publish path.

### What was optimized (FileStore layer)

Five root causes were identified and fixed in the FileStore/StreamManager layer:

| # | Root Cause | Fix | Impact |
|---|-----------|-----|--------|
| 1 | **Per-message synchronous disk I/O** — `MsgBlock.WriteAt()` called `RandomAccess.Write()` on every message | Added write buffering in MsgBlock + background flush loop in FileStore (Go's `flushLoop` pattern: coalesce 16KB or 8ms) | Eliminates per-message syscall overhead |
| 2 | **O(n) `GetStateAsync` per publish** — `_messages.Keys.Min()` and `_messages.Values.Sum()` on every publish for MaxMsgs/MaxBytes checks | Added incremental `_messageCount`, `_totalBytes`, `_firstSeq` fields updated in all mutation paths; `GetStateAsync` is now O(1) | Eliminates O(n) scan per publish |
| 3 | **Unnecessary `LoadAsync` after every append** — `StreamManager.Capture` reloaded the just-stored message even when no mirrors/sources were configured | Made `LoadAsync` conditional on mirror/source replication being configured | Eliminates redundant disk read per publish |
| 4 | **Redundant `PruneExpiredMessages` per publish** — called before every publish even when `MaxAge=0`, and again inside `EnforceRuntimePolicies` | Guarded with `MaxAgeMs > 0` check; removed the pre-publish call (background expiry timer handles it) | Eliminates O(n) scan per publish |
| 5 | **`PrunePerSubject` loading all messages per publish** — `EnforceRuntimePolicies` → `PrunePerSubject` called `ListAsync().GroupBy()` even when `MaxMsgsPer=0` | Guarded with `MaxMsgsPer > 0` check | Eliminates O(n) scan per publish |

Additional fixes: SHA256 envelope bypass for unencrypted/uncompressed stores, RAFT propose skip for single-replica streams.

### Why the benchmark didn't improve

After all FileStore-level optimizations, the benchmark remained at ~174 msg/s. The bottleneck is **upstream of the storage layer** in the E2E network/protocol processing path.

**Important context from Go source verification:** Go also processes JetStream messages inline on the read goroutine — `processInbound` → `processInboundClientMsg` calls `processJetStreamMsg` synchronously (no channel handoff). `processJetStreamMsg` takes `mset.mu` and calls `store.StoreMsg()` inline (`server/stream.go:5436–6136`). The `pcd` field is `map[*client]struct{}` for deferred outbound flush bookkeeping (`server/client.go:291`), not a channel.

So Go faces the same serial read→process constraint per connection — the 1,200x gap cannot be explained by Go offloading JetStream to another goroutine (it doesn't). The actual differences are:

1. **Write coalescing in the file store** — Go's `writeMsgRecordLocked` appends to `mb.cache.buf` (an in-memory byte slice) and defers disk I/O to a background `flushLoop` goroutine that coalesces at 16KB or 8ms (`server/filestore.go:328, 5796, 5841`). Our .NET port now matches this pattern, but there may be differences in how efficiently the flush loop runs (Task scheduling overhead vs goroutine scheduling).

2. **Coalescing write loop for outbound data** — Go has a dedicated `writeLoop` goroutine per connection that waits on `c.out.sg` (`sync.Cond`, `server/client.go:355, 1274`). Outbound data accumulates in `out.nb` (`net.Buffers`) and is flushed in batches via `net.Buffers.WriteTo` up to `nbMaxVectorSize` buffers (`server/client.go:1615`). The .NET server writes ack responses individually per message — no outbound batching.

3. **Per-message overhead in the .NET protocol path** — The .NET `NatsClient.ProcessInboundAsync` calls `TryCaptureJetStreamPublish` via `.GetAwaiter().GetResult()`, blocking the read loop Task. While Go also processes inline, Go's goroutine scheduler is cheaper for this pattern — goroutines that block on mutex or I/O yield efficiently to the runtime scheduler, whereas .NET's `Task` + `GetAwaiter().GetResult()` on an async context can cause thread pool starvation or synchronization overhead.

4. **AsyncFlush configuration** — Go's file store respects `fcfg.AsyncFlush` (`server/filestore.go:456`). When `AsyncFlush=true` (the default for streams), `writeMsgRecordLocked` does NOT flush synchronously (`server/filestore.go:6803`). When `AsyncFlush=false`, it flushes inline after each write. The .NET benchmark may be triggering synchronous flushes unintentionally.

### What would actually fix it

The fix requires changes to the outbound write path and careful profiling, not further FileStore tuning:

| Change | Description | Go Reference |
|--------|-------------|-------------|
| **Coalescing write loop** | Add a dedicated outbound write loop per connection that batches acks/MSGs using `net.Buffers`-style vectored I/O, woken by a `sync.Cond`-equivalent signal | `server/client.go:1274, 1615` — `writeLoop` with `out.sg` (`sync.Cond`) and `out.nb` (`net.Buffers`) |
| **Eliminate sync-over-async** | Replace `.GetAwaiter().GetResult()` calls in the read loop with true async/await or a synchronous-only code path to avoid thread pool overhead | N/A — architectural difference |
| **Profile Task scheduling** | The background flush loop uses `Task.Delay(1)` for coalescing waits; this may have higher latency than Go's `time.Sleep(1ms)` due to Task scheduler granularity | `server/filestore.go:5841` — `time.Sleep` in `flushLoop` |
| **Verify AsyncFlush is enabled** | Ensure the benchmark stream config sets `AsyncFlush=true` so the file store uses buffered writes rather than synchronous per-message flushes | `server/filestore.go:456` — `fs.fip = !fcfg.AsyncFlush` |

The coalescing write loop is likely the highest-impact change — it explains both the JetStream ack throughput gap and the 0.18x gap in pub/sub (small payload) and fan-out benchmarks.