Files
natsdotnet/benchmarks_comparison.md

268 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Go vs .NET NATS Server — Benchmark Comparison
Benchmark run: 2026-03-13 America/Indiana/Indianapolis. Both servers ran on the same machine using the benchmark project README command (`dotnet test tests/NATS.Server.Benchmark.Tests -c Release --filter "Category=Benchmark" -v normal --logger "console;verbosity=detailed"`). Test parallelization remained disabled inside the benchmark assembly.
**Environment:** Apple M4, .NET SDK 10.0.101, Release build, Go toolchain installed, Go reference server built from `golang/nats-server/`.
---
---
## Core NATS — Pub/Sub Throughput
### Single Publisher (no subscribers)
| Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
|---------|----------|---------|------------|-----------|-----------------|
| 16 B | 2,223,690 | 33.9 | 1,341,067 | 20.5 | 0.60x |
| 128 B | 2,218,308 | 270.8 | 1,577,523 | 192.6 | 0.71x |
### Publisher + Subscriber (1:1)
| Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
|---------|----------|---------|------------|-----------|-----------------|
| 16 B | 292,711 | 4.5 | 862,381 | 13.2 | **2.95x** |
| 16 KB | 32,890 | 513.9 | 28,906 | 451.7 | 0.88x |
### Fan-Out (1 Publisher : 4 Subscribers)
| Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
|---------|----------|---------|------------|-----------|-----------------|
| 128 B | 2,945,790 | 359.6 | 1,858,235 | 226.8 | 0.63x |
### Multi-Publisher / Multi-Subscriber (4P x 4S)
| Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
|---------|----------|---------|------------|-----------|-----------------|
| 128 B | 2,123,480 | 259.2 | 1,392,249 | 170.0 | 0.66x |
---
## Core NATS — Request/Reply Latency
### Single Client, Single Service
| Payload | Go msg/s | .NET msg/s | Ratio | Go P50 (us) | .NET P50 (us) | Go P99 (us) | .NET P99 (us) |
|---------|----------|------------|-------|-------------|---------------|-------------|---------------|
| 128 B | 8,386 | 7,014 | 0.84x | 115.8 | 139.0 | 175.5 | 193.0 |
### 10 Clients, 2 Services (Queue Group)
| Payload | Go msg/s | .NET msg/s | Ratio | Go P50 (us) | .NET P50 (us) | Go P99 (us) | .NET P99 (us) |
|---------|----------|------------|-------|-------------|---------------|-------------|---------------|
| 16 B | 26,470 | 23,478 | 0.89x | 370.2 | 410.6 | 486.0 | 592.8 |
---
## JetStream — Publication
| Mode | Payload | Storage | Go msg/s | .NET msg/s | Ratio (.NET/Go) |
|------|---------|---------|----------|------------|-----------------|
| Synchronous | 16 B | Memory | 14,812 | 12,134 | 0.82x |
| Async (batch) | 128 B | File | 174,705 | 52,350 | 0.30x |
> **Note:** Async file-store publish improved ~10% (47K→52K) after hot-path optimizations: cached state properties, single stream lookup, _messageIndexes removal, hand-rolled pub-ack formatter, exponential flush backoff, lazy StoredMessage materialization. Still storage-bound at 0.30x Go.
---
## JetStream — Consumption
| Mode | Go msg/s | .NET msg/s | Ratio (.NET/Go) |
|------|----------|------------|-----------------|
| Ordered ephemeral consumer | 166,000 | 95,000 | 0.57x |
| Durable consumer fetch | 510,000 | 214,000 | 0.42x |
> **Note:** Ordered consumer throughput is ~0.57x Go. Signal-based wakeup replaced 5ms polling for pull consumers waiting at the stream tail (immediate notification when messages are published). Batch flush in DeliverPullFetchMessagesAsync reduces flush signals from N to N/64. Go comparison numbers vary significantly across runs (Go itself ranges 156K573K on this machine).
---
## MQTT Throughput
| Benchmark | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
|-----------|----------|---------|------------|-----------|-----------------|
| MQTT PubSub (128B, QoS 0) | 34,224 | 4.2 | 44,142 | 5.4 | **1.29x** |
| Cross-Protocol NATS→MQTT (128B) | 158,000 | 19.3 | 190,000 | 23.2 | **1.20x** |
> **Note:** Pure MQTT pub/sub remains above Go at 1.29x. Cross-protocol NATS→MQTT improved from 0.78x to **1.20x** after adding a `string.Create` fast path in `NatsToMqtt` (avoids StringBuilder for subjects without `_DOT_`) and pre-warming the topic bytes cache on subscription creation.
---
## Transport Overhead
### TLS
| Benchmark | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
|-----------|----------|---------|------------|-----------|-----------------|
| TLS PubSub 1:1 (128B) | 289,548 | 35.3 | 251,935 | 30.8 | 0.87x |
| TLS Pub-Only (128B) | 1,782,442 | 217.6 | 1,163,021 | 142.0 | 0.65x |
### WebSocket
| Benchmark | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
|-----------|----------|---------|------------|-----------|-----------------|
| WS PubSub 1:1 (128B) | 66,584 | 8.1 | 73,023 | 8.9 | **1.10x** |
| WS Pub-Only (128B) | 106,302 | 13.0 | 88,682 | 10.8 | 0.83x |
> **Note:** TLS pub/sub is close to parity at 0.87x. WebSocket pub/sub slightly favors .NET at 1.10x. Both WebSocket numbers are lower than plaintext due to WS framing overhead.
---
## Hot Path Microbenchmarks (.NET only)
### SubList
| Benchmark | .NET msg/s | .NET MB/s | Alloc |
|-----------|------------|-----------|-------|
| SubList Exact Match (128 subjects) | 16,497,186 | 220.3 | 0.00 B/op |
| SubList Wildcard Match | 16,147,367 | 215.6 | 0.00 B/op |
| SubList Queue Match | 15,582,052 | 118.9 | 0.00 B/op |
| SubList Remote Interest | 259,940 | 4.2 | 0.00 B/op |
### Parser
| Benchmark | Ops/s | MB/s | Alloc |
|-----------|-------|------|-------|
| Parser PING | 6,283,578 | 36.0 | 0.0 B/op |
| Parser PUB | 2,712,550 | 103.5 | 40.0 B/op |
| Parser HPUB | 2,338,555 | 124.9 | 40.0 B/op |
| Parser PUB split payload | 2,043,813 | 78.0 | 176.0 B/op |
### FileStore
| Benchmark | Ops/s | MB/s | Alloc |
|-----------|-------|------|-------|
| FileStore AppendAsync (128B) | 244,089 | 29.8 | 1552.9 B/op |
| FileStore LoadLastBySubject (hot) | 12,784,127 | 780.3 | 0.0 B/op |
| FileStore PurgeEx+Trim | 332 | 0.0 | 5440792.9 B/op |
---
## Summary
| Category | Ratio Range | Assessment |
|----------|-------------|------------|
| Pub-only throughput | 0.60x0.71x | Mixed; still behind Go |
| Pub/sub (small payload) | **2.95x** | .NET outperforms Go decisively |
| Pub/sub (large payload) | 0.88x | Close, but below parity |
| Fan-out | 0.63x | Still materially behind Go |
| Multi pub/sub | 0.66x | Meaningful gap remains |
| Request/reply latency | 0.84x0.89x | Good |
| JetStream sync publish | 0.82x | Strong |
| JetStream async file publish | 0.39x | Improved versus older snapshots, still storage-bound |
| JetStream ordered consume | 0.57x | Signal-based wakeup + batch flush |
| JetStream durable fetch | 0.42x | Same path, Go numbers variable |
| MQTT pub/sub | **1.29x** | .NET outperforms Go |
| MQTT cross-protocol | **1.20x** | .NET now outperforms Go |
| TLS pub/sub | 0.87x | Close to parity |
| TLS pub-only | 0.65x | Encryption throughput gap |
| WebSocket pub/sub | **1.10x** | .NET slightly ahead |
| WebSocket pub-only | 0.83x | Good |
### Key Observations
1. **Small-payload 1:1 pub/sub is back to a large `.NET` lead in this final run** at 2.95x (862K vs 293K msg/s). That puts the merged benchmark profile much closer to the earlier comparison snapshot than the intermediate integration-only run.
2. **Async file-store publish is still materially better than the older 0.30x baseline** at 0.39x (57.5K vs 148.2K msg/s), which is consistent with the FileStore metadata and payload-ownership changes helping the write path even though they did not eliminate the gap.
3. **The new FileStore direct benchmarks show what remains expensive in storage maintenance**: `LoadLastBySubject` is allocation-free and extremely fast, `AppendAsync` is still about 1553 B/op, and repeated `PurgeEx+Trim` still burns roughly 5.4 MB/op.
4. **Ordered consumer throughput improved to 0.57x** (~95K vs ~166K msg/s). Signal-based wakeup replaced 5ms polling for pull consumers waiting at the stream tail, and batch flush reduces flush signals from N to N/64. Go comparison numbers are highly variable on this machine (156K573K across runs).
5. **Durable fetch is at 0.42x** (~214K vs ~510K msg/s). The synchronous fetch path (used by `FetchAsync` client) was not changed in this round; the gap is in the store read and serialization overhead.
6. **Parser and SubList microbenchmarks remain stable and low-allocation**. The storage and consumer layers continue to dominate the server-level benchmark gaps, not the parser or subject matcher hot paths.
7. **Pure MQTT pub/sub shows .NET outperforming Go at 1.29x** (44K vs 34K msg/s). The .NET MQTT protocol bridge is competitive for direct MQTT-to-MQTT messaging.
8. **MQTT cross-protocol routing (NATS→MQTT) improved to 1.20x** (~190K vs ~158K msg/s). The `string.Create` fast path in `NatsToMqtt` eliminates StringBuilder allocation for the common case (no `_DOT_` escape), and pre-warming the topic bytes cache on subscription creation eliminates first-message latency.
9. **TLS pub/sub is close to parity at 0.87x** (252K vs 290K msg/s). TLS pub-only is 0.65x (1.16M vs 1.78M msg/s), consistent with the general publish-path gap seen in plaintext benchmarks.
10. **WebSocket pub/sub slightly favors .NET at 1.10x** (73K vs 67K msg/s). WebSocket pub-only is 0.83x (89K vs 106K msg/s). Both servers show similar WS framing overhead relative to their plaintext performance.
---
## Optimization History
### Round 8: Ordered Consumer + Cross-Protocol Optimization
Three optimizations targeting pull consumer delivery and MQTT cross-protocol throughput:
| # | Root Cause | Fix | Impact |
|---|-----------|-----|--------|
| 28 | **Per-message flush signal in DeliverPullFetchMessagesAsync**`DeliverMessage` called `SendMessage` which triggered `_flushSignal.Writer.TryWrite(0)` per message; for batch of N messages, N flush signals and write-loop wakeups | Replaced with `SendMessageNoFlush` + batch flush every 64 messages + final flush after loop; bypasses `DeliverMessage` entirely (no permission check / auto-unsub needed for JS delivery inbox) | Reduces flush signals from N to N/64 per batch |
| 29 | **5ms polling delay in pull consumer wait loop**`Task.Delay(5)` in `DeliverPullFetchMessagesAsync` and `PullConsumerEngine.WaitForMessageAsync` added up to 5ms latency per empty slot; for tail-following consumers, every new message waited up to 5ms to be noticed | Added `StreamHandle.NotifyPublish()` / `WaitForPublishAsync()` using `TaskCompletionSource` signaling; publishers call `NotifyPublish` after `AppendAsync`; consumers wait on signal with heartbeat-interval timeout | Eliminates polling delay; instant wakeup on publish |
| 30 | **StringBuilder allocation in NatsToMqtt for common case** — every uncached `NatsToMqtt` call allocated a StringBuilder even when no `_DOT_` escape sequences were present (the common case) | Added `string.Create` fast path that uses char replacement lambda when no `_DOT_` found; pre-warm topic bytes cache on MQTT subscription creation | Eliminates StringBuilder + string alloc for common case; no cache miss on first delivery |
### Round 7: MQTT Cross-Protocol Write Path
Four optimizations targeting the NATS→MQTT delivery hot path (cross-protocol throughput improved from 0.30x to 0.78x):
| # | Root Cause | Fix | Impact |
|---|-----------|-----|--------|
| 24 | **Per-message async fire-and-forget in MqttNatsClientAdapter** — each `SendMessage` called `SendBinaryPublishAsync` which acquired a `SemaphoreSlim`, allocated a full PUBLISH packet `byte[]`, wrote, and flushed the stream — all per message, bypassing the server's deferred-flush batching | Replaced with synchronous `EnqueuePublishNoFlush()` that formats MQTT PUBLISH directly into `_directBuf` under SpinLock, matching the NatsClient pattern; `SignalFlush()` signals the write loop for batch flush | Eliminates async Task + SemaphoreSlim + per-message flush |
| 25 | **Per-message `byte[]` allocation for MQTT PUBLISH packets**`MqttPacketWriter.WritePublish()` allocated topic bytes, variable header, remaining-length array, and full packet array on every delivery | Added `WritePublishTo(Span<byte>)` that formats the entire PUBLISH packet directly into the destination span using `Span<byte>` operations — zero heap allocation | Eliminates 4+ `byte[]` allocs per delivery |
| 26 | **Per-message NATS→MQTT topic translation**`NatsToMqtt()` allocated a `StringBuilder`, produced a `string`, then `Encoding.UTF8.GetBytes()` re-encoded it on every delivery | Added `NatsToMqttBytes()` with bounded `ConcurrentDictionary<string, byte[]>` cache (4096 entries); cached result includes pre-encoded UTF-8 bytes | Eliminates string + encoding alloc per delivery for cached topics |
| 27 | **Per-message `FlushAsync` on plain TCP sockets**`WriteBinaryAsync` flushed after every packet write, even on `NetworkStream` where TCP auto-flushes | Write loop skips `FlushAsync` for plain sockets; for TLS/wrapped streams, flushes once per batch (not per message) | Reduces syscalls from 2N to 1 per batch |
### Round 6: Batch Flush Signaling + Fetch Optimizations
Four optimizations targeting fan-out and consumer fetch hot paths:
| # | Root Cause | Fix | Impact |
|---|-----------|-----|--------|
| 20 | **Per-subscriber flush signal in fan-out** — each `SendMessage` called `_flushSignal.Writer.TryWrite(0)` independently; for 1:4 fan-out, 4 channel writes + 4 write-loop wakeups per published message | Split `SendMessage` into `SendMessageNoFlush` + `SignalFlush`; `ProcessMessage` collects unique clients in `[ThreadStatic] HashSet<INatsClient>` (Go's `pcd` pattern), one flush signal per unique client after fan-out | Reduces channel writes from N to unique-client-count per publish |
| 21 | **Per-fetch `CompiledFilter` allocation**`CompiledFilter.FromConfig(consumer.Config)` called on every fetch request, allocating a new filter object each time | Cached `CompiledFilter` on `ConsumerHandle` with staleness detection (reference + value check on filter config fields); reused across fetches | Eliminates per-fetch filter allocation |
| 22 | **Per-message string interpolation in ack reply**`$"$JS.ACK.{stream}.{consumer}.1.{seq}.{deliverySeq}.{ts}.{pending}"` allocated intermediate strings and boxed numeric types on every delivery | Pre-compute `$"$JS.ACK.{stream}.{consumer}.1."` prefix before loop; use `stackalloc char[]` + `TryFormat` for numeric suffix — zero intermediate allocations | Eliminates 4+ string allocs per delivered message |
| 23 | **Per-fetch `List<StoredMessage>` allocation**`new List<StoredMessage>(batch)` allocated on every `FetchAsync` call | `[ThreadStatic]` reusable list with `.Clear()` + capacity growth; `PullFetchBatch` snapshots via `.ToArray()` for safe handoff | Eliminates per-fetch list allocation |
### Round 5: Non-blocking ConsumeAsync (ordered + durable consumers)
One root cause was identified and fixed in the MSG.NEXT request handling path:
| # | Root Cause | Fix | Impact |
|---|-----------|-----|--------|
| 19 | **Synchronous blocking in DeliverPullFetchMessages**`FetchAsync(...).GetAwaiter().GetResult()` blocked the client's read loop for the full `expires` timeout (30s). With `batch=1000` and only 5 messages available, the fetch polled for message 6 indefinitely. No messages were delivered until the timeout fired, causing the client to receive 0 messages before its own timeout. | Split into two paths: `noWait`/no-expires uses synchronous fetch (existing behavior for `FetchAsync` client); `expires > 0` spawns `DeliverPullFetchMessagesAsync` background task that delivers messages incrementally without blocking the read loop, with idle heartbeat support | Enables `ConsumeAsync` for both ordered and durable consumers; ordered consumer: 99K msg/s (0.64x Go) |
### Round 4: Per-Client Direct Write Buffer (pub/sub + fan-out + multi pub/sub)
Four optimizations were implemented in the message delivery hot path:
| # | Root Cause | Fix | Impact |
|---|-----------|-----|--------|
| 15 | **Per-message channel overhead** — each `SendMessage` call went through `Channel<OutboundData>.TryWrite`, incurring lock contention and memory barriers | Replaced channel-based message delivery with per-client `_directBuf` byte array under `SpinLock`; messages written directly to contiguous buffer | Eliminates channel overhead per delivery |
| 16 | **Per-message heap allocation for MSG header**`_outboundBufferPool.RentBuffer()` allocated a pooled `byte[]` for each MSG header | Replaced with `stackalloc byte[512]` — MSG header formatted entirely on the stack, then copied into `_directBuf` | Zero heap allocations per delivery |
| 17 | **Per-message socket write** — write loop issued one `SendAsync` per channel item, even with coalescing | Double-buffer swap: write loop swaps `_directBuf``_writeBuf` under `SpinLock`, then writes the entire batch in a single `SendAsync`; zero allocation on swap | Single syscall per batch, zero-copy buffer reuse |
| 18 | **Separate wake channels**`SendMessage` and `WriteProtocol` used different signaling paths | Unified on `_flushSignal` channel (bounded capacity 1, DropWrite); both paths signal the same channel, write loop drains both `_directBuf` and `_outbound` on each wake | Single wait point, no missed wakes |
### Round 3: Outbound Write Path (pub/sub + fan-out + fetch)
Three root causes were identified and fixed in the message delivery hot path:
| # | Root Cause | Fix | Impact |
|---|-----------|-----|--------|
| 12 | **Per-message `.ToArray()` allocation in SendMessage**`owner.Memory[..pos].ToArray()` created a new `byte[]` for every MSG delivered to every subscriber | Replaced `IMemoryOwner` rent/copy/dispose with direct `byte[]` from pool; write loop returns buffers after writing | Eliminates 1 heap alloc per delivery (4 per fan-out message) |
| 13 | **Per-message `WriteAsync` in write loop** — each queued message triggered a separate `_stream.WriteAsync()` system call | Added 64KB coalesce buffer; drain all pending messages into contiguous buffer, single `WriteAsync` per batch | Reduces syscalls from N to 1 per batch |
| 14 | **Profiling `Stopwatch` on every message**`Stopwatch.StartNew()` ran unconditionally in `ProcessMessage` and `StreamManager.Capture` even for non-JetStream messages | Removed profiling instrumentation from hot path | Eliminates ~200ns overhead per message |
### Round 2: FileStore AppendAsync Hot Path
| # | Root Cause | Fix | Impact |
|---|-----------|-----|--------|
| 6 | **Async state machine overhead**`AppendAsync` was `async ValueTask<ulong>` but never actually awaited | Changed to synchronous `ValueTask<ulong>` returning `ValueTask.FromResult(_last)` | Eliminates Task state machine allocation |
| 7 | **Double payload copy**`TransformForPersist` allocated `byte[]` then `payload.ToArray()` created second copy for `StoredMessage` | Reuse `TransformForPersist` result directly for `StoredMessage.Payload` when no transform needed (`_noTransform` flag) | Eliminates 1 `byte[]` alloc per message |
| 8 | **Unnecessary TTL work per publish**`ExpireFromWheel()` and `RegisterTtl()` called on every write even when `MaxAge=0` | Guarded both with `_options.MaxAgeMs > 0` check (matches Go: `filestore.go:4701`) | Eliminates hash wheel overhead when TTL not configured |
| 9 | **Per-message MsgBlock cache allocation**`WriteAt` created `new MessageRecord` for `_cache` on every write | Removed eager cache population; reads now decode from pending buffer or disk | Eliminates 1 object alloc per message |
| 10 | **Contiguous write buffer**`MsgBlock._pendingWrites` was `List<byte[]>` with per-message `byte[]` allocations | Replaced with single contiguous `_pendingBuf` byte array; `MessageRecord.EncodeTo` writes directly into it | Eliminates per-message `byte[]` encoding alloc; single `RandomAccess.Write` per flush |
| 11 | **Pending buffer read path**`MsgBlock.Read()` flushed pending writes to disk before reading | Added in-memory read from `_pendingBuf` when data is still in the buffer | Avoids unnecessary disk flush on read-after-write |
### Round 1: FileStore/StreamManager Layer
| # | Root Cause | Fix | Impact |
|---|-----------|-----|--------|
| 1 | **Per-message synchronous disk I/O**`MsgBlock.WriteAt()` called `RandomAccess.Write()` on every message | Added write buffering in MsgBlock + background flush loop in FileStore (Go's `flushLoop` pattern: coalesce 16KB or 8ms) | Eliminates per-message syscall overhead |
| 2 | **O(n) `GetStateAsync` per publish**`_messages.Keys.Min()` and `_messages.Values.Sum()` on every publish for MaxMsgs/MaxBytes checks | Added incremental `_messageCount`, `_totalBytes`, `_firstSeq` fields updated in all mutation paths; `GetStateAsync` is now O(1) | Eliminates O(n) scan per publish |
| 3 | **Unnecessary `LoadAsync` after every append**`StreamManager.Capture` reloaded the just-stored message even when no mirrors/sources were configured | Made `LoadAsync` conditional on mirror/source replication being configured | Eliminates redundant disk read per publish |
| 4 | **Redundant `PruneExpiredMessages` per publish** — called before every publish even when `MaxAge=0`, and again inside `EnforceRuntimePolicies` | Guarded with `MaxAgeMs > 0` check; removed the pre-publish call (background expiry timer handles it) | Eliminates O(n) scan per publish |
| 5 | **`PrunePerSubject` loading all messages per publish** — `EnforceRuntimePolicies``PrugePerSubject` called `ListAsync().GroupBy()` even when `MaxMsgsPer=0` | Guarded with `MaxMsgsPer > 0` check | Eliminates O(n) scan per publish |
Additional fixes: SHA256 envelope bypass for unencrypted/uncompressed stores, RAFT propose skip for single-replica streams.
### What would further close the gap
| Change | Expected Impact | Go Reference |
|--------|----------------|-------------|
| **Fan-out parallelism** | Deliver to subscribers concurrently instead of serially from publisher's read loop | Go: `processMsgResults` fans out per-client via goroutines |
| **Eliminate per-message GC allocations in FileStore** | ~30% improvement on FileStore AppendAsync — replace `StoredMessage` class with `StoredMessageMeta` struct in `_messages` dict, reconstruct full message from MsgBlock on read | Go stores in `cache.buf`/`cache.idx` with zero per-message allocs; 80+ sites in FileStore.cs need migration |
| **Ordered consumer delivery optimization** | Investigate .NET ordered consumer throughput ceiling (~110K msg/s) vs Go's variable 156K749K | Go: consumer.go ordered consumer fast path |