Three optimizations making the serial fan-out path cheaper (fan-out 0.63x→0.70x,
multi pub/sub 0.65x→0.69x):
1. Pre-format MSG prefix ("MSG subject ") and suffix (" [reply] sizes\r\n") once
per publish. New SendMessagePreformatted writes prefix+sid+suffix directly into
_directBuf — zero encoding, pure memory copies. Only SID varies per delivery.
2. Replace queue-group round-robin Interlocked.Increment/Decrement with non-atomic
uint QueueRoundRobin++ (safe: ProcessMessage runs single-threaded per connection).
3. Replace HashSet<INatsClient> pcd with ThreadStatic INatsClient[] + linear scan.
O(n) but n≤16; faster than hash for small fan-out counts.
291 lines
25 KiB
Markdown
291 lines
25 KiB
Markdown
# Go vs .NET NATS Server — Benchmark Comparison
|
||
|
||
Benchmark run: 2026-03-13 America/Indiana/Indianapolis. Both servers ran on the same machine using the benchmark project README command (`dotnet test tests/NATS.Server.Benchmark.Tests -c Release --filter "Category=Benchmark" -v normal --logger "console;verbosity=detailed"`). Test parallelization remained disabled inside the benchmark assembly.
|
||
|
||
**Environment:** Apple M4, .NET SDK 10.0.101, Release build, Go toolchain installed, Go reference server built from `golang/nats-server/`.
|
||
**Environment:** Apple M4, .NET SDK 10.0.101, Release build (server GC, tiered PGO enabled), Go toolchain installed, Go reference server built from `golang/nats-server/`.
|
||
|
||
---
|
||
---
|
||
|
||
## Core NATS — Pub/Sub Throughput
|
||
|
||
### Single Publisher (no subscribers)
|
||
|
||
| Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
|
||
|---------|----------|---------|------------|-----------|-----------------|
|
||
| 16 B | 2,223,690 | 33.9 | 1,651,727 | 25.2 | 0.74x |
|
||
| 128 B | 2,218,308 | 270.8 | 1,368,967 | 167.1 | 0.62x |
|
||
|
||
### Publisher + Subscriber (1:1)
|
||
|
||
| Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
|
||
|---------|----------|---------|------------|-----------|-----------------|
|
||
| 16 B | 292,711 | 4.5 | 723,867 | 11.0 | **2.47x** |
|
||
| 16 KB | 32,890 | 513.9 | 37,943 | 592.9 | **1.15x** |
|
||
|
||
### Fan-Out (1 Publisher : 4 Subscribers)
|
||
|
||
| Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
|
||
|---------|----------|---------|------------|-----------|-----------------|
|
||
| 128 B | 2,945,790 | 359.6 | 2,063,771 | 251.9 | 0.70x |
|
||
|
||
> **Note:** Fan-out improved from 0.63x to 0.70x after Round 10 pre-formatted MSG headers, eliminating per-delivery replyTo encoding, size formatting, and prefix/subject copying. Only the SID varies per delivery now.
|
||
|
||
### Multi-Publisher / Multi-Subscriber (4P x 4S)
|
||
|
||
| Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
|
||
|---------|----------|---------|------------|-----------|-----------------|
|
||
| 128 B | 2,123,480 | 259.2 | 1,465,416 | 178.9 | 0.69x |
|
||
|
||
---
|
||
|
||
## Core NATS — Request/Reply Latency
|
||
|
||
### Single Client, Single Service
|
||
|
||
| Payload | Go msg/s | .NET msg/s | Ratio | Go P50 (us) | .NET P50 (us) | Go P99 (us) | .NET P99 (us) |
|
||
|---------|----------|------------|-------|-------------|---------------|-------------|---------------|
|
||
| 128 B | 8,386 | 7,424 | 0.89x | 115.8 | 139.0 | 175.5 | 193.0 |
|
||
|
||
### 10 Clients, 2 Services (Queue Group)
|
||
|
||
| Payload | Go msg/s | .NET msg/s | Ratio | Go P50 (us) | .NET P50 (us) | Go P99 (us) | .NET P99 (us) |
|
||
|---------|----------|------------|-------|-------------|---------------|-------------|---------------|
|
||
| 16 B | 26,470 | 26,620 | **1.01x** | 370.2 | 376.0 | 486.0 | 592.8 |
|
||
|
||
---
|
||
|
||
## JetStream — Publication
|
||
|
||
| Mode | Payload | Storage | Go msg/s | .NET msg/s | Ratio (.NET/Go) |
|
||
|------|---------|---------|----------|------------|-----------------|
|
||
| Synchronous | 16 B | Memory | 14,812 | 12,134 | 0.82x |
|
||
| Async (batch) | 128 B | File | 174,705 | 52,350 | 0.30x |
|
||
|
||
> **Note:** Async file-store publish improved ~10% (47K→52K) after hot-path optimizations: cached state properties, single stream lookup, _messageIndexes removal, hand-rolled pub-ack formatter, exponential flush backoff, lazy StoredMessage materialization. Still storage-bound at 0.30x Go.
|
||
|
||
---
|
||
|
||
## JetStream — Consumption
|
||
|
||
| Mode | Go msg/s | .NET msg/s | Ratio (.NET/Go) |
|
||
|------|----------|------------|-----------------|
|
||
| Ordered ephemeral consumer | 166,000 | 102,369 | 0.62x |
|
||
| Durable consumer fetch | 510,000 | 468,252 | 0.92x |
|
||
|
||
> **Note:** Ordered consumer improved to 0.62x (102K vs 166K). Durable fetch jumped to 0.92x (468K vs 510K) — the Release build with tiered PGO dramatically improved the JIT quality for the fetch delivery path. Go comparison numbers vary significantly across runs.
|
||
|
||
---
|
||
|
||
## MQTT Throughput
|
||
|
||
| Benchmark | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
|
||
|-----------|----------|---------|------------|-----------|-----------------|
|
||
| MQTT PubSub (128B, QoS 0) | 34,224 | 4.2 | 47,341 | 5.8 | **1.38x** |
|
||
| Cross-Protocol NATS→MQTT (128B) | 158,000 | 19.3 | 229,932 | 28.1 | **1.46x** |
|
||
|
||
> **Note:** Pure MQTT pub/sub extended its lead to 1.38x. Cross-protocol NATS→MQTT now at **1.46x** — the Release build JIT further benefits the delivery path.
|
||
|
||
---
|
||
|
||
## Transport Overhead
|
||
|
||
### TLS
|
||
|
||
| Benchmark | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
|
||
|-----------|----------|---------|------------|-----------|-----------------|
|
||
| TLS PubSub 1:1 (128B) | 289,548 | 35.3 | 254,834 | 31.1 | 0.88x |
|
||
| TLS Pub-Only (128B) | 1,782,442 | 217.6 | 877,149 | 107.1 | 0.49x |
|
||
|
||
### WebSocket
|
||
|
||
| Benchmark | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
|
||
|-----------|----------|---------|------------|-----------|-----------------|
|
||
| WS PubSub 1:1 (128B) | 66,584 | 8.1 | 62,249 | 7.6 | 0.93x |
|
||
| WS Pub-Only (128B) | 106,302 | 13.0 | 85,878 | 10.5 | 0.81x |
|
||
|
||
> **Note:** TLS pub/sub stable at 0.88x. WebSocket pub/sub at 0.93x. Both WebSocket numbers are lower than plaintext due to WS framing overhead.
|
||
|
||
---
|
||
|
||
## Hot Path Microbenchmarks (.NET only)
|
||
|
||
### SubList
|
||
|
||
| Benchmark | .NET msg/s | .NET MB/s | Alloc |
|
||
|-----------|------------|-----------|-------|
|
||
| SubList Exact Match (128 subjects) | 19,285,510 | 257.5 | 0.00 B/op |
|
||
| SubList Wildcard Match | 18,876,330 | 252.0 | 0.00 B/op |
|
||
| SubList Queue Match | 20,639,153 | 157.5 | 0.00 B/op |
|
||
| SubList Remote Interest | 274,703 | 4.5 | 0.00 B/op |
|
||
|
||
### Parser
|
||
|
||
| Benchmark | Ops/s | MB/s | Alloc |
|
||
|-----------|-------|------|-------|
|
||
| Parser PING | 6,283,578 | 36.0 | 0.0 B/op |
|
||
| Parser PUB | 2,712,550 | 103.5 | 40.0 B/op |
|
||
| Parser HPUB | 2,338,555 | 124.9 | 40.0 B/op |
|
||
| Parser PUB split payload | 2,043,813 | 78.0 | 176.0 B/op |
|
||
|
||
### FileStore
|
||
|
||
| Benchmark | Ops/s | MB/s | Alloc |
|
||
|-----------|-------|------|-------|
|
||
| FileStore AppendAsync (128B) | 244,089 | 29.8 | 1552.9 B/op |
|
||
| FileStore LoadLastBySubject (hot) | 12,784,127 | 780.3 | 0.0 B/op |
|
||
| FileStore PurgeEx+Trim | 332 | 0.0 | 5440792.9 B/op |
|
||
|
||
---
|
||
|
||
## Summary
|
||
|
||
| Category | Ratio Range | Assessment |
|
||
|----------|-------------|------------|
|
||
| Pub-only throughput | 0.62x–0.74x | Improved with Release build |
|
||
| Pub/sub (small payload) | **2.47x** | .NET outperforms Go decisively |
|
||
| Pub/sub (large payload) | **1.15x** | .NET now exceeds parity |
|
||
| Fan-out | 0.70x | Improved: pre-formatted MSG headers |
|
||
| Multi pub/sub | 0.69x | Improved: same optimizations |
|
||
| Request/reply latency | 0.89x–**1.01x** | Effectively at parity |
|
||
| JetStream sync publish | 0.74x | Run-to-run variance |
|
||
| JetStream async file publish | 0.41x | Storage-bound |
|
||
| JetStream ordered consume | 0.62x | Improved with Release build |
|
||
| JetStream durable fetch | 0.92x | Major improvement with Release build |
|
||
| MQTT pub/sub | **1.38x** | .NET outperforms Go |
|
||
| MQTT cross-protocol | **1.46x** | .NET strongly outperforms Go |
|
||
| TLS pub/sub | 0.88x | Close to parity |
|
||
| TLS pub-only | 0.49x | Variance / contention with other tests |
|
||
| WebSocket pub/sub | 0.93x | Close to parity |
|
||
| WebSocket pub-only | 0.81x | Good |
|
||
|
||
### Key Observations
|
||
|
||
1. **Switching the benchmark harness to Release build was the highest-impact change.** Durable fetch jumped from 0.42x to 0.92x (468K vs 510K msg/s). Ordered consumer improved from 0.57x to 0.62x. Request-reply 10Cx2S reached parity at 1.01x. Large-payload pub/sub now exceeds Go at 1.15x.
|
||
2. **Small-payload 1:1 pub/sub remains a strong .NET lead** at 2.47x (724K vs 293K msg/s).
|
||
3. **MQTT cross-protocol improved to 1.46x** (230K vs 158K msg/s), up from 1.20x — the Release JIT further benefits the delivery path.
|
||
4. **Fan-out improved from 0.63x to 0.70x, multi pub/sub from 0.65x to 0.69x** after Round 10 pre-formatted MSG headers. Per-delivery work is now minimal (SID copy + suffix copy + payload copy under SpinLock). The remaining gap is likely dominated by write-loop wakeup and socket write overhead.
|
||
5. **SubList Match microbenchmarks improved ~17%** (19.3M vs 16.5M ops/s for exact match) after removing Interlocked stats from the hot path.
|
||
6. **TLS pub-only dropped to 0.49x** this run, likely noise from co-running benchmarks contending on CPU. TLS pub/sub remains stable at 0.88x.
|
||
|
||
---
|
||
|
||
## Optimization History
|
||
|
||
### Round 10: Fan-Out Serial Path Optimization
|
||
|
||
Three optimizations making the serial fan-out path cheaper (fan-out 0.63x→0.70x, multi 0.65x→0.69x):
|
||
|
||
| # | Root Cause | Fix | Impact |
|
||
|---|-----------|-----|--------|
|
||
| 38 | **Per-delivery MSG header re-formatting** — `SendMessageNoFlush` independently formats the entire MSG header line (prefix, subject copy, replyTo encoding, size formatting, CRLF) for every subscriber — but only the SID varies per delivery | Pre-build prefix (`MSG subject `) and suffix (` [reply] sizes\r\n`) once per publish; new `SendMessagePreformatted` writes prefix+sid+suffix directly into `_directBuf` — zero encoding, pure memory copies | Eliminates per-delivery replyTo encoding, size formatting, prefix/subject copying |
|
||
| 39 | **Queue-group round-robin burns 2 Interlocked ops** — `Interlocked.Increment(ref OutMsgs)` + `Interlocked.Decrement(ref OutMsgs)` per queue group just to pick an index | Replaced with non-atomic `uint QueueRoundRobin++` — safe because ProcessMessage runs single-threaded per publisher connection (the read loop) | Eliminates 2 interlocked ops per queue group per publish |
|
||
| 40 | **`HashSet<INatsClient>` pcd overhead** — hash computation + bucket lookup per Add for small fan-out counts (4 subscribers) | Replaced with `[ThreadStatic] INatsClient[]` + linear scan; O(n) but n≤16, faster than hash for small counts | Eliminates hash computation and internal array overhead |
|
||
|
||
### Round 9: Fan-Out & Multi Pub/Sub Hot-Path Optimization
|
||
|
||
Seven optimizations targeting the per-delivery hot path and benchmark harness configuration:
|
||
|
||
| # | Root Cause | Fix | Impact |
|
||
|---|-----------|-----|--------|
|
||
| 31 | **Benchmark harness built server in Debug** — `DotNetServerProcess.cs` hardcoded `-c Debug`, disabling JIT optimizations, tiered PGO, and inlining | Changed to `-c Release` build and DLL path | Major: durable fetch 0.42x→0.92x, request-reply to parity |
|
||
| 32 | **Per-delivery Interlocked on server-wide stats** — `SendMessageNoFlush` did 2 `Interlocked` ops per delivery; fan-out 4 subs = 8 interlocked ops per publish | Moved server-wide stats to batch `Interlocked.Add` once after fan-out loop in `ProcessMessage` | Eliminates N×2 interlocked ops per publish |
|
||
| 33 | **Auto-unsub tracking on every delivery** — `Interlocked.Increment(ref sub.MessageCount)` on every delivery even when `MaxMessages == 0` (no limit — the common case) | Guarded with `if (sub.MaxMessages > 0)` | Eliminates 1 interlocked op per delivery in common case |
|
||
| 34 | **Per-delivery SID ASCII encoding** — `Encoding.ASCII.GetBytes(sid)` on every delivery; SID is a small integer that never changes | Added `Subscription.SidBytes` cached property; new `SendMessageNoFlush` overload accepts `ReadOnlySpan<byte>` | Eliminates per-delivery encoding |
|
||
| 35 | **Per-delivery subject ASCII encoding** — `Encoding.ASCII.GetBytes(subject)` for each subscriber; fan-out 4 = 4× encoding same subject | Pre-encode subject once in `ProcessMessage` before fan-out loop; new overload uses span copy | Eliminates N-1 subject encodings per publish |
|
||
| 36 | **Per-publish subject string allocation** — `Encoding.ASCII.GetString(cmd.Subject.Span)` on every PUB even when publishing to the same subject repeatedly | Added 1-element string cache per client; reuses string when subject bytes match | Eliminates string alloc for repeated subjects |
|
||
| 37 | **Interlocked stats in SubList.Match hot path** — `Interlocked.Increment(ref _matches)` and `_cacheHits` on every match call | Replaced with non-atomic increments (approximate counters for monitoring) | Eliminates 1-2 interlocked ops per match |
|
||
|
||
### Round 8: Ordered Consumer + Cross-Protocol Optimization
|
||
|
||
Three optimizations targeting pull consumer delivery and MQTT cross-protocol throughput:
|
||
|
||
| # | Root Cause | Fix | Impact |
|
||
|---|-----------|-----|--------|
|
||
| 28 | **Per-message flush signal in DeliverPullFetchMessagesAsync** — `DeliverMessage` called `SendMessage` which triggered `_flushSignal.Writer.TryWrite(0)` per message; for batch of N messages, N flush signals and write-loop wakeups | Replaced with `SendMessageNoFlush` + batch flush every 64 messages + final flush after loop; bypasses `DeliverMessage` entirely (no permission check / auto-unsub needed for JS delivery inbox) | Reduces flush signals from N to N/64 per batch |
|
||
| 29 | **5ms polling delay in pull consumer wait loop** — `Task.Delay(5)` in `DeliverPullFetchMessagesAsync` and `PullConsumerEngine.WaitForMessageAsync` added up to 5ms latency per empty slot; for tail-following consumers, every new message waited up to 5ms to be noticed | Added `StreamHandle.NotifyPublish()` / `WaitForPublishAsync()` using `TaskCompletionSource` signaling; publishers call `NotifyPublish` after `AppendAsync`; consumers wait on signal with heartbeat-interval timeout | Eliminates polling delay; instant wakeup on publish |
|
||
| 30 | **StringBuilder allocation in NatsToMqtt for common case** — every uncached `NatsToMqtt` call allocated a StringBuilder even when no `_DOT_` escape sequences were present (the common case) | Added `string.Create` fast path that uses char replacement lambda when no `_DOT_` found; pre-warm topic bytes cache on MQTT subscription creation | Eliminates StringBuilder + string alloc for common case; no cache miss on first delivery |
|
||
|
||
### Round 7: MQTT Cross-Protocol Write Path
|
||
|
||
Four optimizations targeting the NATS→MQTT delivery hot path (cross-protocol throughput improved from 0.30x to 0.78x):
|
||
|
||
| # | Root Cause | Fix | Impact |
|
||
|---|-----------|-----|--------|
|
||
| 24 | **Per-message async fire-and-forget in MqttNatsClientAdapter** — each `SendMessage` called `SendBinaryPublishAsync` which acquired a `SemaphoreSlim`, allocated a full PUBLISH packet `byte[]`, wrote, and flushed the stream — all per message, bypassing the server's deferred-flush batching | Replaced with synchronous `EnqueuePublishNoFlush()` that formats MQTT PUBLISH directly into `_directBuf` under SpinLock, matching the NatsClient pattern; `SignalFlush()` signals the write loop for batch flush | Eliminates async Task + SemaphoreSlim + per-message flush |
|
||
| 25 | **Per-message `byte[]` allocation for MQTT PUBLISH packets** — `MqttPacketWriter.WritePublish()` allocated topic bytes, variable header, remaining-length array, and full packet array on every delivery | Added `WritePublishTo(Span<byte>)` that formats the entire PUBLISH packet directly into the destination span using `Span<byte>` operations — zero heap allocation | Eliminates 4+ `byte[]` allocs per delivery |
|
||
| 26 | **Per-message NATS→MQTT topic translation** — `NatsToMqtt()` allocated a `StringBuilder`, produced a `string`, then `Encoding.UTF8.GetBytes()` re-encoded it on every delivery | Added `NatsToMqttBytes()` with bounded `ConcurrentDictionary<string, byte[]>` cache (4096 entries); cached result includes pre-encoded UTF-8 bytes | Eliminates string + encoding alloc per delivery for cached topics |
|
||
| 27 | **Per-message `FlushAsync` on plain TCP sockets** — `WriteBinaryAsync` flushed after every packet write, even on `NetworkStream` where TCP auto-flushes | Write loop skips `FlushAsync` for plain sockets; for TLS/wrapped streams, flushes once per batch (not per message) | Reduces syscalls from 2N to 1 per batch |
|
||
|
||
### Round 6: Batch Flush Signaling + Fetch Optimizations
|
||
|
||
Four optimizations targeting fan-out and consumer fetch hot paths:
|
||
|
||
| # | Root Cause | Fix | Impact |
|
||
|---|-----------|-----|--------|
|
||
| 20 | **Per-subscriber flush signal in fan-out** — each `SendMessage` called `_flushSignal.Writer.TryWrite(0)` independently; for 1:4 fan-out, 4 channel writes + 4 write-loop wakeups per published message | Split `SendMessage` into `SendMessageNoFlush` + `SignalFlush`; `ProcessMessage` collects unique clients in `[ThreadStatic] HashSet<INatsClient>` (Go's `pcd` pattern), one flush signal per unique client after fan-out | Reduces channel writes from N to unique-client-count per publish |
|
||
| 21 | **Per-fetch `CompiledFilter` allocation** — `CompiledFilter.FromConfig(consumer.Config)` called on every fetch request, allocating a new filter object each time | Cached `CompiledFilter` on `ConsumerHandle` with staleness detection (reference + value check on filter config fields); reused across fetches | Eliminates per-fetch filter allocation |
|
||
| 22 | **Per-message string interpolation in ack reply** — `$"$JS.ACK.{stream}.{consumer}.1.{seq}.{deliverySeq}.{ts}.{pending}"` allocated intermediate strings and boxed numeric types on every delivery | Pre-compute `$"$JS.ACK.{stream}.{consumer}.1."` prefix before loop; use `stackalloc char[]` + `TryFormat` for numeric suffix — zero intermediate allocations | Eliminates 4+ string allocs per delivered message |
|
||
| 23 | **Per-fetch `List<StoredMessage>` allocation** — `new List<StoredMessage>(batch)` allocated on every `FetchAsync` call | `[ThreadStatic]` reusable list with `.Clear()` + capacity growth; `PullFetchBatch` snapshots via `.ToArray()` for safe handoff | Eliminates per-fetch list allocation |
|
||
|
||
### Round 5: Non-blocking ConsumeAsync (ordered + durable consumers)
|
||
|
||
One root cause was identified and fixed in the MSG.NEXT request handling path:
|
||
|
||
| # | Root Cause | Fix | Impact |
|
||
|---|-----------|-----|--------|
|
||
| 19 | **Synchronous blocking in DeliverPullFetchMessages** — `FetchAsync(...).GetAwaiter().GetResult()` blocked the client's read loop for the full `expires` timeout (30s). With `batch=1000` and only 5 messages available, the fetch polled for message 6 indefinitely. No messages were delivered until the timeout fired, causing the client to receive 0 messages before its own timeout. | Split into two paths: `noWait`/no-expires uses synchronous fetch (existing behavior for `FetchAsync` client); `expires > 0` spawns `DeliverPullFetchMessagesAsync` background task that delivers messages incrementally without blocking the read loop, with idle heartbeat support | Enables `ConsumeAsync` for both ordered and durable consumers; ordered consumer: 99K msg/s (0.64x Go) |
|
||
|
||
### Round 4: Per-Client Direct Write Buffer (pub/sub + fan-out + multi pub/sub)
|
||
|
||
Four optimizations were implemented in the message delivery hot path:
|
||
|
||
| # | Root Cause | Fix | Impact |
|
||
|---|-----------|-----|--------|
|
||
| 15 | **Per-message channel overhead** — each `SendMessage` call went through `Channel<OutboundData>.TryWrite`, incurring lock contention and memory barriers | Replaced channel-based message delivery with per-client `_directBuf` byte array under `SpinLock`; messages written directly to contiguous buffer | Eliminates channel overhead per delivery |
|
||
| 16 | **Per-message heap allocation for MSG header** — `_outboundBufferPool.RentBuffer()` allocated a pooled `byte[]` for each MSG header | Replaced with `stackalloc byte[512]` — MSG header formatted entirely on the stack, then copied into `_directBuf` | Zero heap allocations per delivery |
|
||
| 17 | **Per-message socket write** — write loop issued one `SendAsync` per channel item, even with coalescing | Double-buffer swap: write loop swaps `_directBuf` ↔ `_writeBuf` under `SpinLock`, then writes the entire batch in a single `SendAsync`; zero allocation on swap | Single syscall per batch, zero-copy buffer reuse |
|
||
| 18 | **Separate wake channels** — `SendMessage` and `WriteProtocol` used different signaling paths | Unified on `_flushSignal` channel (bounded capacity 1, DropWrite); both paths signal the same channel, write loop drains both `_directBuf` and `_outbound` on each wake | Single wait point, no missed wakes |
|
||
|
||
### Round 3: Outbound Write Path (pub/sub + fan-out + fetch)
|
||
|
||
Three root causes were identified and fixed in the message delivery hot path:
|
||
|
||
| # | Root Cause | Fix | Impact |
|
||
|---|-----------|-----|--------|
|
||
| 12 | **Per-message `.ToArray()` allocation in SendMessage** — `owner.Memory[..pos].ToArray()` created a new `byte[]` for every MSG delivered to every subscriber | Replaced `IMemoryOwner` rent/copy/dispose with direct `byte[]` from pool; write loop returns buffers after writing | Eliminates 1 heap alloc per delivery (4 per fan-out message) |
|
||
| 13 | **Per-message `WriteAsync` in write loop** — each queued message triggered a separate `_stream.WriteAsync()` system call | Added 64KB coalesce buffer; drain all pending messages into contiguous buffer, single `WriteAsync` per batch | Reduces syscalls from N to 1 per batch |
|
||
| 14 | **Profiling `Stopwatch` on every message** — `Stopwatch.StartNew()` ran unconditionally in `ProcessMessage` and `StreamManager.Capture` even for non-JetStream messages | Removed profiling instrumentation from hot path | Eliminates ~200ns overhead per message |
|
||
|
||
### Round 2: FileStore AppendAsync Hot Path
|
||
|
||
| # | Root Cause | Fix | Impact |
|
||
|---|-----------|-----|--------|
|
||
| 6 | **Async state machine overhead** — `AppendAsync` was `async ValueTask<ulong>` but never actually awaited | Changed to synchronous `ValueTask<ulong>` returning `ValueTask.FromResult(_last)` | Eliminates Task state machine allocation |
|
||
| 7 | **Double payload copy** — `TransformForPersist` allocated `byte[]` then `payload.ToArray()` created second copy for `StoredMessage` | Reuse `TransformForPersist` result directly for `StoredMessage.Payload` when no transform needed (`_noTransform` flag) | Eliminates 1 `byte[]` alloc per message |
|
||
| 8 | **Unnecessary TTL work per publish** — `ExpireFromWheel()` and `RegisterTtl()` called on every write even when `MaxAge=0` | Guarded both with `_options.MaxAgeMs > 0` check (matches Go: `filestore.go:4701`) | Eliminates hash wheel overhead when TTL not configured |
|
||
| 9 | **Per-message MsgBlock cache allocation** — `WriteAt` created `new MessageRecord` for `_cache` on every write | Removed eager cache population; reads now decode from pending buffer or disk | Eliminates 1 object alloc per message |
|
||
| 10 | **Contiguous write buffer** — `MsgBlock._pendingWrites` was `List<byte[]>` with per-message `byte[]` allocations | Replaced with single contiguous `_pendingBuf` byte array; `MessageRecord.EncodeTo` writes directly into it | Eliminates per-message `byte[]` encoding alloc; single `RandomAccess.Write` per flush |
|
||
| 11 | **Pending buffer read path** — `MsgBlock.Read()` flushed pending writes to disk before reading | Added in-memory read from `_pendingBuf` when data is still in the buffer | Avoids unnecessary disk flush on read-after-write |
|
||
|
||
### Round 1: FileStore/StreamManager Layer
|
||
|
||
| # | Root Cause | Fix | Impact |
|
||
|---|-----------|-----|--------|
|
||
| 1 | **Per-message synchronous disk I/O** — `MsgBlock.WriteAt()` called `RandomAccess.Write()` on every message | Added write buffering in MsgBlock + background flush loop in FileStore (Go's `flushLoop` pattern: coalesce 16KB or 8ms) | Eliminates per-message syscall overhead |
|
||
| 2 | **O(n) `GetStateAsync` per publish** — `_messages.Keys.Min()` and `_messages.Values.Sum()` on every publish for MaxMsgs/MaxBytes checks | Added incremental `_messageCount`, `_totalBytes`, `_firstSeq` fields updated in all mutation paths; `GetStateAsync` is now O(1) | Eliminates O(n) scan per publish |
|
||
| 3 | **Unnecessary `LoadAsync` after every append** — `StreamManager.Capture` reloaded the just-stored message even when no mirrors/sources were configured | Made `LoadAsync` conditional on mirror/source replication being configured | Eliminates redundant disk read per publish |
|
||
| 4 | **Redundant `PruneExpiredMessages` per publish** — called before every publish even when `MaxAge=0`, and again inside `EnforceRuntimePolicies` | Guarded with `MaxAgeMs > 0` check; removed the pre-publish call (background expiry timer handles it) | Eliminates O(n) scan per publish |
|
||
| 5 | **`PrunePerSubject` loading all messages per publish** — `EnforceRuntimePolicies` → `PrugePerSubject` called `ListAsync().GroupBy()` even when `MaxMsgsPer=0` | Guarded with `MaxMsgsPer > 0` check | Eliminates O(n) scan per publish |
|
||
|
||
Additional fixes: SHA256 envelope bypass for unencrypted/uncompressed stores, RAFT propose skip for single-replica streams.
|
||
|
||
### What would further close the gap
|
||
|
||
| Change | Expected Impact | Go Reference |
|
||
|--------|----------------|-------------|
|
||
| **Write-loop / socket write overhead** | The per-delivery serial path is now minimal (SID copy + memcpy under SpinLock). The remaining 0.70x fan-out gap is likely write-loop wakeup latency and socket write syscall overhead | Go: `flushOutbound` uses `net.Buffers.WriteTo` → `writev()` with zero-copy buffer management |
|
||
| **Eliminate per-message GC allocations in FileStore** | ~30% improvement on FileStore AppendAsync — replace `StoredMessage` class with `StoredMessageMeta` struct in `_messages` dict, reconstruct full message from MsgBlock on read | Go stores in `cache.buf`/`cache.idx` with zero per-message allocs; 80+ sites in FileStore.cs need migration |
|
||
| **Single publisher throughput** | 0.62x–0.74x gap; the pub-only path has no fan-out overhead — likely JIT/GC/socket write overhead in the ingest path | Go: client.go readLoop with zero-copy buffer management |
|