perf: optimize fan-out serial path — pre-formatted MSG headers, non-atomic RR, linear pcd

Three optimizations making the serial fan-out path cheaper (fan-out 0.63x→0.70x, multi pub/sub 0.65x→0.69x): 1. Pre-format MSG prefix ("MSG subject ") and suffix (" [reply] sizes\r\n") once per publish. New SendMessagePreformatted writes prefix+sid+suffix directly into _directBuf — zero encoding, pure memory copies. Only SID varies per delivery. 2. Replace queue-group round-robin Interlocked.Increment/Decrement with non-atomic uint QueueRoundRobin++ (safe: ProcessMessage runs single-threaded per connection). 3. Replace HashSet<INatsClient> pcd with ThreadStatic INatsClient[] + linear scan. O(n) but n≤16; faster than hash for small fan-out counts.
2026-03-13 16:23:18 -04:00
parent 23543b2ba8
commit 0e5ce4ed9b
3 changed files with 201 additions and 47 deletions
--- a/benchmarks_comparison.md
+++ b/benchmarks_comparison.md
@@ -28,15 +28,15 @@ Benchmark run: 2026-03-13 America/Indiana/Indianapolis. Both servers ran on the

 | Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
 |---------|----------|---------|------------|-----------|-----------------|
-| 128 B | 2,945,790 | 359.6 | 1,848,130 | 225.6 | 0.63x |
+| 128 B | 2,945,790 | 359.6 | 2,063,771 | 251.9 | 0.70x |

-> **Note:** Fan-out numbers are within noise of prior round. The hot-path optimizations (batched stats, pre-encoded subject/SID bytes, auto-unsub guard) remove per-delivery overhead but the gap is now dominated by the serial fan-out loop itself.
+> **Note:** Fan-out improved from 0.63x to 0.70x after Round 10 pre-formatted MSG headers, eliminating per-delivery replyTo encoding, size formatting, and prefix/subject copying. Only the SID varies per delivery now.

 ### Multi-Publisher / Multi-Subscriber (4P x 4S)

 | Payload | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
 |---------|----------|---------|------------|-----------|-----------------|
-| 128 B | 2,123,480 | 259.2 | 1,374,570 | 167.8 | 0.65x |
+| 128 B | 2,123,480 | 259.2 | 1,465,416 | 178.9 | 0.69x |

 ---

@@ -146,8 +146,8 @@ Benchmark run: 2026-03-13 America/Indiana/Indianapolis. Both servers ran on the
 | Pub-only throughput | 0.62x–0.74x | Improved with Release build |
 | Pub/sub (small payload) | **2.47x** | .NET outperforms Go decisively |
 | Pub/sub (large payload) | **1.15x** | .NET now exceeds parity |
-| Fan-out | 0.63x | Serial fan-out loop is bottleneck |
-| Multi pub/sub | 0.65x | Close to prior round |
+| Fan-out | 0.70x | Improved: pre-formatted MSG headers |
+| Multi pub/sub | 0.69x | Improved: same optimizations |
 | Request/reply latency | 0.89x–**1.01x** | Effectively at parity |
 | JetStream sync publish | 0.74x | Run-to-run variance |
 | JetStream async file publish | 0.41x | Storage-bound |
@@ -165,7 +165,7 @@ Benchmark run: 2026-03-13 America/Indiana/Indianapolis. Both servers ran on the
 1. **Switching the benchmark harness to Release build was the highest-impact change.** Durable fetch jumped from 0.42x to 0.92x (468K vs 510K msg/s). Ordered consumer improved from 0.57x to 0.62x. Request-reply 10Cx2S reached parity at 1.01x. Large-payload pub/sub now exceeds Go at 1.15x.
 2. **Small-payload 1:1 pub/sub remains a strong .NET lead** at 2.47x (724K vs 293K msg/s).
 3. **MQTT cross-protocol improved to 1.46x** (230K vs 158K msg/s), up from 1.20x — the Release JIT further benefits the delivery path.
-4. **Fan-out (0.63x) and multi pub/sub (0.65x) remain the largest gaps.** The hot-path optimizations (batched stats, pre-encoded SID/subject, auto-unsub guard) removed per-delivery overhead, but the remaining gap is dominated by the serial fan-out loop itself — Go parallelizes fan-out delivery across goroutines.
+4. **Fan-out improved from 0.63x to 0.70x, multi pub/sub from 0.65x to 0.69x** after Round 10 pre-formatted MSG headers. Per-delivery work is now minimal (SID copy + suffix copy + payload copy under SpinLock). The remaining gap is likely dominated by write-loop wakeup and socket write overhead.
 5. **SubList Match microbenchmarks improved ~17%** (19.3M vs 16.5M ops/s for exact match) after removing Interlocked stats from the hot path.
 6. **TLS pub-only dropped to 0.49x** this run, likely noise from co-running benchmarks contending on CPU. TLS pub/sub remains stable at 0.88x.

@@ -173,6 +173,16 @@ Benchmark run: 2026-03-13 America/Indiana/Indianapolis. Both servers ran on the

 ## Optimization History

+### Round 10: Fan-Out Serial Path Optimization
+
+Three optimizations making the serial fan-out path cheaper (fan-out 0.63x→0.70x, multi 0.65x→0.69x):
+
+| # | Root Cause | Fix | Impact |
+|---|-----------|-----|--------|
+| 38 | **Per-delivery MSG header re-formatting** — `SendMessageNoFlush` independently formats the entire MSG header line (prefix, subject copy, replyTo encoding, size formatting, CRLF) for every subscriber — but only the SID varies per delivery | Pre-build prefix (`MSG subject `) and suffix (` [reply] sizes\r\n`) once per publish; new `SendMessagePreformatted` writes prefix+sid+suffix directly into `_directBuf` — zero encoding, pure memory copies | Eliminates per-delivery replyTo encoding, size formatting, prefix/subject copying |
+| 39 | **Queue-group round-robin burns 2 Interlocked ops** — `Interlocked.Increment(ref OutMsgs)` + `Interlocked.Decrement(ref OutMsgs)` per queue group just to pick an index | Replaced with non-atomic `uint QueueRoundRobin++` — safe because ProcessMessage runs single-threaded per publisher connection (the read loop) | Eliminates 2 interlocked ops per queue group per publish |
+| 40 | **`HashSet<INatsClient>` pcd overhead** — hash computation + bucket lookup per Add for small fan-out counts (4 subscribers) | Replaced with `[ThreadStatic] INatsClient[]` + linear scan; O(n) but n≤16, faster than hash for small counts | Eliminates hash computation and internal array overhead |
+
 ### Round 9: Fan-Out & Multi Pub/Sub Hot-Path Optimization

 Seven optimizations targeting the per-delivery hot path and benchmark harness configuration:
@@ -275,6 +285,6 @@ Additional fixes: SHA256 envelope bypass for unencrypted/uncompressed stores, RA

 | Change | Expected Impact | Go Reference |
 |--------|----------------|-------------|
-| **Fan-out parallelism** | Deliver to subscribers concurrently instead of serially from publisher's read loop — this is now the primary bottleneck for the 0.63x fan-out gap | Go: `processMsgResults` fans out per-client via goroutines |
+| **Write-loop / socket write overhead** | The per-delivery serial path is now minimal (SID copy + memcpy under SpinLock). The remaining 0.70x fan-out gap is likely write-loop wakeup latency and socket write syscall overhead | Go: `flushOutbound` uses `net.Buffers.WriteTo` → `writev()` with zero-copy buffer management |
 | **Eliminate per-message GC allocations in FileStore** | ~30% improvement on FileStore AppendAsync — replace `StoredMessage` class with `StoredMessageMeta` struct in `_messages` dict, reconstruct full message from MsgBlock on read | Go stores in `cache.buf`/`cache.idx` with zero per-message allocs; 80+ sites in FileStore.cs need migration |
 | **Single publisher throughput** | 0.62x–0.74x gap; the pub-only path has no fan-out overhead — likely JIT/GC/socket write overhead in the ingest path | Go: client.go readLoop with zero-copy buffer management |