perf: optimize MQTT cross-protocol path (0.30x → 0.78x Go)

Replace per-message async fire-and-forget with direct-buffer write loop mirroring NatsClient pattern: SpinLock-guarded buffer append, double- buffer swap, single WriteAsync per batch. - MqttConnection: add _directBuf/_writeBuf + RunMqttWriteLoopAsync - MqttConnection: add EnqueuePublishNoFlush (zero-alloc PUBLISH format) - MqttPacketWriter: add WritePublishTo(Span<byte>) + MeasurePublish - MqttTopicMapper: add NatsToMqttBytes with bounded ConcurrentDictionary - MqttNatsClientAdapter: synchronous SendMessageNoFlush + SignalFlush - Skip FlushAsync on plain TCP sockets (TCP auto-flushes)
2026-03-13 14:25:13 -04:00
parent 699449da6a
commit 11e01b9026
14 changed files with 1113 additions and 10 deletions
--- a/benchmarks_comparison.md
+++ b/benchmarks_comparison.md
@@ -1,6 +1,6 @@
 # Go vs .NET NATS Server — Benchmark Comparison

-Benchmark run: 2026-03-13 11:41 AM America/Indiana/Indianapolis. Both servers ran on the same machine using the benchmark project README command (`dotnet test tests/NATS.Server.Benchmark.Tests --filter "Category=Benchmark" -v normal --logger "console;verbosity=detailed"`). Test parallelization remained disabled inside the benchmark assembly.
+Benchmark run: 2026-03-13 12:08 PM America/Indiana/Indianapolis. Both servers ran on the same machine using the benchmark project README command (`dotnet test tests/NATS.Server.Benchmark.Tests --filter "Category=Benchmark" -v normal --logger "console;verbosity=detailed"`). Test parallelization remained disabled inside the benchmark assembly.

 **Environment:** Apple M4, .NET SDK 10.0.101, benchmark README command run in the benchmark project's default `Debug` configuration, Go toolchain installed, Go reference server built from `golang/nats-server/`.

@@ -75,6 +75,37 @@ Benchmark run: 2026-03-13 11:41 AM America/Indiana/Indianapolis. Both servers ra

 ---

+## MQTT Throughput
+
+| Benchmark | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
+|-----------|----------|---------|------------|-----------|-----------------|
+| MQTT PubSub (128B, QoS 0) | 34,224 | 4.2 | 44,142 | 5.4 | **1.29x** |
+| Cross-Protocol NATS→MQTT (128B) | 118,322 | 14.4 | 92,485 | 11.3 | 0.78x |
+
+> **Note:** Pure MQTT pub/sub remains above Go at 1.29x. Cross-protocol NATS→MQTT improved from 0.30x to 0.78x after adopting direct-buffer write loop + zero-alloc PUBLISH formatting + topic cache (matching the NatsClient batching pattern). The remaining gap is likely due to Go's writev() scatter-gather and goroutine-level parallelism in message routing.
+
+---
+
+## Transport Overhead
+
+### TLS
+
+| Benchmark | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
+|-----------|----------|---------|------------|-----------|-----------------|
+| TLS PubSub 1:1 (128B) | 289,548 | 35.3 | 251,935 | 30.8 | 0.87x |
+| TLS Pub-Only (128B) | 1,782,442 | 217.6 | 1,163,021 | 142.0 | 0.65x |
+
+### WebSocket
+
+| Benchmark | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
+|-----------|----------|---------|------------|-----------|-----------------|
+| WS PubSub 1:1 (128B) | 66,584 | 8.1 | 73,023 | 8.9 | **1.10x** |
+| WS Pub-Only (128B) | 106,302 | 13.0 | 88,682 | 10.8 | 0.83x |
+
+> **Note:** TLS pub/sub is close to parity at 0.87x. WebSocket pub/sub slightly favors .NET at 1.10x. Both WebSocket numbers are lower than plaintext due to WS framing overhead.
+
+---
+
 ## Hot Path Microbenchmarks (.NET only)

 ### SubList
@@ -119,6 +150,12 @@ Benchmark run: 2026-03-13 11:41 AM America/Indiana/Indianapolis. Both servers ra
 | JetStream async file publish | 0.39x | Improved versus older snapshots, still storage-bound |
 | JetStream ordered consume | 0.18x | Highest-priority JetStream gap |
 | JetStream durable fetch | 0.56x | Regressed from prior snapshot |
+| MQTT pub/sub | **1.29x** | .NET outperforms Go |
+| MQTT cross-protocol | 0.78x | Improved from 0.30x via direct-buffer write loop |
+| TLS pub/sub | 0.87x | Close to parity |
+| TLS pub-only | 0.65x | Encryption throughput gap |
+| WebSocket pub/sub | **1.10x** | .NET slightly ahead |
+| WebSocket pub-only | 0.83x | Good |

 ### Key Observations

@@ -128,11 +165,26 @@ Benchmark run: 2026-03-13 11:41 AM America/Indiana/Indianapolis. Both servers ra
 4. **Ordered consumer throughput remains the largest JetStream gap at 0.18x** (102K vs 573K msg/s). That is better than the intermediate 0.11x run, but it is still the clearest post-FileStore optimization target.
 5. **Durable fetch regressed to 0.56x in the final run**, which keeps consumer delivery and storage-read coordination in the top tier of remaining work even after the FileStore changes.
 6. **Parser and SubList microbenchmarks remain stable and low-allocation**. The storage and consumer layers continue to dominate the server-level benchmark gaps, not the parser or subject matcher hot paths.
+7. **Pure MQTT pub/sub shows .NET outperforming Go at 1.29x** (44K vs 34K msg/s). The .NET MQTT protocol bridge is competitive for direct MQTT-to-MQTT messaging.
+8. **MQTT cross-protocol routing (NATS→MQTT) improved from 0.30x to 0.78x** (92K vs 118K msg/s) after adopting the same direct-buffer write loop pattern used by NatsClient: SpinLock-guarded buffer append, double-buffer swap, single write per batch, plus zero-alloc MQTT PUBLISH formatting and cached topic-to-bytes translation.
+9. **TLS pub/sub is close to parity at 0.87x** (252K vs 290K msg/s). TLS pub-only is 0.65x (1.16M vs 1.78M msg/s), consistent with the general publish-path gap seen in plaintext benchmarks.
+10. **WebSocket pub/sub slightly favors .NET at 1.10x** (73K vs 67K msg/s). WebSocket pub-only is 0.83x (89K vs 106K msg/s). Both servers show similar WS framing overhead relative to their plaintext performance.

 ---

 ## Optimization History

+### Round 7: MQTT Cross-Protocol Write Path
+
+Four optimizations targeting the NATS→MQTT delivery hot path (cross-protocol throughput improved from 0.30x to 0.78x):
+
+| # | Root Cause | Fix | Impact |
+|---|-----------|-----|--------|
+| 24 | **Per-message async fire-and-forget in MqttNatsClientAdapter** — each `SendMessage` called `SendBinaryPublishAsync` which acquired a `SemaphoreSlim`, allocated a full PUBLISH packet `byte[]`, wrote, and flushed the stream — all per message, bypassing the server's deferred-flush batching | Replaced with synchronous `EnqueuePublishNoFlush()` that formats MQTT PUBLISH directly into `_directBuf` under SpinLock, matching the NatsClient pattern; `SignalFlush()` signals the write loop for batch flush | Eliminates async Task + SemaphoreSlim + per-message flush |
+| 25 | **Per-message `byte[]` allocation for MQTT PUBLISH packets** — `MqttPacketWriter.WritePublish()` allocated topic bytes, variable header, remaining-length array, and full packet array on every delivery | Added `WritePublishTo(Span<byte>)` that formats the entire PUBLISH packet directly into the destination span using `Span<byte>` operations — zero heap allocation | Eliminates 4+ `byte[]` allocs per delivery |
+| 26 | **Per-message NATS→MQTT topic translation** — `NatsToMqtt()` allocated a `StringBuilder`, produced a `string`, then `Encoding.UTF8.GetBytes()` re-encoded it on every delivery | Added `NatsToMqttBytes()` with bounded `ConcurrentDictionary<string, byte[]>` cache (4096 entries); cached result includes pre-encoded UTF-8 bytes | Eliminates string + encoding alloc per delivery for cached topics |
+| 27 | **Per-message `FlushAsync` on plain TCP sockets** — `WriteBinaryAsync` flushed after every packet write, even on `NetworkStream` where TCP auto-flushes | Write loop skips `FlushAsync` for plain sockets; for TLS/wrapped streams, flushes once per batch (not per message) | Reduces syscalls from 2N to 1 per batch |
+
 ### Round 6: Batch Flush Signaling + Fetch Optimizations

 Four optimizations targeting fan-out and consumer fetch hot paths: