perf: optimize MQTT cross-protocol path (0.30x → 0.78x Go)

Replace per-message async fire-and-forget with direct-buffer write loop
mirroring NatsClient pattern: SpinLock-guarded buffer append, double-
buffer swap, single WriteAsync per batch.

- MqttConnection: add _directBuf/_writeBuf + RunMqttWriteLoopAsync
- MqttConnection: add EnqueuePublishNoFlush (zero-alloc PUBLISH format)
- MqttPacketWriter: add WritePublishTo(Span<byte>) + MeasurePublish
- MqttTopicMapper: add NatsToMqttBytes with bounded ConcurrentDictionary
- MqttNatsClientAdapter: synchronous SendMessageNoFlush + SignalFlush
- Skip FlushAsync on plain TCP sockets (TCP auto-flushes)
This commit is contained in:
Joseph Doherty
2026-03-13 14:25:13 -04:00
parent 699449da6a
commit 11e01b9026
14 changed files with 1113 additions and 10 deletions

View File

@@ -1,6 +1,6 @@
# Go vs .NET NATS Server — Benchmark Comparison
Benchmark run: 2026-03-13 11:41 AM America/Indiana/Indianapolis. Both servers ran on the same machine using the benchmark project README command (`dotnet test tests/NATS.Server.Benchmark.Tests --filter "Category=Benchmark" -v normal --logger "console;verbosity=detailed"`). Test parallelization remained disabled inside the benchmark assembly.
Benchmark run: 2026-03-13 12:08 PM America/Indiana/Indianapolis. Both servers ran on the same machine using the benchmark project README command (`dotnet test tests/NATS.Server.Benchmark.Tests --filter "Category=Benchmark" -v normal --logger "console;verbosity=detailed"`). Test parallelization remained disabled inside the benchmark assembly.
**Environment:** Apple M4, .NET SDK 10.0.101, benchmark README command run in the benchmark project's default `Debug` configuration, Go toolchain installed, Go reference server built from `golang/nats-server/`.
@@ -75,6 +75,37 @@ Benchmark run: 2026-03-13 11:41 AM America/Indiana/Indianapolis. Both servers ra
---
## MQTT Throughput
| Benchmark | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
|-----------|----------|---------|------------|-----------|-----------------|
| MQTT PubSub (128B, QoS 0) | 34,224 | 4.2 | 44,142 | 5.4 | **1.29x** |
| Cross-Protocol NATS→MQTT (128B) | 118,322 | 14.4 | 92,485 | 11.3 | 0.78x |
> **Note:** Pure MQTT pub/sub remains above Go at 1.29x. Cross-protocol NATS→MQTT improved from 0.30x to 0.78x after adopting direct-buffer write loop + zero-alloc PUBLISH formatting + topic cache (matching the NatsClient batching pattern). The remaining gap is likely due to Go's writev() scatter-gather and goroutine-level parallelism in message routing.
---
## Transport Overhead
### TLS
| Benchmark | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
|-----------|----------|---------|------------|-----------|-----------------|
| TLS PubSub 1:1 (128B) | 289,548 | 35.3 | 251,935 | 30.8 | 0.87x |
| TLS Pub-Only (128B) | 1,782,442 | 217.6 | 1,163,021 | 142.0 | 0.65x |
### WebSocket
| Benchmark | Go msg/s | Go MB/s | .NET msg/s | .NET MB/s | Ratio (.NET/Go) |
|-----------|----------|---------|------------|-----------|-----------------|
| WS PubSub 1:1 (128B) | 66,584 | 8.1 | 73,023 | 8.9 | **1.10x** |
| WS Pub-Only (128B) | 106,302 | 13.0 | 88,682 | 10.8 | 0.83x |
> **Note:** TLS pub/sub is close to parity at 0.87x. WebSocket pub/sub slightly favors .NET at 1.10x. Both WebSocket numbers are lower than plaintext due to WS framing overhead.
---
## Hot Path Microbenchmarks (.NET only)
### SubList
@@ -119,6 +150,12 @@ Benchmark run: 2026-03-13 11:41 AM America/Indiana/Indianapolis. Both servers ra
| JetStream async file publish | 0.39x | Improved versus older snapshots, still storage-bound |
| JetStream ordered consume | 0.18x | Highest-priority JetStream gap |
| JetStream durable fetch | 0.56x | Regressed from prior snapshot |
| MQTT pub/sub | **1.29x** | .NET outperforms Go |
| MQTT cross-protocol | 0.78x | Improved from 0.30x via direct-buffer write loop |
| TLS pub/sub | 0.87x | Close to parity |
| TLS pub-only | 0.65x | Encryption throughput gap |
| WebSocket pub/sub | **1.10x** | .NET slightly ahead |
| WebSocket pub-only | 0.83x | Good |
### Key Observations
@@ -128,11 +165,26 @@ Benchmark run: 2026-03-13 11:41 AM America/Indiana/Indianapolis. Both servers ra
4. **Ordered consumer throughput remains the largest JetStream gap at 0.18x** (102K vs 573K msg/s). That is better than the intermediate 0.11x run, but it is still the clearest post-FileStore optimization target.
5. **Durable fetch regressed to 0.56x in the final run**, which keeps consumer delivery and storage-read coordination in the top tier of remaining work even after the FileStore changes.
6. **Parser and SubList microbenchmarks remain stable and low-allocation**. The storage and consumer layers continue to dominate the server-level benchmark gaps, not the parser or subject matcher hot paths.
7. **Pure MQTT pub/sub shows .NET outperforming Go at 1.29x** (44K vs 34K msg/s). The .NET MQTT protocol bridge is competitive for direct MQTT-to-MQTT messaging.
8. **MQTT cross-protocol routing (NATS→MQTT) improved from 0.30x to 0.78x** (92K vs 118K msg/s) after adopting the same direct-buffer write loop pattern used by NatsClient: SpinLock-guarded buffer append, double-buffer swap, single write per batch, plus zero-alloc MQTT PUBLISH formatting and cached topic-to-bytes translation.
9. **TLS pub/sub is close to parity at 0.87x** (252K vs 290K msg/s). TLS pub-only is 0.65x (1.16M vs 1.78M msg/s), consistent with the general publish-path gap seen in plaintext benchmarks.
10. **WebSocket pub/sub slightly favors .NET at 1.10x** (73K vs 67K msg/s). WebSocket pub-only is 0.83x (89K vs 106K msg/s). Both servers show similar WS framing overhead relative to their plaintext performance.
---
## Optimization History
### Round 7: MQTT Cross-Protocol Write Path
Four optimizations targeting the NATS→MQTT delivery hot path (cross-protocol throughput improved from 0.30x to 0.78x):
| # | Root Cause | Fix | Impact |
|---|-----------|-----|--------|
| 24 | **Per-message async fire-and-forget in MqttNatsClientAdapter** — each `SendMessage` called `SendBinaryPublishAsync` which acquired a `SemaphoreSlim`, allocated a full PUBLISH packet `byte[]`, wrote, and flushed the stream — all per message, bypassing the server's deferred-flush batching | Replaced with synchronous `EnqueuePublishNoFlush()` that formats MQTT PUBLISH directly into `_directBuf` under SpinLock, matching the NatsClient pattern; `SignalFlush()` signals the write loop for batch flush | Eliminates async Task + SemaphoreSlim + per-message flush |
| 25 | **Per-message `byte[]` allocation for MQTT PUBLISH packets**`MqttPacketWriter.WritePublish()` allocated topic bytes, variable header, remaining-length array, and full packet array on every delivery | Added `WritePublishTo(Span<byte>)` that formats the entire PUBLISH packet directly into the destination span using `Span<byte>` operations — zero heap allocation | Eliminates 4+ `byte[]` allocs per delivery |
| 26 | **Per-message NATS→MQTT topic translation**`NatsToMqtt()` allocated a `StringBuilder`, produced a `string`, then `Encoding.UTF8.GetBytes()` re-encoded it on every delivery | Added `NatsToMqttBytes()` with bounded `ConcurrentDictionary<string, byte[]>` cache (4096 entries); cached result includes pre-encoded UTF-8 bytes | Eliminates string + encoding alloc per delivery for cached topics |
| 27 | **Per-message `FlushAsync` on plain TCP sockets**`WriteBinaryAsync` flushed after every packet write, even on `NetworkStream` where TCP auto-flushes | Write loop skips `FlushAsync` for plain sockets; for TLS/wrapped streams, flushes once per batch (not per message) | Reduces syscalls from 2N to 1 per batch |
### Round 6: Batch Flush Signaling + Fetch Optimizations
Four optimizations targeting fan-out and consumer fetch hot paths: