Improve source XML docs and refresh profiling artifacts

This captures the iterative CommentChecker cleanup plus updated snapshot/report outputs used to validate and benchmark the latest JetStream and transport work.
2026-03-14 03:13:17 -04:00
parent 56c773dc71
commit ba0d65317a
76 changed files with 3058 additions and 29987 deletions
--- a/benchmarks_comparison.md
+++ b/benchmarks_comparison.md
@@ -59,7 +59,7 @@ Benchmark run: 2026-03-13 America/Indiana/Indianapolis. Both servers ran on the
 | Mode | Payload | Storage | Go msg/s | .NET msg/s | Ratio (.NET/Go) |
 |------|---------|---------|----------|------------|-----------------|
 | Synchronous | 16 B | Memory | 16,982 | 14,514 | 0.85x |
-| Async (batch) | 128 B | File | 211,355 | 58,334 | 0.28x |
+| Async (batch) | 128 B | File | 174,421 | 85,394 | 0.49x |

 ---

@@ -144,7 +144,7 @@ Benchmark run: 2026-03-13 America/Indiana/Indianapolis. Both servers ran on the
 | Request/reply (single) | 0.89x | Close to parity |
 | Request/reply (10Cx2S) | 0.86x | Close to parity |
 | JetStream sync publish | 0.85x | Close to parity |
-| JetStream async file publish | 0.28x | Storage-bound |
+| JetStream async file publish | 0.49x | Improved after double-buffer + deferred fsync |
 | JetStream ordered consume | 0.44x | Significant gap |
 | JetStream durable fetch | 0.76x | Moderate gap |
 | MQTT pub/sub | **1.32x** | .NET outperforms Go |
@@ -157,16 +157,26 @@ Benchmark run: 2026-03-13 America/Indiana/Indianapolis. Both servers ran on the
 ### Key Observations

 1. **Multi pub/sub reached parity (1.01x)** after Round 10 pre-formatted MSG headers. Fan-out improved to 0.84x.
-2. **TLS pub/sub shows a dramatic .NET advantage (4.70x)** — .NET's `SslStream` has significantly lower overhead in the bidirectional pub/sub path. TLS pub-only (ingest only) still favors Go at 0.39x, suggesting the advantage is in the read-and-deliver path.
-3. **MQTT pub/sub remains a .NET strength at 1.32x.** Cross-protocol (NATS→MQTT) dropped to 0.71x — this benchmark shows high variance across runs.
-4. **JetStream ordered consumer dropped to 0.44x** compared to earlier runs (0.62x). This test completes in <100ms and shows high variance.
-5. **Single publisher 128B dropped to 0.37x** (from 0.62x with smaller message counts). With 500K messages, this benchmark runs long enough for Go's goroutine scheduler and buffer management to reach steady state, widening the gap. The 16B variant is stable at 0.74x.
-6. **Request-reply latency stable** at 0.86x–0.89x across all runs.
+2. **JetStream async file publish improved to 0.49x** (from 0.28x) after Round 11 double-buffer + deferred fsync optimizations — a 75% improvement.
+3. **TLS pub/sub shows a dramatic .NET advantage (4.70x)** — .NET's `SslStream` has significantly lower overhead in the bidirectional pub/sub path. TLS pub-only (ingest only) still favors Go at 0.39x, suggesting the advantage is in the read-and-deliver path.
+4. **MQTT pub/sub remains a .NET strength at 1.32x.** Cross-protocol (NATS→MQTT) dropped to 0.71x — this benchmark shows high variance across runs.
+5. **JetStream ordered consumer dropped to 0.44x** compared to earlier runs (0.62x). This test completes in <100ms and shows high variance.
+6. **Single publisher 128B dropped to 0.37x** (from 0.62x with smaller message counts). With 500K messages, this benchmark runs long enough for Go's goroutine scheduler and buffer management to reach steady state, widening the gap. The 16B variant is stable at 0.74x.
+7. **Request-reply latency stable** at 0.86x–0.89x across all runs.

 ---

 ## Optimization History

+### Round 11: JetStream FileStore Double-Buffer + Deferred Fsync
+
+Two optimizations targeting the JetStream async file publish hot path (0.28x→0.49x, 75% improvement):
+
+| # | Root Cause | Fix | Impact |
+|---|-----------|-----|--------|
+| 41 | **Lock contention between WriteAt and FlushPending** — `MsgBlock.FlushPending()` held the write lock for the entire `RandomAccess.Write` call, blocking `WriteAt` (publish path) during disk I/O | Double-buffer: swap `_pendingBuf` ↔ `_flushBuf` under write lock, then write old buffer to disk outside lock using separate `_flushLock`; publish path only blocked during buffer pointer swap, not disk I/O | Eliminates write-lock contention during disk I/O |
+| 42 | **Synchronous fsync on publish path** — `RotateBlock()` called `FlushToDisk()` which did `fsync` synchronously (1,557ms per profile), blocking the publish hot path for every block rotation | Deferred fsync: `RotateBlock` enqueues completed blocks into `ConcurrentQueue<MsgBlock> _needSyncBlocks`; background `FlushLoopAsync` drains the queue via `DrainSyncQueue()`, calling `Flush()` (fsync) off the publish path — matches Go's `needSync` flag + background goroutine pattern | Moves fsync entirely off the publish hot path |
+
 ### Round 10: Fan-Out Serial Path Optimization

 Three optimizations making the serial fan-out path cheaper (fan-out 0.63x→0.84x, multi 0.65x→1.01x):
@@ -280,6 +290,6 @@ Additional fixes: SHA256 envelope bypass for unencrypted/uncompressed stores, RA
 | Change | Expected Impact | Go Reference |
 |--------|----------------|-------------|
 | **Single publisher ingest path (0.37x at 128B)** | The pub-only path has the largest gap. Go's readLoop uses zero-copy buffer management with direct `[]byte` slicing; .NET parses into managed objects. Reducing allocations in the parser→ProcessMessage path would help. | Go: `client.go` readLoop, direct buffer slicing |
-| **JetStream async file publish (0.28x)** | Storage-bound: FileStore AppendAsync bottleneck is synchronous `RandomAccess.Write` in flush loop and S2 compression overhead | Go: `filestore.go` uses `cache.buf`/`cache.idx` with mmap and goroutine-per-flush concurrency |
+| **JetStream async file publish (0.49x)** | After double-buffer + deferred fsync, remaining gap is likely write coalescing and S2 compression overhead | Go: `filestore.go` uses `cache.buf`/`cache.idx` with mmap and goroutine-per-flush concurrency |
 | **JetStream ordered consumer (0.44x)** | Pull consumer delivery pipeline has overhead in the fetch→deliver→ack cycle. The test completes in <100ms so numbers are noisy, but the gap is real. | Go: `consumer.go` delivery with direct buffer writes |
 | **Write-loop / socket write overhead** | Fan-out (0.84x) and pub/sub (0.66x) gaps partly come from write-loop wakeup latency and socket write syscall overhead compared to Go's `writev()` | Go: `flushOutbound` uses `net.Buffers.WriteTo` → `writev()` with zero-copy buffer management |