docs: record parser hot-path allocation strategy

2026-03-13 10:08:20 -04:00
parent 6cf11969f5
commit a3b34fb16d
5 changed files with 235 additions and 8 deletions
--- a/Documentation/Protocol/Parser.md
+++ b/Documentation/Protocol/Parser.md
@@ -24,9 +24,30 @@ public enum CommandType
 }
 ```

+### ParsedCommandView
+
+`ParsedCommandView` is the byte-first parser result used on the hot path. It keeps protocol fields in byte-oriented storage and exposes payload as a `ReadOnlySequence<byte>` so single-segment bodies can flow through without an unconditional copy.
+
+```csharp
+public readonly struct ParsedCommandView
+{
+    public CommandType Type { get; init; }
+    public string? Operation { get; init; }
+    public ReadOnlyMemory<byte> Subject { get; init; }
+    public ReadOnlyMemory<byte> ReplyTo { get; init; }
+    public ReadOnlyMemory<byte> Queue { get; init; }
+    public ReadOnlyMemory<byte> Sid { get; init; }
+    public int MaxMessages { get; init; }
+    public int HeaderSize { get; init; }
+    public ReadOnlySequence<byte> Payload { get; init; }
+}
+```
+
+`Subject`, `ReplyTo`, `Queue`, and `Sid` remain ASCII-encoded bytes until a caller explicitly materializes them. `Payload` stays sequence-backed until a caller asks for contiguous memory.
+
 ### ParsedCommand

-`ParsedCommand` is a `readonly struct` that carries the result of a successful parse. Using a struct avoids a heap allocation per command on the fast path.
+`ParsedCommand` remains the compatibility shape for existing consumers. `TryParse` now delegates through `TryParseView` and materializes strings and contiguous payload memory in one adapter step instead of during every parse branch.

 ```csharp
 public readonly struct ParsedCommand
@@ -46,15 +67,21 @@ public readonly struct ParsedCommand

 Fields that do not apply to a given command type are left at their default values (`null` for strings, `0` for integers). `MaxMessages` uses `-1` as a sentinel meaning "unset" (relevant for UNSUB with no max). `HeaderSize` is set for HPUB/HMSG; `-1` indicates no headers. `Payload` carries the raw body bytes for PUB/HPUB, and the raw JSON bytes for CONNECT/INFO.

-## TryParse
+## TryParseView and TryParse

-`TryParse` is the main entry point. It is called by the read loop after each `PipeReader.ReadAsync` completes.
+`TryParseView` is the byte-oriented parser entry point. It is called by hot-path consumers such as `NatsClient.ProcessCommandsAsync` when they want to defer materialization.
+
+```csharp
+internal bool TryParseView(ref ReadOnlySequence<byte> buffer, out ParsedCommandView command)
+```
+
+`TryParse` remains the compatibility entry point for existing call sites and tests:

 ```csharp
 public bool TryParse(ref ReadOnlySequence<byte> buffer, out ParsedCommand command)
 ```

-The method returns `true` and advances `buffer` past the consumed bytes when a complete command is available. It returns `false` — leaving `buffer` unchanged — when more data is needed. The caller must call `TryParse` in a loop until it returns `false`, then call `PipeReader.AdvanceTo` to signal how far the buffer was consumed.
+Both methods return `true` and advance `buffer` past the consumed bytes when a complete command is available. They return `false` — leaving `buffer` unchanged — when more data is needed. The caller must call the parser in a loop until it returns `false`, then call `PipeReader.AdvanceTo` to signal how far the buffer was consumed.

 If the parser detects a malformed command it throws `ProtocolViolationException`, which the read loop catches to close the connection.

@@ -158,9 +185,9 @@ The two-character pairs are: `p+i` = PING, `p+o` = PONG, `p+u` = PUB, `h+p` = HP

 PUB and HPUB require a payload body that follows the control line. The parser handles split reads — where the TCP segment boundary falls inside the payload — through an `_awaitingPayload` state flag.

-**Phase 1 — control line:** The parser reads the control line up to `\r\n`, extracts the subject, optional reply-to, and payload size(s), then stores these in private fields (`_pendingSubject`, `_pendingReplyTo`, `_expectedPayloadSize`, `_pendingHeaderSize`, `_pendingType`) and sets `_awaitingPayload = true`. It then immediately calls `TryReadPayload` to attempt phase 2.
+**Phase 1 — control line:** The parser reads the control line up to `\r\n`, extracts the subject, optional reply-to, and payload size(s), then stores these in private fields (`_pendingSubject`, `_pendingReplyTo`, `_expectedPayloadSize`, `_pendingHeaderSize`, `_pendingType`) and sets `_awaitingPayload = true`. The pending subject and reply values are held as byte-oriented state, not strings. It then immediately calls `TryReadPayload` to attempt phase 2.

-**Phase 2 — payload read:** `TryReadPayload` checks whether `buffer.Length >= _expectedPayloadSize + 2` (the `+ 2` accounts for the trailing `\r\n`). If enough data is present, the payload bytes are copied to a new `byte[]`, the trailing `\r\n` is verified, the `ParsedCommand` is constructed, and `_awaitingPayload` is reset to `false`. If not enough data is present, `TryReadPayload` returns `false` and `_awaitingPayload` remains `true`.
+**Phase 2 — payload read:** `TryReadPayload` checks whether `buffer.Length >= _expectedPayloadSize + 2` (the `+ 2` accounts for the trailing `\r\n`). If enough data is present, the parser slices the payload as a `ReadOnlySequence<byte>`, verifies the trailing `\r\n`, constructs a `ParsedCommandView`, and resets `_awaitingPayload` to `false`. If not enough data is present, `TryReadPayload` returns `false` and `_awaitingPayload` remains `true`.

 On the next call to `TryParse`, the check at the top of the method routes straight to `TryReadPayload` without re-parsing the control line:

@@ -171,6 +198,20 @@ if (_awaitingPayload)

 This means the parser correctly handles payloads that arrive across multiple `PipeReader.ReadAsync` completions without buffering the control line a second time.

+## Materialization Boundaries
+
+The parser now has explicit materialization boundaries:
+
+- `TryParseView` keeps payloads sequence-backed and leaves token fields as bytes.
+- `ParsedCommandView.Materialize()` converts byte fields to strings and converts multi-segment payloads to a standalone `byte[]`.
+- `NatsClient` consumes `ParsedCommandView` directly for the `PUB` and `HPUB` hot path, only decoding subject and reply strings at the routing and permission-check boundary.
+- `CONNECT` and `INFO` now keep their JSON payload as a slice of the original control-line sequence until a consumer explicitly materializes it.
+
+Payload copying is still intentional in two places:
+
+- when a multi-segment payload must become contiguous for a consumer using `ReadOnlyMemory<byte>`
+- when compatibility callers continue to use `TryParse` and require a materialized `ParsedCommand`
+
 ## Zero-Allocation Argument Splitting

 `SplitArgs` splits the argument portion of a control line into token ranges without allocating. The caller `stackalloc`s a `Span<Range>` sized to the maximum expected argument count for the command, then passes it to `SplitArgs`:
--- a/benchmarks_comparison.md
+++ b/benchmarks_comparison.md
@@ -1,8 +1,46 @@
 # Go vs .NET NATS Server — Benchmark Comparison

-Benchmark run: 2026-03-13. Both servers running on the same machine, tested with identical NATS.Client.Core workloads. Test parallelization disabled to avoid resource contention. Best-of-3 runs reported.
+Benchmark run: 2026-03-13 10:06 AM America/Indiana/Indianapolis. The latest refresh used the benchmark project README command (`dotnet test tests/NATS.Server.Benchmark.Tests --filter "Category=Benchmark" -v normal --logger "console;verbosity=detailed"`) and completed successfully as a `.NET`-only run. The Go/.NET comparison tables below remain the last Go-capable comparison baseline.

-**Environment:** Apple M4, .NET 10, Go nats-server (latest from `golang/nats-server/`).
+**Environment:** Apple M4, .NET SDK 10.0.101, README benchmark command run in the benchmark project's default `Debug` configuration, Go toolchain installed but the current full-suite run emitted only `.NET` result blocks.
+
+---
+
+## Latest README Run (.NET only)
+
+The current refresh came from `/tmp/bench-output.txt` using the benchmark project README workflow. Because the run did not emit any Go comparison blocks, the values below are the latest `.NET`-only numbers from that run, and the historical Go/.NET comparison tables are preserved below instead of being overwritten with mixed-source ratios.
+
+### Core and JetStream
+
+| Benchmark | .NET msg/s | .NET MB/s | Notes |
+|-----------|------------|-----------|-------|
+| Single Publisher (16B) | 1,392,442 | 21.2 | README full-suite run |
+| Single Publisher (128B) | 1,491,226 | 182.0 | README full-suite run |
+| PubSub 1:1 (16B) | 717,731 | 11.0 | README full-suite run |
+| PubSub 1:1 (16KB) | 28,450 | 444.5 | README full-suite run |
+| Fan-Out 1:4 (128B) | 1,451,748 | 177.2 | README full-suite run |
+| Multi 4Px4S (128B) | 244,878 | 29.9 | README full-suite run |
+| Request-Reply Single (128B) | 6,840 | 0.8 | P50 142.5 us, P99 203.9 us |
+| Request-Reply 10Cx2S (16B) | 22,844 | 0.3 | P50 421.1 us, P99 602.1 us |
+| JS Sync Publish (16B Memory) | 12,619 | 0.2 | README full-suite run |
+| JS Async Publish (128B File) | 46,631 | 5.7 | README full-suite run |
+| JS Ordered Consumer (128B) | 108,057 | 13.2 | README full-suite run |
+| JS Durable Fetch (128B) | 490,090 | 59.8 | README full-suite run |
+
+### Parser Microbenchmarks
+
+| Benchmark | Ops/s | MB/s | Alloc |
+|-----------|-------|------|-------|
+| Parser PING | 5,756,370 | 32.9 | 0.0 B/op |
+| Parser PUB | 2,537,973 | 96.8 | 40.0 B/op |
+| Parser HPUB | 2,298,811 | 122.8 | 40.0 B/op |
+| Parser PUB split payload | 2,049,535 | 78.2 | 176.0 B/op |
+
+### Current Run Highlights
+
+1. The parser microbenchmarks show the hot path is already at zero allocation for `PING`, with contiguous `PUB` and `HPUB` still paying a small fixed cost for retained field copies.
+2. Split-payload `PUB` remains meaningfully more allocation-heavy than contiguous `PUB` because the parser must preserve unread payload state across reads and then materialize contiguous memory at the current client boundary.
+3. The README-driven suite was a `.NET`-only refresh, so the comparative Go/.NET ratios below should still be treated as the last Go-capable baseline rather than current same-run ratios.

 ---

--- a/src/NATS.Server/NATS.Server.csproj
+++ b/src/NATS.Server/NATS.Server.csproj
@@ -10,6 +10,7 @@
    <InternalsVisibleTo Include="NATS.Server.Monitoring.Tests" />
    <InternalsVisibleTo Include="NATS.Server.Auth.Tests" />
    <InternalsVisibleTo Include="NATS.Server.JetStream.Tests" />
+    <InternalsVisibleTo Include="NATS.Server.Benchmark.Tests" />
  </ItemGroup>
  <ItemGroup>
    <FrameworkReference Include="Microsoft.AspNetCore.App" />
--- a/tests/NATS.Server.Benchmark.Tests/NATS.Server.Benchmark.Tests.csproj
+++ b/tests/NATS.Server.Benchmark.Tests/NATS.Server.Benchmark.Tests.csproj
@@ -23,4 +23,8 @@
    <Content Include="xunit.runner.json" CopyToOutputDirectory="PreserveNewest" />
  </ItemGroup>

+  <ItemGroup>
+    <ProjectReference Include="..\..\src\NATS.Server\NATS.Server.csproj" />
+  </ItemGroup>
+
 </Project>
--- a/tests/NATS.Server.Benchmark.Tests/Protocol/ParserHotPathBenchmarks.cs
+++ b/tests/NATS.Server.Benchmark.Tests/Protocol/ParserHotPathBenchmarks.cs
@@ -0,0 +1,143 @@
+using System.Buffers;
+using System.Diagnostics;
+using System.Text;
+using NATS.Server.Protocol;
+using Xunit.Abstractions;
+
+namespace NATS.Server.Benchmark.Tests.Protocol;
+
+public class ParserHotPathBenchmarks(ITestOutputHelper output)
+{
+    [Fact]
+    [Trait("Category", "Benchmark")]
+    public void Parser_PING_Throughput()
+    {
+        var payload = "PING\r\n"u8.ToArray();
+        MeasureSingleChunk("Parser PING", payload, iterations: 500_000);
+    }
+
+    [Fact]
+    [Trait("Category", "Benchmark")]
+    public void Parser_PUB_Throughput()
+    {
+        var payload = "PUB bench.subject 16\r\n0123456789ABCDEF\r\n"u8.ToArray();
+        MeasureSingleChunk("Parser PUB", payload, iterations: 250_000);
+    }
+
+    [Fact]
+    [Trait("Category", "Benchmark")]
+    public void Parser_HPUB_Throughput()
+    {
+        var payload = "HPUB bench.subject 12 28\r\nNATS/1.0\r\n\r\n0123456789ABCDEF\r\n"u8.ToArray();
+        MeasureSingleChunk("Parser HPUB", payload, iterations: 200_000);
+    }
+
+    [Fact]
+    [Trait("Category", "Benchmark")]
+    public void Parser_PUB_SplitPayload_Throughput()
+    {
+        var firstChunk = "PUB bench.subject 16\r\n01234567"u8.ToArray();
+        var secondChunk = "89ABCDEF\r\n"u8.ToArray();
+        MeasureSplitPayload("Parser PUB split payload", firstChunk, secondChunk, iterations: 200_000);
+    }
+
+    private void MeasureSingleChunk(string name, byte[] commandBytes, int iterations)
+    {
+        GC.Collect();
+        GC.WaitForPendingFinalizers();
+        GC.Collect();
+
+        var parser = new NatsParser();
+        var totalBytes = (long)commandBytes.Length * iterations;
+        var beforeAlloc = GC.GetAllocatedBytesForCurrentThread();
+        var stopwatch = Stopwatch.StartNew();
+
+        for (var i = 0; i < iterations; i++)
+        {
+            ReadOnlySequence<byte> buffer = new(commandBytes);
+            if (!parser.TryParseView(ref buffer, out var command))
+                throw new InvalidOperationException($"{name} did not produce a parsed command.");
+
+            if (command.Type is CommandType.Pub or CommandType.HPub)
+            {
+                var payload = command.GetPayloadMemory();
+                if (payload.IsEmpty)
+                    throw new InvalidOperationException($"{name} produced an empty payload unexpectedly.");
+            }
+        }
+
+        stopwatch.Stop();
+        var allocatedBytes = GC.GetAllocatedBytesForCurrentThread() - beforeAlloc;
+        WriteResult(name, iterations, totalBytes, stopwatch.Elapsed, allocatedBytes);
+    }
+
+    private void MeasureSplitPayload(string name, byte[] firstChunkBytes, byte[] secondChunkBytes, int iterations)
+    {
+        GC.Collect();
+        GC.WaitForPendingFinalizers();
+        GC.Collect();
+
+        var parser = new NatsParser();
+        var totalBytes = (long)(firstChunkBytes.Length + secondChunkBytes.Length) * iterations;
+        var beforeAlloc = GC.GetAllocatedBytesForCurrentThread();
+        var stopwatch = Stopwatch.StartNew();
+
+        for (var i = 0; i < iterations; i++)
+        {
+            ReadOnlySequence<byte> firstChunk = new(firstChunkBytes);
+            if (parser.TryParseView(ref firstChunk, out _))
+                throw new InvalidOperationException($"{name} should wait for the second payload chunk.");
+
+            ReadOnlySequence<byte> secondChunk = CreateSequence(firstChunk.First, secondChunkBytes);
+            if (!parser.TryParseView(ref secondChunk, out var command))
+                throw new InvalidOperationException($"{name} did not complete after the second payload chunk.");
+
+            if (command.GetPayloadMemory().Length != 16)
+                throw new InvalidOperationException($"{name} produced the wrong payload length.");
+        }
+
+        stopwatch.Stop();
+        var allocatedBytes = GC.GetAllocatedBytesForCurrentThread() - beforeAlloc;
+        WriteResult(name, iterations, totalBytes, stopwatch.Elapsed, allocatedBytes);
+    }
+
+    private void WriteResult(string name, int iterations, long totalBytes, TimeSpan elapsed, long allocatedBytes)
+    {
+        var operationsPerSecond = iterations / elapsed.TotalSeconds;
+        var megabytesPerSecond = totalBytes / elapsed.TotalSeconds / (1024.0 * 1024.0);
+        var bytesPerOperation = allocatedBytes / (double)iterations;
+
+        output.WriteLine($"=== {name} ===");
+        output.WriteLine($"Ops:      {operationsPerSecond:N0} ops/s");
+        output.WriteLine($"Data:     {megabytesPerSecond:F1} MB/s");
+        output.WriteLine($"Alloc:    {bytesPerOperation:F1} B/op");
+        output.WriteLine($"Elapsed:  {elapsed.TotalMilliseconds:F0} ms");
+        output.WriteLine("");
+    }
+
+    private static ReadOnlySequence<byte> CreateSequence(ReadOnlyMemory<byte> remainingBytes, byte[] secondChunk)
+    {
+        var first = new BufferSegment(remainingBytes);
+        var second = first.Append(secondChunk);
+        return new ReadOnlySequence<byte>(first, 0, second, second.Memory.Length);
+    }
+
+    private sealed class BufferSegment : ReadOnlySequenceSegment<byte>
+    {
+        public BufferSegment(ReadOnlyMemory<byte> memory)
+        {
+            Memory = memory;
+        }
+
+        public BufferSegment Append(ReadOnlyMemory<byte> memory)
+        {
+            var next = new BufferSegment(memory)
+            {
+                RunningIndex = RunningIndex + Memory.Length,
+            };
+
+            Next = next;
+            return next;
+        }
+    }
+}