Files
natsdotnet/benchmarks_comparison.md
Joseph Doherty 9e0df9b3d7 docs: add JetStream perf investigation notes and test status tracking
Add detailed analysis of the 1,200x JetStream file publish gap identifying
the bottleneck in the outbound write path (not FileStore). Add tests.md
tracking skipped/failing test status across Core and JetStream suites.
2026-03-13 03:20:43 -04:00

10 KiB
Raw Blame History

Go vs .NET NATS Server — Benchmark Comparison

Benchmark run: 2026-03-13. Both servers running on the same machine, tested with identical NATS.Client.Core workloads. Test parallelization disabled to avoid resource contention.

Environment: Apple M4, .NET 10, Go nats-server (latest from golang/nats-server/).


Core NATS — Pub/Sub Throughput

Single Publisher (no subscribers)

Payload Go msg/s Go MB/s .NET msg/s .NET MB/s Ratio (.NET/Go)
16 B 2,436,416 37.2 1,425,767 21.8 0.59x
128 B 2,143,434 261.6 1,654,692 202.0 0.77x

Publisher + Subscriber (1:1)

Payload Go msg/s Go MB/s .NET msg/s .NET MB/s Ratio (.NET/Go)
16 B 1,140,225 17.4 207,654 3.2 0.18x
16 KB 41,762 652.5 34,429 538.0 0.82x

Fan-Out (1 Publisher : 4 Subscribers)

Payload Go msg/s Go MB/s .NET msg/s .NET MB/s Ratio (.NET/Go)
128 B 3,192,313 389.7 581,284 71.0 0.18x

Multi-Publisher / Multi-Subscriber (4P x 4S)

Payload Go msg/s Go MB/s .NET msg/s .NET MB/s Ratio (.NET/Go)
128 B 269,445 32.9 529,808 64.7 1.97x

Core NATS — Request/Reply Latency

Single Client, Single Service

Payload Go msg/s .NET msg/s Ratio Go P50 (us) .NET P50 (us) Go P99 (us) .NET P99 (us)
128 B 9,347 7,215 0.77x 104.5 134.7 146.2 190.5

10 Clients, 2 Services (Queue Group)

Payload Go msg/s .NET msg/s Ratio Go P50 (us) .NET P50 (us) Go P99 (us) .NET P99 (us)
16 B 30,893 25,861 0.84x 315.0 370.2 451.1 595.0

JetStream — Publication

Mode Payload Storage Go msg/s .NET msg/s Ratio (.NET/Go)
Synchronous 16 B Memory 16,783 13,815 0.82x
Async (batch) 128 B File 210,387 174 0.00x

Note: Async file store publish remains extremely slow after FileStore-level optimizations (buffered writes, O(1) state tracking, redundant work elimination). The bottleneck is in the E2E network/protocol processing path (synchronous .GetAwaiter().GetResult() calls in the client read loop), not storage I/O.


JetStream — Consumption

Mode Go msg/s .NET msg/s Ratio (.NET/Go)
Ordered ephemeral consumer 109,519 N/A N/A
Durable consumer fetch 639,247 80,792 0.13x

Note: Ordered ephemeral consumer is not yet fully supported on the .NET server (API timeout during consumer creation).


Summary

Category Ratio Range Assessment
Pub-only throughput 0.59x0.77x Good — within 2x
Pub/sub (large payload) 0.82x Good
Pub/sub (small payload) 0.18x Needs optimization
Fan-out 0.18x Needs optimization
Multi pub/sub 1.97x .NET faster (likely measurement artifact at low counts)
Request/reply latency 0.77x0.84x Good
JetStream sync publish 0.82x Good
JetStream async file publish ~0x Broken — E2E protocol path bottleneck
JetStream durable fetch 0.13x Needs optimization

Key Observations

  1. Pub-only and request/reply are within striking distance (0.6x0.85x), suggesting the core message path is reasonably well ported.
  2. Small-payload pub/sub and fan-out are 5x slower (0.18x ratio). The bottleneck is likely in the subscription dispatch / message delivery hot path — the SubList.Match()MSG write loop.
  3. JetStream file store async publish is 1,200x slower than Go — see investigation notes below.
  4. JetStream consumption (durable fetch) is 8x slower than Go. Ordered consumers don't work yet.
  5. The multi-pub/sub result showing .NET faster is likely a measurement artifact from the small message count (2,000 per publisher) — not representative at scale.

JetStream Async File Publish Investigation

The async file store publish benchmark publishes 5,000 128-byte messages in batches of 100 to a Retention=Limits, Storage=File, MaxMsgs=10_000_000 stream (no MaxAge, no MaxMsgsPer). Go achieves 210,387 msg/s; .NET achieves 174 msg/s — a 1,208x gap.

The JetStream sync memory store benchmark achieves 0.82x parity, confirming the bottleneck is specific to the file-store async publish path.

What was optimized (FileStore layer)

Five root causes were identified and fixed in the FileStore/StreamManager layer:

# Root Cause Fix Impact
1 Per-message synchronous disk I/OMsgBlock.WriteAt() called RandomAccess.Write() on every message Added write buffering in MsgBlock + background flush loop in FileStore (Go's flushLoop pattern: coalesce 16KB or 8ms) Eliminates per-message syscall overhead
2 O(n) GetStateAsync per publish_messages.Keys.Min() and _messages.Values.Sum() on every publish for MaxMsgs/MaxBytes checks Added incremental _messageCount, _totalBytes, _firstSeq fields updated in all mutation paths; GetStateAsync is now O(1) Eliminates O(n) scan per publish
3 Unnecessary LoadAsync after every appendStreamManager.Capture reloaded the just-stored message even when no mirrors/sources were configured Made LoadAsync conditional on mirror/source replication being configured Eliminates redundant disk read per publish
4 Redundant PruneExpiredMessages per publish — called before every publish even when MaxAge=0, and again inside EnforceRuntimePolicies Guarded with MaxAgeMs > 0 check; removed the pre-publish call (background expiry timer handles it) Eliminates O(n) scan per publish
5 PrunePerSubject loading all messages per publishEnforceRuntimePoliciesPrunePerSubject called ListAsync().GroupBy() even when MaxMsgsPer=0 Guarded with MaxMsgsPer > 0 check Eliminates O(n) scan per publish

Additional fixes: SHA256 envelope bypass for unencrypted/uncompressed stores, RAFT propose skip for single-replica streams.

Why the benchmark didn't improve

After all FileStore-level optimizations, the benchmark remained at ~174 msg/s. The bottleneck is upstream of the storage layer in the E2E network/protocol processing path.

Important context from Go source verification: Go also processes JetStream messages inline on the read goroutine — processInboundprocessInboundClientMsg calls processJetStreamMsg synchronously (no channel handoff). processJetStreamMsg takes mset.mu and calls store.StoreMsg() inline (server/stream.go:54366136). The pcd field is map[*client]struct{} for deferred outbound flush bookkeeping (server/client.go:291), not a channel.

So Go faces the same serial read→process constraint per connection — the 1,200x gap cannot be explained by Go offloading JetStream to another goroutine (it doesn't). The actual differences are:

  1. Write coalescing in the file store — Go's writeMsgRecordLocked appends to mb.cache.buf (an in-memory byte slice) and defers disk I/O to a background flushLoop goroutine that coalesces at 16KB or 8ms (server/filestore.go:328, 5796, 5841). Our .NET port now matches this pattern, but there may be differences in how efficiently the flush loop runs (Task scheduling overhead vs goroutine scheduling).

  2. Coalescing write loop for outbound data — Go has a dedicated writeLoop goroutine per connection that waits on c.out.sg (sync.Cond, server/client.go:355, 1274). Outbound data accumulates in out.nb (net.Buffers) and is flushed in batches via net.Buffers.WriteTo up to nbMaxVectorSize buffers (server/client.go:1615). The .NET server writes ack responses individually per message — no outbound batching.

  3. Per-message overhead in the .NET protocol path — The .NET NatsClient.ProcessInboundAsync calls TryCaptureJetStreamPublish via .GetAwaiter().GetResult(), blocking the read loop Task. While Go also processes inline, Go's goroutine scheduler is cheaper for this pattern — goroutines that block on mutex or I/O yield efficiently to the runtime scheduler, whereas .NET's Task + GetAwaiter().GetResult() on an async context can cause thread pool starvation or synchronization overhead.

  4. AsyncFlush configuration — Go's file store respects fcfg.AsyncFlush (server/filestore.go:456). When AsyncFlush=true (the default for streams), writeMsgRecordLocked does NOT flush synchronously (server/filestore.go:6803). When AsyncFlush=false, it flushes inline after each write. The .NET benchmark may be triggering synchronous flushes unintentionally.

What would actually fix it

The fix requires changes to the outbound write path and careful profiling, not further FileStore tuning:

Change Description Go Reference
Coalescing write loop Add a dedicated outbound write loop per connection that batches acks/MSGs using net.Buffers-style vectored I/O, woken by a sync.Cond-equivalent signal server/client.go:1274, 1615writeLoop with out.sg (sync.Cond) and out.nb (net.Buffers)
Eliminate sync-over-async Replace .GetAwaiter().GetResult() calls in the read loop with true async/await or a synchronous-only code path to avoid thread pool overhead N/A — architectural difference
Profile Task scheduling The background flush loop uses Task.Delay(1) for coalescing waits; this may have higher latency than Go's time.Sleep(1ms) due to Task scheduler granularity server/filestore.go:5841time.Sleep in flushLoop
Verify AsyncFlush is enabled Ensure the benchmark stream config sets AsyncFlush=true so the file store uses buffered writes rather than synchronous per-message flushes server/filestore.go:456fs.fip = !fcfg.AsyncFlush

The coalescing write loop is likely the highest-impact change — it explains both the JetStream ack throughput gap and the 0.18x gap in pub/sub (small payload) and fan-out benchmarks.