Files
mxaccess/design/M6-bench-baseline.md
T
Joseph Doherty 71c69b80c6 [F38] mxaccess-codec: counting-allocator bench harness + R12 baseline
Hand-rolled GlobalAlloc wrapper around System that tracks allocs +
bytes + deallocs via two atomics. Each scenario runs 10k iterations
after a 1k warm-up; output is a markdown table with allocs/op,
bytes/op, deallocs/op.

Why hand-rolled (not dhat/criterion): R12 gates on a single number
("< 5 allocs/write"). dhat is heap-profiling-oriented (call-stack
attribution, JSON snapshots); criterion measures wall-clock latency
which is reported-but-not-gated per 60-roadmap.md:104. A 50-line
GlobalAlloc + atomic counters is the simplest thing that answers
the gate.

Run: `cargo bench -p mxaccess-codec`

Baseline numbers (release, Windows x64):
- Bool write:    1.00 allocs/op
- Int32 write:   2.00 allocs/op
- Float32 write: 2.00 allocs/op
- Float64 write: 2.00 allocs/op
- String write:  4.00 allocs/op (5-char string)
- Handle from_names: 2.00 allocs/op
- DataUpdate decode: 1.00 alloc/op

R12's < 5 allocs/write target is **already met** across the proven
matrix without any zero-copy work. The bench gates on this — any
write_message::encode scenario at >= 5 allocs/op exits the harness
with code 1.

Companion: `design/M6-bench-baseline.md` documents the numbers,
explains the per-scenario breakdown, and tightens F39's scope from
"hit the target" to "nice-to-have optimisations" (BytesMut output
buffer, name-signature cache, session-level scratch pool).

Workspace: 759 tests still pass; clippy --benches clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 04:45:33 -04:00

3.1 KiB
Raw Blame History

M6 — mxaccess-codec allocation-count baseline

Source: cargo bench -p mxaccess-codec (commit recording this file). Harness: crates/mxaccess-codec/benches/alloc_count.rs — a thin GlobalAlloc wrapper that increments two atomics on every alloc / dealloc call, then runs each scenario for 10k iterations after a 1k-iteration warm-up.

Target (per 70-risks-and-open-questions.md R12)

Aim for < 5 allocations per write at steady state.

The bench gates on this: any write_message::encode scenario at ≥ 5 allocs/op causes the binary to exit with code 1.

Baseline (release profile, Windows x64)

scenario iters allocs/op bytes/op deallocs/op
write_message::encode (Int32) 10,000 2.00 44 2.00
write_message::encode (Float32) 10,000 2.00 44 2.00
write_message::encode (Float64) 10,000 2.00 52 2.00
write_message::encode (Boolean) 10,000 1.00 37 1.00
write_message::encode (String, 5 chars) 10,000 4.00 92 4.00
MxReferenceHandle::from_names 10,000 2.00 22 2.00
NmxSubscriptionMessage::parse_inner 10,000 1.00 72 1.00
(DataUpdate, Int32)

Read

R12's < 5 allocs/write target is already met across the proven matrix:

  • Scalar writes (Bool, Int32, Float32, Float64) sit at 12 allocs/op. The two allocs come from (1) the encoder's Vec<u8> output buffer and (2) an internal scratch buffer in the value-encode path.
  • String writes hit 4 allocs/op (output buffer, UTF-16LE conversion buffer, the inner-length wrapper, and one more downstream).
  • MxReferenceHandle::from_names allocates twice (one per compute_name_signature call — UTF-16LE buffer for each name).
  • NmxSubscriptionMessage::parse_inner allocates once for the records: Vec<NmxSubscriptionRecord> collection.

Implications for F39

F39 (zero-copy pass) was scoped as the work to hit the R12 target. With the target already met, F39's scope tightens to:

  • Move the encoder's output buffer to bytes::BytesMut so consumers can split it without copying. Doesn't reduce alloc count but improves downstream zero-copy on the wire-write path.
  • Cache the per-handle UTF-16LE name conversion (the two compute_name_signature allocs per from_names) inside MxReferenceHandle if the same name is registered repeatedly.
  • Pool the per-frame scratch buffer at the session level so the per-write count drops from 2 → 1 on hot paths.

These are nice-to-have optimisations rather than R12 blockers.

Reproducing

cd rust
cargo bench -p mxaccess-codec

Numbers are deterministic per release-profile build on a given host. Numeric drift across hosts is expected (the warm-up + black_box hints keep iteration counts stable, not the underlying allocator's small-alloc fast-path heuristics).