Files
mxaccess/design/M6-bench-baseline.md
T
Joseph Doherty a0fa5bedfd [F52.2] mxaccess-codec: thread-local name-signature cache
Adds a thread-local `HashMap<String, u16>` cache inside
`compute_name_signature`. Repeated calls with the same name (the hot
path inside `MxReferenceHandle::from_names`) skip the `to_lowercase`
allocation and the CRC-16/IBM walk entirely. Bounded at 1024 entries
per thread; on overflow the cache is cleared rather than evicted LRU
— any sane workload re-fills only the names it actively uses.

`MxReferenceHandle::from_names` drops from 2 → 0 allocs/op once warm
(bench delta in `design/M6-bench-baseline.md` § F52.2). Cold-path
behaviour is unchanged: first call with a new name still pays the
`to_lowercase` + cache-key `String` allocations.

Two new tests pin the cache: cache-hit returns the same value as
cold-compute, and cache overflow doesn't break correctness.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 22:50:07 -04:00

113 lines
5.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# M6 — `mxaccess-codec` allocation-count baseline
Source: `cargo bench -p mxaccess-codec` (commit recording this file).
Harness: `crates/mxaccess-codec/benches/alloc_count.rs` — a thin
`GlobalAlloc` wrapper that increments two atomics on every `alloc` /
`dealloc` call, then runs each scenario for 10k iterations after a
1k-iteration warm-up.
## Target (per `70-risks-and-open-questions.md` R12)
> Aim for < 5 allocations per write at steady state.
The bench gates on this: any `write_message::encode` scenario at
≥ 5 allocs/op causes the binary to exit with code 1.
## Baseline (release profile, Windows x64)
| scenario | iters | allocs/op | bytes/op | deallocs/op |
|------------------------------------------------|--------:|----------:|---------:|------------:|
| `write_message::encode` (Int32) | 10,000 | 2.00 | 44 | 2.00 |
| `write_message::encode` (Float32) | 10,000 | 2.00 | 44 | 2.00 |
| `write_message::encode` (Float64) | 10,000 | 2.00 | 52 | 2.00 |
| `write_message::encode` (Boolean) | 10,000 | 1.00 | 37 | 1.00 |
| `write_message::encode` (String, 5 chars) | 10,000 | 4.00 | 92 | 4.00 |
| `write_message::encode_to_bytes_mut` (Int32) | 10,000 | 2.00 | 44 | 2.00 |
| `MxReferenceHandle::from_names` (F52.2) | 10,000 | 0.00 | 0 | 0.00 |
| `NmxSubscriptionMessage::parse_inner` | 10,000 | 1.00 | 72 | 1.00 |
| (DataUpdate, Int32) | | | | |
## Read
R12's < 5 allocs/write target is **already met** across the proven matrix:
- Scalar writes (Bool, Int32, Float32, Float64) sit at 12 allocs/op.
The two allocs come from (1) the encoder's `Vec<u8>` output buffer
and (2) an internal scratch buffer in the value-encode path.
- String writes hit 4 allocs/op (output buffer, UTF-16LE conversion
buffer, the inner-length wrapper, and one more downstream).
- `MxReferenceHandle::from_names` allocates twice (one per
`compute_name_signature` call — UTF-16LE buffer for each name).
- `NmxSubscriptionMessage::parse_inner` allocates once for the
`records: Vec<NmxSubscriptionRecord>` collection.
## Implications for F39
F39 (zero-copy pass) was scoped as the work to *hit* the R12 target.
With the target already met, F39's scope tightens to:
- Move the encoder's output buffer to `bytes::BytesMut` so consumers
can split it without copying. Doesn't reduce alloc count but
improves downstream zero-copy on the wire-write path.
- Cache the per-handle UTF-16LE name conversion (the two
`compute_name_signature` allocs per `from_names`) inside
`MxReferenceHandle` if the same name is registered repeatedly.
- Pool the per-frame scratch buffer at the session level so the
per-write count drops from 2 → 1 on hot paths.
These are nice-to-have optimisations rather than R12 blockers.
## F52 deltas
F52 split the three F39 sub-tasks into their own commits. Each
optimisation lands with a before/after row in this section.
### F52.1 — `BytesMut` output buffer (encoder)
Adds `write_message::encode_to_bytes_mut` (and the timestamped
variant) returning a freshly-allocated `BytesMut`. Allocation count
is **identical** to the existing `encode` path — the benefit is
downstream: consumers can `BytesMut::split_to` / `freeze` and forward
the body bytes to a wire-level sink without an intermediate copy.
| scenario | before (allocs/op) | after (allocs/op) |
|----------------------------------------------|-------------------:|------------------:|
| `write_message::encode` (Int32) | 2.00 | 2.00 |
| `write_message::encode_to_bytes_mut` (Int32) | — | 2.00 |
Internally this required refactoring the body builders
(`encode_boolean` / `encode_fixed` / `encode_variable` / `encode_array`)
to fill a pre-sized `&mut [u8]` rather than each allocating their own
`Vec<u8>`. The dispatcher computes the body size up front via small
`*_body_size` helpers and resizes the destination buffer (Vec or
BytesMut) once. This is also the prerequisite refactor for F52.3.
### F52.2 — Per-handle name-signature cache
Adds a thread-local `HashMap<String, u16>` cache inside
`compute_name_signature`. Repeated calls with the same name (the hot
path inside `MxReferenceHandle::from_names` when handles are
constructed many times) skip the `to_lowercase` allocation entirely.
Capped at 1024 entries; on overflow the thread's cache is cleared.
| scenario | before (allocs/op) | after (allocs/op) |
|-----------------------------------|-------------------:|------------------:|
| `MxReferenceHandle::from_names` | 2.00 | 0.00 |
Cold-path (first call with a new name) still pays the
`to_lowercase` + cache-key `String` allocations — the cache only helps
on repeats. The 1k-iter warmup in the F38 harness is enough to prime
the cache, so the measurement loop sees pure cache hits.
## Reproducing
```powershell
cd rust
cargo bench -p mxaccess-codec
```
Numbers are deterministic per release-profile build on a given host.
Numeric drift across hosts is expected (the warm-up + black_box hints
keep iteration counts stable, not the underlying allocator's
small-alloc fast-path heuristics).