Files
mxaccess/design/M6-bench-baseline.md
Joseph Doherty ceeaeefa71
rust / build / test / clippy / fmt (push) Has been cancelled
rust / cargo public-api drift check (F41) (push) Has been cancelled
[F52.3] mxaccess-codec: caller-supplied scratch buffer for write encoder
Adds `write_message::encode_into_bytes_mut` (and the timestamped
variant) which writes the encoded body into a caller-supplied
`BytesMut`. The buffer is cleared and resized in place each call;
once it has grown to the largest body the session will produce, it
allocates nothing further.

A session that holds a single `BytesMut` and reuses it across writes:

  - Int32 / Float32 / Float64: 2 → 1 allocs/op
    (only the `encode_scalar_value` scratch `Vec<u8>` remains)
  - Boolean: 1 → 0 allocs/op
    (no per-value scratch — the literal payload is a stack `[u8; 4]`)

Bench delta in `design/M6-bench-baseline.md` § F52.3. The
`encode_scalar_value` Vec is the remaining 1 alloc/op for fixed-width
scalars; eliminating it would require inlining the LE-bytes write
into the body slice (left for a follow-up since the F52 spec only
asks for 2 → 1).

Resolves F52 (all three optimisations landed: 4e76b44 F52.1,
a0fa5be F52.2, this commit F52.3). Existing `encode` / `encode_to_bytes_mut`
public surface unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 22:53:07 -04:00

140 lines
6.8 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# M6 — `mxaccess-codec` allocation-count baseline
Source: `cargo bench -p mxaccess-codec` (commit recording this file).
Harness: `crates/mxaccess-codec/benches/alloc_count.rs` — a thin
`GlobalAlloc` wrapper that increments two atomics on every `alloc` /
`dealloc` call, then runs each scenario for 10k iterations after a
1k-iteration warm-up.
## Target (per `70-risks-and-open-questions.md` R12)
> Aim for < 5 allocations per write at steady state.
The bench gates on this: any `write_message::encode` scenario at
≥ 5 allocs/op causes the binary to exit with code 1.
## Baseline (release profile, Windows x64)
| scenario | iters | allocs/op | bytes/op | deallocs/op |
|------------------------------------------------|--------:|----------:|---------:|------------:|
| `write_message::encode` (Int32) | 10,000 | 2.00 | 44 | 2.00 |
| `write_message::encode` (Float32) | 10,000 | 2.00 | 44 | 2.00 |
| `write_message::encode` (Float64) | 10,000 | 2.00 | 52 | 2.00 |
| `write_message::encode` (Boolean) | 10,000 | 1.00 | 37 | 1.00 |
| `write_message::encode` (String, 5 chars) | 10,000 | 4.00 | 92 | 4.00 |
| `write_message::encode_to_bytes_mut` (Int32) | 10,000 | 2.00 | 44 | 2.00 |
| `encode_into_bytes_mut` (Int32, pooled, F52.3) | 10,000 | 1.00 | 4 | 1.00 |
| `encode_into_bytes_mut` (Bool, pooled, F52.3) | 10,000 | 0.00 | 0 | 0.00 |
| `MxReferenceHandle::from_names` (F52.2) | 10,000 | 0.00 | 0 | 0.00 |
| `NmxSubscriptionMessage::parse_inner` | 10,000 | 1.00 | 72 | 1.00 |
| (DataUpdate, Int32) | | | | |
## Read
R12's < 5 allocs/write target is **already met** across the proven matrix:
- Scalar writes (Bool, Int32, Float32, Float64) sit at 12 allocs/op.
The two allocs come from (1) the encoder's `Vec<u8>` output buffer
and (2) an internal scratch buffer in the value-encode path.
- String writes hit 4 allocs/op (output buffer, UTF-16LE conversion
buffer, the inner-length wrapper, and one more downstream).
- `MxReferenceHandle::from_names` allocates twice (one per
`compute_name_signature` call — UTF-16LE buffer for each name).
- `NmxSubscriptionMessage::parse_inner` allocates once for the
`records: Vec<NmxSubscriptionRecord>` collection.
## Implications for F39
F39 (zero-copy pass) was scoped as the work to *hit* the R12 target.
With the target already met, F39's scope tightens to:
- Move the encoder's output buffer to `bytes::BytesMut` so consumers
can split it without copying. Doesn't reduce alloc count but
improves downstream zero-copy on the wire-write path.
- Cache the per-handle UTF-16LE name conversion (the two
`compute_name_signature` allocs per `from_names`) inside
`MxReferenceHandle` if the same name is registered repeatedly.
- Pool the per-frame scratch buffer at the session level so the
per-write count drops from 2 → 1 on hot paths.
These are nice-to-have optimisations rather than R12 blockers.
## F52 deltas
F52 split the three F39 sub-tasks into their own commits. Each
optimisation lands with a before/after row in this section.
### F52.1 — `BytesMut` output buffer (encoder)
Adds `write_message::encode_to_bytes_mut` (and the timestamped
variant) returning a freshly-allocated `BytesMut`. Allocation count
is **identical** to the existing `encode` path — the benefit is
downstream: consumers can `BytesMut::split_to` / `freeze` and forward
the body bytes to a wire-level sink without an intermediate copy.
| scenario | before (allocs/op) | after (allocs/op) |
|----------------------------------------------|-------------------:|------------------:|
| `write_message::encode` (Int32) | 2.00 | 2.00 |
| `write_message::encode_to_bytes_mut` (Int32) | — | 2.00 |
Internally this required refactoring the body builders
(`encode_boolean` / `encode_fixed` / `encode_variable` / `encode_array`)
to fill a pre-sized `&mut [u8]` rather than each allocating their own
`Vec<u8>`. The dispatcher computes the body size up front via small
`*_body_size` helpers and resizes the destination buffer (Vec or
BytesMut) once. This is also the prerequisite refactor for F52.3.
### F52.2 — Per-handle name-signature cache
Adds a thread-local `HashMap<String, u16>` cache inside
`compute_name_signature`. Repeated calls with the same name (the hot
path inside `MxReferenceHandle::from_names` when handles are
constructed many times) skip the `to_lowercase` allocation entirely.
Capped at 1024 entries; on overflow the thread's cache is cleared.
| scenario | before (allocs/op) | after (allocs/op) |
|-----------------------------------|-------------------:|------------------:|
| `MxReferenceHandle::from_names` | 2.00 | 0.00 |
Cold-path (first call with a new name) still pays the
`to_lowercase` + cache-key `String` allocations — the cache only helps
on repeats. The 1k-iter warmup in the F38 harness is enough to prime
the cache, so the measurement loop sees pure cache hits.
### F52.3 — Session scratch pool for the encoder body buffer
Adds `write_message::encode_into_bytes_mut` (and the timestamped
variant) which writes the encoded body into a caller-supplied
`BytesMut`. The buffer is cleared and resized in place each call;
once it has grown to the largest body the session will produce, it
allocates nothing further.
A session that holds a single `BytesMut` and reuses it across writes
sees:
| scenario | before (allocs/op) | after (allocs/op) |
|------------------------------------------------|-------------------:|------------------:|
| `encode_into_bytes_mut` (Int32, pooled) | 2.00 | 1.00 |
| `encode_into_bytes_mut` (Boolean, pooled) | 1.00 | 0.00 |
The remaining `1.00` for Int32 is the `encode_scalar_value` scratch
`Vec<u8>`. Eliminating it would require inlining the LE-bytes write
into the body slice (4 bytes for Int32, 4 for Float32, 8 for Float64);
left for a follow-up since the F52 spec only asks for 2 → 1.
Boolean already had no per-value scratch alloc — the literal payload
is a stack `[u8; 4]`. Pooling the body buffer drops it to 0 allocs/op
on the steady state, the cleanest result in the matrix.
## Reproducing
```powershell
cd rust
cargo bench -p mxaccess-codec
```
Numbers are deterministic per release-profile build on a given host.
Numeric drift across hosts is expected (the warm-up + black_box hints
keep iteration counts stable, not the underlying allocator's
small-alloc fast-path heuristics).