# M6 — `mxaccess-codec` allocation-count baseline Source: `cargo bench -p mxaccess-codec` (commit recording this file). Harness: `crates/mxaccess-codec/benches/alloc_count.rs` — a thin `GlobalAlloc` wrapper that increments two atomics on every `alloc` / `dealloc` call, then runs each scenario for 10k iterations after a 1k-iteration warm-up. ## Target (per `70-risks-and-open-questions.md` R12) > Aim for < 5 allocations per write at steady state. The bench gates on this: any `write_message::encode` scenario at ≥ 5 allocs/op causes the binary to exit with code 1. ## Baseline (release profile, Windows x64) | scenario | iters | allocs/op | bytes/op | deallocs/op | |------------------------------------------------|--------:|----------:|---------:|------------:| | `write_message::encode` (Int32) | 10,000 | 2.00 | 44 | 2.00 | | `write_message::encode` (Float32) | 10,000 | 2.00 | 44 | 2.00 | | `write_message::encode` (Float64) | 10,000 | 2.00 | 52 | 2.00 | | `write_message::encode` (Boolean) | 10,000 | 1.00 | 37 | 1.00 | | `write_message::encode` (String, 5 chars) | 10,000 | 4.00 | 92 | 4.00 | | `write_message::encode_to_bytes_mut` (Int32) | 10,000 | 2.00 | 44 | 2.00 | | `encode_into_bytes_mut` (Int32, pooled, F52.3) | 10,000 | 1.00 | 4 | 1.00 | | `encode_into_bytes_mut` (Bool, pooled, F52.3) | 10,000 | 0.00 | 0 | 0.00 | | `MxReferenceHandle::from_names` (F52.2) | 10,000 | 0.00 | 0 | 0.00 | | `NmxSubscriptionMessage::parse_inner` | 10,000 | 1.00 | 72 | 1.00 | | (DataUpdate, Int32) | | | | | ## Read R12's < 5 allocs/write target is **already met** across the proven matrix: - Scalar writes (Bool, Int32, Float32, Float64) sit at 1–2 allocs/op. The two allocs come from (1) the encoder's `Vec` output buffer and (2) an internal scratch buffer in the value-encode path. - String writes hit 4 allocs/op (output buffer, UTF-16LE conversion buffer, the inner-length wrapper, and one more downstream). - `MxReferenceHandle::from_names` allocates twice (one per `compute_name_signature` call — UTF-16LE buffer for each name). - `NmxSubscriptionMessage::parse_inner` allocates once for the `records: Vec` collection. ## Implications for F39 F39 (zero-copy pass) was scoped as the work to *hit* the R12 target. With the target already met, F39's scope tightens to: - Move the encoder's output buffer to `bytes::BytesMut` so consumers can split it without copying. Doesn't reduce alloc count but improves downstream zero-copy on the wire-write path. - Cache the per-handle UTF-16LE name conversion (the two `compute_name_signature` allocs per `from_names`) inside `MxReferenceHandle` if the same name is registered repeatedly. - Pool the per-frame scratch buffer at the session level so the per-write count drops from 2 → 1 on hot paths. These are nice-to-have optimisations rather than R12 blockers. ## F52 deltas F52 split the three F39 sub-tasks into their own commits. Each optimisation lands with a before/after row in this section. ### F52.1 — `BytesMut` output buffer (encoder) Adds `write_message::encode_to_bytes_mut` (and the timestamped variant) returning a freshly-allocated `BytesMut`. Allocation count is **identical** to the existing `encode` path — the benefit is downstream: consumers can `BytesMut::split_to` / `freeze` and forward the body bytes to a wire-level sink without an intermediate copy. | scenario | before (allocs/op) | after (allocs/op) | |----------------------------------------------|-------------------:|------------------:| | `write_message::encode` (Int32) | 2.00 | 2.00 | | `write_message::encode_to_bytes_mut` (Int32) | — | 2.00 | Internally this required refactoring the body builders (`encode_boolean` / `encode_fixed` / `encode_variable` / `encode_array`) to fill a pre-sized `&mut [u8]` rather than each allocating their own `Vec`. The dispatcher computes the body size up front via small `*_body_size` helpers and resizes the destination buffer (Vec or BytesMut) once. This is also the prerequisite refactor for F52.3. ### F52.2 — Per-handle name-signature cache Adds a thread-local `HashMap` cache inside `compute_name_signature`. Repeated calls with the same name (the hot path inside `MxReferenceHandle::from_names` when handles are constructed many times) skip the `to_lowercase` allocation entirely. Capped at 1024 entries; on overflow the thread's cache is cleared. | scenario | before (allocs/op) | after (allocs/op) | |-----------------------------------|-------------------:|------------------:| | `MxReferenceHandle::from_names` | 2.00 | 0.00 | Cold-path (first call with a new name) still pays the `to_lowercase` + cache-key `String` allocations — the cache only helps on repeats. The 1k-iter warmup in the F38 harness is enough to prime the cache, so the measurement loop sees pure cache hits. ### F52.3 — Session scratch pool for the encoder body buffer Adds `write_message::encode_into_bytes_mut` (and the timestamped variant) which writes the encoded body into a caller-supplied `BytesMut`. The buffer is cleared and resized in place each call; once it has grown to the largest body the session will produce, it allocates nothing further. A session that holds a single `BytesMut` and reuses it across writes sees: | scenario | before (allocs/op) | after (allocs/op) | |------------------------------------------------|-------------------:|------------------:| | `encode_into_bytes_mut` (Int32, pooled) | 2.00 | 1.00 | | `encode_into_bytes_mut` (Boolean, pooled) | 1.00 | 0.00 | The remaining `1.00` for Int32 is the `encode_scalar_value` scratch `Vec`. Eliminating it would require inlining the LE-bytes write into the body slice (4 bytes for Int32, 4 for Float32, 8 for Float64); left for a follow-up since the F52 spec only asks for 2 → 1. Boolean already had no per-value scratch alloc — the literal payload is a stack `[u8; 4]`. Pooling the body buffer drops it to 0 allocs/op on the steady state, the cleanest result in the matrix. ## Reproducing ```powershell cd rust cargo bench -p mxaccess-codec ``` Numbers are deterministic per release-profile build on a given host. Numeric drift across hosts is expected (the warm-up + black_box hints keep iteration counts stable, not the underlying allocator's small-alloc fast-path heuristics).