ceeaeefa71
Adds `write_message::encode_into_bytes_mut` (and the timestamped
variant) which writes the encoded body into a caller-supplied
`BytesMut`. The buffer is cleared and resized in place each call;
once it has grown to the largest body the session will produce, it
allocates nothing further.
A session that holds a single `BytesMut` and reuses it across writes:
- Int32 / Float32 / Float64: 2 → 1 allocs/op
(only the `encode_scalar_value` scratch `Vec<u8>` remains)
- Boolean: 1 → 0 allocs/op
(no per-value scratch — the literal payload is a stack `[u8; 4]`)
Bench delta in `design/M6-bench-baseline.md` § F52.3. The
`encode_scalar_value` Vec is the remaining 1 alloc/op for fixed-width
scalars; eliminating it would require inlining the LE-bytes write
into the body slice (left for a follow-up since the F52 spec only
asks for 2 → 1).
Resolves F52 (all three optimisations landed: 4e76b44 F52.1,
a0fa5be F52.2, this commit F52.3). Existing `encode` / `encode_to_bytes_mut`
public surface unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
140 lines
6.8 KiB
Markdown
140 lines
6.8 KiB
Markdown
# M6 — `mxaccess-codec` allocation-count baseline
|
||
|
||
Source: `cargo bench -p mxaccess-codec` (commit recording this file).
|
||
Harness: `crates/mxaccess-codec/benches/alloc_count.rs` — a thin
|
||
`GlobalAlloc` wrapper that increments two atomics on every `alloc` /
|
||
`dealloc` call, then runs each scenario for 10k iterations after a
|
||
1k-iteration warm-up.
|
||
|
||
## Target (per `70-risks-and-open-questions.md` R12)
|
||
|
||
> Aim for < 5 allocations per write at steady state.
|
||
|
||
The bench gates on this: any `write_message::encode` scenario at
|
||
≥ 5 allocs/op causes the binary to exit with code 1.
|
||
|
||
## Baseline (release profile, Windows x64)
|
||
|
||
| scenario | iters | allocs/op | bytes/op | deallocs/op |
|
||
|------------------------------------------------|--------:|----------:|---------:|------------:|
|
||
| `write_message::encode` (Int32) | 10,000 | 2.00 | 44 | 2.00 |
|
||
| `write_message::encode` (Float32) | 10,000 | 2.00 | 44 | 2.00 |
|
||
| `write_message::encode` (Float64) | 10,000 | 2.00 | 52 | 2.00 |
|
||
| `write_message::encode` (Boolean) | 10,000 | 1.00 | 37 | 1.00 |
|
||
| `write_message::encode` (String, 5 chars) | 10,000 | 4.00 | 92 | 4.00 |
|
||
| `write_message::encode_to_bytes_mut` (Int32) | 10,000 | 2.00 | 44 | 2.00 |
|
||
| `encode_into_bytes_mut` (Int32, pooled, F52.3) | 10,000 | 1.00 | 4 | 1.00 |
|
||
| `encode_into_bytes_mut` (Bool, pooled, F52.3) | 10,000 | 0.00 | 0 | 0.00 |
|
||
| `MxReferenceHandle::from_names` (F52.2) | 10,000 | 0.00 | 0 | 0.00 |
|
||
| `NmxSubscriptionMessage::parse_inner` | 10,000 | 1.00 | 72 | 1.00 |
|
||
| (DataUpdate, Int32) | | | | |
|
||
|
||
## Read
|
||
|
||
R12's < 5 allocs/write target is **already met** across the proven matrix:
|
||
|
||
- Scalar writes (Bool, Int32, Float32, Float64) sit at 1–2 allocs/op.
|
||
The two allocs come from (1) the encoder's `Vec<u8>` output buffer
|
||
and (2) an internal scratch buffer in the value-encode path.
|
||
- String writes hit 4 allocs/op (output buffer, UTF-16LE conversion
|
||
buffer, the inner-length wrapper, and one more downstream).
|
||
- `MxReferenceHandle::from_names` allocates twice (one per
|
||
`compute_name_signature` call — UTF-16LE buffer for each name).
|
||
- `NmxSubscriptionMessage::parse_inner` allocates once for the
|
||
`records: Vec<NmxSubscriptionRecord>` collection.
|
||
|
||
## Implications for F39
|
||
|
||
F39 (zero-copy pass) was scoped as the work to *hit* the R12 target.
|
||
With the target already met, F39's scope tightens to:
|
||
|
||
- Move the encoder's output buffer to `bytes::BytesMut` so consumers
|
||
can split it without copying. Doesn't reduce alloc count but
|
||
improves downstream zero-copy on the wire-write path.
|
||
- Cache the per-handle UTF-16LE name conversion (the two
|
||
`compute_name_signature` allocs per `from_names`) inside
|
||
`MxReferenceHandle` if the same name is registered repeatedly.
|
||
- Pool the per-frame scratch buffer at the session level so the
|
||
per-write count drops from 2 → 1 on hot paths.
|
||
|
||
These are nice-to-have optimisations rather than R12 blockers.
|
||
|
||
## F52 deltas
|
||
|
||
F52 split the three F39 sub-tasks into their own commits. Each
|
||
optimisation lands with a before/after row in this section.
|
||
|
||
### F52.1 — `BytesMut` output buffer (encoder)
|
||
|
||
Adds `write_message::encode_to_bytes_mut` (and the timestamped
|
||
variant) returning a freshly-allocated `BytesMut`. Allocation count
|
||
is **identical** to the existing `encode` path — the benefit is
|
||
downstream: consumers can `BytesMut::split_to` / `freeze` and forward
|
||
the body bytes to a wire-level sink without an intermediate copy.
|
||
|
||
| scenario | before (allocs/op) | after (allocs/op) |
|
||
|----------------------------------------------|-------------------:|------------------:|
|
||
| `write_message::encode` (Int32) | 2.00 | 2.00 |
|
||
| `write_message::encode_to_bytes_mut` (Int32) | — | 2.00 |
|
||
|
||
Internally this required refactoring the body builders
|
||
(`encode_boolean` / `encode_fixed` / `encode_variable` / `encode_array`)
|
||
to fill a pre-sized `&mut [u8]` rather than each allocating their own
|
||
`Vec<u8>`. The dispatcher computes the body size up front via small
|
||
`*_body_size` helpers and resizes the destination buffer (Vec or
|
||
BytesMut) once. This is also the prerequisite refactor for F52.3.
|
||
|
||
### F52.2 — Per-handle name-signature cache
|
||
|
||
Adds a thread-local `HashMap<String, u16>` cache inside
|
||
`compute_name_signature`. Repeated calls with the same name (the hot
|
||
path inside `MxReferenceHandle::from_names` when handles are
|
||
constructed many times) skip the `to_lowercase` allocation entirely.
|
||
Capped at 1024 entries; on overflow the thread's cache is cleared.
|
||
|
||
| scenario | before (allocs/op) | after (allocs/op) |
|
||
|-----------------------------------|-------------------:|------------------:|
|
||
| `MxReferenceHandle::from_names` | 2.00 | 0.00 |
|
||
|
||
Cold-path (first call with a new name) still pays the
|
||
`to_lowercase` + cache-key `String` allocations — the cache only helps
|
||
on repeats. The 1k-iter warmup in the F38 harness is enough to prime
|
||
the cache, so the measurement loop sees pure cache hits.
|
||
|
||
### F52.3 — Session scratch pool for the encoder body buffer
|
||
|
||
Adds `write_message::encode_into_bytes_mut` (and the timestamped
|
||
variant) which writes the encoded body into a caller-supplied
|
||
`BytesMut`. The buffer is cleared and resized in place each call;
|
||
once it has grown to the largest body the session will produce, it
|
||
allocates nothing further.
|
||
|
||
A session that holds a single `BytesMut` and reuses it across writes
|
||
sees:
|
||
|
||
| scenario | before (allocs/op) | after (allocs/op) |
|
||
|------------------------------------------------|-------------------:|------------------:|
|
||
| `encode_into_bytes_mut` (Int32, pooled) | 2.00 | 1.00 |
|
||
| `encode_into_bytes_mut` (Boolean, pooled) | 1.00 | 0.00 |
|
||
|
||
The remaining `1.00` for Int32 is the `encode_scalar_value` scratch
|
||
`Vec<u8>`. Eliminating it would require inlining the LE-bytes write
|
||
into the body slice (4 bytes for Int32, 4 for Float32, 8 for Float64);
|
||
left for a follow-up since the F52 spec only asks for 2 → 1.
|
||
|
||
Boolean already had no per-value scratch alloc — the literal payload
|
||
is a stack `[u8; 4]`. Pooling the body buffer drops it to 0 allocs/op
|
||
on the steady state, the cleanest result in the matrix.
|
||
|
||
## Reproducing
|
||
|
||
```powershell
|
||
cd rust
|
||
cargo bench -p mxaccess-codec
|
||
```
|
||
|
||
Numbers are deterministic per release-profile build on a given host.
|
||
Numeric drift across hosts is expected (the warm-up + black_box hints
|
||
keep iteration counts stable, not the underlying allocator's
|
||
small-alloc fast-path heuristics).
|