dohertj2/mxaccess

Fork 0

Files

T

Joseph Doherty ceeaeefa71

rust / build / test / clippy / fmt (push) Has been cancelled

Details

rust / cargo public-api drift check (F41) (push) Has been cancelled

Details

[F52.3] mxaccess-codec: caller-supplied scratch buffer for write encoder

Adds `write_message::encode_into_bytes_mut` (and the timestamped
variant) which writes the encoded body into a caller-supplied
`BytesMut`. The buffer is cleared and resized in place each call;
once it has grown to the largest body the session will produce, it
allocates nothing further.

A session that holds a single `BytesMut` and reuses it across writes:

  - Int32 / Float32 / Float64: 2 → 1 allocs/op
    (only the `encode_scalar_value` scratch `Vec<u8>` remains)
  - Boolean: 1 → 0 allocs/op
    (no per-value scratch — the literal payload is a stack `[u8; 4]`)

Bench delta in `design/M6-bench-baseline.md` § F52.3. The
`encode_scalar_value` Vec is the remaining 1 alloc/op for fixed-width
scalars; eliminating it would require inlining the LE-bytes write
into the body slice (left for a follow-up since the F52 spec only
asks for 2 → 1).

Resolves F52 (all three optimisations landed: 4e76b44 F52.1,
a0fa5be F52.2, this commit F52.3). Existing `encode` / `encode_to_bytes_mut`
public surface unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-06 22:53:07 -04:00

6.8 KiB

Raw Permalink Blame History

M6 — `mxaccess-codec` allocation-count baseline

Source: cargo bench -p mxaccess-codec (commit recording this file). Harness: crates/mxaccess-codec/benches/alloc_count.rs — a thin GlobalAlloc wrapper that increments two atomics on every alloc / dealloc call, then runs each scenario for 10k iterations after a 1k-iteration warm-up.

Target (per `70-risks-and-open-questions.md` R12)

Aim for < 5 allocations per write at steady state.

The bench gates on this: any write_message::encode scenario at ≥ 5 allocs/op causes the binary to exit with code 1.

Baseline (release profile, Windows x64)

scenario	iters	allocs/op	bytes/op	deallocs/op
`write_message::encode` (Int32)	10,000	2.00	44	2.00
`write_message::encode` (Float32)	10,000	2.00	44	2.00
`write_message::encode` (Float64)	10,000	2.00	52	2.00
`write_message::encode` (Boolean)	10,000	1.00	37	1.00
`write_message::encode` (String, 5 chars)	10,000	4.00	92	4.00
`write_message::encode_to_bytes_mut` (Int32)	10,000	2.00	44	2.00
`encode_into_bytes_mut` (Int32, pooled, F52.3)	10,000	1.00	4	1.00
`encode_into_bytes_mut` (Bool, pooled, F52.3)	10,000	0.00	0	0.00
`MxReferenceHandle::from_names` (F52.2)	10,000	0.00	0	0.00
`NmxSubscriptionMessage::parse_inner`	10,000	1.00	72	1.00
(DataUpdate, Int32)

Read

R12's < 5 allocs/write target is already met across the proven matrix:

Scalar writes (Bool, Int32, Float32, Float64) sit at 1–2 allocs/op. The two allocs come from (1) the encoder's Vec<u8> output buffer and (2) an internal scratch buffer in the value-encode path.
String writes hit 4 allocs/op (output buffer, UTF-16LE conversion buffer, the inner-length wrapper, and one more downstream).
MxReferenceHandle::from_names allocates twice (one per compute_name_signature call — UTF-16LE buffer for each name).
NmxSubscriptionMessage::parse_inner allocates once for the records: Vec<NmxSubscriptionRecord> collection.

Implications for F39

F39 (zero-copy pass) was scoped as the work to hit the R12 target. With the target already met, F39's scope tightens to:

Move the encoder's output buffer to bytes::BytesMut so consumers can split it without copying. Doesn't reduce alloc count but improves downstream zero-copy on the wire-write path.
Cache the per-handle UTF-16LE name conversion (the two compute_name_signature allocs per from_names) inside MxReferenceHandle if the same name is registered repeatedly.
Pool the per-frame scratch buffer at the session level so the per-write count drops from 2 → 1 on hot paths.

These are nice-to-have optimisations rather than R12 blockers.

F52 deltas

F52 split the three F39 sub-tasks into their own commits. Each optimisation lands with a before/after row in this section.

F52.1 — `BytesMut` output buffer (encoder)

Adds write_message::encode_to_bytes_mut (and the timestamped variant) returning a freshly-allocated BytesMut. Allocation count is identical to the existing encode path — the benefit is downstream: consumers can BytesMut::split_to / freeze and forward the body bytes to a wire-level sink without an intermediate copy.

scenario	before (allocs/op)	after (allocs/op)
`write_message::encode` (Int32)	2.00	2.00
`write_message::encode_to_bytes_mut` (Int32)	—	2.00

Internally this required refactoring the body builders (encode_boolean / encode_fixed / encode_variable / encode_array) to fill a pre-sized &mut [u8] rather than each allocating their own Vec<u8>. The dispatcher computes the body size up front via small *_body_size helpers and resizes the destination buffer (Vec or BytesMut) once. This is also the prerequisite refactor for F52.3.

F52.2 — Per-handle name-signature cache

Adds a thread-local HashMap<String, u16> cache inside compute_name_signature. Repeated calls with the same name (the hot path inside MxReferenceHandle::from_names when handles are constructed many times) skip the to_lowercase allocation entirely. Capped at 1024 entries; on overflow the thread's cache is cleared.

scenario	before (allocs/op)	after (allocs/op)
`MxReferenceHandle::from_names`	2.00	0.00

Cold-path (first call with a new name) still pays the to_lowercase + cache-key String allocations — the cache only helps on repeats. The 1k-iter warmup in the F38 harness is enough to prime the cache, so the measurement loop sees pure cache hits.

F52.3 — Session scratch pool for the encoder body buffer

Adds write_message::encode_into_bytes_mut (and the timestamped variant) which writes the encoded body into a caller-supplied BytesMut. The buffer is cleared and resized in place each call; once it has grown to the largest body the session will produce, it allocates nothing further.

A session that holds a single BytesMut and reuses it across writes sees:

scenario	before (allocs/op)	after (allocs/op)
`encode_into_bytes_mut` (Int32, pooled)	2.00	1.00
`encode_into_bytes_mut` (Boolean, pooled)	1.00	0.00

The remaining 1.00 for Int32 is the encode_scalar_value scratch Vec<u8>. Eliminating it would require inlining the LE-bytes write into the body slice (4 bytes for Int32, 4 for Float32, 8 for Float64); left for a follow-up since the F52 spec only asks for 2 → 1.

Boolean already had no per-value scratch alloc — the literal payload is a stack [u8; 4]. Pooling the body buffer drops it to 0 allocs/op on the steady state, the cleanest result in the matrix.

Reproducing

cd rust
cargo bench -p mxaccess-codec

Numbers are deterministic per release-profile build on a given host. Numeric drift across hosts is expected (the warm-up + black_box hints keep iteration counts stable, not the underlying allocator's small-alloc fast-path heuristics).

6.8 KiB Raw Permalink Blame History Unescape Escape

M6 — mxaccess-codec allocation-count baseline

Target (per 70-risks-and-open-questions.md R12)