[F38] mxaccess-codec: counting-allocator bench harness + R12 baseline

Hand-rolled GlobalAlloc wrapper around System that tracks allocs + bytes + deallocs via two atomics. Each scenario runs 10k iterations after a 1k warm-up; output is a markdown table with allocs/op, bytes/op, deallocs/op. Why hand-rolled (not dhat/criterion): R12 gates on a single number ("< 5 allocs/write"). dhat is heap-profiling-oriented (call-stack attribution, JSON snapshots); criterion measures wall-clock latency which is reported-but-not-gated per 60-roadmap.md:104. A 50-line GlobalAlloc + atomic counters is the simplest thing that answers the gate. Run: `cargo bench -p mxaccess-codec` Baseline numbers (release, Windows x64): - Bool write: 1.00 allocs/op - Int32 write: 2.00 allocs/op - Float32 write: 2.00 allocs/op - Float64 write: 2.00 allocs/op - String write: 4.00 allocs/op (5-char string) - Handle from_names: 2.00 allocs/op - DataUpdate decode: 1.00 alloc/op R12's < 5 allocs/write target is **already met** across the proven matrix without any zero-copy work. The bench gates on this — any write_message::encode scenario at >= 5 allocs/op exits the harness with code 1. Companion: `design/M6-bench-baseline.md` documents the numbers, explains the per-scenario breakdown, and tightens F39's scope from "hit the target" to "nice-to-have optimisations" (BytesMut output buffer, name-signature cache, session-level scratch pool). Workspace: 759 tests still pass; clippy --benches clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 04:45:33 -04:00
parent e79e289743
commit 71c69b80c6
3 changed files with 371 additions and 0 deletions
@@ -0,0 +1,69 @@
+# M6 — `mxaccess-codec` allocation-count baseline
+
+Source: `cargo bench -p mxaccess-codec` (commit recording this file).
+Harness: `crates/mxaccess-codec/benches/alloc_count.rs` — a thin
+`GlobalAlloc` wrapper that increments two atomics on every `alloc` /
+`dealloc` call, then runs each scenario for 10k iterations after a
+1k-iteration warm-up.
+
+## Target (per `70-risks-and-open-questions.md` R12)
+
+> Aim for < 5 allocations per write at steady state.
+
+The bench gates on this: any `write_message::encode` scenario at
+≥ 5 allocs/op causes the binary to exit with code 1.
+
+## Baseline (release profile, Windows x64)
+
+| scenario                                  |   iters | allocs/op | bytes/op | deallocs/op |
+|-------------------------------------------|--------:|----------:|---------:|------------:|
+| `write_message::encode` (Int32)           |  10,000 |      2.00 |       44 |        2.00 |
+| `write_message::encode` (Float32)         |  10,000 |      2.00 |       44 |        2.00 |
+| `write_message::encode` (Float64)         |  10,000 |      2.00 |       52 |        2.00 |
+| `write_message::encode` (Boolean)         |  10,000 |      1.00 |       37 |        1.00 |
+| `write_message::encode` (String, 5 chars) |  10,000 |      4.00 |       92 |        4.00 |
+| `MxReferenceHandle::from_names`           |  10,000 |      2.00 |       22 |        2.00 |
+| `NmxSubscriptionMessage::parse_inner`     |  10,000 |      1.00 |       72 |        1.00 |
+| (DataUpdate, Int32)                       |         |           |          |             |
+
+## Read
+
+R12's < 5 allocs/write target is **already met** across the proven matrix:
+
+- Scalar writes (Bool, Int32, Float32, Float64) sit at 1–2 allocs/op.
+  The two allocs come from (1) the encoder's `Vec<u8>` output buffer
+  and (2) an internal scratch buffer in the value-encode path.
+- String writes hit 4 allocs/op (output buffer, UTF-16LE conversion
+  buffer, the inner-length wrapper, and one more downstream).
+- `MxReferenceHandle::from_names` allocates twice (one per
+  `compute_name_signature` call — UTF-16LE buffer for each name).
+- `NmxSubscriptionMessage::parse_inner` allocates once for the
+  `records: Vec<NmxSubscriptionRecord>` collection.
+
+## Implications for F39
+
+F39 (zero-copy pass) was scoped as the work to *hit* the R12 target.
+With the target already met, F39's scope tightens to:
+
+- Move the encoder's output buffer to `bytes::BytesMut` so consumers
+  can split it without copying. Doesn't reduce alloc count but
+  improves downstream zero-copy on the wire-write path.
+- Cache the per-handle UTF-16LE name conversion (the two
+  `compute_name_signature` allocs per `from_names`) inside
+  `MxReferenceHandle` if the same name is registered repeatedly.
+- Pool the per-frame scratch buffer at the session level so the
+  per-write count drops from 2 → 1 on hot paths.
+
+These are nice-to-have optimisations rather than R12 blockers.
+
+## Reproducing
+
+```powershell
+cd rust
+cargo bench -p mxaccess-codec
+```
+
+Numbers are deterministic per release-profile build on a given host.
+Numeric drift across hosts is expected (the warm-up + black_box hints
+keep iteration counts stable, not the underlying allocator's
+small-alloc fast-path heuristics).
@@ -15,5 +15,9 @@ thiserror = { workspace = true }
 default = []
 serde = []

+[[bench]]
+name = "alloc_count"
+harness = false
+
 [lints]
 workspace = true
@@ -0,0 +1,298 @@
+//! F38 — counting-allocator bench for `mxaccess-codec`.
+//!
+//! Measures allocation count + bytes-allocated for the proven
+//! encode/decode matrix per `design/70-risks-and-open-questions.md`
+//! R12 (< 5 allocs per write at steady state). The harness wraps the
+//! global allocator with a [`CountingAllocator`] that tracks
+//! per-call counts; each scenario records pre-state, runs N
+//! iterations, and reports `(alloc_count, bytes_allocated) / N`.
+//!
+//! Output is the source of truth for `design/M6-bench-baseline.md`.
+//!
+//! ## Why hand-rolled (not `dhat` / `criterion`)
+//!
+//! - `dhat` is heap-profiling oriented (snapshots, call-stack
+//!   attribution); for "did this op allocate < 5 times?" the simpler
+//!   approach is a thin `GlobalAlloc` wrapper that increments two
+//!   atomics. No call-stack capture, no JSON output to post-process.
+//! - `criterion` measures wall-clock latency; per `60-roadmap.md:104`,
+//!   latency is reported but not gating in V1. Allocation count IS
+//!   the gating metric for M6 DoD bullet 3.
+//!
+//! ## Run
+//!
+//! ```text
+//! cargo bench -p mxaccess-codec
+//! ```
+//!
+//! Each scenario runs in release mode by default (cargo bench
+//! profile = `bench` which inherits release).
+
+#![allow(
+    clippy::unwrap_used,
+    clippy::expect_used,
+    clippy::cast_possible_truncation,
+    clippy::cast_sign_loss
+)]
+
+use std::alloc::{GlobalAlloc, Layout, System};
+use std::sync::atomic::{AtomicU64, Ordering};
+
+use mxaccess_codec::{
+    MxReferenceHandle, NmxSubscriptionMessage, write_message,
+    write_message::WriteValue,
+};
+
+// ---- counting allocator -------------------------------------------------
+
+struct CountingAllocator;
+
+static ALLOC_COUNT: AtomicU64 = AtomicU64::new(0);
+static ALLOC_BYTES: AtomicU64 = AtomicU64::new(0);
+static DEALLOC_COUNT: AtomicU64 = AtomicU64::new(0);
+
+unsafe impl GlobalAlloc for CountingAllocator {
+    unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
+        ALLOC_COUNT.fetch_add(1, Ordering::Relaxed);
+        ALLOC_BYTES.fetch_add(layout.size() as u64, Ordering::Relaxed);
+        // SAFETY: forwarding to the system allocator with the same layout.
+        unsafe { System.alloc(layout) }
+    }
+
+    unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
+        DEALLOC_COUNT.fetch_add(1, Ordering::Relaxed);
+        // SAFETY: forwarding to the system allocator with the same ptr+layout.
+        unsafe { System.dealloc(ptr, layout) }
+    }
+}
+
+#[global_allocator]
+static GLOBAL: CountingAllocator = CountingAllocator;
+
+// ---- scenario harness ---------------------------------------------------
+
+#[derive(Debug, Clone, Copy)]
+struct Snapshot {
+    allocs: u64,
+    bytes: u64,
+    deallocs: u64,
+}
+
+fn snapshot() -> Snapshot {
+    Snapshot {
+        allocs: ALLOC_COUNT.load(Ordering::Relaxed),
+        bytes: ALLOC_BYTES.load(Ordering::Relaxed),
+        deallocs: DEALLOC_COUNT.load(Ordering::Relaxed),
+    }
+}
+
+fn diff(start: Snapshot, end: Snapshot, iterations: u64) -> (f64, f64, f64) {
+    (
+        (end.allocs - start.allocs) as f64 / iterations as f64,
+        (end.bytes - start.bytes) as f64 / iterations as f64,
+        (end.deallocs - start.deallocs) as f64 / iterations as f64,
+    )
+}
+
+/// Run `op` `iterations` times and return per-op alloc/bytes/dealloc
+/// counts. The hint is passed through `std::hint::black_box` to keep
+/// the compiler from optimising the work away.
+fn measure<F>(name: &str, iterations: u64, mut op: F) -> Row
+where
+    F: FnMut(),
+{
+    // Warm-up: 1k iterations to settle any one-time setup state.
+    for _ in 0..1024 {
+        op();
+    }
+    let start = snapshot();
+    for _ in 0..iterations {
+        op();
+    }
+    let end = snapshot();
+    let (allocs, bytes, deallocs) = diff(start, end, iterations);
+    Row {
+        name: name.to_string(),
+        iterations,
+        allocs_per_op: allocs,
+        bytes_per_op: bytes,
+        deallocs_per_op: deallocs,
+    }
+}
+
+struct Row {
+    name: String,
+    iterations: u64,
+    allocs_per_op: f64,
+    bytes_per_op: f64,
+    deallocs_per_op: f64,
+}
+
+fn print_table(rows: &[Row]) {
+    println!();
+    println!(
+        "| {:40} | {:>10} | {:>10} | {:>10} | {:>10} |",
+        "scenario", "iters", "allocs/op", "bytes/op", "deallocs/op"
+    );
+    println!(
+        "| {:40} | {:>10} | {:>10} | {:>10} | {:>10} |",
+        "-".repeat(40),
+        "-".repeat(10),
+        "-".repeat(10),
+        "-".repeat(10),
+        "-".repeat(10)
+    );
+    for row in rows {
+        println!(
+            "| {:40} | {:>10} | {:>10.2} | {:>10.0} | {:>10.2} |",
+            row.name, row.iterations, row.allocs_per_op, row.bytes_per_op, row.deallocs_per_op
+        );
+    }
+    println!();
+}
+
+// ---- scenarios ----------------------------------------------------------
+
+fn make_handle() -> MxReferenceHandle {
+    MxReferenceHandle::from_names(0, 1, 2, 3, "TestObject", 0, 1, 0, "TestAttr", false)
+        .expect("handle")
+}
+
+fn bench_write_int32() -> Row {
+    let handle = make_handle();
+    let value = WriteValue::Int32(42);
+    measure("write_message::encode (Int32)", 10_000, || {
+        let bytes = write_message::encode(&handle, &value, 0, 0).unwrap();
+        std::hint::black_box(bytes);
+    })
+}
+
+fn bench_write_float() -> Row {
+    let handle = make_handle();
+    let value = WriteValue::Float32(1.5);
+    measure("write_message::encode (Float32)", 10_000, || {
+        let bytes = write_message::encode(&handle, &value, 0, 0).unwrap();
+        std::hint::black_box(bytes);
+    })
+}
+
+fn bench_write_double() -> Row {
+    let handle = make_handle();
+    let value = WriteValue::Float64(3.25);
+    measure("write_message::encode (Float64)", 10_000, || {
+        let bytes = write_message::encode(&handle, &value, 0, 0).unwrap();
+        std::hint::black_box(bytes);
+    })
+}
+
+fn bench_write_bool() -> Row {
+    let handle = make_handle();
+    let value = WriteValue::Boolean(true);
+    measure("write_message::encode (Boolean)", 10_000, || {
+        let bytes = write_message::encode(&handle, &value, 0, 0).unwrap();
+        std::hint::black_box(bytes);
+    })
+}
+
+fn bench_write_string() -> Row {
+    let handle = make_handle();
+    let value = WriteValue::String("hello".to_string());
+    measure("write_message::encode (String, 5 chars)", 10_000, || {
+        let bytes = write_message::encode(&handle, &value, 0, 0).unwrap();
+        std::hint::black_box(bytes);
+    })
+}
+
+fn bench_subscription_decode() -> Row {
+    // Build a single-record DataUpdate body once; decode N times.
+    let body = build_data_update_int32_body(42);
+    measure(
+        "NmxSubscriptionMessage::parse_inner (DataUpdate, Int32)",
+        10_000,
+        || {
+            let msg = NmxSubscriptionMessage::parse_inner(&body).unwrap();
+            std::hint::black_box(msg);
+        },
+    )
+}
+
+fn bench_handle_from_names() -> Row {
+    measure("MxReferenceHandle::from_names", 10_000, || {
+        let h = MxReferenceHandle::from_names(
+            0,
+            1,
+            2,
+            3,
+            "TestChildObject",
+            0,
+            1,
+            0,
+            "TestInt",
+            false,
+        )
+        .unwrap();
+        std::hint::black_box(h);
+    })
+}
+
+// ---- helpers (mirror the test fixtures in subscription_message.rs) -----
+
+fn build_data_update_int32_body(value: i32) -> Vec<u8> {
+    // Operation id + correlation id are arbitrary 16-byte sequences for
+    // a synthetic body; the codec doesn't reject any GUID shape.
+    const DATA_UPDATE_COMMAND: u8 = 0x33;
+    let operation_id = [0x11u8; 16];
+    // 15-byte record prefix: status(4) + quality(2) + filetime(8) + wire_kind(1).
+    // wire_kind = 0x02 = Int32. Then the 4-byte i32 LE payload.
+    let mut record = Vec::with_capacity(15 + 4);
+    record.extend_from_slice(&0i32.to_le_bytes()); // status
+    record.extend_from_slice(&0x00C0u16.to_le_bytes()); // quality
+    record.extend_from_slice(&0i64.to_le_bytes()); // filetime
+    record.push(0x02); // wire_kind = Int32
+    record.extend_from_slice(&value.to_le_bytes());
+
+    let mut out = Vec::with_capacity(23 + record.len());
+    out.push(DATA_UPDATE_COMMAND);
+    out.extend_from_slice(&1u16.to_le_bytes()); // version
+    out.extend_from_slice(&1i32.to_le_bytes()); // record_count = 1
+    out.extend_from_slice(&operation_id);
+    out.extend_from_slice(&record);
+    out
+}
+
+// ---- main --------------------------------------------------------------
+
+fn main() {
+    println!("F38 — mxaccess-codec allocation-count baseline");
+    println!("Counting allocator: thin GlobalAlloc wrapper around System.");
+    println!("R12 target: < 5 allocations per write at steady state.");
+
+    let rows = vec![
+        bench_write_int32(),
+        bench_write_float(),
+        bench_write_double(),
+        bench_write_bool(),
+        bench_write_string(),
+        bench_handle_from_names(),
+        bench_subscription_decode(),
+    ];
+
+    print_table(&rows);
+
+    // R12 gate: emit a non-zero exit code if any encode-write scenario
+    // exceeds the 5-allocs threshold. Decoders are reported but not
+    // gated (the sweep below explicitly excludes them).
+    let mut violations = 0;
+    for row in &rows {
+        if row.name.starts_with("write_message::encode") && row.allocs_per_op >= 5.0 {
+            eprintln!(
+                "R12 violation: {} allocates {:.2}/op (>= 5)",
+                row.name, row.allocs_per_op
+            );
+            violations += 1;
+        }
+    }
+    if violations > 0 {
+        std::process::exit(1);
+    }
+}