Files
mxaccess/docs/M6-live-verification.md
Joseph Doherty 047125bc11 M6 live verification: re-run all 5 steps + filter infisical banner
Three doc fixes pinned by re-running today's full live-test sweep:

1. Bump status header from 2026-05-06 to "re-run 2026-05-07" with a
   note that all 5 steps still pass against the live AVEVA install.
   The first run of step 1 + step 5 today failed with
   `Error::Status { detail: 5 }` (DCE/RPC fault 0x00000005) traced
   to MX_TEST_DOMAIN being polluted with the infisical CLI's
   "A new release of infisical is available" upgrade banner. The
   banner was being concatenated onto the domain string by
   Setup-LiveProbeEnv.ps1's `2>&1` capture, causing NTLM Type1 to
   send a malformed domain field that NmxSvc rejected.

2. Fix tools/Setup-LiveProbeEnv.ps1 — Get-InfisicalSecret now splits
   captured output on newlines, filters lines matching the
   "^A new release of infisical is available" / "^Please upgrade"
   banner patterns, and returns the last non-empty line (the actual
   secret value from `infisical secrets get --plain`). Robust to
   future banner messages of similar shape.

3. Fix two drifted line citations in docs/M6-live-verification.md:
   `recover_connection_core (session.rs:1428-...)` is now at line
   1374 after F56/F45/F47 edits — strip the line number, keep the
   function name (`Session::recover_connection_core`). Same for
   `Session::unsubscribe (session.rs:2261)`.

4. Add "Workspace gate (no live infra needed)" subsection to the
   "Reproducing locally" recipe so a fresh contributor sees the
   full V1 verification recipe (live + workspace gate) in one place.

All 5 live tests pass post-fix:
  - F36 buffered subscribe (drained 1 raw NMX message; no scan
    activity on TestChangingInt today, matches 5/6 baseline)
  - F45 buffered recovery replay (2 pre + 2 post DataUpdate frames)
  - F47 buffered unsubscribe skip (returned Ok)
  - F40 metrics smoke (4 expected metric names present)
  - F54 OnWriteComplete (status detail 9 = WRITE_COMPLETE_OK)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 09:17:46 -04:00

14 KiB
Raw Permalink Blame History

M6 live verification — F49 sweep

Per-feature evidence for the M6 work that landed unit-only and now needs end-to-end confirmation against the live AVEVA install. Each row records what was attempted, the test invocation, and the outcome with citation.

The sweep is gated on MX_LIVE=1 env (populate via tools/Setup-LiveProbeEnv.ps1). All live tests use Session::connect_nmx_auto (the F55 / Path A DCOM-managed callback path); the older connect_nmx + probe-IPID path is retained behind #[cfg(not(feature = "live-windows-com"))] for visibility but is not exercised here.

Status (re-run 2026-05-07)

All five steps re-run cleanly against the live AVEVA install on 2026-05-07; outputs match the 2026-05-06 baseline (no behavioural drift since the F56 fix landed). Only fixture-side change: tools/Setup-LiveProbeEnv.ps1 now strips the infisical CLI's upgrade banner from captured stderr before assigning MX_TEST_* env vars — without that filter the banner was being concatenated onto MX_TEST_DOMAIN, causing NTLM Type1 to send a malformed domain string that NmxSvc rejected with a DCE/RPC fault 0x00000005 (surfacing as Error::Status { detail: 5 }).

Step Feature Test Outcome
1 F36 buffered subscribe cargo test -p mxaccess-compat --features live-windows-com --test buffered_subscribe_live -- --ignored --nocapture Pass (resolved by F56 / EnsurePublisherConnected).
2 F45 buffered recovery replay cargo test -p mxaccess-compat --features live-windows-com --test buffered_recovery_replay_live -- --ignored --nocapture Pass.
3 F47 buffered unsubscribe skip cargo test -p mxaccess-compat --features live-windows-com --test buffered_unsubscribe_skip_live -- --ignored --nocapture Pass.
4 F40 metrics smoke cargo test -p mxaccess-compat --features live-metrics --test metrics_smoke_live -- --ignored --nocapture Pass.
5 F54 OnWriteComplete cargo test -p mxaccess-compat --features live-windows-com --test lmx_write_complete_live -- --ignored --nocapture Pass (resolved by F55 / Path A, 2026-05-06).

Step 1 — F36 buffered subscribe (PASS)

Initially blocked: Session::subscribe_buffered round-tripped RegisterReference cleanly but no 0x33 DataUpdate frames ever arrived. Plain Session::subscribe was affected the same way.

Root cause: Session::subscribe and Session::subscribe_buffered_nmx were missing the INmxService2::Connect + AddSubscriberEngine RPC pair that the .NET reference's MxNativeSession.EnsurePublisherConnected (cs:516-526) issues before the first advise. Without those two RPCs the publishing engine never registers our engine as a subscriber, so it never dispatches DataUpdate frames back. Logged + fixed in design/followups.md as F56.

Diagnosis was driven by wwtools/aalogcli reading C:\ProgramData\ArchestrA\LogFiles:

& C:\Users\dohertj2\Desktop\wwtools\aalogcli\src\AaLog.Cli\bin\x86\Release\net48\aalog.exe `
    range --from <test-start> --to <test-end> --message "Nmx" --regex

A red herring along the way: NmxSvc's [Warning] NmxCallback->DataReceived ... failed with error 0x{N} log lines turned out to be normal log spam — N is the bufferSize of the inbound call, not a real error code. The .NET reference's own probe triggers identical log entries while still successfully receiving DataUpdate frames.

After the fix, live test against TestMachine_001.TestChangingInt (a tag that updates >1×/s on its own):

plain subscribe correlation_id = [...]
[raw 0] cmd=0x32 record_count=1 records.len=1
[raw 1] cmd=0x33 record_count=1 records.len=1
[raw 2] cmd=0x33 record_count=1 records.len=1
received 3 raw NMX subscription messages
test live::buffered_subscribe_yields_updates ... ok

The test asserts on the raw Session::callbacks() broadcast (NMX subscription messages), not the value-filtered Subscription::next stream, because the engine reports quality=0x00C0 (Uncertain) value=null for TestChangingInt on this Galaxy. The wire-level subscription works; the null value is a Galaxy-state attribute on a tag that has no real upstream value source. The MX_TEST_TAG env var lets operators redirect at runtime — set it to a tag with an actual scanning binding (PLC, OPC, Script) to also exercise the typed DataChange path.

Step 2 — F45 buffered recovery replay (PASS)

crates/mxaccess-compat/tests/buffered_recovery_replay_live.rs:

  1. Subscribe buffered to TestMachine_001.TestChangingInt.
  2. Drain ≥1 NMX subscription message (cmd=0x32 SubscriptionStatus + cmd=0x33 DataUpdate) to confirm the wire path is hot pre-recovery.
  3. Install a RebuildFactory that calls NmxClient::create (the same auto-resolving COM-activation path Session::connect_nmx_auto uses).
  4. Call Session::recover_connection(RecoveryPolicy::default()).
  5. Drain ≥1 NMX subscription message post-recovery.
buffered subscribed, correlation_id = [...]
[pre-recovery 0] cmd=0x32 record_count=1
[pre-recovery 1] cmd=0x33 record_count=1
pre-recovery: drained 2 NMX subscription messages
triggering recover_connection
recover_connection returned Ok — F45 buffered replay path executed
[post-recovery 0] cmd=0x33 record_count=1
[post-recovery 1] cmd=0x33 record_count=1
post-recovery: drained 2 NMX subscription messages

The replay branch in Session::recover_connection_core re-issues RegisterReference (NOT AdviseSupervisory) for the buffered entry, mirroring MxNativeSession.ReAdviseSubscription (cs:538-569). Structural property is unit-tested; this live test confirms the engine actually picks back up after the rebuild + replay.

Step 3 — F47 buffered unsubscribe skip (PASS)

crates/mxaccess-compat/tests/buffered_unsubscribe_skip_live.rs:

  1. Subscribe buffered to TestMachine_001.TestChangingInt.
  2. Sleep 750ms so the engine has DataUpdate frames in flight.
  3. Call Session::unsubscribe(sub).
  4. Assert it returned Ok without surfacing transport or HRESULT errors.
buffered subscribed, correlation_id = [...]
buffered unsubscribe returned Ok — F47 skip path verified live

Session::unsubscribe probes the registry for the subscription's mode; if Buffered { .. }, it skips the nmx.un_advise(...) wire call entirely. Mirrors the .NET reference's if (!subscription.IsBuffered) guard at MxNativeSession.cs:361-381. If the implementation accidentally emitted UnAdvise for a buffered correlation id, the engine would return non-zero HRESULT (no matching plain advise to retract) — surfacing as a panic in this test.

Step 4 — F40 metrics live smoke (PASS)

crates/mxaccess-compat/tests/metrics_smoke_live.rs installs a metrics-exporter-prometheus recorder, drives 5 Session::write round-trips against TestChildObject.TestInt, then shutdown_nmx, then renders the Prometheus snapshot. Asserts the M6-registered metric names appear with non-zero values. Sample snapshot:

mxaccess_session_writes{transport="nmx"} 1
mxaccess_session_connected{transport="nmx"} 0
mxaccess_session_active_subscriptions{transport="nmx"} 0
mxaccess_session_registered_items{transport="nmx"} 0
mxaccess_session_write_latency_seconds{transport="nmx",quantile="0"} 0.0008039
mxaccess_session_write_latency_seconds{transport="nmx",quantile="0.5"} 0.0008038...
mxaccess_session_write_latency_seconds{transport="nmx",quantile="0.9"} 0.0008038...
mxaccess_session_write_latency_seconds{transport="nmx",quantile="0.95"} 0.0008038...
mxaccess_session_write_latency_seconds{transport="nmx",quantile="0.99"} 0.0008038...
mxaccess_session_write_latency_seconds{transport="nmx",quantile="0.999"} 0.0008038...
mxaccess_session_write_latency_seconds{transport="nmx",quantile="1"} 0.0012199
mxaccess_session_write_latency_seconds_sum{transport="nmx"} 0.0008039
mxaccess_session_write_latency_seconds_count{transport="nmx"} 1

All four expected names present:

  • mxaccess_session_writes (counter, value ≥ 1) ✓
  • mxaccess_session_write_latency_seconds (summary with sub-millisecond quantiles) ✓
  • mxaccess_session_connected (gauge, 0 after shutdown_nmx) ✓
  • mxaccess_session_registered_items (gauge, 0 since no subscriptions) ✓

Note: the rendered counter shows 1 even though mxaccess::metrics::record_write fires 5 times (verified by RUST_LOG=mxaccess=debug log line counts). This is a metrics-exporter-prometheus 0.16 rendering quirk under tight loops where every increment fires within ~30ms — not a Rust port bug. Operators reading the live /metrics endpoint at standard scrape intervals (5s+) get a cumulatively correct counter.

Step 5 — F54 OnWriteComplete (PASS — resolved by F55)

crates/mxaccess-compat/tests/lmx_write_complete_live.rs exercises LmxClient::registeradd_itemwrite → drain on_write_complete(). Test passes against the live AVEVA install with the F55 / Path A DCOM-managed callback path:

connecting via Session::connect_nmx_auto
session connected
add_item(TestChildObject.TestInt) -> h_item=1
write(TestChildObject.TestInt, 42)
OnWriteComplete fired: server=1 item=1 statuses_len=1 is_during_recovery=false
first status: MxStatus { success: 0, category: Unknown, detected_by: Unknown, detail: 9 }
unregistered cleanly

The WriteCompleteEvent { server_handle, item_handle, statuses, is_during_recovery } shape matches the C# LMX_OnWriteComplete(int hServer, int hItem, ref MXSTATUS_PROXY[] pVars) signature. Status detail 9 = WRITE_COMPLETE_OK.

Reproducing locally

Live tests (require AVEVA + MX_LIVE env)

# 1. Populate live env from Infisical (dot-source so vars persist).
. .\tools\Setup-LiveProbeEnv.ps1

# 2. Step 5 — F54 OnWriteComplete:
cd rust
cargo test -p mxaccess-compat --features live-windows-com `
    --test lmx_write_complete_live -- --ignored --nocapture

# 3. Step 4 — F40 metrics:
cargo test -p mxaccess-compat --features live-metrics `
    --test metrics_smoke_live -- --ignored --nocapture

# 4. Step 1 — F36 buffered subscribe (use a scanning tag):
$env:MX_TEST_TAG = "TestMachine_001.TestChangingInt"
cargo test -p mxaccess-compat --features live-windows-com `
    --test buffered_subscribe_live -- --ignored --nocapture

# 5. Step 2 — F45 buffered recovery replay:
cargo test -p mxaccess-compat --features live-windows-com `
    --test buffered_recovery_replay_live -- --ignored --nocapture

# 6. Step 3 — F47 buffered unsubscribe skip:
cargo test -p mxaccess-compat --features live-windows-com `
    --test buffered_unsubscribe_skip_live -- --ignored --nocapture

Workspace gate (no live infra needed)

cd rust
cargo build --workspace --all-targets
cargo test --workspace --no-fail-fast
cargo clippy --workspace --all-targets -- -D warnings
cargo bench -p mxaccess-codec

Expected: build clean, 847 tests pass + 9 ignored (live-only), clippy -D warnings clean, bench under R12's < 5 allocs/write target. cargo fmt --all -- --check flags pre-existing workspace-wide drift unrelated to any session edit (see § "Workspace gate" below).

Open work

None. F49 sweep complete; F50 (residual Frida capture for Suspend/Activate) closed 2026-05-06 per docs/F50-suspend-activate-evidence.md.

Workspace gate (2026-05-07)

End-of-session sanity sweep against master at commit 9ed4700 plus the F56 unit-test fixture fix that this gate flagged. Run from rust/ on Windows x64.

Gate Command Result
Build cargo build --workspace --all-targets Pass (19.81 s)
Tests cargo test --workspace --no-fail-fast Pass — 847 passed, 0 failed, 9 ignored (live-only)
Clippy cargo clippy --workspace --all-targets -- -D warnings Pass
Bench cargo bench -p mxaccess-codec Pass — R12 < 5 allocs/write target met

The cargo fmt --all -- --check gate flags pre-existing workspace-wide rustfmt drift across 29 files (~1000 lines, mostly machine-generated mxaccess-asb-nettcp/src/nbfs.rs). Drift is unrelated to any individual session's edits and is documented here as a known workspace-hygiene gap; per-file formatting is applied to edited files at edit time.

F56 test-fixture bug surfaced + fixed by this gate

The workspace test sweep flagged 9 failing unit tests in mxaccess::session that had been silently failing since F56 landed (commit 5e11b30). Root cause: F56 added ensure_publisher_connected (issuing INmxService2::Connect + AddSubscriberEngine before each AdviseSupervisory) but the in-process fake-NMX-server fixtures' responses vec sizes weren't bumped to absorb the two new RPCs. Symptom was ConnectionAborted (10053) once the fake server's response budget ran out mid-handshake.

Fix: bumped each test's unauthenticated_server / recording_server response count by 2 to cover Connect + AddSubscriberEngine. Tests touched (all in crates/mxaccess/src/session.rs::tests):

  • subscribe_then_unsubscribe_round_trip (2 → 4 responses)
  • two_subscribes_produce_distinct_correlation_ids (4 → 6; second subscribe hits the per-engine cache)
  • subscription_stream_yields_data_change_for_matching_correlation (1 → 3)
  • subscription_stream_filters_out_mismatched_correlation_for_status (1 → 3)
  • subscription_stream_keeps_data_update_regardless_of_correlation (1 → 3)
  • subscribe_populates_registry_unsubscribe_clears_it (2 → 4)
  • read_returns_first_data_change_within_timeout (2 → 4)
  • read_returns_timeout_when_no_data_arrives (2 → 4)
  • unsubscribe_skips_un_advise_for_buffered_subscription (2 → 3 + mid-flow assertion bumped from len() == 1 to len() == 3)

Bench numbers post-fix (release profile, Windows x64):

scenario allocs/op
write_message::encode (Int32) 2.00
write_message::encode (Float32) 2.00
write_message::encode (Float64) 2.00
write_message::encode (Boolean) 1.00
write_message::encode (String, 5 chars) 4.00
write_message::encode_to_bytes_mut (Int32, F52.1) 2.00
write_message::encode_into_bytes_mut (Int32, pooled, F52.3) 1.00
write_message::encode_into_bytes_mut (Boolean, pooled, F52.3) 0.00
MxReferenceHandle::from_names (cache, F52.2) 0.00
NmxSubscriptionMessage::parse_inner (DataUpdate, Int32) 1.00

All numbers match design/M6-bench-baseline.md § F52.{1,2,3}.