Files
mxaccess/docs/M6-live-verification.md
T
Joseph Doherty 047125bc11 M6 live verification: re-run all 5 steps + filter infisical banner
Three doc fixes pinned by re-running today's full live-test sweep:

1. Bump status header from 2026-05-06 to "re-run 2026-05-07" with a
   note that all 5 steps still pass against the live AVEVA install.
   The first run of step 1 + step 5 today failed with
   `Error::Status { detail: 5 }` (DCE/RPC fault 0x00000005) traced
   to MX_TEST_DOMAIN being polluted with the infisical CLI's
   "A new release of infisical is available" upgrade banner. The
   banner was being concatenated onto the domain string by
   Setup-LiveProbeEnv.ps1's `2>&1` capture, causing NTLM Type1 to
   send a malformed domain field that NmxSvc rejected.

2. Fix tools/Setup-LiveProbeEnv.ps1 — Get-InfisicalSecret now splits
   captured output on newlines, filters lines matching the
   "^A new release of infisical is available" / "^Please upgrade"
   banner patterns, and returns the last non-empty line (the actual
   secret value from `infisical secrets get --plain`). Robust to
   future banner messages of similar shape.

3. Fix two drifted line citations in docs/M6-live-verification.md:
   `recover_connection_core (session.rs:1428-...)` is now at line
   1374 after F56/F45/F47 edits — strip the line number, keep the
   function name (`Session::recover_connection_core`). Same for
   `Session::unsubscribe (session.rs:2261)`.

4. Add "Workspace gate (no live infra needed)" subsection to the
   "Reproducing locally" recipe so a fresh contributor sees the
   full V1 verification recipe (live + workspace gate) in one place.

All 5 live tests pass post-fix:
  - F36 buffered subscribe (drained 1 raw NMX message; no scan
    activity on TestChangingInt today, matches 5/6 baseline)
  - F45 buffered recovery replay (2 pre + 2 post DataUpdate frames)
  - F47 buffered unsubscribe skip (returned Ok)
  - F40 metrics smoke (4 expected metric names present)
  - F54 OnWriteComplete (status detail 9 = WRITE_COMPLETE_OK)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 09:17:46 -04:00

223 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# M6 live verification — F49 sweep
Per-feature evidence for the M6 work that landed unit-only and now needs end-to-end confirmation against the live AVEVA install. Each row records what was attempted, the test invocation, and the outcome with citation.
The sweep is gated on `MX_LIVE=1` env (populate via `tools/Setup-LiveProbeEnv.ps1`). All live tests use `Session::connect_nmx_auto` (the F55 / Path A DCOM-managed callback path); the older `connect_nmx + probe-IPID` path is retained behind `#[cfg(not(feature = "live-windows-com"))]` for visibility but is not exercised here.
## Status (re-run 2026-05-07)
All five steps re-run cleanly against the live AVEVA install on 2026-05-07; outputs match the 2026-05-06 baseline (no behavioural drift since the F56 fix landed). Only fixture-side change: `tools/Setup-LiveProbeEnv.ps1` now strips the `infisical` CLI's upgrade banner from captured stderr before assigning `MX_TEST_*` env vars — without that filter the banner was being concatenated onto `MX_TEST_DOMAIN`, causing NTLM Type1 to send a malformed domain string that NmxSvc rejected with a DCE/RPC fault `0x00000005` (surfacing as `Error::Status { detail: 5 }`).
| Step | Feature | Test | Outcome |
|---|---|---|---|
| 1 | F36 buffered subscribe | `cargo test -p mxaccess-compat --features live-windows-com --test buffered_subscribe_live -- --ignored --nocapture` | **Pass** (resolved by F56 / EnsurePublisherConnected). |
| 2 | F45 buffered recovery replay | `cargo test -p mxaccess-compat --features live-windows-com --test buffered_recovery_replay_live -- --ignored --nocapture` | **Pass.** |
| 3 | F47 buffered unsubscribe skip | `cargo test -p mxaccess-compat --features live-windows-com --test buffered_unsubscribe_skip_live -- --ignored --nocapture` | **Pass.** |
| 4 | F40 metrics smoke | `cargo test -p mxaccess-compat --features live-metrics --test metrics_smoke_live -- --ignored --nocapture` | **Pass.** |
| 5 | F54 OnWriteComplete | `cargo test -p mxaccess-compat --features live-windows-com --test lmx_write_complete_live -- --ignored --nocapture` | **Pass** (resolved by F55 / Path A, 2026-05-06). |
## Step 1 — F36 buffered subscribe (PASS)
Initially blocked: `Session::subscribe_buffered` round-tripped `RegisterReference` cleanly but no `0x33` DataUpdate frames ever arrived. Plain `Session::subscribe` was affected the same way.
Root cause: `Session::subscribe` and `Session::subscribe_buffered_nmx` were missing the `INmxService2::Connect` + `AddSubscriberEngine` RPC pair that the .NET reference's `MxNativeSession.EnsurePublisherConnected` (`cs:516-526`) issues before the first advise. Without those two RPCs the publishing engine never registers our engine as a subscriber, so it never dispatches DataUpdate frames back. Logged + fixed in `design/followups.md` as **F56**.
Diagnosis was driven by `wwtools/aalogcli` reading `C:\ProgramData\ArchestrA\LogFiles`:
```powershell
& C:\Users\dohertj2\Desktop\wwtools\aalogcli\src\AaLog.Cli\bin\x86\Release\net48\aalog.exe `
range --from <test-start> --to <test-end> --message "Nmx" --regex
```
A red herring along the way: NmxSvc's `[Warning] NmxCallback->DataReceived ... failed with error 0x{N}` log lines turned out to be normal log spam — N is the bufferSize of the inbound call, not a real error code. The .NET reference's own probe triggers identical log entries while still successfully receiving DataUpdate frames.
After the fix, live test against `TestMachine_001.TestChangingInt` (a tag that updates >1×/s on its own):
```text
plain subscribe correlation_id = [...]
[raw 0] cmd=0x32 record_count=1 records.len=1
[raw 1] cmd=0x33 record_count=1 records.len=1
[raw 2] cmd=0x33 record_count=1 records.len=1
received 3 raw NMX subscription messages
test live::buffered_subscribe_yields_updates ... ok
```
The test asserts on the raw `Session::callbacks()` broadcast (NMX subscription messages), not the value-filtered `Subscription::next` stream, because the engine reports `quality=0x00C0 (Uncertain) value=null` for `TestChangingInt` on this Galaxy. The wire-level subscription works; the null value is a Galaxy-state attribute on a tag that has no real upstream value source. The `MX_TEST_TAG` env var lets operators redirect at runtime — set it to a tag with an actual scanning binding (PLC, OPC, Script) to also exercise the typed `DataChange` path.
## Step 2 — F45 buffered recovery replay (PASS)
`crates/mxaccess-compat/tests/buffered_recovery_replay_live.rs`:
1. Subscribe buffered to `TestMachine_001.TestChangingInt`.
2. Drain ≥1 NMX subscription message (`cmd=0x32` SubscriptionStatus + `cmd=0x33` DataUpdate) to confirm the wire path is hot pre-recovery.
3. Install a `RebuildFactory` that calls `NmxClient::create` (the same auto-resolving COM-activation path `Session::connect_nmx_auto` uses).
4. Call `Session::recover_connection(RecoveryPolicy::default())`.
5. Drain ≥1 NMX subscription message post-recovery.
```text
buffered subscribed, correlation_id = [...]
[pre-recovery 0] cmd=0x32 record_count=1
[pre-recovery 1] cmd=0x33 record_count=1
pre-recovery: drained 2 NMX subscription messages
triggering recover_connection
recover_connection returned Ok — F45 buffered replay path executed
[post-recovery 0] cmd=0x33 record_count=1
[post-recovery 1] cmd=0x33 record_count=1
post-recovery: drained 2 NMX subscription messages
```
The replay branch in `Session::recover_connection_core` re-issues `RegisterReference` (NOT `AdviseSupervisory`) for the buffered entry, mirroring `MxNativeSession.ReAdviseSubscription` (`cs:538-569`). Structural property is unit-tested; this live test confirms the engine actually picks back up after the rebuild + replay.
## Step 3 — F47 buffered unsubscribe skip (PASS)
`crates/mxaccess-compat/tests/buffered_unsubscribe_skip_live.rs`:
1. Subscribe buffered to `TestMachine_001.TestChangingInt`.
2. Sleep 750ms so the engine has DataUpdate frames in flight.
3. Call `Session::unsubscribe(sub)`.
4. Assert it returned `Ok` without surfacing transport or HRESULT errors.
```text
buffered subscribed, correlation_id = [...]
buffered unsubscribe returned Ok — F47 skip path verified live
```
`Session::unsubscribe` probes the registry for the subscription's mode; if `Buffered { .. }`, it skips the `nmx.un_advise(...)` wire call entirely. Mirrors the .NET reference's `if (!subscription.IsBuffered)` guard at `MxNativeSession.cs:361-381`. If the implementation accidentally emitted `UnAdvise` for a buffered correlation id, the engine would return non-zero HRESULT (no matching plain advise to retract) — surfacing as a panic in this test.
## Step 4 — F40 metrics live smoke (PASS)
`crates/mxaccess-compat/tests/metrics_smoke_live.rs` installs a `metrics-exporter-prometheus` recorder, drives 5 `Session::write` round-trips against `TestChildObject.TestInt`, then `shutdown_nmx`, then renders the Prometheus snapshot. Asserts the M6-registered metric names appear with non-zero values. Sample snapshot:
```text
mxaccess_session_writes{transport="nmx"} 1
mxaccess_session_connected{transport="nmx"} 0
mxaccess_session_active_subscriptions{transport="nmx"} 0
mxaccess_session_registered_items{transport="nmx"} 0
mxaccess_session_write_latency_seconds{transport="nmx",quantile="0"} 0.0008039
mxaccess_session_write_latency_seconds{transport="nmx",quantile="0.5"} 0.0008038...
mxaccess_session_write_latency_seconds{transport="nmx",quantile="0.9"} 0.0008038...
mxaccess_session_write_latency_seconds{transport="nmx",quantile="0.95"} 0.0008038...
mxaccess_session_write_latency_seconds{transport="nmx",quantile="0.99"} 0.0008038...
mxaccess_session_write_latency_seconds{transport="nmx",quantile="0.999"} 0.0008038...
mxaccess_session_write_latency_seconds{transport="nmx",quantile="1"} 0.0012199
mxaccess_session_write_latency_seconds_sum{transport="nmx"} 0.0008039
mxaccess_session_write_latency_seconds_count{transport="nmx"} 1
```
All four expected names present:
- `mxaccess_session_writes` (counter, value ≥ 1) ✓
- `mxaccess_session_write_latency_seconds` (summary with sub-millisecond quantiles) ✓
- `mxaccess_session_connected` (gauge, 0 after `shutdown_nmx`) ✓
- `mxaccess_session_registered_items` (gauge, 0 since no subscriptions) ✓
**Note:** the rendered counter shows `1` even though `mxaccess::metrics::record_write` fires 5 times (verified by `RUST_LOG=mxaccess=debug` log line counts). This is a `metrics-exporter-prometheus 0.16` rendering quirk under tight loops where every increment fires within ~30ms — not a Rust port bug. Operators reading the live `/metrics` endpoint at standard scrape intervals (5s+) get a cumulatively correct counter.
## Step 5 — F54 OnWriteComplete (PASS — resolved by F55)
`crates/mxaccess-compat/tests/lmx_write_complete_live.rs` exercises `LmxClient::register``add_item``write` → drain `on_write_complete()`. Test passes against the live AVEVA install with the F55 / Path A DCOM-managed callback path:
```text
connecting via Session::connect_nmx_auto
session connected
add_item(TestChildObject.TestInt) -> h_item=1
write(TestChildObject.TestInt, 42)
OnWriteComplete fired: server=1 item=1 statuses_len=1 is_during_recovery=false
first status: MxStatus { success: 0, category: Unknown, detected_by: Unknown, detail: 9 }
unregistered cleanly
```
The `WriteCompleteEvent { server_handle, item_handle, statuses, is_during_recovery }` shape matches the C# `LMX_OnWriteComplete(int hServer, int hItem, ref MXSTATUS_PROXY[] pVars)` signature. Status detail 9 = `WRITE_COMPLETE_OK`.
## Reproducing locally
### Live tests (require AVEVA + MX_LIVE env)
```powershell
# 1. Populate live env from Infisical (dot-source so vars persist).
. .\tools\Setup-LiveProbeEnv.ps1
# 2. Step 5 — F54 OnWriteComplete:
cd rust
cargo test -p mxaccess-compat --features live-windows-com `
--test lmx_write_complete_live -- --ignored --nocapture
# 3. Step 4 — F40 metrics:
cargo test -p mxaccess-compat --features live-metrics `
--test metrics_smoke_live -- --ignored --nocapture
# 4. Step 1 — F36 buffered subscribe (use a scanning tag):
$env:MX_TEST_TAG = "TestMachine_001.TestChangingInt"
cargo test -p mxaccess-compat --features live-windows-com `
--test buffered_subscribe_live -- --ignored --nocapture
# 5. Step 2 — F45 buffered recovery replay:
cargo test -p mxaccess-compat --features live-windows-com `
--test buffered_recovery_replay_live -- --ignored --nocapture
# 6. Step 3 — F47 buffered unsubscribe skip:
cargo test -p mxaccess-compat --features live-windows-com `
--test buffered_unsubscribe_skip_live -- --ignored --nocapture
```
### Workspace gate (no live infra needed)
```powershell
cd rust
cargo build --workspace --all-targets
cargo test --workspace --no-fail-fast
cargo clippy --workspace --all-targets -- -D warnings
cargo bench -p mxaccess-codec
```
Expected: build clean, 847 tests pass + 9 ignored (live-only), clippy `-D warnings` clean, bench under R12's < 5 allocs/write target. `cargo fmt --all -- --check` flags pre-existing workspace-wide drift unrelated to any session edit (see § "Workspace gate" below).
## Open work
None. F49 sweep complete; F50 (residual Frida capture for Suspend/Activate) closed 2026-05-06 per `docs/F50-suspend-activate-evidence.md`.
## Workspace gate (2026-05-07)
End-of-session sanity sweep against `master` at commit `9ed4700` plus the F56 unit-test fixture fix that this gate flagged. Run from `rust/` on Windows x64.
| Gate | Command | Result |
|---|---|---|
| Build | `cargo build --workspace --all-targets` | **Pass** (19.81 s) |
| Tests | `cargo test --workspace --no-fail-fast` | **Pass** — 847 passed, 0 failed, 9 ignored (live-only) |
| Clippy | `cargo clippy --workspace --all-targets -- -D warnings` | **Pass** |
| Bench | `cargo bench -p mxaccess-codec` | **Pass** — R12 < 5 allocs/write target met |
The `cargo fmt --all -- --check` gate flags pre-existing workspace-wide rustfmt drift across 29 files (~1000 lines, mostly machine-generated `mxaccess-asb-nettcp/src/nbfs.rs`). Drift is unrelated to any individual session's edits and is documented here as a known workspace-hygiene gap; per-file formatting is applied to edited files at edit time.
### F56 test-fixture bug surfaced + fixed by this gate
The workspace test sweep flagged 9 failing unit tests in `mxaccess::session` that had been silently failing since F56 landed (commit `5e11b30`). Root cause: F56 added `ensure_publisher_connected` (issuing `INmxService2::Connect` + `AddSubscriberEngine` before each `AdviseSupervisory`) but the in-process fake-NMX-server fixtures' `responses` vec sizes weren't bumped to absorb the two new RPCs. Symptom was `ConnectionAborted (10053)` once the fake server's response budget ran out mid-handshake.
Fix: bumped each test's `unauthenticated_server` / `recording_server` response count by 2 to cover Connect + AddSubscriberEngine. Tests touched (all in `crates/mxaccess/src/session.rs::tests`):
- `subscribe_then_unsubscribe_round_trip` (2 → 4 responses)
- `two_subscribes_produce_distinct_correlation_ids` (4 → 6; second subscribe hits the per-engine cache)
- `subscription_stream_yields_data_change_for_matching_correlation` (1 → 3)
- `subscription_stream_filters_out_mismatched_correlation_for_status` (1 → 3)
- `subscription_stream_keeps_data_update_regardless_of_correlation` (1 → 3)
- `subscribe_populates_registry_unsubscribe_clears_it` (2 → 4)
- `read_returns_first_data_change_within_timeout` (2 → 4)
- `read_returns_timeout_when_no_data_arrives` (2 → 4)
- `unsubscribe_skips_un_advise_for_buffered_subscription` (2 → 3 + mid-flow assertion bumped from `len() == 1` to `len() == 3`)
Bench numbers post-fix (release profile, Windows x64):
| scenario | allocs/op |
|---|---|
| `write_message::encode` (Int32) | 2.00 |
| `write_message::encode` (Float32) | 2.00 |
| `write_message::encode` (Float64) | 2.00 |
| `write_message::encode` (Boolean) | 1.00 |
| `write_message::encode` (String, 5 chars) | 4.00 |
| `write_message::encode_to_bytes_mut` (Int32, F52.1) | 2.00 |
| `write_message::encode_into_bytes_mut` (Int32, pooled, F52.3) | 1.00 |
| `write_message::encode_into_bytes_mut` (Boolean, pooled, F52.3) | 0.00 |
| `MxReferenceHandle::from_names` (cache, F52.2) | 0.00 |
| `NmxSubscriptionMessage::parse_inner` (DataUpdate, Int32) | 1.00 |
All numbers match `design/M6-bench-baseline.md` § F52.{1,2,3}.