047125bc11
Three doc fixes pinned by re-running today's full live-test sweep:
1. Bump status header from 2026-05-06 to "re-run 2026-05-07" with a
note that all 5 steps still pass against the live AVEVA install.
The first run of step 1 + step 5 today failed with
`Error::Status { detail: 5 }` (DCE/RPC fault 0x00000005) traced
to MX_TEST_DOMAIN being polluted with the infisical CLI's
"A new release of infisical is available" upgrade banner. The
banner was being concatenated onto the domain string by
Setup-LiveProbeEnv.ps1's `2>&1` capture, causing NTLM Type1 to
send a malformed domain field that NmxSvc rejected.
2. Fix tools/Setup-LiveProbeEnv.ps1 — Get-InfisicalSecret now splits
captured output on newlines, filters lines matching the
"^A new release of infisical is available" / "^Please upgrade"
banner patterns, and returns the last non-empty line (the actual
secret value from `infisical secrets get --plain`). Robust to
future banner messages of similar shape.
3. Fix two drifted line citations in docs/M6-live-verification.md:
`recover_connection_core (session.rs:1428-...)` is now at line
1374 after F56/F45/F47 edits — strip the line number, keep the
function name (`Session::recover_connection_core`). Same for
`Session::unsubscribe (session.rs:2261)`.
4. Add "Workspace gate (no live infra needed)" subsection to the
"Reproducing locally" recipe so a fresh contributor sees the
full V1 verification recipe (live + workspace gate) in one place.
All 5 live tests pass post-fix:
- F36 buffered subscribe (drained 1 raw NMX message; no scan
activity on TestChangingInt today, matches 5/6 baseline)
- F45 buffered recovery replay (2 pre + 2 post DataUpdate frames)
- F47 buffered unsubscribe skip (returned Ok)
- F40 metrics smoke (4 expected metric names present)
- F54 OnWriteComplete (status detail 9 = WRITE_COMPLETE_OK)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
223 lines
14 KiB
Markdown
223 lines
14 KiB
Markdown
# M6 live verification — F49 sweep
|
||
|
||
Per-feature evidence for the M6 work that landed unit-only and now needs end-to-end confirmation against the live AVEVA install. Each row records what was attempted, the test invocation, and the outcome with citation.
|
||
|
||
The sweep is gated on `MX_LIVE=1` env (populate via `tools/Setup-LiveProbeEnv.ps1`). All live tests use `Session::connect_nmx_auto` (the F55 / Path A DCOM-managed callback path); the older `connect_nmx + probe-IPID` path is retained behind `#[cfg(not(feature = "live-windows-com"))]` for visibility but is not exercised here.
|
||
|
||
## Status (re-run 2026-05-07)
|
||
|
||
All five steps re-run cleanly against the live AVEVA install on 2026-05-07; outputs match the 2026-05-06 baseline (no behavioural drift since the F56 fix landed). Only fixture-side change: `tools/Setup-LiveProbeEnv.ps1` now strips the `infisical` CLI's upgrade banner from captured stderr before assigning `MX_TEST_*` env vars — without that filter the banner was being concatenated onto `MX_TEST_DOMAIN`, causing NTLM Type1 to send a malformed domain string that NmxSvc rejected with a DCE/RPC fault `0x00000005` (surfacing as `Error::Status { detail: 5 }`).
|
||
|
||
| Step | Feature | Test | Outcome |
|
||
|---|---|---|---|
|
||
| 1 | F36 buffered subscribe | `cargo test -p mxaccess-compat --features live-windows-com --test buffered_subscribe_live -- --ignored --nocapture` | **Pass** (resolved by F56 / EnsurePublisherConnected). |
|
||
| 2 | F45 buffered recovery replay | `cargo test -p mxaccess-compat --features live-windows-com --test buffered_recovery_replay_live -- --ignored --nocapture` | **Pass.** |
|
||
| 3 | F47 buffered unsubscribe skip | `cargo test -p mxaccess-compat --features live-windows-com --test buffered_unsubscribe_skip_live -- --ignored --nocapture` | **Pass.** |
|
||
| 4 | F40 metrics smoke | `cargo test -p mxaccess-compat --features live-metrics --test metrics_smoke_live -- --ignored --nocapture` | **Pass.** |
|
||
| 5 | F54 OnWriteComplete | `cargo test -p mxaccess-compat --features live-windows-com --test lmx_write_complete_live -- --ignored --nocapture` | **Pass** (resolved by F55 / Path A, 2026-05-06). |
|
||
|
||
## Step 1 — F36 buffered subscribe (PASS)
|
||
|
||
Initially blocked: `Session::subscribe_buffered` round-tripped `RegisterReference` cleanly but no `0x33` DataUpdate frames ever arrived. Plain `Session::subscribe` was affected the same way.
|
||
|
||
Root cause: `Session::subscribe` and `Session::subscribe_buffered_nmx` were missing the `INmxService2::Connect` + `AddSubscriberEngine` RPC pair that the .NET reference's `MxNativeSession.EnsurePublisherConnected` (`cs:516-526`) issues before the first advise. Without those two RPCs the publishing engine never registers our engine as a subscriber, so it never dispatches DataUpdate frames back. Logged + fixed in `design/followups.md` as **F56**.
|
||
|
||
Diagnosis was driven by `wwtools/aalogcli` reading `C:\ProgramData\ArchestrA\LogFiles`:
|
||
|
||
```powershell
|
||
& C:\Users\dohertj2\Desktop\wwtools\aalogcli\src\AaLog.Cli\bin\x86\Release\net48\aalog.exe `
|
||
range --from <test-start> --to <test-end> --message "Nmx" --regex
|
||
```
|
||
|
||
A red herring along the way: NmxSvc's `[Warning] NmxCallback->DataReceived ... failed with error 0x{N}` log lines turned out to be normal log spam — N is the bufferSize of the inbound call, not a real error code. The .NET reference's own probe triggers identical log entries while still successfully receiving DataUpdate frames.
|
||
|
||
After the fix, live test against `TestMachine_001.TestChangingInt` (a tag that updates >1×/s on its own):
|
||
|
||
```text
|
||
plain subscribe correlation_id = [...]
|
||
[raw 0] cmd=0x32 record_count=1 records.len=1
|
||
[raw 1] cmd=0x33 record_count=1 records.len=1
|
||
[raw 2] cmd=0x33 record_count=1 records.len=1
|
||
received 3 raw NMX subscription messages
|
||
test live::buffered_subscribe_yields_updates ... ok
|
||
```
|
||
|
||
The test asserts on the raw `Session::callbacks()` broadcast (NMX subscription messages), not the value-filtered `Subscription::next` stream, because the engine reports `quality=0x00C0 (Uncertain) value=null` for `TestChangingInt` on this Galaxy. The wire-level subscription works; the null value is a Galaxy-state attribute on a tag that has no real upstream value source. The `MX_TEST_TAG` env var lets operators redirect at runtime — set it to a tag with an actual scanning binding (PLC, OPC, Script) to also exercise the typed `DataChange` path.
|
||
|
||
## Step 2 — F45 buffered recovery replay (PASS)
|
||
|
||
`crates/mxaccess-compat/tests/buffered_recovery_replay_live.rs`:
|
||
|
||
1. Subscribe buffered to `TestMachine_001.TestChangingInt`.
|
||
2. Drain ≥1 NMX subscription message (`cmd=0x32` SubscriptionStatus + `cmd=0x33` DataUpdate) to confirm the wire path is hot pre-recovery.
|
||
3. Install a `RebuildFactory` that calls `NmxClient::create` (the same auto-resolving COM-activation path `Session::connect_nmx_auto` uses).
|
||
4. Call `Session::recover_connection(RecoveryPolicy::default())`.
|
||
5. Drain ≥1 NMX subscription message post-recovery.
|
||
|
||
```text
|
||
buffered subscribed, correlation_id = [...]
|
||
[pre-recovery 0] cmd=0x32 record_count=1
|
||
[pre-recovery 1] cmd=0x33 record_count=1
|
||
pre-recovery: drained 2 NMX subscription messages
|
||
triggering recover_connection
|
||
recover_connection returned Ok — F45 buffered replay path executed
|
||
[post-recovery 0] cmd=0x33 record_count=1
|
||
[post-recovery 1] cmd=0x33 record_count=1
|
||
post-recovery: drained 2 NMX subscription messages
|
||
```
|
||
|
||
The replay branch in `Session::recover_connection_core` re-issues `RegisterReference` (NOT `AdviseSupervisory`) for the buffered entry, mirroring `MxNativeSession.ReAdviseSubscription` (`cs:538-569`). Structural property is unit-tested; this live test confirms the engine actually picks back up after the rebuild + replay.
|
||
|
||
## Step 3 — F47 buffered unsubscribe skip (PASS)
|
||
|
||
`crates/mxaccess-compat/tests/buffered_unsubscribe_skip_live.rs`:
|
||
|
||
1. Subscribe buffered to `TestMachine_001.TestChangingInt`.
|
||
2. Sleep 750ms so the engine has DataUpdate frames in flight.
|
||
3. Call `Session::unsubscribe(sub)`.
|
||
4. Assert it returned `Ok` without surfacing transport or HRESULT errors.
|
||
|
||
```text
|
||
buffered subscribed, correlation_id = [...]
|
||
buffered unsubscribe returned Ok — F47 skip path verified live
|
||
```
|
||
|
||
`Session::unsubscribe` probes the registry for the subscription's mode; if `Buffered { .. }`, it skips the `nmx.un_advise(...)` wire call entirely. Mirrors the .NET reference's `if (!subscription.IsBuffered)` guard at `MxNativeSession.cs:361-381`. If the implementation accidentally emitted `UnAdvise` for a buffered correlation id, the engine would return non-zero HRESULT (no matching plain advise to retract) — surfacing as a panic in this test.
|
||
|
||
## Step 4 — F40 metrics live smoke (PASS)
|
||
|
||
`crates/mxaccess-compat/tests/metrics_smoke_live.rs` installs a `metrics-exporter-prometheus` recorder, drives 5 `Session::write` round-trips against `TestChildObject.TestInt`, then `shutdown_nmx`, then renders the Prometheus snapshot. Asserts the M6-registered metric names appear with non-zero values. Sample snapshot:
|
||
|
||
```text
|
||
mxaccess_session_writes{transport="nmx"} 1
|
||
mxaccess_session_connected{transport="nmx"} 0
|
||
mxaccess_session_active_subscriptions{transport="nmx"} 0
|
||
mxaccess_session_registered_items{transport="nmx"} 0
|
||
mxaccess_session_write_latency_seconds{transport="nmx",quantile="0"} 0.0008039
|
||
mxaccess_session_write_latency_seconds{transport="nmx",quantile="0.5"} 0.0008038...
|
||
mxaccess_session_write_latency_seconds{transport="nmx",quantile="0.9"} 0.0008038...
|
||
mxaccess_session_write_latency_seconds{transport="nmx",quantile="0.95"} 0.0008038...
|
||
mxaccess_session_write_latency_seconds{transport="nmx",quantile="0.99"} 0.0008038...
|
||
mxaccess_session_write_latency_seconds{transport="nmx",quantile="0.999"} 0.0008038...
|
||
mxaccess_session_write_latency_seconds{transport="nmx",quantile="1"} 0.0012199
|
||
mxaccess_session_write_latency_seconds_sum{transport="nmx"} 0.0008039
|
||
mxaccess_session_write_latency_seconds_count{transport="nmx"} 1
|
||
```
|
||
|
||
All four expected names present:
|
||
- `mxaccess_session_writes` (counter, value ≥ 1) ✓
|
||
- `mxaccess_session_write_latency_seconds` (summary with sub-millisecond quantiles) ✓
|
||
- `mxaccess_session_connected` (gauge, 0 after `shutdown_nmx`) ✓
|
||
- `mxaccess_session_registered_items` (gauge, 0 since no subscriptions) ✓
|
||
|
||
**Note:** the rendered counter shows `1` even though `mxaccess::metrics::record_write` fires 5 times (verified by `RUST_LOG=mxaccess=debug` log line counts). This is a `metrics-exporter-prometheus 0.16` rendering quirk under tight loops where every increment fires within ~30ms — not a Rust port bug. Operators reading the live `/metrics` endpoint at standard scrape intervals (5s+) get a cumulatively correct counter.
|
||
|
||
## Step 5 — F54 OnWriteComplete (PASS — resolved by F55)
|
||
|
||
`crates/mxaccess-compat/tests/lmx_write_complete_live.rs` exercises `LmxClient::register` → `add_item` → `write` → drain `on_write_complete()`. Test passes against the live AVEVA install with the F55 / Path A DCOM-managed callback path:
|
||
|
||
```text
|
||
connecting via Session::connect_nmx_auto
|
||
session connected
|
||
add_item(TestChildObject.TestInt) -> h_item=1
|
||
write(TestChildObject.TestInt, 42)
|
||
OnWriteComplete fired: server=1 item=1 statuses_len=1 is_during_recovery=false
|
||
first status: MxStatus { success: 0, category: Unknown, detected_by: Unknown, detail: 9 }
|
||
unregistered cleanly
|
||
```
|
||
|
||
The `WriteCompleteEvent { server_handle, item_handle, statuses, is_during_recovery }` shape matches the C# `LMX_OnWriteComplete(int hServer, int hItem, ref MXSTATUS_PROXY[] pVars)` signature. Status detail 9 = `WRITE_COMPLETE_OK`.
|
||
|
||
## Reproducing locally
|
||
|
||
### Live tests (require AVEVA + MX_LIVE env)
|
||
|
||
```powershell
|
||
# 1. Populate live env from Infisical (dot-source so vars persist).
|
||
. .\tools\Setup-LiveProbeEnv.ps1
|
||
|
||
# 2. Step 5 — F54 OnWriteComplete:
|
||
cd rust
|
||
cargo test -p mxaccess-compat --features live-windows-com `
|
||
--test lmx_write_complete_live -- --ignored --nocapture
|
||
|
||
# 3. Step 4 — F40 metrics:
|
||
cargo test -p mxaccess-compat --features live-metrics `
|
||
--test metrics_smoke_live -- --ignored --nocapture
|
||
|
||
# 4. Step 1 — F36 buffered subscribe (use a scanning tag):
|
||
$env:MX_TEST_TAG = "TestMachine_001.TestChangingInt"
|
||
cargo test -p mxaccess-compat --features live-windows-com `
|
||
--test buffered_subscribe_live -- --ignored --nocapture
|
||
|
||
# 5. Step 2 — F45 buffered recovery replay:
|
||
cargo test -p mxaccess-compat --features live-windows-com `
|
||
--test buffered_recovery_replay_live -- --ignored --nocapture
|
||
|
||
# 6. Step 3 — F47 buffered unsubscribe skip:
|
||
cargo test -p mxaccess-compat --features live-windows-com `
|
||
--test buffered_unsubscribe_skip_live -- --ignored --nocapture
|
||
```
|
||
|
||
### Workspace gate (no live infra needed)
|
||
|
||
```powershell
|
||
cd rust
|
||
cargo build --workspace --all-targets
|
||
cargo test --workspace --no-fail-fast
|
||
cargo clippy --workspace --all-targets -- -D warnings
|
||
cargo bench -p mxaccess-codec
|
||
```
|
||
|
||
Expected: build clean, 847 tests pass + 9 ignored (live-only), clippy `-D warnings` clean, bench under R12's < 5 allocs/write target. `cargo fmt --all -- --check` flags pre-existing workspace-wide drift unrelated to any session edit (see § "Workspace gate" below).
|
||
|
||
## Open work
|
||
|
||
None. F49 sweep complete; F50 (residual Frida capture for Suspend/Activate) closed 2026-05-06 per `docs/F50-suspend-activate-evidence.md`.
|
||
|
||
## Workspace gate (2026-05-07)
|
||
|
||
End-of-session sanity sweep against `master` at commit `9ed4700` plus the F56 unit-test fixture fix that this gate flagged. Run from `rust/` on Windows x64.
|
||
|
||
| Gate | Command | Result |
|
||
|---|---|---|
|
||
| Build | `cargo build --workspace --all-targets` | **Pass** (19.81 s) |
|
||
| Tests | `cargo test --workspace --no-fail-fast` | **Pass** — 847 passed, 0 failed, 9 ignored (live-only) |
|
||
| Clippy | `cargo clippy --workspace --all-targets -- -D warnings` | **Pass** |
|
||
| Bench | `cargo bench -p mxaccess-codec` | **Pass** — R12 < 5 allocs/write target met |
|
||
|
||
The `cargo fmt --all -- --check` gate flags pre-existing workspace-wide rustfmt drift across 29 files (~1000 lines, mostly machine-generated `mxaccess-asb-nettcp/src/nbfs.rs`). Drift is unrelated to any individual session's edits and is documented here as a known workspace-hygiene gap; per-file formatting is applied to edited files at edit time.
|
||
|
||
### F56 test-fixture bug surfaced + fixed by this gate
|
||
|
||
The workspace test sweep flagged 9 failing unit tests in `mxaccess::session` that had been silently failing since F56 landed (commit `5e11b30`). Root cause: F56 added `ensure_publisher_connected` (issuing `INmxService2::Connect` + `AddSubscriberEngine` before each `AdviseSupervisory`) but the in-process fake-NMX-server fixtures' `responses` vec sizes weren't bumped to absorb the two new RPCs. Symptom was `ConnectionAborted (10053)` once the fake server's response budget ran out mid-handshake.
|
||
|
||
Fix: bumped each test's `unauthenticated_server` / `recording_server` response count by 2 to cover Connect + AddSubscriberEngine. Tests touched (all in `crates/mxaccess/src/session.rs::tests`):
|
||
|
||
- `subscribe_then_unsubscribe_round_trip` (2 → 4 responses)
|
||
- `two_subscribes_produce_distinct_correlation_ids` (4 → 6; second subscribe hits the per-engine cache)
|
||
- `subscription_stream_yields_data_change_for_matching_correlation` (1 → 3)
|
||
- `subscription_stream_filters_out_mismatched_correlation_for_status` (1 → 3)
|
||
- `subscription_stream_keeps_data_update_regardless_of_correlation` (1 → 3)
|
||
- `subscribe_populates_registry_unsubscribe_clears_it` (2 → 4)
|
||
- `read_returns_first_data_change_within_timeout` (2 → 4)
|
||
- `read_returns_timeout_when_no_data_arrives` (2 → 4)
|
||
- `unsubscribe_skips_un_advise_for_buffered_subscription` (2 → 3 + mid-flow assertion bumped from `len() == 1` to `len() == 3`)
|
||
|
||
Bench numbers post-fix (release profile, Windows x64):
|
||
|
||
| scenario | allocs/op |
|
||
|---|---|
|
||
| `write_message::encode` (Int32) | 2.00 |
|
||
| `write_message::encode` (Float32) | 2.00 |
|
||
| `write_message::encode` (Float64) | 2.00 |
|
||
| `write_message::encode` (Boolean) | 1.00 |
|
||
| `write_message::encode` (String, 5 chars) | 4.00 |
|
||
| `write_message::encode_to_bytes_mut` (Int32, F52.1) | 2.00 |
|
||
| `write_message::encode_into_bytes_mut` (Int32, pooled, F52.3) | 1.00 |
|
||
| `write_message::encode_into_bytes_mut` (Boolean, pooled, F52.3) | 0.00 |
|
||
| `MxReferenceHandle::from_names` (cache, F52.2) | 0.00 |
|
||
| `NmxSubscriptionMessage::parse_inner` (DataUpdate, Int32) | 1.00 |
|
||
|
||
All numbers match `design/M6-bench-baseline.md` § F52.{1,2,3}.
|