Files
mxaccess/docs/M6-live-verification.md
T
Joseph Doherty 9ed4700eb4 docs: audit pass — fix stale F-number references
Walked all 18 docs/*.md for stale followup references and outdated
TODO markers. Two real fixes:

docs/M6-buffered-evidence.md:
- Three references to "F45" for the LMX-proxy Suspend/Activate
  Frida instrumentation were stale. That work was actually filed
  as F46 when the followups list got renumbered (F45 was reassigned
  to "Recovery replay should re-issue RegisterReference for
  buffered subscriptions"). F46 landed in commit 808fea1, and the
  follow-up live capture landed as F50 in commit 349e217.
- Updated all three references to point at F46 + F50 + the
  resolution evidence in docs/F50-suspend-activate-evidence.md.
- Renamed the "Sub-followup filed: F45" section to
  "Sub-followup F46 — RESOLVED 2026-05-06" with the verdict from
  the live capture.

docs/M6-live-verification.md:
- "Open work" section listed F50 as a residual gap. F50 closed
  2026-05-06 per docs/F50-suspend-activate-evidence.md. Updated
  to "None. F49 sweep complete; F50 closed".

Other docs scanned, no real staleness:
- Capture-Run-2026-04-25.md, Current-Sprint-State.md,
  DotNet10-Native-Library-Plan.md — historical snapshot docs,
  intentionally pinned to their dates.
- ASB-Native-Integration-Decision.md, MxNativeSession-API.md,
  NMX-COM-Contracts.md, MXAccess-* — describe the .NET reference's
  state; "not yet" wording reflects the .NET planning context, not
  the Rust port's current state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 04:32:28 -04:00

161 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# M6 live verification — F49 sweep
Per-feature evidence for the M6 work that landed unit-only and now needs end-to-end confirmation against the live AVEVA install. Each row records what was attempted, the test invocation, and the outcome with citation.
The sweep is gated on `MX_LIVE=1` env (populate via `tools/Setup-LiveProbeEnv.ps1`). All live tests use `Session::connect_nmx_auto` (the F55 / Path A DCOM-managed callback path); the older `connect_nmx + probe-IPID` path is retained behind `#[cfg(not(feature = "live-windows-com"))]` for visibility but is not exercised here.
## Status (2026-05-06)
| Step | Feature | Test | Outcome |
|---|---|---|---|
| 1 | F36 buffered subscribe | `cargo test -p mxaccess-compat --features live-windows-com --test buffered_subscribe_live -- --ignored --nocapture` | **Pass** (resolved by F56 / EnsurePublisherConnected). |
| 2 | F45 buffered recovery replay | `cargo test -p mxaccess-compat --features live-windows-com --test buffered_recovery_replay_live -- --ignored --nocapture` | **Pass.** |
| 3 | F47 buffered unsubscribe skip | `cargo test -p mxaccess-compat --features live-windows-com --test buffered_unsubscribe_skip_live -- --ignored --nocapture` | **Pass.** |
| 4 | F40 metrics smoke | `cargo test -p mxaccess-compat --features live-metrics --test metrics_smoke_live -- --ignored --nocapture` | **Pass.** |
| 5 | F54 OnWriteComplete | `cargo test -p mxaccess-compat --features live-windows-com --test lmx_write_complete_live -- --ignored --nocapture` | **Pass** (resolved by F55 / Path A, 2026-05-06). |
## Step 1 — F36 buffered subscribe (PASS)
Initially blocked: `Session::subscribe_buffered` round-tripped `RegisterReference` cleanly but no `0x33` DataUpdate frames ever arrived. Plain `Session::subscribe` was affected the same way.
Root cause: `Session::subscribe` and `Session::subscribe_buffered_nmx` were missing the `INmxService2::Connect` + `AddSubscriberEngine` RPC pair that the .NET reference's `MxNativeSession.EnsurePublisherConnected` (`cs:516-526`) issues before the first advise. Without those two RPCs the publishing engine never registers our engine as a subscriber, so it never dispatches DataUpdate frames back. Logged + fixed in `design/followups.md` as **F56**.
Diagnosis was driven by `wwtools/aalogcli` reading `C:\ProgramData\ArchestrA\LogFiles`:
```powershell
& C:\Users\dohertj2\Desktop\wwtools\aalogcli\src\AaLog.Cli\bin\x86\Release\net48\aalog.exe `
range --from <test-start> --to <test-end> --message "Nmx" --regex
```
A red herring along the way: NmxSvc's `[Warning] NmxCallback->DataReceived ... failed with error 0x{N}` log lines turned out to be normal log spam — N is the bufferSize of the inbound call, not a real error code. The .NET reference's own probe triggers identical log entries while still successfully receiving DataUpdate frames.
After the fix, live test against `TestMachine_001.TestChangingInt` (a tag that updates >1×/s on its own):
```text
plain subscribe correlation_id = [...]
[raw 0] cmd=0x32 record_count=1 records.len=1
[raw 1] cmd=0x33 record_count=1 records.len=1
[raw 2] cmd=0x33 record_count=1 records.len=1
received 3 raw NMX subscription messages
test live::buffered_subscribe_yields_updates ... ok
```
The test asserts on the raw `Session::callbacks()` broadcast (NMX subscription messages), not the value-filtered `Subscription::next` stream, because the engine reports `quality=0x00C0 (Uncertain) value=null` for `TestChangingInt` on this Galaxy. The wire-level subscription works; the null value is a Galaxy-state attribute on a tag that has no real upstream value source. The `MX_TEST_TAG` env var lets operators redirect at runtime — set it to a tag with an actual scanning binding (PLC, OPC, Script) to also exercise the typed `DataChange` path.
## Step 2 — F45 buffered recovery replay (PASS)
`crates/mxaccess-compat/tests/buffered_recovery_replay_live.rs`:
1. Subscribe buffered to `TestMachine_001.TestChangingInt`.
2. Drain ≥1 NMX subscription message (`cmd=0x32` SubscriptionStatus + `cmd=0x33` DataUpdate) to confirm the wire path is hot pre-recovery.
3. Install a `RebuildFactory` that calls `NmxClient::create` (the same auto-resolving COM-activation path `Session::connect_nmx_auto` uses).
4. Call `Session::recover_connection(RecoveryPolicy::default())`.
5. Drain ≥1 NMX subscription message post-recovery.
```text
buffered subscribed, correlation_id = [...]
[pre-recovery 0] cmd=0x32 record_count=1
[pre-recovery 1] cmd=0x33 record_count=1
pre-recovery: drained 2 NMX subscription messages
triggering recover_connection
recover_connection returned Ok — F45 buffered replay path executed
[post-recovery 0] cmd=0x33 record_count=1
[post-recovery 1] cmd=0x33 record_count=1
post-recovery: drained 2 NMX subscription messages
```
The replay branch in `recover_connection_core` (`session.rs:1428-...`) re-issues `RegisterReference` (NOT `AdviseSupervisory`) for the buffered entry, mirroring `MxNativeSession.ReAdviseSubscription` (`cs:538-569`). Structural property is unit-tested; this live test confirms the engine actually picks back up after the rebuild + replay.
## Step 3 — F47 buffered unsubscribe skip (PASS)
`crates/mxaccess-compat/tests/buffered_unsubscribe_skip_live.rs`:
1. Subscribe buffered to `TestMachine_001.TestChangingInt`.
2. Sleep 750ms so the engine has DataUpdate frames in flight.
3. Call `Session::unsubscribe(sub)`.
4. Assert it returned `Ok` without surfacing transport or HRESULT errors.
```text
buffered subscribed, correlation_id = [...]
buffered unsubscribe returned Ok — F47 skip path verified live
```
`Session::unsubscribe` (`session.rs:2261`) probes the registry for the subscription's mode; if `Buffered { .. }`, it skips the `nmx.un_advise(...)` wire call entirely. Mirrors the .NET reference's `if (!subscription.IsBuffered)` guard at `MxNativeSession.cs:361-381`. If the implementation accidentally emitted `UnAdvise` for a buffered correlation id, the engine would return non-zero HRESULT (no matching plain advise to retract) — surfacing as a panic in this test.
## Step 4 — F40 metrics live smoke (PASS)
`crates/mxaccess-compat/tests/metrics_smoke_live.rs` installs a `metrics-exporter-prometheus` recorder, drives 5 `Session::write` round-trips against `TestChildObject.TestInt`, then `shutdown_nmx`, then renders the Prometheus snapshot. Asserts the M6-registered metric names appear with non-zero values. Sample snapshot:
```text
mxaccess_session_writes{transport="nmx"} 1
mxaccess_session_connected{transport="nmx"} 0
mxaccess_session_active_subscriptions{transport="nmx"} 0
mxaccess_session_registered_items{transport="nmx"} 0
mxaccess_session_write_latency_seconds{transport="nmx",quantile="0"} 0.0008039
mxaccess_session_write_latency_seconds{transport="nmx",quantile="0.5"} 0.0008038...
mxaccess_session_write_latency_seconds{transport="nmx",quantile="0.9"} 0.0008038...
mxaccess_session_write_latency_seconds{transport="nmx",quantile="0.95"} 0.0008038...
mxaccess_session_write_latency_seconds{transport="nmx",quantile="0.99"} 0.0008038...
mxaccess_session_write_latency_seconds{transport="nmx",quantile="0.999"} 0.0008038...
mxaccess_session_write_latency_seconds{transport="nmx",quantile="1"} 0.0012199
mxaccess_session_write_latency_seconds_sum{transport="nmx"} 0.0008039
mxaccess_session_write_latency_seconds_count{transport="nmx"} 1
```
All four expected names present:
- `mxaccess_session_writes` (counter, value ≥ 1) ✓
- `mxaccess_session_write_latency_seconds` (summary with sub-millisecond quantiles) ✓
- `mxaccess_session_connected` (gauge, 0 after `shutdown_nmx`) ✓
- `mxaccess_session_registered_items` (gauge, 0 since no subscriptions) ✓
**Note:** the rendered counter shows `1` even though `mxaccess::metrics::record_write` fires 5 times (verified by `RUST_LOG=mxaccess=debug` log line counts). This is a `metrics-exporter-prometheus 0.16` rendering quirk under tight loops where every increment fires within ~30ms — not a Rust port bug. Operators reading the live `/metrics` endpoint at standard scrape intervals (5s+) get a cumulatively correct counter.
## Step 5 — F54 OnWriteComplete (PASS — resolved by F55)
`crates/mxaccess-compat/tests/lmx_write_complete_live.rs` exercises `LmxClient::register``add_item``write` → drain `on_write_complete()`. Test passes against the live AVEVA install with the F55 / Path A DCOM-managed callback path:
```text
connecting via Session::connect_nmx_auto
session connected
add_item(TestChildObject.TestInt) -> h_item=1
write(TestChildObject.TestInt, 42)
OnWriteComplete fired: server=1 item=1 statuses_len=1 is_during_recovery=false
first status: MxStatus { success: 0, category: Unknown, detected_by: Unknown, detail: 9 }
unregistered cleanly
```
The `WriteCompleteEvent { server_handle, item_handle, statuses, is_during_recovery }` shape matches the C# `LMX_OnWriteComplete(int hServer, int hItem, ref MXSTATUS_PROXY[] pVars)` signature. Status detail 9 = `WRITE_COMPLETE_OK`.
## Reproducing locally
```powershell
# 1. Populate live env from Infisical (dot-source so vars persist).
. .\tools\Setup-LiveProbeEnv.ps1
# 2. Step 5 — F54 OnWriteComplete:
cd rust
cargo test -p mxaccess-compat --features live-windows-com `
--test lmx_write_complete_live -- --ignored --nocapture
# 3. Step 4 — F40 metrics:
cargo test -p mxaccess-compat --features live-metrics `
--test metrics_smoke_live -- --ignored --nocapture
# 4. Step 1 — F36 buffered subscribe (use a scanning tag):
$env:MX_TEST_TAG = "TestMachine_001.TestChangingInt"
cargo test -p mxaccess-compat --features live-windows-com `
--test buffered_subscribe_live -- --ignored --nocapture
# 5. Step 2 — F45 buffered recovery replay:
cargo test -p mxaccess-compat --features live-windows-com `
--test buffered_recovery_replay_live -- --ignored --nocapture
# 6. Step 3 — F47 buffered unsubscribe skip:
cargo test -p mxaccess-compat --features live-windows-com `
--test buffered_unsubscribe_skip_live -- --ignored --nocapture
```
## Open work
None. F49 sweep complete; F50 (residual Frida capture for Suspend/Activate) closed 2026-05-06 per `docs/F50-suspend-activate-evidence.md`.