Files
mxaccess/docs/M6-live-verification.md
T
Joseph Doherty 5e11b30507 [F56 resolved] subscribe paths now drive 0x33 DataUpdate frames
Root cause: `Session::subscribe` and `Session::subscribe_buffered_nmx`
were missing the `INmxService2::Connect` + `AddSubscriberEngine` RPC
pair that the .NET reference's `MxNativeSession.EnsurePublisherConnected`
(`cs:516-526`) issues before the first advise against a publishing
engine. Without those two RPCs, NmxSvc accepted the subscription
registration but the publishing engine never knew our engine was
subscribed — so it never dispatched DataUpdate frames back.

Diagnosis driven by wwtools/aalogcli reading
C:\ProgramData\ArchestrA\LogFiles. The user pointed at this tooling
which lit up the path.

Red herring: NmxSvc's `[Warning] NmxCallback->DataReceived ... failed
with error 0x{N}` log lines turned out to be normal log spam where N
is the bufferSize of the inbound call, not a real error code. The
.NET reference's own probe triggers identical entries while still
receiving DataUpdate frames successfully.

Fix:
- SessionInner::publisher_endpoints — per-session HashMap<(platform_id,
  engine_id), ()> cache mirroring MxNativeSession._publisherEndpoints.
- Session::ensure_publisher_connected — issues Connect +
  AddSubscriberEngine, once per publisher endpoint per session.
- Session::subscribe + subscribe_buffered_nmx — both call it before
  the wire advise.
- subscribe_buffered_nmx — additionally issues AdviseSupervisory after
  RegisterReference. The .NET reference's RegisterBufferedItemAsync
  only calls RegisterReference, but on this AVEVA install
  RegisterReference alone produces the registration result + heartbeat
  callbacks without ever starting DataUpdate dispatch; AdviseSupervisory
  unblocks the dispatch.

Live verification (`TestMachine_001.TestChangingInt`, a tag that
updates >1×/s):
  cargo test -p mxaccess-compat --features live-windows-com \
      --test plain_subscribe_live -- --ignored --nocapture
  cargo test -p mxaccess-compat --features live-windows-com \
      --test buffered_subscribe_live -- --ignored --nocapture
Both pass — `cmd=0x32` SubscriptionStatus + sequence of `cmd=0x33`
DataUpdate frames flow as expected. Tests assert on the raw
Session::callbacks() broadcast (not the typed Subscription::next
DataChange path) because the engine reports quality=Uncertain
value=null for this attribute on this Galaxy — the wire-level
subscription is what F56 was about, not the value content.

DcomCallbackSink reverted to S_OK return for both DataReceivedRaw
and StatusReceivedRaw (the bytes-processed / sentinel HRESULT
experiments during diagnosis turned out to be irrelevant — the
"failed with error 0xN" logs come from NmxSvc regardless of the
return value).

design/followups.md F49 + F56 + docs/M6-live-verification.md updated:
F56 resolved, F49 steps 1 + 4 + 5 pass live, steps 2 + 3 pending
(now executable on this fixture).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 11:32:07 -04:00

114 lines
7.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# M6 live verification — F49 sweep
Per-feature evidence for the M6 work that landed unit-only and now needs end-to-end confirmation against the live AVEVA install. Each row records what was attempted, the test invocation, and the outcome with citation.
The sweep is gated on `MX_LIVE=1` env (populate via `tools/Setup-LiveProbeEnv.ps1`). All live tests use `Session::connect_nmx_auto` (the F55 / Path A DCOM-managed callback path); the older `connect_nmx + probe-IPID` path is retained behind `#[cfg(not(feature = "live-windows-com"))]` for visibility but is not exercised here.
## Status (2026-05-06)
| Step | Feature | Test | Outcome |
|---|---|---|---|
| 1 | F36 buffered subscribe | `cargo test -p mxaccess-compat --features live-windows-com --test buffered_subscribe_live -- --ignored --nocapture` | **Pass** (resolved by F56 / EnsurePublisherConnected). |
| 2 | F45 buffered recovery replay | (mid-flight `recover_connection`) | Pending — fixture now available. |
| 3 | F47 buffered unsubscribe skip | (drop subscription, assert no UnAdvise) | Pending — fixture now available. |
| 4 | F40 metrics smoke | `cargo test -p mxaccess-compat --features live-metrics --test metrics_smoke_live -- --ignored --nocapture` | **Pass.** |
| 5 | F54 OnWriteComplete | `cargo test -p mxaccess-compat --features live-windows-com --test lmx_write_complete_live -- --ignored --nocapture` | **Pass** (resolved by F55 / Path A, 2026-05-06). |
## Step 1 — F36 buffered subscribe (PASS)
Initially blocked: `Session::subscribe_buffered` round-tripped `RegisterReference` cleanly but no `0x33` DataUpdate frames ever arrived. Plain `Session::subscribe` was affected the same way.
Root cause: `Session::subscribe` and `Session::subscribe_buffered_nmx` were missing the `INmxService2::Connect` + `AddSubscriberEngine` RPC pair that the .NET reference's `MxNativeSession.EnsurePublisherConnected` (`cs:516-526`) issues before the first advise. Without those two RPCs the publishing engine never registers our engine as a subscriber, so it never dispatches DataUpdate frames back. Logged + fixed in `design/followups.md` as **F56**.
Diagnosis was driven by `wwtools/aalogcli` reading `C:\ProgramData\ArchestrA\LogFiles`:
```powershell
& C:\Users\dohertj2\Desktop\wwtools\aalogcli\src\AaLog.Cli\bin\x86\Release\net48\aalog.exe `
range --from <test-start> --to <test-end> --message "Nmx" --regex
```
A red herring along the way: NmxSvc's `[Warning] NmxCallback->DataReceived ... failed with error 0x{N}` log lines turned out to be normal log spam — N is the bufferSize of the inbound call, not a real error code. The .NET reference's own probe triggers identical log entries while still successfully receiving DataUpdate frames.
After the fix, live test against `TestMachine_001.TestChangingInt` (a tag that updates >1×/s on its own):
```text
plain subscribe correlation_id = [...]
[raw 0] cmd=0x32 record_count=1 records.len=1
[raw 1] cmd=0x33 record_count=1 records.len=1
[raw 2] cmd=0x33 record_count=1 records.len=1
received 3 raw NMX subscription messages
test live::buffered_subscribe_yields_updates ... ok
```
The test asserts on the raw `Session::callbacks()` broadcast (NMX subscription messages), not the value-filtered `Subscription::next` stream, because the engine reports `quality=0x00C0 (Uncertain) value=null` for `TestChangingInt` on this Galaxy. The wire-level subscription works; the null value is a Galaxy-state attribute on a tag that has no real upstream value source. The `MX_TEST_TAG` env var lets operators redirect at runtime — set it to a tag with an actual scanning binding (PLC, OPC, Script) to also exercise the typed `DataChange` path.
## Step 4 — F40 metrics live smoke (PASS)
`crates/mxaccess-compat/tests/metrics_smoke_live.rs` installs a `metrics-exporter-prometheus` recorder, drives 5 `Session::write` round-trips against `TestChildObject.TestInt`, then `shutdown_nmx`, then renders the Prometheus snapshot. Asserts the M6-registered metric names appear with non-zero values. Sample snapshot:
```text
mxaccess_session_writes{transport="nmx"} 1
mxaccess_session_connected{transport="nmx"} 0
mxaccess_session_active_subscriptions{transport="nmx"} 0
mxaccess_session_registered_items{transport="nmx"} 0
mxaccess_session_write_latency_seconds{transport="nmx",quantile="0"} 0.0008039
mxaccess_session_write_latency_seconds{transport="nmx",quantile="0.5"} 0.0008038...
mxaccess_session_write_latency_seconds{transport="nmx",quantile="0.9"} 0.0008038...
mxaccess_session_write_latency_seconds{transport="nmx",quantile="0.95"} 0.0008038...
mxaccess_session_write_latency_seconds{transport="nmx",quantile="0.99"} 0.0008038...
mxaccess_session_write_latency_seconds{transport="nmx",quantile="0.999"} 0.0008038...
mxaccess_session_write_latency_seconds{transport="nmx",quantile="1"} 0.0012199
mxaccess_session_write_latency_seconds_sum{transport="nmx"} 0.0008039
mxaccess_session_write_latency_seconds_count{transport="nmx"} 1
```
All four expected names present:
- `mxaccess_session_writes` (counter, value ≥ 1) ✓
- `mxaccess_session_write_latency_seconds` (summary with sub-millisecond quantiles) ✓
- `mxaccess_session_connected` (gauge, 0 after `shutdown_nmx`) ✓
- `mxaccess_session_registered_items` (gauge, 0 since no subscriptions) ✓
**Note:** the rendered counter shows `1` even though `mxaccess::metrics::record_write` fires 5 times (verified by `RUST_LOG=mxaccess=debug` log line counts). This is a `metrics-exporter-prometheus 0.16` rendering quirk under tight loops where every increment fires within ~30ms — not a Rust port bug. Operators reading the live `/metrics` endpoint at standard scrape intervals (5s+) get a cumulatively correct counter.
## Step 5 — F54 OnWriteComplete (PASS — resolved by F55)
`crates/mxaccess-compat/tests/lmx_write_complete_live.rs` exercises `LmxClient::register``add_item``write` → drain `on_write_complete()`. Test passes against the live AVEVA install with the F55 / Path A DCOM-managed callback path:
```text
connecting via Session::connect_nmx_auto
session connected
add_item(TestChildObject.TestInt) -> h_item=1
write(TestChildObject.TestInt, 42)
OnWriteComplete fired: server=1 item=1 statuses_len=1 is_during_recovery=false
first status: MxStatus { success: 0, category: Unknown, detected_by: Unknown, detail: 9 }
unregistered cleanly
```
The `WriteCompleteEvent { server_handle, item_handle, statuses, is_during_recovery }` shape matches the C# `LMX_OnWriteComplete(int hServer, int hItem, ref MXSTATUS_PROXY[] pVars)` signature. Status detail 9 = `WRITE_COMPLETE_OK`.
## Reproducing locally
```powershell
# 1. Populate live env from Infisical (dot-source so vars persist).
. .\tools\Setup-LiveProbeEnv.ps1
# 2. Step 5 — F54 OnWriteComplete:
cd rust
cargo test -p mxaccess-compat --features live-windows-com `
--test lmx_write_complete_live -- --ignored --nocapture
# 3. Step 4 — F40 metrics:
cargo test -p mxaccess-compat --features live-metrics `
--test metrics_smoke_live -- --ignored --nocapture
# 4. Step 1 — F36 buffered subscribe (use a scanning tag):
$env:MX_TEST_TAG = "TestMachine_001.TestChangingInt"
cargo test -p mxaccess-compat --features live-windows-com `
--test buffered_subscribe_live -- --ignored --nocapture
```
## Open work
- **F49 steps 2 + 3** — recovery replay and unsubscribe-skip live verification. Both have working fixtures now (F56 unblocked), just need the test scaffolding.
- **F50** — residual Frida capture for Suspend/Activate (independent of F49).