Files
mxaccess/docs/M6-live-verification.md
T
Joseph Doherty 5e11b30507 [F56 resolved] subscribe paths now drive 0x33 DataUpdate frames
Root cause: `Session::subscribe` and `Session::subscribe_buffered_nmx`
were missing the `INmxService2::Connect` + `AddSubscriberEngine` RPC
pair that the .NET reference's `MxNativeSession.EnsurePublisherConnected`
(`cs:516-526`) issues before the first advise against a publishing
engine. Without those two RPCs, NmxSvc accepted the subscription
registration but the publishing engine never knew our engine was
subscribed — so it never dispatched DataUpdate frames back.

Diagnosis driven by wwtools/aalogcli reading
C:\ProgramData\ArchestrA\LogFiles. The user pointed at this tooling
which lit up the path.

Red herring: NmxSvc's `[Warning] NmxCallback->DataReceived ... failed
with error 0x{N}` log lines turned out to be normal log spam where N
is the bufferSize of the inbound call, not a real error code. The
.NET reference's own probe triggers identical entries while still
receiving DataUpdate frames successfully.

Fix:
- SessionInner::publisher_endpoints — per-session HashMap<(platform_id,
  engine_id), ()> cache mirroring MxNativeSession._publisherEndpoints.
- Session::ensure_publisher_connected — issues Connect +
  AddSubscriberEngine, once per publisher endpoint per session.
- Session::subscribe + subscribe_buffered_nmx — both call it before
  the wire advise.
- subscribe_buffered_nmx — additionally issues AdviseSupervisory after
  RegisterReference. The .NET reference's RegisterBufferedItemAsync
  only calls RegisterReference, but on this AVEVA install
  RegisterReference alone produces the registration result + heartbeat
  callbacks without ever starting DataUpdate dispatch; AdviseSupervisory
  unblocks the dispatch.

Live verification (`TestMachine_001.TestChangingInt`, a tag that
updates >1×/s):
  cargo test -p mxaccess-compat --features live-windows-com \
      --test plain_subscribe_live -- --ignored --nocapture
  cargo test -p mxaccess-compat --features live-windows-com \
      --test buffered_subscribe_live -- --ignored --nocapture
Both pass — `cmd=0x32` SubscriptionStatus + sequence of `cmd=0x33`
DataUpdate frames flow as expected. Tests assert on the raw
Session::callbacks() broadcast (not the typed Subscription::next
DataChange path) because the engine reports quality=Uncertain
value=null for this attribute on this Galaxy — the wire-level
subscription is what F56 was about, not the value content.

DcomCallbackSink reverted to S_OK return for both DataReceivedRaw
and StatusReceivedRaw (the bytes-processed / sentinel HRESULT
experiments during diagnosis turned out to be irrelevant — the
"failed with error 0xN" logs come from NmxSvc regardless of the
return value).

design/followups.md F49 + F56 + docs/M6-live-verification.md updated:
F56 resolved, F49 steps 1 + 4 + 5 pass live, steps 2 + 3 pending
(now executable on this fixture).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 11:32:07 -04:00

7.4 KiB
Raw Blame History

M6 live verification — F49 sweep

Per-feature evidence for the M6 work that landed unit-only and now needs end-to-end confirmation against the live AVEVA install. Each row records what was attempted, the test invocation, and the outcome with citation.

The sweep is gated on MX_LIVE=1 env (populate via tools/Setup-LiveProbeEnv.ps1). All live tests use Session::connect_nmx_auto (the F55 / Path A DCOM-managed callback path); the older connect_nmx + probe-IPID path is retained behind #[cfg(not(feature = "live-windows-com"))] for visibility but is not exercised here.

Status (2026-05-06)

Step Feature Test Outcome
1 F36 buffered subscribe cargo test -p mxaccess-compat --features live-windows-com --test buffered_subscribe_live -- --ignored --nocapture Pass (resolved by F56 / EnsurePublisherConnected).
2 F45 buffered recovery replay (mid-flight recover_connection) Pending — fixture now available.
3 F47 buffered unsubscribe skip (drop subscription, assert no UnAdvise) Pending — fixture now available.
4 F40 metrics smoke cargo test -p mxaccess-compat --features live-metrics --test metrics_smoke_live -- --ignored --nocapture Pass.
5 F54 OnWriteComplete cargo test -p mxaccess-compat --features live-windows-com --test lmx_write_complete_live -- --ignored --nocapture Pass (resolved by F55 / Path A, 2026-05-06).

Step 1 — F36 buffered subscribe (PASS)

Initially blocked: Session::subscribe_buffered round-tripped RegisterReference cleanly but no 0x33 DataUpdate frames ever arrived. Plain Session::subscribe was affected the same way.

Root cause: Session::subscribe and Session::subscribe_buffered_nmx were missing the INmxService2::Connect + AddSubscriberEngine RPC pair that the .NET reference's MxNativeSession.EnsurePublisherConnected (cs:516-526) issues before the first advise. Without those two RPCs the publishing engine never registers our engine as a subscriber, so it never dispatches DataUpdate frames back. Logged + fixed in design/followups.md as F56.

Diagnosis was driven by wwtools/aalogcli reading C:\ProgramData\ArchestrA\LogFiles:

& C:\Users\dohertj2\Desktop\wwtools\aalogcli\src\AaLog.Cli\bin\x86\Release\net48\aalog.exe `
    range --from <test-start> --to <test-end> --message "Nmx" --regex

A red herring along the way: NmxSvc's [Warning] NmxCallback->DataReceived ... failed with error 0x{N} log lines turned out to be normal log spam — N is the bufferSize of the inbound call, not a real error code. The .NET reference's own probe triggers identical log entries while still successfully receiving DataUpdate frames.

After the fix, live test against TestMachine_001.TestChangingInt (a tag that updates >1×/s on its own):

plain subscribe correlation_id = [...]
[raw 0] cmd=0x32 record_count=1 records.len=1
[raw 1] cmd=0x33 record_count=1 records.len=1
[raw 2] cmd=0x33 record_count=1 records.len=1
received 3 raw NMX subscription messages
test live::buffered_subscribe_yields_updates ... ok

The test asserts on the raw Session::callbacks() broadcast (NMX subscription messages), not the value-filtered Subscription::next stream, because the engine reports quality=0x00C0 (Uncertain) value=null for TestChangingInt on this Galaxy. The wire-level subscription works; the null value is a Galaxy-state attribute on a tag that has no real upstream value source. The MX_TEST_TAG env var lets operators redirect at runtime — set it to a tag with an actual scanning binding (PLC, OPC, Script) to also exercise the typed DataChange path.

Step 4 — F40 metrics live smoke (PASS)

crates/mxaccess-compat/tests/metrics_smoke_live.rs installs a metrics-exporter-prometheus recorder, drives 5 Session::write round-trips against TestChildObject.TestInt, then shutdown_nmx, then renders the Prometheus snapshot. Asserts the M6-registered metric names appear with non-zero values. Sample snapshot:

mxaccess_session_writes{transport="nmx"} 1
mxaccess_session_connected{transport="nmx"} 0
mxaccess_session_active_subscriptions{transport="nmx"} 0
mxaccess_session_registered_items{transport="nmx"} 0
mxaccess_session_write_latency_seconds{transport="nmx",quantile="0"} 0.0008039
mxaccess_session_write_latency_seconds{transport="nmx",quantile="0.5"} 0.0008038...
mxaccess_session_write_latency_seconds{transport="nmx",quantile="0.9"} 0.0008038...
mxaccess_session_write_latency_seconds{transport="nmx",quantile="0.95"} 0.0008038...
mxaccess_session_write_latency_seconds{transport="nmx",quantile="0.99"} 0.0008038...
mxaccess_session_write_latency_seconds{transport="nmx",quantile="0.999"} 0.0008038...
mxaccess_session_write_latency_seconds{transport="nmx",quantile="1"} 0.0012199
mxaccess_session_write_latency_seconds_sum{transport="nmx"} 0.0008039
mxaccess_session_write_latency_seconds_count{transport="nmx"} 1

All four expected names present:

  • mxaccess_session_writes (counter, value ≥ 1) ✓
  • mxaccess_session_write_latency_seconds (summary with sub-millisecond quantiles) ✓
  • mxaccess_session_connected (gauge, 0 after shutdown_nmx) ✓
  • mxaccess_session_registered_items (gauge, 0 since no subscriptions) ✓

Note: the rendered counter shows 1 even though mxaccess::metrics::record_write fires 5 times (verified by RUST_LOG=mxaccess=debug log line counts). This is a metrics-exporter-prometheus 0.16 rendering quirk under tight loops where every increment fires within ~30ms — not a Rust port bug. Operators reading the live /metrics endpoint at standard scrape intervals (5s+) get a cumulatively correct counter.

Step 5 — F54 OnWriteComplete (PASS — resolved by F55)

crates/mxaccess-compat/tests/lmx_write_complete_live.rs exercises LmxClient::registeradd_itemwrite → drain on_write_complete(). Test passes against the live AVEVA install with the F55 / Path A DCOM-managed callback path:

connecting via Session::connect_nmx_auto
session connected
add_item(TestChildObject.TestInt) -> h_item=1
write(TestChildObject.TestInt, 42)
OnWriteComplete fired: server=1 item=1 statuses_len=1 is_during_recovery=false
first status: MxStatus { success: 0, category: Unknown, detected_by: Unknown, detail: 9 }
unregistered cleanly

The WriteCompleteEvent { server_handle, item_handle, statuses, is_during_recovery } shape matches the C# LMX_OnWriteComplete(int hServer, int hItem, ref MXSTATUS_PROXY[] pVars) signature. Status detail 9 = WRITE_COMPLETE_OK.

Reproducing locally

# 1. Populate live env from Infisical (dot-source so vars persist).
. .\tools\Setup-LiveProbeEnv.ps1

# 2. Step 5 — F54 OnWriteComplete:
cd rust
cargo test -p mxaccess-compat --features live-windows-com `
    --test lmx_write_complete_live -- --ignored --nocapture

# 3. Step 4 — F40 metrics:
cargo test -p mxaccess-compat --features live-metrics `
    --test metrics_smoke_live -- --ignored --nocapture

# 4. Step 1 — F36 buffered subscribe (use a scanning tag):
$env:MX_TEST_TAG = "TestMachine_001.TestChangingInt"
cargo test -p mxaccess-compat --features live-windows-com `
    --test buffered_subscribe_live -- --ignored --nocapture

Open work

  • F49 steps 2 + 3 — recovery replay and unsubscribe-skip live verification. Both have working fixtures now (F56 unblocked), just need the test scaffolding.
  • F50 — residual Frida capture for Suspend/Activate (independent of F49).