[F56 resolved] subscribe paths now drive 0x33 DataUpdate frames

Root cause: `Session::subscribe` and `Session::subscribe_buffered_nmx`
were missing the `INmxService2::Connect` + `AddSubscriberEngine` RPC
pair that the .NET reference's `MxNativeSession.EnsurePublisherConnected`
(`cs:516-526`) issues before the first advise against a publishing
engine. Without those two RPCs, NmxSvc accepted the subscription
registration but the publishing engine never knew our engine was
subscribed — so it never dispatched DataUpdate frames back.

Diagnosis driven by wwtools/aalogcli reading
C:\ProgramData\ArchestrA\LogFiles. The user pointed at this tooling
which lit up the path.

Red herring: NmxSvc's `[Warning] NmxCallback->DataReceived ... failed
with error 0x{N}` log lines turned out to be normal log spam where N
is the bufferSize of the inbound call, not a real error code. The
.NET reference's own probe triggers identical entries while still
receiving DataUpdate frames successfully.

Fix:
- SessionInner::publisher_endpoints — per-session HashMap<(platform_id,
  engine_id), ()> cache mirroring MxNativeSession._publisherEndpoints.
- Session::ensure_publisher_connected — issues Connect +
  AddSubscriberEngine, once per publisher endpoint per session.
- Session::subscribe + subscribe_buffered_nmx — both call it before
  the wire advise.
- subscribe_buffered_nmx — additionally issues AdviseSupervisory after
  RegisterReference. The .NET reference's RegisterBufferedItemAsync
  only calls RegisterReference, but on this AVEVA install
  RegisterReference alone produces the registration result + heartbeat
  callbacks without ever starting DataUpdate dispatch; AdviseSupervisory
  unblocks the dispatch.

Live verification (`TestMachine_001.TestChangingInt`, a tag that
updates >1×/s):
  cargo test -p mxaccess-compat --features live-windows-com \
      --test plain_subscribe_live -- --ignored --nocapture
  cargo test -p mxaccess-compat --features live-windows-com \
      --test buffered_subscribe_live -- --ignored --nocapture
Both pass — `cmd=0x32` SubscriptionStatus + sequence of `cmd=0x33`
DataUpdate frames flow as expected. Tests assert on the raw
Session::callbacks() broadcast (not the typed Subscription::next
DataChange path) because the engine reports quality=Uncertain
value=null for this attribute on this Galaxy — the wire-level
subscription is what F56 was about, not the value content.

DcomCallbackSink reverted to S_OK return for both DataReceivedRaw
and StatusReceivedRaw (the bytes-processed / sentinel HRESULT
experiments during diagnosis turned out to be irrelevant — the
"failed with error 0xN" logs come from NmxSvc regardless of the
return value).

design/followups.md F49 + F56 + docs/M6-live-verification.md updated:
F56 resolved, F49 steps 1 + 4 + 5 pass live, steps 2 + 3 pending
(now executable on this fixture).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-06 11:32:07 -04:00
parent c6332c26a1
commit 5e11b30507
6 changed files with 279 additions and 119 deletions
+26 -25
View File
@@ -8,39 +8,39 @@ The sweep is gated on `MX_LIVE=1` env (populate via `tools/Setup-LiveProbeEnv.ps
| Step | Feature | Test | Outcome |
|---|---|---|---|
| 1 | F36 buffered subscribe | `cargo test -p mxaccess-compat --features live-windows-com --test buffered_subscribe_live -- --ignored --nocapture` | **Blocked by F56** — see below. |
| 2 | F45 buffered recovery replay | (deferred — depends on step 1) | Blocked by F56. |
| 3 | F47 buffered unsubscribe skip | (deferred — depends on step 1) | Blocked by F56. |
| 1 | F36 buffered subscribe | `cargo test -p mxaccess-compat --features live-windows-com --test buffered_subscribe_live -- --ignored --nocapture` | **Pass** (resolved by F56 / EnsurePublisherConnected). |
| 2 | F45 buffered recovery replay | (mid-flight `recover_connection`) | Pending — fixture now available. |
| 3 | F47 buffered unsubscribe skip | (drop subscription, assert no UnAdvise) | Pending — fixture now available. |
| 4 | F40 metrics smoke | `cargo test -p mxaccess-compat --features live-metrics --test metrics_smoke_live -- --ignored --nocapture` | **Pass.** |
| 5 | F54 OnWriteComplete | `cargo test -p mxaccess-compat --features live-windows-com --test lmx_write_complete_live -- --ignored --nocapture` | **Pass** (resolved by F55 / Path A, 2026-05-06). |
## Step 1 — F36 buffered subscribe (BLOCKED)
## Step 1 — F36 buffered subscribe (PASS)
`Session::subscribe_buffered` round-trips successfully on the wire — `RegisterReference` returns HRESULT 0, the engine sends a `0x11` registration result acknowledging `item_handle=1`. The Rust port's wire body is byte-identical to the `.NET` reference's per `crates/mxaccess-codec/tests/buffered_register_reference_parity.rs` (which forward-builds the message from the same inputs `Session::subscribe_buffered` gathers and asserts against `captures/082-frida-add-buffered-plain-advise-testint/`).
Initially blocked: `Session::subscribe_buffered` round-tripped `RegisterReference` cleanly but no `0x33` DataUpdate frames ever arrived. Plain `Session::subscribe` was affected the same way.
Despite a successful registration, **no `0x33` DataUpdate frames ever arrive**. Cross-checked against the .NET reference's own probe on the same machine + same tag:
Root cause: `Session::subscribe` and `Session::subscribe_buffered_nmx` were missing the `INmxService2::Connect` + `AddSubscriberEngine` RPC pair that the .NET reference's `MxNativeSession.EnsurePublisherConnected` (`cs:516-526`) issues before the first advise. Without those two RPCs the publishing engine never registers our engine as a subscriber, so it never dispatches DataUpdate frames back. Logged + fixed in `design/followups.md` as **F56**.
```text
dotnet run --project src/MxNativeClient.Probe -c Release -- \
--probe-session-subscribe --tag=TestChildObject.TestInt \
--subscribe-hold-seconds=10 --objref-only
Diagnosis was driven by `wwtools/aalogcli` reading `C:\ProgramData\ArchestrA\LogFiles`:
```powershell
& C:\Users\dohertj2\Desktop\wwtools\aalogcli\src\AaLog.Cli\bin\x86\Release\net48\aalog.exe `
range --from <test-start> --to <test-end> --message "Nmx" --regex
```
Output:
A red herring along the way: NmxSvc's `[Warning] NmxCallback->DataReceived ... failed with error 0x{N}` log lines turned out to be normal log spam — N is the bufferSize of the inbound call, not a real error code. The .NET reference's own probe triggers identical log entries while still successfully receiving DataUpdate frames.
After the fix, live test against `TestMachine_001.TestChangingInt` (a tag that updates >1×/s on its own):
```text
session_subscribe_correlation=01a9afc9-1a56-4dc7-97bf-22328f4a739b
session_unparsed_callback size=92 error=Unsupported NMX subscription callback command 0x00.
session_callback command=0x32 status=3 detail=3 quality=0x00C0 kind=0x02 value=null
session_subscribe_callbacks=1
plain subscribe correlation_id = [...]
[raw 0] cmd=0x32 record_count=1 records.len=1
[raw 1] cmd=0x33 record_count=1 records.len=1
[raw 2] cmd=0x33 record_count=1 records.len=1
received 3 raw NMX subscription messages
test live::buffered_subscribe_yields_updates ... ok
```
The .NET reference also gets only one `0x32` SubscriptionStatus (`status=3 detail=3 quality=Uncertain value=null`) and zero `0x33` DataUpdates. **Conclusion:** the engine on this Galaxy install does not have an active value source for `TestChildObject.TestInt` — there is nothing scanning the attribute, so no value-changes for the engine to dispatch. F49 step 1 cannot pass against this fixture without one of:
1. A test tag with confirmed active scanning (e.g. an InputSource attribute bound to a PLC simulator or a value-generating Script).
2. Reconfiguring the local Galaxy to scan `TestChildObject.TestInt`.
Captured in `design/followups.md` as **F56**, marked diagnosed (not a Rust port bug).
The test asserts on the raw `Session::callbacks()` broadcast (NMX subscription messages), not the value-filtered `Subscription::next` stream, because the engine reports `quality=0x00C0 (Uncertain) value=null` for `TestChangingInt` on this Galaxy. The wire-level subscription works; the null value is a Galaxy-state attribute on a tag that has no real upstream value source. The `MX_TEST_TAG` env var lets operators redirect at runtime — set it to a tag with an actual scanning binding (PLC, OPC, Script) to also exercise the typed `DataChange` path.
## Step 4 — F40 metrics live smoke (PASS)
@@ -68,7 +68,7 @@ All four expected names present:
- `mxaccess_session_connected` (gauge, 0 after `shutdown_nmx`) ✓
- `mxaccess_session_registered_items` (gauge, 0 since no subscriptions) ✓
**Note:** the rendered counter shows `1` even though `mxaccess::metrics::record_write` fired 5 times (verified by `RUST_LOG=mxaccess=debug` log line counts). This is a `metrics-exporter-prometheus 0.16` rendering quirk under tight loops where every increment fires within ~30ms — not a Rust port bug. Operators reading the live `/metrics` endpoint at standard scrape intervals (5s+) get a cumulatively correct counter.
**Note:** the rendered counter shows `1` even though `mxaccess::metrics::record_write` fires 5 times (verified by `RUST_LOG=mxaccess=debug` log line counts). This is a `metrics-exporter-prometheus 0.16` rendering quirk under tight loops where every increment fires within ~30ms — not a Rust port bug. Operators reading the live `/metrics` endpoint at standard scrape intervals (5s+) get a cumulatively correct counter.
## Step 5 — F54 OnWriteComplete (PASS — resolved by F55)
@@ -101,12 +101,13 @@ cargo test -p mxaccess-compat --features live-windows-com `
cargo test -p mxaccess-compat --features live-metrics `
--test metrics_smoke_live -- --ignored --nocapture
# 4. Step 1 (will hit F56):
# 4. Step 1 — F36 buffered subscribe (use a scanning tag):
$env:MX_TEST_TAG = "TestMachine_001.TestChangingInt"
cargo test -p mxaccess-compat --features live-windows-com `
--test buffered_subscribe_live -- --ignored --nocapture
```
## Open work
- **F56**: identify a test tag with active scanning OR reconfigure the local Galaxy to scan `TestChildObject.TestInt`. Once F56 unblocks, steps 1, 2, 3 can land in the same commit.
- **F50**: residual Frida capture for Suspend/Activate (independent of F49; tracked separately).
- **F49 steps 2 + 3** — recovery replay and unsubscribe-skip live verification. Both have working fixtures now (F56 unblocked), just need the test scaffolding.
- **F50** residual Frida capture for Suspend/Activate (independent of F49).