followups: trim F56's stale pre-resolution analysis
F56's body had a "Resolved 2026-05-06" header followed by ~40 lines of pre-resolution debugging analysis that contradicted the resolution: "Likely revised root cause" pointing at DCOM sink IID mismatches, "But zero 0x33 DataUpdate frames ever arrive", "Action items for whoever picks F56 up", "Definition of done", "Resolves when" — all written before the actual root cause (missing EnsurePublisherConnected round-trip) was identified. Trim to: status + actual root cause + fix that landed + live verification + the codec fixes that also landed independently. The dead-end debugging branches are preserved in this file's git history for archeology; F56 body now reads as a coherent closeout. Also fixed line 108's "See Resolved section below for the full closeout" pointer — the closeout *is* the body; F56 was never moved to Resolved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
+5
-44
@@ -105,9 +105,9 @@ Most of those are protocol-internal types (struct fields, enum variants on wire-
|
||||
**Resolves when:** the lint is on and the workspace doc build is warning-clean with it.
|
||||
|
||||
### F56 — `subscribe` / `subscribe_buffered` complete on the wire but never receive `0x33` DataUpdate frames
|
||||
**Status:** **Resolved 2026-05-06.** See Resolved section below for the full closeout.
|
||||
**Status:** **Resolved 2026-05-06.**
|
||||
|
||||
Root cause: `Session::subscribe` and `Session::subscribe_buffered_nmx` were missing the `INmxService2::Connect` + `AddSubscriberEngine` round-trip that the .NET reference's `MxNativeSession.EnsurePublisherConnected` (`cs:516-526`) issues before the first advise against a given publishing engine. Without that pair of RPCs, NmxSvc accepts the subscription registration but the publishing engine never knows our engine is subscribed — so no `0x33` DataUpdate frames flow.
|
||||
**Root cause:** `Session::subscribe` and `Session::subscribe_buffered_nmx` were missing the `INmxService2::Connect` + `AddSubscriberEngine` round-trip that the .NET reference's `MxNativeSession.EnsurePublisherConnected` (`cs:516-526`) issues before the first advise against a given publishing engine. Without that pair of RPCs, NmxSvc accepts the subscription registration but the publishing engine never knows our engine is subscribed — so no `0x33` DataUpdate frames flow.
|
||||
|
||||
Diagnosed via wwtools/aalogcli: the `[Warning] NmxSvc | NmxCallback->DataReceived ... failed with error 0x{N}` log lines turned out to be NmxSvc's normal log spam where N is the bufferSize, NOT an actual error — the .NET reference's own probe triggers identical entries while still receiving `0x33` DataUpdate frames successfully. The real issue was that those frames never started being sent in the first place.
|
||||
|
||||
@@ -123,53 +123,14 @@ Live verification passes for both paths against `TestMachine_001.TestChangingInt
|
||||
|
||||
Both tests assert on the raw `Session::callbacks()` broadcast (NMX subscription messages) rather than the typed `Subscription::next` (DataChange) path because `TestChangingInt` on this Galaxy is configured with `quality=0x00C0 (Uncertain) value=null`, so the typed path filters every record. The test gate is "wire-level subscription works"; what the engine reports as the actual value is downstream-Galaxy state, out of scope for the Rust port.
|
||||
|
||||
Real codec fixes ALSO landed in this session as part of F56 investigation (independent from the resolution above):
|
||||
- `NmxSubscriptionMessage::try_parse_process_data_received_body` — peels the `ProcessDataReceived` envelope before calling `parse_inner`. The router previously called `parse_inner` directly on wire bytes, which would have silently dropped any `0x33` even if one arrived.
|
||||
- `NmxReferenceRegistrationResultMessage::try_parse_process_data_received_body` + router branch — drops `0x11` registration-result frames cleanly.
|
||||
- `Session::subscribe_buffered_nmx` — split-form (object, attribute) wire body + per-session monotonic `item_handle` counter (mirrors `MxNativeCompatibilityServer.AddBufferedItemAsync`'s `_nextItemHandle++`).
|
||||
|
||||
**Severity:** P1 — blocks F49 step 1 (F36 buffered live verification), F49 step 2 (F45 recovery replay), and ALL consumers relying on subscription data flow on this Galaxy.
|
||||
|
||||
**Updated 2026-05-06.** Initial diagnosis suspected a buffered-specific wire-body gap; ruled out:
|
||||
- Wire body proven byte-identical to the .NET reference's by `crates/mxaccess-codec/tests/buffered_register_reference_parity.rs` (which forward-builds the message from `Session::subscribe_buffered`'s inputs and compares against `captures/082-frida-add-buffered-plain-advise-testint/`).
|
||||
- Test now uses real Galaxy DB metadata via `mxaccess_galaxy::SqlTagResolver` (engine_id=2, attribute_id=155, etc.) instead of the hardcoded StaticResolver shim.
|
||||
- Item-handle, item_definition, item_context all switched to the .NET-reference split form (handle=1 + per-session counter, definition="<attr>.property(buffer)", context="<object_tag>").
|
||||
|
||||
**Plain subscribe also fails.** Added `crates/mxaccess-compat/tests/plain_subscribe_live.rs` driving `Session::subscribe` (NOT buffered). Same symptom: `AdviseSupervisory` returns HRESULT 0, the engine acks every write with a 51-byte op-status frame, but no `0x33` DataUpdate ever arrives. So this is **not buffered-specific** — the entire inbound DataUpdate path is silent on this machine.
|
||||
|
||||
**Likely revised root cause:**
|
||||
- The engine generates `0x33` DataUpdate frames into a different transport channel than the one our DCOM sink listens on. The .NET reference's `INmxSvcCallback` has two opnums — `DataReceivedRaw` (3) and `StatusReceivedRaw` (4). We only ever observe opnum=3 callbacks. If the engine routes data updates through a different IID or different DCOM stub on this install, our sink never sees them.
|
||||
- Alternatively, the engine on this Galaxy install is configured such that local Object scanning is disabled / the deployed objects aren't actively producing value-change events. The `OnWriteComplete` round-trip works (proves write-path + callback-path); a passive subscription doesn't produce updates if no source changes the value.
|
||||
|
||||
**Action items (for whoever picks F56 up):**
|
||||
1. Compare the **C# DcomCallbackSink** (`src/MxNativeClient/NmxCallbackSink.cs`) to the Rust port's `mxaccess-callback::dcom_sink` — verify it implements **only** `INmxSvcCallback` and that the IID + vtable layout match. There may be a third method or a sibling interface (e.g. `INmxSvcCallback2`) that the engine also calls into for high-cadence DataUpdate dispatch.
|
||||
2. Try the same live test against a tag that has known active scanning (e.g. a bound-to-PLC InputSource attribute) — rule out static-UDA hypothesis.
|
||||
3. Run `MxNativeClient.Probe --probe-session-subscribe --tag=TestChildObject.TestInt --subscribe-hold-seconds=30` (the .NET reference's working live probe) and confirm `0x33` DataUpdates fire on THIS machine. If they do, capture the wire bytes via Frida and diff against the Rust port's exact body.
|
||||
|
||||
**What landed in this session (real router/codec fixes, NOT F56-resolving):**
|
||||
**Codec fixes** that ALSO landed in this session as part of the F56 investigation (independent from the resolution above; would have been needed even after the Connect/AddSubscriberEngine fix to make the inbound path readable):
|
||||
- `NmxSubscriptionMessage::try_parse_process_data_received_body` — peels the `ProcessDataReceived` envelope before calling `parse_inner`. The router previously called `parse_inner` directly on wire bytes, which would have silently dropped any `0x33` even if one arrived.
|
||||
- `NmxReferenceRegistrationResultMessage::try_parse_process_data_received_body` + router branch — drops `0x11` registration-result frames cleanly instead of logging "unexpected opcode 0x11".
|
||||
- `Session::subscribe_buffered_nmx` — split-form (object, attribute) wire body + per-session monotonic `item_handle` counter (mirrors `MxNativeCompatibilityServer.AddBufferedItemAsync`'s `_nextItemHandle++`).
|
||||
**Source:** F49 step 1 live attempt 2026-05-06. Test `cargo test -p mxaccess-compat --features live-windows-com --test buffered_subscribe_live -- --ignored --nocapture` (added in this session) connects via `Session::connect_nmx_auto` (F55-proven), issues `subscribe_buffered(TestChildObject.TestInt, 1000ms)` against the live engine, and runs a background writer at 500ms cadence. RegisterReference returns HRESULT 0; the engine then fires:
|
||||
- One 46-byte heartbeat envelope (header-only, empty inner)
|
||||
- One 51-byte op-status frame for the `RegisterReference` completion
|
||||
- One 87-byte `0x11` `NmxReferenceRegistrationResultMessage` carrying the assigned `item_handle`
|
||||
- One 51-byte op-status frame **per write** (60 frames over 30s — perfectly clocked to the writer cadence)
|
||||
|
||||
But **zero `0x33` `DataUpdate` frames** ever arrive — verified end-to-end via `RUST_LOG=trace mxaccess_callback=trace`. The .NET reference's `MxNativeSession.SubscribeBufferedAsync` does deliver DataUpdates against the same engine + same tag (per F36 wave 1 evidence at `captures/094-frida-buffered-separate-writer/`), so this is a Rust-port-specific gap.
|
||||
**Severity:** P1 — blocked F49 step 1 (F36 buffered live verification), F49 step 2 (F45 recovery replay), and all consumers relying on subscription data flow on this Galaxy. Now unblocked.
|
||||
|
||||
**Likely causes (in priority order):**
|
||||
1. The `NmxReferenceRegistrationMessage` body the Rust port sends differs in some field from the .NET reference's. Specifically: `subscribe: true` is set, but other fields (e.g. `item_handle = 0`, `reserved_*`, `source_galaxy_id`) may need different values to trigger DataUpdate dispatch. **Action**: capture the wire bytes from the Rust port's RegisterReference and diff against `captures/094-frida-buffered-separate-writer/` per-byte.
|
||||
2. Some additional client-side step is required after RegisterReference — e.g. an ACK of the `0x11` registration result via the assigned `item_handle`, or a separate RPC the .NET reference dispatches that we miss. The F36 wave 1 evidence said no `SetBufferedUpdateInterval` is sent, but there may be another op. **Action**: capture .NET reference's outbound calls during `subscribe-buffered` end-to-end and compare to ours.
|
||||
3. The `0x11` registration-result body might carry a status code we should be checking (see `NmxReferenceRegistrationResultMessage::status_category` / `status_detail`). If non-zero, the engine may have rejected the subscription silently. **Action**: log the parsed `0x11` body and check the status fields.
|
||||
|
||||
**What's already wired (this session):** `NmxSubscriptionMessage::try_parse_process_data_received_body` (envelope-peeling helper) was added — the previous router called `parse_inner` directly on wire bytes and would have silently dropped any `0x33` that did arrive. This was a real bug fix; without it F56 would have stayed invisible. Same for `NmxReferenceRegistrationResultMessage::try_parse_process_data_received_body` + the `0x11` path in the router.
|
||||
|
||||
**Does not affect:** `Session::write` round-trip (proven by F55 live test); plain `Session::subscribe` (not yet live-tested but uses `AdviseSupervisory` not `RegisterReference`).
|
||||
|
||||
**Definition of done:** F49 step 1 passes — `cargo test -p mxaccess-compat --features live-windows-com --test buffered_subscribe_live -- --ignored --nocapture` reports at least 3 `DataChange` arrivals at the configured cadence, with monotonically-increasing values matching the writer.
|
||||
|
||||
**Resolves when:** the missing field / step / status check is identified, the fix lands in `Session::subscribe_buffered_nmx` (or upstream), and the live test passes.
|
||||
**Source:** F49 step 1 live attempt 2026-05-06. The pre-resolution debugging analysis (initial buffered-only hypothesis ruled out via byte-identical parity test → "plain subscribe also fails too" → revised hypothesis around DCOM sink IID / vtable mismatch and disabled object scanning → final landing on the missing `EnsurePublisherConnected` round-trip) is preserved in this file's git history. Run `git log -p design/followups.md` around 2026-05-06 / 2026-05-07 if the dead-end branches are needed for future archeology.
|
||||
|
||||
### F55 — Hand-rolled callback exporter rejected by `RegisterEngine2` on this AVEVA install
|
||||
**Status:** Resolved 2026-05-06 by Path A (DCOM-managed `INmxSvcCallback` sink in `mxaccess-callback::dcom_sink`, wired into `Session::from_nmx_client` behind the `windows-com` feature). Live test `cargo test -p mxaccess-compat --features live-windows-com --test lmx_write_complete_live -- --ignored --nocapture` passes end-to-end: RegisterEngine2 succeeds, write round-trips, OnWriteComplete fires with status from the wire. The hand-rolled `CallbackExporter` is retained for unit tests that exercise the exporter against an in-process fake NMX peer.
|
||||
|
||||
Reference in New Issue
Block a user