[F56] subscribe / subscribe_buffered: split-form wire body + diagnose Galaxy fixture gap

Three real fixes + one architectural diagnosis:

1. Session::subscribe_buffered_nmx now sends the .NET-reference split
   form on the wire:
     item_definition = "<attr>.property(buffer)"   (was: full reference)
     item_context    = "<object_tag_name>"          (was: empty)
     item_handle     = SessionInner::next_item_handle.fetch_add(1)
                       (was: hardcoded 0)
   Verified byte-identical against captures/082 + 094 by the existing
   buffered_register_reference_parity unit tests. The
   item_handle counter mirrors MxNativeCompatibilityServer's
   _nextItemHandle++ at MxNativeSession.cs:613.

2. New live tests:
   - tests/buffered_subscribe_live.rs (F49 step 1) — uses real Galaxy
     metadata via SqlTagResolver + connect_nmx_auto, drives a
     background writer at 500ms cadence to force value-changes,
     drains DataChange events from Subscription.
   - tests/plain_subscribe_live.rs — same harness over plain
     Session::subscribe (NOT buffered), used to isolate whether
     "no DataUpdate" is buffered-specific (it's not — both fail).

   Both pull tracing-subscriber as a dev-dep so `RUST_LOG=trace`
   surfaces dcom_sink + router activity.

3. mxaccess-galaxy/sql_resolver.rs: drop the inner-attribute
   `#![cfg(feature = "galaxy-resolver")]` — the module-level cfg on
   `pub mod sql_resolver` in lib.rs already handles this and Rust
   1.85's clippy::duplicated_attributes lint flagged the duplicate
   once mxaccess-compat dev-deps activated the feature.

4. F56 finding (diagnosis, NOT a bug fix): the engine on this Galaxy
   install does not have an active value for TestChildObject.TestInt.
   Confirmed by running the .NET reference's own probe:

     dotnet run --project src/MxNativeClient.Probe -c Release \
       -- --probe-session-subscribe --tag=TestChildObject.TestInt \
       --subscribe-hold-seconds=10

   ...returns ONE 0x32 SubscriptionStatus (status=3 detail=3
   quality=0x00C0 Uncertain value=null) and zero 0x33 DataUpdates —
   matching the Rust port's symptom exactly. Not a Rust port bug,
   not a wire-byte gap. F49 steps 1-3 need either an actively-
   scanned tag or local Galaxy reconfiguration to scan
   TestChildObject.TestInt.

Workspace tests + clippy clean under both feature configurations.
F56 entry in design/followups.md updated with the full diagnostic
chain so future-me / future-collaborators can pick it up without
re-tracing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-06 10:27:08 -04:00
parent af15fe7587
commit df3457c54a
7 changed files with 276 additions and 82 deletions
+27 -2
View File
@@ -102,8 +102,33 @@ Between each publish: wait for the crate to be indexed before the next one's `ca
**Resolves when:** the lint is on and the workspace doc build is warning-clean with it.
### F56 — Buffered subscribe completes RegisterReference but receives no `0x33` DataUpdate frames
**Severity:** P1 — blocks F49 step 1 (F36 buffered live verification) and any consumer relying on `Session::subscribe_buffered` to surface value changes.
### F56 — `subscribe` / `subscribe_buffered` complete on the wire but never receive `0x33` DataUpdate frames
**Status:** Diagnosed 2026-05-06 as a **test-fixture issue, not a Rust port bug**. The .NET reference's own `MxNativeClient.Probe --probe-session-subscribe --tag=TestChildObject.TestInt` returns a single `0x32` SubscriptionStatus with `status=3 detail=3 quality=0x00C0 (Uncertain) value=null` and zero `0x33` DataUpdates — same observation as the Rust port's `subscribe` / `subscribe_buffered` paths. The engine on this Galaxy install does not have a live value for `TestChildObject.TestInt`; nothing is scanning that attribute, so there are no value-changes for the engine to dispatch. F49 steps 1-3 need either (a) a different test tag with active scanning, or (b) configuring the local Galaxy to scan TestChildObject.TestInt before live verification can pass.
Real codec fixes still landed in this session (envelope-peeling for `NmxSubscriptionMessage` + `0x11` registration-result path + split-form RegisterReference body + per-session item_handle counter); they were necessary preconditions for F49 step 1 even if the test fixture blocks the actual pass criterion.
**Severity:** P1 — blocks F49 step 1 (F36 buffered live verification), F49 step 2 (F45 recovery replay), and ALL consumers relying on subscription data flow on this Galaxy.
**Updated 2026-05-06.** Initial diagnosis suspected a buffered-specific wire-body gap; ruled out:
- Wire body proven byte-identical to the .NET reference's by `crates/mxaccess-codec/tests/buffered_register_reference_parity.rs` (which forward-builds the message from `Session::subscribe_buffered`'s inputs and compares against `captures/082-frida-add-buffered-plain-advise-testint/`).
- Test now uses real Galaxy DB metadata via `mxaccess_galaxy::SqlTagResolver` (engine_id=2, attribute_id=155, etc.) instead of the hardcoded StaticResolver shim.
- Item-handle, item_definition, item_context all switched to the .NET-reference split form (handle=1 + per-session counter, definition="<attr>.property(buffer)", context="<object_tag>").
**Plain subscribe also fails.** Added `crates/mxaccess-compat/tests/plain_subscribe_live.rs` driving `Session::subscribe` (NOT buffered). Same symptom: `AdviseSupervisory` returns HRESULT 0, the engine acks every write with a 51-byte op-status frame, but no `0x33` DataUpdate ever arrives. So this is **not buffered-specific** — the entire inbound DataUpdate path is silent on this machine.
**Likely revised root cause:**
- The engine generates `0x33` DataUpdate frames into a different transport channel than the one our DCOM sink listens on. The .NET reference's `INmxSvcCallback` has two opnums — `DataReceivedRaw` (3) and `StatusReceivedRaw` (4). We only ever observe opnum=3 callbacks. If the engine routes data updates through a different IID or different DCOM stub on this install, our sink never sees them.
- Alternatively, the engine on this Galaxy install is configured such that local Object scanning is disabled / the deployed objects aren't actively producing value-change events. The `OnWriteComplete` round-trip works (proves write-path + callback-path); a passive subscription doesn't produce updates if no source changes the value.
**Action items (for whoever picks F56 up):**
1. Compare the **C# DcomCallbackSink** (`src/MxNativeClient/NmxCallbackSink.cs`) to the Rust port's `mxaccess-callback::dcom_sink` — verify it implements **only** `INmxSvcCallback` and that the IID + vtable layout match. There may be a third method or a sibling interface (e.g. `INmxSvcCallback2`) that the engine also calls into for high-cadence DataUpdate dispatch.
2. Try the same live test against a tag that has known active scanning (e.g. a bound-to-PLC InputSource attribute) — rule out static-UDA hypothesis.
3. Run `MxNativeClient.Probe --probe-session-subscribe --tag=TestChildObject.TestInt --subscribe-hold-seconds=30` (the .NET reference's working live probe) and confirm `0x33` DataUpdates fire on THIS machine. If they do, capture the wire bytes via Frida and diff against the Rust port's exact body.
**What landed in this session (real router/codec fixes, NOT F56-resolving):**
- `NmxSubscriptionMessage::try_parse_process_data_received_body` — peels the `ProcessDataReceived` envelope before calling `parse_inner`. The router previously called `parse_inner` directly on wire bytes, which would have silently dropped any `0x33` even if one arrived.
- `NmxReferenceRegistrationResultMessage::try_parse_process_data_received_body` + router branch — drops `0x11` registration-result frames cleanly instead of logging "unexpected opcode 0x11".
- `Session::subscribe_buffered_nmx` — split-form (object, attribute) wire body + per-session monotonic `item_handle` counter (mirrors `MxNativeCompatibilityServer.AddBufferedItemAsync`'s `_nextItemHandle++`).
**Source:** F49 step 1 live attempt 2026-05-06. Test `cargo test -p mxaccess-compat --features live-windows-com --test buffered_subscribe_live -- --ignored --nocapture` (added in this session) connects via `Session::connect_nmx_auto` (F55-proven), issues `subscribe_buffered(TestChildObject.TestInt, 1000ms)` against the live engine, and runs a background writer at 500ms cadence. RegisterReference returns HRESULT 0; the engine then fires:
- One 46-byte heartbeat envelope (header-only, empty inner)
- One 51-byte op-status frame for the `RegisterReference` completion