alarms-over-gateway: full pipeline (wnwrap consumer + dispatcher + IPC + auto-subscribe + ack-by-name + live smoke) #118
Reference in New Issue
Block a user
Delete Branch "docs/alarm-client-wm-app-finding"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Reflection probe of the deployed
aaAlarmManagedClient.dll(v1.0.7368.41290) on 2026-05-01 confirmedAlarmClientexposes zero public events. The PR A.5 architecture thatAlarmClientConsumeris built on (managed-event surface, no message pump) does not hold against this assembly.Practical impact
AlarmClientConsumer.AlarmRecordReceivedhas no production caller.RaiseAlarmRecordReceivedisinternalfor tests and never invoked at runtime.Subscribe(...)returns OK fromRegisterConsumer+Subscribebut no notifications reach the consumer at runtime because no window is attached.MX_EVENT_FAMILY_ON_ALARM_TRANSITIONfamily is reserved on the wire but cannot carry events until A.2 lands a real WM_APP pump.AcknowledgeByGuidandSnapshotActiveAlarmsare pull-style and remain correct.How the path actually works
RegisterConsumer(hWnd, ...)takes a window handle because WM_APP messaging is the actual notification mechanism: AVEVA's alarm provider WM_APP-pokes the registered window, thenGetStatistics+GetAlarmExtendedRecpull the change set on each poke. This matches the original plan banner's Track A design — the optimistic xmldoc on PR A.5 was wrong against the deployed assembly.Changes
docs/AlarmClientDiscovery.md(new) — reflection probe summary, fullAlarmClientmethod list, open questions for A.2 implementation (WM_APP message ID,wParam/lParamsemantics, STA affinity, subscription scope).AlarmClientConsumer.csxmldoc — replaced the inaccurate "managed event surface" claim with the WM_APP finding; flaggedAlarmRecordReceivedas unreachable in production until the WM_APP pump lands. Inline comment inSubscribeupdated to match.MxAccessAlarmEventSink.csxmldoc — replaced the "verify on dev rig" hedge with the resolved finding; expanded the open-questions list so the next A.2 PR knows what the dev-rig probe needs to answer.Code-only no-op for the worker; worker builds clean (
dotnet build src/MxGateway.Worker0 errors / 0 warnings).Why this is a doc-only PR
The actual A.2 wiring (hidden message-only window, WindowProc, runtime probe for the WM_APP message ID) is substantial work that benefits from the open questions being settled first. Splitting the discovery record from the implementation keeps the WM_APP pump PR focused on code rather than re-deriving why the previous design didn't work.
Tracked in the lmxopcua-side issue #420.
Test plan
dotnet build src/MxGateway.Workerclean.MxGateway.Worker.Testsunaffected (the discovery probe stays Skip-gated).docs/AlarmClientDiscovery.mdreads top-to-bottom; method list matches the probe output.🤖 Generated with Claude Code
Added MxGateway.Worker.Tests/AlarmClientWmProbeTests.cs as a Skip-gated runtime probe. Run on the dev rig 2026-05-01 against the live AVEVA install (Galaxy reachable, no manual alarm fired). Findings: - RegisterConsumer(hWnd, ...) and Subscribe("\Galaxy!", ...) both return 0 (success). Calls are valid against the deployed assembly. - A registered-message-class WM (ID 0xC275 in this OS session) fires every ~1 second after Subscribe completes. Constant wParam=0x1100, constant lParam=0x079E46D8 — looks like a heartbeat / keepalive, not a per-change notification. - Critically, this WM is delivered to AVEVA's own internal window (hwnd=0x18032E), NOT to the consumer hWnd we registered. The consumer window receives only the standard WM_CREATE / WM_DESTROY sequence; no AVEVA traffic in between. This invalidates the WM_APP-pump design previously documented. The hWnd parameter to RegisterConsumer appears to be a registration identity only — AVEVA's notification path runs entirely against AVEVA's own internal window. Two viable A.2 designs replace the previous one: 1. Polling. Call GetStatistics on a 500ms timer in the worker's STA and react to whatever change set it reports. No window plumbing needed. Latency floor = poll period. Matches AVEVA's own internal heartbeat cadence. 2. Hook AVEVA's internal window. Discover AVEVA's own hwnd, SetWindowSubclass on it, intercept WM 0xC275 on AVEVA's thread. Higher fidelity, lower latency, but invasive and fragile across AVEVA upgrades — likely a non-starter. Recommendation in docs/AlarmClientDiscovery.md is option 1 (polling) unless a follow-up probe with a real fired alarm shows AVEVA does post change-specific WMs to a different hWnd. Open follow-up probes documented: - Fire a real Galaxy alarm during pump and check whether WM 0xC275 cadence changes or GetStatistics returns non-empty arrays. - GetStatistics threading affinity test. - Hook AVEVA's internal window 0x18032E. - Decompile aaAlarmManagedClient IL for RegisterConsumer to find whether WNAL_Register's callback surface is wrapped. Test project changes: - Added Reference to aaAlarmManagedClient + IAlarmMgrDataProvider (Private=true so the DLL gets copied into bin for test load). - Test-suite-wide: 127 real tests still pass; both alarm-related Skip-gated tests skip cleanly. Code change to the probe is additive — the worker is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>InitializeConsumer was the missing call. Adding it before RegisterConsumer makes the \Galaxy! provider appear in GetProviders (status 0 -> 100 within 500ms). Without Initialize, GetProviders returns an empty list even though everything else returns rc=0 (success). Probe trace 2026-05-01: InitializeConsumer -> 0 RegisterConsumer -> 0 GetProviders [after Register] -> count=0 list=[] Subscribe('\Galaxy!') -> 0 GetProviders [after Subscribe] -> count=1 list=[ 0 \Galaxy!] GetProviders [poll #1] -> count=1 list=[100 \Galaxy!] Despite the provider being at "100% query complete" for the entire 60s window, GetStatistics continued to report total=0 active=0 codes=[7] -- no alarm transitions reached the consumer even with a System Platform script flipping TestMachine_001.TestAlarm001 every 10s during the run. So the consumer chain works end-to-end. What's missing is alarm traffic from the producer side. The next discriminator is whether ObjectViewer (or another live consumer) sees the alarm fire while the script runs. API-ordering bug fix to apply to PR A.5's AlarmClientConsumer regardless of how A.2 lands: AlarmClientConsumer.Subscribe should call InitializeConsumer before RegisterConsumer (currently omits Initialize entirely, which means the provider chain is never visible from the worker either). That fix lifts a fundamental bug independent of the polling-vs-callback question. Probe changes: - Added InitializeConsumer call before RegisterConsumer. - Added LogProviders helper that logs only on change; called after Register, after Subscribe, and on every poll. Easier to spot when the provider chain transitions from empty to populated. - Restored Skip-gating after run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Replaces NotWiredAlarmRpcDispatcher in DI with a production implementation that issues the new MxCommandKind.{AcknowledgeAlarm, QueryActiveAlarms} commands across the IPC and unwraps the resulting MxCommandReply into the public RPC types. QueryActiveAlarms is fully wired: builds the QueryActiveAlarmsCommand (forwarding alarm_filter_prefix), invokes it on the resolved GatewaySession's worker client, and yields each ActiveAlarmSnapshot from the QueryActiveAlarmsReplyPayload as the RPC stream. Worker failures + missing sessions yield an empty stream — matches the ConditionRefresh contract clients already speak to. AcknowledgeAlarm is partially wired: the public RPC takes AlarmFullReference (Provider!Group.Tag), but the worker's wnwrap consumer acks by GUID. Strategy: - If AlarmFullReference parses as a canonical GUID, forward it directly through MxCommandKind.AcknowledgeAlarm. Native status flows back via MxCommandReply.Hresult and the dedicated AcknowledgeAlarmReplyPayload.NativeStatus. - Otherwise, return InvalidRequest with a clear diagnostic naming the follow-up — reference→GUID lookup needs a worker-side AlarmAckByName command wrapping wwAlarmConsumerClass.AlarmAckByName. DI: SessionServiceCollectionExtensions registers WorkerAlarmRpcDispatcher as the default IAlarmRpcDispatcher; MxAccessGatewayService picks it up via constructor injection. NotWiredAlarmRpcDispatcher is retained for test fixtures that want the no-side-effect fake. Tests: 7 new unit tests cover session-not-found short-circuit, GUID-vs- reference branching, native-status propagation, worker MxaccessFailure diagnostic propagation, and snapshot-stream yielding. Server test suite total: 288/0 fail. Solution builds clean. End-to-end alarms-over-gateway pipeline status: consumer → sink → queue (A.2 + A.3 in-process slice) worker IPC commands (A.3 worker slice) gateway dispatcher (this slice) Remaining for full E2E: - Auto-issue SubscribeAlarms on session open (or add a public SubscribeAlarms RPC). Without this trigger the consumer never starts and Acknowledge/Query return "not subscribed". - AlarmAckByName worker command for ack-by-reference. - End-to-end live test against the dev rig. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Adds the missing trigger that activates the worker's wnwrap consumer. Without this, every session opened in OK state but the consumer never started, so AcknowledgeAlarm/QueryActiveAlarms returned "alarm consumer not configured" forever. New AlarmsOptions config block (under MxGateway:Alarms): - Enabled (default false): gates the auto-subscribe path so existing deployments without alarm configuration are unaffected. - SubscriptionExpression: explicit AVEVA expression like \<machine>\Galaxy!<area>. - DefaultArea: fallback used when SubscriptionExpression is empty; composes \$(MachineName)\Galaxy!$(DefaultArea). - RequireSubscribeOnOpen (default false): when true, an auto-subscribe failure faults the session; when false, the failure is logged and the session stays Ready (data subscriptions keep working, alarms return "not subscribed" until the operator retries). SessionManager.OpenSessionAsync gains a TryAutoSubscribeAlarmsAsync hook that runs after MarkReady. Skips when alarms are disabled; otherwise builds a SubscribeAlarmsCommand, invokes it on the session's worker client, and either logs the resulting status or escalates per RequireSubscribeOnOpen. SessionManagerException is the failure mode for the strict path so callers in MxAccessGatewayService surface it as session-open-failed. Tests: 7 new unit tests cover the disabled lane, expression-driven subscribe, DefaultArea fallback, success path, soft-failure (require off), strict-failure (require on), and missing-config-strict-throw. Server suite total: 295 pass / 0 fail. Solution builds clean. End-to-end alarms-over-gateway path is now live (with config). Open a session against a gateway with Alarms.Enabled=true + a valid SubscriptionExpression; the worker's wnwrap consumer auto-subscribes; QueryActiveAlarms streams snapshots; AcknowledgeAlarm acks by GUID. Reference→GUID resolution (AlarmAckByName worker command) and the live dev-rig smoke test remain follow-ups. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Skip-gated AlarmsLiveSmokeTests.Alarms_full_pipeline_round_trip ran against the dev rig with the flip script firing TestMachine_001.TestAlarm001 every 10s. Verified: - Subscribe + 1st PollOnce yield real transition events - Field-by-field decode correct (provider, group, tag, severity, UTC timestamp, comment, type) - SnapshotActiveAlarms reflects current state - AcknowledgeByName(real identity) -> rc=0 - Pipeline keeps streaming transitions on the 10s cadence post-ack Three production quirks surfaced and were fixed in WnWrapAlarmConsumer: 1. SetXmlAlarmQuery is mandatory for reads. Skipping it (per the earlier discovery-doc recommendation) makes the first GetXmlCurrentAlarms2 fail with E_FAIL. The doc's claim that the call is unnecessary because the round-trip echo is mangled was wrong — mangled echo or not, the call is required. 2. SetXmlAlarmQuery breaks AlarmAckByName on the same consumer instance (returns -55). Workaround: provision a parallel "ack-only" wnwrap consumer that runs Initialize → Register → Subscribe via the v1-prefixed methods, no SetXmlAlarmQuery. Production WnWrapAlarmConsumer now holds two COM clients; AcknowledgeByName always dispatches through the ack-only one. 3. AlarmAckByName has v2 (8-arg) and v1 (6-arg) overloads. The v2 8-arg overload returns -55 on this AVEVA build (apparently a stub); the v1 6-arg overload works. Production now calls the 6-arg overload, discarding the proto's operator_domain and operator_full_name fields. The proto contract keeps both for forward-compat if AVEVA fixes the v2 method. Bonus finding (not fixed here): AlarmAckByGUID throws NotImplementedException on wnwrap. Reference→GUID lookup that we initially planned to plumb is therefore not viable; all acks must go through AlarmAckByName. WorkerAlarmRpcDispatcher.AcknowledgeAsync already routes references through the by-name path, so this only affects the GUID-input branch (which the worker tries first if the input parses as a GUID — that branch will surface NotImplementedException as MxaccessFailure if a client supplies one). Threading caveat: wnwrap is ThreadingModel=Apartment, so the consumer's internal Timer (firing on threadpool threads) blocks on cross-apartment marshaling without an STA message pump. The smoke test sidesteps this with pollIntervalMilliseconds=0 (Timer disabled) + manual PollOnce calls from the test STA. Production hosting will route polls through the worker's StaRuntime in a follow-up; PollOnce is now public so the wire-up is straightforward. Test counts after this slice: Worker: 195 pass / 4 skipped (live probes incl. new live smoke) / 1 pre-existing structure-fail (untouched) Server: 308 pass / 0 fail Solution builds clean. docs/AlarmClientDiscovery.md "Live smoke-test discoveries" section records all five findings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>docs: AlarmClient public surface — managed-event premise wrong, WM_APP requiredto alarms-over-gateway: full pipeline (wnwrap consumer + dispatcher + IPC + auto-subscribe + ack-by-name + live smoke)