From 864b9f4bd338cd52ccf26695cd4e5bfdf0e42401 Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Thu, 21 May 2026 17:04:29 -0400 Subject: [PATCH] Remove the AlarmClientDiscovery probe log MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Delete docs/AlarmClientDiscovery.md — an archival AVEVA alarm-consumer investigation log whose durable findings now live in the alarm worker/monitor code. Drop the now-dangling links from Grpc.md and GatewayConfiguration.md. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/AlarmClientDiscovery.md | 828 ----------------------------------- docs/GatewayConfiguration.md | 4 +- docs/Grpc.md | 2 +- 3 files changed, 2 insertions(+), 832 deletions(-) delete mode 100644 docs/AlarmClientDiscovery.md diff --git a/docs/AlarmClientDiscovery.md b/docs/AlarmClientDiscovery.md deleted file mode 100644 index 5ec9829..0000000 --- a/docs/AlarmClientDiscovery.md +++ /dev/null @@ -1,828 +0,0 @@ -# aaAlarmManagedClient discovery — public surface, 2026-05-01 - -Result of running -`MxGateway.Worker.Tests.AlarmClientDiscoveryTests.DumpAlarmClientPublicSurface` -against the deployed AVEVA assembly: - -- File: - `C:\Program Files (x86)\ArchestrA\Framework\Bin\ViewAppFramework\Content\MA\aaAlarmManagedClient.dll` -- Assembly identity: `aaAlarmManagedClient, Version=1.0.7368.41290, - Culture=neutral, PublicKeyToken=7ebd82b507d9e10c` - -## Public types - -- `aaAlarmManagedClient.AlarmClient` (class) -- `aaAlarmManagedClient.PriorityData` (class) - -That's the entire exported surface — two types, no interfaces, no -delegates. - -## `AlarmClient` events - -**None.** The class has no public events at all. The reflection probe's -`GetEvents(BindingFlags.Public | Instance | Static)` returned an empty -list. - -## `AlarmClient` methods (relevant subset) - -- **Lifecycle:** - `RegisterConsumer(int hWnd, string szProductName, string - szApplicationName, string szVersion, bool bRetainHiddenAlarms) → int`, - `DeregisterConsumer() → int`, - `InitializeConsumer(string szApplicationName) → int`, - `UninitializeConsumer() → int`, - `Dispose()`. -- **Subscription:** - `Subscribe(string szSubscription, short wFromPri, short wToPri, - eQueryType QueryType, eSortFlags SortFlags, eAlarmFilterState - FilterMask, eAlarmFilterState FilterSpecification) → int`. -- **Change enumeration (pull on poke):** - `GetStatistics(out int lPercentQuery, out int lTotalAlarms, out int - lActiveAlarms, out int lSuppressedAlarms, out int lSuppressedFilters, - out int lNewAlarms, out int lChangesCount, out int[] ChangeCodes, - out int[] ChangePos, out int[] hAlarm) → int`. -- **Record fetch:** - `GetAlarmExtendedRec(int lIndex, out AlarmRecord almRec) → int`, - `GetAlarmExtendedRec2(...)`, - `GetHighPriAlarm(out AlarmRecord almRec) → int`. -- **Selection model** (used by ack-selected-* family): - `DeselectAll`, `SelectAlaramEntry(short select, int from, int to)`, - `SelectByGUID(Guid)`, `SelectAlarmCount(int from, int to)`. -- **Acknowledge:** - `AlarmAckByGUID(Guid alarmGuid, string ackComment, string ackOprName, - string ackOprNode, string ackOprDomain, string ackOprFullName) → int` - is the per-alarm full-fidelity native ack. - `AlarmAckSelected(string ackComment, string ackOprName, string - ackOprNode, string ackOprDomain, string ackOprFullName) → int` - acks whatever the selection model currently has selected. - Several `AckSelected*Group/Tag/Priority/All/Visible*Alarms_Ex(...)` - variants exist for bulk ack scoped to a group / tag / priority range. -- **Suppress / shelve:** `SupressSelected*` and `ShelveSelected*` - families plus `DoAlarmShelveAction(...)`. Out of scope for the v1 - alarm path. -- **Snapshot/filter** (`SF*` prefix): `SFSetSortA / SFSetFilterA / - SFCreateSnapshot / SFGetListCount / SFDeleteSnapshot / SFRefreshAlarm / - SFGetStatistics`. Snapshot-style query API, distinct from the - consumer-subscription path. Not currently used. - -## What this means - -The architecture comment on -`src/MxGateway.Worker/MxAccess/AlarmClientConsumer.cs` (PR A.5) is -**wrong against this deployed assembly**: - -> "The AVEVA alarm-manager surface (`IAlarmMgrDataProvider`) exposes -> the events we need as plain .NET events — no Windows message pump -> required." - -There is no managed event surface. `AlarmClient.RegisterConsumer` -takes an `hWnd` because **WM_APP messaging is the actual notification -mechanism**: AVEVA's alarm provider WM_APP-pokes the registered window, -and the consumer is expected to call `GetStatistics` on each poke to -pull `ChangeCodes` / `ChangePos` / `hAlarm` arrays, then -`GetAlarmExtendedRec(pos, …)` per index to fetch each changed record. - -`AlarmClientConsumer.AlarmRecordReceived` has no production callers as -a result — `RaiseAlarmRecordReceived` is `internal` for tests and -never gets invoked at runtime. Until A.2 lands a WM_APP pump, -`MX_EVENT_FAMILY_ON_ALARM_TRANSITION` cannot carry events. - -## Live runtime probe — 2026-05-01 - -`MxGateway.Worker.Tests.AlarmClientWmProbeTests.ProbeAlarmClientWmMessages` -is a Skip-gated runtime probe that creates a real message-only -window, calls `AlarmClient.RegisterConsumer(hWnd, …)` + -`Subscribe(@"\Galaxy!", …)`, and pumps for 20s while logging every -window message that arrives. Run results below — this turned the -"WM_APP pump" design assumption upside down. - -**`RegisterConsumer` and `Subscribe` both returned 0 (success).** The -calls are valid against the deployed assembly; no parameter pinning -needed. - -**A registered-message-class WM (ID `0xC275` in this OS session) -fired every ~1s after `Subscribe` completed.** Constant -`wParam = 0x00001100`, constant `lParam = 0x079E46D8` (looks like a -stable pointer into AVEVA-internal state) for all 20 hits. The -constant payload across hits with no Galaxy alarm being fired -suggests this is a **heartbeat/keepalive**, not a per-change -notification. - -**Critically: this WM is delivered to AVEVA's own internal window -(`hwnd=0x18032E`) — NOT to the consumer's `hWnd` we passed in.** The -consumer window's `WndProc` received only the standard creation -sequence (`WM_GETMINMAXINFO`, `WM_NCCREATE`, `WM_NCCALCSIZE`, -`WM_CREATE`) and the destruction sequence (`WM_NCDESTROY`, -`WM_DESTROY`, `WM_NCCALCSIZE`) — nothing in between. AVEVA's -notification path runs entirely against AVEVA's internal window; -it never forwards to the user-supplied hWnd. - -The message ID itself is dynamic (a `RegisterWindowMessage` -allocation in the >= 0xC000 range), so it cannot be hard-coded — -each consumer process must call `RegisterWindowMessage` with the -correct *string* and use whatever ID the OS returns. - -## What this means for A.2 - -The "WM_APP pump on the user hWnd" design — what the original plan -banner described and what the previous version of this doc -recommended — does not match how AVEVA actually delivers -notifications. The hWnd parameter to `RegisterConsumer` does not -appear to receive any of AVEVA's alarm traffic; it's likely used -only as a registration identity (and perhaps as a parent for modal -dialogs). - -Two viable A.2 designs given the probe data: - -1. **Polling.** Just call `GetStatistics` on a timer (e.g. every - 500ms in the worker's STA) and react to the change set it - reports. No window plumbing needed. Trade-off: latency floor = - poll period; modest CPU floor because the call is cheap. Matches - the heartbeat-style WM 0xC275 semantics — AVEVA itself runs a - poll loop internally. -2. **Hook AVEVA's internal window.** Discover AVEVA's own window - (`hwnd=0x18032E` in the probe), `SetWindowsHookEx` or - `SetWindowSubclass` on it, and intercept WM 0xC275 on AVEVA's - thread. Higher fidelity, near-zero latency, but invasive, - fragile across AVEVA upgrades, and requires running on the same - process / thread as the AVEVA window. Probably a non-starter - without further AVEVA documentation. - -**Recommendation:** the polling path (option 1) is cheaper to -implement, more robust against AVEVA-internal change, and -acceptable for a typical alarm cadence. The worker's existing STA -already provides a thread-affinitized timer surface. The unanswered -question is whether `GetStatistics` can be safely called outside -AVEVA's own message-pump thread — confirmable by extending the -probe to fire `GetStatistics` on its own thread and check the -result. - -## Alarm-provider visibility — third probe run, 2026-05-01 - -Extended the probe to call `AlarmClient.GetProviders` after -`RegisterConsumer`. Result on this rig: - -``` -GetProviders -> rc=0 count=0 list=[] -``` - -**Zero alarm providers visible to the consumer process.** This -explains every preceding probe run: no providers means no alarm -events, regardless of how many times any value (including a -bool with an `$Alarm` extension) flips. `Subscribe(@"\Galaxy!")` -returns 0 (success) but matches nothing because the alarm-manager -chain that provides the matching feed doesn't expose any provider -to this consumer. - -A System Platform script flipping `TestMachine_001.TestAlarm001` -every 10s during this probe run produced no observable -`GetStatistics` transitions, no `positions[]` / `handles[]` -entries, no change in any field — confirms the silence is not -about subscription-scope / message-pump but about provider -absence. - -### Possible causes - -1. **No `$Alarm` extension on the test bool.** If - `TestMachine_001.TestAlarm001` is a regular UDA without a - `BoolAlarm` extension wired to it, flipping the value just - writes a new value — no alarm fires. -2. **Alarm manager service not running.** AVEVA's `aaAlarmMgr` - (or the equivalent on this rig's Platform version) needs to - be running for providers to register. -3. **Process security context.** A consumer running under a - normal user account may not see providers that registered - under `LocalSystem` / a Platform service identity. The - gateway-worker installation runs under a service account - that may have access where `dotnet test` doesn't. - -## InitializeConsumer required — fourth probe run, 2026-05-01 - -Adding `InitializeConsumer("AlarmProbe.Tests")` before -`RegisterConsumer` made `\Galaxy!` appear in `GetProviders` -(count=1, status 0 → 100 within 500ms). So #2 and #3 above are -NOT the cause — the consumer can see the alarm provider once it -calls Initialize. That's a missing API-call ordering, not a -permission or service issue. - -``` -InitializeConsumer -> 0 -RegisterConsumer -> 0 -GetProviders [after Register] -> rc=0 count=0 list=[] -Subscribe('\Galaxy!') -> 0 -GetProviders [after Subscribe] -> rc=0 count=1 list=[ 0 \Galaxy!] -GetProviders [poll #1] -> rc=0 count=1 list=[100 \Galaxy!] -``` - -Despite the provider being visible at "100% query complete" for -the entire 60s window, `GetStatistics` continued to report -`total=0 active=0 codes=[7]` — no alarm transitions reached the -consumer even with a System Platform script flipping the test -boolean every 10s during the run. - -That isolates the remaining unknown to whether the test bool's -alarm extension is actually generating MxAccess alarm-provider -events when its value flips. The probe has confirmed every link -in the consumer chain works (Initialize → Register → Subscribe → -provider visible at 100%) — what's missing is alarm traffic from -the producer side. ObjectViewer or another live consumer running -alongside the script is the next discriminator: does it visibly -see the alarm fire? - -API-ordering finding: `InitializeConsumer` MUST precede -`RegisterConsumer` (or at least, must be called before -`GetProviders` returns anything). PR A.5's `AlarmClientConsumer` -omits `InitializeConsumer` entirely — that's a bug fix to apply -even before A.2 lands, since without it the provider chain never -becomes visible. - -## Subscribe-parameter sweep — fifth probe run, 2026-05-01 - -Even with `InitializeConsumer` + provider visible at status 100, -no alarm transitions arrived during a 60s window with the user's -script flipping the test bool every 10s. Tried: - -- `qtSummary` and `qtHistory` (the only `eQueryType` values). -- Priority 1..999 and 0..32767. -- `eAlarmFilterState.asNone` and `asAlarmActiveNow` for both - `FilterMask` and `FilterSpecification`. - -`eAlarmFilterState` is single-state-valued (asNone=0, -asAlarmActiveNow=1, asAlarmAcked=2, asShelved=3), not flag bits. -None of these knobs surfaced any alarm activity. - -User confirmation 2026-05-01: the test bool does have a -`BoolAlarm` extension on it; in `aaObjectViewer` the -`$Alarm.InAlarm` sub-attribute flips true/false in lockstep with -the script's writes. So the alarm extension is **evaluating** -its condition, just not visibly producing transitions on the -`aaAlarmManagedClient` consumer stream. - -## Multi-channel + multi-subscription probe — sixth run, 2026-05-01 - -Extended the probe to try every consumer-side approach in -parallel: - -- **Subscription expressions** (sequential): `\Galaxy!`, - `\Galaxy!*`, `\\Galaxy!`, `\Galaxy!TestArea`, `\\.\Galaxy!`. - All Subscribe calls returned rc=0; the last one - (`\\.\Galaxy!`) is reflected in `GetProviders` (count=1). -- **Read channels** polled at 500ms cadence: `GetStatistics`, - `GetHighPriAlarm`, `SFCreateSnapshot` + `SFGetStatistics`. -- **Filter+sort**: priority 0..32767, `qtSummary`, - state=`asAlarmActiveNow`, sort=`sfReturnNewestFirst`. -- **AlarmRecord init** (worked around `Not a valid Win32 - FileTime` exception): all DateTime fields pre-set to FILETIME - epoch (1601-01-01 UTC) before the call, since - `default(DateTime)` is outside FILETIME range and trips the - interop marshaler. - -Result of the 60s run with `TestMachine_001.TestAlarm001` being -flipped every 10s: - -``` -Subscribe('\Galaxy!') -> 0 -Subscribe('\Galaxy!*') -> 0 -Subscribe('\\Galaxy!') -> 0 -Subscribe('\Galaxy!TestArea') -> 0 -Subscribe('\\.\Galaxy!') -> 0 -GetProviders [after Subscribe-multi] -> count=1 list=[ 0 \\.\Galaxy!] -GetStatistics #1: total=0 active=0 changes=1 codes=[7] positions=[] handles=[] -GetHighPriAlarm #1: rc=0 { } -SF channel #1: SFCreate=0 numAlarms=0 SFStats=0 unackRet=0 unackAlm=0 ackAlm=0 others=0 events=0 idxNewest=-1 -``` - -**No further "(changed)" entries for the entire 60s window.** -Every read API returned the same empty result on every poll. - -User confirms the alarm IS firing — `aaObjectViewer` sees -`$Alarm.InAlarm` flip in lockstep with the script. Historian -records exist (per user — needs verification by querying the -historian directly). - -## Conclusion of consumer-side probing - -`aaAlarmManagedClient.AlarmClient` is **not** the receive -surface AVEVA's alarm pipeline routes to in this Galaxy -configuration. The consumer chain is verified end-to-end: - -- `InitializeConsumer` + `RegisterConsumer` + `Subscribe` all - succeed (rc=0). -- `GetProviders` finds `\Galaxy!` once Initialize is called. -- All read APIs (`GetStatistics`, `GetHighPriAlarm`, - `SFCreateSnapshot`/`SFGetStatistics`) return empty even with - every documented filter combination. -- The consumer's hWnd receives zero AVEVA messages between - `WM_CREATE` and `WM_DESTROY`; AVEVA's traffic goes to its own - internal hwnd. - -The next investigation directions are not consumer-side: - -1. **Inspect `aaObjectViewer`'s alarm SDK** to see what library - it uses to read alarms. If different from - `aaAlarmManagedClient`, switch the worker over. -2. **Query the historian directly** (`aahEventStorage` / - `aahEventSvc`) to confirm alarms are recorded — and use the - same path for v2 alarm capture. -3. **Inspect AVEVA's alarm-routing config** for this Galaxy in - System Platform IDE — area assignments, alarm provider - bindings, "publish alarm events to" settings on the platform. - -For A.2 implementation: the `aaAlarmManagedClient` path the -gateway-worker is currently architected around may be a -dead-end on customer Galaxies configured this way. If the -alarms truly only flow through the historian event-storage path, -A.2 needs to consume from `aahEventStorage` instead — a -fundamental architecture pivot. - -## BREAKTHROUGH — seventh probe run, 2026-05-01 - -Two changes finally produced a signal: - -1. **Subscription scope:** `\\\Galaxy!` is the - canonical AlarmClient subscription format (per ArchestrA Alarm - Client docs at `archestra6.rssing.com/chan-12008125/article13.html`): - `\\Node\Provider!Area!Filter`, where Node is the *machine* name, - Provider is **literally `Galaxy`**, and Area is a hosted area - object. For this rig (`\\DESKTOP-6JL3KKO\Galaxy!DEV`) the DEV - area — the platform's primary area — is the right scope. Earlier - `\Galaxy!`, `\Galaxy!TestArea`, `\\.\Galaxy!`, etc., all returned - rc=0 but matched no traffic — they were not the canonical form. -2. **`InitializeConsumer` before `RegisterConsumer`** — already - discovered earlier; bug-fix for PR A.5's `AlarmClientConsumer`. - -With both in place, `GetHighPriAlarm` returned a record on every -poll for 60s straight (117/117 calls), but threw -`ArgumentOutOfRangeException: Not a valid Win32 FileTime` instead -of returning successfully — the AlarmRecord struct contains five -DateTime fields (`ar_Time`, `ar_OrigTime`, `ar_AckTime`, -`ar_RtnTime`, `ar_SubTime`) and AVEVA writes sentinel/invalid -FILETIME values for unset ones (e.g., `ar_AckTime` for an -unacknowledged alarm). The .NET interop that AVEVA ships -(`aaAlarmManagedClient.dll`) auto-converts FILETIME→DateTime and -rejects out-of-range values. - -`GetStatistics` continues to report `total=0 active=0` even with -GetHighPriAlarm returning records — those two API surfaces have -genuinely different views in AVEVA's data model. - -So: **alarms flow through `aaAlarmManagedClient.AlarmClient` once -the subscription expression is canonical**. The blocking issue is -extracting the payload past the .NET interop's DateTime -auto-marshaling. - -## Remaining work to capture alarm payloads - -Define a custom COM interop that uses `long` (FILETIME-as-int64) -instead of `DateTime` for the timestamp fields. Approach options: - -1. **Patch the AVEVA-shipped `aaAlarmManagedClient.dll`** — ildasm - the assembly, replace `DateTime` with `long` on AlarmRecord's - timestamp fields, ilasm back. Brittle across AVEVA upgrades. -2. **Write our own `[ComImport]` interface** — declare - `IRawAlarmConsumer` ourselves with safe-blittable types, - discover the underlying COM IID (via reflection on - `AlarmClient`'s `[Guid]` attribute), and `(IRawAlarmConsumer) - alarmClient` cast. Cleaner; requires the IID. -3. **Use `IDispatch` late binding** — dispatch-Invoke bypasses - strong-typed marshaling. Verbose but doesn't need IIDs. - -For PR A.2's worker integration, option 2 is the least -disruptive. Once the interop is custom, `AlarmClient.Subscribe` + -`GetHighPriAlarm` + `GetAlarmExtendedRec` form a viable -polling-style alarm consumer. - -**REVISED 2026-05-01 — option 1 not directly applicable.** -Reflection on `aaAlarmManagedClient.AlarmClient` shows it -implements only `IDisposable` (no `[ComImport]` interface, no -class GUID). It has a single field `CwwAlarmConsumer* -m_almUnmanaged` — meaning `AlarmClient` is a **C++/CLI managed -wrapper around a native C++ class**, NOT a COM-interop class. -The DateTime conversion happens inside the AVEVA wrapper's IL, -not at a .NET-to-COM marshaling boundary. There is no separate -COM interface IID we can QI to. - -Revised approach options: - -A. **Switch to `wnwrapConsumer.dll`** — a separate standalone - COM library AVEVA ships at - `C:\Program Files (x86)\Common Files\ArchestrA\wnwrapConsumer.dll` - exposing `WNWRAPCONSUMERLib.wwAlarmConsumerClass` with - `SetXmlAlarmQuery` / `GetXmlCurrentAlarms`. XML-string output - bypasses FILETIME marshaling entirely. -B. **Patch `aaAlarmManagedClient.dll` IL** — wrap the unsafe - `DateTime.FromFileTime` calls with a safe variant. Direct - fix but modifies a vendor binary. -C. **Reflect into `m_almUnmanaged` and call native vtable** — - get the IntPtr, walk the MSVC C++ vtable, call - `__thiscall` methods via `Marshal.GetDelegateForFunctionPointer`. - Doable but requires reverse-engineering the C++ class layout. - -Option A is the best fit: real COM-based, self-contained in -our code, conventional production-grade approach (the WIN-911 -consumer pattern referenced in AVEVA support forums uses it). - -The polling-vs-WM_APP-callback question from earlier is now -moot: `GetStatistics`'s `positions[]/handles[]` arrays remained -empty even when alarms were demonstrably present. The active -read API for current alarms is `GetHighPriAlarm`, not -`GetStatistics`'s change array. - -### Implications for A.2 implementation - -The A.2 PR's value is unmeasurable until at least one alarm -provider is visible. The choice between polling-via-`GetStatistics` -and the callback path can only be decided by observing what -populates first when a real alarm fires. Without a provider, -both paths return the same "nothing happening" answer. - -Until that's resolved, A.2 implementation work is genuinely -blocked on a dev-rig configuration issue — not on architectural -choice or code structure. - -## GetStatistics polling — second probe run, 2026-05-01 - -Extended the probe to call `GetStatistics` every ~2s alongside the -WM logger. Key findings: - -- **`GetStatistics` is safely callable from the same thread that - did `RegisterConsumer` + `Subscribe`.** Every poll returned rc=0 - with no exceptions over 9 polls / 20s window. -- **The deployed Galaxy currently has zero active alarms.** Every - poll reported `total=0 active=0 suppressed=0 newAlarms=0`. The - `positions[]` and `handles[]` arrays were empty. -- **`changes=1 codes=[7]` was constant across all polls**, matching - the constant 1 Hz WM 0xC275 cadence. Code 7 is consistent with a - "heartbeat / subscription healthy" sentinel — same semantics as - the WM but reported through the pull-side API. -- `percent=100` (query-complete percentage) was constant — the - subscription is steady-state. - -This confirms the polling design (option 1 in the previous section) -is mechanically viable. The remaining open question is whether -`GetStatistics` populates `positions[] / handles[]` with real -entries when an alarm transition actually fires — proving that -requires firing an alarm. - -## Open follow-up probes - -Each can be added to `AlarmClientWmProbeTests` as a separate -Skip-gated test: - -1. **Fire a real Galaxy alarm during the pump window.** The cleanest - programmatic trigger is an MxAccess write that flips a - `$Alarm`-extended boolean to true (alarm in) and back to false - (alarm out). Pinning the exact tag reference is pending — needs - either a documented test-fixture tag or an interactive selection - in System Platform IDE. Once the trigger fires, this resolves - whether AVEVA's pulled change set arrives via `GetStatistics` - `positions[] / handles[]` (per-change polling works) or only via - the AVEVA-internal window (callback path needed). -2. **Hook AVEVA's internal window** to log what WMs it actually - processes — only relevant if probe 1 shows `GetStatistics` does - NOT report per-change activity. -3. **Decompile `aaAlarmManagedClient.dll`'s IL** for the - `RegisterConsumer` method to find what `RegisterWindowMessage` - string is used and whether there's a callback-registration - surface on `WNAL_Register` that the managed client wraps. The - alarmlst.dll strings (`WNAL_CallBack`, "Invalid callbacks" error) - suggest the underlying C API takes callbacks, but the managed - wrapper exposes none of them. - -PR A.5's `Subscribe` / `AcknowledgeByGuid` / `SnapshotActiveAlarms` -are correct — they're pull-style and don't depend on the -notification mechanism. - -## Option A — captured, 2026-05-01 - -`wnwrapConsumer.dll` (`C:\Program Files (x86)\Common Files\ -ArchestrA\wnwrapConsumer.dll`) hosts the standalone COM class -`WNWRAPCONSUMERLib.wwAlarmConsumerClass`. Type library imports -cleanly via `tlbimp` (output stored under `mxaccessgw/lib/ -Interop.WNWRAPCONSUMERLib.dll`). The COM class is registered in -`HKLM:\SOFTWARE\WOW6432Node\Classes\CLSID\ -{7AB52E5F-36B2-4A30-AE46-952A746F667C}` with `ThreadingModel= -Apartment` — `new wwAlarmConsumerClass()` succeeds via -`CoCreateInstance`. - -The probe `MxGateway.Worker.Tests/WnWrapConsumerProbeTests.cs` -(Skip-gated, archival) drove the captured run. Lifecycle: - -1. `new wwAlarmConsumerClass()` — instantiated. -2. `InitializeConsumer("MxGatewayProbe.WnWrap")` -> 0. -3. `RegisterConsumer(hWnd: 0, productName, applicationName, - version)` -> 0. **Note:** wnwrap's `RegisterConsumer` is - 4-arg (no `bRetainHiddenAlarms`); `aaAlarmManagedClient`'s - is 5-arg. Different surface. -4. `Subscribe(@"\\\Galaxy!DEV", priLow=1, priHigh=999, - qtSummary, sfReturnNewestFirst, asAlarmActiveNow, - asAlarmActiveNow)` -> 0. Same canonical scope that worked - for `aaAlarmManagedClient`. -5. `SetXmlAlarmQuery(...)` was called too but the round-trip - `GetXmlAlarmQuery` returned a mangled echo (NODE became - `DESKTOP-6JL3KKO\Galaxy!DEV`, PROVIDER became `Galaxy!DEV`, - ALARM_STATE shortened to `All`, DISPLAY_MODE truncated to - `Sum`). The XML-query path looks broken in this build; rely - on `Subscribe` for the filter and skip `SetXmlAlarmQuery` in - production. Confirming "Subscribe alone is sufficient" is - one follow-up probe (call `Subscribe` and read XML, no - `SetXmlAlarmQuery`) — out of scope for the breakthrough run - but easy to verify. - -### Captured XML (60 polls over 30s, 500ms cadence) - -`GetXmlCurrentAlarms2(maxAlmCnt: 100, out vartCurrentXmlAlarms)` -returned BSTR XML cleanly on every call — 60/60 ok, zero throws. -`GetXmlCurrentAlarms` (the v1 method) returned identical content -on the same cadence; either method is viable. - -Empty state: - -```xml - -``` - -With alarm active (`UNACK_ALM`, value=true after the flip -script set the bool true): - -```xml - - - - BCC4705395424D65BDAABCDEA6A32A73 - 2026/5/1 - - 240 - 0 - DESKTOP-6JL3KKO - Galaxy - TestArea - TestMachine_001.TestAlarm001 - DSC - true - true - 500 - UNACK_ALM - - - Test alarm #1 - - -``` - -After the script set the bool false (`UNACK_RTN`, value=false): - -```xml - - - - BCC4705395424D65BDAABCDEA6A32A73 - 2026/5/1 - - ... - false - UNACK_RTN - ... - - -``` - -The 10s cadence between transitions matches the System Platform -script's flip frequency exactly. **GUID is stable across the -in→out cycle** (`BCC4705…` carried through both states), so the -XML stream represents the alarm record's lifecycle, not separate -event records — this is "current alarms snapshot," not -"transition stream." For an OPC UA `AlarmConditionService` -adapter this is fine: condition-state changes per-snapshot is -the supported model. - -`STATE` enum values observed: `UNACK_RTN` (the alarm has -returned to normal but is unacknowledged — i.e., visible in the -"current alarms" list because operator hasn't acked it yet) and -`UNACK_ALM` (the alarm is currently active and unacknowledged). -The other states from `eAlmState` (`ACK_RTN`, `ACK_ALM`) would -appear when an ack is performed — `wwAlarmConsumerClass.AlarmAckByGUID` -is the method to call. - -### `GetStatistics` AV — unrelated quirk - -Every `GetStatistics` call threw `AccessViolationException` in -the probe. Cause: the wnwrap interop signature uses `IntPtr` for -the three array out-parameters (`pChangeCode`, `pChangePos`, -`phAlarm`); passing `IntPtr.Zero` is wrong — the COM impl is -writing into the buffer pointer without null-checking. Pre- -allocate three int-arrays and pass pinned pointers (or use -`Marshal.AllocCoTaskMem`) to fix. Not required for the -production path — the XML methods give us everything we need. - -### Implications for PR A.2 worker integration - -Replacing `aaAlarmManagedClient.AlarmClient` with -`WNWRAPCONSUMERLib.wwAlarmConsumerClass` in the worker's -alarm-consumer surface unblocks A.2 fully. Outline: - -1. **Reference path:** drop `aaAlarmManagedClient.dll` reference - from `MxGateway.Worker.csproj`; add `Interop.WNWRAPCONSUMERLib.dll` - reference from `mxaccessgw/lib/`. (Or commit the interop dll - in-tree under `lib/` and reference relatively.) -2. **`AlarmClientConsumer` → `WnWrapAlarmConsumer`:** rewrite - the consumer wrapper to: - - `new wwAlarmConsumerClass()` on the worker's STA thread. - - `InitializeConsumer(applicationName)` then - `RegisterConsumer(hWnd: 0, …)`. - - `Subscribe(@"\\\Galaxy!", …)` per configured - area. The `` and `` are configurable (default - `Environment.MachineName` + the platform's primary area). - - Poll `GetXmlCurrentAlarms2(maxAlmCnt, out xml)` on a - timer (500ms-1s cadence is comfortable). Parse XML - payload; diff against the previous snapshot (keyed by - `GUID`); emit `MX_EVENT_FAMILY_ON_ALARM_TRANSITION` - events for added/changed/removed records. - - `AlarmAckByGUID(VBGUID, comment, oprName, node, domain, - fullName)` for client-driven acknowledgements (matches - PR A.5's `AlarmAckCommand` payload). - - Lifecycle teardown: `DeregisterConsumer` + - `UninitializeConsumer` + `Marshal.FinalReleaseComObject`. -3. **Conversion layer:** map XML record fields to - `MxAlarmConditionRecord` proto: - - `GUID` → `condition_id` (canonicalize the no-dashes hex - to a UUID string). - - `STATE` enum → `inAlarm` + `acked` booleans - (`UNACK_ALM` → in_alarm=true, acked=false; - `UNACK_RTN` → in_alarm=false, acked=false; - `ACK_ALM` → in_alarm=true, acked=true; - `ACK_RTN` → in_alarm=false, acked=true). - - `DATE + TIME + GMTOFFSET + DSTADJUST` → reassemble UTC - timestamp; matches the worker's existing `Timestamp` - wire format. - - `PRIORITY` → severity (already 1-1000-ish range). - - `TAGNAME` → reference; `PROVIDER_NAME` + `GROUP` for - scope metadata. -4. **PR A.5 fix carry-over:** `InitializeConsumer` MUST be - called before `RegisterConsumer` (rediscovered with - `aaAlarmManagedClient`, also true here). The existing - `AlarmClientConsumer` skips Initialize entirely; the new - `WnWrapAlarmConsumer` includes it from day one. -5. **Test reuse:** PR A.5's snapshot/ack contract tests can - stay — they don't touch the underlying COM API. Add a new - integration test against the wnwrap surface (live-AVEVA-only, - Skip-gated like the probe). - -### Settled API-ordering and surface knowledge - -- `InitializeConsumer` first, then `RegisterConsumer` — both - on `aaAlarmManagedClient.AlarmClient` and - `wwAlarmConsumerClass`. -- `RegisterConsumer` arity differs: - `aaAlarmManagedClient.AlarmClient.RegisterConsumer(hWnd, - product, app, version, bRetainHiddenAlarms)` — 5 args; - `wwAlarmConsumerClass.RegisterConsumer(hWnd, product, app, - version)` — 4 args. The wnwrap class has no - `bRetainHiddenAlarms` parameter at all. -- Subscription expression format: `\\\Galaxy!` - (literal `Galaxy` provider) for both libraries. -- Native ack: `AlarmAckByGUID(VBGUID guid, comment, oprName, - node, domain, fullName)` on the v2 surface; ID 5-arg - variant on the legacy `IwwAlarmConsumer` interface. - -These findings retire the open follow-up probes from the -"polling-vs-pump" debate above — `wwAlarmConsumerClass` plus -poll-on-timer is the implementation. - -## Live smoke-test discoveries — 2026-05-01 - -The Skip-gated `AlarmsLiveSmokeTests.Alarms_full_pipeline_round_trip` -ran the full -`WnWrapAlarmConsumer` + `AlarmDispatcher` + `MxAccessAlarmEventSink` -pipeline against the dev rig with the flip script running. End-to-end -verified: 6 real transitions captured on the 10s cadence, ack-by-name -returned rc=0, pipeline stayed healthy through 5 more transitions -afterwards. Three production-relevant quirks surfaced and were fixed -in the consumer: - -### 1. `SetXmlAlarmQuery` is mandatory for reads despite the mangled echo - -Without `SetXmlAlarmQuery`, the first `GetXmlCurrentAlarms2` call -fails with `E_FAIL` (HRESULT `0x80004005`). The discovery doc above -flagged the round-trip echo as mangled and recommended skipping the -call — that recommendation is **wrong**. The echo *is* mangled (AVEVA -parses NODE/PROVIDER/ALARM_STATE/DISPLAY_MODE incorrectly), but the -call itself is required as some kind of subscription enabler. Even -the Subscribe call setting the actual filter doesn't avoid the need -for `SetXmlAlarmQuery`. - -`WnWrapAlarmConsumer.ComposeXmlAlarmQuery(subscription)` decomposes -the canonical `\\\Galaxy!` form into the XML's -NODE/PROVIDER/GROUP fields. Mangled or not, the call enables reads. - -### 2. Two consumers required: read-side vs. ack-side - -`SetXmlAlarmQuery` enables reads but **breaks `AlarmAckByName` on -the same consumer instance**. With SetXml applied, AlarmAckByName -returns -55 even with valid name+provider+group+operator. Without -SetXml, AlarmAckByName succeeds with rc=0. - -The production consumer therefore provisions **two** wnwrap COM -instances: -- Primary consumer (`client`): runs full lifecycle including - `SetXmlAlarmQuery` for `GetXmlCurrentAlarms2` polls. -- Ack-only consumer (`ackClient`): runs Initialize → Register → - Subscribe via the v1-prefixed methods, **no SetXmlAlarmQuery**. - All `AcknowledgeByName` calls dispatch through this instance. - -Both consumers subscribe to the same expression. Disposal cleans up -both via a shared `ReleaseConsumerCom` helper. - -### 3. `AlarmAckByName` v2 8-arg vs. v1 6-arg - -`wwAlarmConsumerClass` exposes two `AlarmAckByName` overloads: -- `IwwAlarmConsumer2` v2: 8 args (`name, provider, group, comment, - oprName, node, domainName, oprFullName`). -- `IwwAlarmConsumer` v1: 6 args (no domain, no full-name). - -The v2 8-arg method returns -55 on this AVEVA build regardless of -operator-identity inputs — looks like a stub. The v1 6-arg method -works. Production `WnWrapAlarmConsumer.AcknowledgeByName` calls the -6-arg overload and discards the proto's `domain` + `full_name` fields. -The proto contract keeps the 8 fields for forward compatibility if -AVEVA fixes the v2 method later. - -### 4. `AlarmAckByGUID` is not implemented - -The v2 `AlarmAckByGUID(VBGUID, …)` throws `NotImplementedException` -(COM `E_NOTIMPL`) on `wwAlarmConsumerClass` against this AVEVA -build. The reference→GUID lookup that we initially planned to wire -through `AlarmAckByGUID` is therefore not viable on wnwrap; all acks -must go through `AlarmAckByName`. - -The proto `AcknowledgeAlarmCommand` (GUID-based) and the worker's -`MxAccessCommandExecutor.ExecuteAcknowledgeAlarm` switch arm remain -in the codebase for the forward-compat shape, but the gateway-side -`WorkerAlarmRpcDispatcher.AcknowledgeAsync` now always routes through -`AcknowledgeAlarmByName` when the public RPC supplies a recognizable -`Provider!Group.Tag` reference. - -**Command/reply payload reuse.** `MxCommand.payload` has a dedicated -`acknowledge_alarm_by_name_command` field, but `MxCommandReply.payload` -intentionally has **no** by-name-specific case. The by-name ack carries -no outcome detail beyond the native return code, so the worker's -`ExecuteAcknowledgeAlarmByName` sets the same `acknowledge_alarm` -(`AcknowledgeAlarmReplyPayload`) reply case used by the GUID arm, with -`native_status` = the `AlarmAckByName` return code (also echoed into the -top-level `MxCommandReply.hresult`). Reply consumers must dispatch on -`MxCommandReply.kind` (`MX_COMMAND_KIND_ACKNOWLEDGE_ALARM` vs. -`MX_COMMAND_KIND_ACKNOWLEDGE_ALARM_BY_NAME`), not on the payload oneof -case, to distinguish the two acks. `WorkerAlarmRpcDispatcher` reads only -the top-level `hresult`/`protocol_status`, so it handles both arms -without unpacking the payload. - -**Worker `native_status` → public `AcknowledgeAlarmReply` mapping.** The -worker carries the ack outcome as a single `int32` -(`AcknowledgeAlarmReplyPayload.native_status`, the `AlarmAckByName` / -`AlarmAckByGUID` return code; `0` = success), also mirrored into the -worker `MxCommandReply.hresult`. The public `AcknowledgeAlarmReply` has -two outcome-shaped fields, but only one is populated: - -- `AcknowledgeAlarmReply.hresult` — `WorkerAlarmRpcDispatcher` copies the - worker's `MxCommandReply.hresult` (the native return code) into this - field. **This is the authoritative ack-outcome field**; `0` means the - ack succeeded. It is absent only when the worker reply omitted the - value, which is a protocol violation surfaced in `protocol_status`. -- `AcknowledgeAlarmReply.status` (`MxStatusProxy`) — the worker by-name / - by-GUID ack path produces only the `int32` return code, never a - populated `MXSTATUS_PROXY` struct, so `WorkerAlarmRpcDispatcher` leaves - this field **unset on every reply**. It is reserved for a future - structured view of the ack outcome. Clients must not depend on it. - -Client authors should therefore branch on `protocol_status` first (for -transport/session-level failures) and then on `hresult` (`0` = ack -accepted by MXAccess) — never on `status`. - -### 5. STA / threading — production fix needed - -The wnwrap COM is `ThreadingModel=Apartment`. The consumer's -internal `Timer` fires on threadpool threads and would block forever -on cross-apartment marshaling unless the host STA pumps Win32 -messages. The smoke test sidesteps this by setting -`pollIntervalMilliseconds=0` (Timer disabled) and driving `PollOnce` -manually from the test's STA. Production hosting will route polls -through the worker's `StaRuntime` in a follow-up — the consumer's -`PollOnce` is `public` and idempotent so the wire-up is mechanical. - -### Capture summary - -``` -Transition: kind=Clear ref='Galaxy!TestArea.TestMachine_001.TestAlarm001' … -Transition: kind=Raise ref='Galaxy!TestArea.TestMachine_001.TestAlarm001' … -SnapshotActiveAlarms count=1 - active: ref='Galaxy!TestArea.TestMachine_001.TestAlarm001' state=Active -AcknowledgeByName(real identity) -> rc=0 -Post-ack transition: kind=Clear … -+1: kind=Raise … (10s after ack) -+2: kind=Clear … (20s) -+3: kind=Raise … (30s) -+4: kind=Clear … (40s) -``` - -10s cadence held throughout; full proto fields populated correctly; -ack registered server-side without errors. diff --git a/docs/GatewayConfiguration.md b/docs/GatewayConfiguration.md index 1bb21bc..1602d18 100644 --- a/docs/GatewayConfiguration.md +++ b/docs/GatewayConfiguration.md @@ -184,9 +184,7 @@ behavior. | `MxGateway:Alarms:ReconcileIntervalSeconds` | `30` | How often the monitor reconciles its in-process alarm cache against the worker's authoritative active-alarm snapshot, catching transitions the live poll-and-diff feed missed. Floored at 5 seconds. | The alarm monitor is independent of client sessions: `AcknowledgeAlarm` and -`StreamAlarms` are session-less RPCs served by the monitor. See -[Alarm Client Discovery](./AlarmClientDiscovery.md) for the AVEVA consumer -surface the monitor's worker session drives. +`StreamAlarms` are session-less RPCs served by the monitor. ## Related Documentation diff --git a/docs/Grpc.md b/docs/Grpc.md index 126361a..f601c38 100644 --- a/docs/Grpc.md +++ b/docs/Grpc.md @@ -88,7 +88,7 @@ Carrying the enqueue timestamp into the worker layer is what lets queue-wait tim ### `AcknowledgeAlarm` -`AcknowledgeAlarm` is a unary, **session-less** RPC that acknowledges a single alarm. The handler validates `alarm_full_reference` inline (it does not run through `MxAccessGrpcRequestValidator`) and delegates to `IGatewayAlarmService.AcknowledgeAsync`. The always-on `GatewayAlarmMonitor` routes the ack over its own gateway-managed worker session — clients no longer open a session to acknowledge an alarm. A reference that parses as a canonical GUID forwards to `AcknowledgeAlarmCommand`; a `Provider!Group.Tag` reference forwards to `AcknowledgeAlarmByNameCommand`. The alarm contract and the central monitor are documented in [Alarm Client Discovery](./AlarmClientDiscovery.md). +`AcknowledgeAlarm` is a unary, **session-less** RPC that acknowledges a single alarm. The handler validates `alarm_full_reference` inline (it does not run through `MxAccessGrpcRequestValidator`) and delegates to `IGatewayAlarmService.AcknowledgeAsync`. The always-on `GatewayAlarmMonitor` routes the ack over its own gateway-managed worker session — clients no longer open a session to acknowledge an alarm. A reference that parses as a canonical GUID forwards to `AcknowledgeAlarmCommand`; a `Provider!Group.Tag` reference forwards to `AcknowledgeAlarmByNameCommand`. ### `StreamAlarms`