Files
mxaccessgw/docs/AlarmClientDiscovery.md
T
Joseph Doherty e541339c07 docs(audit): apply per-cluster judgment fixes across living docs
Resolve audit findings: correct WorkerEnvelope proto/route/metric/session
facts; rewrite auth (ZB.MOM.WW.Auth migration), dashboard (ZB.MOM.WW.Theme),
and StyleGuide (foreign-project copy-paste); document alarm subsystem, Ldap
options, and gateway alarm broker; fix client CLI flags and package paths.
2026-06-03 16:01:28 -04:00

43 KiB

aaAlarmManagedClient discovery — public surface, 2026-05-01

Result of running ZB.MOM.WW.MxGateway.Worker.Tests.AlarmClientDiscoveryTests.DumpAlarmClientPublicSurface against the deployed AVEVA assembly:

  • File: C:\Program Files (x86)\ArchestrA\Framework\Bin\ViewAppFramework\Content\MA\aaAlarmManagedClient.dll
  • Assembly identity: aaAlarmManagedClient, Version=1.0.7368.41290, Culture=neutral, PublicKeyToken=7ebd82b507d9e10c

Public types

  • aaAlarmManagedClient.AlarmClient (class)
  • aaAlarmManagedClient.PriorityData (class)

That's the entire exported surface — two types, no interfaces, no delegates.

AlarmClient events

None. The class has no public events at all. The reflection probe's GetEvents(BindingFlags.Public | Instance | Static) returned an empty list.

AlarmClient methods (relevant subset)

  • Lifecycle: RegisterConsumer(int hWnd, string szProductName, string szApplicationName, string szVersion, bool bRetainHiddenAlarms) → int, DeregisterConsumer() → int, InitializeConsumer(string szApplicationName) → int, UninitializeConsumer() → int, Dispose().
  • Subscription: Subscribe(string szSubscription, short wFromPri, short wToPri, eQueryType QueryType, eSortFlags SortFlags, eAlarmFilterState FilterMask, eAlarmFilterState FilterSpecification) → int.
  • Change enumeration (pull on poke): GetStatistics(out int lPercentQuery, out int lTotalAlarms, out int lActiveAlarms, out int lSuppressedAlarms, out int lSuppressedFilters, out int lNewAlarms, out int lChangesCount, out int[] ChangeCodes, out int[] ChangePos, out int[] hAlarm) → int.
  • Record fetch: GetAlarmExtendedRec(int lIndex, out AlarmRecord almRec) → int, GetAlarmExtendedRec2(...), GetHighPriAlarm(out AlarmRecord almRec) → int.
  • Selection model (used by ack-selected-* family): DeselectAll, SelectAlaramEntry(short select, int from, int to), SelectByGUID(Guid), SelectAlarmCount(int from, int to).
  • Acknowledge: AlarmAckByGUID(Guid alarmGuid, string ackComment, string ackOprName, string ackOprNode, string ackOprDomain, string ackOprFullName) → int is the per-alarm full-fidelity native ack. AlarmAckSelected(string ackComment, string ackOprName, string ackOprNode, string ackOprDomain, string ackOprFullName) → int acks whatever the selection model currently has selected. Several AckSelected*Group/Tag/Priority/All/Visible*Alarms_Ex(...) variants exist for bulk ack scoped to a group / tag / priority range.
  • Suppress / shelve: SupressSelected* and ShelveSelected* families plus DoAlarmShelveAction(...). Out of scope for the v1 alarm path.
  • Snapshot/filter (SF* prefix): SFSetSortA / SFSetFilterA / SFCreateSnapshot / SFGetListCount / SFDeleteSnapshot / SFRefreshAlarm / SFGetStatistics. Snapshot-style query API, distinct from the consumer-subscription path. Not currently used.

What this means

Historical note (current as built). This discovery record predates the as-built alarm path. The AlarmClientConsumer.cs file referenced below was retired; the production consumer is src/ZB.MOM.WW.MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs (driven by the wwAlarmConsumerClass COM surface — see Option A below). The current public RPC surface and broker architecture are summarized in Current alarm path (as built) at the end of this document; the sections in between are kept as a discovery record.

The architecture comment on the (now-retired) AlarmClientConsumer.cs (PR A.5) was wrong against this deployed assembly:

"The AVEVA alarm-manager surface (IAlarmMgrDataProvider) exposes the events we need as plain .NET events — no Windows message pump required."

There is no managed event surface. AlarmClient.RegisterConsumer takes an hWnd because WM_APP messaging is the actual notification mechanism: AVEVA's alarm provider WM_APP-pokes the registered window, and the consumer is expected to call GetStatistics on each poke to pull ChangeCodes / ChangePos / hAlarm arrays, then GetAlarmExtendedRec(pos, …) per index to fetch each changed record.

AlarmClientConsumer.AlarmRecordReceived has no production callers as a result — RaiseAlarmRecordReceived is internal for tests and never gets invoked at runtime. Until A.2 lands a WM_APP pump, MX_EVENT_FAMILY_ON_ALARM_TRANSITION cannot carry events.

Live runtime probe — 2026-05-01

ZB.MOM.WW.MxGateway.Worker.Tests.AlarmClientWmProbeTests.ProbeAlarmClientWmMessages is a Skip-gated runtime probe that creates a real message-only window, calls AlarmClient.RegisterConsumer(hWnd, …) + Subscribe(@"\Galaxy!", …), and pumps for 20s while logging every window message that arrives. Run results below — this turned the "WM_APP pump" design assumption upside down.

RegisterConsumer and Subscribe both returned 0 (success). The calls are valid against the deployed assembly; no parameter pinning needed.

A registered-message-class WM (ID 0xC275 in this OS session) fired every ~1s after Subscribe completed. Constant wParam = 0x00001100, constant lParam = 0x079E46D8 (looks like a stable pointer into AVEVA-internal state) for all 20 hits. The constant payload across hits with no Galaxy alarm being fired suggests this is a heartbeat/keepalive, not a per-change notification.

Critically: this WM is delivered to AVEVA's own internal window (hwnd=0x18032E) — NOT to the consumer's hWnd we passed in. The consumer window's WndProc received only the standard creation sequence (WM_GETMINMAXINFO, WM_NCCREATE, WM_NCCALCSIZE, WM_CREATE) and the destruction sequence (WM_NCDESTROY, WM_DESTROY, WM_NCCALCSIZE) — nothing in between. AVEVA's notification path runs entirely against AVEVA's internal window; it never forwards to the user-supplied hWnd.

The message ID itself is dynamic (a RegisterWindowMessage allocation in the >= 0xC000 range), so it cannot be hard-coded — each consumer process must call RegisterWindowMessage with the correct string and use whatever ID the OS returns.

What this means for A.2

The "WM_APP pump on the user hWnd" design — what the original plan banner described and what the previous version of this doc recommended — does not match how AVEVA actually delivers notifications. The hWnd parameter to RegisterConsumer does not appear to receive any of AVEVA's alarm traffic; it's likely used only as a registration identity (and perhaps as a parent for modal dialogs).

Two viable A.2 designs given the probe data:

  1. Polling. Just call GetStatistics on a timer (e.g. every 500ms in the worker's STA) and react to the change set it reports. No window plumbing needed. Trade-off: latency floor = poll period; modest CPU floor because the call is cheap. Matches the heartbeat-style WM 0xC275 semantics — AVEVA itself runs a poll loop internally.
  2. Hook AVEVA's internal window. Discover AVEVA's own window (hwnd=0x18032E in the probe), SetWindowsHookEx or SetWindowSubclass on it, and intercept WM 0xC275 on AVEVA's thread. Higher fidelity, near-zero latency, but invasive, fragile across AVEVA upgrades, and requires running on the same process / thread as the AVEVA window. Probably a non-starter without further AVEVA documentation.

Recommendation: the polling path (option 1) is cheaper to implement, more robust against AVEVA-internal change, and acceptable for a typical alarm cadence. The worker's existing STA already provides a thread-affinitized timer surface. The unanswered question is whether GetStatistics can be safely called outside AVEVA's own message-pump thread — confirmable by extending the probe to fire GetStatistics on its own thread and check the result.

Alarm-provider visibility — third probe run, 2026-05-01

Extended the probe to call AlarmClient.GetProviders after RegisterConsumer. Result on this rig:

GetProviders -> rc=0 count=0 list=[]

Zero alarm providers visible to the consumer process. This explains every preceding probe run: no providers means no alarm events, regardless of how many times any value (including a bool with an $Alarm extension) flips. Subscribe(@"\Galaxy!") returns 0 (success) but matches nothing because the alarm-manager chain that provides the matching feed doesn't expose any provider to this consumer.

A System Platform script flipping TestMachine_001.TestAlarm001 every 10s during this probe run produced no observable GetStatistics transitions, no positions[] / handles[] entries, no change in any field — confirms the silence is not about subscription-scope / message-pump but about provider absence.

Possible causes

  1. No $Alarm extension on the test bool. If TestMachine_001.TestAlarm001 is a regular UDA without a BoolAlarm extension wired to it, flipping the value just writes a new value — no alarm fires.
  2. Alarm manager service not running. AVEVA's aaAlarmMgr (or the equivalent on this rig's Platform version) needs to be running for providers to register.
  3. Process security context. A consumer running under a normal user account may not see providers that registered under LocalSystem / a Platform service identity. The gateway-worker installation runs under a service account that may have access where dotnet test doesn't.

InitializeConsumer required — fourth probe run, 2026-05-01

Adding InitializeConsumer("AlarmProbe.Tests") before RegisterConsumer made \Galaxy! appear in GetProviders (count=1, status 0 → 100 within 500ms). So #2 and #3 above are NOT the cause — the consumer can see the alarm provider once it calls Initialize. That's a missing API-call ordering, not a permission or service issue.

InitializeConsumer -> 0
RegisterConsumer -> 0
GetProviders [after Register] -> rc=0 count=0 list=[]
Subscribe('\Galaxy!') -> 0
GetProviders [after Subscribe] -> rc=0 count=1 list=[  0 \Galaxy!]
GetProviders [poll #1] -> rc=0 count=1 list=[100 \Galaxy!]

Despite the provider being visible at "100% query complete" for the entire 60s window, GetStatistics continued to report total=0 active=0 codes=[7] — no alarm transitions reached the consumer even with a System Platform script flipping the test boolean every 10s during the run.

That isolates the remaining unknown to whether the test bool's alarm extension is actually generating MxAccess alarm-provider events when its value flips. The probe has confirmed every link in the consumer chain works (Initialize → Register → Subscribe → provider visible at 100%) — what's missing is alarm traffic from the producer side. ObjectViewer or another live consumer running alongside the script is the next discriminator: does it visibly see the alarm fire?

API-ordering finding: InitializeConsumer MUST precede RegisterConsumer (or at least, must be called before GetProviders returns anything). PR A.5's AlarmClientConsumer omits InitializeConsumer entirely — that's a bug fix to apply even before A.2 lands, since without it the provider chain never becomes visible.

Subscribe-parameter sweep — fifth probe run, 2026-05-01

Even with InitializeConsumer + provider visible at status 100, no alarm transitions arrived during a 60s window with the user's script flipping the test bool every 10s. Tried:

  • qtSummary and qtHistory (the only eQueryType values).
  • Priority 1..999 and 0..32767.
  • eAlarmFilterState.asNone and asAlarmActiveNow for both FilterMask and FilterSpecification.

eAlarmFilterState is single-state-valued (asNone=0, asAlarmActiveNow=1, asAlarmAcked=2, asShelved=3), not flag bits. None of these knobs surfaced any alarm activity.

User confirmation 2026-05-01: the test bool does have a BoolAlarm extension on it; in aaObjectViewer the $Alarm.InAlarm sub-attribute flips true/false in lockstep with the script's writes. So the alarm extension is evaluating its condition, just not visibly producing transitions on the aaAlarmManagedClient consumer stream.

Multi-channel + multi-subscription probe — sixth run, 2026-05-01

Extended the probe to try every consumer-side approach in parallel:

  • Subscription expressions (sequential): \Galaxy!, \Galaxy!*, \\Galaxy!, \Galaxy!TestArea, \\.\Galaxy!. All Subscribe calls returned rc=0; the last one (\\.\Galaxy!) is reflected in GetProviders (count=1).
  • Read channels polled at 500ms cadence: GetStatistics, GetHighPriAlarm, SFCreateSnapshot + SFGetStatistics.
  • Filter+sort: priority 0..32767, qtSummary, state=asAlarmActiveNow, sort=sfReturnNewestFirst.
  • AlarmRecord init (worked around Not a valid Win32 FileTime exception): all DateTime fields pre-set to FILETIME epoch (1601-01-01 UTC) before the call, since default(DateTime) is outside FILETIME range and trips the interop marshaler.

Result of the 60s run with TestMachine_001.TestAlarm001 being flipped every 10s:

Subscribe('\Galaxy!') -> 0
Subscribe('\Galaxy!*') -> 0
Subscribe('\\Galaxy!') -> 0
Subscribe('\Galaxy!TestArea') -> 0
Subscribe('\\.\Galaxy!') -> 0
GetProviders [after Subscribe-multi] -> count=1 list=[  0 \\.\Galaxy!]
GetStatistics #1: total=0 active=0 changes=1 codes=[7] positions=[] handles=[]
GetHighPriAlarm #1: rc=0 {  }
SF channel #1: SFCreate=0 numAlarms=0 SFStats=0 unackRet=0 unackAlm=0 ackAlm=0 others=0 events=0 idxNewest=-1

No further "(changed)" entries for the entire 60s window. Every read API returned the same empty result on every poll.

User confirms the alarm IS firing — aaObjectViewer sees $Alarm.InAlarm flip in lockstep with the script. Historian records exist (per user — needs verification by querying the historian directly).

Conclusion of consumer-side probing

aaAlarmManagedClient.AlarmClient is not the receive surface AVEVA's alarm pipeline routes to in this Galaxy configuration. The consumer chain is verified end-to-end:

  • InitializeConsumer + RegisterConsumer + Subscribe all succeed (rc=0).
  • GetProviders finds \Galaxy! once Initialize is called.
  • All read APIs (GetStatistics, GetHighPriAlarm, SFCreateSnapshot/SFGetStatistics) return empty even with every documented filter combination.
  • The consumer's hWnd receives zero AVEVA messages between WM_CREATE and WM_DESTROY; AVEVA's traffic goes to its own internal hwnd.

The next investigation directions are not consumer-side:

  1. Inspect aaObjectViewer's alarm SDK to see what library it uses to read alarms. If different from aaAlarmManagedClient, switch the worker over.
  2. Query the historian directly (aahEventStorage / aahEventSvc) to confirm alarms are recorded — and use the same path for v2 alarm capture.
  3. Inspect AVEVA's alarm-routing config for this Galaxy in System Platform IDE — area assignments, alarm provider bindings, "publish alarm events to" settings on the platform.

For A.2 implementation: the aaAlarmManagedClient path the gateway-worker is currently architected around may be a dead-end on customer Galaxies configured this way. If the alarms truly only flow through the historian event-storage path, A.2 needs to consume from aahEventStorage instead — a fundamental architecture pivot.

BREAKTHROUGH — seventh probe run, 2026-05-01

Two changes finally produced a signal:

  1. Subscription scope: \\<MachineName>\Galaxy!<TopArea> is the canonical AlarmClient subscription format (per ArchestrA Alarm Client docs at archestra6.rssing.com/chan-12008125/article13.html): \\Node\Provider!Area!Filter, where Node is the machine name, Provider is literally Galaxy, and Area is a hosted area object. For this rig (\\DESKTOP-6JL3KKO\Galaxy!DEV) the DEV area — the platform's primary area — is the right scope. Earlier \Galaxy!, \Galaxy!TestArea, \\.\Galaxy!, etc., all returned rc=0 but matched no traffic — they were not the canonical form.
  2. InitializeConsumer before RegisterConsumer — already discovered earlier; bug-fix for PR A.5's AlarmClientConsumer.

With both in place, GetHighPriAlarm returned a record on every poll for 60s straight (117/117 calls), but threw ArgumentOutOfRangeException: Not a valid Win32 FileTime instead of returning successfully — the AlarmRecord struct contains five DateTime fields (ar_Time, ar_OrigTime, ar_AckTime, ar_RtnTime, ar_SubTime) and AVEVA writes sentinel/invalid FILETIME values for unset ones (e.g., ar_AckTime for an unacknowledged alarm). The .NET interop that AVEVA ships (aaAlarmManagedClient.dll) auto-converts FILETIME→DateTime and rejects out-of-range values.

GetStatistics continues to report total=0 active=0 even with GetHighPriAlarm returning records — those two API surfaces have genuinely different views in AVEVA's data model.

So: alarms flow through aaAlarmManagedClient.AlarmClient once the subscription expression is canonical. The blocking issue is extracting the payload past the .NET interop's DateTime auto-marshaling.

Remaining work to capture alarm payloads

Define a custom COM interop that uses long (FILETIME-as-int64) instead of DateTime for the timestamp fields. Approach options:

  1. Patch the AVEVA-shipped aaAlarmManagedClient.dll — ildasm the assembly, replace DateTime with long on AlarmRecord's timestamp fields, ilasm back. Brittle across AVEVA upgrades.
  2. Write our own [ComImport] interface — declare IRawAlarmConsumer ourselves with safe-blittable types, discover the underlying COM IID (via reflection on AlarmClient's [Guid] attribute), and (IRawAlarmConsumer) alarmClient cast. Cleaner; requires the IID.
  3. Use IDispatch late binding — dispatch-Invoke bypasses strong-typed marshaling. Verbose but doesn't need IIDs.

For PR A.2's worker integration, option 2 is the least disruptive. Once the interop is custom, AlarmClient.Subscribe + GetHighPriAlarm + GetAlarmExtendedRec form a viable polling-style alarm consumer.

REVISED 2026-05-01 — option 1 not directly applicable. Reflection on aaAlarmManagedClient.AlarmClient shows it implements only IDisposable (no [ComImport] interface, no class GUID). It has a single field CwwAlarmConsumer* m_almUnmanaged — meaning AlarmClient is a C++/CLI managed wrapper around a native C++ class, NOT a COM-interop class. The DateTime conversion happens inside the AVEVA wrapper's IL, not at a .NET-to-COM marshaling boundary. There is no separate COM interface IID we can QI to.

Revised approach options:

A. Switch to wnwrapConsumer.dll — a separate standalone COM library AVEVA ships at C:\Program Files (x86)\Common Files\ArchestrA\wnwrapConsumer.dll exposing WNWRAPCONSUMERLib.wwAlarmConsumerClass with SetXmlAlarmQuery / GetXmlCurrentAlarms. XML-string output bypasses FILETIME marshaling entirely. B. Patch aaAlarmManagedClient.dll IL — wrap the unsafe DateTime.FromFileTime calls with a safe variant. Direct fix but modifies a vendor binary. C. Reflect into m_almUnmanaged and call native vtable — get the IntPtr, walk the MSVC C++ vtable, call __thiscall methods via Marshal.GetDelegateForFunctionPointer. Doable but requires reverse-engineering the C++ class layout.

Option A is the best fit: real COM-based, self-contained in our code, conventional production-grade approach (the WIN-911 consumer pattern referenced in AVEVA support forums uses it).

The polling-vs-WM_APP-callback question from earlier is now moot: GetStatistics's positions[]/handles[] arrays remained empty even when alarms were demonstrably present. The active read API for current alarms is GetHighPriAlarm, not GetStatistics's change array.

Implications for A.2 implementation

The A.2 PR's value is unmeasurable until at least one alarm provider is visible. The choice between polling-via-GetStatistics and the callback path can only be decided by observing what populates first when a real alarm fires. Without a provider, both paths return the same "nothing happening" answer.

Until that's resolved, A.2 implementation work is genuinely blocked on a dev-rig configuration issue — not on architectural choice or code structure.

GetStatistics polling — second probe run, 2026-05-01

Extended the probe to call GetStatistics every ~2s alongside the WM logger. Key findings:

  • GetStatistics is safely callable from the same thread that did RegisterConsumer + Subscribe. Every poll returned rc=0 with no exceptions over 9 polls / 20s window.
  • The deployed Galaxy currently has zero active alarms. Every poll reported total=0 active=0 suppressed=0 newAlarms=0. The positions[] and handles[] arrays were empty.
  • changes=1 codes=[7] was constant across all polls, matching the constant 1 Hz WM 0xC275 cadence. Code 7 is consistent with a "heartbeat / subscription healthy" sentinel — same semantics as the WM but reported through the pull-side API.
  • percent=100 (query-complete percentage) was constant — the subscription is steady-state.

This confirms the polling design (option 1 in the previous section) is mechanically viable. The remaining open question is whether GetStatistics populates positions[] / handles[] with real entries when an alarm transition actually fires — proving that requires firing an alarm.

Open follow-up probes

Each can be added to AlarmClientWmProbeTests as a separate Skip-gated test:

  1. Fire a real Galaxy alarm during the pump window. The cleanest programmatic trigger is an MxAccess write that flips a $Alarm-extended boolean to true (alarm in) and back to false (alarm out). Pinning the exact tag reference is pending — needs either a documented test-fixture tag or an interactive selection in System Platform IDE. Once the trigger fires, this resolves whether AVEVA's pulled change set arrives via GetStatistics positions[] / handles[] (per-change polling works) or only via the AVEVA-internal window (callback path needed).
  2. Hook AVEVA's internal window to log what WMs it actually processes — only relevant if probe 1 shows GetStatistics does NOT report per-change activity.
  3. Decompile aaAlarmManagedClient.dll's IL for the RegisterConsumer method to find what RegisterWindowMessage string is used and whether there's a callback-registration surface on WNAL_Register that the managed client wraps. The alarmlst.dll strings (WNAL_CallBack, "Invalid callbacks" error) suggest the underlying C API takes callbacks, but the managed wrapper exposes none of them.

PR A.5's Subscribe / AcknowledgeByGuid / SnapshotActiveAlarms are correct — they're pull-style and don't depend on the notification mechanism.

Option A — captured, 2026-05-01

wnwrapConsumer.dll (C:\Program Files (x86)\Common Files\ ArchestrA\wnwrapConsumer.dll) hosts the standalone COM class WNWRAPCONSUMERLib.wwAlarmConsumerClass. Type library imports cleanly via tlbimp (output stored under mxaccessgw/lib/ Interop.WNWRAPCONSUMERLib.dll). The COM class is registered in HKLM:\SOFTWARE\WOW6432Node\Classes\CLSID\ {7AB52E5F-36B2-4A30-AE46-952A746F667C} with ThreadingModel= Apartmentnew wwAlarmConsumerClass() succeeds via CoCreateInstance.

The probe ZB.MOM.WW.MxGateway.Worker.Tests/WnWrapConsumerProbeTests.cs (Skip-gated, archival) drove the captured run. Lifecycle:

  1. new wwAlarmConsumerClass() — instantiated.
  2. InitializeConsumer("MxGatewayProbe.WnWrap") -> 0.
  3. RegisterConsumer(hWnd: 0, productName, applicationName, version) -> 0. Note: wnwrap's RegisterConsumer is 4-arg (no bRetainHiddenAlarms); aaAlarmManagedClient's is 5-arg. Different surface.
  4. Subscribe(@"\\<machine>\Galaxy!DEV", priLow=1, priHigh=999, qtSummary, sfReturnNewestFirst, asAlarmActiveNow, asAlarmActiveNow) -> 0. Same canonical scope that worked for aaAlarmManagedClient.
  5. SetXmlAlarmQuery(...) was called too but the round-trip GetXmlAlarmQuery returned a mangled echo (NODE became DESKTOP-6JL3KKO\Galaxy!DEV, PROVIDER became Galaxy!DEV, ALARM_STATE shortened to All, DISPLAY_MODE truncated to Sum). The XML-query path looks broken in this build; rely on Subscribe for the filter and skip SetXmlAlarmQuery in production. Confirming "Subscribe alone is sufficient" is one follow-up probe (call Subscribe and read XML, no SetXmlAlarmQuery) — out of scope for the breakthrough run but easy to verify.

Captured XML (60 polls over 30s, 500ms cadence)

GetXmlCurrentAlarms2(maxAlmCnt: 100, out vartCurrentXmlAlarms) returned BSTR XML cleanly on every call — 60/60 ok, zero throws. GetXmlCurrentAlarms (the v1 method) returned identical content on the same cadence; either method is viable.

Empty state:

<?xml version="1.0"?><ALARM_RECORDS COUNT="0"></ALARM_RECORDS>

With alarm active (UNACK_ALM, value=true after the flip script set the bool true):

<?xml version="1.0"?>
<ALARM_RECORDS COUNT="1">
  <ALARM>
    <GUID>BCC4705395424D65BDAABCDEA6A32A73</GUID>
    <DATE>2026/5/1</DATE>
    <TIME>13:26:14.709</TIME>
    <GMTOFFSET>240</GMTOFFSET>
    <DSTADJUST>0</DSTADJUST>
    <PROVIDER_NODE>DESKTOP-6JL3KKO</PROVIDER_NODE>
    <PROVIDER_NAME>Galaxy</PROVIDER_NAME>
    <GROUP>TestArea</GROUP>
    <TAGNAME>TestMachine_001.TestAlarm001</TAGNAME>
    <TYPE>DSC</TYPE>
    <VALUE>true</VALUE>
    <LIMIT>true</LIMIT>
    <PRIORITY>500</PRIORITY>
    <STATE>UNACK_ALM</STATE>
    <OPERATOR_NODE></OPERATOR_NODE>
    <OPERATOR_NAME></OPERATOR_NAME>
    <ALARM_COMMENT>Test alarm #1</ALARM_COMMENT>
  </ALARM>
</ALARM_RECORDS>

After the script set the bool false (UNACK_RTN, value=false):

<?xml version="1.0"?>
<ALARM_RECORDS COUNT="1">
  <ALARM>
    <GUID>BCC4705395424D65BDAABCDEA6A32A73</GUID>
    <DATE>2026/5/1</DATE>
    <TIME>13:26:24.710</TIME>
    ...
    <VALUE>false</VALUE>
    <STATE>UNACK_RTN</STATE>
    ...
  </ALARM>
</ALARM_RECORDS>

The 10s cadence between transitions matches the System Platform script's flip frequency exactly. GUID is stable across the in→out cycle (BCC4705… carried through both states), so the XML stream represents the alarm record's lifecycle, not separate event records — this is "current alarms snapshot," not "transition stream." For an OPC UA AlarmConditionService adapter this is fine: condition-state changes per-snapshot is the supported model.

STATE enum values observed: UNACK_RTN (the alarm has returned to normal but is unacknowledged — i.e., visible in the "current alarms" list because operator hasn't acked it yet) and UNACK_ALM (the alarm is currently active and unacknowledged). The other states from eAlmState (ACK_RTN, ACK_ALM) would appear when an ack is performed.

Forward reference / superseded: an earlier draft named wwAlarmConsumerClass.AlarmAckByGUID as the ack method. That call turned out to be E_NOTIMPL on this AVEVA build (see AlarmAckByGUID is not implemented below). The as-built ack path is the v1 6-arg AlarmAckByName on a dedicated ack-only consumer instance. Do not wire acks through AlarmAckByGUID.

GetStatistics AV — unrelated quirk

Every GetStatistics call threw AccessViolationException in the probe. Cause: the wnwrap interop signature uses IntPtr for the three array out-parameters (pChangeCode, pChangePos, phAlarm); passing IntPtr.Zero is wrong — the COM impl is writing into the buffer pointer without null-checking. Pre- allocate three int-arrays and pass pinned pointers (or use Marshal.AllocCoTaskMem) to fix. Not required for the production path — the XML methods give us everything we need.

Implications for PR A.2 worker integration

Replacing aaAlarmManagedClient.AlarmClient with WNWRAPCONSUMERLib.wwAlarmConsumerClass in the worker's alarm-consumer surface unblocks A.2 fully. Outline:

  1. Reference path: drop aaAlarmManagedClient.dll reference from ZB.MOM.WW.MxGateway.Worker.csproj; add Interop.WNWRAPCONSUMERLib.dll reference from mxaccessgw/lib/. (Or commit the interop dll in-tree under lib/ and reference relatively.)
  2. AlarmClientConsumerWnWrapAlarmConsumer: rewrite the consumer wrapper to:
    • new wwAlarmConsumerClass() on the worker's STA thread.
    • InitializeConsumer(applicationName) then RegisterConsumer(hWnd: 0, …).
    • Subscribe(@"\\<node>\Galaxy!<area>", …) per configured area. The <node> and <area> are configurable (default Environment.MachineName + the platform's primary area).
    • Poll GetXmlCurrentAlarms2(maxAlmCnt, out xml) on a timer (500ms-1s cadence is comfortable). Parse XML payload; diff against the previous snapshot (keyed by GUID); emit MX_EVENT_FAMILY_ON_ALARM_TRANSITION events for added/changed/removed records.
    • Client-driven acknowledgements. (This draft named AlarmAckByGUID and a AlarmAckCommand payload; as built the ack proto is AcknowledgeAlarmCommand / AcknowledgeAlarmByNameCommand, the consumer interface method is AcknowledgeByGuid / AcknowledgeByName, and the GUID path is E_NOTIMPL so only the by-name path runs — see AlarmAckByGUID is not implemented.)
    • Lifecycle teardown: DeregisterConsumer + UninitializeConsumer + Marshal.FinalReleaseComObject.
  3. Conversion layer: map XML record fields to the alarm proto:
    • GUID and PROVIDER_NAME!GROUP.TAGNAMEalarm_full_reference (there is no condition_id field; the public RPC and worker carry the reference as alarm_full_reference, either a canonical GUID or Provider!Group.Tag).
    • STATEAlarmConditionState on ActiveAlarmSnapshot.current_state (this draft used inAlarm + acked booleans, which the proto does not have). As built, the snapshot state collapses to three values: UNACK_ALMActive; ACK_ALMActiveAcked; UNACK_RTN and ACK_RTN both → Inactive (a returned-to-normal alarm is no longer "active"). For the live transition feed the STATE instead drives an AlarmTransitionKind (Raise / Acknowledge / Clear).
    • DATE + TIME + GMTOFFSET + DSTADJUST → reassemble UTC timestamp; matches the worker's existing Timestamp wire format.
    • PRIORITY → severity (already 1-1000-ish range).
    • TAGNAME → reference; PROVIDER_NAME + GROUP for scope metadata.
  4. PR A.5 fix carry-over: InitializeConsumer MUST be called before RegisterConsumer (rediscovered with aaAlarmManagedClient, also true here). The existing AlarmClientConsumer skips Initialize entirely; the new WnWrapAlarmConsumer includes it from day one.
  5. Test reuse: the snapshot/ack contract tests stayed — they don't touch the underlying COM API. As built, the alarm tests live under src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/ (AlarmDispatcherTests, AlarmRecordTransitionMapperTests, AlarmCommandHandlerTests, AlarmCommandExecutorTests, WnWrapAlarmConsumerXmlTests), with the live-AVEVA-only round-trip in src/ZB.MOM.WW.MxGateway.Worker.Tests/Probes/AlarmsLiveSmokeTests.cs (Skip-gated like the probe).

Settled API-ordering and surface knowledge

  • InitializeConsumer first, then RegisterConsumer — both on aaAlarmManagedClient.AlarmClient and wwAlarmConsumerClass.
  • RegisterConsumer arity differs: aaAlarmManagedClient.AlarmClient.RegisterConsumer(hWnd, product, app, version, bRetainHiddenAlarms) — 5 args; wwAlarmConsumerClass.RegisterConsumer(hWnd, product, app, version) — 4 args. The wnwrap class has no bRetainHiddenAlarms parameter at all.
  • Subscription expression format: \\<machine>\Galaxy!<area> (literal Galaxy provider) for both libraries.
  • Native ack: AlarmAckByGUID(VBGUID guid, comment, oprName, node, domain, fullName) on the v2 surface; ID 5-arg variant on the legacy IwwAlarmConsumer interface.

These findings retire the open follow-up probes from the "polling-vs-pump" debate above — wwAlarmConsumerClass plus poll-on-timer is the implementation.

Live smoke-test discoveries — 2026-05-01

The Skip-gated AlarmsLiveSmokeTests.Alarms_full_pipeline_round_trip ran the full WnWrapAlarmConsumer + AlarmDispatcher + MxAccessAlarmEventSink pipeline against the dev rig with the flip script running. End-to-end verified: 6 real transitions captured on the 10s cadence, ack-by-name returned rc=0, pipeline stayed healthy through 5 more transitions afterwards. Three production-relevant quirks surfaced and were fixed in the consumer:

1. SetXmlAlarmQuery is mandatory for reads despite the mangled echo

Without SetXmlAlarmQuery, the first GetXmlCurrentAlarms2 call fails with E_FAIL (HRESULT 0x80004005). The discovery doc above flagged the round-trip echo as mangled and recommended skipping the call — that recommendation is wrong. The echo is mangled (AVEVA parses NODE/PROVIDER/ALARM_STATE/DISPLAY_MODE incorrectly), but the call itself is required as some kind of subscription enabler. Even the Subscribe call setting the actual filter doesn't avoid the need for SetXmlAlarmQuery.

WnWrapAlarmConsumer.ComposeXmlAlarmQuery(subscription) decomposes the canonical \\<machine>\Galaxy!<area> form into the XML's NODE/PROVIDER/GROUP fields. Mangled or not, the call enables reads.

2. Two consumers required: read-side vs. ack-side

SetXmlAlarmQuery enables reads but breaks AlarmAckByName on the same consumer instance. With SetXml applied, AlarmAckByName returns -55 even with valid name+provider+group+operator. Without SetXml, AlarmAckByName succeeds with rc=0.

The production consumer therefore provisions two wnwrap COM instances:

  • Primary consumer (client): runs full lifecycle including SetXmlAlarmQuery for GetXmlCurrentAlarms2 polls.
  • Ack-only consumer (ackClient): runs Initialize → Register → Subscribe via the v1-prefixed methods, no SetXmlAlarmQuery. All AcknowledgeByName calls dispatch through this instance.

Both consumers subscribe to the same expression. Disposal cleans up both via a shared ReleaseConsumerCom helper.

3. AlarmAckByName v2 8-arg vs. v1 6-arg

wwAlarmConsumerClass exposes two AlarmAckByName overloads:

  • IwwAlarmConsumer2 v2: 8 args (name, provider, group, comment, oprName, node, domainName, oprFullName).
  • IwwAlarmConsumer v1: 6 args (no domain, no full-name).

The v2 8-arg method returns -55 on this AVEVA build regardless of operator-identity inputs — looks like a stub. The v1 6-arg method works. Production WnWrapAlarmConsumer.AcknowledgeByName calls the 6-arg overload and discards the proto's domain + full_name fields. The proto contract keeps the 8 fields for forward compatibility if AVEVA fixes the v2 method later.

4. AlarmAckByGUID is not implemented

The v2 AlarmAckByGUID(VBGUID, …) throws NotImplementedException (COM E_NOTIMPL) on wwAlarmConsumerClass against this AVEVA build. The reference→GUID lookup that we initially planned to wire through AlarmAckByGUID is therefore not viable on wnwrap; only the by-name path actually succeeds.

Routing as built (and the GUID hazard). The gateway-side router is GatewayAlarmMonitor.BuildAcknowledgeCommand (there is no WorkerAlarmRpcDispatcher type). Routing is conditional on the reference shape, not unconditional:

  • A reference that Guid.TryParse accepts is built into MxCommandKind.AcknowledgeAlarm / AcknowledgeAlarmCommand — the GUID path, which the worker dispatches to AlarmAckByGUID.
  • A Provider!Group.Tag reference (parsed by GatewayAlarmMonitor.TryParseAlarmReference) is built into MxCommandKind.AcknowledgeAlarmByName / AcknowledgeAlarmByNameCommand — the by-name path, which is the only one that succeeds on this build.
  • Anything else fails with an alarm_full_reference parse error before any worker call.

The GUID arm is still dispatched unguarded: the proto AcknowledgeAlarmCommand and the worker's MxAccessCommandExecutor.ExecuteAcknowledgeAlarm switch arm remain in the codebase for forward compatibility, and BuildAcknowledgeCommand routes a GUID-shaped reference straight to them. On the deployed wnwrap build that path hits the E_NOTIMPL AlarmAckByGUID and surfaces a COMException rather than acknowledging. Practical guidance: acknowledge with the Provider!Group.Tag reference (the same form the transition feed emits in alarm_full_reference), not a raw GUID, until the GUID arm is either guarded or AVEVA implements AlarmAckByGUID.

5. STA / threading

The wnwrap COM is ThreadingModel=Apartment, so every consumer call (Subscribe, PollOnce, the AcknowledgeBy* methods) must run on the STA that created the COM instance. As built, WnWrapAlarmConsumer owns no internal timer and takes no pollIntervalMilliseconds parameter — an earlier draft described a self-driven Timer that would have blocked on cross-apartment marshaling, but that design was dropped. Instead PollOnce() is a public, idempotent method the host drives on the worker's STA (via StaRuntime.InvokeAsync(() => consumer.PollOnce())); the poll cadence lives in the host, not the consumer. Each PollOnce reads GetXmlCurrentAlarms2, diffs against the previous snapshot, and emits transition events.

Capture summary

Transition: kind=Clear  ref='Galaxy!TestArea.TestMachine_001.TestAlarm001' …
Transition: kind=Raise  ref='Galaxy!TestArea.TestMachine_001.TestAlarm001' …
SnapshotActiveAlarms count=1
  active: ref='Galaxy!TestArea.TestMachine_001.TestAlarm001' state=Active
AcknowledgeByName(real identity) -> rc=0
Post-ack transition: kind=Clear …
+1: kind=Raise … (10s after ack)
+2: kind=Clear … (20s)
+3: kind=Raise … (30s)
+4: kind=Clear … (40s)

10s cadence held throughout; full proto fields populated correctly; ack registered server-side without errors.

Current alarm path (as built)

The sections above are a discovery record. This section summarizes the path that actually ships, grounded in the current code. For the proto shapes see Contracts; for the server handlers see gRPC; for configuration see Gateway Configuration.

Public RPCs and configuration

Alarms are exposed through three session-less RPCs on MxAccessGateway: AcknowledgeAlarm, StreamAlarms, and QueryActiveAlarms. No client opens a worker session to use them. They are gated by MxGateway:Alarms:*:

  • MxGateway:Alarms:Enabled (default false) turns the whole subsystem on.
  • MxGateway:Alarms:SubscriptionExpression is the canonical \\<machine>\Galaxy!<area> subscription; when empty, the monitor falls back to \\<MachineName>\Galaxy!<DefaultArea> from MxGateway:Alarms:DefaultArea. Enabled with both empty faults the monitor with a configuration diagnostic.
  • MxGateway:Alarms:ReconcileIntervalSeconds (default 30, floored at 5) sets the reconcile cadence below.

The always-on GatewayAlarmMonitor broker

GatewayAlarmMonitor (src/ZB.MOM.WW.MxGateway.Server/Alarms/GatewayAlarmMonitor.cs) is registered by AddGatewayAlarms as a singleton, as the IGatewayAlarmService, and as a hosted BackgroundService. When Enabled, it:

  1. Opens one gateway-managed worker session dedicated to alarms (client name gateway-alarm-monitor, backend Galaxy), after a brief startup grace so worker launching and orphan cleanup settle.
  2. Subscribes that session to the resolved subscription expression and feeds an in-process active-alarm cache (Dictionary<reference, ActiveAlarmSnapshot>) from the session's transition events.
  3. Fans the feed out to any number of StreamAlarms subscribers — clients never open their own session. The session is transparently re-opened with a 5-second backoff if the worker faults.

AlarmFeedMessage stream protocol

StreamAsync (behind StreamAlarms) emits, in order:

  1. one AlarmFeedMessage { active_alarm } per currently-cached alarm matching the optional alarm_filter_prefix,
  2. a single AlarmFeedMessage { snapshot_complete = true } sentinel,
  3. then one AlarmFeedMessage { transition } per live change.

The subscriber is registered under the monitor lock before the snapshot is taken, so no transition can slip between the snapshot and the live tail. QueryActiveAlarms reuses the same cache but emits only the active_alarm snapshots and completes — no sentinel, no transitions.

Reconcile loop

A PeriodicTimer runs ReconcileAsync every max(5, ReconcileIntervalSeconds) seconds. It pulls the worker's authoritative active-alarm snapshot and replaces the cache, broadcasting a synthetic Clear transition for any cached alarm the snapshot no longer contains and a synthetic Raise for any alarm the snapshot adds. This catches transitions the live poll-and-diff feed missed (e.g. across a transport blip). A failed reconcile pass logs at Debug and keeps the current cache.

Subscriber backpressure

Each subscriber gets a bounded channel of 2048 messages (SubscriberQueueCapacity). When Broadcast cannot write to a subscriber (its channel is full), that subscriber is completed with an error and dropped — the error message tells the client to reconnect to re-snapshot. Backpressure from one slow consumer never blocks the broker or other subscribers.

Snapshot state collapse

ActiveAlarmSnapshot.current_state carries only three AlarmConditionState values, so the four AVEVA STATEs collapse: UNACK_ALMActive, ACK_ALMActiveAcked, and both UNACK_RTN and ACK_RTNInactive (AlarmDispatcher). A returned-to-normal alarm is reported as Inactive in a snapshot even though it is still listed because it is unacknowledged. The live transition feed instead reports AlarmTransitionKind (Raise / Acknowledge / Clear).

alarm_full_reference parse contract

AcknowledgeAlarm accepts either form in alarm_full_reference (GatewayAlarmMonitor.BuildAcknowledgeCommand):

  • a canonical GUID (Guid.TryParse) → GUID ack path (AcknowledgeAlarmCommand), which on the deployed wnwrap build hits the E_NOTIMPL AlarmAckByGUID — see AlarmAckByGUID is not implemented;
  • a Provider!Group.Tag reference (TryParseAlarmReference: first ! splits provider from Group.Tag, the first . after the ! splits group from tag) → by-name ack path (AcknowledgeAlarmByNameCommand), the path that works;
  • anything else → a parse error before any worker call.

The transition feed emits the Provider!Group.Tag form in alarm_full_reference, so echoing that value back into AcknowledgeAlarm takes the working by-name path.

Reserved / unused

AlarmTransitionKind.RETRIGGER is defined in the proto but is not currently produced — the transition mapper emits only Raise / Acknowledge / Clear. It is reserved for a future "re-raise from a previously cleared condition" distinction.