Files
lmxopcua/docs/plans/alarms-worker-wiring-plan.md
Joseph Doherty cd2306db66 feat(historian-sidecar): live aahClientManaged alarm-event write path (C.1)
SdkAlarmHistorianWriteBackend.WriteBatchAsync replaces the RetryPlease
placeholder with the real entry point — HistorianAccess.AddStreamedValue
(HistorianEvent, out HistorianAccessError) in aahClientManaged, pinned by
decompiling the installed SDK.

The write path opens its own ReadOnly=false connection: the query-side
HistorianDataSource opens ReadOnly sessions and AddStreamedValue fails on
those with WriteToReadOnlyFile. IHistorianConnectionFactory gains a readOnly
parameter (default true, query path unchanged); BuildConnectionArgs is
extracted as a pure helper. HistorianClusterEndpointPicker is shared for
node failover; connection-class errors abort the batch as RetryPlease and
reset the connection, malformed-input codes map to PermanentFail.

Tests: connection-unavailable batch deferral, ClassifyOutcome error-code
table, BuildConnectionArgs read-vs-write shaping (80 pass, 2 rig-skipped).
Live_* round-trip tests stay Skip-gated for the D.1 rollout smoke.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 16:08:32 -04:00

16 KiB
Raw Blame History

Alarms Worker Wiring Plan

Context: The alarms-over-gateway epic shipped 19 PRs across the lmxopcua and mxaccessgw repos (merged 2026-04-30). Contracts are live; the sub-attribute fallback path keeps Galaxy alarms functional today. Four items remain as inert scaffolds gated on a dev-rig finding. This document is the focused implementation plan for those four items only.

Do not duplicate docs/plans/alarms-over-gateway.md — that document is the full historical record of all 19 PRs. This document covers only what is still to be done and exactly what blocks each item.

This work lives in the mxaccessgw sibling repo at C:\Users\dohertj2\Desktop\mxaccessgw\ — not in this (lmxopcua) repo, except where lmxopcua changes are noted explicitly.


Dev-rig finding that blocks everything (2026-04-30)

During PR A.2 work the following was discovered on the dev box:

The MXAccess COM Toolkit at C:\Program Files (x86)\ArchestrA\Framework\Bin\ArchestrA.MXAccess.dll exposes no alarm-event family — only OnDataChange, OnWriteComplete, OperationComplete, OnBufferedDataChange.

AVEVA's aaAlarmManagedClient / ArchestrAAlarmsAndEvents.SDK assemblies are x64-only and incompatible with the worker's x86 net48 bitness.

The architectural decision required before any of A.2, A.3/A.4, C.1 can ship:

Either accept the value-driven sub-attribute path as the production architecture (operator-comment fidelity is the only v1 regression), or add an x64 alarm-helper sub-process alongside the x86 worker.

Resolution drives the implementation shape of every item below. The plan presented here assumes the x64 alarm-helper sub-process route (the higher parity option), but notes the sub-attribute-only exit at each step.


Discovered AVEVA API surface

Before implementing, verify the following against the AVEVA SDK actually installed on the dev box and in the mxaccessgw worker's deployment folder:

Assembly Bitness Likely location Key types
ArchestrA.MXAccess.dll x86 C:\Program Files (x86)\ArchestrA\Framework\Bin\ IMxAlarmEventSink, MxAlarmEventArgsconfirm exists at actual version
aaAlarmManagedClient.dll x64 C:\Program Files\ArchestrA\Framework\Bin\ AlarmClient, IAlarmConsumer, AlarmEventArgs
ArchestrAAlarmsAndEvents.SDK.dll x64 Same or Historian SDK folder AlarmHistorianWriter, GetAlarmExtendedRec

The AVEVA MXAccess Toolkit reference in the mxaccessgw repo (gateway.md) is the canonical API doc for the gateway worker's side. The alarm-client API is documented separately; verify the following call shapes during PR A.2:

Operation Likely API Notes
Subscribe to alarm events AlarmClient.RegisterConsumer(IAlarmConsumer) + AlarmClient.Subscribe(filterSpec) Confirm exact method signatures against the SDK version on the dev box
Receive alarm event IAlarmConsumer.OnAlarmEvent(AlarmEventArgs) callback Field set: alarm name, source, type, transition kind, severity, timestamps, operator fields
Acknowledge alarm AlarmClient.AcknowledgeAlarm(alarmRef, comment, userPrincipal) or equivalent Confirm whether this is synchronous or returns a status
Query active alarms AlarmClient.GetAlarmExtendedRec(filter) or GetActiveAlarms() Returns current active set for ConditionRefresh
Get statistics AlarmClient.GetStatistics() Optional — useful for worker health checks

Record the exact method signatures against the installed SDK before starting A.2 — the proto field set in OnAlarmTransitionEvent must match the SDK's actual payload.


Dependency order

A.2 (worker: AlarmClient subscription)
  └─► A.3 (gateway: dispatch OnAlarmTransition + AcknowledgeAlarm RPC handler)
        └─► A.4 (gateway: QueryActiveAlarms RPC handler)
              └─► lmxopcua B.2 (GalaxyDriver IAlarmSource live)
                    └─► C.1 (sidecar: AahClientManagedAlarmEventWriter live)
                          └─► D.1 (smoke artifact captured)

A.2 is the single blocking item. All subsequent items unblock serially once A.2 delivers alarm events through the channel.


Item A.2 — Worker: subscribe to MxAccess alarm event source

Repo: mxaccessgwsrc\MxGateway.Worker\ (net48, x86)

What it needs:

The worker must subscribe to AVEVA's alarm events and fan them into the same bounded channel the data-change pump uses, translating each MxAccess alarm event into a WorkerEvent proto with family MX_EVENT_FAMILY_ON_ALARM_TRANSITION (defined in PR A.1, already merged).

Architectural choice determines the implementation path:

Option X1 — aaAlarmManagedClient in a new x64 alarm-helper process

Add a second worker-mode sub-process (MxGateway.AlarmWorker, net8.0 x64) alongside the existing x86 worker. The AlarmWorker:

  1. Loads aaAlarmManagedClient.dll (x64) on startup.
  2. Calls AlarmClient.RegisterConsumer with a WorkerAlarmConsumer sink.
  3. Calls AlarmClient.Subscribe with a session-level filter (all alarms for the session's Galaxy scope).
  4. Translates each IAlarmConsumer.OnAlarmEvent callback into a protobuf WorkerEvent (family ON_ALARM_TRANSITION) and writes it to an IPC channel readable by the gateway server-side multiplexer.
  5. Handles session lifecycle: re-subscribes after reconnect; unsubscribes on session close.

IPC from AlarmWorker to gateway: simplest option is a named pipe or an in-process queue if the AlarmWorker is hosted in the same gateway process space as a separate IHostedService.

Option X2 — Accept sub-attribute fallback as production (no A.2 work)

If the architectural decision is to accept the sub-attribute path as permanent:

  • MxAccessAlarmEventSink.Attach() in the worker remains a no-op (as currently coded with the architectural comment).
  • The MX_EVENT_FAMILY_ON_ALARM_TRANSITION proto family stays defined but the gateway never emits events on it.
  • lmxopcua's GalaxyDriver does not implement IAlarmSource for the native path; the value-driven sub-attribute path remains the production path.
  • The only regression vs. v1 is operator-comment fidelity on Galaxy alarms.
  • C.1 is still needed if scripted-alarm historian write-back is required.

What blocks it: the architectural decision above. Once made, A.2 becomes a 23 day implementation task (sub-process plumbing + proto translation + unit tests for the consumer sink cancellation behaviour).

Tests to write (when A.2 proceeds):

  • WorkerAlarmConsumerTests — fake IAlarmConsumer source emits canned transitions; assert each produces the correct WorkerEvent body shape.
  • Cancellation/session-close test — closing the session unsubscribes from the AlarmClient cleanly (no leaked IAlarmConsumer reference if the worker is recycled mid-session).
  • Re-subscribe-after-reconnect test — ReconnectSupervisor triggers a reconnect; assert the alarm consumer re-attaches to the new session.

Item A.3 / A.4 — Gateway: dispatch and RPC handlers

Repo: mxaccessgwsrc\MxGateway.Server\

Depends on: A.2 delivering WorkerEvent bodies with family MX_EVENT_FAMILY_ON_ALARM_TRANSITION.

What it needs:

A.3 — Dispatch + AcknowledgeAlarm

  1. The session-level event multiplexer (Sessions\SessionEventStream.cs or equivalent — verify name in the mxaccessgw repo) must recognise the new WorkerEvent body and forward it as an MxEvent with family MX_EVENT_FAMILY_ON_ALARM_TRANSITION to every StreamEvents subscriber for that session.

  2. New RPC handler AcknowledgeAlarm builds an AlarmAcknowledgeCommand worker command and forwards it to the alarm-helper process (Option X1) or the worker's MxAccess session (Option X2 if MxAccess exposes ack). Maps the reply status to AcknowledgeAlarmReply.MxStatusProxy.

  3. Authorization: new API scope invoke:alarm-ack on the API key. Keys without it receive PERMISSION_DENIED. Follow the existing scope-check pattern used by invoke:write.

A.4 — QueryActiveAlarms

  1. New RPC handler QueryActiveAlarms calls AlarmClient.GetAlarmExtendedRec (or GetActiveAlarms — confirm the method name during implementation) on the alarm-helper process, batches results into ActiveAlarmSnapshot proto messages, and streams them back to the caller.

  2. New API scope invoke:alarm-query (separate from ack so read-only clients can refresh without ack rights).

What blocks A.3/A.4: A.2 must deliver WorkerEvent bodies on the channel. A.3/A.4 are pure dispatch wiring once the events arrive.

Tests to write:

  • A.3 dispatch test — fake worker emits an AlarmTransition event; assert the gateway forwards it on the StreamEvents channel of every subscribed session (mirrors existing OnDataChange dispatch tests).
  • A.3 AcknowledgeAlarm auth test — existing key without invoke:alarm-ack scope returns PERMISSION_DENIED.
  • A.4 pagination test — synthetic active-alarm set of 0 / 1 / 100 entries; assert each streams back as separate ActiveAlarmSnapshot messages.
  • Integration (parity rig — requires dev box with AVEVA platform): trigger a real Galaxy alarm, call QueryActiveAlarms, assert the alarm appears in the stream; call AcknowledgeAlarm, assert the alarm transitions to ActiveAcked and a Acknowledge transition event appears on StreamEvents.

Item C.1 — Historian sidecar: AahClientManagedAlarmEventWriter

Repo: lmxopcuasrc\Drivers\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\

Depends on: Architectural decision (the sidecar uses aahClientManaged x64, which is not bitness-constrained like the worker). C.1 is independently unblockable from A.2 if the goal is to wire up the scripted-alarm historian path.

Current state (DONE — code):

C.1 shipped. SdkAlarmHistorianWriteBackend.WriteBatchAsync writes through the real SDK entry point — HistorianAccess.AddStreamedValue(HistorianEvent, out HistorianAccessError) in aahClientManaged — pinned 2026-05-18 by decompiling the installed SDK. Program.cs and Install-Services.ps1 were already wired in the PR C.1 scaffolding. Two corrections to the assumptions this doc was written under:

  • There is no ArchestrAAlarmsAndEvents.SDK writer. That assembly (ArchestrAAlarmsAndEvents.SDK.Common.dll, the only one installed) is a WCF query-proxy base — no AlarmHistorianWriter type. The write path is the aahClientManaged HistorianAccess surface.
  • The write path needs its own connection. The query-side HistorianDataSource opens ReadOnly sessions; AddStreamedValue on a read-only session fails with WriteToReadOnlyFile. SdkAlarmHistorianWriteBackend opens a dedicated ReadOnly=false connection and shares only HistorianClusterEndpointPicker (not the connection object).

What it needed (all done):

  1. SdkAlarmHistorianWriteBackend builds a HistorianEvent per AlarmHistorianEventDto, calls AddStreamedValue, and maps HistorianAccessError.ErrorValue codes through AahClientManagedAlarmEventWriter.MapOutcome (Ack / PermanentFail / RetryPlease). HistorianClusterEndpointPicker drives multi-node failover.
  2. Program.csBuildAlarmWriter() constructs the backend gated behind OTOPCUA_HISTORIAN_ALARM_WRITE_ENABLED.
  3. Install-Services.ps1 — env var present in the install-time block.

What remains for C.1: only the live-rig write smoke — the Live_* tests in SdkAlarmHistorianWriteBackendTests stay Skip-gated until D.1 confirms a round-trip against a real AVEVA Historian, including the exact mandatory HistorianEvent field set.

Tests to write:

  • Outcome-mapping table: every MxStatus on alarm-write → expected HistorianWriteOutcome.
  • Batch test: 1 / 100 / 1000 events through a fake aahClientManaged writer; assert per-row outcome list parallel to input order.
  • Cluster failover: primary Historian node returns BadCommunicationError; picker rotates to secondary; eventual success.
  • Program.cs seam: assert handler constructed with alarm writer when env var enabled; without it when disabled.
  • Live integration (parity rig): write a synthetic alarm event through the IPC; query it back via ReadEvents; assert round-trip fidelity.

Item D.1 — Smoke artifact

Repo: lmxopcua (deployment refresh) + mxaccessgw (rig verification)

Depends on: A.2, A.3, A.4, and C.1 all passing on the dev rig with a live Galaxy and live Historian.

Current state: The deployment script Refresh-Services.ps1 (task D.1) has shipped as PR #417 (merged 2026-04-30). What was NOT captured at that time was a smoke artifact — a log snippet or test output confirming that:

  1. An alarm transition event from a live Galaxy alarm reaches lmxopcua's AlarmConditionService via the new IAlarmSource path (not the fallback).
  2. A scripted-alarm historian write-back reaches AVEVA Historian via the sidecar IAlarmEventWriter.

What it needs:

Once A.2, A.3, C.1 are wired on the parity rig:

  1. Deploy the updated mxaccessgw (with A.2 / A.3 / A.4 changes).

  2. Deploy the updated sidecar (with C.1 changes).

  3. Run Refresh-Services.ps1 to confirm clean service restarts.

  4. Trigger a Galaxy alarm (e.g. set an AnalogLimitAlarm attribute out of range in Galaxy IDE).

  5. Observe the lmxopcua OPC UA alarm surface via the Client CLI:

    dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
        alarms -u opc.tcp://localhost:4840 --subscribe
    

    Pass: the alarm condition appears on the OPC UA A&E surface within 2 × publishing interval.

  6. Trigger a scripted alarm via the lmxopcua ScriptedAlarmEngine (or an OPC UA method call if one is wired).

  7. Confirm in the AVEVA Historian that the scripted alarm event is stored (query via the Historian client or HistorianWatch tool).

  8. Capture log snippets:

    • mxaccessgw log: [INF] AlarmTransition dispatched sessionId=<> alarmRef=<>
    • lmxopcua log: [INF] AlarmConditionService: IAlarmSource event alarmRef=<> origin=Driver
    • Sidecar log: [INF] AahClientManagedAlarmEventWriter: Wrote <n> alarm events
  9. Commit the log snippets as docs/plans/alarms-d1-smoke-artifact.md (a new doc, not this one).

What blocks D.1: all of A.2, A.3, C.1, plus the operator decision on the x64 alarm-helper architecture (or explicit acceptance of the sub-attribute fallback as production).


Summary of blocks

Item Blocked by Estimated effort once unblocked
A.2 Architectural decision (x64 alarm-helper vs. sub-attribute fallback as production) 23 days implementation; 1 day tests
A.3 A.2 delivering WorkerEvent bodies 12 days
A.4 A.2 (active-alarm query needs AlarmClient session) 1 day
C.1 aahClientManaged SDK access (available on dev box); NOT blocked by A.2 12 days
D.1 A.2 + A.3 + C.1 all passing on parity rig 0.5 day (smoke + artifact capture)

C.1 can proceed in parallel with A.2 / A.3 since the sidecar's aahClientManaged is x64 and does not share the worker bitness constraint.


What this plan does NOT cover

  • The value-driven sub-attribute fallback path — already shipped and functional (not being changed).
  • Track B (lmxopcua EventPump, GalaxyDriver IAlarmSource re-implementation) and Track E (client SDK surface refresh) from the alarms-over-gateway plan — those are in lmxopcua and depend on A.3 being live; they follow naturally once A.3 ships.
  • Galaxy-native alarm historian path — System Platform's own HistorizeToAveva toggle on the Galaxy template; not in scope.
  • Alarm ACL / role-grant surface — already shipped in Phase 6.2.