Files
lmxopcua/docs/AlarmTracking.md
T
Joseph Doherty 6208304a44
v2-ci / build (push) Failing after 43s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (push) Has been skipped
docs(historian): HistorizeToAveva opt-out semantics + config knobs + startup validation
2026-06-11 13:24:46 -04:00

11 KiB

Alarm tracking — v2 final architecture

This document describes how OtOpcUa surfaces alarms to OPC UA Part 9 clients after the alarms-over-gateway epic (docs/plans/alarms-over-gateway.md) landed. The v1 architecture (Galaxy.Host's COM-side GalaxyAlarmTracker) is preserved at docs/v1/AlarmTracking.md for historical reference.

Three alarm sources, one OPC UA Part 9 surface

Source Driver capability Path
Galaxy MxAccess (driver-native) GalaxyDriver : IAlarmSource gateway → worker → MxAccess alarm sink → MX_EVENT_FAMILY_ON_ALARM_TRANSITIONEventPump → driver OnAlarmEventAlarmConditionService
Galaxy sub-attribute fallback IWritable writes to $Alarm* sub-attributes gateway data subscription → driver OnDataChangeDriverNodeManager ConditionSink → AlarmConditionService
Scripted alarms Phase7Composer server-side script evaluator → ScriptedAlarmActor transitions → HistorianAdapterActorIAlarmHistorianSink

All three converge on the alarm-state actor — in v2 the OPC UA Part 9 state machine lives inside ScriptedAlarmActor (src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ScriptedAlarms/ScriptedAlarmActor.cs), which dispatches transitions to the OPC UA condition node managers. Driver-native transitions take precedence over sub-attribute synthesis when both arrive for the same condition — the dedup logic prefers the richer driver-native record because it carries the full operator + raise-time + category metadata that the value-driven path collapses.

Galaxy driver path (driver-native)

Restored in PR B.2 of the epic. GalaxyDriver implements IAlarmSource with these surfaces:

  • SubscribeAlarmsAsync(sourceNodeIds) → returns a sentinel handle. The driver doesn't multiplex per source-node-id today; every active handle observes the gateway's alarm-event stream. The server-side AlarmConditionService filters by source-node before raising the OPC UA condition.
  • UnsubscribeAlarmsAsync(handle) → symmetric handle removal.
  • AcknowledgeAsync(requests) → routes one gateway RPC per acknowledgement through IGalaxyAlarmAcknowledger. Production uses GatewayGalaxyAlarmAcknowledger calling MxGatewayClient.AcknowledgeAlarmAsync (PR E.2 SDK method).
  • OnAlarmEvent → bridges EventPump.OnAlarmTransition (PR B.1) onto AlarmEventArgs. Suppressed when no alarm subscription is active so untracked transitions don't leak through.

The proto contract carries the rich payload — alarm full reference, source-object reference, alarm-type-name, transition kind (Raise / Acknowledge / Clear / Retrigger), severity (raw MxAccess scale), original raise timestamp, transition timestamp, operator user, operator comment, alarm category, description. MxAccessSeverityMapper (PR B.1) translates the raw severity onto the four-bucket AlarmSeverity ladder — boundaries match v1's GalaxyAlarmTracker so customers see no surprise re-classification.

The richer fields surface on Core.Abstractions.AlarmEventArgs via the optional properties added in PR E.7 (OperatorComment, OriginalRaiseTimestampUtc, AlarmCategory). Consumers that don't need them are unaffected; consumers that do (Client.UI, Client.CLI verbose mode) read the new fields when present.

Galaxy sub-attribute fallback

For Galaxy templates without $Alarm* extensions, the value-driven path stays in place: DriverNodeManager registers an AlarmConditionState per Galaxy variable that bears alarm-bearing sub-attributes (InAlarm, Acked, Priority, Description), subscribes to those sub-attributes, and synthesizes Part 9 transitions when the values change. This path operated as the only Galaxy alarm path between PR 7.2 and the alarms-over-gateway epic; it remains the fallback today.

When both paths report the same condition, AlarmConditionService.AlarmConditionState keeps the driver-native record and discards the duplicate sub-attribute synthesis. Driver-native transitions are richer (carry operator comment + original raise time) and arrive lower-latency (no publishing-interval delay on the sub-attribute reads), so they win the dedup.

Acknowledge routing — Galaxy / driver alarms

DriverNodeManager picks the acknowledger when registering each condition (PR B.3 logic):

  • Driver implements IAlarmSourceDriverAlarmSourceAcknowledger routes the operator comment through IAlarmSource.AcknowledgeAsync via the existing AlarmSurfaceInvoker (Phase 6.1 resilience pipeline; no-retry per decision #143). End-to-end operator-comment fidelity is preserved.
  • Driver doesn't implement IAlarmSourceDriverWritableAcknowledger writes the comment into the AckMsgWriteRef sub-attribute via IWritable.WriteAsync. Same resilience pipeline; collapses the comment into a single string write at the wire level.

The OPC UA Part 9 AlarmConditionState.OnAcknowledge delegate already validates the session's AlarmAck role before dispatching, so the gateway-side ack RPC only sees authenticated, authorised calls.

Inbound operator ack/shelve — scripted alarms

Scripted alarms use a separate inbound path that converges on the alarm-commands DPS topic. Two surfaces route onto this topic:

OPC UA Part 9 method path (external OPC UA clients)

OtOpcUaNodeManager wires the Part 9 condition methods (Acknowledge / Confirm / AddComment / OneShotShelve / TimedShelve / Unshelve) on each scripted-alarm AlarmConditionState node. Every call is gated on the AlarmAck LDAP role — fail-closed: sessions with no role or without AlarmAck group membership receive BadUserAccessDenied immediately. The LDAP-resolved role set is carried past OpcUaApplicationHost by RoleCarryingUserIdentity (a UserIdentity subclass), making it readable inside the method handler at dispatch time.

On allow, the handler publishes a Commons.OpcUa.AlarmCommand onto the alarm-commands DPS topic. The node manager is Akka-free; the dispatch action is a settable Action<AlarmCommand> injected at boot by the hosted service.

OnTimedUnshelve (the SDK's automatic unshelve timer) bypasses the operator gate — it is system-initiated.

WriteAlarmCondition fires the Part 9 condition event only when the incoming state differs from the node's current live state (delta-gate), preventing the double-emit that would otherwise occur when the SDK auto-applies the acked state and the engine re-projection fires a duplicate event immediately after.

AdminUI path

The /alerts page shows per-row Acknowledge / Shelve / Unshelve buttons gated by the DriverOperator AdminUI policy. These route through the AdminOperationsActor cluster singleton (AcknowledgeAlarmCommand / ShelveAlarmCommand), which publishes onto the same alarm-commands topic. The singleton handles cross-node routing — the command always reaches the driver-role node owning the engine regardless of which AdminUI instance the operator is on.

ScriptedAlarmHostActor dispatch

ScriptedAlarmHostActor subscribes to the alarm-commands topic, ownership-filters each command (each node only acts on its own alarms), and dispatches to the matching ScriptedAlarmEngine operation (AcknowledgeAsync / ConfirmAsync / OneShotShelveAsync / TimedShelveAsync / UnshelveAsync / EnableAsync / DisableAsync / AddCommentAsync). The engine's existing OnEvent callback handles the OPC UA node update — no explicit re-projection is required.

The AdminUI /alerts Shelve flow was live-verified on docker-dev 2026-06-11: singleton → topic → host actor → engine → "Shelved" status reflected on /alerts with the operator identity threaded through.

Redundancy deduplication

Under warm/hot redundancy, both cluster nodes run ScriptedAlarmHostActor and the scripted-alarm engine. To prevent duplicate /alerts rows and duplicate historian writes, alarm transition publication to the alerts topic and HistorianAdapterActor historization are Primary-gated: only the node whose RedundancyRole is Primary publishes externally. OPC UA condition-node writes and inbound ack/shelve processing remain ungated on both nodes so the secondary stays warm for failover. See Redundancy.md §Primary-gated alarm emission and historization.

Historian write-back (non-Galaxy alarms)

Scripted alarms (and any future non-Galaxy IAlarmSource like AB CIP ALMD) route to AVEVA Historian via the Wonderware sidecar:

  • IAlarmHistorianSink is the DI-registered intake contract. The default binding is NullAlarmHistorianSink (registered in ServiceCollectionExtensions.AddOtOpcUaRuntime). Production deployments override it with SqliteStoreAndForwardSink wrapping WonderwareHistorianClient (the AVEVA Historian sidecar IPC client) — see ServiceHosting.md for the sidecar setup.
  • SqliteStoreAndForwardSink queues each transition to a local SQLite database and drains in the background via an IAlarmHistorianWriter. The durability guarantee is bounded: the queue capacity defaults to 1,000,000 rows; under a sustained historian outage, older non-dead-lettered rows are evicted (oldest first) to make room for new events. The HistorianSinkStatus.EvictedCount counter surfaces lifetime eviction events so operators can detect silent data loss without log scraping. The drain cadence, queue capacity, and dead-letter retention are tunable via the AlarmHistorian config section (DrainIntervalSeconds, Capacity, DeadLetterRetentionDays); AlarmHistorianOptions.Validate() logs a startup warning for an empty SharedSecret, a relative DatabasePath, or a non-positive knob.
  • HistorianAdapterActor (src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Historian/HistorianAdapterActor.cs) subscribes to the cluster alerts DPS topic, translates each AlarmTransitionEventAlarmHistorianEvent, and calls EnqueueAsync fire-and-forget. The durable write is gated two ways: (1) Primary-gated — only the Primary node historizes, giving exactly-once writes across a redundant pair; (2) per-alarm — a transition whose HistorizeToAveva is false is skipped (the flag rides on AlarmTransitionEvent as a nullable bool; missing/null/true historize, only an explicit false suppresses, so a cross-version rolling restart defaults to historizing rather than dropping an audit row). Neither gate touches the live alerts publish — the /alerts UI always sees every transition. See AlarmHistorian.md §Configuration for the AlarmHistorian appsettings section that enables the real sink.

Galaxy-native alarms with $Alarm* extensions reach AVEVA Historian directly via System Platform's HistorizeToAveva toggle on the alarm primitive — no involvement from OtOpcUa. This sidecar path is exclusively for non-Galaxy alarm producers.

Cross-references