11 KiB
Alarm tracking — v2 final architecture
This document describes how OtOpcUa surfaces alarms to OPC UA Part 9
clients after the alarms-over-gateway epic
(docs/plans/alarms-over-gateway.md)
landed. The v1 architecture (Galaxy.Host's COM-side GalaxyAlarmTracker)
is preserved at docs/v1/AlarmTracking.md for
historical reference.
Three alarm sources, one OPC UA Part 9 surface
| Source | Driver capability | Path |
|---|---|---|
| Galaxy MxAccess (driver-native) | GalaxyDriver : IAlarmSource |
gateway → worker → MxAccess alarm sink → MX_EVENT_FAMILY_ON_ALARM_TRANSITION → EventPump → driver OnAlarmEvent → AlarmConditionService |
| Galaxy sub-attribute fallback | IWritable writes to $Alarm* sub-attributes |
gateway data subscription → driver OnDataChange → DriverNodeManager ConditionSink → AlarmConditionService |
| Scripted alarms | Phase7Composer |
server-side script evaluator → ScriptedAlarmActor transitions → HistorianAdapterActor → IAlarmHistorianSink |
All three converge on the alarm-state actor — in v2 the OPC UA Part 9 state
machine lives inside ScriptedAlarmActor
(src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ScriptedAlarms/ScriptedAlarmActor.cs),
which dispatches transitions to the OPC UA condition node managers. Driver-native transitions take
precedence over sub-attribute synthesis when both arrive for the same
condition — the dedup logic prefers the richer driver-native record
because it carries the full operator + raise-time + category metadata
that the value-driven path collapses.
Galaxy driver path (driver-native)
Restored in PR B.2 of the epic. GalaxyDriver implements
IAlarmSource with these surfaces:
SubscribeAlarmsAsync(sourceNodeIds)→ returns a sentinel handle. The driver doesn't multiplex per source-node-id today; every active handle observes the gateway's alarm-event stream. The server-sideAlarmConditionServicefilters by source-node before raising the OPC UA condition.UnsubscribeAlarmsAsync(handle)→ symmetric handle removal.AcknowledgeAsync(requests)→ routes one gateway RPC per acknowledgement throughIGalaxyAlarmAcknowledger. Production usesGatewayGalaxyAlarmAcknowledgercallingMxGatewayClient.AcknowledgeAlarmAsync(PR E.2 SDK method).OnAlarmEvent→ bridgesEventPump.OnAlarmTransition(PR B.1) ontoAlarmEventArgs. Suppressed when no alarm subscription is active so untracked transitions don't leak through.
The proto contract carries the rich payload — alarm full reference,
source-object reference, alarm-type-name, transition kind (Raise /
Acknowledge / Clear / Retrigger), severity (raw MxAccess scale),
original raise timestamp, transition timestamp, operator user,
operator comment, alarm category, description. MxAccessSeverityMapper
(PR B.1) translates the raw severity onto the four-bucket
AlarmSeverity ladder — boundaries match v1's GalaxyAlarmTracker
so customers see no surprise re-classification.
The richer fields surface on Core.Abstractions.AlarmEventArgs via
the optional properties added in PR E.7 (OperatorComment,
OriginalRaiseTimestampUtc, AlarmCategory). Consumers that don't
need them are unaffected; consumers that do (Client.UI, Client.CLI
verbose mode) read the new fields when present.
Galaxy sub-attribute fallback
For Galaxy templates without $Alarm* extensions, the value-driven
path stays in place: DriverNodeManager registers an
AlarmConditionState per Galaxy variable that bears alarm-bearing
sub-attributes (InAlarm, Acked, Priority, Description),
subscribes to those sub-attributes, and synthesizes Part 9 transitions
when the values change. This path operated as the only Galaxy alarm
path between PR 7.2 and the alarms-over-gateway epic; it remains the
fallback today.
When both paths report the same condition,
AlarmConditionService.AlarmConditionState keeps the
driver-native record and discards the duplicate sub-attribute
synthesis. Driver-native transitions are richer (carry operator
comment + original raise time) and arrive lower-latency (no
publishing-interval delay on the sub-attribute reads), so they win
the dedup.
Acknowledge routing — Galaxy / driver alarms
DriverNodeManager picks the acknowledger when registering each
condition (PR B.3 logic):
- Driver implements
IAlarmSource→DriverAlarmSourceAcknowledgerroutes the operator comment throughIAlarmSource.AcknowledgeAsyncvia the existingAlarmSurfaceInvoker(Phase 6.1 resilience pipeline; no-retry per decision #143). End-to-end operator-comment fidelity is preserved. - Driver doesn't implement
IAlarmSource→DriverWritableAcknowledgerwrites the comment into theAckMsgWriteRefsub-attribute viaIWritable.WriteAsync. Same resilience pipeline; collapses the comment into a single string write at the wire level.
The OPC UA Part 9 AlarmConditionState.OnAcknowledge delegate
already validates the session's AlarmAck role before dispatching,
so the gateway-side ack RPC only sees authenticated, authorised
calls.
Inbound operator ack/shelve — scripted alarms
Scripted alarms use a separate inbound path that converges on the
alarm-commands DPS topic. Two surfaces route onto this topic:
OPC UA Part 9 method path (external OPC UA clients)
OtOpcUaNodeManager wires the Part 9 condition methods (Acknowledge /
Confirm / AddComment / OneShotShelve / TimedShelve / Unshelve) on each
scripted-alarm AlarmConditionState node. Every call is gated on the
AlarmAck LDAP role — fail-closed: sessions with no role or without
AlarmAck group membership receive BadUserAccessDenied immediately.
The LDAP-resolved role set is carried past OpcUaApplicationHost by
RoleCarryingUserIdentity (a UserIdentity subclass), making it
readable inside the method handler at dispatch time.
On allow, the handler publishes a Commons.OpcUa.AlarmCommand onto the
alarm-commands DPS topic. The node manager is Akka-free; the dispatch
action is a settable Action<AlarmCommand> injected at boot by the
hosted service.
OnTimedUnshelve (the SDK's automatic unshelve timer) bypasses the
operator gate — it is system-initiated.
WriteAlarmCondition fires the Part 9 condition event only when the
incoming state differs from the node's current live state (delta-gate),
preventing the double-emit that would otherwise occur when the SDK
auto-applies the acked state and the engine re-projection fires a
duplicate event immediately after.
AdminUI path
The /alerts page shows per-row Acknowledge / Shelve / Unshelve
buttons gated by the DriverOperator AdminUI policy. These route
through the AdminOperationsActor cluster singleton
(AcknowledgeAlarmCommand / ShelveAlarmCommand), which publishes onto
the same alarm-commands topic. The singleton handles cross-node
routing — the command always reaches the driver-role node owning the
engine regardless of which AdminUI instance the operator is on.
ScriptedAlarmHostActor dispatch
ScriptedAlarmHostActor subscribes to the alarm-commands topic,
ownership-filters each command (each node only acts on its own alarms),
and dispatches to the matching ScriptedAlarmEngine operation
(AcknowledgeAsync / ConfirmAsync / OneShotShelveAsync /
TimedShelveAsync / UnshelveAsync / EnableAsync / DisableAsync /
AddCommentAsync). The engine's existing OnEvent callback handles
the OPC UA node update — no explicit re-projection is required.
The AdminUI /alerts Shelve flow was live-verified on docker-dev
2026-06-11: singleton → topic → host actor → engine → "Shelved" status
reflected on /alerts with the operator identity threaded through.
Redundancy deduplication
Under warm/hot redundancy, both cluster nodes run ScriptedAlarmHostActor and the scripted-alarm engine. To prevent duplicate /alerts rows and duplicate historian writes, alarm transition publication to the alerts topic and HistorianAdapterActor historization are Primary-gated: only the node whose RedundancyRole is Primary publishes externally. OPC UA condition-node writes and inbound ack/shelve processing remain ungated on both nodes so the secondary stays warm for failover. See Redundancy.md §Primary-gated alarm emission and historization.
Historian write-back (non-Galaxy alarms)
Scripted alarms (and any future non-Galaxy IAlarmSource like
AB CIP ALMD) route to AVEVA Historian via the Wonderware sidecar:
IAlarmHistorianSinkis the DI-registered intake contract. The default binding isNullAlarmHistorianSink(registered inServiceCollectionExtensions.AddOtOpcUaRuntime). Production deployments override it withSqliteStoreAndForwardSinkwrappingWonderwareHistorianClient(the AVEVA Historian sidecar IPC client) — see ServiceHosting.md for the sidecar setup.SqliteStoreAndForwardSinkqueues each transition to a local SQLite database and drains in the background via anIAlarmHistorianWriter. The durability guarantee is bounded: the queue capacity defaults to 1,000,000 rows; under a sustained historian outage, older non-dead-lettered rows are evicted (oldest first) to make room for new events. TheHistorianSinkStatus.EvictedCountcounter surfaces lifetime eviction events so operators can detect silent data loss without log scraping. The drain cadence, queue capacity, and dead-letter retention are tunable via theAlarmHistorianconfig section (DrainIntervalSeconds,Capacity,DeadLetterRetentionDays);AlarmHistorianOptions.Validate()logs a startup warning for an emptySharedSecret, a relativeDatabasePath, or a non-positive knob.HistorianAdapterActor(src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Historian/HistorianAdapterActor.cs) subscribes to the clusteralertsDPS topic, translates eachAlarmTransitionEvent→AlarmHistorianEvent, and callsEnqueueAsyncfire-and-forget. The durable write is gated two ways: (1) Primary-gated — only the Primary node historizes, giving exactly-once writes across a redundant pair; (2) per-alarm — a transition whoseHistorizeToAvevaisfalseis skipped (the flag rides onAlarmTransitionEventas a nullable bool; missing/null/truehistorize, only an explicitfalsesuppresses, so a cross-version rolling restart defaults to historizing rather than dropping an audit row). Neither gate touches the livealertspublish — the/alertsUI always sees every transition. See AlarmHistorian.md §Configuration for theAlarmHistorianappsettings section that enables the real sink.
Galaxy-native alarms with $Alarm* extensions reach AVEVA Historian
directly via System Platform's HistorizeToAveva toggle on the
alarm primitive — no involvement from OtOpcUa. This sidecar path is
exclusively for non-Galaxy alarm producers.
Cross-references
- Plan: docs/plans/alarms-over-gateway.md
- v1 archive: docs/v1/AlarmTracking.md
- Galaxy driver: docs/drivers/Galaxy.md
- Phase 7 scripting + alarming: docs/v2/implementation/phase-7-scripting-and-alarming.md
- Security + ACL: docs/security.md