# Alarm tracking — v2 final architecture This document describes how OtOpcUa surfaces alarms to OPC UA Part 9 clients after the **alarms-over-gateway** epic ([docs/plans/alarms-over-gateway.md](plans/alarms-over-gateway.md)) landed. The v1 architecture (Galaxy.Host's COM-side `GalaxyAlarmTracker`) is preserved at [docs/v1/AlarmTracking.md](v1/AlarmTracking.md) for historical reference. ## Three alarm sources, one OPC UA Part 9 surface | Source | Driver capability | Path | |----------------------------------|--------------------------|------| | **Galaxy MxAccess (driver-native)** | `GalaxyDriver : IAlarmSource` | gateway → worker → MxAccess alarm sink → `MX_EVENT_FAMILY_ON_ALARM_TRANSITION` → `EventPump` → driver `OnAlarmEvent` → `AlarmConditionService` | | **Galaxy sub-attribute fallback** | `IWritable` writes to `$Alarm*` sub-attributes | gateway data subscription → driver `OnDataChange` → `DriverNodeManager` ConditionSink → `AlarmConditionService` | | **Scripted alarms** | `Phase7Composer` | server-side script evaluator → `ScriptedAlarmActor` transitions → `HistorianAdapterActor` → `IAlarmHistorianSink` | All three converge on the alarm-state actor — in v2 the OPC UA Part 9 state machine lives inside `ScriptedAlarmActor` (`src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ScriptedAlarms/ScriptedAlarmActor.cs`), which dispatches transitions to the OPC UA condition node managers. Driver-native transitions take precedence over sub-attribute synthesis when both arrive for the same condition — the dedup logic prefers the richer driver-native record because it carries the full operator + raise-time + category metadata that the value-driven path collapses. ## Galaxy driver path (driver-native) Restored in PR B.2 of the epic. `GalaxyDriver` implements `IAlarmSource` with these surfaces: - `SubscribeAlarmsAsync(sourceNodeIds)` → returns a sentinel handle. The driver doesn't multiplex per source-node-id today; every active handle observes the gateway's alarm-event stream. The server-side `AlarmConditionService` filters by source-node before raising the OPC UA condition. - `UnsubscribeAlarmsAsync(handle)` → symmetric handle removal. - `AcknowledgeAsync(requests)` → routes one gateway RPC per acknowledgement through `IGalaxyAlarmAcknowledger`. Production uses `GatewayGalaxyAlarmAcknowledger` calling `MxGatewayClient.AcknowledgeAlarmAsync` (PR E.2 SDK method). - `OnAlarmEvent` → bridges `EventPump.OnAlarmTransition` (PR B.1) onto `AlarmEventArgs`. Suppressed when no alarm subscription is active so untracked transitions don't leak through. The proto contract carries the rich payload — alarm full reference, source-object reference, alarm-type-name, transition kind (Raise / Acknowledge / Clear / Retrigger), severity (raw MxAccess scale), original raise timestamp, transition timestamp, operator user, operator comment, alarm category, description. `MxAccessSeverityMapper` (PR B.1) translates the raw severity onto the four-bucket `AlarmSeverity` ladder — boundaries match v1's `GalaxyAlarmTracker` so customers see no surprise re-classification. The richer fields surface on `Core.Abstractions.AlarmEventArgs` via the optional properties added in PR E.7 (`OperatorComment`, `OriginalRaiseTimestampUtc`, `AlarmCategory`). Consumers that don't need them are unaffected; consumers that do (Client.UI, Client.CLI verbose mode) read the new fields when present. ## Galaxy sub-attribute fallback For Galaxy templates without `$Alarm*` extensions, the value-driven path stays in place: `DriverNodeManager` registers an `AlarmConditionState` per Galaxy variable that bears alarm-bearing sub-attributes (`InAlarm`, `Acked`, `Priority`, `Description`), subscribes to those sub-attributes, and synthesizes Part 9 transitions when the values change. This path operated as the only Galaxy alarm path between PR 7.2 and the alarms-over-gateway epic; it remains the fallback today. When both paths report the same condition, `AlarmConditionService.AlarmConditionState` keeps the driver-native record and discards the duplicate sub-attribute synthesis. Driver-native transitions are richer (carry operator comment + original raise time) and arrive lower-latency (no publishing-interval delay on the sub-attribute reads), so they win the dedup. ## Acknowledge routing — Galaxy / driver alarms `DriverNodeManager` picks the acknowledger when registering each condition (PR B.3 logic): - Driver implements `IAlarmSource` → `DriverAlarmSourceAcknowledger` routes the operator comment through `IAlarmSource.AcknowledgeAsync` via the existing `AlarmSurfaceInvoker` (Phase 6.1 resilience pipeline; no-retry per decision #143). End-to-end operator-comment fidelity is preserved. - Driver doesn't implement `IAlarmSource` → `DriverWritableAcknowledger` writes the comment into the `AckMsgWriteRef` sub-attribute via `IWritable.WriteAsync`. Same resilience pipeline; collapses the comment into a single string write at the wire level. The OPC UA Part 9 `AlarmConditionState.OnAcknowledge` delegate already validates the session's `AlarmAck` role before dispatching, so the gateway-side ack RPC only sees authenticated, authorised calls. ## Inbound operator ack/shelve — scripted alarms Scripted alarms use a separate inbound path that converges on the `alarm-commands` DPS topic. Two surfaces route onto this topic: ### OPC UA Part 9 method path (external OPC UA clients) `OtOpcUaNodeManager` wires the Part 9 condition methods (Acknowledge / Confirm / AddComment / OneShotShelve / TimedShelve / Unshelve) on each scripted-alarm `AlarmConditionState` node. Every call is **gated on the `AlarmAck` LDAP role** — fail-closed: sessions with no role or without `AlarmAck` group membership receive `BadUserAccessDenied` immediately. The LDAP-resolved role set is carried past `OpcUaApplicationHost` by `RoleCarryingUserIdentity` (a `UserIdentity` subclass), making it readable inside the method handler at dispatch time. On allow, the handler publishes a `Commons.OpcUa.AlarmCommand` onto the `alarm-commands` DPS topic. The node manager is Akka-free; the dispatch action is a settable `Action` injected at boot by the hosted service. `OnTimedUnshelve` (the SDK's automatic unshelve timer) bypasses the operator gate — it is system-initiated. `WriteAlarmCondition` fires the Part 9 condition event only when the incoming state differs from the node's current live state (delta-gate), preventing the double-emit that would otherwise occur when the SDK auto-applies the acked state and the engine re-projection fires a duplicate event immediately after. ### AdminUI path The `/alerts` page shows per-row **Acknowledge / Shelve / Unshelve** buttons gated by the `DriverOperator` AdminUI policy. These route through the `AdminOperationsActor` cluster singleton (`AcknowledgeAlarmCommand` / `ShelveAlarmCommand`), which publishes onto the same `alarm-commands` topic. The singleton handles cross-node routing — the command always reaches the driver-role node owning the engine regardless of which AdminUI instance the operator is on. ### ScriptedAlarmHostActor dispatch `ScriptedAlarmHostActor` subscribes to the `alarm-commands` topic, ownership-filters each command (each node only acts on its own alarms), and dispatches to the matching `ScriptedAlarmEngine` operation (`AcknowledgeAsync` / `ConfirmAsync` / `OneShotShelveAsync` / `TimedShelveAsync` / `UnshelveAsync` / `EnableAsync` / `DisableAsync` / `AddCommentAsync`). The engine's existing `OnEvent` callback handles the OPC UA node update — no explicit re-projection is required. The AdminUI `/alerts` Shelve flow was live-verified on docker-dev 2026-06-11: singleton → topic → host actor → engine → "Shelved" status reflected on `/alerts` with the operator identity threaded through. ## Historian write-back (non-Galaxy alarms) Scripted alarms (and any future non-Galaxy `IAlarmSource` like AB CIP ALMD) route to AVEVA Historian via the Wonderware sidecar: - `IAlarmHistorianSink` is the DI-registered intake contract. The default binding is `NullAlarmHistorianSink` (registered in `ServiceCollectionExtensions.AddOtOpcUaRuntime`). Production deployments override it with `SqliteStoreAndForwardSink` wrapping `WonderwareHistorianClient` (the AVEVA Historian sidecar IPC client) — see [ServiceHosting.md](ServiceHosting.md) for the sidecar setup. - `SqliteStoreAndForwardSink` queues each transition to a local SQLite database and drains in the background via an `IAlarmHistorianWriter`. **The durability guarantee is bounded**: the queue capacity defaults to 1,000,000 rows; under a sustained historian outage, older non-dead-lettered rows are evicted (oldest first) to make room for new events. The `HistorianSinkStatus.EvictedCount` counter surfaces lifetime eviction events so operators can detect silent data loss without log scraping. - `HistorianAdapterActor` (`src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Historian/HistorianAdapterActor.cs`) bridges Akka cluster messages from `ScriptedAlarmActor` into the sink's `EnqueueAsync`; fire-and-forget so the actor loop is never blocked on historian reachability. Galaxy-native alarms with `$Alarm*` extensions reach AVEVA Historian directly via System Platform's `HistorizeToAveva` toggle on the alarm primitive — no involvement from OtOpcUa. This sidecar path is exclusively for non-Galaxy alarm producers. ## Cross-references - Plan: [docs/plans/alarms-over-gateway.md](plans/alarms-over-gateway.md) - v1 archive: [docs/v1/AlarmTracking.md](v1/AlarmTracking.md) - Galaxy driver: [docs/drivers/Galaxy.md](drivers/Galaxy.md) - Phase 7 scripting + alarming: [docs/v2/implementation/phase-7-scripting-and-alarming.md](v2/implementation/phase-7-scripting-and-alarming.md) - Security + ACL: [docs/security.md](security.md)