Files
lmxopcua/docs/AlarmTracking.md
Joseph Doherty 21e0fdd4cd Docs audit — fill gaps so the top-level docs/ reference matches shipped code
Audit of docs/ against src/ surfaced shipped features without current-reference
coverage (FOCAS CLI, Core.Scripting+VirtualTags, Core.ScriptedAlarms,
Core.AlarmHistorian), an out-of-date driver count + capability matrix, ADR-002's
virtual-tag dispatch not reflected in data-path docs, broken cross-references,
and OpcUaServerReqs declaring OPC-020..022 that were never scoped. This commit
closes all of those so operators + integrators can stay inside docs/ without
falling back to v2/implementation/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 09:42:42 -04:00

129 lines
12 KiB
Markdown

# Alarm Tracking
Alarm surfacing is an optional driver capability exposed via `IAlarmSource` (`src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IAlarmSource.cs`). Drivers whose backends have an alarm concept implement it — today: Galaxy (MXAccess alarms), FOCAS (CNC alarms), OPC UA Client (A&C events from the upstream server). Modbus / S7 / AB CIP / AB Legacy / TwinCAT do not implement the interface and the feature is simply absent from their subtrees.
## IAlarmSource surface
```csharp
Task<IAlarmSubscriptionHandle> SubscribeAlarmsAsync(
IReadOnlyList<string> sourceNodeIds, CancellationToken cancellationToken);
Task UnsubscribeAlarmsAsync(IAlarmSubscriptionHandle handle, CancellationToken cancellationToken);
Task AcknowledgeAsync(IReadOnlyList<AlarmAcknowledgeRequest> acknowledgements,
CancellationToken cancellationToken);
event EventHandler<AlarmEventArgs>? OnAlarmEvent;
```
The driver fires `OnAlarmEvent` for every transition (`Active`, `Acknowledged`, `Inactive`) with an `AlarmEventArgs` carrying the source node id, condition id, alarm type, message, severity (`AlarmSeverity` enum), and source timestamp.
## AlarmSurfaceInvoker
`AlarmSurfaceInvoker` (`src/ZB.MOM.WW.OtOpcUa.Core/Resilience/AlarmSurfaceInvoker.cs`) wraps the three mutating surfaces through `CapabilityInvoker`:
- `SubscribeAlarmsAsync` / `UnsubscribeAlarmsAsync` run through the `DriverCapability.AlarmSubscribe` pipeline — retries apply under the tier configuration.
- `AcknowledgeAsync` runs through `DriverCapability.AlarmAcknowledge` which does NOT retry per decision #143. A timed-out ack may have already registered at the plant floor; replay would silently double-acknowledge.
Multi-host fan-out: when the driver implements `IPerCallHostResolver`, each source node id is resolved individually and batches are grouped by host so a dead PLC inside a multi-device driver doesn't poison sibling breakers. Single-host drivers fall back to `IDriver.DriverInstanceId` as the pipeline-key host.
## Condition-node creation via CapturingBuilder
Alarm-condition nodes are materialized at address-space build time. During `GenericDriverNodeManager.BuildAddressSpaceAsync` the builder is wrapped in a `CapturingBuilder` that observes every `Variable()` call. When a driver calls `IVariableHandle.MarkAsAlarmCondition(AlarmConditionInfo)` on a returned handle, the server-side `DriverNodeManager.VariableHandle` creates a sibling `AlarmConditionState` node and returns an `IAlarmConditionSink`. The wrapper stores the sink in `_alarmSinks` keyed by the variable's full reference, then `GenericDriverNodeManager` registers a forwarder on `IAlarmSource.OnAlarmEvent` that routes each push to the matching sink by `SourceNodeId`. Unknown source ids are dropped silently — they may belong to another driver.
The `AlarmConditionState` layout matches OPC UA Part 9:
- `SourceNode` → the originating variable
- `SourceName` / `ConditionName` → from `AlarmConditionInfo.SourceName`
- Initial state: enabled, inactive, acknowledged, severity per `InitialSeverity`, retain false
- `HasCondition` references wire the source variable ↔ the condition node bidirectionally
Drivers flag alarm-bearing variables at discovery time via `DriverAttributeInfo.IsAlarm = true`. The Galaxy driver, for example, sets this on attributes that have an `AlarmExtension` primitive in the Galaxy repository DB; FOCAS sets it on the CNC alarm register.
## State transitions
`ConditionSink.OnTransition` runs under the node manager's `Lock` and maps the `AlarmEventArgs.AlarmType` string to Part 9 state:
| AlarmType | Action |
|---|---|
| `Active` | `SetActiveState(true)`, `SetAcknowledgedState(false)`, `Retain = true` |
| `Acknowledged` | `SetAcknowledgedState(true)` |
| `Inactive` | `SetActiveState(false)`; `Retain = false` once both inactive and acknowledged |
Severity is remapped: `AlarmSeverity.Low/Medium/High/Critical` → OPC UA numeric 250 / 500 / 700 / 900. `Message.Value` is set from `AlarmEventArgs.Message` on every transition. `ClearChangeMasks(true)` and `ReportEvent(condition)` fire the OPC UA event notification for clients subscribed to any ancestor notifier.
## Acknowledge dispatch
Alarm acknowledgement initiated by an OPC UA client flows:
1. The SDK invokes the `AlarmConditionState.OnAcknowledge` method delegate.
2. The handler checks the session's roles for `AlarmAck` — drivers never see a request the session wasn't entitled to make.
3. `AlarmSurfaceInvoker.AcknowledgeAsync` is called with the source / condition / comment tuple. The invoker groups by host and runs each batch through the no-retry `AlarmAcknowledge` pipeline.
Drivers return normally for success or throw to signal the ack failed at the backend.
## EventNotifier propagation
Drivers that want hierarchical alarm subscriptions propagate `EventNotifier.SubscribeToEvents` up the containment chain during discovery — the Galaxy driver flips the flag on every ancestor of an alarm-bearing object up to the driver root, mirroring v1 behavior. Clients subscribed at the driver root, a mid-level folder, or the `Objects/` root see alarm events from every descendant with an `AlarmConditionState` sibling. The driver-root `FolderState` is created in `DriverNodeManager.CreateAddressSpace` with `EventNotifier = SubscribeToEvents | HistoryRead` so alarm event subscriptions and alarm history both have a single natural target.
## ConditionRefresh
The OPC UA `ConditionRefresh` service queues the current state of every retained condition back to the requesting monitored items. `DriverNodeManager` iterates the node manager's `AlarmConditionState` collection and queues each condition whose `Retain.Value == true` — matching the Part 9 requirement.
## Alarm historian sink
Distinct from the live `IAlarmSource` stream and the Part 9 `AlarmConditionState` materialization above, qualifying alarm transitions are **also** persisted to a durable event log for downstream AVEVA Historian ingestion. This is a separate subsystem from the `IHistoryProvider` capability used by `HistoryReadEvents` (see [HistoricalDataAccess.md](HistoricalDataAccess.md#alarm-event-history-vs-ihistoryprovider)): the sink is a *producer* path (server → Historian) that runs independently of any client HistoryRead call.
### `IAlarmHistorianSink`
`src/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/IAlarmHistorianSink.cs` defines the intake contract:
```csharp
Task EnqueueAsync(AlarmHistorianEvent evt, CancellationToken cancellationToken);
HistorianSinkStatus GetStatus();
```
`EnqueueAsync` is fire-and-forget from the producer's perspective — it must never block the emitting thread. The event payload (`AlarmHistorianEvent` — same file) is source-agnostic: `AlarmId`, `EquipmentPath`, `AlarmName`, `AlarmTypeName` (Part 9 subtype name), `Severity`, `EventKind` (free-form transition string — `Activated` / `Cleared` / `Acknowledged` / `Confirmed` / `Shelved` / …), `Message`, `User`, `Comment`, `TimestampUtc`.
The sink scope is defined to span every alarm source (plan decision #15: scripted, Galaxy-native, AB CIP ALMD, any future `IAlarmSource`), gated per-alarm by a `HistorizeToAveva` toggle on the producer. Today only `Phase7EngineComposer.RouteToHistorianAsync` (`src/ZB.MOM.WW.OtOpcUa.Server/Phase7/Phase7EngineComposer.cs`) is wired — it subscribes to `ScriptedAlarmEngine.OnEvent` and marshals each emission into `AlarmHistorianEvent`. Galaxy-native alarms continue to reach AVEVA Historian via the driver's direct `aahClientManaged` path and do not flow through the sink; the AB CIP ALMD path remains unwired pending a producer-side integration.
### `SqliteStoreAndForwardSink`
Default production implementation (`src/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs`). A local SQLite queue absorbs every `EnqueueAsync` synchronously; a background `Timer` drains batches asynchronously to an `IAlarmHistorianWriter` so operator actions are never blocked on historian reachability.
Queue schema (single table `Queue`): `RowId PK autoincrement`, `AlarmId`, `EnqueuedUtc`, `PayloadJson` (serialized `AlarmHistorianEvent`), `AttemptCount`, `LastAttemptUtc`, `LastError`, `DeadLettered` (bool), plus `IX_Queue_Drain (DeadLettered, RowId)`. Default capacity `1_000_000` non-dead-lettered rows; oldest rows evict with a WARN log past the cap.
Drain cadence: `StartDrainLoop(tickInterval)` arms a periodic timer. `DrainOnceAsync` reads up to `batchSize` rows (default 100) in `RowId` order and forwards them through `IAlarmHistorianWriter.WriteBatchAsync`, which returns one `HistorianWriteOutcome` per row:
| Outcome | Action |
|---|---|
| `Ack` | Row deleted. |
| `PermanentFail` | Row flipped to `DeadLettered = 1` with reason. Peers in the batch retry independently. |
| `RetryPlease` | `AttemptCount` bumped; row stays queued. Drain worker enters `BackingOff`. |
Writer-side exceptions treat the whole batch as `RetryPlease`.
Backoff ladder on `RetryPlease` (hard-coded): 1s → 2s → 5s → 15s → 60s cap. Reset to 0 on any batch with no retries. `CurrentBackoff` exposes the current step for instrumentation; the drain timer itself fires on `tickInterval`, so the ladder governs write cadence rather than timer period.
Dead-letter retention defaults to 30 days (plan decision #21). `PurgeAgedDeadLetters` runs each drain pass and deletes rows whose `LastAttemptUtc` is past the cutoff. `RetryDeadLettered()` is an operator action that clears `DeadLettered` + resets `AttemptCount` on every dead-lettered row so they rejoin the main queue.
### Composition and writer resolution
`Phase7Composer.ResolveHistorianSink` (`src/ZB.MOM.WW.OtOpcUa.Server/Phase7/Phase7Composer.cs`) scans the registered drivers for one that implements `IAlarmHistorianWriter`. Today that is `GalaxyProxyDriver` via `GalaxyHistorianWriter` (`src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/Ipc/GalaxyHistorianWriter.cs`), which forwards batches over the Galaxy.Host pipe to the `aahClientManaged` alarm schema. When a writer is found, a `SqliteStoreAndForwardSink` is instantiated against `%ProgramData%/OtOpcUa/alarm-historian-queue.db` with a 2 s drain tick and the writer attached. When no driver provides a writer the fallback is the DI-registered `NullAlarmHistorianSink` (`src/ZB.MOM.WW.OtOpcUa.Server/Program.cs`), which silently discards and reports `HistorianDrainState.Disabled`.
### Status and observability
`GetStatus()` returns `HistorianSinkStatus(QueueDepth, DeadLetterDepth, LastDrainUtc, LastSuccessUtc, LastError, DrainState)` — two `COUNT(*)` scalars plus last-drain telemetry. `DrainState` is one of `Disabled` / `Idle` / `Draining` / `BackingOff`.
The Admin UI `/alarms/historian` page surfaces this through `HistorianDiagnosticsService` (`src/ZB.MOM.WW.OtOpcUa.Admin/Services/HistorianDiagnosticsService.cs`), which also exposes `TryRetryDeadLettered` — it calls through to `SqliteStoreAndForwardSink.RetryDeadLettered` when the live sink is the SQLite implementation and returns 0 otherwise.
## Key source files
- `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IAlarmSource.cs` — capability contract + `AlarmEventArgs`
- `src/ZB.MOM.WW.OtOpcUa.Core/Resilience/AlarmSurfaceInvoker.cs` — per-host fan-out + no-retry ack
- `src/ZB.MOM.WW.OtOpcUa.Core/OpcUa/GenericDriverNodeManager.cs``CapturingBuilder` + alarm forwarder
- `src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs``VariableHandle.MarkAsAlarmCondition` + `ConditionSink`
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/Alarms/GalaxyAlarmTracker.cs` — Galaxy-specific alarm-event production
- `src/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/IAlarmHistorianSink.cs` — historian sink intake contract + `AlarmHistorianEvent` + `HistorianSinkStatus` + `IAlarmHistorianWriter`
- `src/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs` — durable queue + drain worker + backoff ladder + dead-letter retention
- `src/ZB.MOM.WW.OtOpcUa.Server/Phase7/Phase7EngineComposer.cs``RouteToHistorianAsync` wires scripted-alarm emissions into the sink
- `src/ZB.MOM.WW.OtOpcUa.Server/Phase7/Phase7Composer.cs``ResolveHistorianSink` selects `SqliteStoreAndForwardSink` vs `NullAlarmHistorianSink`
- `src/ZB.MOM.WW.OtOpcUa.Admin/Services/HistorianDiagnosticsService.cs` — Admin UI `/alarms/historian` status + retry-dead-lettered operator action