Audit of docs/ against src/ surfaced shipped features without current-reference coverage (FOCAS CLI, Core.Scripting+VirtualTags, Core.ScriptedAlarms, Core.AlarmHistorian), an out-of-date driver count + capability matrix, ADR-002's virtual-tag dispatch not reflected in data-path docs, broken cross-references, and OpcUaServerReqs declaring OPC-020..022 that were never scoped. This commit closes all of those so operators + integrators can stay inside docs/ without falling back to v2/implementation/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
129 lines
12 KiB
Markdown
129 lines
12 KiB
Markdown
# Alarm Tracking
|
|
|
|
Alarm surfacing is an optional driver capability exposed via `IAlarmSource` (`src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IAlarmSource.cs`). Drivers whose backends have an alarm concept implement it — today: Galaxy (MXAccess alarms), FOCAS (CNC alarms), OPC UA Client (A&C events from the upstream server). Modbus / S7 / AB CIP / AB Legacy / TwinCAT do not implement the interface and the feature is simply absent from their subtrees.
|
|
|
|
## IAlarmSource surface
|
|
|
|
```csharp
|
|
Task<IAlarmSubscriptionHandle> SubscribeAlarmsAsync(
|
|
IReadOnlyList<string> sourceNodeIds, CancellationToken cancellationToken);
|
|
Task UnsubscribeAlarmsAsync(IAlarmSubscriptionHandle handle, CancellationToken cancellationToken);
|
|
Task AcknowledgeAsync(IReadOnlyList<AlarmAcknowledgeRequest> acknowledgements,
|
|
CancellationToken cancellationToken);
|
|
event EventHandler<AlarmEventArgs>? OnAlarmEvent;
|
|
```
|
|
|
|
The driver fires `OnAlarmEvent` for every transition (`Active`, `Acknowledged`, `Inactive`) with an `AlarmEventArgs` carrying the source node id, condition id, alarm type, message, severity (`AlarmSeverity` enum), and source timestamp.
|
|
|
|
## AlarmSurfaceInvoker
|
|
|
|
`AlarmSurfaceInvoker` (`src/ZB.MOM.WW.OtOpcUa.Core/Resilience/AlarmSurfaceInvoker.cs`) wraps the three mutating surfaces through `CapabilityInvoker`:
|
|
|
|
- `SubscribeAlarmsAsync` / `UnsubscribeAlarmsAsync` run through the `DriverCapability.AlarmSubscribe` pipeline — retries apply under the tier configuration.
|
|
- `AcknowledgeAsync` runs through `DriverCapability.AlarmAcknowledge` which does NOT retry per decision #143. A timed-out ack may have already registered at the plant floor; replay would silently double-acknowledge.
|
|
|
|
Multi-host fan-out: when the driver implements `IPerCallHostResolver`, each source node id is resolved individually and batches are grouped by host so a dead PLC inside a multi-device driver doesn't poison sibling breakers. Single-host drivers fall back to `IDriver.DriverInstanceId` as the pipeline-key host.
|
|
|
|
## Condition-node creation via CapturingBuilder
|
|
|
|
Alarm-condition nodes are materialized at address-space build time. During `GenericDriverNodeManager.BuildAddressSpaceAsync` the builder is wrapped in a `CapturingBuilder` that observes every `Variable()` call. When a driver calls `IVariableHandle.MarkAsAlarmCondition(AlarmConditionInfo)` on a returned handle, the server-side `DriverNodeManager.VariableHandle` creates a sibling `AlarmConditionState` node and returns an `IAlarmConditionSink`. The wrapper stores the sink in `_alarmSinks` keyed by the variable's full reference, then `GenericDriverNodeManager` registers a forwarder on `IAlarmSource.OnAlarmEvent` that routes each push to the matching sink by `SourceNodeId`. Unknown source ids are dropped silently — they may belong to another driver.
|
|
|
|
The `AlarmConditionState` layout matches OPC UA Part 9:
|
|
|
|
- `SourceNode` → the originating variable
|
|
- `SourceName` / `ConditionName` → from `AlarmConditionInfo.SourceName`
|
|
- Initial state: enabled, inactive, acknowledged, severity per `InitialSeverity`, retain false
|
|
- `HasCondition` references wire the source variable ↔ the condition node bidirectionally
|
|
|
|
Drivers flag alarm-bearing variables at discovery time via `DriverAttributeInfo.IsAlarm = true`. The Galaxy driver, for example, sets this on attributes that have an `AlarmExtension` primitive in the Galaxy repository DB; FOCAS sets it on the CNC alarm register.
|
|
|
|
## State transitions
|
|
|
|
`ConditionSink.OnTransition` runs under the node manager's `Lock` and maps the `AlarmEventArgs.AlarmType` string to Part 9 state:
|
|
|
|
| AlarmType | Action |
|
|
|---|---|
|
|
| `Active` | `SetActiveState(true)`, `SetAcknowledgedState(false)`, `Retain = true` |
|
|
| `Acknowledged` | `SetAcknowledgedState(true)` |
|
|
| `Inactive` | `SetActiveState(false)`; `Retain = false` once both inactive and acknowledged |
|
|
|
|
Severity is remapped: `AlarmSeverity.Low/Medium/High/Critical` → OPC UA numeric 250 / 500 / 700 / 900. `Message.Value` is set from `AlarmEventArgs.Message` on every transition. `ClearChangeMasks(true)` and `ReportEvent(condition)` fire the OPC UA event notification for clients subscribed to any ancestor notifier.
|
|
|
|
## Acknowledge dispatch
|
|
|
|
Alarm acknowledgement initiated by an OPC UA client flows:
|
|
|
|
1. The SDK invokes the `AlarmConditionState.OnAcknowledge` method delegate.
|
|
2. The handler checks the session's roles for `AlarmAck` — drivers never see a request the session wasn't entitled to make.
|
|
3. `AlarmSurfaceInvoker.AcknowledgeAsync` is called with the source / condition / comment tuple. The invoker groups by host and runs each batch through the no-retry `AlarmAcknowledge` pipeline.
|
|
|
|
Drivers return normally for success or throw to signal the ack failed at the backend.
|
|
|
|
## EventNotifier propagation
|
|
|
|
Drivers that want hierarchical alarm subscriptions propagate `EventNotifier.SubscribeToEvents` up the containment chain during discovery — the Galaxy driver flips the flag on every ancestor of an alarm-bearing object up to the driver root, mirroring v1 behavior. Clients subscribed at the driver root, a mid-level folder, or the `Objects/` root see alarm events from every descendant with an `AlarmConditionState` sibling. The driver-root `FolderState` is created in `DriverNodeManager.CreateAddressSpace` with `EventNotifier = SubscribeToEvents | HistoryRead` so alarm event subscriptions and alarm history both have a single natural target.
|
|
|
|
## ConditionRefresh
|
|
|
|
The OPC UA `ConditionRefresh` service queues the current state of every retained condition back to the requesting monitored items. `DriverNodeManager` iterates the node manager's `AlarmConditionState` collection and queues each condition whose `Retain.Value == true` — matching the Part 9 requirement.
|
|
|
|
## Alarm historian sink
|
|
|
|
Distinct from the live `IAlarmSource` stream and the Part 9 `AlarmConditionState` materialization above, qualifying alarm transitions are **also** persisted to a durable event log for downstream AVEVA Historian ingestion. This is a separate subsystem from the `IHistoryProvider` capability used by `HistoryReadEvents` (see [HistoricalDataAccess.md](HistoricalDataAccess.md#alarm-event-history-vs-ihistoryprovider)): the sink is a *producer* path (server → Historian) that runs independently of any client HistoryRead call.
|
|
|
|
### `IAlarmHistorianSink`
|
|
|
|
`src/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/IAlarmHistorianSink.cs` defines the intake contract:
|
|
|
|
```csharp
|
|
Task EnqueueAsync(AlarmHistorianEvent evt, CancellationToken cancellationToken);
|
|
HistorianSinkStatus GetStatus();
|
|
```
|
|
|
|
`EnqueueAsync` is fire-and-forget from the producer's perspective — it must never block the emitting thread. The event payload (`AlarmHistorianEvent` — same file) is source-agnostic: `AlarmId`, `EquipmentPath`, `AlarmName`, `AlarmTypeName` (Part 9 subtype name), `Severity`, `EventKind` (free-form transition string — `Activated` / `Cleared` / `Acknowledged` / `Confirmed` / `Shelved` / …), `Message`, `User`, `Comment`, `TimestampUtc`.
|
|
|
|
The sink scope is defined to span every alarm source (plan decision #15: scripted, Galaxy-native, AB CIP ALMD, any future `IAlarmSource`), gated per-alarm by a `HistorizeToAveva` toggle on the producer. Today only `Phase7EngineComposer.RouteToHistorianAsync` (`src/ZB.MOM.WW.OtOpcUa.Server/Phase7/Phase7EngineComposer.cs`) is wired — it subscribes to `ScriptedAlarmEngine.OnEvent` and marshals each emission into `AlarmHistorianEvent`. Galaxy-native alarms continue to reach AVEVA Historian via the driver's direct `aahClientManaged` path and do not flow through the sink; the AB CIP ALMD path remains unwired pending a producer-side integration.
|
|
|
|
### `SqliteStoreAndForwardSink`
|
|
|
|
Default production implementation (`src/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs`). A local SQLite queue absorbs every `EnqueueAsync` synchronously; a background `Timer` drains batches asynchronously to an `IAlarmHistorianWriter` so operator actions are never blocked on historian reachability.
|
|
|
|
Queue schema (single table `Queue`): `RowId PK autoincrement`, `AlarmId`, `EnqueuedUtc`, `PayloadJson` (serialized `AlarmHistorianEvent`), `AttemptCount`, `LastAttemptUtc`, `LastError`, `DeadLettered` (bool), plus `IX_Queue_Drain (DeadLettered, RowId)`. Default capacity `1_000_000` non-dead-lettered rows; oldest rows evict with a WARN log past the cap.
|
|
|
|
Drain cadence: `StartDrainLoop(tickInterval)` arms a periodic timer. `DrainOnceAsync` reads up to `batchSize` rows (default 100) in `RowId` order and forwards them through `IAlarmHistorianWriter.WriteBatchAsync`, which returns one `HistorianWriteOutcome` per row:
|
|
|
|
| Outcome | Action |
|
|
|---|---|
|
|
| `Ack` | Row deleted. |
|
|
| `PermanentFail` | Row flipped to `DeadLettered = 1` with reason. Peers in the batch retry independently. |
|
|
| `RetryPlease` | `AttemptCount` bumped; row stays queued. Drain worker enters `BackingOff`. |
|
|
|
|
Writer-side exceptions treat the whole batch as `RetryPlease`.
|
|
|
|
Backoff ladder on `RetryPlease` (hard-coded): 1s → 2s → 5s → 15s → 60s cap. Reset to 0 on any batch with no retries. `CurrentBackoff` exposes the current step for instrumentation; the drain timer itself fires on `tickInterval`, so the ladder governs write cadence rather than timer period.
|
|
|
|
Dead-letter retention defaults to 30 days (plan decision #21). `PurgeAgedDeadLetters` runs each drain pass and deletes rows whose `LastAttemptUtc` is past the cutoff. `RetryDeadLettered()` is an operator action that clears `DeadLettered` + resets `AttemptCount` on every dead-lettered row so they rejoin the main queue.
|
|
|
|
### Composition and writer resolution
|
|
|
|
`Phase7Composer.ResolveHistorianSink` (`src/ZB.MOM.WW.OtOpcUa.Server/Phase7/Phase7Composer.cs`) scans the registered drivers for one that implements `IAlarmHistorianWriter`. Today that is `GalaxyProxyDriver` via `GalaxyHistorianWriter` (`src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/Ipc/GalaxyHistorianWriter.cs`), which forwards batches over the Galaxy.Host pipe to the `aahClientManaged` alarm schema. When a writer is found, a `SqliteStoreAndForwardSink` is instantiated against `%ProgramData%/OtOpcUa/alarm-historian-queue.db` with a 2 s drain tick and the writer attached. When no driver provides a writer the fallback is the DI-registered `NullAlarmHistorianSink` (`src/ZB.MOM.WW.OtOpcUa.Server/Program.cs`), which silently discards and reports `HistorianDrainState.Disabled`.
|
|
|
|
### Status and observability
|
|
|
|
`GetStatus()` returns `HistorianSinkStatus(QueueDepth, DeadLetterDepth, LastDrainUtc, LastSuccessUtc, LastError, DrainState)` — two `COUNT(*)` scalars plus last-drain telemetry. `DrainState` is one of `Disabled` / `Idle` / `Draining` / `BackingOff`.
|
|
|
|
The Admin UI `/alarms/historian` page surfaces this through `HistorianDiagnosticsService` (`src/ZB.MOM.WW.OtOpcUa.Admin/Services/HistorianDiagnosticsService.cs`), which also exposes `TryRetryDeadLettered` — it calls through to `SqliteStoreAndForwardSink.RetryDeadLettered` when the live sink is the SQLite implementation and returns 0 otherwise.
|
|
|
|
## Key source files
|
|
|
|
- `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IAlarmSource.cs` — capability contract + `AlarmEventArgs`
|
|
- `src/ZB.MOM.WW.OtOpcUa.Core/Resilience/AlarmSurfaceInvoker.cs` — per-host fan-out + no-retry ack
|
|
- `src/ZB.MOM.WW.OtOpcUa.Core/OpcUa/GenericDriverNodeManager.cs` — `CapturingBuilder` + alarm forwarder
|
|
- `src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs` — `VariableHandle.MarkAsAlarmCondition` + `ConditionSink`
|
|
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/Alarms/GalaxyAlarmTracker.cs` — Galaxy-specific alarm-event production
|
|
- `src/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/IAlarmHistorianSink.cs` — historian sink intake contract + `AlarmHistorianEvent` + `HistorianSinkStatus` + `IAlarmHistorianWriter`
|
|
- `src/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs` — durable queue + drain worker + backoff ladder + dead-letter retention
|
|
- `src/ZB.MOM.WW.OtOpcUa.Server/Phase7/Phase7EngineComposer.cs` — `RouteToHistorianAsync` wires scripted-alarm emissions into the sink
|
|
- `src/ZB.MOM.WW.OtOpcUa.Server/Phase7/Phase7Composer.cs` — `ResolveHistorianSink` selects `SqliteStoreAndForwardSink` vs `NullAlarmHistorianSink`
|
|
- `src/ZB.MOM.WW.OtOpcUa.Admin/Services/HistorianDiagnosticsService.cs` — Admin UI `/alarms/historian` status + retry-dead-lettered operator action
|