Files
lmxopcua/docs/AlarmTracking.md
Joseph Doherty 21e0fdd4cd Docs audit — fill gaps so the top-level docs/ reference matches shipped code
Audit of docs/ against src/ surfaced shipped features without current-reference
coverage (FOCAS CLI, Core.Scripting+VirtualTags, Core.ScriptedAlarms,
Core.AlarmHistorian), an out-of-date driver count + capability matrix, ADR-002's
virtual-tag dispatch not reflected in data-path docs, broken cross-references,
and OpcUaServerReqs declaring OPC-020..022 that were never scoped. This commit
closes all of those so operators + integrators can stay inside docs/ without
falling back to v2/implementation/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 09:42:42 -04:00

12 KiB

Alarm Tracking

Alarm surfacing is an optional driver capability exposed via IAlarmSource (src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IAlarmSource.cs). Drivers whose backends have an alarm concept implement it — today: Galaxy (MXAccess alarms), FOCAS (CNC alarms), OPC UA Client (A&C events from the upstream server). Modbus / S7 / AB CIP / AB Legacy / TwinCAT do not implement the interface and the feature is simply absent from their subtrees.

IAlarmSource surface

Task<IAlarmSubscriptionHandle> SubscribeAlarmsAsync(
    IReadOnlyList<string> sourceNodeIds, CancellationToken cancellationToken);
Task UnsubscribeAlarmsAsync(IAlarmSubscriptionHandle handle, CancellationToken cancellationToken);
Task AcknowledgeAsync(IReadOnlyList<AlarmAcknowledgeRequest> acknowledgements,
    CancellationToken cancellationToken);
event EventHandler<AlarmEventArgs>? OnAlarmEvent;

The driver fires OnAlarmEvent for every transition (Active, Acknowledged, Inactive) with an AlarmEventArgs carrying the source node id, condition id, alarm type, message, severity (AlarmSeverity enum), and source timestamp.

AlarmSurfaceInvoker

AlarmSurfaceInvoker (src/ZB.MOM.WW.OtOpcUa.Core/Resilience/AlarmSurfaceInvoker.cs) wraps the three mutating surfaces through CapabilityInvoker:

  • SubscribeAlarmsAsync / UnsubscribeAlarmsAsync run through the DriverCapability.AlarmSubscribe pipeline — retries apply under the tier configuration.
  • AcknowledgeAsync runs through DriverCapability.AlarmAcknowledge which does NOT retry per decision #143. A timed-out ack may have already registered at the plant floor; replay would silently double-acknowledge.

Multi-host fan-out: when the driver implements IPerCallHostResolver, each source node id is resolved individually and batches are grouped by host so a dead PLC inside a multi-device driver doesn't poison sibling breakers. Single-host drivers fall back to IDriver.DriverInstanceId as the pipeline-key host.

Condition-node creation via CapturingBuilder

Alarm-condition nodes are materialized at address-space build time. During GenericDriverNodeManager.BuildAddressSpaceAsync the builder is wrapped in a CapturingBuilder that observes every Variable() call. When a driver calls IVariableHandle.MarkAsAlarmCondition(AlarmConditionInfo) on a returned handle, the server-side DriverNodeManager.VariableHandle creates a sibling AlarmConditionState node and returns an IAlarmConditionSink. The wrapper stores the sink in _alarmSinks keyed by the variable's full reference, then GenericDriverNodeManager registers a forwarder on IAlarmSource.OnAlarmEvent that routes each push to the matching sink by SourceNodeId. Unknown source ids are dropped silently — they may belong to another driver.

The AlarmConditionState layout matches OPC UA Part 9:

  • SourceNode → the originating variable
  • SourceName / ConditionName → from AlarmConditionInfo.SourceName
  • Initial state: enabled, inactive, acknowledged, severity per InitialSeverity, retain false
  • HasCondition references wire the source variable ↔ the condition node bidirectionally

Drivers flag alarm-bearing variables at discovery time via DriverAttributeInfo.IsAlarm = true. The Galaxy driver, for example, sets this on attributes that have an AlarmExtension primitive in the Galaxy repository DB; FOCAS sets it on the CNC alarm register.

State transitions

ConditionSink.OnTransition runs under the node manager's Lock and maps the AlarmEventArgs.AlarmType string to Part 9 state:

AlarmType Action
Active SetActiveState(true), SetAcknowledgedState(false), Retain = true
Acknowledged SetAcknowledgedState(true)
Inactive SetActiveState(false); Retain = false once both inactive and acknowledged

Severity is remapped: AlarmSeverity.Low/Medium/High/Critical → OPC UA numeric 250 / 500 / 700 / 900. Message.Value is set from AlarmEventArgs.Message on every transition. ClearChangeMasks(true) and ReportEvent(condition) fire the OPC UA event notification for clients subscribed to any ancestor notifier.

Acknowledge dispatch

Alarm acknowledgement initiated by an OPC UA client flows:

  1. The SDK invokes the AlarmConditionState.OnAcknowledge method delegate.
  2. The handler checks the session's roles for AlarmAck — drivers never see a request the session wasn't entitled to make.
  3. AlarmSurfaceInvoker.AcknowledgeAsync is called with the source / condition / comment tuple. The invoker groups by host and runs each batch through the no-retry AlarmAcknowledge pipeline.

Drivers return normally for success or throw to signal the ack failed at the backend.

EventNotifier propagation

Drivers that want hierarchical alarm subscriptions propagate EventNotifier.SubscribeToEvents up the containment chain during discovery — the Galaxy driver flips the flag on every ancestor of an alarm-bearing object up to the driver root, mirroring v1 behavior. Clients subscribed at the driver root, a mid-level folder, or the Objects/ root see alarm events from every descendant with an AlarmConditionState sibling. The driver-root FolderState is created in DriverNodeManager.CreateAddressSpace with EventNotifier = SubscribeToEvents | HistoryRead so alarm event subscriptions and alarm history both have a single natural target.

ConditionRefresh

The OPC UA ConditionRefresh service queues the current state of every retained condition back to the requesting monitored items. DriverNodeManager iterates the node manager's AlarmConditionState collection and queues each condition whose Retain.Value == true — matching the Part 9 requirement.

Alarm historian sink

Distinct from the live IAlarmSource stream and the Part 9 AlarmConditionState materialization above, qualifying alarm transitions are also persisted to a durable event log for downstream AVEVA Historian ingestion. This is a separate subsystem from the IHistoryProvider capability used by HistoryReadEvents (see HistoricalDataAccess.md): the sink is a producer path (server → Historian) that runs independently of any client HistoryRead call.

IAlarmHistorianSink

src/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/IAlarmHistorianSink.cs defines the intake contract:

Task EnqueueAsync(AlarmHistorianEvent evt, CancellationToken cancellationToken);
HistorianSinkStatus GetStatus();

EnqueueAsync is fire-and-forget from the producer's perspective — it must never block the emitting thread. The event payload (AlarmHistorianEvent — same file) is source-agnostic: AlarmId, EquipmentPath, AlarmName, AlarmTypeName (Part 9 subtype name), Severity, EventKind (free-form transition string — Activated / Cleared / Acknowledged / Confirmed / Shelved / …), Message, User, Comment, TimestampUtc.

The sink scope is defined to span every alarm source (plan decision #15: scripted, Galaxy-native, AB CIP ALMD, any future IAlarmSource), gated per-alarm by a HistorizeToAveva toggle on the producer. Today only Phase7EngineComposer.RouteToHistorianAsync (src/ZB.MOM.WW.OtOpcUa.Server/Phase7/Phase7EngineComposer.cs) is wired — it subscribes to ScriptedAlarmEngine.OnEvent and marshals each emission into AlarmHistorianEvent. Galaxy-native alarms continue to reach AVEVA Historian via the driver's direct aahClientManaged path and do not flow through the sink; the AB CIP ALMD path remains unwired pending a producer-side integration.

SqliteStoreAndForwardSink

Default production implementation (src/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs). A local SQLite queue absorbs every EnqueueAsync synchronously; a background Timer drains batches asynchronously to an IAlarmHistorianWriter so operator actions are never blocked on historian reachability.

Queue schema (single table Queue): RowId PK autoincrement, AlarmId, EnqueuedUtc, PayloadJson (serialized AlarmHistorianEvent), AttemptCount, LastAttemptUtc, LastError, DeadLettered (bool), plus IX_Queue_Drain (DeadLettered, RowId). Default capacity 1_000_000 non-dead-lettered rows; oldest rows evict with a WARN log past the cap.

Drain cadence: StartDrainLoop(tickInterval) arms a periodic timer. DrainOnceAsync reads up to batchSize rows (default 100) in RowId order and forwards them through IAlarmHistorianWriter.WriteBatchAsync, which returns one HistorianWriteOutcome per row:

Outcome Action
Ack Row deleted.
PermanentFail Row flipped to DeadLettered = 1 with reason. Peers in the batch retry independently.
RetryPlease AttemptCount bumped; row stays queued. Drain worker enters BackingOff.

Writer-side exceptions treat the whole batch as RetryPlease.

Backoff ladder on RetryPlease (hard-coded): 1s → 2s → 5s → 15s → 60s cap. Reset to 0 on any batch with no retries. CurrentBackoff exposes the current step for instrumentation; the drain timer itself fires on tickInterval, so the ladder governs write cadence rather than timer period.

Dead-letter retention defaults to 30 days (plan decision #21). PurgeAgedDeadLetters runs each drain pass and deletes rows whose LastAttemptUtc is past the cutoff. RetryDeadLettered() is an operator action that clears DeadLettered + resets AttemptCount on every dead-lettered row so they rejoin the main queue.

Composition and writer resolution

Phase7Composer.ResolveHistorianSink (src/ZB.MOM.WW.OtOpcUa.Server/Phase7/Phase7Composer.cs) scans the registered drivers for one that implements IAlarmHistorianWriter. Today that is GalaxyProxyDriver via GalaxyHistorianWriter (src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/Ipc/GalaxyHistorianWriter.cs), which forwards batches over the Galaxy.Host pipe to the aahClientManaged alarm schema. When a writer is found, a SqliteStoreAndForwardSink is instantiated against %ProgramData%/OtOpcUa/alarm-historian-queue.db with a 2 s drain tick and the writer attached. When no driver provides a writer the fallback is the DI-registered NullAlarmHistorianSink (src/ZB.MOM.WW.OtOpcUa.Server/Program.cs), which silently discards and reports HistorianDrainState.Disabled.

Status and observability

GetStatus() returns HistorianSinkStatus(QueueDepth, DeadLetterDepth, LastDrainUtc, LastSuccessUtc, LastError, DrainState) — two COUNT(*) scalars plus last-drain telemetry. DrainState is one of Disabled / Idle / Draining / BackingOff.

The Admin UI /alarms/historian page surfaces this through HistorianDiagnosticsService (src/ZB.MOM.WW.OtOpcUa.Admin/Services/HistorianDiagnosticsService.cs), which also exposes TryRetryDeadLettered — it calls through to SqliteStoreAndForwardSink.RetryDeadLettered when the live sink is the SQLite implementation and returns 0 otherwise.

Key source files

  • src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IAlarmSource.cs — capability contract + AlarmEventArgs
  • src/ZB.MOM.WW.OtOpcUa.Core/Resilience/AlarmSurfaceInvoker.cs — per-host fan-out + no-retry ack
  • src/ZB.MOM.WW.OtOpcUa.Core/OpcUa/GenericDriverNodeManager.csCapturingBuilder + alarm forwarder
  • src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.csVariableHandle.MarkAsAlarmCondition + ConditionSink
  • src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/Alarms/GalaxyAlarmTracker.cs — Galaxy-specific alarm-event production
  • src/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/IAlarmHistorianSink.cs — historian sink intake contract + AlarmHistorianEvent + HistorianSinkStatus + IAlarmHistorianWriter
  • src/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs — durable queue + drain worker + backoff ladder + dead-letter retention
  • src/ZB.MOM.WW.OtOpcUa.Server/Phase7/Phase7EngineComposer.csRouteToHistorianAsync wires scripted-alarm emissions into the sink
  • src/ZB.MOM.WW.OtOpcUa.Server/Phase7/Phase7Composer.csResolveHistorianSink selects SqliteStoreAndForwardSink vs NullAlarmHistorianSink
  • src/ZB.MOM.WW.OtOpcUa.Admin/Services/HistorianDiagnosticsService.cs — Admin UI /alarms/historian status + retry-dead-lettered operator action