Files
lmxopcua/docs/AlarmHistorian.md
T

8.0 KiB

Alarm Historian — store-and-forward SQLite sink

Reference for ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian (src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/), the durable local queue that historizes alarm transitions to AVEVA Historian without ever blocking the alarm engine or operator actions.

This is the sink mechanics doc. For how the three alarm sources converge on the OPC UA Part 9 surface and which alarms route here, see AlarmTracking.md. For the historian client that drains this queue, see DriverLifecycle.md and ServiceHosting.md.


Why store-and-forward

Scripted alarms (and any future non-Galaxy IAlarmSource, e.g. AB CIP ALMD) must reach AVEVA Historian, but the historian sidecar can be slow, busy, or disconnected. The sink decouples the alarm engine from historian reachability: every qualifying transition is committed to a local SQLite queue first, and a background drain worker forwards rows to the historian on a backoff-aware cadence. Operator acks and alarm-state transitions are never blocked waiting on the historian.

Galaxy-native alarms with $Alarm* extensions reach AVEVA Historian directly via System Platform's HistorizeToAveva toggle — they do not flow through this sink. This path is exclusively for non-Galaxy alarm producers.


Contracts

All in IAlarmHistorianSink.cs unless noted.

  • IAlarmHistorianSink — the intake contract. EnqueueAsync(evt, ct) durably enqueues an event and returns as soon as the queue row is committed (fire-and-forget from the engine's perspective; the sink must not block the emitting thread). GetStatus() returns a HistorianSinkStatus snapshot.
  • NullAlarmHistorianSink — the no-op default for tests and deployments that don't historize alarms. It is the default DI binding (registered in the Runtime's AddOtOpcUaRuntime); production overrides it with SqliteStoreAndForwardSink.
  • AlarmHistorianEvent (AlarmHistorianEvent.cs) — the source-agnostic event record: AlarmId, EquipmentPath (UNS path, doubles as Historian's SourceNode), AlarmName, AlarmTypeName (Part 9 subtype), Severity, EventKind (free-form transition string — "Activated"/"Cleared"/"Acknowledged"/etc.), Message, User, Comment, TimestampUtc.
  • IAlarmHistorianWriter — what the drain worker delegates writes to. WriteBatchAsync(batch, ct) returns one HistorianWriteOutcome per event, in order. Production binds this to WonderwareHistorianClient (the AVEVA Historian sidecar IPC client).
  • HistorianWriteOutcome — per-event drain result: Ack (persisted, remove from queue), RetryPlease (transient failure — leave queued, retry after backoff), PermanentFail (malformed/unrecoverable — move to dead-letter).
  • HistorianSinkStatus — diagnostic snapshot surfaced to the AdminUI and /healthz: QueueDepth, DeadLetterDepth, LastDrainUtc, LastSuccessUtc, LastError, DrainState, and EvictedCount.
  • HistorianDrainStateDisabled / Idle / Draining / BackingOff.

SqliteStoreAndForwardSink

SqliteStoreAndForwardSink.cs is the production IAlarmHistorianSink. Construction takes a SQLite database path, an IAlarmHistorianWriter, a logger, and optional batchSize (default 100), capacity (default 1,000,000), deadLetterRetention (default 30 days), and a test clock.

Queue table

The sink owns one SQLite table (created on construction, WAL journal mode):

CREATE TABLE Queue (
    RowId          INTEGER PRIMARY KEY AUTOINCREMENT,
    AlarmId        TEXT    NOT NULL,
    EnqueuedUtc    TEXT    NOT NULL,
    PayloadJson    TEXT    NOT NULL,   -- JSON-serialized AlarmHistorianEvent
    AttemptCount   INTEGER NOT NULL DEFAULT 0,
    LastAttemptUtc TEXT    NULL,
    LastError      TEXT    NULL,
    DeadLettered   INTEGER NOT NULL DEFAULT 0
);
CREATE INDEX IX_Queue_Drain ON Queue (DeadLettered, RowId);

EnqueueAsync does a single INSERT on the hot path. To avoid a SELECT COUNT(*) on every enqueue, the sink keeps an in-memory non-dead-lettered row counter (seeded at startup, kept current by every mutation, and re-synced from storage every 10,000 enqueues to defend against drift). SQLite writer contention is handled via PRAGMA busy_timeout=5000 + WAL so an enqueue/drain collision waits out the file lock instead of failing fast.

Drain worker

StartDrainLoop(tickInterval) starts a self-rescheduling one-shot System.Threading.Timer (not started automatically — tests drive DrainOnceAsync deterministically). Each tick:

  1. Purges aged dead-lettered rows past the retention window.
  2. Reads up to batchSize non-dead-lettered rows in RowId order.
  3. Rows with un-deserializable payloads are dead-lettered immediately (by their own RowId) so they can't stall the queue head.
  4. The remaining batch is handed to IAlarmHistorianWriter.WriteBatchAsync, and each outcome is applied in one transaction: Ack deletes the row, PermanentFail flips its DeadLettered flag, RetryPlease bumps its attempt count and leaves it queued.
  5. The timer re-arms its next due-time to max(tickInterval, currentBackoff).

Backoff ladder (applied to the timer's next due-time, so a historian outage genuinely slows the drain cadence): 1s → 2s → 5s → 15s → 60s cap. Any RetryPlease outcome — or a writer exception, or a writer cardinality violation (outcome count ≠ event count) — bumps the backoff and sets DrainState = BackingOff; a clean batch resets it. The async-void timer callback is fully guarded: a fault is logged and recorded into GetStatus() rather than lost as an unobserved task exception.

Durability bound (important)

The durability guarantee is bounded by capacity (default 1,000,000 rows). When the non-dead-lettered queue reaches capacity, EnqueueAsync evicts the oldest non-dead-lettered rows (oldest RowId first) to make room, logs a WARN, and increments HistorianSinkStatus.EvictedCount. Under a sustained historian outage, accepted alarm events can therefore be dropped before delivery. A non-zero EvictedCount is a data-loss signal that requires operator attention — it surfaces silent loss without log scraping.

Dead-letter + operator recovery

PermanentFail and corrupt-payload rows are retained in-place with DeadLettered = 1 for the retention window (default 30 days) so operators can inspect them before the sweeper purges them. RetryDeadLettered() is the operator action (from the AdminUI) that clears the dead-letter flag and attempt count on every dead-lettered row, returning them to the regular queue with a fresh backoff.


Runtime wiring

Production routes alarm transitions through the Akka cluster. The HistorianAdapterActor (Runtime/Historian/HistorianAdapterActor.cs) bridges messages from the scripted-alarm actor into the sink's EnqueueAsync, fire-and-forget so the actor loop is never blocked on historian reachability. The WonderwareHistorianClient is the IAlarmHistorianWriter the drain worker delegates to. See ServiceHosting.md for the sidecar setup.


See also

  • AlarmTracking.md — the three alarm sources and the OPC UA Part 9 surface; which alarms route to this sink.
  • DriverLifecycle.mdIHistorianDataSource (the historian read surface; this page covers the write path) and the WonderwareHistorianClient.
  • ScriptedAlarms.md — the scripted-alarm engine that emits most events into this sink.
  • ServiceHosting.md — the optional Wonderware historian sidecar.