Files
lmxopcua/docs/AlarmHistorian.md
T
Joseph Doherty 1be06502c7
v2-ci / build (push) Failing after 43s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (push) Has been skipped
fix(historian): correct AlarmHistorian config-key refs in docs + install (review)
2026-06-12 12:25:13 -04:00

10 KiB

Alarm Historian — store-and-forward SQLite sink

Reference for ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian (src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/), the durable local queue that historizes alarm transitions to AVEVA Historian without ever blocking the alarm engine or operator actions.

This is the sink mechanics doc. For how the three alarm sources converge on the OPC UA Part 9 surface and which alarms route here, see AlarmTracking.md. For the historian client that drains this queue, see DriverLifecycle.md and ServiceHosting.md.


Why store-and-forward

Scripted alarms (and any future non-Galaxy IAlarmSource, e.g. AB CIP ALMD) must reach AVEVA Historian, but the historian sidecar can be slow, busy, or disconnected. The sink decouples the alarm engine from historian reachability: every qualifying transition is committed to a local SQLite queue first, and a background drain worker forwards rows to the historian on a backoff-aware cadence. Operator acks and alarm-state transitions are never blocked waiting on the historian.

Galaxy-native alarms with $Alarm* extensions reach AVEVA Historian directly via System Platform's HistorizeToAveva toggle — they do not flow through this sink. This path is exclusively for non-Galaxy alarm producers.


Contracts

All in IAlarmHistorianSink.cs unless noted.

  • IAlarmHistorianSink — the intake contract. EnqueueAsync(evt, ct) durably enqueues an event and returns as soon as the queue row is committed (fire-and-forget from the engine's perspective; the sink must not block the emitting thread). GetStatus() returns a HistorianSinkStatus snapshot.
  • NullAlarmHistorianSink — the no-op default for tests and deployments that don't historize alarms. It is the default DI binding (registered in the Runtime's AddOtOpcUaRuntime); production overrides it with SqliteStoreAndForwardSink.
  • AlarmHistorianEvent (AlarmHistorianEvent.cs) — the source-agnostic event record: AlarmId, EquipmentPath (UNS path, doubles as Historian's SourceNode), AlarmName, AlarmTypeName (Part 9 subtype), Severity, EventKind (free-form transition string — "Activated"/"Cleared"/"Acknowledged"/etc.), Message, User, Comment, TimestampUtc.
  • IAlarmHistorianWriter — what the drain worker delegates writes to. WriteBatchAsync(batch, ct) returns one HistorianWriteOutcome per event, in order. Production binds this to WonderwareHistorianClient (the AVEVA Historian sidecar IPC client).
  • HistorianWriteOutcome — per-event drain result: Ack (persisted, remove from queue), RetryPlease (transient failure — leave queued, retry after backoff), PermanentFail (malformed/unrecoverable — move to dead-letter).
  • HistorianSinkStatus — diagnostic snapshot surfaced to the AdminUI and /healthz: QueueDepth, DeadLetterDepth, LastDrainUtc, LastSuccessUtc, LastError, DrainState, and EvictedCount.
  • HistorianDrainStateDisabled / Idle / Draining / BackingOff.

SqliteStoreAndForwardSink

SqliteStoreAndForwardSink.cs is the production IAlarmHistorianSink. Construction takes a SQLite database path, an IAlarmHistorianWriter, a logger, and optional batchSize (default 100), capacity (default 1,000,000), deadLetterRetention (default 30 days), and a test clock.

Queue table

The sink owns one SQLite table (created on construction, WAL journal mode):

CREATE TABLE Queue (
    RowId          INTEGER PRIMARY KEY AUTOINCREMENT,
    AlarmId        TEXT    NOT NULL,
    EnqueuedUtc    TEXT    NOT NULL,
    PayloadJson    TEXT    NOT NULL,   -- JSON-serialized AlarmHistorianEvent
    AttemptCount   INTEGER NOT NULL DEFAULT 0,
    LastAttemptUtc TEXT    NULL,
    LastError      TEXT    NULL,
    DeadLettered   INTEGER NOT NULL DEFAULT 0
);
CREATE INDEX IX_Queue_Drain ON Queue (DeadLettered, RowId);

EnqueueAsync does a single INSERT on the hot path. To avoid a SELECT COUNT(*) on every enqueue, the sink keeps an in-memory non-dead-lettered row counter (seeded at startup, kept current by every mutation, and re-synced from storage every 10,000 enqueues to defend against drift). SQLite writer contention is handled via PRAGMA busy_timeout=5000 + WAL so an enqueue/drain collision waits out the file lock instead of failing fast.

Drain worker

StartDrainLoop(tickInterval) starts a self-rescheduling one-shot System.Threading.Timer (not started automatically — tests drive DrainOnceAsync deterministically). Each tick:

  1. Purges aged dead-lettered rows past the retention window.
  2. Reads up to batchSize non-dead-lettered rows in RowId order.
  3. Rows with un-deserializable payloads are dead-lettered immediately (by their own RowId) so they can't stall the queue head.
  4. The remaining batch is handed to IAlarmHistorianWriter.WriteBatchAsync, and each outcome is applied in one transaction: Ack deletes the row, PermanentFail flips its DeadLettered flag, RetryPlease bumps its attempt count and leaves it queued.
  5. The timer re-arms its next due-time to max(tickInterval, currentBackoff).

Backoff ladder (applied to the timer's next due-time, so a historian outage genuinely slows the drain cadence): 1s → 2s → 5s → 15s → 60s cap. Any RetryPlease outcome — or a writer exception, or a writer cardinality violation (outcome count ≠ event count) — bumps the backoff and sets DrainState = BackingOff; a clean batch resets it. The async-void timer callback is fully guarded: a fault is logged and recorded into GetStatus() rather than lost as an unobserved task exception.

Durability bound (important)

The durability guarantee is bounded by capacity (default 1,000,000 rows). When the non-dead-lettered queue reaches capacity, EnqueueAsync evicts the oldest non-dead-lettered rows (oldest RowId first) to make room, logs a WARN, and increments HistorianSinkStatus.EvictedCount. Under a sustained historian outage, accepted alarm events can therefore be dropped before delivery. A non-zero EvictedCount is a data-loss signal that requires operator attention — it surfaces silent loss without log scraping.

Dead-letter + operator recovery

PermanentFail and corrupt-payload rows are retained in-place with DeadLettered = 1 for the retention window (default 30 days) so operators can inspect them before the sweeper purges them. RetryDeadLettered() is the operator action (from the AdminUI) that clears the dead-letter flag and attempt count on every dead-lettered row, returning them to the regular queue with a fresh backoff.


Runtime wiring

HistorianAdapterActor (Runtime/Historian/HistorianAdapterActor.cs) subscribes to the cluster alerts DPS topic and translates each AlarmTransitionEvent into an AlarmHistorianEvent, then calls IAlarmHistorianSink.EnqueueAsync fire-and-forget so the actor loop is never blocked on historian reachability. The actor is Primary-gated: only the node whose RedundancyRole is Primary historizes, giving exactly-once writes across a redundant pair. AlarmTransitionEvent carries AlarmTypeName (the Part 9 subtype string) and Comment (the operator comment from the originating ack/shelve command) that populate the corresponding fields of AlarmHistorianEvent. WonderwareHistorianClient is the IAlarmHistorianWriter the drain worker delegates to. See ServiceHosting.md for the sidecar setup.

Scope: scripted alarms only. Galaxy-native alarms historize via System Platform's HistorizeToAveva toggle (not this actor); AB CIP ALMD is not on the alerts topic (future).

Configuration

The real sink is opt-in via the AlarmHistorian section of appsettings.json. When Enabled is false (the default), AddAlarmHistorian registers NullAlarmHistorianSink and the feature is dormant. When Enabled is true, AddAlarmHistorian constructs SqliteStoreAndForwardSink and registers WonderwareHistorianClient as the IAlarmHistorianWriter.

{
  "AlarmHistorian": {
    "Enabled": true,
    "DatabasePath": "C:\\ProgramData\\OtOpcUa\\alarmhistorian.db",
    "SharedSecret": "<token from historian sidecar config>",
    "BatchSize": 100
  },
  "Historian": {
    "Wonderware": {
      "Host": "localhost",
      "Port": 32569,
      "UseTls": false,
      "ServerCertThumbprint": ""
    }
  }
}
Key Type Default Description
Enabled bool false Enable the real SQLite + Wonderware sink. falseNullAlarmHistorianSink.
DatabasePath string Absolute path to the SQLite queue file. Created on first use (WAL mode). Required when Enabled.
SharedSecret string Shared secret token the sidecar expects on every connection. Required when Enabled.
BatchSize int 100 Max rows per drain cycle handed to IAlarmHistorianWriter.WriteBatchAsync.
AlarmHistorian:Host string localhost DNS name or IP of the machine running the historian sidecar.
AlarmHistorian:Port int 32569 TCP port the sidecar listens on (OTOPCUA_HISTORIAN_TCP_PORT).
AlarmHistorian:UseTls bool false Wrap the TCP stream in TLS before the Hello handshake.
AlarmHistorian:ServerCertThumbprint string Optional SHA-1 thumbprint to pin the sidecar's TLS server certificate. Leave empty to use normal CA-chain validation.

Dev and docker-dev deployments leave Enabled unset (defaults to false) so alarm transitions historize to nowhere unless a historian sidecar is present.


See also

  • AlarmTracking.md — the three alarm sources and the OPC UA Part 9 surface; which alarms route to this sink.
  • DriverLifecycle.mdIHistorianDataSource (the historian read surface; this page covers the write path) and the WonderwareHistorianClient.
  • ScriptedAlarms.md — the scripted-alarm engine that emits most events into this sink.
  • ServiceHosting.md — the optional Wonderware historian sidecar.