Files
lmxopcua/docs/AlarmHistorian.md
T
Joseph Doherty 2124f21ab6
v2-ci / build (pull_request) Failing after 38s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (pull_request) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (pull_request) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (pull_request) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (pull_request) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (pull_request) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (pull_request) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (pull_request) Has been skipped
docs(historian-gateway): document gateway backend, config keys, EnsureTags hook, known gates; retire Wonderware from docs
HistorianGateway is now the sole historian backend (read + alarm SendEvent +
continuous WriteLiveValues). Document the final state and retire the Wonderware
sidecar from the docs/config/labels:

- CLAUDE.md: rewrite the Historian section — ServerHistorian /
  ContinuousHistorization / AlarmHistorian config keys, the IHistorianProvisioning
  EnsureTags hook, the GatewayAlarmHistorianWriter SendEvent path + ReadEvents
  dependency on gateway RuntimeDb:EventReadsEnabled=true, gateway-side
  prerequisites (RuntimeDb flags + historian:read/write/tags:write scopes),
  migration note, and two KNOWN-LIMITATION callouts (live-validation gate +
  empty historized-ref-set recorder follow-on).
- appsettings.json: fix the stale ServerHistorian block (Host/Port/SharedSecret/
  ServerCertThumbprint -> Endpoint/ApiKey/UseTls/AllowUntrustedServerCertificate/
  CaCertificatePath/CallTimeout, keep MaxTieClusterOverfetch); add a disabled
  ContinuousHistorization block; prune the orphaned Wonderware keys from
  AlarmHistorian (keep the SQLite knobs). ApiKey env-supplied via
  ServerHistorian__ApiKey (commented; valid strict JSON via _comment keys).
- README.md + docs (Historian.md, AlarmHistorian.md, Configuration.md,
  ServiceHosting.md, DriverLifecycle.md, drivers/README.md, Uns.md, VirtualTags.md,
  AlarmTracking.md, Client.UI.md, README.md, TestConnectProbes.md): retire the
  Wonderware historian backend from current-backend descriptions; fix the stale
  ServerHistorian/AlarmHistorian config tables (now gateway shape); convert
  drivers/Historian.Wonderware.md to a retired stub pointing at the gateway.
- Source/UI labels (descriptive text only, no behavior change):
  OtOpcUaServerHostedService.cs, HistoryPaging.cs, OtOpcUaSdkServer.cs,
  HistorianAdapterActor.cs, VirtualTagModal.razor, ScriptedAlarmModal.razor,
  AlarmsHistorian.razor now name the HistorianGateway backend.

Build clean (0 errors); AdminUI.Tests green (514 passed).

Claude-Session: https://claude.ai/code/session_012SDSQ3AcaXqPcBtDESBRii
2026-06-26 19:46:27 -04:00

11 KiB

Alarm Historian — store-and-forward SQLite sink

Reference for ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian (src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/), the durable local queue that historizes alarm transitions to AVEVA Historian without ever blocking the alarm engine or operator actions.

This is the sink mechanics doc. For how the three alarm sources converge on the OPC UA Part 9 surface and which alarms route here, see AlarmTracking.md. For the historian client that drains this queue, see DriverLifecycle.md and ServiceHosting.md.


Why store-and-forward

Scripted alarms (and any future non-Galaxy IAlarmSource, e.g. AB CIP ALMD) must reach AVEVA Historian, but the historian gateway can be slow, busy, or disconnected. The sink decouples the alarm engine from historian reachability: every qualifying transition is committed to a local SQLite queue first, and a background drain worker forwards rows to the historian on a backoff-aware cadence. Operator acks and alarm-state transitions are never blocked waiting on the historian.

Galaxy-native alarms with $Alarm* extensions reach AVEVA Historian directly via System Platform's HistorizeToAveva toggle — they do not flow through this sink. This path is exclusively for non-Galaxy alarm producers.


Contracts

All in IAlarmHistorianSink.cs unless noted.

  • IAlarmHistorianSink — the intake contract. EnqueueAsync(evt, ct) durably enqueues an event and returns as soon as the queue row is committed (fire-and-forget from the engine's perspective; the sink must not block the emitting thread). GetStatus() returns a HistorianSinkStatus snapshot.
  • NullAlarmHistorianSink — the no-op default for tests and deployments that don't historize alarms. It is the default DI binding (registered in the Runtime's AddOtOpcUaRuntime); production overrides it with SqliteStoreAndForwardSink.
  • AlarmHistorianEvent (AlarmHistorianEvent.cs) — the source-agnostic event record: AlarmId, EquipmentPath (UNS path, doubles as Historian's SourceNode), AlarmName, AlarmTypeName (Part 9 subtype), Severity, EventKind (free-form transition string — "Activated"/"Cleared"/"Acknowledged"/etc.), Message, User, Comment, TimestampUtc.
  • IAlarmHistorianWriter — what the drain worker delegates writes to. WriteBatchAsync(batch, ct) returns one HistorianWriteOutcome per event, in order. Production binds this to GatewayAlarmHistorianWriter (the HistorianGateway SendEvent path).
  • HistorianWriteOutcome — per-event drain result: Ack (persisted, remove from queue), RetryPlease (transient failure — leave queued, retry after backoff), PermanentFail (malformed/unrecoverable — move to dead-letter).
  • HistorianSinkStatus — diagnostic snapshot surfaced to the AdminUI and /healthz: QueueDepth, DeadLetterDepth, LastDrainUtc, LastSuccessUtc, LastError, DrainState, and EvictedCount.
  • HistorianDrainStateDisabled / Idle / Draining / BackingOff.

SqliteStoreAndForwardSink

SqliteStoreAndForwardSink.cs is the production IAlarmHistorianSink. Construction takes a SQLite database path, an IAlarmHistorianWriter, a logger, and optional batchSize (default 100), capacity (default 1,000,000), deadLetterRetention (default 30 days), and a test clock.

Queue table

The sink owns one SQLite table (created on construction, WAL journal mode):

CREATE TABLE Queue (
    RowId          INTEGER PRIMARY KEY AUTOINCREMENT,
    AlarmId        TEXT    NOT NULL,
    EnqueuedUtc    TEXT    NOT NULL,
    PayloadJson    TEXT    NOT NULL,   -- JSON-serialized AlarmHistorianEvent
    AttemptCount   INTEGER NOT NULL DEFAULT 0,
    LastAttemptUtc TEXT    NULL,
    LastError      TEXT    NULL,
    DeadLettered   INTEGER NOT NULL DEFAULT 0
);
CREATE INDEX IX_Queue_Drain ON Queue (DeadLettered, RowId);

EnqueueAsync does a single INSERT on the hot path. To avoid a SELECT COUNT(*) on every enqueue, the sink keeps an in-memory non-dead-lettered row counter (seeded at startup, kept current by every mutation, and re-synced from storage every 10,000 enqueues to defend against drift). SQLite writer contention is handled via PRAGMA busy_timeout=5000 + WAL so an enqueue/drain collision waits out the file lock instead of failing fast.

Drain worker

StartDrainLoop(tickInterval) starts a self-rescheduling one-shot System.Threading.Timer (not started automatically — tests drive DrainOnceAsync deterministically). Each tick:

  1. Purges aged dead-lettered rows past the retention window.
  2. Reads up to batchSize non-dead-lettered rows in RowId order.
  3. Rows with un-deserializable payloads are dead-lettered immediately (by their own RowId) so they can't stall the queue head.
  4. The remaining batch is handed to IAlarmHistorianWriter.WriteBatchAsync, and each outcome is applied in one transaction: Ack deletes the row, PermanentFail flips its DeadLettered flag, RetryPlease bumps its attempt count and leaves it queued. A row whose AttemptCount has reached the configured MaxAttempts cap (default 10) is dead-lettered automatically on the next drain tick rather than retried — this breaks infinite retry loops for poison events whose payload the historian will always reject (e.g. a malformed alarm record that triggers a permanent SDK error on every attempt). The dead-lettered row remains inspectable via RetryDeadLettered() for the configured retention window.
  5. The timer re-arms its next due-time to max(tickInterval, currentBackoff).

Backoff ladder (applied to the timer's next due-time, so a historian outage genuinely slows the drain cadence): 1s → 2s → 5s → 15s → 60s cap. Any RetryPlease outcome — or a writer exception, or a writer cardinality violation (outcome count ≠ event count) — bumps the backoff and sets DrainState = BackingOff; a clean batch resets it. The async-void timer callback is fully guarded: a fault is logged and recorded into GetStatus() rather than lost as an unobserved task exception.

Durability bound (important)

The durability guarantee is bounded by capacity (default 1,000,000 rows). When the non-dead-lettered queue reaches capacity, EnqueueAsync evicts the oldest non-dead-lettered rows (oldest RowId first) to make room, logs a WARN, and increments HistorianSinkStatus.EvictedCount. Under a sustained historian outage, accepted alarm events can therefore be dropped before delivery. A non-zero EvictedCount is a data-loss signal that requires operator attention — it surfaces silent loss without log scraping.

Dead-letter + operator recovery

PermanentFail and corrupt-payload rows are retained in-place with DeadLettered = 1 for the retention window (default 30 days) so operators can inspect them before the sweeper purges them. RetryDeadLettered() is the operator action (from the AdminUI) that clears the dead-letter flag and attempt count on every dead-lettered row, returning them to the regular queue with a fresh backoff.


Runtime wiring

HistorianAdapterActor (Runtime/Historian/HistorianAdapterActor.cs) subscribes to the cluster alerts DPS topic and translates each AlarmTransitionEvent into an AlarmHistorianEvent, then calls IAlarmHistorianSink.EnqueueAsync fire-and-forget so the actor loop is never blocked on historian reachability. The actor is Primary-gated: only the node whose RedundancyRole is Primary historizes, giving exactly-once writes across a redundant pair. AlarmTransitionEvent carries AlarmTypeName (the Part 9 subtype string) and Comment (the operator comment from the originating ack/shelve command) that populate the corresponding fields of AlarmHistorianEvent. GatewayAlarmHistorianWriter is the IAlarmHistorianWriter the drain worker delegates to (the gateway SendEvent path). See ServiceHosting.md for the (external) HistorianGateway setup.

Scope: scripted alarms only. Galaxy-native alarms historize via System Platform's HistorizeToAveva toggle (not this actor); AB CIP ALMD is not on the alerts topic (future).

Configuration

The real sink is opt-in via the AlarmHistorian section of appsettings.json. When Enabled is false (the default), AddAlarmHistorian registers NullAlarmHistorianSink and the feature is dormant. When Enabled is true, AddAlarmHistorian constructs SqliteStoreAndForwardSink and registers GatewayAlarmHistorianWriter as the IAlarmHistorianWriter. This section carries only the Enabled gate + the SQLite store-and-forward knobs — the downstream gateway connection (endpoint / key / TLS) is sourced from the ServerHistorian section (see Historian.md).

{
  "AlarmHistorian": {
    "Enabled": true,
    "DatabasePath": "C:\\ProgramData\\OtOpcUa\\alarmhistorian.db",
    "BatchSize": 100,
    "DrainIntervalSeconds": 5,
    "Capacity": 1000000,
    "DeadLetterRetentionDays": 30
  }
}
Key Type Default Description
Enabled bool false Enable the SQLite store-and-forward sink (drains to the HistorianGateway SendEvent path). falseNullAlarmHistorianSink.
DatabasePath string alarm-historian.db Path to the SQLite queue file. Created on first use (WAL mode). Set an absolute path in production.
BatchSize int 100 Max rows per drain cycle handed to IAlarmHistorianWriter.WriteBatchAsync.
DrainIntervalSeconds int 5 Seconds between drain-worker ticks.
Capacity long 1000000 Max queued rows before the sink evicts the oldest (data-loss signal via EvictedCount).
DeadLetterRetentionDays int 30 Days to retain dead-lettered rows before purge.
MaxAttempts int 10 Maximum delivery attempts before a poison (perpetually-retrying) row is dead-lettered automatically. Must be > 0.

The downstream gateway connection lives in ServerHistorian (Endpoint + env ServerHistorian__ApiKey, UseTls, CaCertificatePath); alarm-history ReadEvents additionally requires the gateway running RuntimeDb:EventReadsEnabled=true. The old Wonderware connection keys (SharedSecret / AlarmHistorian:Host/Port/UseTls/ServerCertThumbprint) were pruned.

Dev and docker-dev deployments leave Enabled unset (defaults to false) so alarm transitions historize to nowhere unless a HistorianGateway is configured.


See also

  • AlarmTracking.md — the three alarm sources and the OPC UA Part 9 surface; which alarms route to this sink.
  • DriverLifecycle.mdIHistorianDataSource (the historian read surface; this page covers the write path) and the GatewayHistorianDataSource.
  • ScriptedAlarms.md — the scripted-alarm engine that emits most events into this sink.
  • ServiceHosting.md — the external HistorianGateway backend.