Files
lmxopcua/docs/AlarmHistorian.md
T

169 lines
8.0 KiB
Markdown

# Alarm Historian — store-and-forward SQLite sink
Reference for `ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian`
([`src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/`](../src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/)),
the durable local queue that historizes alarm transitions to AVEVA Historian
without ever blocking the alarm engine or operator actions.
This is the *sink mechanics* doc. For how the three alarm sources converge on
the OPC UA Part 9 surface and which alarms route here, see
[AlarmTracking.md](AlarmTracking.md). For the historian client that drains this
queue, see [DriverLifecycle.md](DriverLifecycle.md#ihistoriandatasource--server-side-historian-read-surface)
and [ServiceHosting.md](ServiceHosting.md).
---
## Why store-and-forward
Scripted alarms (and any future non-Galaxy `IAlarmSource`, e.g. AB CIP ALMD)
must reach AVEVA Historian, but the historian sidecar can be slow, busy, or
disconnected. The sink decouples the alarm engine from historian reachability:
every qualifying transition is committed to a **local SQLite queue first**, and
a background drain worker forwards rows to the historian on a backoff-aware
cadence. Operator acks and alarm-state transitions are never blocked waiting on
the historian.
> Galaxy-native alarms with `$Alarm*` extensions reach AVEVA Historian directly
> via System Platform's `HistorizeToAveva` toggle — they do **not** flow through
> this sink. This path is exclusively for non-Galaxy alarm producers.
---
## Contracts
All in
[`IAlarmHistorianSink.cs`](../src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/IAlarmHistorianSink.cs)
unless noted.
- **`IAlarmHistorianSink`** — the intake contract. `EnqueueAsync(evt, ct)`
durably enqueues an event and returns as soon as the queue row is committed
(fire-and-forget from the engine's perspective; the sink must not block the
emitting thread). `GetStatus()` returns a `HistorianSinkStatus` snapshot.
- **`NullAlarmHistorianSink`** — the no-op default for tests and deployments
that don't historize alarms. It is the default DI binding (registered in the
Runtime's `AddOtOpcUaRuntime`); production overrides it with
`SqliteStoreAndForwardSink`.
- **`AlarmHistorianEvent`**
([`AlarmHistorianEvent.cs`](../src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/AlarmHistorianEvent.cs))
— the source-agnostic event record: `AlarmId`, `EquipmentPath` (UNS path,
doubles as Historian's SourceNode), `AlarmName`, `AlarmTypeName` (Part 9
subtype), `Severity`, `EventKind` (free-form transition string —
"Activated"/"Cleared"/"Acknowledged"/etc.), `Message`, `User`, `Comment`,
`TimestampUtc`.
- **`IAlarmHistorianWriter`** — what the drain worker delegates writes to.
`WriteBatchAsync(batch, ct)` returns one `HistorianWriteOutcome` per event,
in order. Production binds this to `WonderwareHistorianClient` (the AVEVA
Historian sidecar IPC client).
- **`HistorianWriteOutcome`** — per-event drain result: `Ack` (persisted,
remove from queue), `RetryPlease` (transient failure — leave queued, retry
after backoff), `PermanentFail` (malformed/unrecoverable — move to
dead-letter).
- **`HistorianSinkStatus`** — diagnostic snapshot surfaced to the AdminUI and
`/healthz`: `QueueDepth`, `DeadLetterDepth`, `LastDrainUtc`, `LastSuccessUtc`,
`LastError`, `DrainState`, and `EvictedCount`.
- **`HistorianDrainState`** — `Disabled` / `Idle` / `Draining` / `BackingOff`.
---
## SqliteStoreAndForwardSink
[`SqliteStoreAndForwardSink.cs`](../src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs)
is the production `IAlarmHistorianSink`. Construction takes a SQLite database
path, an `IAlarmHistorianWriter`, a logger, and optional `batchSize` (default
100), `capacity` (default 1,000,000), `deadLetterRetention` (default 30 days),
and a test clock.
### Queue table
The sink owns one SQLite table (created on construction, WAL journal mode):
```sql
CREATE TABLE Queue (
RowId INTEGER PRIMARY KEY AUTOINCREMENT,
AlarmId TEXT NOT NULL,
EnqueuedUtc TEXT NOT NULL,
PayloadJson TEXT NOT NULL, -- JSON-serialized AlarmHistorianEvent
AttemptCount INTEGER NOT NULL DEFAULT 0,
LastAttemptUtc TEXT NULL,
LastError TEXT NULL,
DeadLettered INTEGER NOT NULL DEFAULT 0
);
CREATE INDEX IX_Queue_Drain ON Queue (DeadLettered, RowId);
```
`EnqueueAsync` does a single `INSERT` on the hot path. To avoid a
`SELECT COUNT(*)` on every enqueue, the sink keeps an in-memory non-dead-lettered
row counter (seeded at startup, kept current by every mutation, and re-synced
from storage every 10,000 enqueues to defend against drift). SQLite writer
contention is handled via `PRAGMA busy_timeout=5000` + WAL so an enqueue/drain
collision waits out the file lock instead of failing fast.
### Drain worker
`StartDrainLoop(tickInterval)` starts a **self-rescheduling one-shot
`System.Threading.Timer`** (not started automatically — tests drive
`DrainOnceAsync` deterministically). Each tick:
1. Purges aged dead-lettered rows past the retention window.
2. Reads up to `batchSize` non-dead-lettered rows in `RowId` order.
3. Rows with un-deserializable payloads are dead-lettered immediately (by their
own `RowId`) so they can't stall the queue head.
4. The remaining batch is handed to `IAlarmHistorianWriter.WriteBatchAsync`, and
each outcome is applied in one transaction: `Ack` deletes the row,
`PermanentFail` flips its `DeadLettered` flag, `RetryPlease` bumps its attempt
count and leaves it queued.
5. The timer re-arms its next due-time to `max(tickInterval, currentBackoff)`.
**Backoff ladder** (applied to the timer's next due-time, so a historian outage
genuinely slows the drain cadence): 1s → 2s → 5s → 15s → 60s cap. Any
`RetryPlease` outcome — or a writer exception, or a writer cardinality violation
(outcome count ≠ event count) — bumps the backoff and sets `DrainState =
BackingOff`; a clean batch resets it. The async-void timer callback is fully
guarded: a fault is logged and recorded into `GetStatus()` rather than lost as
an unobserved task exception.
### Durability bound (important)
**The durability guarantee is bounded by `capacity` (default 1,000,000 rows).**
When the non-dead-lettered queue reaches capacity, `EnqueueAsync` evicts the
oldest non-dead-lettered rows (oldest `RowId` first) to make room, logs a WARN,
and increments `HistorianSinkStatus.EvictedCount`. Under a sustained historian
outage, accepted alarm events can therefore be dropped before delivery. A
non-zero `EvictedCount` is a data-loss signal that requires operator attention —
it surfaces silent loss without log scraping.
### Dead-letter + operator recovery
`PermanentFail` and corrupt-payload rows are retained in-place with
`DeadLettered = 1` for the retention window (default 30 days) so operators can
inspect them before the sweeper purges them. `RetryDeadLettered()` is the
operator action (from the AdminUI) that clears the dead-letter flag and attempt
count on every dead-lettered row, returning them to the regular queue with a
fresh backoff.
---
## Runtime wiring
Production routes alarm transitions through the Akka cluster. The
`HistorianAdapterActor`
([`Runtime/Historian/HistorianAdapterActor.cs`](../src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Historian/HistorianAdapterActor.cs))
bridges messages from the scripted-alarm actor into the sink's `EnqueueAsync`,
fire-and-forget so the actor loop is never blocked on historian reachability.
The `WonderwareHistorianClient` is the `IAlarmHistorianWriter` the drain worker
delegates to. See [ServiceHosting.md](ServiceHosting.md) for the sidecar setup.
---
## See also
- [AlarmTracking.md](AlarmTracking.md) — the three alarm sources and the OPC UA
Part 9 surface; which alarms route to this sink.
- [DriverLifecycle.md](DriverLifecycle.md) — `IHistorianDataSource` (the
historian *read* surface; this page covers the *write* path) and the
`WonderwareHistorianClient`.
- [ScriptedAlarms.md](ScriptedAlarms.md) — the scripted-alarm engine that emits
most events into this sink.
- [ServiceHosting.md](ServiceHosting.md) — the optional Wonderware historian
sidecar.