169 lines
8.0 KiB
Markdown
169 lines
8.0 KiB
Markdown
# Alarm Historian — store-and-forward SQLite sink
|
|
|
|
Reference for `ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian`
|
|
([`src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/`](../src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/)),
|
|
the durable local queue that historizes alarm transitions to AVEVA Historian
|
|
without ever blocking the alarm engine or operator actions.
|
|
|
|
This is the *sink mechanics* doc. For how the three alarm sources converge on
|
|
the OPC UA Part 9 surface and which alarms route here, see
|
|
[AlarmTracking.md](AlarmTracking.md). For the historian client that drains this
|
|
queue, see [DriverLifecycle.md](DriverLifecycle.md#ihistoriandatasource--server-side-historian-read-surface)
|
|
and [ServiceHosting.md](ServiceHosting.md).
|
|
|
|
---
|
|
|
|
## Why store-and-forward
|
|
|
|
Scripted alarms (and any future non-Galaxy `IAlarmSource`, e.g. AB CIP ALMD)
|
|
must reach AVEVA Historian, but the historian sidecar can be slow, busy, or
|
|
disconnected. The sink decouples the alarm engine from historian reachability:
|
|
every qualifying transition is committed to a **local SQLite queue first**, and
|
|
a background drain worker forwards rows to the historian on a backoff-aware
|
|
cadence. Operator acks and alarm-state transitions are never blocked waiting on
|
|
the historian.
|
|
|
|
> Galaxy-native alarms with `$Alarm*` extensions reach AVEVA Historian directly
|
|
> via System Platform's `HistorizeToAveva` toggle — they do **not** flow through
|
|
> this sink. This path is exclusively for non-Galaxy alarm producers.
|
|
|
|
---
|
|
|
|
## Contracts
|
|
|
|
All in
|
|
[`IAlarmHistorianSink.cs`](../src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/IAlarmHistorianSink.cs)
|
|
unless noted.
|
|
|
|
- **`IAlarmHistorianSink`** — the intake contract. `EnqueueAsync(evt, ct)`
|
|
durably enqueues an event and returns as soon as the queue row is committed
|
|
(fire-and-forget from the engine's perspective; the sink must not block the
|
|
emitting thread). `GetStatus()` returns a `HistorianSinkStatus` snapshot.
|
|
- **`NullAlarmHistorianSink`** — the no-op default for tests and deployments
|
|
that don't historize alarms. It is the default DI binding (registered in the
|
|
Runtime's `AddOtOpcUaRuntime`); production overrides it with
|
|
`SqliteStoreAndForwardSink`.
|
|
- **`AlarmHistorianEvent`**
|
|
([`AlarmHistorianEvent.cs`](../src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/AlarmHistorianEvent.cs))
|
|
— the source-agnostic event record: `AlarmId`, `EquipmentPath` (UNS path,
|
|
doubles as Historian's SourceNode), `AlarmName`, `AlarmTypeName` (Part 9
|
|
subtype), `Severity`, `EventKind` (free-form transition string —
|
|
"Activated"/"Cleared"/"Acknowledged"/etc.), `Message`, `User`, `Comment`,
|
|
`TimestampUtc`.
|
|
- **`IAlarmHistorianWriter`** — what the drain worker delegates writes to.
|
|
`WriteBatchAsync(batch, ct)` returns one `HistorianWriteOutcome` per event,
|
|
in order. Production binds this to `WonderwareHistorianClient` (the AVEVA
|
|
Historian sidecar IPC client).
|
|
- **`HistorianWriteOutcome`** — per-event drain result: `Ack` (persisted,
|
|
remove from queue), `RetryPlease` (transient failure — leave queued, retry
|
|
after backoff), `PermanentFail` (malformed/unrecoverable — move to
|
|
dead-letter).
|
|
- **`HistorianSinkStatus`** — diagnostic snapshot surfaced to the AdminUI and
|
|
`/healthz`: `QueueDepth`, `DeadLetterDepth`, `LastDrainUtc`, `LastSuccessUtc`,
|
|
`LastError`, `DrainState`, and `EvictedCount`.
|
|
- **`HistorianDrainState`** — `Disabled` / `Idle` / `Draining` / `BackingOff`.
|
|
|
|
---
|
|
|
|
## SqliteStoreAndForwardSink
|
|
|
|
[`SqliteStoreAndForwardSink.cs`](../src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs)
|
|
is the production `IAlarmHistorianSink`. Construction takes a SQLite database
|
|
path, an `IAlarmHistorianWriter`, a logger, and optional `batchSize` (default
|
|
100), `capacity` (default 1,000,000), `deadLetterRetention` (default 30 days),
|
|
and a test clock.
|
|
|
|
### Queue table
|
|
|
|
The sink owns one SQLite table (created on construction, WAL journal mode):
|
|
|
|
```sql
|
|
CREATE TABLE Queue (
|
|
RowId INTEGER PRIMARY KEY AUTOINCREMENT,
|
|
AlarmId TEXT NOT NULL,
|
|
EnqueuedUtc TEXT NOT NULL,
|
|
PayloadJson TEXT NOT NULL, -- JSON-serialized AlarmHistorianEvent
|
|
AttemptCount INTEGER NOT NULL DEFAULT 0,
|
|
LastAttemptUtc TEXT NULL,
|
|
LastError TEXT NULL,
|
|
DeadLettered INTEGER NOT NULL DEFAULT 0
|
|
);
|
|
CREATE INDEX IX_Queue_Drain ON Queue (DeadLettered, RowId);
|
|
```
|
|
|
|
`EnqueueAsync` does a single `INSERT` on the hot path. To avoid a
|
|
`SELECT COUNT(*)` on every enqueue, the sink keeps an in-memory non-dead-lettered
|
|
row counter (seeded at startup, kept current by every mutation, and re-synced
|
|
from storage every 10,000 enqueues to defend against drift). SQLite writer
|
|
contention is handled via `PRAGMA busy_timeout=5000` + WAL so an enqueue/drain
|
|
collision waits out the file lock instead of failing fast.
|
|
|
|
### Drain worker
|
|
|
|
`StartDrainLoop(tickInterval)` starts a **self-rescheduling one-shot
|
|
`System.Threading.Timer`** (not started automatically — tests drive
|
|
`DrainOnceAsync` deterministically). Each tick:
|
|
|
|
1. Purges aged dead-lettered rows past the retention window.
|
|
2. Reads up to `batchSize` non-dead-lettered rows in `RowId` order.
|
|
3. Rows with un-deserializable payloads are dead-lettered immediately (by their
|
|
own `RowId`) so they can't stall the queue head.
|
|
4. The remaining batch is handed to `IAlarmHistorianWriter.WriteBatchAsync`, and
|
|
each outcome is applied in one transaction: `Ack` deletes the row,
|
|
`PermanentFail` flips its `DeadLettered` flag, `RetryPlease` bumps its attempt
|
|
count and leaves it queued.
|
|
5. The timer re-arms its next due-time to `max(tickInterval, currentBackoff)`.
|
|
|
|
**Backoff ladder** (applied to the timer's next due-time, so a historian outage
|
|
genuinely slows the drain cadence): 1s → 2s → 5s → 15s → 60s cap. Any
|
|
`RetryPlease` outcome — or a writer exception, or a writer cardinality violation
|
|
(outcome count ≠ event count) — bumps the backoff and sets `DrainState =
|
|
BackingOff`; a clean batch resets it. The async-void timer callback is fully
|
|
guarded: a fault is logged and recorded into `GetStatus()` rather than lost as
|
|
an unobserved task exception.
|
|
|
|
### Durability bound (important)
|
|
|
|
**The durability guarantee is bounded by `capacity` (default 1,000,000 rows).**
|
|
When the non-dead-lettered queue reaches capacity, `EnqueueAsync` evicts the
|
|
oldest non-dead-lettered rows (oldest `RowId` first) to make room, logs a WARN,
|
|
and increments `HistorianSinkStatus.EvictedCount`. Under a sustained historian
|
|
outage, accepted alarm events can therefore be dropped before delivery. A
|
|
non-zero `EvictedCount` is a data-loss signal that requires operator attention —
|
|
it surfaces silent loss without log scraping.
|
|
|
|
### Dead-letter + operator recovery
|
|
|
|
`PermanentFail` and corrupt-payload rows are retained in-place with
|
|
`DeadLettered = 1` for the retention window (default 30 days) so operators can
|
|
inspect them before the sweeper purges them. `RetryDeadLettered()` is the
|
|
operator action (from the AdminUI) that clears the dead-letter flag and attempt
|
|
count on every dead-lettered row, returning them to the regular queue with a
|
|
fresh backoff.
|
|
|
|
---
|
|
|
|
## Runtime wiring
|
|
|
|
Production routes alarm transitions through the Akka cluster. The
|
|
`HistorianAdapterActor`
|
|
([`Runtime/Historian/HistorianAdapterActor.cs`](../src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Historian/HistorianAdapterActor.cs))
|
|
bridges messages from the scripted-alarm actor into the sink's `EnqueueAsync`,
|
|
fire-and-forget so the actor loop is never blocked on historian reachability.
|
|
The `WonderwareHistorianClient` is the `IAlarmHistorianWriter` the drain worker
|
|
delegates to. See [ServiceHosting.md](ServiceHosting.md) for the sidecar setup.
|
|
|
|
---
|
|
|
|
## See also
|
|
|
|
- [AlarmTracking.md](AlarmTracking.md) — the three alarm sources and the OPC UA
|
|
Part 9 surface; which alarms route to this sink.
|
|
- [DriverLifecycle.md](DriverLifecycle.md) — `IHistorianDataSource` (the
|
|
historian *read* surface; this page covers the *write* path) and the
|
|
`WonderwareHistorianClient`.
|
|
- [ScriptedAlarms.md](ScriptedAlarms.md) — the scripted-alarm engine that emits
|
|
most events into this sink.
|
|
- [ServiceHosting.md](ServiceHosting.md) — the optional Wonderware historian
|
|
sidecar.
|