docs(audit): G1 completeness — driver-lifecycle + alarm-historian reference pages
This commit is contained in:
@@ -0,0 +1,168 @@
|
||||
# Alarm Historian — store-and-forward SQLite sink
|
||||
|
||||
Reference for `ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian`
|
||||
([`src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/`](../src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/)),
|
||||
the durable local queue that historizes alarm transitions to AVEVA Historian
|
||||
without ever blocking the alarm engine or operator actions.
|
||||
|
||||
This is the *sink mechanics* doc. For how the three alarm sources converge on
|
||||
the OPC UA Part 9 surface and which alarms route here, see
|
||||
[AlarmTracking.md](AlarmTracking.md). For the historian client that drains this
|
||||
queue, see [DriverLifecycle.md](DriverLifecycle.md#ihistoriandatasource--server-side-historian-read-surface)
|
||||
and [ServiceHosting.md](ServiceHosting.md).
|
||||
|
||||
---
|
||||
|
||||
## Why store-and-forward
|
||||
|
||||
Scripted alarms (and any future non-Galaxy `IAlarmSource`, e.g. AB CIP ALMD)
|
||||
must reach AVEVA Historian, but the historian sidecar can be slow, busy, or
|
||||
disconnected. The sink decouples the alarm engine from historian reachability:
|
||||
every qualifying transition is committed to a **local SQLite queue first**, and
|
||||
a background drain worker forwards rows to the historian on a backoff-aware
|
||||
cadence. Operator acks and alarm-state transitions are never blocked waiting on
|
||||
the historian.
|
||||
|
||||
> Galaxy-native alarms with `$Alarm*` extensions reach AVEVA Historian directly
|
||||
> via System Platform's `HistorizeToAveva` toggle — they do **not** flow through
|
||||
> this sink. This path is exclusively for non-Galaxy alarm producers.
|
||||
|
||||
---
|
||||
|
||||
## Contracts
|
||||
|
||||
All in
|
||||
[`IAlarmHistorianSink.cs`](../src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/IAlarmHistorianSink.cs)
|
||||
unless noted.
|
||||
|
||||
- **`IAlarmHistorianSink`** — the intake contract. `EnqueueAsync(evt, ct)`
|
||||
durably enqueues an event and returns as soon as the queue row is committed
|
||||
(fire-and-forget from the engine's perspective; the sink must not block the
|
||||
emitting thread). `GetStatus()` returns a `HistorianSinkStatus` snapshot.
|
||||
- **`NullAlarmHistorianSink`** — the no-op default for tests and deployments
|
||||
that don't historize alarms. It is the default DI binding (registered in the
|
||||
Runtime's `AddOtOpcUaRuntime`); production overrides it with
|
||||
`SqliteStoreAndForwardSink`.
|
||||
- **`AlarmHistorianEvent`**
|
||||
([`AlarmHistorianEvent.cs`](../src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/AlarmHistorianEvent.cs))
|
||||
— the source-agnostic event record: `AlarmId`, `EquipmentPath` (UNS path,
|
||||
doubles as Historian's SourceNode), `AlarmName`, `AlarmTypeName` (Part 9
|
||||
subtype), `Severity`, `EventKind` (free-form transition string —
|
||||
"Activated"/"Cleared"/"Acknowledged"/etc.), `Message`, `User`, `Comment`,
|
||||
`TimestampUtc`.
|
||||
- **`IAlarmHistorianWriter`** — what the drain worker delegates writes to.
|
||||
`WriteBatchAsync(batch, ct)` returns one `HistorianWriteOutcome` per event,
|
||||
in order. Production binds this to `WonderwareHistorianClient` (the AVEVA
|
||||
Historian sidecar IPC client).
|
||||
- **`HistorianWriteOutcome`** — per-event drain result: `Ack` (persisted,
|
||||
remove from queue), `RetryPlease` (transient failure — leave queued, retry
|
||||
after backoff), `PermanentFail` (malformed/unrecoverable — move to
|
||||
dead-letter).
|
||||
- **`HistorianSinkStatus`** — diagnostic snapshot surfaced to the AdminUI and
|
||||
`/healthz`: `QueueDepth`, `DeadLetterDepth`, `LastDrainUtc`, `LastSuccessUtc`,
|
||||
`LastError`, `DrainState`, and `EvictedCount`.
|
||||
- **`HistorianDrainState`** — `Disabled` / `Idle` / `Draining` / `BackingOff`.
|
||||
|
||||
---
|
||||
|
||||
## SqliteStoreAndForwardSink
|
||||
|
||||
[`SqliteStoreAndForwardSink.cs`](../src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs)
|
||||
is the production `IAlarmHistorianSink`. Construction takes a SQLite database
|
||||
path, an `IAlarmHistorianWriter`, a logger, and optional `batchSize` (default
|
||||
100), `capacity` (default 1,000,000), `deadLetterRetention` (default 30 days),
|
||||
and a test clock.
|
||||
|
||||
### Queue table
|
||||
|
||||
The sink owns one SQLite table (created on construction, WAL journal mode):
|
||||
|
||||
```sql
|
||||
CREATE TABLE Queue (
|
||||
RowId INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
AlarmId TEXT NOT NULL,
|
||||
EnqueuedUtc TEXT NOT NULL,
|
||||
PayloadJson TEXT NOT NULL, -- JSON-serialized AlarmHistorianEvent
|
||||
AttemptCount INTEGER NOT NULL DEFAULT 0,
|
||||
LastAttemptUtc TEXT NULL,
|
||||
LastError TEXT NULL,
|
||||
DeadLettered INTEGER NOT NULL DEFAULT 0
|
||||
);
|
||||
CREATE INDEX IX_Queue_Drain ON Queue (DeadLettered, RowId);
|
||||
```
|
||||
|
||||
`EnqueueAsync` does a single `INSERT` on the hot path. To avoid a
|
||||
`SELECT COUNT(*)` on every enqueue, the sink keeps an in-memory non-dead-lettered
|
||||
row counter (seeded at startup, kept current by every mutation, and re-synced
|
||||
from storage every 10,000 enqueues to defend against drift). SQLite writer
|
||||
contention is handled via `PRAGMA busy_timeout=5000` + WAL so an enqueue/drain
|
||||
collision waits out the file lock instead of failing fast.
|
||||
|
||||
### Drain worker
|
||||
|
||||
`StartDrainLoop(tickInterval)` starts a **self-rescheduling one-shot
|
||||
`System.Threading.Timer`** (not started automatically — tests drive
|
||||
`DrainOnceAsync` deterministically). Each tick:
|
||||
|
||||
1. Purges aged dead-lettered rows past the retention window.
|
||||
2. Reads up to `batchSize` non-dead-lettered rows in `RowId` order.
|
||||
3. Rows with un-deserializable payloads are dead-lettered immediately (by their
|
||||
own `RowId`) so they can't stall the queue head.
|
||||
4. The remaining batch is handed to `IAlarmHistorianWriter.WriteBatchAsync`, and
|
||||
each outcome is applied in one transaction: `Ack` deletes the row,
|
||||
`PermanentFail` flips its `DeadLettered` flag, `RetryPlease` bumps its attempt
|
||||
count and leaves it queued.
|
||||
5. The timer re-arms its next due-time to `max(tickInterval, currentBackoff)`.
|
||||
|
||||
**Backoff ladder** (applied to the timer's next due-time, so a historian outage
|
||||
genuinely slows the drain cadence): 1s → 2s → 5s → 15s → 60s cap. Any
|
||||
`RetryPlease` outcome — or a writer exception, or a writer cardinality violation
|
||||
(outcome count ≠ event count) — bumps the backoff and sets `DrainState =
|
||||
BackingOff`; a clean batch resets it. The async-void timer callback is fully
|
||||
guarded: a fault is logged and recorded into `GetStatus()` rather than lost as
|
||||
an unobserved task exception.
|
||||
|
||||
### Durability bound (important)
|
||||
|
||||
**The durability guarantee is bounded by `capacity` (default 1,000,000 rows).**
|
||||
When the non-dead-lettered queue reaches capacity, `EnqueueAsync` evicts the
|
||||
oldest non-dead-lettered rows (oldest `RowId` first) to make room, logs a WARN,
|
||||
and increments `HistorianSinkStatus.EvictedCount`. Under a sustained historian
|
||||
outage, accepted alarm events can therefore be dropped before delivery. A
|
||||
non-zero `EvictedCount` is a data-loss signal that requires operator attention —
|
||||
it surfaces silent loss without log scraping.
|
||||
|
||||
### Dead-letter + operator recovery
|
||||
|
||||
`PermanentFail` and corrupt-payload rows are retained in-place with
|
||||
`DeadLettered = 1` for the retention window (default 30 days) so operators can
|
||||
inspect them before the sweeper purges them. `RetryDeadLettered()` is the
|
||||
operator action (from the AdminUI) that clears the dead-letter flag and attempt
|
||||
count on every dead-lettered row, returning them to the regular queue with a
|
||||
fresh backoff.
|
||||
|
||||
---
|
||||
|
||||
## Runtime wiring
|
||||
|
||||
Production routes alarm transitions through the Akka cluster. The
|
||||
`HistorianAdapterActor`
|
||||
([`Runtime/Historian/HistorianAdapterActor.cs`](../src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Historian/HistorianAdapterActor.cs))
|
||||
bridges messages from the scripted-alarm actor into the sink's `EnqueueAsync`,
|
||||
fire-and-forget so the actor loop is never blocked on historian reachability.
|
||||
The `WonderwareHistorianClient` is the `IAlarmHistorianWriter` the drain worker
|
||||
delegates to. See [ServiceHosting.md](ServiceHosting.md) for the sidecar setup.
|
||||
|
||||
---
|
||||
|
||||
## See also
|
||||
|
||||
- [AlarmTracking.md](AlarmTracking.md) — the three alarm sources and the OPC UA
|
||||
Part 9 surface; which alarms route to this sink.
|
||||
- [DriverLifecycle.md](DriverLifecycle.md) — `IHistorianDataSource` (the
|
||||
historian *read* surface; this page covers the *write* path) and the
|
||||
`WonderwareHistorianClient`.
|
||||
- [ScriptedAlarms.md](ScriptedAlarms.md) — the scripted-alarm engine that emits
|
||||
most events into this sink.
|
||||
- [ServiceHosting.md](ServiceHosting.md) — the optional Wonderware historian
|
||||
sidecar.
|
||||
Reference in New Issue
Block a user