Files
lmxopcua/docs/AlarmHistorian.md
T
Joseph Doherty 2124f21ab6
v2-ci / build (pull_request) Failing after 38s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (pull_request) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (pull_request) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (pull_request) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (pull_request) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (pull_request) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (pull_request) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (pull_request) Has been skipped
docs(historian-gateway): document gateway backend, config keys, EnsureTags hook, known gates; retire Wonderware from docs
HistorianGateway is now the sole historian backend (read + alarm SendEvent +
continuous WriteLiveValues). Document the final state and retire the Wonderware
sidecar from the docs/config/labels:

- CLAUDE.md: rewrite the Historian section — ServerHistorian /
  ContinuousHistorization / AlarmHistorian config keys, the IHistorianProvisioning
  EnsureTags hook, the GatewayAlarmHistorianWriter SendEvent path + ReadEvents
  dependency on gateway RuntimeDb:EventReadsEnabled=true, gateway-side
  prerequisites (RuntimeDb flags + historian:read/write/tags:write scopes),
  migration note, and two KNOWN-LIMITATION callouts (live-validation gate +
  empty historized-ref-set recorder follow-on).
- appsettings.json: fix the stale ServerHistorian block (Host/Port/SharedSecret/
  ServerCertThumbprint -> Endpoint/ApiKey/UseTls/AllowUntrustedServerCertificate/
  CaCertificatePath/CallTimeout, keep MaxTieClusterOverfetch); add a disabled
  ContinuousHistorization block; prune the orphaned Wonderware keys from
  AlarmHistorian (keep the SQLite knobs). ApiKey env-supplied via
  ServerHistorian__ApiKey (commented; valid strict JSON via _comment keys).
- README.md + docs (Historian.md, AlarmHistorian.md, Configuration.md,
  ServiceHosting.md, DriverLifecycle.md, drivers/README.md, Uns.md, VirtualTags.md,
  AlarmTracking.md, Client.UI.md, README.md, TestConnectProbes.md): retire the
  Wonderware historian backend from current-backend descriptions; fix the stale
  ServerHistorian/AlarmHistorian config tables (now gateway shape); convert
  drivers/Historian.Wonderware.md to a retired stub pointing at the gateway.
- Source/UI labels (descriptive text only, no behavior change):
  OtOpcUaServerHostedService.cs, HistoryPaging.cs, OtOpcUaSdkServer.cs,
  HistorianAdapterActor.cs, VirtualTagModal.razor, ScriptedAlarmModal.razor,
  AlarmsHistorian.razor now name the HistorianGateway backend.

Build clean (0 errors); AdminUI.Tests green (514 passed).

Claude-Session: https://claude.ai/code/session_012SDSQ3AcaXqPcBtDESBRii
2026-06-26 19:46:27 -04:00

224 lines
11 KiB
Markdown

# Alarm Historian — store-and-forward SQLite sink
Reference for `ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian`
([`src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/`](../src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/)),
the durable local queue that historizes alarm transitions to AVEVA Historian
without ever blocking the alarm engine or operator actions.
This is the *sink mechanics* doc. For how the three alarm sources converge on
the OPC UA Part 9 surface and which alarms route here, see
[AlarmTracking.md](AlarmTracking.md). For the historian client that drains this
queue, see [DriverLifecycle.md](DriverLifecycle.md#ihistoriandatasource--server-side-historian-read-surface)
and [ServiceHosting.md](ServiceHosting.md).
---
## Why store-and-forward
Scripted alarms (and any future non-Galaxy `IAlarmSource`, e.g. AB CIP ALMD)
must reach AVEVA Historian, but the historian gateway can be slow, busy, or
disconnected. The sink decouples the alarm engine from historian reachability:
every qualifying transition is committed to a **local SQLite queue first**, and
a background drain worker forwards rows to the historian on a backoff-aware
cadence. Operator acks and alarm-state transitions are never blocked waiting on
the historian.
> Galaxy-native alarms with `$Alarm*` extensions reach AVEVA Historian directly
> via System Platform's `HistorizeToAveva` toggle — they do **not** flow through
> this sink. This path is exclusively for non-Galaxy alarm producers.
---
## Contracts
All in
[`IAlarmHistorianSink.cs`](../src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/IAlarmHistorianSink.cs)
unless noted.
- **`IAlarmHistorianSink`** — the intake contract. `EnqueueAsync(evt, ct)`
durably enqueues an event and returns as soon as the queue row is committed
(fire-and-forget from the engine's perspective; the sink must not block the
emitting thread). `GetStatus()` returns a `HistorianSinkStatus` snapshot.
- **`NullAlarmHistorianSink`** — the no-op default for tests and deployments
that don't historize alarms. It is the default DI binding (registered in the
Runtime's `AddOtOpcUaRuntime`); production overrides it with
`SqliteStoreAndForwardSink`.
- **`AlarmHistorianEvent`**
([`AlarmHistorianEvent.cs`](../src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/AlarmHistorianEvent.cs))
— the source-agnostic event record: `AlarmId`, `EquipmentPath` (UNS path,
doubles as Historian's SourceNode), `AlarmName`, `AlarmTypeName` (Part 9
subtype), `Severity`, `EventKind` (free-form transition string —
"Activated"/"Cleared"/"Acknowledged"/etc.), `Message`, `User`, `Comment`,
`TimestampUtc`.
- **`IAlarmHistorianWriter`** — what the drain worker delegates writes to.
`WriteBatchAsync(batch, ct)` returns one `HistorianWriteOutcome` per event,
in order. Production binds this to `GatewayAlarmHistorianWriter` (the
HistorianGateway `SendEvent` path).
- **`HistorianWriteOutcome`** — per-event drain result: `Ack` (persisted,
remove from queue), `RetryPlease` (transient failure — leave queued, retry
after backoff), `PermanentFail` (malformed/unrecoverable — move to
dead-letter).
- **`HistorianSinkStatus`** — diagnostic snapshot surfaced to the AdminUI and
`/healthz`: `QueueDepth`, `DeadLetterDepth`, `LastDrainUtc`, `LastSuccessUtc`,
`LastError`, `DrainState`, and `EvictedCount`.
- **`HistorianDrainState`** — `Disabled` / `Idle` / `Draining` / `BackingOff`.
---
## SqliteStoreAndForwardSink
[`SqliteStoreAndForwardSink.cs`](../src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs)
is the production `IAlarmHistorianSink`. Construction takes a SQLite database
path, an `IAlarmHistorianWriter`, a logger, and optional `batchSize` (default
100), `capacity` (default 1,000,000), `deadLetterRetention` (default 30 days),
and a test clock.
### Queue table
The sink owns one SQLite table (created on construction, WAL journal mode):
```sql
CREATE TABLE Queue (
RowId INTEGER PRIMARY KEY AUTOINCREMENT,
AlarmId TEXT NOT NULL,
EnqueuedUtc TEXT NOT NULL,
PayloadJson TEXT NOT NULL, -- JSON-serialized AlarmHistorianEvent
AttemptCount INTEGER NOT NULL DEFAULT 0,
LastAttemptUtc TEXT NULL,
LastError TEXT NULL,
DeadLettered INTEGER NOT NULL DEFAULT 0
);
CREATE INDEX IX_Queue_Drain ON Queue (DeadLettered, RowId);
```
`EnqueueAsync` does a single `INSERT` on the hot path. To avoid a
`SELECT COUNT(*)` on every enqueue, the sink keeps an in-memory non-dead-lettered
row counter (seeded at startup, kept current by every mutation, and re-synced
from storage every 10,000 enqueues to defend against drift). SQLite writer
contention is handled via `PRAGMA busy_timeout=5000` + WAL so an enqueue/drain
collision waits out the file lock instead of failing fast.
### Drain worker
`StartDrainLoop(tickInterval)` starts a **self-rescheduling one-shot
`System.Threading.Timer`** (not started automatically — tests drive
`DrainOnceAsync` deterministically). Each tick:
1. Purges aged dead-lettered rows past the retention window.
2. Reads up to `batchSize` non-dead-lettered rows in `RowId` order.
3. Rows with un-deserializable payloads are dead-lettered immediately (by their
own `RowId`) so they can't stall the queue head.
4. The remaining batch is handed to `IAlarmHistorianWriter.WriteBatchAsync`, and
each outcome is applied in one transaction: `Ack` deletes the row,
`PermanentFail` flips its `DeadLettered` flag, `RetryPlease` bumps its attempt
count and leaves it queued. A row whose `AttemptCount` has reached the configured
**`MaxAttempts`** cap (default 10) is dead-lettered automatically on the next drain
tick rather than retried — this breaks infinite retry loops for poison events whose
payload the historian will always reject (e.g. a malformed alarm record that triggers
a permanent SDK error on every attempt). The dead-lettered row remains inspectable
via `RetryDeadLettered()` for the configured retention window.
5. The timer re-arms its next due-time to `max(tickInterval, currentBackoff)`.
**Backoff ladder** (applied to the timer's next due-time, so a historian outage
genuinely slows the drain cadence): 1s → 2s → 5s → 15s → 60s cap. Any
`RetryPlease` outcome — or a writer exception, or a writer cardinality violation
(outcome count ≠ event count) — bumps the backoff and sets `DrainState =
BackingOff`; a clean batch resets it. The async-void timer callback is fully
guarded: a fault is logged and recorded into `GetStatus()` rather than lost as
an unobserved task exception.
### Durability bound (important)
**The durability guarantee is bounded by `capacity` (default 1,000,000 rows).**
When the non-dead-lettered queue reaches capacity, `EnqueueAsync` evicts the
oldest non-dead-lettered rows (oldest `RowId` first) to make room, logs a WARN,
and increments `HistorianSinkStatus.EvictedCount`. Under a sustained historian
outage, accepted alarm events can therefore be dropped before delivery. A
non-zero `EvictedCount` is a data-loss signal that requires operator attention —
it surfaces silent loss without log scraping.
### Dead-letter + operator recovery
`PermanentFail` and corrupt-payload rows are retained in-place with
`DeadLettered = 1` for the retention window (default 30 days) so operators can
inspect them before the sweeper purges them. `RetryDeadLettered()` is the
operator action (from the AdminUI) that clears the dead-letter flag and attempt
count on every dead-lettered row, returning them to the regular queue with a
fresh backoff.
---
## Runtime wiring
`HistorianAdapterActor`
([`Runtime/Historian/HistorianAdapterActor.cs`](../src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Historian/HistorianAdapterActor.cs))
subscribes to the cluster **`alerts` DPS topic** and translates each
`AlarmTransitionEvent` into an `AlarmHistorianEvent`, then calls
`IAlarmHistorianSink.EnqueueAsync` fire-and-forget so the actor loop is never
blocked on historian reachability. The actor is **Primary-gated**: only the
node whose `RedundancyRole` is `Primary` historizes, giving exactly-once
writes across a redundant pair. `AlarmTransitionEvent` carries `AlarmTypeName`
(the Part 9 subtype string) and `Comment` (the operator comment from the
originating ack/shelve command) that populate the corresponding fields of
`AlarmHistorianEvent`. `GatewayAlarmHistorianWriter` is the `IAlarmHistorianWriter`
the drain worker delegates to (the gateway `SendEvent` path). See
[ServiceHosting.md](ServiceHosting.md) for the (external) HistorianGateway setup.
**Scope:** scripted alarms only. Galaxy-native alarms historize via System
Platform's `HistorizeToAveva` toggle (not this actor); AB CIP ALMD is not on
the `alerts` topic (future).
## Configuration
The real sink is opt-in via the `AlarmHistorian` section of `appsettings.json`.
When `Enabled` is `false` (the default), `AddAlarmHistorian` registers
`NullAlarmHistorianSink` and the feature is dormant. When `Enabled` is `true`,
`AddAlarmHistorian` constructs `SqliteStoreAndForwardSink` and registers
`GatewayAlarmHistorianWriter` as the `IAlarmHistorianWriter`. This section carries
**only** the `Enabled` gate + the SQLite store-and-forward knobs — the downstream
gateway connection (endpoint / key / TLS) is sourced from the `ServerHistorian`
section (see [Historian.md](Historian.md)).
```json
{
"AlarmHistorian": {
"Enabled": true,
"DatabasePath": "C:\\ProgramData\\OtOpcUa\\alarmhistorian.db",
"BatchSize": 100,
"DrainIntervalSeconds": 5,
"Capacity": 1000000,
"DeadLetterRetentionDays": 30
}
}
```
| Key | Type | Default | Description |
|---|---|---|---|
| `Enabled` | bool | `false` | Enable the SQLite store-and-forward sink (drains to the HistorianGateway `SendEvent` path). `false``NullAlarmHistorianSink`. |
| `DatabasePath` | string | `alarm-historian.db` | Path to the SQLite queue file. Created on first use (WAL mode). Set an **absolute** path in production. |
| `BatchSize` | int | `100` | Max rows per drain cycle handed to `IAlarmHistorianWriter.WriteBatchAsync`. |
| `DrainIntervalSeconds` | int | `5` | Seconds between drain-worker ticks. |
| `Capacity` | long | `1000000` | Max queued rows before the sink evicts the oldest (data-loss signal via `EvictedCount`). |
| `DeadLetterRetentionDays` | int | `30` | Days to retain dead-lettered rows before purge. |
| `MaxAttempts` | int | `10` | Maximum delivery attempts before a poison (perpetually-retrying) row is dead-lettered automatically. Must be > 0. |
> The downstream gateway connection lives in `ServerHistorian` (`Endpoint` + env `ServerHistorian__ApiKey`,
> `UseTls`, `CaCertificatePath`); alarm-history `ReadEvents` additionally requires the gateway running
> `RuntimeDb:EventReadsEnabled=true`. The old Wonderware connection keys (`SharedSecret` /
> `AlarmHistorian:Host`/`Port`/`UseTls`/`ServerCertThumbprint`) were pruned.
> Dev and docker-dev deployments leave `Enabled` unset (defaults to `false`) so alarm transitions historize to nowhere unless a HistorianGateway is configured.
---
## See also
- [AlarmTracking.md](AlarmTracking.md) — the three alarm sources and the OPC UA
Part 9 surface; which alarms route to this sink.
- [DriverLifecycle.md](DriverLifecycle.md) — `IHistorianDataSource` (the
historian *read* surface; this page covers the *write* path) and the
`GatewayHistorianDataSource`.
- [ScriptedAlarms.md](ScriptedAlarms.md) — the scripted-alarm engine that emits
most events into this sink.
- [ServiceHosting.md](ServiceHosting.md) — the external HistorianGateway backend.