2124f21ab6
v2-ci / build (pull_request) Failing after 38s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (pull_request) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (pull_request) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (pull_request) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (pull_request) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (pull_request) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (pull_request) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (pull_request) Has been skipped
HistorianGateway is now the sole historian backend (read + alarm SendEvent + continuous WriteLiveValues). Document the final state and retire the Wonderware sidecar from the docs/config/labels: - CLAUDE.md: rewrite the Historian section — ServerHistorian / ContinuousHistorization / AlarmHistorian config keys, the IHistorianProvisioning EnsureTags hook, the GatewayAlarmHistorianWriter SendEvent path + ReadEvents dependency on gateway RuntimeDb:EventReadsEnabled=true, gateway-side prerequisites (RuntimeDb flags + historian:read/write/tags:write scopes), migration note, and two KNOWN-LIMITATION callouts (live-validation gate + empty historized-ref-set recorder follow-on). - appsettings.json: fix the stale ServerHistorian block (Host/Port/SharedSecret/ ServerCertThumbprint -> Endpoint/ApiKey/UseTls/AllowUntrustedServerCertificate/ CaCertificatePath/CallTimeout, keep MaxTieClusterOverfetch); add a disabled ContinuousHistorization block; prune the orphaned Wonderware keys from AlarmHistorian (keep the SQLite knobs). ApiKey env-supplied via ServerHistorian__ApiKey (commented; valid strict JSON via _comment keys). - README.md + docs (Historian.md, AlarmHistorian.md, Configuration.md, ServiceHosting.md, DriverLifecycle.md, drivers/README.md, Uns.md, VirtualTags.md, AlarmTracking.md, Client.UI.md, README.md, TestConnectProbes.md): retire the Wonderware historian backend from current-backend descriptions; fix the stale ServerHistorian/AlarmHistorian config tables (now gateway shape); convert drivers/Historian.Wonderware.md to a retired stub pointing at the gateway. - Source/UI labels (descriptive text only, no behavior change): OtOpcUaServerHostedService.cs, HistoryPaging.cs, OtOpcUaSdkServer.cs, HistorianAdapterActor.cs, VirtualTagModal.razor, ScriptedAlarmModal.razor, AlarmsHistorian.razor now name the HistorianGateway backend. Build clean (0 errors); AdminUI.Tests green (514 passed). Claude-Session: https://claude.ai/code/session_012SDSQ3AcaXqPcBtDESBRii
224 lines
11 KiB
Markdown
224 lines
11 KiB
Markdown
# Alarm Historian — store-and-forward SQLite sink
|
|
|
|
Reference for `ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian`
|
|
([`src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/`](../src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/)),
|
|
the durable local queue that historizes alarm transitions to AVEVA Historian
|
|
without ever blocking the alarm engine or operator actions.
|
|
|
|
This is the *sink mechanics* doc. For how the three alarm sources converge on
|
|
the OPC UA Part 9 surface and which alarms route here, see
|
|
[AlarmTracking.md](AlarmTracking.md). For the historian client that drains this
|
|
queue, see [DriverLifecycle.md](DriverLifecycle.md#ihistoriandatasource--server-side-historian-read-surface)
|
|
and [ServiceHosting.md](ServiceHosting.md).
|
|
|
|
---
|
|
|
|
## Why store-and-forward
|
|
|
|
Scripted alarms (and any future non-Galaxy `IAlarmSource`, e.g. AB CIP ALMD)
|
|
must reach AVEVA Historian, but the historian gateway can be slow, busy, or
|
|
disconnected. The sink decouples the alarm engine from historian reachability:
|
|
every qualifying transition is committed to a **local SQLite queue first**, and
|
|
a background drain worker forwards rows to the historian on a backoff-aware
|
|
cadence. Operator acks and alarm-state transitions are never blocked waiting on
|
|
the historian.
|
|
|
|
> Galaxy-native alarms with `$Alarm*` extensions reach AVEVA Historian directly
|
|
> via System Platform's `HistorizeToAveva` toggle — they do **not** flow through
|
|
> this sink. This path is exclusively for non-Galaxy alarm producers.
|
|
|
|
---
|
|
|
|
## Contracts
|
|
|
|
All in
|
|
[`IAlarmHistorianSink.cs`](../src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/IAlarmHistorianSink.cs)
|
|
unless noted.
|
|
|
|
- **`IAlarmHistorianSink`** — the intake contract. `EnqueueAsync(evt, ct)`
|
|
durably enqueues an event and returns as soon as the queue row is committed
|
|
(fire-and-forget from the engine's perspective; the sink must not block the
|
|
emitting thread). `GetStatus()` returns a `HistorianSinkStatus` snapshot.
|
|
- **`NullAlarmHistorianSink`** — the no-op default for tests and deployments
|
|
that don't historize alarms. It is the default DI binding (registered in the
|
|
Runtime's `AddOtOpcUaRuntime`); production overrides it with
|
|
`SqliteStoreAndForwardSink`.
|
|
- **`AlarmHistorianEvent`**
|
|
([`AlarmHistorianEvent.cs`](../src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/AlarmHistorianEvent.cs))
|
|
— the source-agnostic event record: `AlarmId`, `EquipmentPath` (UNS path,
|
|
doubles as Historian's SourceNode), `AlarmName`, `AlarmTypeName` (Part 9
|
|
subtype), `Severity`, `EventKind` (free-form transition string —
|
|
"Activated"/"Cleared"/"Acknowledged"/etc.), `Message`, `User`, `Comment`,
|
|
`TimestampUtc`.
|
|
- **`IAlarmHistorianWriter`** — what the drain worker delegates writes to.
|
|
`WriteBatchAsync(batch, ct)` returns one `HistorianWriteOutcome` per event,
|
|
in order. Production binds this to `GatewayAlarmHistorianWriter` (the
|
|
HistorianGateway `SendEvent` path).
|
|
- **`HistorianWriteOutcome`** — per-event drain result: `Ack` (persisted,
|
|
remove from queue), `RetryPlease` (transient failure — leave queued, retry
|
|
after backoff), `PermanentFail` (malformed/unrecoverable — move to
|
|
dead-letter).
|
|
- **`HistorianSinkStatus`** — diagnostic snapshot surfaced to the AdminUI and
|
|
`/healthz`: `QueueDepth`, `DeadLetterDepth`, `LastDrainUtc`, `LastSuccessUtc`,
|
|
`LastError`, `DrainState`, and `EvictedCount`.
|
|
- **`HistorianDrainState`** — `Disabled` / `Idle` / `Draining` / `BackingOff`.
|
|
|
|
---
|
|
|
|
## SqliteStoreAndForwardSink
|
|
|
|
[`SqliteStoreAndForwardSink.cs`](../src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs)
|
|
is the production `IAlarmHistorianSink`. Construction takes a SQLite database
|
|
path, an `IAlarmHistorianWriter`, a logger, and optional `batchSize` (default
|
|
100), `capacity` (default 1,000,000), `deadLetterRetention` (default 30 days),
|
|
and a test clock.
|
|
|
|
### Queue table
|
|
|
|
The sink owns one SQLite table (created on construction, WAL journal mode):
|
|
|
|
```sql
|
|
CREATE TABLE Queue (
|
|
RowId INTEGER PRIMARY KEY AUTOINCREMENT,
|
|
AlarmId TEXT NOT NULL,
|
|
EnqueuedUtc TEXT NOT NULL,
|
|
PayloadJson TEXT NOT NULL, -- JSON-serialized AlarmHistorianEvent
|
|
AttemptCount INTEGER NOT NULL DEFAULT 0,
|
|
LastAttemptUtc TEXT NULL,
|
|
LastError TEXT NULL,
|
|
DeadLettered INTEGER NOT NULL DEFAULT 0
|
|
);
|
|
CREATE INDEX IX_Queue_Drain ON Queue (DeadLettered, RowId);
|
|
```
|
|
|
|
`EnqueueAsync` does a single `INSERT` on the hot path. To avoid a
|
|
`SELECT COUNT(*)` on every enqueue, the sink keeps an in-memory non-dead-lettered
|
|
row counter (seeded at startup, kept current by every mutation, and re-synced
|
|
from storage every 10,000 enqueues to defend against drift). SQLite writer
|
|
contention is handled via `PRAGMA busy_timeout=5000` + WAL so an enqueue/drain
|
|
collision waits out the file lock instead of failing fast.
|
|
|
|
### Drain worker
|
|
|
|
`StartDrainLoop(tickInterval)` starts a **self-rescheduling one-shot
|
|
`System.Threading.Timer`** (not started automatically — tests drive
|
|
`DrainOnceAsync` deterministically). Each tick:
|
|
|
|
1. Purges aged dead-lettered rows past the retention window.
|
|
2. Reads up to `batchSize` non-dead-lettered rows in `RowId` order.
|
|
3. Rows with un-deserializable payloads are dead-lettered immediately (by their
|
|
own `RowId`) so they can't stall the queue head.
|
|
4. The remaining batch is handed to `IAlarmHistorianWriter.WriteBatchAsync`, and
|
|
each outcome is applied in one transaction: `Ack` deletes the row,
|
|
`PermanentFail` flips its `DeadLettered` flag, `RetryPlease` bumps its attempt
|
|
count and leaves it queued. A row whose `AttemptCount` has reached the configured
|
|
**`MaxAttempts`** cap (default 10) is dead-lettered automatically on the next drain
|
|
tick rather than retried — this breaks infinite retry loops for poison events whose
|
|
payload the historian will always reject (e.g. a malformed alarm record that triggers
|
|
a permanent SDK error on every attempt). The dead-lettered row remains inspectable
|
|
via `RetryDeadLettered()` for the configured retention window.
|
|
5. The timer re-arms its next due-time to `max(tickInterval, currentBackoff)`.
|
|
|
|
**Backoff ladder** (applied to the timer's next due-time, so a historian outage
|
|
genuinely slows the drain cadence): 1s → 2s → 5s → 15s → 60s cap. Any
|
|
`RetryPlease` outcome — or a writer exception, or a writer cardinality violation
|
|
(outcome count ≠ event count) — bumps the backoff and sets `DrainState =
|
|
BackingOff`; a clean batch resets it. The async-void timer callback is fully
|
|
guarded: a fault is logged and recorded into `GetStatus()` rather than lost as
|
|
an unobserved task exception.
|
|
|
|
### Durability bound (important)
|
|
|
|
**The durability guarantee is bounded by `capacity` (default 1,000,000 rows).**
|
|
When the non-dead-lettered queue reaches capacity, `EnqueueAsync` evicts the
|
|
oldest non-dead-lettered rows (oldest `RowId` first) to make room, logs a WARN,
|
|
and increments `HistorianSinkStatus.EvictedCount`. Under a sustained historian
|
|
outage, accepted alarm events can therefore be dropped before delivery. A
|
|
non-zero `EvictedCount` is a data-loss signal that requires operator attention —
|
|
it surfaces silent loss without log scraping.
|
|
|
|
### Dead-letter + operator recovery
|
|
|
|
`PermanentFail` and corrupt-payload rows are retained in-place with
|
|
`DeadLettered = 1` for the retention window (default 30 days) so operators can
|
|
inspect them before the sweeper purges them. `RetryDeadLettered()` is the
|
|
operator action (from the AdminUI) that clears the dead-letter flag and attempt
|
|
count on every dead-lettered row, returning them to the regular queue with a
|
|
fresh backoff.
|
|
|
|
---
|
|
|
|
## Runtime wiring
|
|
|
|
`HistorianAdapterActor`
|
|
([`Runtime/Historian/HistorianAdapterActor.cs`](../src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Historian/HistorianAdapterActor.cs))
|
|
subscribes to the cluster **`alerts` DPS topic** and translates each
|
|
`AlarmTransitionEvent` into an `AlarmHistorianEvent`, then calls
|
|
`IAlarmHistorianSink.EnqueueAsync` fire-and-forget so the actor loop is never
|
|
blocked on historian reachability. The actor is **Primary-gated**: only the
|
|
node whose `RedundancyRole` is `Primary` historizes, giving exactly-once
|
|
writes across a redundant pair. `AlarmTransitionEvent` carries `AlarmTypeName`
|
|
(the Part 9 subtype string) and `Comment` (the operator comment from the
|
|
originating ack/shelve command) that populate the corresponding fields of
|
|
`AlarmHistorianEvent`. `GatewayAlarmHistorianWriter` is the `IAlarmHistorianWriter`
|
|
the drain worker delegates to (the gateway `SendEvent` path). See
|
|
[ServiceHosting.md](ServiceHosting.md) for the (external) HistorianGateway setup.
|
|
|
|
**Scope:** scripted alarms only. Galaxy-native alarms historize via System
|
|
Platform's `HistorizeToAveva` toggle (not this actor); AB CIP ALMD is not on
|
|
the `alerts` topic (future).
|
|
|
|
## Configuration
|
|
|
|
The real sink is opt-in via the `AlarmHistorian` section of `appsettings.json`.
|
|
When `Enabled` is `false` (the default), `AddAlarmHistorian` registers
|
|
`NullAlarmHistorianSink` and the feature is dormant. When `Enabled` is `true`,
|
|
`AddAlarmHistorian` constructs `SqliteStoreAndForwardSink` and registers
|
|
`GatewayAlarmHistorianWriter` as the `IAlarmHistorianWriter`. This section carries
|
|
**only** the `Enabled` gate + the SQLite store-and-forward knobs — the downstream
|
|
gateway connection (endpoint / key / TLS) is sourced from the `ServerHistorian`
|
|
section (see [Historian.md](Historian.md)).
|
|
|
|
```json
|
|
{
|
|
"AlarmHistorian": {
|
|
"Enabled": true,
|
|
"DatabasePath": "C:\\ProgramData\\OtOpcUa\\alarmhistorian.db",
|
|
"BatchSize": 100,
|
|
"DrainIntervalSeconds": 5,
|
|
"Capacity": 1000000,
|
|
"DeadLetterRetentionDays": 30
|
|
}
|
|
}
|
|
```
|
|
|
|
| Key | Type | Default | Description |
|
|
|---|---|---|---|
|
|
| `Enabled` | bool | `false` | Enable the SQLite store-and-forward sink (drains to the HistorianGateway `SendEvent` path). `false` → `NullAlarmHistorianSink`. |
|
|
| `DatabasePath` | string | `alarm-historian.db` | Path to the SQLite queue file. Created on first use (WAL mode). Set an **absolute** path in production. |
|
|
| `BatchSize` | int | `100` | Max rows per drain cycle handed to `IAlarmHistorianWriter.WriteBatchAsync`. |
|
|
| `DrainIntervalSeconds` | int | `5` | Seconds between drain-worker ticks. |
|
|
| `Capacity` | long | `1000000` | Max queued rows before the sink evicts the oldest (data-loss signal via `EvictedCount`). |
|
|
| `DeadLetterRetentionDays` | int | `30` | Days to retain dead-lettered rows before purge. |
|
|
| `MaxAttempts` | int | `10` | Maximum delivery attempts before a poison (perpetually-retrying) row is dead-lettered automatically. Must be > 0. |
|
|
|
|
> The downstream gateway connection lives in `ServerHistorian` (`Endpoint` + env `ServerHistorian__ApiKey`,
|
|
> `UseTls`, `CaCertificatePath`); alarm-history `ReadEvents` additionally requires the gateway running
|
|
> `RuntimeDb:EventReadsEnabled=true`. The old Wonderware connection keys (`SharedSecret` /
|
|
> `AlarmHistorian:Host`/`Port`/`UseTls`/`ServerCertThumbprint`) were pruned.
|
|
|
|
> Dev and docker-dev deployments leave `Enabled` unset (defaults to `false`) so alarm transitions historize to nowhere unless a HistorianGateway is configured.
|
|
|
|
---
|
|
|
|
## See also
|
|
|
|
- [AlarmTracking.md](AlarmTracking.md) — the three alarm sources and the OPC UA
|
|
Part 9 surface; which alarms route to this sink.
|
|
- [DriverLifecycle.md](DriverLifecycle.md) — `IHistorianDataSource` (the
|
|
historian *read* surface; this page covers the *write* path) and the
|
|
`GatewayHistorianDataSource`.
|
|
- [ScriptedAlarms.md](ScriptedAlarms.md) — the scripted-alarm engine that emits
|
|
most events into this sink.
|
|
- [ServiceHosting.md](ServiceHosting.md) — the external HistorianGateway backend.
|