Files
lmxopcua/docs/plans/2026-06-11-alarm-followups-round2-design.md
T

160 lines
9.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Alarm Follow-ups (Round 2) — Design
**Date:** 2026-06-11
**Status:** Approved (brainstorming) — ready for implementation plan.
## Goal
Resolve the two follow-ups left open by the alarm-followups work (merged to master
`521ee753`):
- **B — Historian feeder.** `HistorianAdapterActor` is Primary-gated but has **no
production feeder** (nothing produces `AlarmHistorianEvent`). Wire engine→historian so
exactly-once alarm historization actually activates.
- **A — Galaxy alarm-client reconnect.** The reviewer flagged that the session-less alarm
client (`GalaxyDriver._ownedMxClient`) is never recreated on reconnect. Investigation shows
this is **already handled** — so the work is verify + document, not new reconnect code.
Scope decided in brainstorming. Decisions: (B) historian **subscribes to the `alerts` DPS
topic** and translates (the intended design per `ScriptedAlarmHostActor`'s own docstring),
**extending `AlarmTransitionEvent`** with the two fields it lacks (`AlarmTypeName`, `Comment`)
to avoid data loss — including a small **Core.ScriptedAlarms engine change** to carry `Comment`
through the emission; (B-sink) also wire a **config-gated real durable sink**
(`SqliteStoreAndForwardSink`→Wonderware) with a `Null` fallback; (A) **verify + test +
document** only (gRPC keepalive hardening is not reachable from this repo — see below).
---
## Part B — Historian feeder (substantive; high-risk)
### The intended design (already documented)
`ScriptedAlarmHostActor`'s docstring states it explicitly: the host publishes each transition
as an `AlarmTransitionEvent` to the `alerts` topic **only**, and "deliberately does NOT also
Tell the historian adapter directly — doing so would double-historize... a direct tell would
duplicate every row." So the historian is meant to **subscribe to `alerts`** and translate.
That subscription was never implemented — `HistorianAdapterActor` has the `Receive<AlarmHistorianEvent>`
+ the Primary gate, but no `alerts` subscription and no feeder. This plan implements it.
### Data flow
```
ScriptedAlarmHostActor.OnEngineEmission (Primary only — already gated)
└─ Publish(AlarmTransitionEvent) → "alerts" DPS topic
├─ AlertSignalRBridge (existing — live UI fan-out)
└─ HistorianAdapterActor (NEW: Subscribe in PreStart)
└─ if _localRole is Primary (existing T2 gate — STILL REQUIRED)
└─ translate AlarmTransitionEvent → AlarmHistorianEvent
└─ IAlarmHistorianSink.EnqueueAsync (fire-and-forget)
```
**Why the T2 Primary gate stays load-bearing:** the Primary publishes the transition **once**,
but DistributedPubSub fans that single message to **every** node's subscribers — including
**both** central nodes' `HistorianAdapterActor`. Without the gate, both nodes' historians would
enqueue → double DB writes. The T2 gate keeps only the Primary writing → exactly-once. (The
historization is therefore co-located with the alerts emission: Primary-only on both ends.)
### Components
1. **Extend `AlarmTransitionEvent`** (`Commons/Messages/Alerts/AlarmTransitionEvent.cs`) with
`string AlarmTypeName` and `string? Comment`. Additive; the cluster's default Akka serializer
(no Hyperion/protobuf binding) is forward/backward compatible across a rolling restart
(old nodes ignore the new fields; new nodes read old messages with the fields null/default).
Existing consumers (`AlertSignalRBridge`, AdminUI `/alerts`) ignore them.
2. **`ScriptedAlarmEvent`** (`Core.ScriptedAlarms/ScriptedAlarmEngine.cs`) gains a
`string? Comment` field; the engine populates it on comment-bearing transitions
(Acknowledge / Confirm / AddComment / Shelve ops already receive the operator comment —
thread it into the emitted event). `Kind` already carries the Part-9 type.
3. **`ScriptedAlarmHostActor.OnEngineEmission`** populates the two new `AlarmTransitionEvent`
fields: `AlarmTypeName = e.Kind.ToString()`, `Comment = e.Comment`. No other change to the
emit path (the Primary gate + OPC UA write stay as-is).
4. **`HistorianAdapterActor`**: `Subscribe(AlertsTopic, Self)` in PreStart (+ `SubscribeAck`
no-op); add `Receive<AlarmTransitionEvent>` that **translates** then runs the **existing
Primary-gated** enqueue. Translation:
- `Severity` int→`AlarmSeverity` by inverting `ScriptedAlarmHostActor.SeverityToInt`'s
buckets (1250 Low, 251500 Medium, 501750 High, 7511000 Critical).
- `EventKind = TransitionKind`; `AlarmTypeName`, `Comment`, the rest map 1:1.
- Keep the existing `Receive<AlarmHistorianEvent>` path too (harmless; lets a future direct
source still feed it).
5. **Config-gated durable sink** — a new `AlarmHistorian` appsettings section + an
`AddAlarmHistorian` registration (Runtime `ServiceCollectionExtensions`, called from the
Host). When the section is present: register `SqliteStoreAndForwardSink(dbPath,
new WonderwareHistorianClient(WonderwareHistorianClientOptions{PipeName, SharedSecret, …}),
logger)` as the `IAlarmHistorianSink` singleton and start its drain loop (`StartDrainLoop`);
dispose it on shutdown. When absent: keep `NullAlarmHistorianSink` (current default — so
dev/docker-dev is unaffected). `SqliteStoreAndForwardSink` takes `(string databasePath,
IAlarmHistorianWriter writer, ILogger, …)`; `WonderwareHistorianClient` implements
`IAlarmHistorianWriter` and talks the named-pipe IPC.
### Scope
Scripted alarms only — the only source on the `alerts` topic. Galaxy native alarms historize
via AVEVA System Platform's own `HistorizeToAveva` (not this sink); AB CIP ALMD alarms aren't
on `alerts` (a future addition, out of scope here).
### Tests
- TestKit (`HistorianAdapterActorTests`): a **Secondary** host receiving an `AlarmTransitionEvent`
does NOT enqueue; a **Primary** translates + enqueues (fake `IAlarmHistorianSink` recording
writes); default (no role) enqueues.
- Translation unit tests: severity int→enum buckets (boundary values), `AlarmTypeName`/`Comment`
carried, `EventKind` mapping.
- Engine: `ScriptedAlarmEvent.Comment` is populated on an ack/comment transition.
- Config-gated registration: section present → `SqliteStoreAndForwardSink` bound; absent →
`NullAlarmHistorianSink`. (xUnit + Shouldly; in-memory config.)
- No bUnit (no UI change). docker-dev live-verify is optional here (the sink stays Null on
docker-dev unless configured); the exactly-once gating is proven by TestKit.
---
## Part A — Galaxy alarm-client reconnect (verify + document; low-risk)
### Finding (verified)
The gap is **already handled**:
- `GatewayGalaxyAlarmFeed.RunAsync` (`Driver.Galaxy/Runtime/GatewayGalaxyAlarmFeed.cs:101-147`)
has its own reconnect loop: on any non-cancellation stream fault it logs, waits
`_reconnectDelay` (~5s), and **re-invokes `StreamAlarmsAsync` on the same client**. There is a
passing unit test (`Reopens_stream_after_a_transport_fault`).
- gRPC.NET channels auto-reconnect the underlying HTTP/2 connection, so re-invoking the stream
(or the next unary `AcknowledgeAlarmAsync`) after a gateway restart re-establishes transparently
— the **client** does not need recreating.
- **Keepalive hardening is not reachable from this repo**: `MxGatewayClientOptions` is a NuGet
package (sibling MxAccessGateway repo) and exposes no keepalive / channel-resilience knobs.
Adding them would be a sibling-repo change — out of scope.
### Decision
No production reconnect code. Instead:
- Add a focused test that the **acknowledger** (`GatewayGalaxyAlarmAcknowledger`) recovers — its
next unary call succeeds after a transient fault (the stream path is already covered by the
existing feed test). If a real recovery gap surfaces, escalate to a follow-up.
- Document the alarm-feed reconnect behaviour in `docs/drivers/Galaxy.md`: the feed's own
re-invoke loop + gRPC channel auto-reconnect handle a gateway restart; `_ownedMxClient` is
intentionally **not** recreated (keepalive hardening would require a gateway-package change).
---
## Sequencing & risk
| Item | Risk | Notes |
|---|---|---|
| B1 Extend `AlarmTransitionEvent` (+ engine `Comment`) | standard | Commons record + Core.ScriptedAlarms engine + ScriptedAlarmHostActor populate |
| B2 `HistorianAdapterActor` alerts-subscribe + translate | high-risk | redundancy gate + cluster topic + exactly-once |
| B3 Config-gated durable sink + Host wiring | standard | DI + appsettings + Sqlite/Wonderware construction + drain lifecycle |
| A Verify + document | small | acknowledger-recovery test + Galaxy.md note |
B2 depends on B1 (needs the extended event). B3 is independent of B1/B2 (DI/config). A is fully
independent. TDD where there's logic; the exactly-once dedup is proven by TestKit, not docker-dev
(the real sink is config-gated and stays Null on docker-dev).
## Hard rules
Stage by explicit path (never `git add .`); never stage `sql_login.txt` /
`src/Server/.../Host/pki/`; never echo the gateway API key into a **new** tracked file; no
force-push, no `--no-verify`; **no Configuration entity / EF migration change** (none of these
items needs one — the historian queue is a standalone SQLite file, not the Config DB). Build on a
feature branch off `master`.