From 3ad7960dc21afde768495801e9969a83164ba955 Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Thu, 11 Jun 2026 10:51:55 -0400 Subject: [PATCH] docs(design): alarm follow-ups round 2 (historian alerts-feeder + Galaxy reconnect verify) --- ...026-06-11-alarm-followups-round2-design.md | 159 ++++++++++++++++++ 1 file changed, 159 insertions(+) create mode 100644 docs/plans/2026-06-11-alarm-followups-round2-design.md diff --git a/docs/plans/2026-06-11-alarm-followups-round2-design.md b/docs/plans/2026-06-11-alarm-followups-round2-design.md new file mode 100644 index 00000000..191c5fa4 --- /dev/null +++ b/docs/plans/2026-06-11-alarm-followups-round2-design.md @@ -0,0 +1,159 @@ +# Alarm Follow-ups (Round 2) — Design + +**Date:** 2026-06-11 +**Status:** Approved (brainstorming) — ready for implementation plan. + +## Goal + +Resolve the two follow-ups left open by the alarm-followups work (merged to master +`521ee753`): + +- **B — Historian feeder.** `HistorianAdapterActor` is Primary-gated but has **no + production feeder** (nothing produces `AlarmHistorianEvent`). Wire engine→historian so + exactly-once alarm historization actually activates. +- **A — Galaxy alarm-client reconnect.** The reviewer flagged that the session-less alarm + client (`GalaxyDriver._ownedMxClient`) is never recreated on reconnect. Investigation shows + this is **already handled** — so the work is verify + document, not new reconnect code. + +Scope decided in brainstorming. Decisions: (B) historian **subscribes to the `alerts` DPS +topic** and translates (the intended design per `ScriptedAlarmHostActor`'s own docstring), +**extending `AlarmTransitionEvent`** with the two fields it lacks (`AlarmTypeName`, `Comment`) +to avoid data loss — including a small **Core.ScriptedAlarms engine change** to carry `Comment` +through the emission; (B-sink) also wire a **config-gated real durable sink** +(`SqliteStoreAndForwardSink`→Wonderware) with a `Null` fallback; (A) **verify + test + +document** only (gRPC keepalive hardening is not reachable from this repo — see below). + +--- + +## Part B — Historian feeder (substantive; high-risk) + +### The intended design (already documented) + +`ScriptedAlarmHostActor`'s docstring states it explicitly: the host publishes each transition +as an `AlarmTransitionEvent` to the `alerts` topic **only**, and "deliberately does NOT also +Tell the historian adapter directly — doing so would double-historize... a direct tell would +duplicate every row." So the historian is meant to **subscribe to `alerts`** and translate. +That subscription was never implemented — `HistorianAdapterActor` has the `Receive` ++ the Primary gate, but no `alerts` subscription and no feeder. This plan implements it. + +### Data flow + +``` +ScriptedAlarmHostActor.OnEngineEmission (Primary only — already gated) + └─ Publish(AlarmTransitionEvent) → "alerts" DPS topic + ├─ AlertSignalRBridge (existing — live UI fan-out) + └─ HistorianAdapterActor (NEW: Subscribe in PreStart) + └─ if _localRole is Primary (existing T2 gate — STILL REQUIRED) + └─ translate AlarmTransitionEvent → AlarmHistorianEvent + └─ IAlarmHistorianSink.EnqueueAsync (fire-and-forget) +``` + +**Why the T2 Primary gate stays load-bearing:** the Primary publishes the transition **once**, +but DistributedPubSub fans that single message to **every** node's subscribers — including +**both** central nodes' `HistorianAdapterActor`. Without the gate, both nodes' historians would +enqueue → double DB writes. The T2 gate keeps only the Primary writing → exactly-once. (The +historization is therefore co-located with the alerts emission: Primary-only on both ends.) + +### Components + +1. **Extend `AlarmTransitionEvent`** (`Commons/Messages/Alerts/AlarmTransitionEvent.cs`) with + `string AlarmTypeName` and `string? Comment`. Additive; the cluster's default Akka serializer + (no Hyperion/protobuf binding) is forward/backward compatible across a rolling restart + (old nodes ignore the new fields; new nodes read old messages with the fields null/default). + Existing consumers (`AlertSignalRBridge`, AdminUI `/alerts`) ignore them. +2. **`ScriptedAlarmEvent`** (`Core.ScriptedAlarms/ScriptedAlarmEngine.cs`) gains a + `string? Comment` field; the engine populates it on comment-bearing transitions + (Acknowledge / Confirm / AddComment / Shelve ops already receive the operator comment — + thread it into the emitted event). `Kind` already carries the Part-9 type. +3. **`ScriptedAlarmHostActor.OnEngineEmission`** populates the two new `AlarmTransitionEvent` + fields: `AlarmTypeName = e.Kind.ToString()`, `Comment = e.Comment`. No other change to the + emit path (the Primary gate + OPC UA write stay as-is). +4. **`HistorianAdapterActor`**: `Subscribe(AlertsTopic, Self)` in PreStart (+ `SubscribeAck` + no-op); add `Receive` that **translates** then runs the **existing + Primary-gated** enqueue. Translation: + - `Severity` int→`AlarmSeverity` by inverting `ScriptedAlarmHostActor.SeverityToInt`'s + buckets (1–250 Low, 251–500 Medium, 501–750 High, 751–1000 Critical). + - `EventKind = TransitionKind`; `AlarmTypeName`, `Comment`, the rest map 1:1. + - Keep the existing `Receive` path too (harmless; lets a future direct + source still feed it). +5. **Config-gated durable sink** — a new `AlarmHistorian` appsettings section + an + `AddAlarmHistorian` registration (Runtime `ServiceCollectionExtensions`, called from the + Host). When the section is present: register `SqliteStoreAndForwardSink(dbPath, + new WonderwareHistorianClient(WonderwareHistorianClientOptions{PipeName, SharedSecret, …}), + logger)` as the `IAlarmHistorianSink` singleton and start its drain loop (`StartDrainLoop`); + dispose it on shutdown. When absent: keep `NullAlarmHistorianSink` (current default — so + dev/docker-dev is unaffected). `SqliteStoreAndForwardSink` takes `(string databasePath, + IAlarmHistorianWriter writer, ILogger, …)`; `WonderwareHistorianClient` implements + `IAlarmHistorianWriter` and talks the named-pipe IPC. + +### Scope + +Scripted alarms only — the only source on the `alerts` topic. Galaxy native alarms historize +via AVEVA System Platform's own `HistorizeToAveva` (not this sink); AB CIP ALMD alarms aren't +on `alerts` (a future addition, out of scope here). + +### Tests + +- TestKit (`HistorianAdapterActorTests`): a **Secondary** host receiving an `AlarmTransitionEvent` + does NOT enqueue; a **Primary** translates + enqueues (fake `IAlarmHistorianSink` recording + writes); default (no role) enqueues. +- Translation unit tests: severity int→enum buckets (boundary values), `AlarmTypeName`/`Comment` + carried, `EventKind` mapping. +- Engine: `ScriptedAlarmEvent.Comment` is populated on an ack/comment transition. +- Config-gated registration: section present → `SqliteStoreAndForwardSink` bound; absent → + `NullAlarmHistorianSink`. (xUnit + Shouldly; in-memory config.) +- No bUnit (no UI change). docker-dev live-verify is optional here (the sink stays Null on + docker-dev unless configured); the exactly-once gating is proven by TestKit. + +--- + +## Part A — Galaxy alarm-client reconnect (verify + document; low-risk) + +### Finding (verified) + +The gap is **already handled**: + +- `GatewayGalaxyAlarmFeed.RunAsync` (`Driver.Galaxy/Runtime/GatewayGalaxyAlarmFeed.cs:101-147`) + has its own reconnect loop: on any non-cancellation stream fault it logs, waits + `_reconnectDelay` (~5s), and **re-invokes `StreamAlarmsAsync` on the same client**. There is a + passing unit test (`Reopens_stream_after_a_transport_fault`). +- gRPC.NET channels auto-reconnect the underlying HTTP/2 connection, so re-invoking the stream + (or the next unary `AcknowledgeAlarmAsync`) after a gateway restart re-establishes transparently + — the **client** does not need recreating. +- **Keepalive hardening is not reachable from this repo**: `MxGatewayClientOptions` is a NuGet + package (sibling MxAccessGateway repo) and exposes no keepalive / channel-resilience knobs. + Adding them would be a sibling-repo change — out of scope. + +### Decision + +No production reconnect code. Instead: + +- Add a focused test that the **acknowledger** (`GatewayGalaxyAlarmAcknowledger`) recovers — its + next unary call succeeds after a transient fault (the stream path is already covered by the + existing feed test). If a real recovery gap surfaces, escalate to a follow-up. +- Document the alarm-feed reconnect behaviour in `docs/drivers/Galaxy.md`: the feed's own + re-invoke loop + gRPC channel auto-reconnect handle a gateway restart; `_ownedMxClient` is + intentionally **not** recreated (keepalive hardening would require a gateway-package change). + +--- + +## Sequencing & risk + +| Item | Risk | Notes | +|---|---|---| +| B1 Extend `AlarmTransitionEvent` (+ engine `Comment`) | standard | Commons record + Core.ScriptedAlarms engine + ScriptedAlarmHostActor populate | +| B2 `HistorianAdapterActor` alerts-subscribe + translate | high-risk | redundancy gate + cluster topic + exactly-once | +| B3 Config-gated durable sink + Host wiring | standard | DI + appsettings + Sqlite/Wonderware construction + drain lifecycle | +| A Verify + document | small | acknowledger-recovery test + Galaxy.md note | + +B2 depends on B1 (needs the extended event). B3 is independent of B1/B2 (DI/config). A is fully +independent. TDD where there's logic; the exactly-once dedup is proven by TestKit, not docker-dev +(the real sink is config-gated and stays Null on docker-dev). + +## Hard rules + +Stage by explicit path (never `git add .`); never stage `sql_login.txt` / +`src/Server/.../Host/pki/`; never echo the gateway API key into a **new** tracked file; no +force-push, no `--no-verify`; **no Configuration entity / EF migration change** (none of these +items needs one — the historian queue is a standalone SQLite file, not the Config DB). Build on a +feature branch off `master`.