# Alarm Follow-ups (Round 2) — Design **Date:** 2026-06-11 **Status:** Approved (brainstorming) — ready for implementation plan. ## Goal Resolve the two follow-ups left open by the alarm-followups work (merged to master `521ee753`): - **B — Historian feeder.** `HistorianAdapterActor` is Primary-gated but has **no production feeder** (nothing produces `AlarmHistorianEvent`). Wire engine→historian so exactly-once alarm historization actually activates. - **A — Galaxy alarm-client reconnect.** The reviewer flagged that the session-less alarm client (`GalaxyDriver._ownedMxClient`) is never recreated on reconnect. Investigation shows this is **already handled** — so the work is verify + document, not new reconnect code. Scope decided in brainstorming. Decisions: (B) historian **subscribes to the `alerts` DPS topic** and translates (the intended design per `ScriptedAlarmHostActor`'s own docstring), **extending `AlarmTransitionEvent`** with the two fields it lacks (`AlarmTypeName`, `Comment`) to avoid data loss — including a small **Core.ScriptedAlarms engine change** to carry `Comment` through the emission; (B-sink) also wire a **config-gated real durable sink** (`SqliteStoreAndForwardSink`→Wonderware) with a `Null` fallback; (A) **verify + test + document** only (gRPC keepalive hardening is not reachable from this repo — see below). --- ## Part B — Historian feeder (substantive; high-risk) ### The intended design (already documented) `ScriptedAlarmHostActor`'s docstring states it explicitly: the host publishes each transition as an `AlarmTransitionEvent` to the `alerts` topic **only**, and "deliberately does NOT also Tell the historian adapter directly — doing so would double-historize... a direct tell would duplicate every row." So the historian is meant to **subscribe to `alerts`** and translate. That subscription was never implemented — `HistorianAdapterActor` has the `Receive` + the Primary gate, but no `alerts` subscription and no feeder. This plan implements it. ### Data flow ``` ScriptedAlarmHostActor.OnEngineEmission (Primary only — already gated) └─ Publish(AlarmTransitionEvent) → "alerts" DPS topic ├─ AlertSignalRBridge (existing — live UI fan-out) └─ HistorianAdapterActor (NEW: Subscribe in PreStart) └─ if _localRole is Primary (existing T2 gate — STILL REQUIRED) └─ translate AlarmTransitionEvent → AlarmHistorianEvent └─ IAlarmHistorianSink.EnqueueAsync (fire-and-forget) ``` **Why the T2 Primary gate stays load-bearing:** the Primary publishes the transition **once**, but DistributedPubSub fans that single message to **every** node's subscribers — including **both** central nodes' `HistorianAdapterActor`. Without the gate, both nodes' historians would enqueue → double DB writes. The T2 gate keeps only the Primary writing → exactly-once. (The historization is therefore co-located with the alerts emission: Primary-only on both ends.) ### Components 1. **Extend `AlarmTransitionEvent`** (`Commons/Messages/Alerts/AlarmTransitionEvent.cs`) with `string AlarmTypeName` and `string? Comment`. Additive; the cluster's default Akka serializer (no Hyperion/protobuf binding) is forward/backward compatible across a rolling restart (old nodes ignore the new fields; new nodes read old messages with the fields null/default). Existing consumers (`AlertSignalRBridge`, AdminUI `/alerts`) ignore them. 2. **`ScriptedAlarmEvent`** (`Core.ScriptedAlarms/ScriptedAlarmEngine.cs`) gains a `string? Comment` field; the engine populates it on comment-bearing transitions (Acknowledge / Confirm / AddComment / Shelve ops already receive the operator comment — thread it into the emitted event). `Kind` already carries the Part-9 type. 3. **`ScriptedAlarmHostActor.OnEngineEmission`** populates the two new `AlarmTransitionEvent` fields: `AlarmTypeName = e.Kind.ToString()`, `Comment = e.Comment`. No other change to the emit path (the Primary gate + OPC UA write stay as-is). 4. **`HistorianAdapterActor`**: `Subscribe(AlertsTopic, Self)` in PreStart (+ `SubscribeAck` no-op); add `Receive` that **translates** then runs the **existing Primary-gated** enqueue. Translation: - `Severity` int→`AlarmSeverity` by inverting `ScriptedAlarmHostActor.SeverityToInt`'s buckets (1–250 Low, 251–500 Medium, 501–750 High, 751–1000 Critical). - `EventKind = TransitionKind`; `AlarmTypeName`, `Comment`, the rest map 1:1. - Keep the existing `Receive` path too (harmless; lets a future direct source still feed it). 5. **Config-gated durable sink** — a new `AlarmHistorian` appsettings section + an `AddAlarmHistorian` registration (Runtime `ServiceCollectionExtensions`, called from the Host). When the section is present: register `SqliteStoreAndForwardSink(dbPath, new WonderwareHistorianClient(WonderwareHistorianClientOptions{PipeName, SharedSecret, …}), logger)` as the `IAlarmHistorianSink` singleton and start its drain loop (`StartDrainLoop`); dispose it on shutdown. When absent: keep `NullAlarmHistorianSink` (current default — so dev/docker-dev is unaffected). `SqliteStoreAndForwardSink` takes `(string databasePath, IAlarmHistorianWriter writer, ILogger, …)`; `WonderwareHistorianClient` implements `IAlarmHistorianWriter` and talks the named-pipe IPC. ### Scope Scripted alarms only — the only source on the `alerts` topic. Galaxy native alarms historize via AVEVA System Platform's own `HistorizeToAveva` (not this sink); AB CIP ALMD alarms aren't on `alerts` (a future addition, out of scope here). ### Tests - TestKit (`HistorianAdapterActorTests`): a **Secondary** host receiving an `AlarmTransitionEvent` does NOT enqueue; a **Primary** translates + enqueues (fake `IAlarmHistorianSink` recording writes); default (no role) enqueues. - Translation unit tests: severity int→enum buckets (boundary values), `AlarmTypeName`/`Comment` carried, `EventKind` mapping. - Engine: `ScriptedAlarmEvent.Comment` is populated on an ack/comment transition. - Config-gated registration: section present → `SqliteStoreAndForwardSink` bound; absent → `NullAlarmHistorianSink`. (xUnit + Shouldly; in-memory config.) - No bUnit (no UI change). docker-dev live-verify is optional here (the sink stays Null on docker-dev unless configured); the exactly-once gating is proven by TestKit. --- ## Part A — Galaxy alarm-client reconnect (verify + document; low-risk) ### Finding (verified) The gap is **already handled**: - `GatewayGalaxyAlarmFeed.RunAsync` (`Driver.Galaxy/Runtime/GatewayGalaxyAlarmFeed.cs:101-147`) has its own reconnect loop: on any non-cancellation stream fault it logs, waits `_reconnectDelay` (~5s), and **re-invokes `StreamAlarmsAsync` on the same client**. There is a passing unit test (`Reopens_stream_after_a_transport_fault`). - gRPC.NET channels auto-reconnect the underlying HTTP/2 connection, so re-invoking the stream (or the next unary `AcknowledgeAlarmAsync`) after a gateway restart re-establishes transparently — the **client** does not need recreating. - **Keepalive hardening is not reachable from this repo**: `MxGatewayClientOptions` is a NuGet package (sibling MxAccessGateway repo) and exposes no keepalive / channel-resilience knobs. Adding them would be a sibling-repo change — out of scope. ### Decision No production reconnect code. Instead: - Add a focused test that the **acknowledger** (`GatewayGalaxyAlarmAcknowledger`) recovers — its next unary call succeeds after a transient fault (the stream path is already covered by the existing feed test). If a real recovery gap surfaces, escalate to a follow-up. - Document the alarm-feed reconnect behaviour in `docs/drivers/Galaxy.md`: the feed's own re-invoke loop + gRPC channel auto-reconnect handle a gateway restart; `_ownedMxClient` is intentionally **not** recreated (keepalive hardening would require a gateway-package change). --- ## Sequencing & risk | Item | Risk | Notes | |---|---|---| | B1 Extend `AlarmTransitionEvent` (+ engine `Comment`) | standard | Commons record + Core.ScriptedAlarms engine + ScriptedAlarmHostActor populate | | B2 `HistorianAdapterActor` alerts-subscribe + translate | high-risk | redundancy gate + cluster topic + exactly-once | | B3 Config-gated durable sink + Host wiring | standard | DI + appsettings + Sqlite/Wonderware construction + drain lifecycle | | A Verify + document | small | acknowledger-recovery test + Galaxy.md note | B2 depends on B1 (needs the extended event). B3 is independent of B1/B2 (DI/config). A is fully independent. TDD where there's logic; the exactly-once dedup is proven by TestKit, not docker-dev (the real sink is config-gated and stays Null on docker-dev). ## Hard rules Stage by explicit path (never `git add .`); never stage `sql_login.txt` / `src/Server/.../Host/pki/`; never echo the gateway API key into a **new** tracked file; no force-push, no `--no-verify`; **no Configuration entity / EF migration change** (none of these items needs one — the historian queue is a standalone SQLite file, not the Config DB). Build on a feature branch off `master`.