9.1 KiB
Alarm Follow-ups (Round 2) — Design
Date: 2026-06-11 Status: Approved (brainstorming) — ready for implementation plan.
Goal
Resolve the two follow-ups left open by the alarm-followups work (merged to master
521ee753):
- B — Historian feeder.
HistorianAdapterActoris Primary-gated but has no production feeder (nothing producesAlarmHistorianEvent). Wire engine→historian so exactly-once alarm historization actually activates. - A — Galaxy alarm-client reconnect. The reviewer flagged that the session-less alarm
client (
GalaxyDriver._ownedMxClient) is never recreated on reconnect. Investigation shows this is already handled — so the work is verify + document, not new reconnect code.
Scope decided in brainstorming. Decisions: (B) historian subscribes to the alerts DPS
topic and translates (the intended design per ScriptedAlarmHostActor's own docstring),
extending AlarmTransitionEvent with the two fields it lacks (AlarmTypeName, Comment)
to avoid data loss — including a small Core.ScriptedAlarms engine change to carry Comment
through the emission; (B-sink) also wire a config-gated real durable sink
(SqliteStoreAndForwardSink→Wonderware) with a Null fallback; (A) verify + test +
document only (gRPC keepalive hardening is not reachable from this repo — see below).
Part B — Historian feeder (substantive; high-risk)
The intended design (already documented)
ScriptedAlarmHostActor's docstring states it explicitly: the host publishes each transition
as an AlarmTransitionEvent to the alerts topic only, and "deliberately does NOT also
Tell the historian adapter directly — doing so would double-historize... a direct tell would
duplicate every row." So the historian is meant to subscribe to alerts and translate.
That subscription was never implemented — HistorianAdapterActor has the Receive<AlarmHistorianEvent>
- the Primary gate, but no
alertssubscription and no feeder. This plan implements it.
Data flow
ScriptedAlarmHostActor.OnEngineEmission (Primary only — already gated)
└─ Publish(AlarmTransitionEvent) → "alerts" DPS topic
├─ AlertSignalRBridge (existing — live UI fan-out)
└─ HistorianAdapterActor (NEW: Subscribe in PreStart)
└─ if _localRole is Primary (existing T2 gate — STILL REQUIRED)
└─ translate AlarmTransitionEvent → AlarmHistorianEvent
└─ IAlarmHistorianSink.EnqueueAsync (fire-and-forget)
Why the T2 Primary gate stays load-bearing: the Primary publishes the transition once,
but DistributedPubSub fans that single message to every node's subscribers — including
both central nodes' HistorianAdapterActor. Without the gate, both nodes' historians would
enqueue → double DB writes. The T2 gate keeps only the Primary writing → exactly-once. (The
historization is therefore co-located with the alerts emission: Primary-only on both ends.)
Components
- Extend
AlarmTransitionEvent(Commons/Messages/Alerts/AlarmTransitionEvent.cs) withstring AlarmTypeNameandstring? Comment. Additive; the cluster's default Akka serializer (no Hyperion/protobuf binding) is forward/backward compatible across a rolling restart (old nodes ignore the new fields; new nodes read old messages with the fields null/default). Existing consumers (AlertSignalRBridge, AdminUI/alerts) ignore them. ScriptedAlarmEvent(Core.ScriptedAlarms/ScriptedAlarmEngine.cs) gains astring? Commentfield; the engine populates it on comment-bearing transitions (Acknowledge / Confirm / AddComment / Shelve ops already receive the operator comment — thread it into the emitted event).Kindalready carries the Part-9 type.ScriptedAlarmHostActor.OnEngineEmissionpopulates the two newAlarmTransitionEventfields:AlarmTypeName = e.Kind.ToString(),Comment = e.Comment. No other change to the emit path (the Primary gate + OPC UA write stay as-is).HistorianAdapterActor:Subscribe(AlertsTopic, Self)in PreStart (+SubscribeAckno-op); addReceive<AlarmTransitionEvent>that translates then runs the existing Primary-gated enqueue. Translation:Severityint→AlarmSeverityby invertingScriptedAlarmHostActor.SeverityToInt's buckets (1–250 Low, 251–500 Medium, 501–750 High, 751–1000 Critical).EventKind = TransitionKind;AlarmTypeName,Comment, the rest map 1:1.- Keep the existing
Receive<AlarmHistorianEvent>path too (harmless; lets a future direct source still feed it).
- Config-gated durable sink — a new
AlarmHistorianappsettings section + anAddAlarmHistorianregistration (RuntimeServiceCollectionExtensions, called from the Host). When the section is present: registerSqliteStoreAndForwardSink(dbPath, new WonderwareHistorianClient(WonderwareHistorianClientOptions{PipeName, SharedSecret, …}), logger)as theIAlarmHistorianSinksingleton and start its drain loop (StartDrainLoop); dispose it on shutdown. When absent: keepNullAlarmHistorianSink(current default — so dev/docker-dev is unaffected).SqliteStoreAndForwardSinktakes(string databasePath, IAlarmHistorianWriter writer, ILogger, …);WonderwareHistorianClientimplementsIAlarmHistorianWriterand talks the named-pipe IPC.
Scope
Scripted alarms only — the only source on the alerts topic. Galaxy native alarms historize
via AVEVA System Platform's own HistorizeToAveva (not this sink); AB CIP ALMD alarms aren't
on alerts (a future addition, out of scope here).
Tests
- TestKit (
HistorianAdapterActorTests): a Secondary host receiving anAlarmTransitionEventdoes NOT enqueue; a Primary translates + enqueues (fakeIAlarmHistorianSinkrecording writes); default (no role) enqueues. - Translation unit tests: severity int→enum buckets (boundary values),
AlarmTypeName/Commentcarried,EventKindmapping. - Engine:
ScriptedAlarmEvent.Commentis populated on an ack/comment transition. - Config-gated registration: section present →
SqliteStoreAndForwardSinkbound; absent →NullAlarmHistorianSink. (xUnit + Shouldly; in-memory config.) - No bUnit (no UI change). docker-dev live-verify is optional here (the sink stays Null on docker-dev unless configured); the exactly-once gating is proven by TestKit.
Part A — Galaxy alarm-client reconnect (verify + document; low-risk)
Finding (verified)
The gap is already handled:
GatewayGalaxyAlarmFeed.RunAsync(Driver.Galaxy/Runtime/GatewayGalaxyAlarmFeed.cs:101-147) has its own reconnect loop: on any non-cancellation stream fault it logs, waits_reconnectDelay(~5s), and re-invokesStreamAlarmsAsyncon the same client. There is a passing unit test (Reopens_stream_after_a_transport_fault).- gRPC.NET channels auto-reconnect the underlying HTTP/2 connection, so re-invoking the stream
(or the next unary
AcknowledgeAlarmAsync) after a gateway restart re-establishes transparently — the client does not need recreating. - Keepalive hardening is not reachable from this repo:
MxGatewayClientOptionsis a NuGet package (sibling MxAccessGateway repo) and exposes no keepalive / channel-resilience knobs. Adding them would be a sibling-repo change — out of scope.
Decision
No production reconnect code. Instead:
- Add a focused test that the acknowledger (
GatewayGalaxyAlarmAcknowledger) recovers — its next unary call succeeds after a transient fault (the stream path is already covered by the existing feed test). If a real recovery gap surfaces, escalate to a follow-up. - Document the alarm-feed reconnect behaviour in
docs/drivers/Galaxy.md: the feed's own re-invoke loop + gRPC channel auto-reconnect handle a gateway restart;_ownedMxClientis intentionally not recreated (keepalive hardening would require a gateway-package change).
Sequencing & risk
| Item | Risk | Notes |
|---|---|---|
B1 Extend AlarmTransitionEvent (+ engine Comment) |
standard | Commons record + Core.ScriptedAlarms engine + ScriptedAlarmHostActor populate |
B2 HistorianAdapterActor alerts-subscribe + translate |
high-risk | redundancy gate + cluster topic + exactly-once |
| B3 Config-gated durable sink + Host wiring | standard | DI + appsettings + Sqlite/Wonderware construction + drain lifecycle |
| A Verify + document | small | acknowledger-recovery test + Galaxy.md note |
B2 depends on B1 (needs the extended event). B3 is independent of B1/B2 (DI/config). A is fully independent. TDD where there's logic; the exactly-once dedup is proven by TestKit, not docker-dev (the real sink is config-gated and stays Null on docker-dev).
Hard rules
Stage by explicit path (never git add .); never stage sql_login.txt /
src/Server/.../Host/pki/; never echo the gateway API key into a new tracked file; no
force-push, no --no-verify; no Configuration entity / EF migration change (none of these
items needs one — the historian queue is a standalone SQLite file, not the Config DB). Build on a
feature branch off master.