Files
lmxopcua/docs/plans/2026-06-11-alarm-followups-round2-design.md
T

9.1 KiB
Raw Blame History

Alarm Follow-ups (Round 2) — Design

Date: 2026-06-11 Status: Approved (brainstorming) — ready for implementation plan.

Goal

Resolve the two follow-ups left open by the alarm-followups work (merged to master 521ee753):

  • B — Historian feeder. HistorianAdapterActor is Primary-gated but has no production feeder (nothing produces AlarmHistorianEvent). Wire engine→historian so exactly-once alarm historization actually activates.
  • A — Galaxy alarm-client reconnect. The reviewer flagged that the session-less alarm client (GalaxyDriver._ownedMxClient) is never recreated on reconnect. Investigation shows this is already handled — so the work is verify + document, not new reconnect code.

Scope decided in brainstorming. Decisions: (B) historian subscribes to the alerts DPS topic and translates (the intended design per ScriptedAlarmHostActor's own docstring), extending AlarmTransitionEvent with the two fields it lacks (AlarmTypeName, Comment) to avoid data loss — including a small Core.ScriptedAlarms engine change to carry Comment through the emission; (B-sink) also wire a config-gated real durable sink (SqliteStoreAndForwardSink→Wonderware) with a Null fallback; (A) verify + test + document only (gRPC keepalive hardening is not reachable from this repo — see below).


Part B — Historian feeder (substantive; high-risk)

The intended design (already documented)

ScriptedAlarmHostActor's docstring states it explicitly: the host publishes each transition as an AlarmTransitionEvent to the alerts topic only, and "deliberately does NOT also Tell the historian adapter directly — doing so would double-historize... a direct tell would duplicate every row." So the historian is meant to subscribe to alerts and translate. That subscription was never implemented — HistorianAdapterActor has the Receive<AlarmHistorianEvent>

  • the Primary gate, but no alerts subscription and no feeder. This plan implements it.

Data flow

ScriptedAlarmHostActor.OnEngineEmission  (Primary only — already gated)
  └─ Publish(AlarmTransitionEvent) → "alerts" DPS topic
       ├─ AlertSignalRBridge          (existing — live UI fan-out)
       └─ HistorianAdapterActor       (NEW: Subscribe in PreStart)
            └─ if _localRole is Primary   (existing T2 gate — STILL REQUIRED)
                 └─ translate AlarmTransitionEvent → AlarmHistorianEvent
                      └─ IAlarmHistorianSink.EnqueueAsync  (fire-and-forget)

Why the T2 Primary gate stays load-bearing: the Primary publishes the transition once, but DistributedPubSub fans that single message to every node's subscribers — including both central nodes' HistorianAdapterActor. Without the gate, both nodes' historians would enqueue → double DB writes. The T2 gate keeps only the Primary writing → exactly-once. (The historization is therefore co-located with the alerts emission: Primary-only on both ends.)

Components

  1. Extend AlarmTransitionEvent (Commons/Messages/Alerts/AlarmTransitionEvent.cs) with string AlarmTypeName and string? Comment. Additive; the cluster's default Akka serializer (no Hyperion/protobuf binding) is forward/backward compatible across a rolling restart (old nodes ignore the new fields; new nodes read old messages with the fields null/default). Existing consumers (AlertSignalRBridge, AdminUI /alerts) ignore them.
  2. ScriptedAlarmEvent (Core.ScriptedAlarms/ScriptedAlarmEngine.cs) gains a string? Comment field; the engine populates it on comment-bearing transitions (Acknowledge / Confirm / AddComment / Shelve ops already receive the operator comment — thread it into the emitted event). Kind already carries the Part-9 type.
  3. ScriptedAlarmHostActor.OnEngineEmission populates the two new AlarmTransitionEvent fields: AlarmTypeName = e.Kind.ToString(), Comment = e.Comment. No other change to the emit path (the Primary gate + OPC UA write stay as-is).
  4. HistorianAdapterActor: Subscribe(AlertsTopic, Self) in PreStart (+ SubscribeAck no-op); add Receive<AlarmTransitionEvent> that translates then runs the existing Primary-gated enqueue. Translation:
    • Severity int→AlarmSeverity by inverting ScriptedAlarmHostActor.SeverityToInt's buckets (1250 Low, 251500 Medium, 501750 High, 7511000 Critical).
    • EventKind = TransitionKind; AlarmTypeName, Comment, the rest map 1:1.
    • Keep the existing Receive<AlarmHistorianEvent> path too (harmless; lets a future direct source still feed it).
  5. Config-gated durable sink — a new AlarmHistorian appsettings section + an AddAlarmHistorian registration (Runtime ServiceCollectionExtensions, called from the Host). When the section is present: register SqliteStoreAndForwardSink(dbPath, new WonderwareHistorianClient(WonderwareHistorianClientOptions{PipeName, SharedSecret, …}), logger) as the IAlarmHistorianSink singleton and start its drain loop (StartDrainLoop); dispose it on shutdown. When absent: keep NullAlarmHistorianSink (current default — so dev/docker-dev is unaffected). SqliteStoreAndForwardSink takes (string databasePath, IAlarmHistorianWriter writer, ILogger, …); WonderwareHistorianClient implements IAlarmHistorianWriter and talks the named-pipe IPC.

Scope

Scripted alarms only — the only source on the alerts topic. Galaxy native alarms historize via AVEVA System Platform's own HistorizeToAveva (not this sink); AB CIP ALMD alarms aren't on alerts (a future addition, out of scope here).

Tests

  • TestKit (HistorianAdapterActorTests): a Secondary host receiving an AlarmTransitionEvent does NOT enqueue; a Primary translates + enqueues (fake IAlarmHistorianSink recording writes); default (no role) enqueues.
  • Translation unit tests: severity int→enum buckets (boundary values), AlarmTypeName/Comment carried, EventKind mapping.
  • Engine: ScriptedAlarmEvent.Comment is populated on an ack/comment transition.
  • Config-gated registration: section present → SqliteStoreAndForwardSink bound; absent → NullAlarmHistorianSink. (xUnit + Shouldly; in-memory config.)
  • No bUnit (no UI change). docker-dev live-verify is optional here (the sink stays Null on docker-dev unless configured); the exactly-once gating is proven by TestKit.

Part A — Galaxy alarm-client reconnect (verify + document; low-risk)

Finding (verified)

The gap is already handled:

  • GatewayGalaxyAlarmFeed.RunAsync (Driver.Galaxy/Runtime/GatewayGalaxyAlarmFeed.cs:101-147) has its own reconnect loop: on any non-cancellation stream fault it logs, waits _reconnectDelay (~5s), and re-invokes StreamAlarmsAsync on the same client. There is a passing unit test (Reopens_stream_after_a_transport_fault).
  • gRPC.NET channels auto-reconnect the underlying HTTP/2 connection, so re-invoking the stream (or the next unary AcknowledgeAlarmAsync) after a gateway restart re-establishes transparently — the client does not need recreating.
  • Keepalive hardening is not reachable from this repo: MxGatewayClientOptions is a NuGet package (sibling MxAccessGateway repo) and exposes no keepalive / channel-resilience knobs. Adding them would be a sibling-repo change — out of scope.

Decision

No production reconnect code. Instead:

  • Add a focused test that the acknowledger (GatewayGalaxyAlarmAcknowledger) recovers — its next unary call succeeds after a transient fault (the stream path is already covered by the existing feed test). If a real recovery gap surfaces, escalate to a follow-up.
  • Document the alarm-feed reconnect behaviour in docs/drivers/Galaxy.md: the feed's own re-invoke loop + gRPC channel auto-reconnect handle a gateway restart; _ownedMxClient is intentionally not recreated (keepalive hardening would require a gateway-package change).

Sequencing & risk

Item Risk Notes
B1 Extend AlarmTransitionEvent (+ engine Comment) standard Commons record + Core.ScriptedAlarms engine + ScriptedAlarmHostActor populate
B2 HistorianAdapterActor alerts-subscribe + translate high-risk redundancy gate + cluster topic + exactly-once
B3 Config-gated durable sink + Host wiring standard DI + appsettings + Sqlite/Wonderware construction + drain lifecycle
A Verify + document small acknowledger-recovery test + Galaxy.md note

B2 depends on B1 (needs the extended event). B3 is independent of B1/B2 (DI/config). A is fully independent. TDD where there's logic; the exactly-once dedup is proven by TestKit, not docker-dev (the real sink is config-gated and stays Null on docker-dev).

Hard rules

Stage by explicit path (never git add .); never stage sql_login.txt / src/Server/.../Host/pki/; never echo the gateway API key into a new tracked file; no force-push, no --no-verify; no Configuration entity / EF migration change (none of these items needs one — the historian queue is a standalone SQLite file, not the Config DB). Build on a feature branch off master.