docs(design): alarm follow-ups round 2 (historian alerts-feeder + Galaxy reconnect verify)
This commit is contained in:
@@ -0,0 +1,159 @@
|
||||
# Alarm Follow-ups (Round 2) — Design
|
||||
|
||||
**Date:** 2026-06-11
|
||||
**Status:** Approved (brainstorming) — ready for implementation plan.
|
||||
|
||||
## Goal
|
||||
|
||||
Resolve the two follow-ups left open by the alarm-followups work (merged to master
|
||||
`521ee753`):
|
||||
|
||||
- **B — Historian feeder.** `HistorianAdapterActor` is Primary-gated but has **no
|
||||
production feeder** (nothing produces `AlarmHistorianEvent`). Wire engine→historian so
|
||||
exactly-once alarm historization actually activates.
|
||||
- **A — Galaxy alarm-client reconnect.** The reviewer flagged that the session-less alarm
|
||||
client (`GalaxyDriver._ownedMxClient`) is never recreated on reconnect. Investigation shows
|
||||
this is **already handled** — so the work is verify + document, not new reconnect code.
|
||||
|
||||
Scope decided in brainstorming. Decisions: (B) historian **subscribes to the `alerts` DPS
|
||||
topic** and translates (the intended design per `ScriptedAlarmHostActor`'s own docstring),
|
||||
**extending `AlarmTransitionEvent`** with the two fields it lacks (`AlarmTypeName`, `Comment`)
|
||||
to avoid data loss — including a small **Core.ScriptedAlarms engine change** to carry `Comment`
|
||||
through the emission; (B-sink) also wire a **config-gated real durable sink**
|
||||
(`SqliteStoreAndForwardSink`→Wonderware) with a `Null` fallback; (A) **verify + test +
|
||||
document** only (gRPC keepalive hardening is not reachable from this repo — see below).
|
||||
|
||||
---
|
||||
|
||||
## Part B — Historian feeder (substantive; high-risk)
|
||||
|
||||
### The intended design (already documented)
|
||||
|
||||
`ScriptedAlarmHostActor`'s docstring states it explicitly: the host publishes each transition
|
||||
as an `AlarmTransitionEvent` to the `alerts` topic **only**, and "deliberately does NOT also
|
||||
Tell the historian adapter directly — doing so would double-historize... a direct tell would
|
||||
duplicate every row." So the historian is meant to **subscribe to `alerts`** and translate.
|
||||
That subscription was never implemented — `HistorianAdapterActor` has the `Receive<AlarmHistorianEvent>`
|
||||
+ the Primary gate, but no `alerts` subscription and no feeder. This plan implements it.
|
||||
|
||||
### Data flow
|
||||
|
||||
```
|
||||
ScriptedAlarmHostActor.OnEngineEmission (Primary only — already gated)
|
||||
└─ Publish(AlarmTransitionEvent) → "alerts" DPS topic
|
||||
├─ AlertSignalRBridge (existing — live UI fan-out)
|
||||
└─ HistorianAdapterActor (NEW: Subscribe in PreStart)
|
||||
└─ if _localRole is Primary (existing T2 gate — STILL REQUIRED)
|
||||
└─ translate AlarmTransitionEvent → AlarmHistorianEvent
|
||||
└─ IAlarmHistorianSink.EnqueueAsync (fire-and-forget)
|
||||
```
|
||||
|
||||
**Why the T2 Primary gate stays load-bearing:** the Primary publishes the transition **once**,
|
||||
but DistributedPubSub fans that single message to **every** node's subscribers — including
|
||||
**both** central nodes' `HistorianAdapterActor`. Without the gate, both nodes' historians would
|
||||
enqueue → double DB writes. The T2 gate keeps only the Primary writing → exactly-once. (The
|
||||
historization is therefore co-located with the alerts emission: Primary-only on both ends.)
|
||||
|
||||
### Components
|
||||
|
||||
1. **Extend `AlarmTransitionEvent`** (`Commons/Messages/Alerts/AlarmTransitionEvent.cs`) with
|
||||
`string AlarmTypeName` and `string? Comment`. Additive; the cluster's default Akka serializer
|
||||
(no Hyperion/protobuf binding) is forward/backward compatible across a rolling restart
|
||||
(old nodes ignore the new fields; new nodes read old messages with the fields null/default).
|
||||
Existing consumers (`AlertSignalRBridge`, AdminUI `/alerts`) ignore them.
|
||||
2. **`ScriptedAlarmEvent`** (`Core.ScriptedAlarms/ScriptedAlarmEngine.cs`) gains a
|
||||
`string? Comment` field; the engine populates it on comment-bearing transitions
|
||||
(Acknowledge / Confirm / AddComment / Shelve ops already receive the operator comment —
|
||||
thread it into the emitted event). `Kind` already carries the Part-9 type.
|
||||
3. **`ScriptedAlarmHostActor.OnEngineEmission`** populates the two new `AlarmTransitionEvent`
|
||||
fields: `AlarmTypeName = e.Kind.ToString()`, `Comment = e.Comment`. No other change to the
|
||||
emit path (the Primary gate + OPC UA write stay as-is).
|
||||
4. **`HistorianAdapterActor`**: `Subscribe(AlertsTopic, Self)` in PreStart (+ `SubscribeAck`
|
||||
no-op); add `Receive<AlarmTransitionEvent>` that **translates** then runs the **existing
|
||||
Primary-gated** enqueue. Translation:
|
||||
- `Severity` int→`AlarmSeverity` by inverting `ScriptedAlarmHostActor.SeverityToInt`'s
|
||||
buckets (1–250 Low, 251–500 Medium, 501–750 High, 751–1000 Critical).
|
||||
- `EventKind = TransitionKind`; `AlarmTypeName`, `Comment`, the rest map 1:1.
|
||||
- Keep the existing `Receive<AlarmHistorianEvent>` path too (harmless; lets a future direct
|
||||
source still feed it).
|
||||
5. **Config-gated durable sink** — a new `AlarmHistorian` appsettings section + an
|
||||
`AddAlarmHistorian` registration (Runtime `ServiceCollectionExtensions`, called from the
|
||||
Host). When the section is present: register `SqliteStoreAndForwardSink(dbPath,
|
||||
new WonderwareHistorianClient(WonderwareHistorianClientOptions{PipeName, SharedSecret, …}),
|
||||
logger)` as the `IAlarmHistorianSink` singleton and start its drain loop (`StartDrainLoop`);
|
||||
dispose it on shutdown. When absent: keep `NullAlarmHistorianSink` (current default — so
|
||||
dev/docker-dev is unaffected). `SqliteStoreAndForwardSink` takes `(string databasePath,
|
||||
IAlarmHistorianWriter writer, ILogger, …)`; `WonderwareHistorianClient` implements
|
||||
`IAlarmHistorianWriter` and talks the named-pipe IPC.
|
||||
|
||||
### Scope
|
||||
|
||||
Scripted alarms only — the only source on the `alerts` topic. Galaxy native alarms historize
|
||||
via AVEVA System Platform's own `HistorizeToAveva` (not this sink); AB CIP ALMD alarms aren't
|
||||
on `alerts` (a future addition, out of scope here).
|
||||
|
||||
### Tests
|
||||
|
||||
- TestKit (`HistorianAdapterActorTests`): a **Secondary** host receiving an `AlarmTransitionEvent`
|
||||
does NOT enqueue; a **Primary** translates + enqueues (fake `IAlarmHistorianSink` recording
|
||||
writes); default (no role) enqueues.
|
||||
- Translation unit tests: severity int→enum buckets (boundary values), `AlarmTypeName`/`Comment`
|
||||
carried, `EventKind` mapping.
|
||||
- Engine: `ScriptedAlarmEvent.Comment` is populated on an ack/comment transition.
|
||||
- Config-gated registration: section present → `SqliteStoreAndForwardSink` bound; absent →
|
||||
`NullAlarmHistorianSink`. (xUnit + Shouldly; in-memory config.)
|
||||
- No bUnit (no UI change). docker-dev live-verify is optional here (the sink stays Null on
|
||||
docker-dev unless configured); the exactly-once gating is proven by TestKit.
|
||||
|
||||
---
|
||||
|
||||
## Part A — Galaxy alarm-client reconnect (verify + document; low-risk)
|
||||
|
||||
### Finding (verified)
|
||||
|
||||
The gap is **already handled**:
|
||||
|
||||
- `GatewayGalaxyAlarmFeed.RunAsync` (`Driver.Galaxy/Runtime/GatewayGalaxyAlarmFeed.cs:101-147`)
|
||||
has its own reconnect loop: on any non-cancellation stream fault it logs, waits
|
||||
`_reconnectDelay` (~5s), and **re-invokes `StreamAlarmsAsync` on the same client**. There is a
|
||||
passing unit test (`Reopens_stream_after_a_transport_fault`).
|
||||
- gRPC.NET channels auto-reconnect the underlying HTTP/2 connection, so re-invoking the stream
|
||||
(or the next unary `AcknowledgeAlarmAsync`) after a gateway restart re-establishes transparently
|
||||
— the **client** does not need recreating.
|
||||
- **Keepalive hardening is not reachable from this repo**: `MxGatewayClientOptions` is a NuGet
|
||||
package (sibling MxAccessGateway repo) and exposes no keepalive / channel-resilience knobs.
|
||||
Adding them would be a sibling-repo change — out of scope.
|
||||
|
||||
### Decision
|
||||
|
||||
No production reconnect code. Instead:
|
||||
|
||||
- Add a focused test that the **acknowledger** (`GatewayGalaxyAlarmAcknowledger`) recovers — its
|
||||
next unary call succeeds after a transient fault (the stream path is already covered by the
|
||||
existing feed test). If a real recovery gap surfaces, escalate to a follow-up.
|
||||
- Document the alarm-feed reconnect behaviour in `docs/drivers/Galaxy.md`: the feed's own
|
||||
re-invoke loop + gRPC channel auto-reconnect handle a gateway restart; `_ownedMxClient` is
|
||||
intentionally **not** recreated (keepalive hardening would require a gateway-package change).
|
||||
|
||||
---
|
||||
|
||||
## Sequencing & risk
|
||||
|
||||
| Item | Risk | Notes |
|
||||
|---|---|---|
|
||||
| B1 Extend `AlarmTransitionEvent` (+ engine `Comment`) | standard | Commons record + Core.ScriptedAlarms engine + ScriptedAlarmHostActor populate |
|
||||
| B2 `HistorianAdapterActor` alerts-subscribe + translate | high-risk | redundancy gate + cluster topic + exactly-once |
|
||||
| B3 Config-gated durable sink + Host wiring | standard | DI + appsettings + Sqlite/Wonderware construction + drain lifecycle |
|
||||
| A Verify + document | small | acknowledger-recovery test + Galaxy.md note |
|
||||
|
||||
B2 depends on B1 (needs the extended event). B3 is independent of B1/B2 (DI/config). A is fully
|
||||
independent. TDD where there's logic; the exactly-once dedup is proven by TestKit, not docker-dev
|
||||
(the real sink is config-gated and stays Null on docker-dev).
|
||||
|
||||
## Hard rules
|
||||
|
||||
Stage by explicit path (never `git add .`); never stage `sql_login.txt` /
|
||||
`src/Server/.../Host/pki/`; never echo the gateway API key into a **new** tracked file; no
|
||||
force-push, no `--no-verify`; **no Configuration entity / EF migration change** (none of these
|
||||
items needs one — the historian queue is a standalone SQLite file, not the Config DB). Build on a
|
||||
feature branch off `master`.
|
||||
Reference in New Issue
Block a user