docs(design): alarm ack/shelve follow-ups (redundancy double-emit, Timed-shelve UI, T21 minors, rig cleanup, Galaxy reconnect, live-pill)

This commit is contained in:
Joseph Doherty
2026-06-11 08:23:58 -04:00
parent bc9843d2bd
commit bcb9f45cb3
@@ -0,0 +1,167 @@
# Alarm Ack/Shelve Follow-ups — Design
**Date:** 2026-06-11
**Status:** Approved (brainstorming) — ready for implementation plan.
## Goal
Resolve the six notes/follow-ups left open by the T17T24 inbound-alarm-ack work
(merged to master `bc9843d2`): the redundancy **double-emit**, the deferred Timed-shelve
UI, two T21 minors, the docker-dev rig cleanup, and two pre-existing Layer-1 gaps (Galaxy
reconnect, the live-pill). Scope decided in brainstorming: **everything**, including the
pre-existing L1 gaps. Double-emit approach decided: **primary-only at the source**.
---
## 1 — Redundancy double-emit fix (core; high-risk)
### Problem (verified)
Both central nodes run roles `admin,driver` in the same MAIN cluster. Each spawns its own
`ScriptedAlarmHostActor` + `ScriptedAlarmEngine` (`DriverHostActor.SpawnScriptedAlarmHost`,
per-node, no singleton) and, for a single-cluster artifact, loads the **same** alarms
(`DeploymentArtifact.ParseComposition` returns `ClusterFilterMode.None` → unfiltered). Both
independently evaluate and both hit `ScriptedAlarmHostActor.OnEngineEmission`:
- line 261 `_publishActor.Tell(new OpcUaPublishActor.AlarmStateUpdate(...))` — writes the OPC
UA node (directed, local).
- line 278 `_mediator.Tell(new Publish(AlertsTopic /* "alerts" */, evt))` — **cluster-wide,
ungated** → every transition appears **twice** on `/alerts`. Inbound ack/shelve commands
are likewise processed by both nodes' engines.
`HistorianAdapterActor` (`Runtime/Historian/`, registered **per-node** in
`Runtime/ServiceCollectionExtensions.cs:146`, name `historian-adapter`) also consumes the
`alerts` topic, so a single publish is historized **once per node** → double DB writes in
production (invisible in docker-dev: `NullAlarmHistorianSink`).
### Existing signal to mirror
`RedundancyStateActor` (admin singleton, `ControlPlane/Redundancy/RedundancyStateActor.cs:90-114`)
publishes `RedundancyStateChanged` to the `redundancy-state` DPS topic. Per-node
`NodeRedundancyState` carries `RedundancyRole` (`Primary`/`Secondary`/`Detached`) +
`IsRoleLeaderForDriver`; **Primary = the driver-role cluster leader**. `OpcUaPublishActor`
already subscribes to this topic (`OpcUaPublishActor.cs:30,147,156,335`) to drive its
ServiceLevel — the exact pattern to copy (`Subscribe(RedundancyStateTopic, Self)` in
PreStart; `Receive<RedundancyStateChanged>`; read `msg.Nodes.FirstOrDefault(n => n.NodeId ==
localNode)?.Role`).
### Decision — gate **only the cluster-wide emission**, on `Primary`
- **Emission gate (task A1):** `ScriptedAlarmHostActor` subscribes to `redundancy-state`,
caches the local role, and **skips the `alerts` publish (line 278) when not Primary**.
**Default = emit until a `RedundancyStateChanged` says this node is Secondary/Detached**
so single-node deploys (sole node is always the driver leader = Primary) and the boot
window never drop transitions. The OPC UA node write (line 261) and **inbound command
processing stay UNGATED** — the secondary must keep its address space warm and its engine
state consistent for failover; clients only ever see the Primary via ServiceLevel, so the
secondary's node writes + its (subscriber-less) condition events are harmless.
- **Historian gate (task A2):** `HistorianAdapterActor` subscribes to `redundancy-state`,
caches the local role, and **skips the sink write when not Primary**. Gives exactly-once
historization for **all** alarm sources (native Galaxy/AB-CIP too), not just scripted.
- **Edge:** a brief failover window (old Primary gone, new not yet elected) may drop a
transition/historization — acceptable, and identical to the existing ServiceLevel handoff
behaviour.
### Tests
TestKit: with a Secondary `RedundancyStateChanged`, `OnEngineEmission` does NOT publish to
`alerts` but DOES still `Tell` the OPC UA `AlarmStateUpdate`; with Primary (or before any
state) it publishes. Inbound `AlarmCommand` is still processed regardless of role. Historian:
a Secondary skips the sink write, a Primary writes (fake sink + role injection).
---
## 2 — Timed-shelve picker UI (small)
`Alerts.razor` currently exposes Ack / Shelve(OneShot) / Unshelve. The `Timed` backend is
fully wired + tested (`ShelveAlarmCommand{ShelveKind.Timed, UnshelveAtUtc}`
`AdminOperationsActor``engine.TimedShelveAsync`). Add a small duration input (a minutes/
seconds number box, or a `datetime-local`) to the shelve control on each row and call
`ShelveAlarmAsync(kind: Timed, unshelveAtUtc: now + duration)`. Razor-only; **no backend
change**; proven by docker-dev live-verify (no bUnit). Keep the existing OneShot/Unshelve
buttons.
## 3 — T21 minors (small)
- **Chip auto-clear:** the `_opResult*` chip persists until the next action. Add the ~8s
auto-clear timer pattern `DriverStatusPanel.razor` already uses (a `Task.Delay`
`InvokeAsync(StateHasChanged)` that clears the per-row result).
- **CorrelationId consistency:** `AcknowledgeAlarmCommand`/`ShelveAlarmCommand` use a bare
`Guid CorrelationId`; the project's other control-plane commands use the `CorrelationId`
wrapper type. Switch both records + `AdminOperationsClient` (`CorrelationId.NewId()`) +
the actor reply contract + tests to the wrapper for uniform correlation tracing.
## 4 — Rig cleanup (operational, last)
Delete the docker-dev seed artifacts left for live-verify: the `t12-overheat` scripted alarm,
the `SC-ba675b168a85` predicate script, the `layer0-logcheck` vtag + script; revert
filler-02's inert `cycle-time-s` logger line to `return ctx.GetTag("TestMachine_002.TestDuration").Value;`.
Redeploy (`POST /api/deployments`, `X-Api-Key: docker-dev-deploy-key`). DB/redeploy, **not a
code change** — done last so the rig stays available for verifying tasks 13/6.
## 5 — Galaxy reconnect recreate (high-risk)
### Problem (verified)
`GalaxyMxSession.ConnectAsync` (`Driver.Galaxy/Runtime/GalaxyMxSession.cs:58-69`) is
idempotent: `if (_session is not null) return;`. When the gateway restarts and the session
goes Faulted/NotFound, `_session` is still a non-null (dead) handle, so `ConnectAsync` is a
silent no-op. `GalaxyDriver.ReopenAsync` (`GalaxyDriver.cs:289`) calls it expecting a
reconnect → no-op; `ReconnectSupervisor.RecoveryLoopAsync`
(`Driver.Galaxy/Runtime/ReconnectSupervisor.cs:158-186`) sees reopen "succeed", proceeds to
replay (which fails on the dead session), and **loops forever** (backoff capped 30s).
### Decision — recreate on reopen
Add a recreate path to `GalaxyMxSession` (e.g. `RecreateAsync`/`DisposeSessionForRecreationAsync`)
that disposes + nulls `_session` and `_ownedClient`, and have `ReopenAsync` call it **before**
`ConnectAsync` so a reopen always routes through the happy-path create (`OpenSessionAsync` +
`RegisterAsync`). Confirm what status/exception marks Faulted/NotFound and that the dispose is
safe (gRPC channel teardown). Keep the supervisor's backoff loop; it now actually recovers.
### Tests
A reconnect test asserting that after a faulted session, the reopen path **creates a new
session** (new handle / `OpenSessionAsync` called again) rather than no-op'ing. Mirror any
existing `ReconnectSupervisor`/session tests.
## 6 — Live-pill circuit health (standard)
### Problem (verified)
`ScriptLog.razor:122` and `Alerts.razor:132` set `_connected = true` once in
`OnInitialized` and never update it; the pill markup binds to that set-once bool.
`IInProcessBroadcaster<T>` (`AdminUI/Hubs/IInProcessBroadcaster.cs`) exposes only
`Received` + `Publish` — no health signal.
### Decision
- Extend `IInProcessBroadcaster<T>` with a connection-health signal (`bool IsConnected` +
`event … ConnectionStateChanged`). The bridge actors (`ScriptLogSignalRBridge` /
`AlertSignalRBridge`) set it from their DPS-subscription health (SubscribeAck up / failure
down). The razors bind the pill to it and subscribe/unsubscribe like
`DriverStatusPanel.razor` (`OnConnectionStateChanged``InvokeAsync(StateHasChanged)`).
- **Dead-circuit case** (node recreate kills the server-side circuit — the component is dead
and cannot self-update its pill): this is Blazor Server's built-in reconnection concern.
**Verify the default reconnect overlay is present/visible** (it is what actually signals a
dropped circuit) rather than trying to fake liveness from a dead component. If absent, add
the standard Blazor reconnect UI.
### Tests
Broadcaster unit test for the new health signal + `SetConnected` propagation. Razor proven by
docker-dev live-verify (no bUnit).
---
## Sequencing & risk
| Item | Risk | Notes |
|---|---|---|
| 1 redundancy double-emit (A1 emit gate, A2 historian gate) | high-risk | independent subsystem; A1∥A2 (different files) |
| 5 Galaxy reconnect | high-risk | independent subsystem (Galaxy driver) |
| 2 Timed-shelve picker | small | razor-only, live-verify |
| 3a chip auto-clear / 3b CorrelationId | small | razor / mechanical refactor |
| 6 live-pill | standard | interface + bridge + 2 razors |
| 4 rig cleanup | trivial/operational | last |
The two high-risk items (1, 5) are in different subsystems and can run in parallel with each
other and with the UI items (2, 3, 6). Rig cleanup (4) is last. TDD where there's logic; UI
proven by docker-dev `/run` live-verify (agent drives — login disabled on docker-dev).
## Hard rules
Stage by explicit path (never `git add .`); never stage `sql_login.txt` /
`src/Server/.../Host/pki/`; never echo the gateway API key into a **new** tracked file; no
force-push, no `--no-verify`; **no Configuration entity / EF migration change** (none of these
items needs one). Build on a feature branch off `master`.