docs(design): alarm ack/shelve follow-ups (redundancy double-emit, Timed-shelve UI, T21 minors, rig cleanup, Galaxy reconnect, live-pill)

2026-06-11 08:23:58 -04:00
parent bc9843d2bd
commit bcb9f45cb3
1 changed files with 167 additions and 0 deletions
@@ -0,0 +1,167 @@
+# Alarm Ack/Shelve Follow-ups — Design
+
+**Date:** 2026-06-11
+**Status:** Approved (brainstorming) — ready for implementation plan.
+
+## Goal
+
+Resolve the six notes/follow-ups left open by the T17–T24 inbound-alarm-ack work
+(merged to master `bc9843d2`): the redundancy **double-emit**, the deferred Timed-shelve
+UI, two T21 minors, the docker-dev rig cleanup, and two pre-existing Layer-1 gaps (Galaxy
+reconnect, the live-pill). Scope decided in brainstorming: **everything**, including the
+pre-existing L1 gaps. Double-emit approach decided: **primary-only at the source**.
+
+---
+
+## 1 — Redundancy double-emit fix (core; high-risk)
+
+### Problem (verified)
+Both central nodes run roles `admin,driver` in the same MAIN cluster. Each spawns its own
+`ScriptedAlarmHostActor` + `ScriptedAlarmEngine` (`DriverHostActor.SpawnScriptedAlarmHost`,
+per-node, no singleton) and, for a single-cluster artifact, loads the **same** alarms
+(`DeploymentArtifact.ParseComposition` returns `ClusterFilterMode.None` → unfiltered). Both
+independently evaluate and both hit `ScriptedAlarmHostActor.OnEngineEmission`:
+
+- line 261 `_publishActor.Tell(new OpcUaPublishActor.AlarmStateUpdate(...))` — writes the OPC
+  UA node (directed, local).
+- line 278 `_mediator.Tell(new Publish(AlertsTopic /* "alerts" */, evt))` — **cluster-wide,
+  ungated** → every transition appears **twice** on `/alerts`. Inbound ack/shelve commands
+  are likewise processed by both nodes' engines.
+
+`HistorianAdapterActor` (`Runtime/Historian/`, registered **per-node** in
+`Runtime/ServiceCollectionExtensions.cs:146`, name `historian-adapter`) also consumes the
+`alerts` topic, so a single publish is historized **once per node** → double DB writes in
+production (invisible in docker-dev: `NullAlarmHistorianSink`).
+
+### Existing signal to mirror
+`RedundancyStateActor` (admin singleton, `ControlPlane/Redundancy/RedundancyStateActor.cs:90-114`)
+publishes `RedundancyStateChanged` to the `redundancy-state` DPS topic. Per-node
+`NodeRedundancyState` carries `RedundancyRole` (`Primary`/`Secondary`/`Detached`) +
+`IsRoleLeaderForDriver`; **Primary = the driver-role cluster leader**. `OpcUaPublishActor`
+already subscribes to this topic (`OpcUaPublishActor.cs:30,147,156,335`) to drive its
+ServiceLevel — the exact pattern to copy (`Subscribe(RedundancyStateTopic, Self)` in
+PreStart; `Receive<RedundancyStateChanged>`; read `msg.Nodes.FirstOrDefault(n => n.NodeId ==
+localNode)?.Role`).
+
+### Decision — gate **only the cluster-wide emission**, on `Primary`
+- **Emission gate (task A1):** `ScriptedAlarmHostActor` subscribes to `redundancy-state`,
+  caches the local role, and **skips the `alerts` publish (line 278) when not Primary**.
+  **Default = emit until a `RedundancyStateChanged` says this node is Secondary/Detached** —
+  so single-node deploys (sole node is always the driver leader = Primary) and the boot
+  window never drop transitions. The OPC UA node write (line 261) and **inbound command
+  processing stay UNGATED** — the secondary must keep its address space warm and its engine
+  state consistent for failover; clients only ever see the Primary via ServiceLevel, so the
+  secondary's node writes + its (subscriber-less) condition events are harmless.
+- **Historian gate (task A2):** `HistorianAdapterActor` subscribes to `redundancy-state`,
+  caches the local role, and **skips the sink write when not Primary**. Gives exactly-once
+  historization for **all** alarm sources (native Galaxy/AB-CIP too), not just scripted.
+- **Edge:** a brief failover window (old Primary gone, new not yet elected) may drop a
+  transition/historization — acceptable, and identical to the existing ServiceLevel handoff
+  behaviour.
+
+### Tests
+TestKit: with a Secondary `RedundancyStateChanged`, `OnEngineEmission` does NOT publish to
+`alerts` but DOES still `Tell` the OPC UA `AlarmStateUpdate`; with Primary (or before any
+state) it publishes. Inbound `AlarmCommand` is still processed regardless of role. Historian:
+a Secondary skips the sink write, a Primary writes (fake sink + role injection).
+
+---
+
+## 2 — Timed-shelve picker UI (small)
+
+`Alerts.razor` currently exposes Ack / Shelve(OneShot) / Unshelve. The `Timed` backend is
+fully wired + tested (`ShelveAlarmCommand{ShelveKind.Timed, UnshelveAtUtc}` →
+`AdminOperationsActor` → `engine.TimedShelveAsync`). Add a small duration input (a minutes/
+seconds number box, or a `datetime-local`) to the shelve control on each row and call
+`ShelveAlarmAsync(kind: Timed, unshelveAtUtc: now + duration)`. Razor-only; **no backend
+change**; proven by docker-dev live-verify (no bUnit). Keep the existing OneShot/Unshelve
+buttons.
+
+## 3 — T21 minors (small)
+
+- **Chip auto-clear:** the `_opResult*` chip persists until the next action. Add the ~8s
+  auto-clear timer pattern `DriverStatusPanel.razor` already uses (a `Task.Delay` →
+  `InvokeAsync(StateHasChanged)` that clears the per-row result).
+- **CorrelationId consistency:** `AcknowledgeAlarmCommand`/`ShelveAlarmCommand` use a bare
+  `Guid CorrelationId`; the project's other control-plane commands use the `CorrelationId`
+  wrapper type. Switch both records + `AdminOperationsClient` (`CorrelationId.NewId()`) +
+  the actor reply contract + tests to the wrapper for uniform correlation tracing.
+
+## 4 — Rig cleanup (operational, last)
+
+Delete the docker-dev seed artifacts left for live-verify: the `t12-overheat` scripted alarm,
+the `SC-ba675b168a85` predicate script, the `layer0-logcheck` vtag + script; revert
+filler-02's inert `cycle-time-s` logger line to `return ctx.GetTag("TestMachine_002.TestDuration").Value;`.
+Redeploy (`POST /api/deployments`, `X-Api-Key: docker-dev-deploy-key`). DB/redeploy, **not a
+code change** — done last so the rig stays available for verifying tasks 1–3/6.
+
+## 5 — Galaxy reconnect recreate (high-risk)
+
+### Problem (verified)
+`GalaxyMxSession.ConnectAsync` (`Driver.Galaxy/Runtime/GalaxyMxSession.cs:58-69`) is
+idempotent: `if (_session is not null) return;`. When the gateway restarts and the session
+goes Faulted/NotFound, `_session` is still a non-null (dead) handle, so `ConnectAsync` is a
+silent no-op. `GalaxyDriver.ReopenAsync` (`GalaxyDriver.cs:289`) calls it expecting a
+reconnect → no-op; `ReconnectSupervisor.RecoveryLoopAsync`
+(`Driver.Galaxy/Runtime/ReconnectSupervisor.cs:158-186`) sees reopen "succeed", proceeds to
+replay (which fails on the dead session), and **loops forever** (backoff capped 30s).
+
+### Decision — recreate on reopen
+Add a recreate path to `GalaxyMxSession` (e.g. `RecreateAsync`/`DisposeSessionForRecreationAsync`)
+that disposes + nulls `_session` and `_ownedClient`, and have `ReopenAsync` call it **before**
+`ConnectAsync` so a reopen always routes through the happy-path create (`OpenSessionAsync` +
+`RegisterAsync`). Confirm what status/exception marks Faulted/NotFound and that the dispose is
+safe (gRPC channel teardown). Keep the supervisor's backoff loop; it now actually recovers.
+
+### Tests
+A reconnect test asserting that after a faulted session, the reopen path **creates a new
+session** (new handle / `OpenSessionAsync` called again) rather than no-op'ing. Mirror any
+existing `ReconnectSupervisor`/session tests.
+
+## 6 — Live-pill circuit health (standard)
+
+### Problem (verified)
+`ScriptLog.razor:122` and `Alerts.razor:132` set `_connected = true` once in
+`OnInitialized` and never update it; the pill markup binds to that set-once bool.
+`IInProcessBroadcaster<T>` (`AdminUI/Hubs/IInProcessBroadcaster.cs`) exposes only
+`Received` + `Publish` — no health signal.
+
+### Decision
+- Extend `IInProcessBroadcaster<T>` with a connection-health signal (`bool IsConnected` +
+  `event … ConnectionStateChanged`). The bridge actors (`ScriptLogSignalRBridge` /
+  `AlertSignalRBridge`) set it from their DPS-subscription health (SubscribeAck up / failure
+  down). The razors bind the pill to it and subscribe/unsubscribe like
+  `DriverStatusPanel.razor` (`OnConnectionStateChanged` → `InvokeAsync(StateHasChanged)`).
+- **Dead-circuit case** (node recreate kills the server-side circuit — the component is dead
+  and cannot self-update its pill): this is Blazor Server's built-in reconnection concern.
+  **Verify the default reconnect overlay is present/visible** (it is what actually signals a
+  dropped circuit) rather than trying to fake liveness from a dead component. If absent, add
+  the standard Blazor reconnect UI.
+
+### Tests
+Broadcaster unit test for the new health signal + `SetConnected` propagation. Razor proven by
+docker-dev live-verify (no bUnit).
+
+---
+
+## Sequencing & risk
+
+| Item | Risk | Notes |
+|---|---|---|
+| 1 redundancy double-emit (A1 emit gate, A2 historian gate) | high-risk | independent subsystem; A1∥A2 (different files) |
+| 5 Galaxy reconnect | high-risk | independent subsystem (Galaxy driver) |
+| 2 Timed-shelve picker | small | razor-only, live-verify |
+| 3a chip auto-clear / 3b CorrelationId | small | razor / mechanical refactor |
+| 6 live-pill | standard | interface + bridge + 2 razors |
+| 4 rig cleanup | trivial/operational | last |
+
+The two high-risk items (1, 5) are in different subsystems and can run in parallel with each
+other and with the UI items (2, 3, 6). Rig cleanup (4) is last. TDD where there's logic; UI
+proven by docker-dev `/run` live-verify (agent drives — login disabled on docker-dev).
+
+## Hard rules
+
+Stage by explicit path (never `git add .`); never stage `sql_login.txt` /
+`src/Server/.../Host/pki/`; never echo the gateway API key into a **new** tracked file; no
+force-push, no `--no-verify`; **no Configuration entity / EF migration change** (none of these
+items needs one). Build on a feature branch off `master`.