docs(design): alarm ack/shelve follow-ups (redundancy double-emit, Timed-shelve UI, T21 minors, rig cleanup, Galaxy reconnect, live-pill)

2026-06-11 08:23:58 -04:00
parent bc9843d2bd
commit bcb9f45cb3
1 changed files with 167 additions and 0 deletions
@@ -0,0 +1,167 @@
 # Alarm Ack/Shelve Follow-ups — Design
 **Date:** 2026-06-11
 **Status:** Approved (brainstorming) — ready for implementation plan.
 ## Goal
 Resolve the six notes/follow-ups left open by the T17–T24 inbound-alarm-ack work
 (merged to master `bc9843d2`): the redundancy **double-emit**, the deferred Timed-shelve
 UI, two T21 minors, the docker-dev rig cleanup, and two pre-existing Layer-1 gaps (Galaxy
 reconnect, the live-pill). Scope decided in brainstorming: **everything**, including the
 pre-existing L1 gaps. Double-emit approach decided: **primary-only at the source**.
 ---
 ## 1 — Redundancy double-emit fix (core; high-risk)
 ### Problem (verified)
 Both central nodes run roles `admin,driver` in the same MAIN cluster. Each spawns its own
 `ScriptedAlarmHostActor` + `ScriptedAlarmEngine` (`DriverHostActor.SpawnScriptedAlarmHost`,
 per-node, no singleton) and, for a single-cluster artifact, loads the **same** alarms
 (`DeploymentArtifact.ParseComposition` returns `ClusterFilterMode.None` → unfiltered). Both
 independently evaluate and both hit `ScriptedAlarmHostActor.OnEngineEmission`:
 - line 261 `_publishActor.Tell(new OpcUaPublishActor.AlarmStateUpdate(...))` — writes the OPC
  UA node (directed, local).
 - line 278 `_mediator.Tell(new Publish(AlertsTopic /* "alerts" */, evt))` — **cluster-wide,
  ungated** → every transition appears **twice** on `/alerts`. Inbound ack/shelve commands
  are likewise processed by both nodes' engines.
 `HistorianAdapterActor` (`Runtime/Historian/`, registered **per-node** in
 `Runtime/ServiceCollectionExtensions.cs:146`, name `historian-adapter`) also consumes the
 `alerts` topic, so a single publish is historized **once per node** → double DB writes in
 production (invisible in docker-dev: `NullAlarmHistorianSink`).
 ### Existing signal to mirror
 `RedundancyStateActor` (admin singleton, `ControlPlane/Redundancy/RedundancyStateActor.cs:90-114`)
 publishes `RedundancyStateChanged` to the `redundancy-state` DPS topic. Per-node
 `NodeRedundancyState` carries `RedundancyRole` (`Primary`/`Secondary`/`Detached`) +
 `IsRoleLeaderForDriver`; **Primary = the driver-role cluster leader**. `OpcUaPublishActor`
 already subscribes to this topic (`OpcUaPublishActor.cs:30,147,156,335`) to drive its
 ServiceLevel — the exact pattern to copy (`Subscribe(RedundancyStateTopic, Self)` in
 PreStart; `Receive<RedundancyStateChanged>`; read `msg.Nodes.FirstOrDefault(n => n.NodeId ==
 localNode)?.Role`).
 ### Decision — gate **only the cluster-wide emission**, on `Primary`
 - **Emission gate (task A1):** `ScriptedAlarmHostActor` subscribes to `redundancy-state`,
  caches the local role, and **skips the `alerts` publish (line 278) when not Primary**.
  **Default = emit until a `RedundancyStateChanged` says this node is Secondary/Detached** —
  so single-node deploys (sole node is always the driver leader = Primary) and the boot
  window never drop transitions. The OPC UA node write (line 261) and **inbound command
  processing stay UNGATED** — the secondary must keep its address space warm and its engine
  state consistent for failover; clients only ever see the Primary via ServiceLevel, so the
  secondary's node writes + its (subscriber-less) condition events are harmless.
 - **Historian gate (task A2):** `HistorianAdapterActor` subscribes to `redundancy-state`,
  caches the local role, and **skips the sink write when not Primary**. Gives exactly-once
  historization for **all** alarm sources (native Galaxy/AB-CIP too), not just scripted.
 - **Edge:** a brief failover window (old Primary gone, new not yet elected) may drop a
  transition/historization — acceptable, and identical to the existing ServiceLevel handoff
  behaviour.
 ### Tests
 TestKit: with a Secondary `RedundancyStateChanged`, `OnEngineEmission` does NOT publish to
 `alerts` but DOES still `Tell` the OPC UA `AlarmStateUpdate`; with Primary (or before any
 state) it publishes. Inbound `AlarmCommand` is still processed regardless of role. Historian:
 a Secondary skips the sink write, a Primary writes (fake sink + role injection).
 ---
 ## 2 — Timed-shelve picker UI (small)
 `Alerts.razor` currently exposes Ack / Shelve(OneShot) / Unshelve. The `Timed` backend is
 fully wired + tested (`ShelveAlarmCommand{ShelveKind.Timed, UnshelveAtUtc}` →
 `AdminOperationsActor` → `engine.TimedShelveAsync`). Add a small duration input (a minutes/
 seconds number box, or a `datetime-local`) to the shelve control on each row and call
 `ShelveAlarmAsync(kind: Timed, unshelveAtUtc: now + duration)`. Razor-only; **no backend
 change**; proven by docker-dev live-verify (no bUnit). Keep the existing OneShot/Unshelve
 buttons.
 ## 3 — T21 minors (small)
 - **Chip auto-clear:** the `_opResult*` chip persists until the next action. Add the ~8s
  auto-clear timer pattern `DriverStatusPanel.razor` already uses (a `Task.Delay` →
  `InvokeAsync(StateHasChanged)` that clears the per-row result).
 - **CorrelationId consistency:** `AcknowledgeAlarmCommand`/`ShelveAlarmCommand` use a bare
  `Guid CorrelationId`; the project's other control-plane commands use the `CorrelationId`
  wrapper type. Switch both records + `AdminOperationsClient` (`CorrelationId.NewId()`) +
  the actor reply contract + tests to the wrapper for uniform correlation tracing.
 ## 4 — Rig cleanup (operational, last)
 Delete the docker-dev seed artifacts left for live-verify: the `t12-overheat` scripted alarm,
 the `SC-ba675b168a85` predicate script, the `layer0-logcheck` vtag + script; revert
 filler-02's inert `cycle-time-s` logger line to `return ctx.GetTag("TestMachine_002.TestDuration").Value;`.
 Redeploy (`POST /api/deployments`, `X-Api-Key: docker-dev-deploy-key`). DB/redeploy, **not a
 code change** — done last so the rig stays available for verifying tasks 1–3/6.
 ## 5 — Galaxy reconnect recreate (high-risk)
 ### Problem (verified)
 `GalaxyMxSession.ConnectAsync` (`Driver.Galaxy/Runtime/GalaxyMxSession.cs:58-69`) is
 idempotent: `if (_session is not null) return;`. When the gateway restarts and the session
 goes Faulted/NotFound, `_session` is still a non-null (dead) handle, so `ConnectAsync` is a
 silent no-op. `GalaxyDriver.ReopenAsync` (`GalaxyDriver.cs:289`) calls it expecting a
 reconnect → no-op; `ReconnectSupervisor.RecoveryLoopAsync`
 (`Driver.Galaxy/Runtime/ReconnectSupervisor.cs:158-186`) sees reopen "succeed", proceeds to
 replay (which fails on the dead session), and **loops forever** (backoff capped 30s).
 ### Decision — recreate on reopen
 Add a recreate path to `GalaxyMxSession` (e.g. `RecreateAsync`/`DisposeSessionForRecreationAsync`)
 that disposes + nulls `_session` and `_ownedClient`, and have `ReopenAsync` call it **before**
 `ConnectAsync` so a reopen always routes through the happy-path create (`OpenSessionAsync` +
 `RegisterAsync`). Confirm what status/exception marks Faulted/NotFound and that the dispose is
 safe (gRPC channel teardown). Keep the supervisor's backoff loop; it now actually recovers.
 ### Tests
 A reconnect test asserting that after a faulted session, the reopen path **creates a new
 session** (new handle / `OpenSessionAsync` called again) rather than no-op'ing. Mirror any
 existing `ReconnectSupervisor`/session tests.
 ## 6 — Live-pill circuit health (standard)
 ### Problem (verified)
 `ScriptLog.razor:122` and `Alerts.razor:132` set `_connected = true` once in
 `OnInitialized` and never update it; the pill markup binds to that set-once bool.
 `IInProcessBroadcaster<T>` (`AdminUI/Hubs/IInProcessBroadcaster.cs`) exposes only
 `Received` + `Publish` — no health signal.
 ### Decision
 - Extend `IInProcessBroadcaster<T>` with a connection-health signal (`bool IsConnected` +
  `event … ConnectionStateChanged`). The bridge actors (`ScriptLogSignalRBridge` /
  `AlertSignalRBridge`) set it from their DPS-subscription health (SubscribeAck up / failure
  down). The razors bind the pill to it and subscribe/unsubscribe like
  `DriverStatusPanel.razor` (`OnConnectionStateChanged` → `InvokeAsync(StateHasChanged)`).
 - **Dead-circuit case** (node recreate kills the server-side circuit — the component is dead
  and cannot self-update its pill): this is Blazor Server's built-in reconnection concern.
  **Verify the default reconnect overlay is present/visible** (it is what actually signals a
  dropped circuit) rather than trying to fake liveness from a dead component. If absent, add
  the standard Blazor reconnect UI.
 ### Tests
 Broadcaster unit test for the new health signal + `SetConnected` propagation. Razor proven by
 docker-dev live-verify (no bUnit).
 ---
 ## Sequencing & risk
 | Item | Risk | Notes |
 |---|---|---|
 | 1 redundancy double-emit (A1 emit gate, A2 historian gate) | high-risk | independent subsystem; A1∥A2 (different files) |
 | 5 Galaxy reconnect | high-risk | independent subsystem (Galaxy driver) |
 | 2 Timed-shelve picker | small | razor-only, live-verify |
 | 3a chip auto-clear / 3b CorrelationId | small | razor / mechanical refactor |
 | 6 live-pill | standard | interface + bridge + 2 razors |
 | 4 rig cleanup | trivial/operational | last |
 The two high-risk items (1, 5) are in different subsystems and can run in parallel with each
 other and with the UI items (2, 3, 6). Rig cleanup (4) is last. TDD where there's logic; UI
 proven by docker-dev `/run` live-verify (agent drives — login disabled on docker-dev).
 ## Hard rules
 Stage by explicit path (never `git add .`); never stage `sql_login.txt` /
 `src/Server/.../Host/pki/`; never echo the gateway API key into a **new** tracked file; no
 force-push, no `--no-verify`; **no Configuration entity / EF migration change** (none of these
 items needs one). Build on a feature branch off `master`.