# Alarm Ack/Shelve Follow-ups — Design **Date:** 2026-06-11 **Status:** Approved (brainstorming) — ready for implementation plan. ## Goal Resolve the six notes/follow-ups left open by the T17–T24 inbound-alarm-ack work (merged to master `bc9843d2`): the redundancy **double-emit**, the deferred Timed-shelve UI, two T21 minors, the docker-dev rig cleanup, and two pre-existing Layer-1 gaps (Galaxy reconnect, the live-pill). Scope decided in brainstorming: **everything**, including the pre-existing L1 gaps. Double-emit approach decided: **primary-only at the source**. --- ## 1 — Redundancy double-emit fix (core; high-risk) ### Problem (verified) Both central nodes run roles `admin,driver` in the same MAIN cluster. Each spawns its own `ScriptedAlarmHostActor` + `ScriptedAlarmEngine` (`DriverHostActor.SpawnScriptedAlarmHost`, per-node, no singleton) and, for a single-cluster artifact, loads the **same** alarms (`DeploymentArtifact.ParseComposition` returns `ClusterFilterMode.None` → unfiltered). Both independently evaluate and both hit `ScriptedAlarmHostActor.OnEngineEmission`: - line 261 `_publishActor.Tell(new OpcUaPublishActor.AlarmStateUpdate(...))` — writes the OPC UA node (directed, local). - line 278 `_mediator.Tell(new Publish(AlertsTopic /* "alerts" */, evt))` — **cluster-wide, ungated** → every transition appears **twice** on `/alerts`. Inbound ack/shelve commands are likewise processed by both nodes' engines. `HistorianAdapterActor` (`Runtime/Historian/`, registered **per-node** in `Runtime/ServiceCollectionExtensions.cs:146`, name `historian-adapter`) also consumes the `alerts` topic, so a single publish is historized **once per node** → double DB writes in production (invisible in docker-dev: `NullAlarmHistorianSink`). ### Existing signal to mirror `RedundancyStateActor` (admin singleton, `ControlPlane/Redundancy/RedundancyStateActor.cs:90-114`) publishes `RedundancyStateChanged` to the `redundancy-state` DPS topic. Per-node `NodeRedundancyState` carries `RedundancyRole` (`Primary`/`Secondary`/`Detached`) + `IsRoleLeaderForDriver`; **Primary = the driver-role cluster leader**. `OpcUaPublishActor` already subscribes to this topic (`OpcUaPublishActor.cs:30,147,156,335`) to drive its ServiceLevel — the exact pattern to copy (`Subscribe(RedundancyStateTopic, Self)` in PreStart; `Receive`; read `msg.Nodes.FirstOrDefault(n => n.NodeId == localNode)?.Role`). ### Decision — gate **only the cluster-wide emission**, on `Primary` - **Emission gate (task A1):** `ScriptedAlarmHostActor` subscribes to `redundancy-state`, caches the local role, and **skips the `alerts` publish (line 278) when not Primary**. **Default = emit until a `RedundancyStateChanged` says this node is Secondary/Detached** — so single-node deploys (sole node is always the driver leader = Primary) and the boot window never drop transitions. The OPC UA node write (line 261) and **inbound command processing stay UNGATED** — the secondary must keep its address space warm and its engine state consistent for failover; clients only ever see the Primary via ServiceLevel, so the secondary's node writes + its (subscriber-less) condition events are harmless. - **Historian gate (task A2):** `HistorianAdapterActor` subscribes to `redundancy-state`, caches the local role, and **skips the sink write when not Primary**. Gives exactly-once historization for **all** alarm sources (native Galaxy/AB-CIP too), not just scripted. - **Edge:** a brief failover window (old Primary gone, new not yet elected) may drop a transition/historization — acceptable, and identical to the existing ServiceLevel handoff behaviour. ### Tests TestKit: with a Secondary `RedundancyStateChanged`, `OnEngineEmission` does NOT publish to `alerts` but DOES still `Tell` the OPC UA `AlarmStateUpdate`; with Primary (or before any state) it publishes. Inbound `AlarmCommand` is still processed regardless of role. Historian: a Secondary skips the sink write, a Primary writes (fake sink + role injection). --- ## 2 — Timed-shelve picker UI (small) `Alerts.razor` currently exposes Ack / Shelve(OneShot) / Unshelve. The `Timed` backend is fully wired + tested (`ShelveAlarmCommand{ShelveKind.Timed, UnshelveAtUtc}` → `AdminOperationsActor` → `engine.TimedShelveAsync`). Add a small duration input (a minutes/ seconds number box, or a `datetime-local`) to the shelve control on each row and call `ShelveAlarmAsync(kind: Timed, unshelveAtUtc: now + duration)`. Razor-only; **no backend change**; proven by docker-dev live-verify (no bUnit). Keep the existing OneShot/Unshelve buttons. ## 3 — T21 minors (small) - **Chip auto-clear:** the `_opResult*` chip persists until the next action. Add the ~8s auto-clear timer pattern `DriverStatusPanel.razor` already uses (a `Task.Delay` → `InvokeAsync(StateHasChanged)` that clears the per-row result). - **CorrelationId consistency:** `AcknowledgeAlarmCommand`/`ShelveAlarmCommand` use a bare `Guid CorrelationId`; the project's other control-plane commands use the `CorrelationId` wrapper type. Switch both records + `AdminOperationsClient` (`CorrelationId.NewId()`) + the actor reply contract + tests to the wrapper for uniform correlation tracing. ## 4 — Rig cleanup (operational, last) Delete the docker-dev seed artifacts left for live-verify: the `t12-overheat` scripted alarm, the `SC-ba675b168a85` predicate script, the `layer0-logcheck` vtag + script; revert filler-02's inert `cycle-time-s` logger line to `return ctx.GetTag("TestMachine_002.TestDuration").Value;`. Redeploy (`POST /api/deployments`, `X-Api-Key: docker-dev-deploy-key`). DB/redeploy, **not a code change** — done last so the rig stays available for verifying tasks 1–3/6. ## 5 — Galaxy reconnect recreate (high-risk) ### Problem (verified) `GalaxyMxSession.ConnectAsync` (`Driver.Galaxy/Runtime/GalaxyMxSession.cs:58-69`) is idempotent: `if (_session is not null) return;`. When the gateway restarts and the session goes Faulted/NotFound, `_session` is still a non-null (dead) handle, so `ConnectAsync` is a silent no-op. `GalaxyDriver.ReopenAsync` (`GalaxyDriver.cs:289`) calls it expecting a reconnect → no-op; `ReconnectSupervisor.RecoveryLoopAsync` (`Driver.Galaxy/Runtime/ReconnectSupervisor.cs:158-186`) sees reopen "succeed", proceeds to replay (which fails on the dead session), and **loops forever** (backoff capped 30s). ### Decision — recreate on reopen Add a recreate path to `GalaxyMxSession` (e.g. `RecreateAsync`/`DisposeSessionForRecreationAsync`) that disposes + nulls `_session` and `_ownedClient`, and have `ReopenAsync` call it **before** `ConnectAsync` so a reopen always routes through the happy-path create (`OpenSessionAsync` + `RegisterAsync`). Confirm what status/exception marks Faulted/NotFound and that the dispose is safe (gRPC channel teardown). Keep the supervisor's backoff loop; it now actually recovers. ### Tests A reconnect test asserting that after a faulted session, the reopen path **creates a new session** (new handle / `OpenSessionAsync` called again) rather than no-op'ing. Mirror any existing `ReconnectSupervisor`/session tests. ## 6 — Live-pill circuit health (standard) ### Problem (verified) `ScriptLog.razor:122` and `Alerts.razor:132` set `_connected = true` once in `OnInitialized` and never update it; the pill markup binds to that set-once bool. `IInProcessBroadcaster` (`AdminUI/Hubs/IInProcessBroadcaster.cs`) exposes only `Received` + `Publish` — no health signal. ### Decision - Extend `IInProcessBroadcaster` with a connection-health signal (`bool IsConnected` + `event … ConnectionStateChanged`). The bridge actors (`ScriptLogSignalRBridge` / `AlertSignalRBridge`) set it from their DPS-subscription health (SubscribeAck up / failure down). The razors bind the pill to it and subscribe/unsubscribe like `DriverStatusPanel.razor` (`OnConnectionStateChanged` → `InvokeAsync(StateHasChanged)`). - **Dead-circuit case** (node recreate kills the server-side circuit — the component is dead and cannot self-update its pill): this is Blazor Server's built-in reconnection concern. **Verify the default reconnect overlay is present/visible** (it is what actually signals a dropped circuit) rather than trying to fake liveness from a dead component. If absent, add the standard Blazor reconnect UI. ### Tests Broadcaster unit test for the new health signal + `SetConnected` propagation. Razor proven by docker-dev live-verify (no bUnit). --- ## Sequencing & risk | Item | Risk | Notes | |---|---|---| | 1 redundancy double-emit (A1 emit gate, A2 historian gate) | high-risk | independent subsystem; A1∥A2 (different files) | | 5 Galaxy reconnect | high-risk | independent subsystem (Galaxy driver) | | 2 Timed-shelve picker | small | razor-only, live-verify | | 3a chip auto-clear / 3b CorrelationId | small | razor / mechanical refactor | | 6 live-pill | standard | interface + bridge + 2 razors | | 4 rig cleanup | trivial/operational | last | The two high-risk items (1, 5) are in different subsystems and can run in parallel with each other and with the UI items (2, 3, 6). Rig cleanup (4) is last. TDD where there's logic; UI proven by docker-dev `/run` live-verify (agent drives — login disabled on docker-dev). ## Hard rules Stage by explicit path (never `git add .`); never stage `sql_login.txt` / `src/Server/.../Host/pki/`; never echo the gateway API key into a **new** tracked file; no force-push, no `--no-verify`; **no Configuration entity / EF migration change** (none of these items needs one). Build on a feature branch off `master`.