From bcb9f45cb3a771ffc5dfb0bb20ec6bf7000a28be Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Thu, 11 Jun 2026 08:23:58 -0400 Subject: [PATCH] docs(design): alarm ack/shelve follow-ups (redundancy double-emit, Timed-shelve UI, T21 minors, rig cleanup, Galaxy reconnect, live-pill) --- .../2026-06-11-alarm-followups-design.md | 167 ++++++++++++++++++ 1 file changed, 167 insertions(+) create mode 100644 docs/plans/2026-06-11-alarm-followups-design.md diff --git a/docs/plans/2026-06-11-alarm-followups-design.md b/docs/plans/2026-06-11-alarm-followups-design.md new file mode 100644 index 00000000..b7421025 --- /dev/null +++ b/docs/plans/2026-06-11-alarm-followups-design.md @@ -0,0 +1,167 @@ +# Alarm Ack/Shelve Follow-ups — Design + +**Date:** 2026-06-11 +**Status:** Approved (brainstorming) — ready for implementation plan. + +## Goal + +Resolve the six notes/follow-ups left open by the T17–T24 inbound-alarm-ack work +(merged to master `bc9843d2`): the redundancy **double-emit**, the deferred Timed-shelve +UI, two T21 minors, the docker-dev rig cleanup, and two pre-existing Layer-1 gaps (Galaxy +reconnect, the live-pill). Scope decided in brainstorming: **everything**, including the +pre-existing L1 gaps. Double-emit approach decided: **primary-only at the source**. + +--- + +## 1 — Redundancy double-emit fix (core; high-risk) + +### Problem (verified) +Both central nodes run roles `admin,driver` in the same MAIN cluster. Each spawns its own +`ScriptedAlarmHostActor` + `ScriptedAlarmEngine` (`DriverHostActor.SpawnScriptedAlarmHost`, +per-node, no singleton) and, for a single-cluster artifact, loads the **same** alarms +(`DeploymentArtifact.ParseComposition` returns `ClusterFilterMode.None` → unfiltered). Both +independently evaluate and both hit `ScriptedAlarmHostActor.OnEngineEmission`: + +- line 261 `_publishActor.Tell(new OpcUaPublishActor.AlarmStateUpdate(...))` — writes the OPC + UA node (directed, local). +- line 278 `_mediator.Tell(new Publish(AlertsTopic /* "alerts" */, evt))` — **cluster-wide, + ungated** → every transition appears **twice** on `/alerts`. Inbound ack/shelve commands + are likewise processed by both nodes' engines. + +`HistorianAdapterActor` (`Runtime/Historian/`, registered **per-node** in +`Runtime/ServiceCollectionExtensions.cs:146`, name `historian-adapter`) also consumes the +`alerts` topic, so a single publish is historized **once per node** → double DB writes in +production (invisible in docker-dev: `NullAlarmHistorianSink`). + +### Existing signal to mirror +`RedundancyStateActor` (admin singleton, `ControlPlane/Redundancy/RedundancyStateActor.cs:90-114`) +publishes `RedundancyStateChanged` to the `redundancy-state` DPS topic. Per-node +`NodeRedundancyState` carries `RedundancyRole` (`Primary`/`Secondary`/`Detached`) + +`IsRoleLeaderForDriver`; **Primary = the driver-role cluster leader**. `OpcUaPublishActor` +already subscribes to this topic (`OpcUaPublishActor.cs:30,147,156,335`) to drive its +ServiceLevel — the exact pattern to copy (`Subscribe(RedundancyStateTopic, Self)` in +PreStart; `Receive`; read `msg.Nodes.FirstOrDefault(n => n.NodeId == +localNode)?.Role`). + +### Decision — gate **only the cluster-wide emission**, on `Primary` +- **Emission gate (task A1):** `ScriptedAlarmHostActor` subscribes to `redundancy-state`, + caches the local role, and **skips the `alerts` publish (line 278) when not Primary**. + **Default = emit until a `RedundancyStateChanged` says this node is Secondary/Detached** — + so single-node deploys (sole node is always the driver leader = Primary) and the boot + window never drop transitions. The OPC UA node write (line 261) and **inbound command + processing stay UNGATED** — the secondary must keep its address space warm and its engine + state consistent for failover; clients only ever see the Primary via ServiceLevel, so the + secondary's node writes + its (subscriber-less) condition events are harmless. +- **Historian gate (task A2):** `HistorianAdapterActor` subscribes to `redundancy-state`, + caches the local role, and **skips the sink write when not Primary**. Gives exactly-once + historization for **all** alarm sources (native Galaxy/AB-CIP too), not just scripted. +- **Edge:** a brief failover window (old Primary gone, new not yet elected) may drop a + transition/historization — acceptable, and identical to the existing ServiceLevel handoff + behaviour. + +### Tests +TestKit: with a Secondary `RedundancyStateChanged`, `OnEngineEmission` does NOT publish to +`alerts` but DOES still `Tell` the OPC UA `AlarmStateUpdate`; with Primary (or before any +state) it publishes. Inbound `AlarmCommand` is still processed regardless of role. Historian: +a Secondary skips the sink write, a Primary writes (fake sink + role injection). + +--- + +## 2 — Timed-shelve picker UI (small) + +`Alerts.razor` currently exposes Ack / Shelve(OneShot) / Unshelve. The `Timed` backend is +fully wired + tested (`ShelveAlarmCommand{ShelveKind.Timed, UnshelveAtUtc}` → +`AdminOperationsActor` → `engine.TimedShelveAsync`). Add a small duration input (a minutes/ +seconds number box, or a `datetime-local`) to the shelve control on each row and call +`ShelveAlarmAsync(kind: Timed, unshelveAtUtc: now + duration)`. Razor-only; **no backend +change**; proven by docker-dev live-verify (no bUnit). Keep the existing OneShot/Unshelve +buttons. + +## 3 — T21 minors (small) + +- **Chip auto-clear:** the `_opResult*` chip persists until the next action. Add the ~8s + auto-clear timer pattern `DriverStatusPanel.razor` already uses (a `Task.Delay` → + `InvokeAsync(StateHasChanged)` that clears the per-row result). +- **CorrelationId consistency:** `AcknowledgeAlarmCommand`/`ShelveAlarmCommand` use a bare + `Guid CorrelationId`; the project's other control-plane commands use the `CorrelationId` + wrapper type. Switch both records + `AdminOperationsClient` (`CorrelationId.NewId()`) + + the actor reply contract + tests to the wrapper for uniform correlation tracing. + +## 4 — Rig cleanup (operational, last) + +Delete the docker-dev seed artifacts left for live-verify: the `t12-overheat` scripted alarm, +the `SC-ba675b168a85` predicate script, the `layer0-logcheck` vtag + script; revert +filler-02's inert `cycle-time-s` logger line to `return ctx.GetTag("TestMachine_002.TestDuration").Value;`. +Redeploy (`POST /api/deployments`, `X-Api-Key: docker-dev-deploy-key`). DB/redeploy, **not a +code change** — done last so the rig stays available for verifying tasks 1–3/6. + +## 5 — Galaxy reconnect recreate (high-risk) + +### Problem (verified) +`GalaxyMxSession.ConnectAsync` (`Driver.Galaxy/Runtime/GalaxyMxSession.cs:58-69`) is +idempotent: `if (_session is not null) return;`. When the gateway restarts and the session +goes Faulted/NotFound, `_session` is still a non-null (dead) handle, so `ConnectAsync` is a +silent no-op. `GalaxyDriver.ReopenAsync` (`GalaxyDriver.cs:289`) calls it expecting a +reconnect → no-op; `ReconnectSupervisor.RecoveryLoopAsync` +(`Driver.Galaxy/Runtime/ReconnectSupervisor.cs:158-186`) sees reopen "succeed", proceeds to +replay (which fails on the dead session), and **loops forever** (backoff capped 30s). + +### Decision — recreate on reopen +Add a recreate path to `GalaxyMxSession` (e.g. `RecreateAsync`/`DisposeSessionForRecreationAsync`) +that disposes + nulls `_session` and `_ownedClient`, and have `ReopenAsync` call it **before** +`ConnectAsync` so a reopen always routes through the happy-path create (`OpenSessionAsync` + +`RegisterAsync`). Confirm what status/exception marks Faulted/NotFound and that the dispose is +safe (gRPC channel teardown). Keep the supervisor's backoff loop; it now actually recovers. + +### Tests +A reconnect test asserting that after a faulted session, the reopen path **creates a new +session** (new handle / `OpenSessionAsync` called again) rather than no-op'ing. Mirror any +existing `ReconnectSupervisor`/session tests. + +## 6 — Live-pill circuit health (standard) + +### Problem (verified) +`ScriptLog.razor:122` and `Alerts.razor:132` set `_connected = true` once in +`OnInitialized` and never update it; the pill markup binds to that set-once bool. +`IInProcessBroadcaster` (`AdminUI/Hubs/IInProcessBroadcaster.cs`) exposes only +`Received` + `Publish` — no health signal. + +### Decision +- Extend `IInProcessBroadcaster` with a connection-health signal (`bool IsConnected` + + `event … ConnectionStateChanged`). The bridge actors (`ScriptLogSignalRBridge` / + `AlertSignalRBridge`) set it from their DPS-subscription health (SubscribeAck up / failure + down). The razors bind the pill to it and subscribe/unsubscribe like + `DriverStatusPanel.razor` (`OnConnectionStateChanged` → `InvokeAsync(StateHasChanged)`). +- **Dead-circuit case** (node recreate kills the server-side circuit — the component is dead + and cannot self-update its pill): this is Blazor Server's built-in reconnection concern. + **Verify the default reconnect overlay is present/visible** (it is what actually signals a + dropped circuit) rather than trying to fake liveness from a dead component. If absent, add + the standard Blazor reconnect UI. + +### Tests +Broadcaster unit test for the new health signal + `SetConnected` propagation. Razor proven by +docker-dev live-verify (no bUnit). + +--- + +## Sequencing & risk + +| Item | Risk | Notes | +|---|---|---| +| 1 redundancy double-emit (A1 emit gate, A2 historian gate) | high-risk | independent subsystem; A1∥A2 (different files) | +| 5 Galaxy reconnect | high-risk | independent subsystem (Galaxy driver) | +| 2 Timed-shelve picker | small | razor-only, live-verify | +| 3a chip auto-clear / 3b CorrelationId | small | razor / mechanical refactor | +| 6 live-pill | standard | interface + bridge + 2 razors | +| 4 rig cleanup | trivial/operational | last | + +The two high-risk items (1, 5) are in different subsystems and can run in parallel with each +other and with the UI items (2, 3, 6). Rig cleanup (4) is last. TDD where there's logic; UI +proven by docker-dev `/run` live-verify (agent drives — login disabled on docker-dev). + +## Hard rules + +Stage by explicit path (never `git add .`); never stage `sql_login.txt` / +`src/Server/.../Host/pki/`; never echo the gateway API key into a **new** tracked file; no +force-push, no `--no-verify`; **no Configuration entity / EF migration change** (none of these +items needs one). Build on a feature branch off `master`.