# Alarm Ack/Shelve Follow-ups Implementation Plan > **For Claude:** REQUIRED SUB-SKILL: Use superpowers-extended-cc:executing-plans (or subagent-driven-development) to implement this plan task-by-task. **Goal:** Resolve the six follow-ups left by the T17–T24 inbound-alarm-ack work: the redundancy double-emit, the Timed-shelve UI, two T21 minors, the docker-dev rig cleanup, and two pre-existing Layer-1 gaps (Galaxy reconnect, the live-pill). **Architecture:** The double-emit is fixed by **primary-only emission at the source** — both `ScriptedAlarmHostActor` (the `alerts` publish) and the per-node `HistorianAdapterActor` (the sink write) subscribe to the existing `redundancy-state` DPS topic, cache the local `RedundancyRole`, and act only when `Primary` (mirroring `OpcUaPublishActor`). OPC UA node writes + inbound-command processing stay ungated for warm-standby. The other five items are localized fixes (one Galaxy-driver lifecycle bug, one broadcaster-health interface extension + 2 razors, two small Alerts.razor additions, one mechanical refactor, one operational cleanup). **Tech Stack:** .NET 10, Akka.NET (cluster, DistributedPubSub, TestKit/xunit2), Blazor Server (InteractiveServer, NO bUnit), xUnit + Shouldly, OPC Foundation UA .NET Standard, Serilog. **Design of record:** `docs/plans/2026-06-11-alarm-followups-design.md` (committed master `bcb9f45c`). **Hard rules:** stage by explicit path (never `git add .`); never stage `sql_login.txt` / `src/Server/.../Host/pki/`; never echo the gateway API key into a **new** tracked file; no force-push, no `--no-verify`; **no Configuration entity / EF migration change**. Build on a feature branch off master. --- ### Task 0: Branch + baseline **Classification:** trivial **Estimated implement time:** ~1 min **Parallelizable with:** none **Files:** (none — git only) **Steps:** 1. `git checkout master && git switch -c feat/alarm-followups` (off `bcb9f45c`). 2. Confirm clean tree + green baseline build: `dotnet build ZB.MOM.WW.OtOpcUa.slnx` → 0 errors. 3. No commit (branch only). --- ### Task 1: Redundancy emit-gate in `ScriptedAlarmHostActor` (A1) **Classification:** high-risk **Estimated implement time:** ~5 min **Parallelizable with:** Task 2, Task 3, Task 4, Task 5 **Files:** - Modify: `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ScriptedAlarms/ScriptedAlarmHostActor.cs` - Test: `tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/ScriptedAlarms/ScriptedAlarmHostActorTests.cs` **Context:** `OnEngineEmission` (≈ line 247–279) does two things per emission: `_publishActor.Tell(AlarmStateUpdate(...))` (line 261, OPC UA node write) and `_mediator.Tell(new Publish(AlertsTopic, evt))` (line 278, cluster-wide `alerts`). Both central nodes run this → the `alerts` publish doubles. Gate **only line 278** on `Primary`. **Reference pattern to mirror** (`src/Server/ZB.MOM.WW.OtOpcUa.Runtime/OpcUa/OpcUaPublishActor.cs`): `RedundancyStateTopic = "redundancy-state"` (line 30); `Subscribe(RedundancyStateTopic, Self)` in PreStart (line 156); `Receive(HandleRedundancyStateChanged)` (line 147); `HandleRedundancyStateChanged` reads `msg.Nodes.FirstOrDefault(n => n.NodeId == localNode)?.Role` (lines 335–351). `RedundancyRole` lives in `ZB.MOM.WW.OtOpcUa.Commons.Messages.Redundancy` (`Primary`/`Secondary`/`Detached`). The host already knows its local node id (it's used elsewhere; if not, derive it the same way `OpcUaPublishActor` resolves `_localNode`). **Step 1: Failing TestKit tests** (extend the existing harness; Runtime.Tests = xunit v2 + Akka.TestKit.Xunit2): - `Emission_is_published_to_alerts_by_default_before_any_redundancy_state` — a fresh host (no `RedundancyStateChanged` yet) DOES publish the `AlarmTransitionEvent` to `alerts` (subscribe a probe to `AlertsTopic`). - `Secondary_node_suppresses_alerts_publish_but_still_writes_opcua` — after a `RedundancyStateChanged` marking the local node `Secondary`, an emission does NOT publish to `alerts` but DOES still `Tell` the `OpcUaPublishActor.AlarmStateUpdate` (probe the publish-actor seam the test already uses for T9/T19). - `Primary_node_publishes_alerts` — after a `Primary` `RedundancyStateChanged`, the emission publishes to `alerts`. - `Inbound_AlarmCommand_is_processed_regardless_of_role` — a `Secondary` host still drives the engine for an inbound `AlarmCommand` (the existing T19 ack test, asserted under a Secondary role). **Step 2:** Run them — expect FAIL (no gate yet; Secondary still publishes). **Step 3: Implement.** Add a cached `RedundancyRole? _localRole = null;` (null = unknown ⇒ treat as Primary/emit). In PreStart, `_mediator.Tell(new Subscribe(OpcUaPublishActor.RedundancyStateTopic, Self))` (reuse the const; if cross-project reference is awkward, introduce a shared `Commons` const for `"redundancy-state"` and point both at it — surface that deviation). Add `Receive` + a `SubscribeAck` no-op. In the handler, set `_localRole` from the snapshot for the local node. In `OnEngineEmission`, before line 278: ```csharp // Warm-standby dedup: only the Primary (driver-role leader) publishes the cluster-wide // transition + drives historization. Default-emit until told we are Secondary/Detached so // single-node deploys + the boot window never drop transitions. The OPC UA node write // above (warm address space) and inbound command processing stay ungated. if (_localRole is RedundancyRole.Secondary or RedundancyRole.Detached) return; _mediator.Tell(new Publish(AlertsTopic, evt)); ``` Leave line 261 (the `_publishActor.Tell`) and `OnAlarmCommand` untouched. **Step 4:** Run `dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter ScriptedAlarmHostActor` → all green. **Step 5: Commit** by explicit path (`ScriptedAlarmHostActor.cs` + the test). > High-risk: concurrency (role cached on the actor thread, read in the emission path which is also marshalled onto the actor thread — confirm `OnEngineEmission` runs on the actor thread, it does via `Self.Tell(EngineEmission)`) + redundancy semantics. Do NOT gate the OPC UA write or commands. --- ### Task 2: Redundancy historize-gate in `HistorianAdapterActor` (A2) **Classification:** high-risk **Estimated implement time:** ~5 min **Parallelizable with:** Task 1, Task 3, Task 4, Task 5 **Files:** - Modify: `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Historian/HistorianAdapterActor.cs` - Modify (only if the actor needs the local node id / mediator wired): `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ServiceCollectionExtensions.cs` (≈ line 146, where it's spawned `historian-adapter`) - Test: `tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Historian/HistorianAdapterActorTests.cs` (create if absent; else extend the existing historian test) **Context:** `HistorianAdapterActor` is **per-node** and consumes the `alerts` topic, so one publish historizes once per node → double DB writes. Gate the sink write on `Primary`, same pattern as Task 1. **Step 1: Failing tests** (TestKit + a fake `IAlarmHistorianSink` that records writes): - `Default_before_redundancy_state_historizes` — a fresh adapter writes to the sink (treat unknown as Primary). - `Secondary_node_does_not_historize` — after a `Secondary` `RedundancyStateChanged`, an incoming alarm event is NOT written to the sink. - `Primary_node_historizes` — after `Primary`, it IS written. **Step 2:** Run — FAIL. **Step 3: Implement** the same `_localRole` cache + `redundancy-state` subscription + `Receive` as Task 1, and guard the sink write: `if (_localRole is RedundancyRole.Secondary or RedundancyRole.Detached) return;` before `_sink.Write...`. The actor will need its local node id; resolve it the way `OpcUaPublishActor`/`ServiceCollectionExtensions` already does (pass it into `Props` if not already available — check how `OpcUaPublishActor` gets `_localNode` and mirror; thread it through `HistorianAdapterActor.Props` + the registration at `ServiceCollectionExtensions.cs:146`). **Step 4:** `dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter Historian` → green. Also confirm the full Runtime.Tests still pass. **Step 5: Commit** by explicit path. > High-risk: data-historization correctness + concurrency. If wiring the node id into `Props` ripples beyond the two files, surface it before expanding. --- ### Task 3: Galaxy reconnect recreates a faulted session **Classification:** high-risk **Estimated implement time:** ~5 min **Parallelizable with:** Task 1, Task 2, Task 4, Task 5 **Files:** - Modify: `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/Runtime/GalaxyMxSession.cs` (≈ line 58–69) - Modify: `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/GalaxyDriver.cs` (≈ line 289, `ReopenAsync`) - Test: the Galaxy driver test project (find `*ReconnectSupervisor*` / `*Session*` tests under `tests/Drivers/...Driver.Galaxy.Tests/`; mirror their seam) **Context:** `GalaxyMxSession.ConnectAsync` has `if (_session is not null) return;` — so when the gRPC session is Faulted/NotFound the field is still a non-null dead handle and `ConnectAsync` is a silent no-op. `GalaxyDriver.ReopenAsync` calls `ConnectAsync` expecting a reconnect → no-op → `ReconnectSupervisor.RecoveryLoopAsync` (`Runtime/ReconnectSupervisor.cs:158-186`) sees reopen "succeed", replay fails, loops forever. **Step 1: Failing test.** Assert the reconnect path **creates a new session** after a faulted one: drive `GalaxyMxSession` (or a fake `MxGatewayClient`/session seam) to a connected state, simulate fault/dispose, call the reopen/recreate path, and assert `OpenSessionAsync` + `RegisterAsync` were invoked **again** (a second create), not a no-op. Mirror the existing session/reconnect test construction (read how they fake `MxGatewayClient`/`OpenSessionAsync`). **Step 2:** Run — FAIL (current code no-ops, second create never happens). **Step 3: Implement.** Add a recreate path to `GalaxyMxSession`, e.g.: ```csharp /// Disposes the current (faulted/stale) session + owned client so the next /// rebuilds a fresh session instead of no-op'ing on the dead handle. public async Task RecreateAsync(MxGatewayClientOptions clientOptions, CancellationToken ct) { ObjectDisposedException.ThrowIf(_disposed, this); await DisposeSessionAsync().ConfigureAwait(false); // dispose+null _session, _serverHandle, _ownedClient await ConnectAsync(clientOptions, ct).ConfigureAwait(false); } ``` (Factor a private `DisposeSessionAsync()` that safely tears down the gRPC session/channel + nulls the fields — reuse whatever `DisposeAsync` already does for teardown, minus marking `_disposed`.) Then in `GalaxyDriver.ReopenAsync` call `RecreateAsync` instead of `ConnectAsync` so every reopen rebuilds. Keep the supervisor's backoff loop unchanged — it now actually recovers. **Step 4:** Run the Galaxy reconnect/session tests → green. `dotnet build` the Galaxy driver clean. **Step 5: Commit** by explicit path. > High-risk: driver session lifecycle + gRPC channel teardown + concurrency with the supervisor loop. Confirm `DisposeSessionAsync` is safe to call repeatedly and from the supervisor thread. --- ### Task 4: Broadcaster connection-health signal + bridges **Classification:** standard **Estimated implement time:** ~5 min **Parallelizable with:** Task 1, Task 2, Task 3, Task 5 **Files:** - Modify: `src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Hubs/IInProcessBroadcaster.cs` - Modify: `src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Hubs/InProcessBroadcaster.cs` (the impl) - Modify: the two bridge actors that publish into the broadcaster — `ScriptLogSignalRBridge` + `AlertSignalRBridge` (find under `AdminUI/` or `Runtime/`; they `Tell`/`Subscribe` the DPS topics and push to the broadcaster) - Test: `tests/Server/ZB.MOM.WW.OtOpcUa.Client.UI.Tests/` (or wherever `InProcessBroadcaster`/broadcaster tests live — find `InProcessBroadcasterTests`) **Step 1: Failing unit test** for the new health signal: a new `InProcessBroadcaster` reports `IsConnected == false` (or a sensible default); calling `SetConnected(true)` raises `ConnectionStateChanged` with `true` and flips `IsConnected`; `SetConnected(false)` flips back + raises. **Step 2:** Run — FAIL (members don't exist). **Step 3: Implement.** Extend the interface: ```csharp bool IsConnected { get; } event Action? ConnectionStateChanged; void SetConnected(bool connected); ``` Implement in `InProcessBroadcaster` (raise only on change). Then in each bridge actor, call `_broadcaster.SetConnected(true)` when its DPS `SubscribeAck` lands (subscription live) and `SetConnected(false)` on `PostStop`/failure. (Read how the bridge subscribes; mirror the `SubscribeAck` handling already there.) Default `IsConnected`: choose `true` once subscribed; before the first ack, `false` — the razor will reconcile on the `ConnectionStateChanged` event. **Step 4:** Run the broadcaster tests → green; AdminUI builds clean (`TreatWarningsAsErrors`). **Step 5: Commit** by explicit path. --- ### Task 5: `CorrelationId` wrapper for the alarm commands (3b) **Classification:** small **Estimated implement time:** ~4 min **Parallelizable with:** Task 1, Task 2, Task 3, Task 4 **Files:** - Modify: `src/Core/ZB.MOM.WW.OtOpcUa.Commons/Messages/Admin/AcknowledgeAlarmCommand.cs` + `ShelveAlarmCommand.cs` (+ their `*Result` records) - Modify: `src/Server/ZB.MOM.WW.OtOpcUa.ControlPlane/AdminOperations/AdminOperationsActor.cs` (the alarm handlers) - Modify: `src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Clients/AdminOperationsClient.cs` - Test: `tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests/AdminOperationsActorTests.cs` (the 5 alarm tests) **Context:** the alarm commands use a bare `Guid CorrelationId`; the project's other control-plane commands (`StartDeployment`/`RestartDriver`/`ReconnectDriver`) use the `CorrelationId` wrapper type. Make them consistent. **Steps:** 1. Find the `CorrelationId` wrapper type + how `RestartDriver` uses it (`CorrelationId.NewId()`, the record field type, the reply echo). 2. Change `AcknowledgeAlarmCommand`/`ShelveAlarmCommand` (+ `*Result`) `CorrelationId` field type `Guid` → `CorrelationId`. Update `AdminOperationsClient.AcknowledgeAlarmAsync`/`ShelveAlarmAsync` to mint `CorrelationId.NewId()`. Update the `AdminOperationsActor` handlers' reply construction. 3. Update the 5 `AdminOperationsActorTests` to the wrapper type. 4. `dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests --filter AdminOperations` → green; build clean. 5. **Commit** by explicit path. --- ### Task 6: `Alerts.razor` — Timed-shelve picker + chip auto-clear + live-pill **Classification:** standard **Estimated implement time:** ~5 min **Parallelizable with:** Task 7 **Blocked by:** Task 4 (needs the broadcaster `ConnectionStateChanged`/`IsConnected`) **Files:** - Modify: `src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Components/Pages/Alerts.razor` **Context:** all three changes are in one file → bundled so they don't contend. NO bUnit; proven by docker-dev live-verify (Task 8). **Steps (no failing test — razor; verify by build + Task 8):** 1. **Timed-shelve picker:** add a small duration input to the row's shelve control (a minutes number box is simplest, default e.g. 5). Add a "Shelve (timed)" action that calls `IAdminOperationsClient.ShelveAlarmAsync(alarmId, user, ShelveKind.Timed, unshelveAtUtc: , comment: null, ct)`. Keep the existing OneShot Shelve + Unshelve buttons. Dispose the CTS (`using var cts = …`) like the existing handlers. 2. **Chip auto-clear:** after `ShowOpResult`, start a `~8s` auto-clear (mirror `DriverStatusPanel.razor`'s timer: a `Task.Delay(8000)` continuation that, if the chip is still the same one, clears `_opResult*` + `InvokeAsync(StateHasChanged)`). Read `DriverStatusPanel.razor` for the exact pattern (cancellation on a newer action). 3. **Live-pill:** replace the set-once `_connected = true` (≈ line 132) — in `OnInitializedAsync` read `Alarms.IsConnected` and subscribe `Alarms.ConnectionStateChanged += OnConnChanged;` (handler sets `_connected` + `InvokeAsync(StateHasChanged)`); unsubscribe in `Dispose`. (`Alarms` = the injected `IInProcessBroadcaster`.) 4. `dotnet build src/Server/ZB.MOM.WW.OtOpcUa.AdminUI` → 0 warnings/errors. 5. **Commit** by explicit path. --- ### Task 7: `ScriptLog.razor` live-pill + reconnect-overlay check **Classification:** small **Estimated implement time:** ~3 min **Parallelizable with:** Task 6 **Blocked by:** Task 4 **Files:** - Modify: `src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Components/Pages/ScriptLog.razor` (≈ line 122) - Read/verify (no edit unless missing): the Blazor host shell (`App.razor` / `_Host.cshtml` / `MainLayout`) for the default reconnect overlay (`components-reconnect-modal` / `
`) **Steps:** 1. Same live-pill fix as Task 6 step 3, against `ScriptLogs` (the injected `IInProcessBroadcaster`): read `IsConnected`, subscribe `ConnectionStateChanged`, update + `StateHasChanged`, unsubscribe in `Dispose`. 2. **Verify the dead-circuit overlay exists:** grep the host shell for the Blazor reconnect modal markup. If present, note it in the commit message (the dead-circuit case is covered by Blazor's built-in reconnection UI — the component can't self-update a dead circuit). If ABSENT, add the standard `components-reconnect-modal` markup. Do NOT fake liveness from a dead component. 3. `dotnet build src/Server/ZB.MOM.WW.OtOpcUa.AdminUI` → clean. 4. **Commit** by explicit path. --- ### Task 8: Live-verify on docker-dev **Classification:** verification **Estimated implement time:** ~ (manual) **Parallelizable with:** none **Blocked by:** Task 1, Task 2, Task 3, Task 6, Task 7 (Task 5 is internal; include it in the build) **Steps:** rebuild docker-dev central nodes on the new image (`docker compose -f docker-dev/docker-compose.yml up -d --build central-1 central-2`), then on `/alerts` (login disabled — agent drives): 1. **Double-emit fixed:** confirm each t12-overheat transition now appears **ONCE** (not twice). Cross-check both nodes' logs: only the **driver-leader (Primary)** node logs the alerts publish; the Secondary suppresses it but still writes its OPC UA node (its address space stays current). (Find the Primary via the `redundancy-state`/ServiceLevel — or just observe single rows.) 2. **Timed-shelve:** drive the new timed-shelve control → a "Shelved" transition appears (operator = `multi-role-test`); after the duration the SDK auto-unshelves (or verify the `UnshelveAtUtc` was set). 3. **Chip auto-clear:** the Ack/Shelve result chip clears after ~8s. 4. **Live-pill:** the pill shows "live" while connected; (optional) kill the bridge / restart a node and confirm the pill flips / the Blazor reconnect overlay shows. 5. **Galaxy reconnect** (best-effort, may need the gateway): if feasible, restart the Galaxy gateway and confirm the driver recreates the session + recovers without a container restart (watch the logs for a fresh `OpenSessionAsync`). If not feasible live, rely on the Task 3 unit test + note it. Defects → new fix tasks. **Agent drives; no sign-in needed (docker-dev login disabled).** --- ### Task 9: docker-dev rig cleanup **Classification:** trivial (operational) **Estimated implement time:** ~ (manual) **Parallelizable with:** none **Blocked by:** Task 8 (the rig artifacts are the live-verify vehicle) **Steps:** remove the seed artifacts left for verification, then redeploy: 1. In the docker-dev AdminUI (or via DB), delete the `t12-overheat` scripted alarm, the `SC-ba675b168a85` predicate script, and the `layer0-logcheck` vtag + script. Revert filler-02's `cycle-time-s` script to `return ctx.GetTag("TestMachine_002.TestDuration").Value;`. 2. Redeploy: `POST http://localhost:9200/api/deployments` with `X-Api-Key: docker-dev-deploy-key` → `202`. 3. Confirm `/alerts` + `/scripted-alarms` + `/scripts` no longer list the removed artifacts. No code commit (DB/deploy only). Note completion in Task 10's commit message. > Operational only — no source change. If the user wants the rig kept, skip this task and record it as deferred. --- ### Task 10: Docs + finish branch **Classification:** small **Estimated implement time:** ~4 min **Parallelizable with:** none **Blocked by:** Task 9 **Files:** `docs/Redundancy.md` (alarm-emission + historization are now Primary-gated), `docs/ScriptedAlarms.md` / `docs/AlarmTracking.md` (the redundancy dedup note), `docs/Client.CLI.md` only if affected, and a one-line note wherever the live-pill/Galaxy-reconnect behaviour is documented. Keep terse. **Steps:** 1. Update the docs above to reflect: Primary-only alarm emission + historization under redundancy; the Galaxy reconnect now recreates a faulted session; the live-pill reflects feed health. 2. Run the FULL suite: `dotnet test ZB.MOM.WW.OtOpcUa.slnx` — confirm all affected unit suites green; the only failures should be the known pre-existing macOS/integration ones (OpcUaServer.IntegrationTests PKI, Host.IntegrationTests deploy-Rejected, AbLegacy/AbCip fixtures). Verify the **Galaxy** suite is green (Task 3). 3. **Commit** docs by explicit path. 4. Run **superpowers-extended-cc:finishing-a-development-branch** (verify tests → present the 4 options → merge). --- ## Execution notes - **Parallel dispatch:** Tasks **1, 2, 3, 4, 5** are mutually parallelizable (disjoint files across Runtime/ScriptedAlarms, Runtime/Historian, Driver.Galaxy, AdminUI/Hubs+bridges, Commons+ControlPlane). Dispatch their implementers concurrently. **6 and 7** wait on **4** (broadcaster health) and are ∥ each other (Alerts.razor vs ScriptLog.razor). **8** waits on 1/2/3/6/7; **9** on 8; **10** on 9. - **One writer per file:** `Alerts.razor` is touched ONLY by Task 6 (its three changes are bundled for exactly this reason). `ScriptLog.razor` only by Task 7. - **High-risk tasks (1, 2, 3):** serial spec→code review each. **Standard (4, 6):** parallel spec+code review. **Small (5, 7):** code review only. - **Live-verify (8) is the integration checkpoint** — the single-alerts-row proof is the whole point of item 1. - TDD where there's logic (1, 2, 3, 4, 5); razor (6, 7) proven by docker-dev `/run` (login disabled, agent drives).