lmxopcua/docs/plans/2026-06-11-alarm-followups.md

# Alarm Ack/Shelve Follow-ups Implementation Plan

> **For Claude:** REQUIRED SUB-SKILL: Use superpowers-extended-cc:executing-plans (or subagent-driven-development) to implement this plan task-by-task.

**Goal:** Resolve the six follow-ups left by the T17–T24 inbound-alarm-ack work: the redundancy double-emit, the Timed-shelve UI, two T21 minors, the docker-dev rig cleanup, and two pre-existing Layer-1 gaps (Galaxy reconnect, the live-pill).

**Architecture:** The double-emit is fixed by **primary-only emission at the source** — both `ScriptedAlarmHostActor` (the `alerts` publish) and the per-node `HistorianAdapterActor` (the sink write) subscribe to the existing `redundancy-state` DPS topic, cache the local `RedundancyRole`, and act only when `Primary` (mirroring `OpcUaPublishActor`). OPC UA node writes + inbound-command processing stay ungated for warm-standby. The other five items are localized fixes (one Galaxy-driver lifecycle bug, one broadcaster-health interface extension + 2 razors, two small Alerts.razor additions, one mechanical refactor, one operational cleanup).

**Tech Stack:** .NET 10, Akka.NET (cluster, DistributedPubSub, TestKit/xunit2), Blazor Server (InteractiveServer, NO bUnit), xUnit + Shouldly, OPC Foundation UA .NET Standard, Serilog.

**Design of record:** `docs/plans/2026-06-11-alarm-followups-design.md` (committed master `bcb9f45c`).

**Hard rules:** stage by explicit path (never `git add .`); never stage `sql_login.txt` / `src/Server/.../Host/pki/`; never echo the gateway API key into a **new** tracked file; no force-push, no `--no-verify`; **no Configuration entity / EF migration change**. Build on a feature branch off master.

---

### Task 0: Branch + baseline

**Classification:** trivial
**Estimated implement time:** ~1 min
**Parallelizable with:** none

**Files:** (none — git only)

**Steps:**
1. `git checkout master && git switch -c feat/alarm-followups` (off `bcb9f45c`).
2. Confirm clean tree + green baseline build: `dotnet build ZB.MOM.WW.OtOpcUa.slnx` → 0 errors.
3. No commit (branch only).

---

### Task 1: Redundancy emit-gate in `ScriptedAlarmHostActor` (A1)

**Classification:** high-risk
**Estimated implement time:** ~5 min
**Parallelizable with:** Task 2, Task 3, Task 4, Task 5

**Files:**
- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ScriptedAlarms/ScriptedAlarmHostActor.cs`
- Test: `tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/ScriptedAlarms/ScriptedAlarmHostActorTests.cs`

**Context:** `OnEngineEmission` (≈ line 247–279) does two things per emission: `_publishActor.Tell(AlarmStateUpdate(...))` (line 261, OPC UA node write) and `_mediator.Tell(new Publish(AlertsTopic, evt))` (line 278, cluster-wide `alerts`). Both central nodes run this → the `alerts` publish doubles. Gate **only line 278** on `Primary`.

**Reference pattern to mirror** (`src/Server/ZB.MOM.WW.OtOpcUa.Runtime/OpcUa/OpcUaPublishActor.cs`): `RedundancyStateTopic = "redundancy-state"` (line 30); `Subscribe(RedundancyStateTopic, Self)` in PreStart (line 156); `Receive<RedundancyStateChanged>(HandleRedundancyStateChanged)` (line 147); `HandleRedundancyStateChanged` reads `msg.Nodes.FirstOrDefault(n => n.NodeId == localNode)?.Role` (lines 335–351). `RedundancyRole` lives in `ZB.MOM.WW.OtOpcUa.Commons.Messages.Redundancy` (`Primary`/`Secondary`/`Detached`). The host already knows its local node id (it's used elsewhere; if not, derive it the same way `OpcUaPublishActor` resolves `_localNode`).

**Step 1: Failing TestKit tests** (extend the existing harness; Runtime.Tests = xunit v2 + Akka.TestKit.Xunit2):
- `Emission_is_published_to_alerts_by_default_before_any_redundancy_state` — a fresh host (no `RedundancyStateChanged` yet) DOES publish the `AlarmTransitionEvent` to `alerts` (subscribe a probe to `AlertsTopic`).
- `Secondary_node_suppresses_alerts_publish_but_still_writes_opcua` — after a `RedundancyStateChanged` marking the local node `Secondary`, an emission does NOT publish to `alerts` but DOES still `Tell` the `OpcUaPublishActor.AlarmStateUpdate` (probe the publish-actor seam the test already uses for T9/T19).
- `Primary_node_publishes_alerts` — after a `Primary` `RedundancyStateChanged`, the emission publishes to `alerts`.
- `Inbound_AlarmCommand_is_processed_regardless_of_role` — a `Secondary` host still drives the engine for an inbound `AlarmCommand` (the existing T19 ack test, asserted under a Secondary role).

**Step 2:** Run them — expect FAIL (no gate yet; Secondary still publishes).

**Step 3: Implement.** Add a cached `RedundancyRole? _localRole = null;` (null = unknown ⇒ treat as Primary/emit). In PreStart, `_mediator.Tell(new Subscribe(OpcUaPublishActor.RedundancyStateTopic, Self))` (reuse the const; if cross-project reference is awkward, introduce a shared `Commons` const for `"redundancy-state"` and point both at it — surface that deviation). Add `Receive<RedundancyStateChanged>` + a `SubscribeAck` no-op. In the handler, set `_localRole` from the snapshot for the local node. In `OnEngineEmission`, before line 278:
```csharp
// Warm-standby dedup: only the Primary (driver-role leader) publishes the cluster-wide
// transition + drives historization. Default-emit until told we are Secondary/Detached so
// single-node deploys + the boot window never drop transitions. The OPC UA node write
// above (warm address space) and inbound command processing stay ungated.
if (_localRole is RedundancyRole.Secondary or RedundancyRole.Detached)
    return;
_mediator.Tell(new Publish(AlertsTopic, evt));
```
Leave line 261 (the `_publishActor.Tell`) and `OnAlarmCommand` untouched.

**Step 4:** Run `dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter ScriptedAlarmHostActor` → all green.

**Step 5: Commit** by explicit path (`ScriptedAlarmHostActor.cs` + the test).

> High-risk: concurrency (role cached on the actor thread, read in the emission path which is also marshalled onto the actor thread — confirm `OnEngineEmission` runs on the actor thread, it does via `Self.Tell(EngineEmission)`) + redundancy semantics. Do NOT gate the OPC UA write or commands.

---

### Task 2: Redundancy historize-gate in `HistorianAdapterActor` (A2)

**Classification:** high-risk
**Estimated implement time:** ~5 min
**Parallelizable with:** Task 1, Task 3, Task 4, Task 5

**Files:**
- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Historian/HistorianAdapterActor.cs`
- Modify (only if the actor needs the local node id / mediator wired): `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ServiceCollectionExtensions.cs` (≈ line 146, where it's spawned `historian-adapter`)
- Test: `tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Historian/HistorianAdapterActorTests.cs` (create if absent; else extend the existing historian test)

**Context:** `HistorianAdapterActor` is **per-node** and consumes the `alerts` topic, so one publish historizes once per node → double DB writes. Gate the sink write on `Primary`, same pattern as Task 1.

**Step 1: Failing tests** (TestKit + a fake `IAlarmHistorianSink` that records writes):
- `Default_before_redundancy_state_historizes` — a fresh adapter writes to the sink (treat unknown as Primary).
- `Secondary_node_does_not_historize` — after a `Secondary` `RedundancyStateChanged`, an incoming alarm event is NOT written to the sink.
- `Primary_node_historizes` — after `Primary`, it IS written.

**Step 2:** Run — FAIL.

**Step 3: Implement** the same `_localRole` cache + `redundancy-state` subscription + `Receive<RedundancyStateChanged>` as Task 1, and guard the sink write: `if (_localRole is RedundancyRole.Secondary or RedundancyRole.Detached) return;` before `_sink.Write...`. The actor will need its local node id; resolve it the way `OpcUaPublishActor`/`ServiceCollectionExtensions` already does (pass it into `Props` if not already available — check how `OpcUaPublishActor` gets `_localNode` and mirror; thread it through `HistorianAdapterActor.Props` + the registration at `ServiceCollectionExtensions.cs:146`).

**Step 4:** `dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter Historian` → green. Also confirm the full Runtime.Tests still pass.

**Step 5: Commit** by explicit path.

> High-risk: data-historization correctness + concurrency. If wiring the node id into `Props` ripples beyond the two files, surface it before expanding.

---

### Task 3: Galaxy reconnect recreates a faulted session

**Classification:** high-risk
**Estimated implement time:** ~5 min
**Parallelizable with:** Task 1, Task 2, Task 4, Task 5

**Files:**
- Modify: `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/Runtime/GalaxyMxSession.cs` (≈ line 58–69)
- Modify: `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/GalaxyDriver.cs` (≈ line 289, `ReopenAsync`)
- Test: the Galaxy driver test project (find `*ReconnectSupervisor*` / `*Session*` tests under `tests/Drivers/...Driver.Galaxy.Tests/`; mirror their seam)

**Context:** `GalaxyMxSession.ConnectAsync` has `if (_session is not null) return;` — so when the gRPC session is Faulted/NotFound the field is still a non-null dead handle and `ConnectAsync` is a silent no-op. `GalaxyDriver.ReopenAsync` calls `ConnectAsync` expecting a reconnect → no-op → `ReconnectSupervisor.RecoveryLoopAsync` (`Runtime/ReconnectSupervisor.cs:158-186`) sees reopen "succeed", replay fails, loops forever.

**Step 1: Failing test.** Assert the reconnect path **creates a new session** after a faulted one: drive `GalaxyMxSession` (or a fake `MxGatewayClient`/session seam) to a connected state, simulate fault/dispose, call the reopen/recreate path, and assert `OpenSessionAsync` + `RegisterAsync` were invoked **again** (a second create), not a no-op. Mirror the existing session/reconnect test construction (read how they fake `MxGatewayClient`/`OpenSessionAsync`).

**Step 2:** Run — FAIL (current code no-ops, second create never happens).

**Step 3: Implement.** Add a recreate path to `GalaxyMxSession`, e.g.:
```csharp
/// <summary>Disposes the current (faulted/stale) session + owned client so the next
/// <see cref="ConnectAsync"/> rebuilds a fresh session instead of no-op'ing on the dead handle.</summary>
public async Task RecreateAsync(MxGatewayClientOptions clientOptions, CancellationToken ct)
{
    ObjectDisposedException.ThrowIf(_disposed, this);
    await DisposeSessionAsync().ConfigureAwait(false); // dispose+null _session, _serverHandle, _ownedClient
    await ConnectAsync(clientOptions, ct).ConfigureAwait(false);
}
```
(Factor a private `DisposeSessionAsync()` that safely tears down the gRPC session/channel + nulls the fields — reuse whatever `DisposeAsync` already does for teardown, minus marking `_disposed`.) Then in `GalaxyDriver.ReopenAsync` call `RecreateAsync` instead of `ConnectAsync` so every reopen rebuilds. Keep the supervisor's backoff loop unchanged — it now actually recovers.

**Step 4:** Run the Galaxy reconnect/session tests → green. `dotnet build` the Galaxy driver clean.

**Step 5: Commit** by explicit path.

> High-risk: driver session lifecycle + gRPC channel teardown + concurrency with the supervisor loop. Confirm `DisposeSessionAsync` is safe to call repeatedly and from the supervisor thread.

---

### Task 4: Broadcaster connection-health signal + bridges

**Classification:** standard
**Estimated implement time:** ~5 min
**Parallelizable with:** Task 1, Task 2, Task 3, Task 5

**Files:**
- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Hubs/IInProcessBroadcaster.cs`
- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Hubs/InProcessBroadcaster.cs` (the impl)
- Modify: the two bridge actors that publish into the broadcaster — `ScriptLogSignalRBridge` + `AlertSignalRBridge` (find under `AdminUI/` or `Runtime/`; they `Tell`/`Subscribe` the DPS topics and push to the broadcaster)
- Test: `tests/Server/ZB.MOM.WW.OtOpcUa.Client.UI.Tests/` (or wherever `InProcessBroadcaster`/broadcaster tests live — find `InProcessBroadcasterTests`)

**Step 1: Failing unit test** for the new health signal: a new `InProcessBroadcaster<T>` reports `IsConnected == false` (or a sensible default); calling `SetConnected(true)` raises `ConnectionStateChanged` with `true` and flips `IsConnected`; `SetConnected(false)` flips back + raises.

**Step 2:** Run — FAIL (members don't exist).

**Step 3: Implement.** Extend the interface:
```csharp
bool IsConnected { get; }
event Action<bool>? ConnectionStateChanged;
void SetConnected(bool connected);
```
Implement in `InProcessBroadcaster<T>` (raise only on change). Then in each bridge actor, call `_broadcaster.SetConnected(true)` when its DPS `SubscribeAck` lands (subscription live) and `SetConnected(false)` on `PostStop`/failure. (Read how the bridge subscribes; mirror the `SubscribeAck` handling already there.) Default `IsConnected`: choose `true` once subscribed; before the first ack, `false` — the razor will reconcile on the `ConnectionStateChanged` event.

**Step 4:** Run the broadcaster tests → green; AdminUI builds clean (`TreatWarningsAsErrors`).

**Step 5: Commit** by explicit path.

---

### Task 5: `CorrelationId` wrapper for the alarm commands (3b)

**Classification:** small
**Estimated implement time:** ~4 min
**Parallelizable with:** Task 1, Task 2, Task 3, Task 4

**Files:**
- Modify: `src/Core/ZB.MOM.WW.OtOpcUa.Commons/Messages/Admin/AcknowledgeAlarmCommand.cs` + `ShelveAlarmCommand.cs` (+ their `*Result` records)
- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.ControlPlane/AdminOperations/AdminOperationsActor.cs` (the alarm handlers)
- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Clients/AdminOperationsClient.cs`
- Test: `tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests/AdminOperationsActorTests.cs` (the 5 alarm tests)

**Context:** the alarm commands use a bare `Guid CorrelationId`; the project's other control-plane commands (`StartDeployment`/`RestartDriver`/`ReconnectDriver`) use the `CorrelationId` wrapper type. Make them consistent.

**Steps:**
1. Find the `CorrelationId` wrapper type + how `RestartDriver` uses it (`CorrelationId.NewId()`, the record field type, the reply echo).
2. Change `AcknowledgeAlarmCommand`/`ShelveAlarmCommand` (+ `*Result`) `CorrelationId` field type `Guid` → `CorrelationId`. Update `AdminOperationsClient.AcknowledgeAlarmAsync`/`ShelveAlarmAsync` to mint `CorrelationId.NewId()`. Update the `AdminOperationsActor` handlers' reply construction.
3. Update the 5 `AdminOperationsActorTests` to the wrapper type.
4. `dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests --filter AdminOperations` → green; build clean.
5. **Commit** by explicit path.

---

### Task 6: `Alerts.razor` — Timed-shelve picker + chip auto-clear + live-pill

**Classification:** standard
**Estimated implement time:** ~5 min
**Parallelizable with:** Task 7
**Blocked by:** Task 4 (needs the broadcaster `ConnectionStateChanged`/`IsConnected`)

**Files:**
- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Components/Pages/Alerts.razor`

**Context:** all three changes are in one file → bundled so they don't contend. NO bUnit; proven by docker-dev live-verify (Task 8).

**Steps (no failing test — razor; verify by build + Task 8):**
1. **Timed-shelve picker:** add a small duration input to the row's shelve control (a minutes number box is simplest, default e.g. 5). Add a "Shelve (timed)" action that calls `IAdminOperationsClient.ShelveAlarmAsync(alarmId, user, ShelveKind.Timed, unshelveAtUtc: <computed now+duration>, comment: null, ct)`. Keep the existing OneShot Shelve + Unshelve buttons. Dispose the CTS (`using var cts = …`) like the existing handlers.
2. **Chip auto-clear:** after `ShowOpResult`, start a `~8s` auto-clear (mirror `DriverStatusPanel.razor`'s timer: a `Task.Delay(8000)` continuation that, if the chip is still the same one, clears `_opResult*` + `InvokeAsync(StateHasChanged)`). Read `DriverStatusPanel.razor` for the exact pattern (cancellation on a newer action).
3. **Live-pill:** replace the set-once `_connected = true` (≈ line 132) — in `OnInitializedAsync` read `Alarms.IsConnected` and subscribe `Alarms.ConnectionStateChanged += OnConnChanged;` (handler sets `_connected` + `InvokeAsync(StateHasChanged)`); unsubscribe in `Dispose`. (`Alarms` = the injected `IInProcessBroadcaster<AlarmTransitionEvent>`.)
4. `dotnet build src/Server/ZB.MOM.WW.OtOpcUa.AdminUI` → 0 warnings/errors.
5. **Commit** by explicit path.

---

### Task 7: `ScriptLog.razor` live-pill + reconnect-overlay check

**Classification:** small
**Estimated implement time:** ~3 min
**Parallelizable with:** Task 6
**Blocked by:** Task 4

**Files:**
- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Components/Pages/ScriptLog.razor` (≈ line 122)
- Read/verify (no edit unless missing): the Blazor host shell (`App.razor` / `_Host.cshtml` / `MainLayout`) for the default reconnect overlay (`components-reconnect-modal` / `<div id="components-reconnect-modal">`)

**Steps:**
1. Same live-pill fix as Task 6 step 3, against `ScriptLogs` (the injected `IInProcessBroadcaster<ScriptLogEntry>`): read `IsConnected`, subscribe `ConnectionStateChanged`, update + `StateHasChanged`, unsubscribe in `Dispose`.
2. **Verify the dead-circuit overlay exists:** grep the host shell for the Blazor reconnect modal markup. If present, note it in the commit message (the dead-circuit case is covered by Blazor's built-in reconnection UI — the component can't self-update a dead circuit). If ABSENT, add the standard `components-reconnect-modal` markup. Do NOT fake liveness from a dead component.
3. `dotnet build src/Server/ZB.MOM.WW.OtOpcUa.AdminUI` → clean.
4. **Commit** by explicit path.

---

### Task 8: Live-verify on docker-dev

**Classification:** verification
**Estimated implement time:** ~ (manual)
**Parallelizable with:** none
**Blocked by:** Task 1, Task 2, Task 3, Task 6, Task 7 (Task 5 is internal; include it in the build)

**Steps:** rebuild docker-dev central nodes on the new image (`docker compose -f docker-dev/docker-compose.yml up -d --build central-1 central-2`), then on `/alerts` (login disabled — agent drives):
1. **Double-emit fixed:** confirm each t12-overheat transition now appears **ONCE** (not twice). Cross-check both nodes' logs: only the **driver-leader (Primary)** node logs the alerts publish; the Secondary suppresses it but still writes its OPC UA node (its address space stays current). (Find the Primary via the `redundancy-state`/ServiceLevel — or just observe single rows.)
2. **Timed-shelve:** drive the new timed-shelve control → a "Shelved" transition appears (operator = `multi-role-test`); after the duration the SDK auto-unshelves (or verify the `UnshelveAtUtc` was set).
3. **Chip auto-clear:** the Ack/Shelve result chip clears after ~8s.
4. **Live-pill:** the pill shows "live" while connected; (optional) kill the bridge / restart a node and confirm the pill flips / the Blazor reconnect overlay shows.
5. **Galaxy reconnect** (best-effort, may need the gateway): if feasible, restart the Galaxy gateway and confirm the driver recreates the session + recovers without a container restart (watch the logs for a fresh `OpenSessionAsync`). If not feasible live, rely on the Task 3 unit test + note it.
Defects → new fix tasks. **Agent drives; no sign-in needed (docker-dev login disabled).**

---

### Task 9: docker-dev rig cleanup

**Classification:** trivial (operational)
**Estimated implement time:** ~ (manual)
**Parallelizable with:** none
**Blocked by:** Task 8 (the rig artifacts are the live-verify vehicle)

**Steps:** remove the seed artifacts left for verification, then redeploy:
1. In the docker-dev AdminUI (or via DB), delete the `t12-overheat` scripted alarm, the `SC-ba675b168a85` predicate script, and the `layer0-logcheck` vtag + script. Revert filler-02's `cycle-time-s` script to `return ctx.GetTag("TestMachine_002.TestDuration").Value;`.
2. Redeploy: `POST http://localhost:9200/api/deployments` with `X-Api-Key: docker-dev-deploy-key` → `202`.
3. Confirm `/alerts` + `/scripted-alarms` + `/scripts` no longer list the removed artifacts. No code commit (DB/deploy only). Note completion in Task 10's commit message.

> Operational only — no source change. If the user wants the rig kept, skip this task and record it as deferred.

---

### Task 10: Docs + finish branch

**Classification:** small
**Estimated implement time:** ~4 min
**Parallelizable with:** none
**Blocked by:** Task 9

**Files:** `docs/Redundancy.md` (alarm-emission + historization are now Primary-gated), `docs/ScriptedAlarms.md` / `docs/AlarmTracking.md` (the redundancy dedup note), `docs/Client.CLI.md` only if affected, and a one-line note wherever the live-pill/Galaxy-reconnect behaviour is documented. Keep terse.

**Steps:**
1. Update the docs above to reflect: Primary-only alarm emission + historization under redundancy; the Galaxy reconnect now recreates a faulted session; the live-pill reflects feed health.
2. Run the FULL suite: `dotnet test ZB.MOM.WW.OtOpcUa.slnx` — confirm all affected unit suites green; the only failures should be the known pre-existing macOS/integration ones (OpcUaServer.IntegrationTests PKI, Host.IntegrationTests deploy-Rejected, AbLegacy/AbCip fixtures). Verify the **Galaxy** suite is green (Task 3).
3. **Commit** docs by explicit path.
4. Run **superpowers-extended-cc:finishing-a-development-branch** (verify tests → present the 4 options → merge).

---

## Execution notes

- **Parallel dispatch:** Tasks **1, 2, 3, 4, 5** are mutually parallelizable (disjoint files across Runtime/ScriptedAlarms, Runtime/Historian, Driver.Galaxy, AdminUI/Hubs+bridges, Commons+ControlPlane). Dispatch their implementers concurrently. **6 and 7** wait on **4** (broadcaster health) and are ∥ each other (Alerts.razor vs ScriptLog.razor). **8** waits on 1/2/3/6/7; **9** on 8; **10** on 9.
- **One writer per file:** `Alerts.razor` is touched ONLY by Task 6 (its three changes are bundled for exactly this reason). `ScriptLog.razor` only by Task 7.
- **High-risk tasks (1, 2, 3):** serial spec→code review each. **Standard (4, 6):** parallel spec+code review. **Small (5, 7):** code review only.
- **Live-verify (8) is the integration checkpoint** — the single-alerts-row proof is the whole point of item 1.
- TDD where there's logic (1, 2, 3, 4, 5); razor (6, 7) proven by docker-dev `/run` (login disabled, agent drives).