Files
lmxopcua/docs/plans/2026-06-11-alarm-followups.md
T
2026-06-11 08:28:37 -04:00

292 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Alarm Ack/Shelve Follow-ups Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers-extended-cc:executing-plans (or subagent-driven-development) to implement this plan task-by-task.
**Goal:** Resolve the six follow-ups left by the T17T24 inbound-alarm-ack work: the redundancy double-emit, the Timed-shelve UI, two T21 minors, the docker-dev rig cleanup, and two pre-existing Layer-1 gaps (Galaxy reconnect, the live-pill).
**Architecture:** The double-emit is fixed by **primary-only emission at the source** — both `ScriptedAlarmHostActor` (the `alerts` publish) and the per-node `HistorianAdapterActor` (the sink write) subscribe to the existing `redundancy-state` DPS topic, cache the local `RedundancyRole`, and act only when `Primary` (mirroring `OpcUaPublishActor`). OPC UA node writes + inbound-command processing stay ungated for warm-standby. The other five items are localized fixes (one Galaxy-driver lifecycle bug, one broadcaster-health interface extension + 2 razors, two small Alerts.razor additions, one mechanical refactor, one operational cleanup).
**Tech Stack:** .NET 10, Akka.NET (cluster, DistributedPubSub, TestKit/xunit2), Blazor Server (InteractiveServer, NO bUnit), xUnit + Shouldly, OPC Foundation UA .NET Standard, Serilog.
**Design of record:** `docs/plans/2026-06-11-alarm-followups-design.md` (committed master `bcb9f45c`).
**Hard rules:** stage by explicit path (never `git add .`); never stage `sql_login.txt` / `src/Server/.../Host/pki/`; never echo the gateway API key into a **new** tracked file; no force-push, no `--no-verify`; **no Configuration entity / EF migration change**. Build on a feature branch off master.
---
### Task 0: Branch + baseline
**Classification:** trivial
**Estimated implement time:** ~1 min
**Parallelizable with:** none
**Files:** (none — git only)
**Steps:**
1. `git checkout master && git switch -c feat/alarm-followups` (off `bcb9f45c`).
2. Confirm clean tree + green baseline build: `dotnet build ZB.MOM.WW.OtOpcUa.slnx` → 0 errors.
3. No commit (branch only).
---
### Task 1: Redundancy emit-gate in `ScriptedAlarmHostActor` (A1)
**Classification:** high-risk
**Estimated implement time:** ~5 min
**Parallelizable with:** Task 2, Task 3, Task 4, Task 5
**Files:**
- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ScriptedAlarms/ScriptedAlarmHostActor.cs`
- Test: `tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/ScriptedAlarms/ScriptedAlarmHostActorTests.cs`
**Context:** `OnEngineEmission` (≈ line 247279) does two things per emission: `_publishActor.Tell(AlarmStateUpdate(...))` (line 261, OPC UA node write) and `_mediator.Tell(new Publish(AlertsTopic, evt))` (line 278, cluster-wide `alerts`). Both central nodes run this → the `alerts` publish doubles. Gate **only line 278** on `Primary`.
**Reference pattern to mirror** (`src/Server/ZB.MOM.WW.OtOpcUa.Runtime/OpcUa/OpcUaPublishActor.cs`): `RedundancyStateTopic = "redundancy-state"` (line 30); `Subscribe(RedundancyStateTopic, Self)` in PreStart (line 156); `Receive<RedundancyStateChanged>(HandleRedundancyStateChanged)` (line 147); `HandleRedundancyStateChanged` reads `msg.Nodes.FirstOrDefault(n => n.NodeId == localNode)?.Role` (lines 335351). `RedundancyRole` lives in `ZB.MOM.WW.OtOpcUa.Commons.Messages.Redundancy` (`Primary`/`Secondary`/`Detached`). The host already knows its local node id (it's used elsewhere; if not, derive it the same way `OpcUaPublishActor` resolves `_localNode`).
**Step 1: Failing TestKit tests** (extend the existing harness; Runtime.Tests = xunit v2 + Akka.TestKit.Xunit2):
- `Emission_is_published_to_alerts_by_default_before_any_redundancy_state` — a fresh host (no `RedundancyStateChanged` yet) DOES publish the `AlarmTransitionEvent` to `alerts` (subscribe a probe to `AlertsTopic`).
- `Secondary_node_suppresses_alerts_publish_but_still_writes_opcua` — after a `RedundancyStateChanged` marking the local node `Secondary`, an emission does NOT publish to `alerts` but DOES still `Tell` the `OpcUaPublishActor.AlarmStateUpdate` (probe the publish-actor seam the test already uses for T9/T19).
- `Primary_node_publishes_alerts` — after a `Primary` `RedundancyStateChanged`, the emission publishes to `alerts`.
- `Inbound_AlarmCommand_is_processed_regardless_of_role` — a `Secondary` host still drives the engine for an inbound `AlarmCommand` (the existing T19 ack test, asserted under a Secondary role).
**Step 2:** Run them — expect FAIL (no gate yet; Secondary still publishes).
**Step 3: Implement.** Add a cached `RedundancyRole? _localRole = null;` (null = unknown ⇒ treat as Primary/emit). In PreStart, `_mediator.Tell(new Subscribe(OpcUaPublishActor.RedundancyStateTopic, Self))` (reuse the const; if cross-project reference is awkward, introduce a shared `Commons` const for `"redundancy-state"` and point both at it — surface that deviation). Add `Receive<RedundancyStateChanged>` + a `SubscribeAck` no-op. In the handler, set `_localRole` from the snapshot for the local node. In `OnEngineEmission`, before line 278:
```csharp
// Warm-standby dedup: only the Primary (driver-role leader) publishes the cluster-wide
// transition + drives historization. Default-emit until told we are Secondary/Detached so
// single-node deploys + the boot window never drop transitions. The OPC UA node write
// above (warm address space) and inbound command processing stay ungated.
if (_localRole is RedundancyRole.Secondary or RedundancyRole.Detached)
return;
_mediator.Tell(new Publish(AlertsTopic, evt));
```
Leave line 261 (the `_publishActor.Tell`) and `OnAlarmCommand` untouched.
**Step 4:** Run `dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter ScriptedAlarmHostActor` → all green.
**Step 5: Commit** by explicit path (`ScriptedAlarmHostActor.cs` + the test).
> High-risk: concurrency (role cached on the actor thread, read in the emission path which is also marshalled onto the actor thread — confirm `OnEngineEmission` runs on the actor thread, it does via `Self.Tell(EngineEmission)`) + redundancy semantics. Do NOT gate the OPC UA write or commands.
---
### Task 2: Redundancy historize-gate in `HistorianAdapterActor` (A2)
**Classification:** high-risk
**Estimated implement time:** ~5 min
**Parallelizable with:** Task 1, Task 3, Task 4, Task 5
**Files:**
- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Historian/HistorianAdapterActor.cs`
- Modify (only if the actor needs the local node id / mediator wired): `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ServiceCollectionExtensions.cs` (≈ line 146, where it's spawned `historian-adapter`)
- Test: `tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Historian/HistorianAdapterActorTests.cs` (create if absent; else extend the existing historian test)
**Context:** `HistorianAdapterActor` is **per-node** and consumes the `alerts` topic, so one publish historizes once per node → double DB writes. Gate the sink write on `Primary`, same pattern as Task 1.
**Step 1: Failing tests** (TestKit + a fake `IAlarmHistorianSink` that records writes):
- `Default_before_redundancy_state_historizes` — a fresh adapter writes to the sink (treat unknown as Primary).
- `Secondary_node_does_not_historize` — after a `Secondary` `RedundancyStateChanged`, an incoming alarm event is NOT written to the sink.
- `Primary_node_historizes` — after `Primary`, it IS written.
**Step 2:** Run — FAIL.
**Step 3: Implement** the same `_localRole` cache + `redundancy-state` subscription + `Receive<RedundancyStateChanged>` as Task 1, and guard the sink write: `if (_localRole is RedundancyRole.Secondary or RedundancyRole.Detached) return;` before `_sink.Write...`. The actor will need its local node id; resolve it the way `OpcUaPublishActor`/`ServiceCollectionExtensions` already does (pass it into `Props` if not already available — check how `OpcUaPublishActor` gets `_localNode` and mirror; thread it through `HistorianAdapterActor.Props` + the registration at `ServiceCollectionExtensions.cs:146`).
**Step 4:** `dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter Historian` → green. Also confirm the full Runtime.Tests still pass.
**Step 5: Commit** by explicit path.
> High-risk: data-historization correctness + concurrency. If wiring the node id into `Props` ripples beyond the two files, surface it before expanding.
---
### Task 3: Galaxy reconnect recreates a faulted session
**Classification:** high-risk
**Estimated implement time:** ~5 min
**Parallelizable with:** Task 1, Task 2, Task 4, Task 5
**Files:**
- Modify: `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/Runtime/GalaxyMxSession.cs` (≈ line 5869)
- Modify: `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/GalaxyDriver.cs` (≈ line 289, `ReopenAsync`)
- Test: the Galaxy driver test project (find `*ReconnectSupervisor*` / `*Session*` tests under `tests/Drivers/...Driver.Galaxy.Tests/`; mirror their seam)
**Context:** `GalaxyMxSession.ConnectAsync` has `if (_session is not null) return;` — so when the gRPC session is Faulted/NotFound the field is still a non-null dead handle and `ConnectAsync` is a silent no-op. `GalaxyDriver.ReopenAsync` calls `ConnectAsync` expecting a reconnect → no-op → `ReconnectSupervisor.RecoveryLoopAsync` (`Runtime/ReconnectSupervisor.cs:158-186`) sees reopen "succeed", replay fails, loops forever.
**Step 1: Failing test.** Assert the reconnect path **creates a new session** after a faulted one: drive `GalaxyMxSession` (or a fake `MxGatewayClient`/session seam) to a connected state, simulate fault/dispose, call the reopen/recreate path, and assert `OpenSessionAsync` + `RegisterAsync` were invoked **again** (a second create), not a no-op. Mirror the existing session/reconnect test construction (read how they fake `MxGatewayClient`/`OpenSessionAsync`).
**Step 2:** Run — FAIL (current code no-ops, second create never happens).
**Step 3: Implement.** Add a recreate path to `GalaxyMxSession`, e.g.:
```csharp
/// <summary>Disposes the current (faulted/stale) session + owned client so the next
/// <see cref="ConnectAsync"/> rebuilds a fresh session instead of no-op'ing on the dead handle.</summary>
public async Task RecreateAsync(MxGatewayClientOptions clientOptions, CancellationToken ct)
{
ObjectDisposedException.ThrowIf(_disposed, this);
await DisposeSessionAsync().ConfigureAwait(false); // dispose+null _session, _serverHandle, _ownedClient
await ConnectAsync(clientOptions, ct).ConfigureAwait(false);
}
```
(Factor a private `DisposeSessionAsync()` that safely tears down the gRPC session/channel + nulls the fields — reuse whatever `DisposeAsync` already does for teardown, minus marking `_disposed`.) Then in `GalaxyDriver.ReopenAsync` call `RecreateAsync` instead of `ConnectAsync` so every reopen rebuilds. Keep the supervisor's backoff loop unchanged — it now actually recovers.
**Step 4:** Run the Galaxy reconnect/session tests → green. `dotnet build` the Galaxy driver clean.
**Step 5: Commit** by explicit path.
> High-risk: driver session lifecycle + gRPC channel teardown + concurrency with the supervisor loop. Confirm `DisposeSessionAsync` is safe to call repeatedly and from the supervisor thread.
---
### Task 4: Broadcaster connection-health signal + bridges
**Classification:** standard
**Estimated implement time:** ~5 min
**Parallelizable with:** Task 1, Task 2, Task 3, Task 5
**Files:**
- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Hubs/IInProcessBroadcaster.cs`
- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Hubs/InProcessBroadcaster.cs` (the impl)
- Modify: the two bridge actors that publish into the broadcaster — `ScriptLogSignalRBridge` + `AlertSignalRBridge` (find under `AdminUI/` or `Runtime/`; they `Tell`/`Subscribe` the DPS topics and push to the broadcaster)
- Test: `tests/Server/ZB.MOM.WW.OtOpcUa.Client.UI.Tests/` (or wherever `InProcessBroadcaster`/broadcaster tests live — find `InProcessBroadcasterTests`)
**Step 1: Failing unit test** for the new health signal: a new `InProcessBroadcaster<T>` reports `IsConnected == false` (or a sensible default); calling `SetConnected(true)` raises `ConnectionStateChanged` with `true` and flips `IsConnected`; `SetConnected(false)` flips back + raises.
**Step 2:** Run — FAIL (members don't exist).
**Step 3: Implement.** Extend the interface:
```csharp
bool IsConnected { get; }
event Action<bool>? ConnectionStateChanged;
void SetConnected(bool connected);
```
Implement in `InProcessBroadcaster<T>` (raise only on change). Then in each bridge actor, call `_broadcaster.SetConnected(true)` when its DPS `SubscribeAck` lands (subscription live) and `SetConnected(false)` on `PostStop`/failure. (Read how the bridge subscribes; mirror the `SubscribeAck` handling already there.) Default `IsConnected`: choose `true` once subscribed; before the first ack, `false` — the razor will reconcile on the `ConnectionStateChanged` event.
**Step 4:** Run the broadcaster tests → green; AdminUI builds clean (`TreatWarningsAsErrors`).
**Step 5: Commit** by explicit path.
---
### Task 5: `CorrelationId` wrapper for the alarm commands (3b)
**Classification:** small
**Estimated implement time:** ~4 min
**Parallelizable with:** Task 1, Task 2, Task 3, Task 4
**Files:**
- Modify: `src/Core/ZB.MOM.WW.OtOpcUa.Commons/Messages/Admin/AcknowledgeAlarmCommand.cs` + `ShelveAlarmCommand.cs` (+ their `*Result` records)
- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.ControlPlane/AdminOperations/AdminOperationsActor.cs` (the alarm handlers)
- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Clients/AdminOperationsClient.cs`
- Test: `tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests/AdminOperationsActorTests.cs` (the 5 alarm tests)
**Context:** the alarm commands use a bare `Guid CorrelationId`; the project's other control-plane commands (`StartDeployment`/`RestartDriver`/`ReconnectDriver`) use the `CorrelationId` wrapper type. Make them consistent.
**Steps:**
1. Find the `CorrelationId` wrapper type + how `RestartDriver` uses it (`CorrelationId.NewId()`, the record field type, the reply echo).
2. Change `AcknowledgeAlarmCommand`/`ShelveAlarmCommand` (+ `*Result`) `CorrelationId` field type `Guid``CorrelationId`. Update `AdminOperationsClient.AcknowledgeAlarmAsync`/`ShelveAlarmAsync` to mint `CorrelationId.NewId()`. Update the `AdminOperationsActor` handlers' reply construction.
3. Update the 5 `AdminOperationsActorTests` to the wrapper type.
4. `dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests --filter AdminOperations` → green; build clean.
5. **Commit** by explicit path.
---
### Task 6: `Alerts.razor` — Timed-shelve picker + chip auto-clear + live-pill
**Classification:** standard
**Estimated implement time:** ~5 min
**Parallelizable with:** Task 7
**Blocked by:** Task 4 (needs the broadcaster `ConnectionStateChanged`/`IsConnected`)
**Files:**
- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Components/Pages/Alerts.razor`
**Context:** all three changes are in one file → bundled so they don't contend. NO bUnit; proven by docker-dev live-verify (Task 8).
**Steps (no failing test — razor; verify by build + Task 8):**
1. **Timed-shelve picker:** add a small duration input to the row's shelve control (a minutes number box is simplest, default e.g. 5). Add a "Shelve (timed)" action that calls `IAdminOperationsClient.ShelveAlarmAsync(alarmId, user, ShelveKind.Timed, unshelveAtUtc: <computed now+duration>, comment: null, ct)`. Keep the existing OneShot Shelve + Unshelve buttons. Dispose the CTS (`using var cts = …`) like the existing handlers.
2. **Chip auto-clear:** after `ShowOpResult`, start a `~8s` auto-clear (mirror `DriverStatusPanel.razor`'s timer: a `Task.Delay(8000)` continuation that, if the chip is still the same one, clears `_opResult*` + `InvokeAsync(StateHasChanged)`). Read `DriverStatusPanel.razor` for the exact pattern (cancellation on a newer action).
3. **Live-pill:** replace the set-once `_connected = true` (≈ line 132) — in `OnInitializedAsync` read `Alarms.IsConnected` and subscribe `Alarms.ConnectionStateChanged += OnConnChanged;` (handler sets `_connected` + `InvokeAsync(StateHasChanged)`); unsubscribe in `Dispose`. (`Alarms` = the injected `IInProcessBroadcaster<AlarmTransitionEvent>`.)
4. `dotnet build src/Server/ZB.MOM.WW.OtOpcUa.AdminUI` → 0 warnings/errors.
5. **Commit** by explicit path.
---
### Task 7: `ScriptLog.razor` live-pill + reconnect-overlay check
**Classification:** small
**Estimated implement time:** ~3 min
**Parallelizable with:** Task 6
**Blocked by:** Task 4
**Files:**
- Modify: `src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Components/Pages/ScriptLog.razor` (≈ line 122)
- Read/verify (no edit unless missing): the Blazor host shell (`App.razor` / `_Host.cshtml` / `MainLayout`) for the default reconnect overlay (`components-reconnect-modal` / `<div id="components-reconnect-modal">`)
**Steps:**
1. Same live-pill fix as Task 6 step 3, against `ScriptLogs` (the injected `IInProcessBroadcaster<ScriptLogEntry>`): read `IsConnected`, subscribe `ConnectionStateChanged`, update + `StateHasChanged`, unsubscribe in `Dispose`.
2. **Verify the dead-circuit overlay exists:** grep the host shell for the Blazor reconnect modal markup. If present, note it in the commit message (the dead-circuit case is covered by Blazor's built-in reconnection UI — the component can't self-update a dead circuit). If ABSENT, add the standard `components-reconnect-modal` markup. Do NOT fake liveness from a dead component.
3. `dotnet build src/Server/ZB.MOM.WW.OtOpcUa.AdminUI` → clean.
4. **Commit** by explicit path.
---
### Task 8: Live-verify on docker-dev
**Classification:** verification
**Estimated implement time:** ~ (manual)
**Parallelizable with:** none
**Blocked by:** Task 1, Task 2, Task 3, Task 6, Task 7 (Task 5 is internal; include it in the build)
**Steps:** rebuild docker-dev central nodes on the new image (`docker compose -f docker-dev/docker-compose.yml up -d --build central-1 central-2`), then on `/alerts` (login disabled — agent drives):
1. **Double-emit fixed:** confirm each t12-overheat transition now appears **ONCE** (not twice). Cross-check both nodes' logs: only the **driver-leader (Primary)** node logs the alerts publish; the Secondary suppresses it but still writes its OPC UA node (its address space stays current). (Find the Primary via the `redundancy-state`/ServiceLevel — or just observe single rows.)
2. **Timed-shelve:** drive the new timed-shelve control → a "Shelved" transition appears (operator = `multi-role-test`); after the duration the SDK auto-unshelves (or verify the `UnshelveAtUtc` was set).
3. **Chip auto-clear:** the Ack/Shelve result chip clears after ~8s.
4. **Live-pill:** the pill shows "live" while connected; (optional) kill the bridge / restart a node and confirm the pill flips / the Blazor reconnect overlay shows.
5. **Galaxy reconnect** (best-effort, may need the gateway): if feasible, restart the Galaxy gateway and confirm the driver recreates the session + recovers without a container restart (watch the logs for a fresh `OpenSessionAsync`). If not feasible live, rely on the Task 3 unit test + note it.
Defects → new fix tasks. **Agent drives; no sign-in needed (docker-dev login disabled).**
---
### Task 9: docker-dev rig cleanup
**Classification:** trivial (operational)
**Estimated implement time:** ~ (manual)
**Parallelizable with:** none
**Blocked by:** Task 8 (the rig artifacts are the live-verify vehicle)
**Steps:** remove the seed artifacts left for verification, then redeploy:
1. In the docker-dev AdminUI (or via DB), delete the `t12-overheat` scripted alarm, the `SC-ba675b168a85` predicate script, and the `layer0-logcheck` vtag + script. Revert filler-02's `cycle-time-s` script to `return ctx.GetTag("TestMachine_002.TestDuration").Value;`.
2. Redeploy: `POST http://localhost:9200/api/deployments` with `X-Api-Key: docker-dev-deploy-key``202`.
3. Confirm `/alerts` + `/scripted-alarms` + `/scripts` no longer list the removed artifacts. No code commit (DB/deploy only). Note completion in Task 10's commit message.
> Operational only — no source change. If the user wants the rig kept, skip this task and record it as deferred.
---
### Task 10: Docs + finish branch
**Classification:** small
**Estimated implement time:** ~4 min
**Parallelizable with:** none
**Blocked by:** Task 9
**Files:** `docs/Redundancy.md` (alarm-emission + historization are now Primary-gated), `docs/ScriptedAlarms.md` / `docs/AlarmTracking.md` (the redundancy dedup note), `docs/Client.CLI.md` only if affected, and a one-line note wherever the live-pill/Galaxy-reconnect behaviour is documented. Keep terse.
**Steps:**
1. Update the docs above to reflect: Primary-only alarm emission + historization under redundancy; the Galaxy reconnect now recreates a faulted session; the live-pill reflects feed health.
2. Run the FULL suite: `dotnet test ZB.MOM.WW.OtOpcUa.slnx` — confirm all affected unit suites green; the only failures should be the known pre-existing macOS/integration ones (OpcUaServer.IntegrationTests PKI, Host.IntegrationTests deploy-Rejected, AbLegacy/AbCip fixtures). Verify the **Galaxy** suite is green (Task 3).
3. **Commit** docs by explicit path.
4. Run **superpowers-extended-cc:finishing-a-development-branch** (verify tests → present the 4 options → merge).
---
## Execution notes
- **Parallel dispatch:** Tasks **1, 2, 3, 4, 5** are mutually parallelizable (disjoint files across Runtime/ScriptedAlarms, Runtime/Historian, Driver.Galaxy, AdminUI/Hubs+bridges, Commons+ControlPlane). Dispatch their implementers concurrently. **6 and 7** wait on **4** (broadcaster health) and are ∥ each other (Alerts.razor vs ScriptLog.razor). **8** waits on 1/2/3/6/7; **9** on 8; **10** on 9.
- **One writer per file:** `Alerts.razor` is touched ONLY by Task 6 (its three changes are bundled for exactly this reason). `ScriptLog.razor` only by Task 7.
- **High-risk tasks (1, 2, 3):** serial spec→code review each. **Standard (4, 6):** parallel spec+code review. **Small (5, 7):** code review only.
- **Live-verify (8) is the integration checkpoint** — the single-alerts-row proof is the whole point of item 1.
- TDD where there's logic (1, 2, 3, 4, 5); razor (6, 7) proven by docker-dev `/run` (login disabled, agent drives).