22 KiB
Alarm Ack/Shelve Follow-ups Implementation Plan
For Claude: REQUIRED SUB-SKILL: Use superpowers-extended-cc:executing-plans (or subagent-driven-development) to implement this plan task-by-task.
Goal: Resolve the six follow-ups left by the T17–T24 inbound-alarm-ack work: the redundancy double-emit, the Timed-shelve UI, two T21 minors, the docker-dev rig cleanup, and two pre-existing Layer-1 gaps (Galaxy reconnect, the live-pill).
Architecture: The double-emit is fixed by primary-only emission at the source — both ScriptedAlarmHostActor (the alerts publish) and the per-node HistorianAdapterActor (the sink write) subscribe to the existing redundancy-state DPS topic, cache the local RedundancyRole, and act only when Primary (mirroring OpcUaPublishActor). OPC UA node writes + inbound-command processing stay ungated for warm-standby. The other five items are localized fixes (one Galaxy-driver lifecycle bug, one broadcaster-health interface extension + 2 razors, two small Alerts.razor additions, one mechanical refactor, one operational cleanup).
Tech Stack: .NET 10, Akka.NET (cluster, DistributedPubSub, TestKit/xunit2), Blazor Server (InteractiveServer, NO bUnit), xUnit + Shouldly, OPC Foundation UA .NET Standard, Serilog.
Design of record: docs/plans/2026-06-11-alarm-followups-design.md (committed master bcb9f45c).
Hard rules: stage by explicit path (never git add .); never stage sql_login.txt / src/Server/.../Host/pki/; never echo the gateway API key into a new tracked file; no force-push, no --no-verify; no Configuration entity / EF migration change. Build on a feature branch off master.
Task 0: Branch + baseline
Classification: trivial Estimated implement time: ~1 min Parallelizable with: none
Files: (none — git only)
Steps:
git checkout master && git switch -c feat/alarm-followups(offbcb9f45c).- Confirm clean tree + green baseline build:
dotnet build ZB.MOM.WW.OtOpcUa.slnx→ 0 errors. - No commit (branch only).
Task 1: Redundancy emit-gate in ScriptedAlarmHostActor (A1)
Classification: high-risk Estimated implement time: ~5 min Parallelizable with: Task 2, Task 3, Task 4, Task 5
Files:
- Modify:
src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ScriptedAlarms/ScriptedAlarmHostActor.cs - Test:
tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/ScriptedAlarms/ScriptedAlarmHostActorTests.cs
Context: OnEngineEmission (≈ line 247–279) does two things per emission: _publishActor.Tell(AlarmStateUpdate(...)) (line 261, OPC UA node write) and _mediator.Tell(new Publish(AlertsTopic, evt)) (line 278, cluster-wide alerts). Both central nodes run this → the alerts publish doubles. Gate only line 278 on Primary.
Reference pattern to mirror (src/Server/ZB.MOM.WW.OtOpcUa.Runtime/OpcUa/OpcUaPublishActor.cs): RedundancyStateTopic = "redundancy-state" (line 30); Subscribe(RedundancyStateTopic, Self) in PreStart (line 156); Receive<RedundancyStateChanged>(HandleRedundancyStateChanged) (line 147); HandleRedundancyStateChanged reads msg.Nodes.FirstOrDefault(n => n.NodeId == localNode)?.Role (lines 335–351). RedundancyRole lives in ZB.MOM.WW.OtOpcUa.Commons.Messages.Redundancy (Primary/Secondary/Detached). The host already knows its local node id (it's used elsewhere; if not, derive it the same way OpcUaPublishActor resolves _localNode).
Step 1: Failing TestKit tests (extend the existing harness; Runtime.Tests = xunit v2 + Akka.TestKit.Xunit2):
Emission_is_published_to_alerts_by_default_before_any_redundancy_state— a fresh host (noRedundancyStateChangedyet) DOES publish theAlarmTransitionEventtoalerts(subscribe a probe toAlertsTopic).Secondary_node_suppresses_alerts_publish_but_still_writes_opcua— after aRedundancyStateChangedmarking the local nodeSecondary, an emission does NOT publish toalertsbut DOES stillTelltheOpcUaPublishActor.AlarmStateUpdate(probe the publish-actor seam the test already uses for T9/T19).Primary_node_publishes_alerts— after aPrimaryRedundancyStateChanged, the emission publishes toalerts.Inbound_AlarmCommand_is_processed_regardless_of_role— aSecondaryhost still drives the engine for an inboundAlarmCommand(the existing T19 ack test, asserted under a Secondary role).
Step 2: Run them — expect FAIL (no gate yet; Secondary still publishes).
Step 3: Implement. Add a cached RedundancyRole? _localRole = null; (null = unknown ⇒ treat as Primary/emit). In PreStart, _mediator.Tell(new Subscribe(OpcUaPublishActor.RedundancyStateTopic, Self)) (reuse the const; if cross-project reference is awkward, introduce a shared Commons const for "redundancy-state" and point both at it — surface that deviation). Add Receive<RedundancyStateChanged> + a SubscribeAck no-op. In the handler, set _localRole from the snapshot for the local node. In OnEngineEmission, before line 278:
// Warm-standby dedup: only the Primary (driver-role leader) publishes the cluster-wide
// transition + drives historization. Default-emit until told we are Secondary/Detached so
// single-node deploys + the boot window never drop transitions. The OPC UA node write
// above (warm address space) and inbound command processing stay ungated.
if (_localRole is RedundancyRole.Secondary or RedundancyRole.Detached)
return;
_mediator.Tell(new Publish(AlertsTopic, evt));
Leave line 261 (the _publishActor.Tell) and OnAlarmCommand untouched.
Step 4: Run dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter ScriptedAlarmHostActor → all green.
Step 5: Commit by explicit path (ScriptedAlarmHostActor.cs + the test).
High-risk: concurrency (role cached on the actor thread, read in the emission path which is also marshalled onto the actor thread — confirm
OnEngineEmissionruns on the actor thread, it does viaSelf.Tell(EngineEmission)) + redundancy semantics. Do NOT gate the OPC UA write or commands.
Task 2: Redundancy historize-gate in HistorianAdapterActor (A2)
Classification: high-risk Estimated implement time: ~5 min Parallelizable with: Task 1, Task 3, Task 4, Task 5
Files:
- Modify:
src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Historian/HistorianAdapterActor.cs - Modify (only if the actor needs the local node id / mediator wired):
src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ServiceCollectionExtensions.cs(≈ line 146, where it's spawnedhistorian-adapter) - Test:
tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Historian/HistorianAdapterActorTests.cs(create if absent; else extend the existing historian test)
Context: HistorianAdapterActor is per-node and consumes the alerts topic, so one publish historizes once per node → double DB writes. Gate the sink write on Primary, same pattern as Task 1.
Step 1: Failing tests (TestKit + a fake IAlarmHistorianSink that records writes):
Default_before_redundancy_state_historizes— a fresh adapter writes to the sink (treat unknown as Primary).Secondary_node_does_not_historize— after aSecondaryRedundancyStateChanged, an incoming alarm event is NOT written to the sink.Primary_node_historizes— afterPrimary, it IS written.
Step 2: Run — FAIL.
Step 3: Implement the same _localRole cache + redundancy-state subscription + Receive<RedundancyStateChanged> as Task 1, and guard the sink write: if (_localRole is RedundancyRole.Secondary or RedundancyRole.Detached) return; before _sink.Write.... The actor will need its local node id; resolve it the way OpcUaPublishActor/ServiceCollectionExtensions already does (pass it into Props if not already available — check how OpcUaPublishActor gets _localNode and mirror; thread it through HistorianAdapterActor.Props + the registration at ServiceCollectionExtensions.cs:146).
Step 4: dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter Historian → green. Also confirm the full Runtime.Tests still pass.
Step 5: Commit by explicit path.
High-risk: data-historization correctness + concurrency. If wiring the node id into
Propsripples beyond the two files, surface it before expanding.
Task 3: Galaxy reconnect recreates a faulted session
Classification: high-risk Estimated implement time: ~5 min Parallelizable with: Task 1, Task 2, Task 4, Task 5
Files:
- Modify:
src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/Runtime/GalaxyMxSession.cs(≈ line 58–69) - Modify:
src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/GalaxyDriver.cs(≈ line 289,ReopenAsync) - Test: the Galaxy driver test project (find
*ReconnectSupervisor*/*Session*tests undertests/Drivers/...Driver.Galaxy.Tests/; mirror their seam)
Context: GalaxyMxSession.ConnectAsync has if (_session is not null) return; — so when the gRPC session is Faulted/NotFound the field is still a non-null dead handle and ConnectAsync is a silent no-op. GalaxyDriver.ReopenAsync calls ConnectAsync expecting a reconnect → no-op → ReconnectSupervisor.RecoveryLoopAsync (Runtime/ReconnectSupervisor.cs:158-186) sees reopen "succeed", replay fails, loops forever.
Step 1: Failing test. Assert the reconnect path creates a new session after a faulted one: drive GalaxyMxSession (or a fake MxGatewayClient/session seam) to a connected state, simulate fault/dispose, call the reopen/recreate path, and assert OpenSessionAsync + RegisterAsync were invoked again (a second create), not a no-op. Mirror the existing session/reconnect test construction (read how they fake MxGatewayClient/OpenSessionAsync).
Step 2: Run — FAIL (current code no-ops, second create never happens).
Step 3: Implement. Add a recreate path to GalaxyMxSession, e.g.:
/// <summary>Disposes the current (faulted/stale) session + owned client so the next
/// <see cref="ConnectAsync"/> rebuilds a fresh session instead of no-op'ing on the dead handle.</summary>
public async Task RecreateAsync(MxGatewayClientOptions clientOptions, CancellationToken ct)
{
ObjectDisposedException.ThrowIf(_disposed, this);
await DisposeSessionAsync().ConfigureAwait(false); // dispose+null _session, _serverHandle, _ownedClient
await ConnectAsync(clientOptions, ct).ConfigureAwait(false);
}
(Factor a private DisposeSessionAsync() that safely tears down the gRPC session/channel + nulls the fields — reuse whatever DisposeAsync already does for teardown, minus marking _disposed.) Then in GalaxyDriver.ReopenAsync call RecreateAsync instead of ConnectAsync so every reopen rebuilds. Keep the supervisor's backoff loop unchanged — it now actually recovers.
Step 4: Run the Galaxy reconnect/session tests → green. dotnet build the Galaxy driver clean.
Step 5: Commit by explicit path.
High-risk: driver session lifecycle + gRPC channel teardown + concurrency with the supervisor loop. Confirm
DisposeSessionAsyncis safe to call repeatedly and from the supervisor thread.
Task 4: Broadcaster connection-health signal + bridges
Classification: standard Estimated implement time: ~5 min Parallelizable with: Task 1, Task 2, Task 3, Task 5
Files:
- Modify:
src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Hubs/IInProcessBroadcaster.cs - Modify:
src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Hubs/InProcessBroadcaster.cs(the impl) - Modify: the two bridge actors that publish into the broadcaster —
ScriptLogSignalRBridge+AlertSignalRBridge(find underAdminUI/orRuntime/; theyTell/Subscribethe DPS topics and push to the broadcaster) - Test:
tests/Server/ZB.MOM.WW.OtOpcUa.Client.UI.Tests/(or whereverInProcessBroadcaster/broadcaster tests live — findInProcessBroadcasterTests)
Step 1: Failing unit test for the new health signal: a new InProcessBroadcaster<T> reports IsConnected == false (or a sensible default); calling SetConnected(true) raises ConnectionStateChanged with true and flips IsConnected; SetConnected(false) flips back + raises.
Step 2: Run — FAIL (members don't exist).
Step 3: Implement. Extend the interface:
bool IsConnected { get; }
event Action<bool>? ConnectionStateChanged;
void SetConnected(bool connected);
Implement in InProcessBroadcaster<T> (raise only on change). Then in each bridge actor, call _broadcaster.SetConnected(true) when its DPS SubscribeAck lands (subscription live) and SetConnected(false) on PostStop/failure. (Read how the bridge subscribes; mirror the SubscribeAck handling already there.) Default IsConnected: choose true once subscribed; before the first ack, false — the razor will reconcile on the ConnectionStateChanged event.
Step 4: Run the broadcaster tests → green; AdminUI builds clean (TreatWarningsAsErrors).
Step 5: Commit by explicit path.
Task 5: CorrelationId wrapper for the alarm commands (3b)
Classification: small Estimated implement time: ~4 min Parallelizable with: Task 1, Task 2, Task 3, Task 4
Files:
- Modify:
src/Core/ZB.MOM.WW.OtOpcUa.Commons/Messages/Admin/AcknowledgeAlarmCommand.cs+ShelveAlarmCommand.cs(+ their*Resultrecords) - Modify:
src/Server/ZB.MOM.WW.OtOpcUa.ControlPlane/AdminOperations/AdminOperationsActor.cs(the alarm handlers) - Modify:
src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Clients/AdminOperationsClient.cs - Test:
tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests/AdminOperationsActorTests.cs(the 5 alarm tests)
Context: the alarm commands use a bare Guid CorrelationId; the project's other control-plane commands (StartDeployment/RestartDriver/ReconnectDriver) use the CorrelationId wrapper type. Make them consistent.
Steps:
- Find the
CorrelationIdwrapper type + howRestartDriveruses it (CorrelationId.NewId(), the record field type, the reply echo). - Change
AcknowledgeAlarmCommand/ShelveAlarmCommand(+*Result)CorrelationIdfield typeGuid→CorrelationId. UpdateAdminOperationsClient.AcknowledgeAlarmAsync/ShelveAlarmAsyncto mintCorrelationId.NewId(). Update theAdminOperationsActorhandlers' reply construction. - Update the 5
AdminOperationsActorTeststo the wrapper type. dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests --filter AdminOperations→ green; build clean.- Commit by explicit path.
Task 6: Alerts.razor — Timed-shelve picker + chip auto-clear + live-pill
Classification: standard
Estimated implement time: ~5 min
Parallelizable with: Task 7
Blocked by: Task 4 (needs the broadcaster ConnectionStateChanged/IsConnected)
Files:
- Modify:
src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Components/Pages/Alerts.razor
Context: all three changes are in one file → bundled so they don't contend. NO bUnit; proven by docker-dev live-verify (Task 8).
Steps (no failing test — razor; verify by build + Task 8):
- Timed-shelve picker: add a small duration input to the row's shelve control (a minutes number box is simplest, default e.g. 5). Add a "Shelve (timed)" action that calls
IAdminOperationsClient.ShelveAlarmAsync(alarmId, user, ShelveKind.Timed, unshelveAtUtc: <computed now+duration>, comment: null, ct). Keep the existing OneShot Shelve + Unshelve buttons. Dispose the CTS (using var cts = …) like the existing handlers. - Chip auto-clear: after
ShowOpResult, start a~8sauto-clear (mirrorDriverStatusPanel.razor's timer: aTask.Delay(8000)continuation that, if the chip is still the same one, clears_opResult*+InvokeAsync(StateHasChanged)). ReadDriverStatusPanel.razorfor the exact pattern (cancellation on a newer action). - Live-pill: replace the set-once
_connected = true(≈ line 132) — inOnInitializedAsyncreadAlarms.IsConnectedand subscribeAlarms.ConnectionStateChanged += OnConnChanged;(handler sets_connected+InvokeAsync(StateHasChanged)); unsubscribe inDispose. (Alarms= the injectedIInProcessBroadcaster<AlarmTransitionEvent>.) dotnet build src/Server/ZB.MOM.WW.OtOpcUa.AdminUI→ 0 warnings/errors.- Commit by explicit path.
Task 7: ScriptLog.razor live-pill + reconnect-overlay check
Classification: small Estimated implement time: ~3 min Parallelizable with: Task 6 Blocked by: Task 4
Files:
- Modify:
src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Components/Pages/ScriptLog.razor(≈ line 122) - Read/verify (no edit unless missing): the Blazor host shell (
App.razor/_Host.cshtml/MainLayout) for the default reconnect overlay (components-reconnect-modal/<div id="components-reconnect-modal">)
Steps:
- Same live-pill fix as Task 6 step 3, against
ScriptLogs(the injectedIInProcessBroadcaster<ScriptLogEntry>): readIsConnected, subscribeConnectionStateChanged, update +StateHasChanged, unsubscribe inDispose. - Verify the dead-circuit overlay exists: grep the host shell for the Blazor reconnect modal markup. If present, note it in the commit message (the dead-circuit case is covered by Blazor's built-in reconnection UI — the component can't self-update a dead circuit). If ABSENT, add the standard
components-reconnect-modalmarkup. Do NOT fake liveness from a dead component. dotnet build src/Server/ZB.MOM.WW.OtOpcUa.AdminUI→ clean.- Commit by explicit path.
Task 8: Live-verify on docker-dev
Classification: verification Estimated implement time: ~ (manual) Parallelizable with: none Blocked by: Task 1, Task 2, Task 3, Task 6, Task 7 (Task 5 is internal; include it in the build)
Steps: rebuild docker-dev central nodes on the new image (docker compose -f docker-dev/docker-compose.yml up -d --build central-1 central-2), then on /alerts (login disabled — agent drives):
- Double-emit fixed: confirm each t12-overheat transition now appears ONCE (not twice). Cross-check both nodes' logs: only the driver-leader (Primary) node logs the alerts publish; the Secondary suppresses it but still writes its OPC UA node (its address space stays current). (Find the Primary via the
redundancy-state/ServiceLevel — or just observe single rows.) - Timed-shelve: drive the new timed-shelve control → a "Shelved" transition appears (operator =
multi-role-test); after the duration the SDK auto-unshelves (or verify theUnshelveAtUtcwas set). - Chip auto-clear: the Ack/Shelve result chip clears after ~8s.
- Live-pill: the pill shows "live" while connected; (optional) kill the bridge / restart a node and confirm the pill flips / the Blazor reconnect overlay shows.
- Galaxy reconnect (best-effort, may need the gateway): if feasible, restart the Galaxy gateway and confirm the driver recreates the session + recovers without a container restart (watch the logs for a fresh
OpenSessionAsync). If not feasible live, rely on the Task 3 unit test + note it. Defects → new fix tasks. Agent drives; no sign-in needed (docker-dev login disabled).
Task 9: docker-dev rig cleanup
Classification: trivial (operational) Estimated implement time: ~ (manual) Parallelizable with: none Blocked by: Task 8 (the rig artifacts are the live-verify vehicle)
Steps: remove the seed artifacts left for verification, then redeploy:
- In the docker-dev AdminUI (or via DB), delete the
t12-overheatscripted alarm, theSC-ba675b168a85predicate script, and thelayer0-logcheckvtag + script. Revert filler-02'scycle-time-sscript toreturn ctx.GetTag("TestMachine_002.TestDuration").Value;. - Redeploy:
POST http://localhost:9200/api/deploymentswithX-Api-Key: docker-dev-deploy-key→202. - Confirm
/alerts+/scripted-alarms+/scriptsno longer list the removed artifacts. No code commit (DB/deploy only). Note completion in Task 10's commit message.
Operational only — no source change. If the user wants the rig kept, skip this task and record it as deferred.
Task 10: Docs + finish branch
Classification: small Estimated implement time: ~4 min Parallelizable with: none Blocked by: Task 9
Files: docs/Redundancy.md (alarm-emission + historization are now Primary-gated), docs/ScriptedAlarms.md / docs/AlarmTracking.md (the redundancy dedup note), docs/Client.CLI.md only if affected, and a one-line note wherever the live-pill/Galaxy-reconnect behaviour is documented. Keep terse.
Steps:
- Update the docs above to reflect: Primary-only alarm emission + historization under redundancy; the Galaxy reconnect now recreates a faulted session; the live-pill reflects feed health.
- Run the FULL suite:
dotnet test ZB.MOM.WW.OtOpcUa.slnx— confirm all affected unit suites green; the only failures should be the known pre-existing macOS/integration ones (OpcUaServer.IntegrationTests PKI, Host.IntegrationTests deploy-Rejected, AbLegacy/AbCip fixtures). Verify the Galaxy suite is green (Task 3). - Commit docs by explicit path.
- Run superpowers-extended-cc:finishing-a-development-branch (verify tests → present the 4 options → merge).
Execution notes
- Parallel dispatch: Tasks 1, 2, 3, 4, 5 are mutually parallelizable (disjoint files across Runtime/ScriptedAlarms, Runtime/Historian, Driver.Galaxy, AdminUI/Hubs+bridges, Commons+ControlPlane). Dispatch their implementers concurrently. 6 and 7 wait on 4 (broadcaster health) and are ∥ each other (Alerts.razor vs ScriptLog.razor). 8 waits on 1/2/3/6/7; 9 on 8; 10 on 9.
- One writer per file:
Alerts.razoris touched ONLY by Task 6 (its three changes are bundled for exactly this reason).ScriptLog.razoronly by Task 7. - High-risk tasks (1, 2, 3): serial spec→code review each. Standard (4, 6): parallel spec+code review. Small (5, 7): code review only.
- Live-verify (8) is the integration checkpoint — the single-alerts-row proof is the whole point of item 1.
- TDD where there's logic (1, 2, 3, 4, 5); razor (6, 7) proven by docker-dev
/run(login disabled, agent drives).