Files

T

Joseph Doherty 535787bd85 docs(plan): alarm follow-ups implementation plan (T0-T10)

2026-06-11 08:28:37 -04:00

22 KiB

Raw Blame History

Alarm Ack/Shelve Follow-ups Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers-extended-cc:executing-plans (or subagent-driven-development) to implement this plan task-by-task.

Goal: Resolve the six follow-ups left by the T17–T24 inbound-alarm-ack work: the redundancy double-emit, the Timed-shelve UI, two T21 minors, the docker-dev rig cleanup, and two pre-existing Layer-1 gaps (Galaxy reconnect, the live-pill).

Architecture: The double-emit is fixed by primary-only emission at the source — both ScriptedAlarmHostActor (the alerts publish) and the per-node HistorianAdapterActor (the sink write) subscribe to the existing redundancy-state DPS topic, cache the local RedundancyRole, and act only when Primary (mirroring OpcUaPublishActor). OPC UA node writes + inbound-command processing stay ungated for warm-standby. The other five items are localized fixes (one Galaxy-driver lifecycle bug, one broadcaster-health interface extension + 2 razors, two small Alerts.razor additions, one mechanical refactor, one operational cleanup).

Tech Stack: .NET 10, Akka.NET (cluster, DistributedPubSub, TestKit/xunit2), Blazor Server (InteractiveServer, NO bUnit), xUnit + Shouldly, OPC Foundation UA .NET Standard, Serilog.

Design of record: docs/plans/2026-06-11-alarm-followups-design.md (committed master bcb9f45c).

Hard rules: stage by explicit path (never git add .); never stage sql_login.txt / src/Server/.../Host/pki/; never echo the gateway API key into a new tracked file; no force-push, no --no-verify; no Configuration entity / EF migration change. Build on a feature branch off master.

Task 0: Branch + baseline

Classification: trivial Estimated implement time: ~1 min Parallelizable with: none

Files: (none — git only)

Steps:

git checkout master && git switch -c feat/alarm-followups (off bcb9f45c).
Confirm clean tree + green baseline build: dotnet build ZB.MOM.WW.OtOpcUa.slnx → 0 errors.
No commit (branch only).

Task 1: Redundancy emit-gate in `ScriptedAlarmHostActor` (A1)

Classification: high-risk Estimated implement time: ~5 min Parallelizable with: Task 2, Task 3, Task 4, Task 5

Files:

Modify: src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ScriptedAlarms/ScriptedAlarmHostActor.cs
Test: tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/ScriptedAlarms/ScriptedAlarmHostActorTests.cs

Context: OnEngineEmission (≈ line 247–279) does two things per emission: _publishActor.Tell(AlarmStateUpdate(...)) (line 261, OPC UA node write) and _mediator.Tell(new Publish(AlertsTopic, evt)) (line 278, cluster-wide alerts). Both central nodes run this → the alerts publish doubles. Gate only line 278 on Primary.

Reference pattern to mirror (src/Server/ZB.MOM.WW.OtOpcUa.Runtime/OpcUa/OpcUaPublishActor.cs): RedundancyStateTopic = "redundancy-state" (line 30); Subscribe(RedundancyStateTopic, Self) in PreStart (line 156); Receive<RedundancyStateChanged>(HandleRedundancyStateChanged) (line 147); HandleRedundancyStateChanged reads msg.Nodes.FirstOrDefault(n => n.NodeId == localNode)?.Role (lines 335–351). RedundancyRole lives in ZB.MOM.WW.OtOpcUa.Commons.Messages.Redundancy (Primary/Secondary/Detached). The host already knows its local node id (it's used elsewhere; if not, derive it the same way OpcUaPublishActor resolves _localNode).

Step 1: Failing TestKit tests (extend the existing harness; Runtime.Tests = xunit v2 + Akka.TestKit.Xunit2):

Emission_is_published_to_alerts_by_default_before_any_redundancy_state — a fresh host (no RedundancyStateChanged yet) DOES publish the AlarmTransitionEvent to alerts (subscribe a probe to AlertsTopic).
Secondary_node_suppresses_alerts_publish_but_still_writes_opcua — after a RedundancyStateChanged marking the local node Secondary, an emission does NOT publish to alerts but DOES still Tell the OpcUaPublishActor.AlarmStateUpdate (probe the publish-actor seam the test already uses for T9/T19).
Primary_node_publishes_alerts — after a Primary RedundancyStateChanged, the emission publishes to alerts.
Inbound_AlarmCommand_is_processed_regardless_of_role — a Secondary host still drives the engine for an inbound AlarmCommand (the existing T19 ack test, asserted under a Secondary role).

Step 2: Run them — expect FAIL (no gate yet; Secondary still publishes).

Step 3: Implement. Add a cached RedundancyRole? _localRole = null; (null = unknown ⇒ treat as Primary/emit). In PreStart, _mediator.Tell(new Subscribe(OpcUaPublishActor.RedundancyStateTopic, Self)) (reuse the const; if cross-project reference is awkward, introduce a shared Commons const for "redundancy-state" and point both at it — surface that deviation). Add Receive<RedundancyStateChanged> + a SubscribeAck no-op. In the handler, set _localRole from the snapshot for the local node. In OnEngineEmission, before line 278:

// Warm-standby dedup: only the Primary (driver-role leader) publishes the cluster-wide
// transition + drives historization. Default-emit until told we are Secondary/Detached so
// single-node deploys + the boot window never drop transitions. The OPC UA node write
// above (warm address space) and inbound command processing stay ungated.
if (_localRole is RedundancyRole.Secondary or RedundancyRole.Detached)
    return;
_mediator.Tell(new Publish(AlertsTopic, evt));

Leave line 261 (the _publishActor.Tell) and OnAlarmCommand untouched.

Step 4: Run dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter ScriptedAlarmHostActor → all green.

Step 5: Commit by explicit path (ScriptedAlarmHostActor.cs + the test).

High-risk: concurrency (role cached on the actor thread, read in the emission path which is also marshalled onto the actor thread — confirm OnEngineEmission runs on the actor thread, it does via Self.Tell(EngineEmission)) + redundancy semantics. Do NOT gate the OPC UA write or commands.

Task 2: Redundancy historize-gate in `HistorianAdapterActor` (A2)

Classification: high-risk Estimated implement time: ~5 min Parallelizable with: Task 1, Task 3, Task 4, Task 5

Files:

Modify: src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Historian/HistorianAdapterActor.cs
Modify (only if the actor needs the local node id / mediator wired): src/Server/ZB.MOM.WW.OtOpcUa.Runtime/ServiceCollectionExtensions.cs (≈ line 146, where it's spawned historian-adapter)
Test: tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests/Historian/HistorianAdapterActorTests.cs (create if absent; else extend the existing historian test)

Context: HistorianAdapterActor is per-node and consumes the alerts topic, so one publish historizes once per node → double DB writes. Gate the sink write on Primary, same pattern as Task 1.

Step 1: Failing tests (TestKit + a fake IAlarmHistorianSink that records writes):

Default_before_redundancy_state_historizes — a fresh adapter writes to the sink (treat unknown as Primary).
Secondary_node_does_not_historize — after a Secondary RedundancyStateChanged, an incoming alarm event is NOT written to the sink.
Primary_node_historizes — after Primary, it IS written.

Step 2: Run — FAIL.

Step 3: Implement the same _localRole cache + redundancy-state subscription + Receive<RedundancyStateChanged> as Task 1, and guard the sink write: if (_localRole is RedundancyRole.Secondary or RedundancyRole.Detached) return; before _sink.Write.... The actor will need its local node id; resolve it the way OpcUaPublishActor/ServiceCollectionExtensions already does (pass it into Props if not already available — check how OpcUaPublishActor gets _localNode and mirror; thread it through HistorianAdapterActor.Props + the registration at ServiceCollectionExtensions.cs:146).

Step 4: dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests --filter Historian → green. Also confirm the full Runtime.Tests still pass.

Step 5: Commit by explicit path.

High-risk: data-historization correctness + concurrency. If wiring the node id into Props ripples beyond the two files, surface it before expanding.

Task 3: Galaxy reconnect recreates a faulted session

Classification: high-risk Estimated implement time: ~5 min Parallelizable with: Task 1, Task 2, Task 4, Task 5

Files:

Modify: src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/Runtime/GalaxyMxSession.cs (≈ line 58–69)
Modify: src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/GalaxyDriver.cs (≈ line 289, ReopenAsync)
Test: the Galaxy driver test project (find *ReconnectSupervisor* / *Session* tests under tests/Drivers/...Driver.Galaxy.Tests/; mirror their seam)

Context: GalaxyMxSession.ConnectAsync has if (_session is not null) return; — so when the gRPC session is Faulted/NotFound the field is still a non-null dead handle and ConnectAsync is a silent no-op. GalaxyDriver.ReopenAsync calls ConnectAsync expecting a reconnect → no-op → ReconnectSupervisor.RecoveryLoopAsync (Runtime/ReconnectSupervisor.cs:158-186) sees reopen "succeed", replay fails, loops forever.

Step 1: Failing test. Assert the reconnect path creates a new session after a faulted one: drive GalaxyMxSession (or a fake MxGatewayClient/session seam) to a connected state, simulate fault/dispose, call the reopen/recreate path, and assert OpenSessionAsync + RegisterAsync were invoked again (a second create), not a no-op. Mirror the existing session/reconnect test construction (read how they fake MxGatewayClient/OpenSessionAsync).

Step 2: Run — FAIL (current code no-ops, second create never happens).

Step 3: Implement. Add a recreate path to GalaxyMxSession, e.g.:

/// <summary>Disposes the current (faulted/stale) session + owned client so the next
/// <see cref="ConnectAsync"/> rebuilds a fresh session instead of no-op'ing on the dead handle.</summary>
public async Task RecreateAsync(MxGatewayClientOptions clientOptions, CancellationToken ct)
{
    ObjectDisposedException.ThrowIf(_disposed, this);
    await DisposeSessionAsync().ConfigureAwait(false); // dispose+null _session, _serverHandle, _ownedClient
    await ConnectAsync(clientOptions, ct).ConfigureAwait(false);
}

(Factor a private DisposeSessionAsync() that safely tears down the gRPC session/channel + nulls the fields — reuse whatever DisposeAsync already does for teardown, minus marking _disposed.) Then in GalaxyDriver.ReopenAsync call RecreateAsync instead of ConnectAsync so every reopen rebuilds. Keep the supervisor's backoff loop unchanged — it now actually recovers.

Step 4: Run the Galaxy reconnect/session tests → green. dotnet build the Galaxy driver clean.

Step 5: Commit by explicit path.

High-risk: driver session lifecycle + gRPC channel teardown + concurrency with the supervisor loop. Confirm DisposeSessionAsync is safe to call repeatedly and from the supervisor thread.

Task 4: Broadcaster connection-health signal + bridges

Classification: standard Estimated implement time: ~5 min Parallelizable with: Task 1, Task 2, Task 3, Task 5

Files:

Modify: src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Hubs/IInProcessBroadcaster.cs
Modify: src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Hubs/InProcessBroadcaster.cs (the impl)
Modify: the two bridge actors that publish into the broadcaster — ScriptLogSignalRBridge + AlertSignalRBridge (find under AdminUI/ or Runtime/; they Tell/Subscribe the DPS topics and push to the broadcaster)
Test: tests/Server/ZB.MOM.WW.OtOpcUa.Client.UI.Tests/ (or wherever InProcessBroadcaster/broadcaster tests live — find InProcessBroadcasterTests)

Step 1: Failing unit test for the new health signal: a new InProcessBroadcaster<T> reports IsConnected == false (or a sensible default); calling SetConnected(true) raises ConnectionStateChanged with true and flips IsConnected; SetConnected(false) flips back + raises.

Step 2: Run — FAIL (members don't exist).

Step 3: Implement. Extend the interface:

bool IsConnected { get; }
event Action<bool>? ConnectionStateChanged;
void SetConnected(bool connected);

Implement in InProcessBroadcaster<T> (raise only on change). Then in each bridge actor, call _broadcaster.SetConnected(true) when its DPS SubscribeAck lands (subscription live) and SetConnected(false) on PostStop/failure. (Read how the bridge subscribes; mirror the SubscribeAck handling already there.) Default IsConnected: choose true once subscribed; before the first ack, false — the razor will reconcile on the ConnectionStateChanged event.

Step 4: Run the broadcaster tests → green; AdminUI builds clean (TreatWarningsAsErrors).

Step 5: Commit by explicit path.

Task 5: `CorrelationId` wrapper for the alarm commands (3b)

Classification: small Estimated implement time: ~4 min Parallelizable with: Task 1, Task 2, Task 3, Task 4

Files:

Modify: src/Core/ZB.MOM.WW.OtOpcUa.Commons/Messages/Admin/AcknowledgeAlarmCommand.cs + ShelveAlarmCommand.cs (+ their *Result records)
Modify: src/Server/ZB.MOM.WW.OtOpcUa.ControlPlane/AdminOperations/AdminOperationsActor.cs (the alarm handlers)
Modify: src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Clients/AdminOperationsClient.cs
Test: tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests/AdminOperationsActorTests.cs (the 5 alarm tests)

Context: the alarm commands use a bare Guid CorrelationId; the project's other control-plane commands (StartDeployment/RestartDriver/ReconnectDriver) use the CorrelationId wrapper type. Make them consistent.

Steps:

Find the CorrelationId wrapper type + how RestartDriver uses it (CorrelationId.NewId(), the record field type, the reply echo).
Change AcknowledgeAlarmCommand/ShelveAlarmCommand (+ *Result) CorrelationId field type Guid → CorrelationId. Update AdminOperationsClient.AcknowledgeAlarmAsync/ShelveAlarmAsync to mint CorrelationId.NewId(). Update the AdminOperationsActor handlers' reply construction.
Update the 5 AdminOperationsActorTests to the wrapper type.
dotnet test tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests --filter AdminOperations → green; build clean.
Commit by explicit path.

Task 6: `Alerts.razor` — Timed-shelve picker + chip auto-clear + live-pill

Classification: standard Estimated implement time: ~5 min Parallelizable with: Task 7 Blocked by: Task 4 (needs the broadcaster ConnectionStateChanged/IsConnected)

Files:

Modify: src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Components/Pages/Alerts.razor

Context: all three changes are in one file → bundled so they don't contend. NO bUnit; proven by docker-dev live-verify (Task 8).

Steps (no failing test — razor; verify by build + Task 8):

Timed-shelve picker: add a small duration input to the row's shelve control (a minutes number box is simplest, default e.g. 5). Add a "Shelve (timed)" action that calls IAdminOperationsClient.ShelveAlarmAsync(alarmId, user, ShelveKind.Timed, unshelveAtUtc: <computed now+duration>, comment: null, ct). Keep the existing OneShot Shelve + Unshelve buttons. Dispose the CTS (using var cts = …) like the existing handlers.
Chip auto-clear: after ShowOpResult, start a ~8s auto-clear (mirror DriverStatusPanel.razor's timer: a Task.Delay(8000) continuation that, if the chip is still the same one, clears _opResult* + InvokeAsync(StateHasChanged)). Read DriverStatusPanel.razor for the exact pattern (cancellation on a newer action).
Live-pill: replace the set-once _connected = true (≈ line 132) — in OnInitializedAsync read Alarms.IsConnected and subscribe Alarms.ConnectionStateChanged += OnConnChanged; (handler sets _connected + InvokeAsync(StateHasChanged)); unsubscribe in Dispose. (Alarms = the injected IInProcessBroadcaster<AlarmTransitionEvent>.)
dotnet build src/Server/ZB.MOM.WW.OtOpcUa.AdminUI → 0 warnings/errors.
Commit by explicit path.

Task 7: `ScriptLog.razor` live-pill + reconnect-overlay check

Classification: small Estimated implement time: ~3 min Parallelizable with: Task 6 Blocked by: Task 4

Files:

Modify: src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Components/Pages/ScriptLog.razor (≈ line 122)
Read/verify (no edit unless missing): the Blazor host shell (App.razor / _Host.cshtml / MainLayout) for the default reconnect overlay (components-reconnect-modal / <div id="components-reconnect-modal">)

Steps:

Same live-pill fix as Task 6 step 3, against ScriptLogs (the injected IInProcessBroadcaster<ScriptLogEntry>): read IsConnected, subscribe ConnectionStateChanged, update + StateHasChanged, unsubscribe in Dispose.
Verify the dead-circuit overlay exists: grep the host shell for the Blazor reconnect modal markup. If present, note it in the commit message (the dead-circuit case is covered by Blazor's built-in reconnection UI — the component can't self-update a dead circuit). If ABSENT, add the standard components-reconnect-modal markup. Do NOT fake liveness from a dead component.
dotnet build src/Server/ZB.MOM.WW.OtOpcUa.AdminUI → clean.
Commit by explicit path.

Task 8: Live-verify on docker-dev

Classification: verification Estimated implement time: ~ (manual) Parallelizable with: none Blocked by: Task 1, Task 2, Task 3, Task 6, Task 7 (Task 5 is internal; include it in the build)

Steps: rebuild docker-dev central nodes on the new image (docker compose -f docker-dev/docker-compose.yml up -d --build central-1 central-2), then on /alerts (login disabled — agent drives):

Double-emit fixed: confirm each t12-overheat transition now appears ONCE (not twice). Cross-check both nodes' logs: only the driver-leader (Primary) node logs the alerts publish; the Secondary suppresses it but still writes its OPC UA node (its address space stays current). (Find the Primary via the redundancy-state/ServiceLevel — or just observe single rows.)
Timed-shelve: drive the new timed-shelve control → a "Shelved" transition appears (operator = multi-role-test); after the duration the SDK auto-unshelves (or verify the UnshelveAtUtc was set).
Chip auto-clear: the Ack/Shelve result chip clears after ~8s.
Live-pill: the pill shows "live" while connected; (optional) kill the bridge / restart a node and confirm the pill flips / the Blazor reconnect overlay shows.
Galaxy reconnect (best-effort, may need the gateway): if feasible, restart the Galaxy gateway and confirm the driver recreates the session + recovers without a container restart (watch the logs for a fresh OpenSessionAsync). If not feasible live, rely on the Task 3 unit test + note it. Defects → new fix tasks. Agent drives; no sign-in needed (docker-dev login disabled).

Task 9: docker-dev rig cleanup

Classification: trivial (operational) Estimated implement time: ~ (manual) Parallelizable with: none Blocked by: Task 8 (the rig artifacts are the live-verify vehicle)

Steps: remove the seed artifacts left for verification, then redeploy:

In the docker-dev AdminUI (or via DB), delete the t12-overheat scripted alarm, the SC-ba675b168a85 predicate script, and the layer0-logcheck vtag + script. Revert filler-02's cycle-time-s script to return ctx.GetTag("TestMachine_002.TestDuration").Value;.
Redeploy: POST http://localhost:9200/api/deployments with X-Api-Key: docker-dev-deploy-key → 202.
Confirm /alerts + /scripted-alarms + /scripts no longer list the removed artifacts. No code commit (DB/deploy only). Note completion in Task 10's commit message.

Operational only — no source change. If the user wants the rig kept, skip this task and record it as deferred.

Task 10: Docs + finish branch

Classification: small Estimated implement time: ~4 min Parallelizable with: none Blocked by: Task 9

Files: docs/Redundancy.md (alarm-emission + historization are now Primary-gated), docs/ScriptedAlarms.md / docs/AlarmTracking.md (the redundancy dedup note), docs/Client.CLI.md only if affected, and a one-line note wherever the live-pill/Galaxy-reconnect behaviour is documented. Keep terse.

Steps:

Update the docs above to reflect: Primary-only alarm emission + historization under redundancy; the Galaxy reconnect now recreates a faulted session; the live-pill reflects feed health.
Run the FULL suite: dotnet test ZB.MOM.WW.OtOpcUa.slnx — confirm all affected unit suites green; the only failures should be the known pre-existing macOS/integration ones (OpcUaServer.IntegrationTests PKI, Host.IntegrationTests deploy-Rejected, AbLegacy/AbCip fixtures). Verify the Galaxy suite is green (Task 3).
Commit docs by explicit path.
Run superpowers-extended-cc:finishing-a-development-branch (verify tests → present the 4 options → merge).

Execution notes

Parallel dispatch: Tasks 1, 2, 3, 4, 5 are mutually parallelizable (disjoint files across Runtime/ScriptedAlarms, Runtime/Historian, Driver.Galaxy, AdminUI/Hubs+bridges, Commons+ControlPlane). Dispatch their implementers concurrently. 6 and 7 wait on 4 (broadcaster health) and are ∥ each other (Alerts.razor vs ScriptLog.razor). 8 waits on 1/2/3/6/7; 9 on 8; 10 on 9.
One writer per file: Alerts.razor is touched ONLY by Task 6 (its three changes are bundled for exactly this reason). ScriptLog.razor only by Task 7.
High-risk tasks (1, 2, 3): serial spec→code review each. Standard (4, 6): parallel spec+code review. Small (5, 7): code review only.
Live-verify (8) is the integration checkpoint — the single-alerts-row proof is the whole point of item 1.
TDD where there's logic (1, 2, 3, 4, 5); razor (6, 7) proven by docker-dev /run (login disabled, agent drives).

22 KiB Raw Blame History Unescape Escape

Alarm Ack/Shelve Follow-ups Implementation Plan

Task 0: Branch + baseline

Task 1: Redundancy emit-gate in ScriptedAlarmHostActor (A1)

Task 2: Redundancy historize-gate in HistorianAdapterActor (A2)

Task 3: Galaxy reconnect recreates a faulted session

Task 4: Broadcaster connection-health signal + bridges

Task 5: CorrelationId wrapper for the alarm commands (3b)

Task 6: Alerts.razor — Timed-shelve picker + chip auto-clear + live-pill

Task 7: ScriptLog.razor live-pill + reconnect-overlay check

Task 8: Live-verify on docker-dev

Task 9: docker-dev rig cleanup

Task 10: Docs + finish branch

Execution notes

22 KiB

Raw Blame History

Task 1: Redundancy emit-gate in `ScriptedAlarmHostActor` (A1)

Task 2: Redundancy historize-gate in `HistorianAdapterActor` (A2)

Task 5: `CorrelationId` wrapper for the alarm commands (3b)

Task 6: `Alerts.razor` — Timed-shelve picker + chip auto-clear + live-pill

Task 7: `ScriptLog.razor` live-pill + reconnect-overlay check