Files
lmxopcua/docs/plans/2026-06-11-alarm-followups-design.md
T

9.5 KiB
Raw Blame History

Alarm Ack/Shelve Follow-ups — Design

Date: 2026-06-11 Status: Approved (brainstorming) — ready for implementation plan.

Goal

Resolve the six notes/follow-ups left open by the T17T24 inbound-alarm-ack work (merged to master bc9843d2): the redundancy double-emit, the deferred Timed-shelve UI, two T21 minors, the docker-dev rig cleanup, and two pre-existing Layer-1 gaps (Galaxy reconnect, the live-pill). Scope decided in brainstorming: everything, including the pre-existing L1 gaps. Double-emit approach decided: primary-only at the source.


1 — Redundancy double-emit fix (core; high-risk)

Problem (verified)

Both central nodes run roles admin,driver in the same MAIN cluster. Each spawns its own ScriptedAlarmHostActor + ScriptedAlarmEngine (DriverHostActor.SpawnScriptedAlarmHost, per-node, no singleton) and, for a single-cluster artifact, loads the same alarms (DeploymentArtifact.ParseComposition returns ClusterFilterMode.None → unfiltered). Both independently evaluate and both hit ScriptedAlarmHostActor.OnEngineEmission:

  • line 261 _publishActor.Tell(new OpcUaPublishActor.AlarmStateUpdate(...)) — writes the OPC UA node (directed, local).
  • line 278 _mediator.Tell(new Publish(AlertsTopic /* "alerts" */, evt))cluster-wide, ungated → every transition appears twice on /alerts. Inbound ack/shelve commands are likewise processed by both nodes' engines.

HistorianAdapterActor (Runtime/Historian/, registered per-node in Runtime/ServiceCollectionExtensions.cs:146, name historian-adapter) also consumes the alerts topic, so a single publish is historized once per node → double DB writes in production (invisible in docker-dev: NullAlarmHistorianSink).

Existing signal to mirror

RedundancyStateActor (admin singleton, ControlPlane/Redundancy/RedundancyStateActor.cs:90-114) publishes RedundancyStateChanged to the redundancy-state DPS topic. Per-node NodeRedundancyState carries RedundancyRole (Primary/Secondary/Detached) + IsRoleLeaderForDriver; Primary = the driver-role cluster leader. OpcUaPublishActor already subscribes to this topic (OpcUaPublishActor.cs:30,147,156,335) to drive its ServiceLevel — the exact pattern to copy (Subscribe(RedundancyStateTopic, Self) in PreStart; Receive<RedundancyStateChanged>; read msg.Nodes.FirstOrDefault(n => n.NodeId == localNode)?.Role).

Decision — gate only the cluster-wide emission, on Primary

  • Emission gate (task A1): ScriptedAlarmHostActor subscribes to redundancy-state, caches the local role, and skips the alerts publish (line 278) when not Primary. Default = emit until a RedundancyStateChanged says this node is Secondary/Detached — so single-node deploys (sole node is always the driver leader = Primary) and the boot window never drop transitions. The OPC UA node write (line 261) and inbound command processing stay UNGATED — the secondary must keep its address space warm and its engine state consistent for failover; clients only ever see the Primary via ServiceLevel, so the secondary's node writes + its (subscriber-less) condition events are harmless.
  • Historian gate (task A2): HistorianAdapterActor subscribes to redundancy-state, caches the local role, and skips the sink write when not Primary. Gives exactly-once historization for all alarm sources (native Galaxy/AB-CIP too), not just scripted.
  • Edge: a brief failover window (old Primary gone, new not yet elected) may drop a transition/historization — acceptable, and identical to the existing ServiceLevel handoff behaviour.

Tests

TestKit: with a Secondary RedundancyStateChanged, OnEngineEmission does NOT publish to alerts but DOES still Tell the OPC UA AlarmStateUpdate; with Primary (or before any state) it publishes. Inbound AlarmCommand is still processed regardless of role. Historian: a Secondary skips the sink write, a Primary writes (fake sink + role injection).


2 — Timed-shelve picker UI (small)

Alerts.razor currently exposes Ack / Shelve(OneShot) / Unshelve. The Timed backend is fully wired + tested (ShelveAlarmCommand{ShelveKind.Timed, UnshelveAtUtc}AdminOperationsActorengine.TimedShelveAsync). Add a small duration input (a minutes/ seconds number box, or a datetime-local) to the shelve control on each row and call ShelveAlarmAsync(kind: Timed, unshelveAtUtc: now + duration). Razor-only; no backend change; proven by docker-dev live-verify (no bUnit). Keep the existing OneShot/Unshelve buttons.

3 — T21 minors (small)

  • Chip auto-clear: the _opResult* chip persists until the next action. Add the ~8s auto-clear timer pattern DriverStatusPanel.razor already uses (a Task.DelayInvokeAsync(StateHasChanged) that clears the per-row result).
  • CorrelationId consistency: AcknowledgeAlarmCommand/ShelveAlarmCommand use a bare Guid CorrelationId; the project's other control-plane commands use the CorrelationId wrapper type. Switch both records + AdminOperationsClient (CorrelationId.NewId()) + the actor reply contract + tests to the wrapper for uniform correlation tracing.

4 — Rig cleanup (operational, last)

Delete the docker-dev seed artifacts left for live-verify: the t12-overheat scripted alarm, the SC-ba675b168a85 predicate script, the layer0-logcheck vtag + script; revert filler-02's inert cycle-time-s logger line to return ctx.GetTag("TestMachine_002.TestDuration").Value;. Redeploy (POST /api/deployments, X-Api-Key: docker-dev-deploy-key). DB/redeploy, not a code change — done last so the rig stays available for verifying tasks 13/6.

5 — Galaxy reconnect recreate (high-risk)

Problem (verified)

GalaxyMxSession.ConnectAsync (Driver.Galaxy/Runtime/GalaxyMxSession.cs:58-69) is idempotent: if (_session is not null) return;. When the gateway restarts and the session goes Faulted/NotFound, _session is still a non-null (dead) handle, so ConnectAsync is a silent no-op. GalaxyDriver.ReopenAsync (GalaxyDriver.cs:289) calls it expecting a reconnect → no-op; ReconnectSupervisor.RecoveryLoopAsync (Driver.Galaxy/Runtime/ReconnectSupervisor.cs:158-186) sees reopen "succeed", proceeds to replay (which fails on the dead session), and loops forever (backoff capped 30s).

Decision — recreate on reopen

Add a recreate path to GalaxyMxSession (e.g. RecreateAsync/DisposeSessionForRecreationAsync) that disposes + nulls _session and _ownedClient, and have ReopenAsync call it before ConnectAsync so a reopen always routes through the happy-path create (OpenSessionAsync + RegisterAsync). Confirm what status/exception marks Faulted/NotFound and that the dispose is safe (gRPC channel teardown). Keep the supervisor's backoff loop; it now actually recovers.

Tests

A reconnect test asserting that after a faulted session, the reopen path creates a new session (new handle / OpenSessionAsync called again) rather than no-op'ing. Mirror any existing ReconnectSupervisor/session tests.

6 — Live-pill circuit health (standard)

Problem (verified)

ScriptLog.razor:122 and Alerts.razor:132 set _connected = true once in OnInitialized and never update it; the pill markup binds to that set-once bool. IInProcessBroadcaster<T> (AdminUI/Hubs/IInProcessBroadcaster.cs) exposes only Received + Publish — no health signal.

Decision

  • Extend IInProcessBroadcaster<T> with a connection-health signal (bool IsConnected + event … ConnectionStateChanged). The bridge actors (ScriptLogSignalRBridge / AlertSignalRBridge) set it from their DPS-subscription health (SubscribeAck up / failure down). The razors bind the pill to it and subscribe/unsubscribe like DriverStatusPanel.razor (OnConnectionStateChangedInvokeAsync(StateHasChanged)).
  • Dead-circuit case (node recreate kills the server-side circuit — the component is dead and cannot self-update its pill): this is Blazor Server's built-in reconnection concern. Verify the default reconnect overlay is present/visible (it is what actually signals a dropped circuit) rather than trying to fake liveness from a dead component. If absent, add the standard Blazor reconnect UI.

Tests

Broadcaster unit test for the new health signal + SetConnected propagation. Razor proven by docker-dev live-verify (no bUnit).


Sequencing & risk

Item Risk Notes
1 redundancy double-emit (A1 emit gate, A2 historian gate) high-risk independent subsystem; A1∥A2 (different files)
5 Galaxy reconnect high-risk independent subsystem (Galaxy driver)
2 Timed-shelve picker small razor-only, live-verify
3a chip auto-clear / 3b CorrelationId small razor / mechanical refactor
6 live-pill standard interface + bridge + 2 razors
4 rig cleanup trivial/operational last

The two high-risk items (1, 5) are in different subsystems and can run in parallel with each other and with the UI items (2, 3, 6). Rig cleanup (4) is last. TDD where there's logic; UI proven by docker-dev /run live-verify (agent drives — login disabled on docker-dev).

Hard rules

Stage by explicit path (never git add .); never stage sql_login.txt / src/Server/.../Host/pki/; never echo the gateway API key into a new tracked file; no force-push, no --no-verify; no Configuration entity / EF migration change (none of these items needs one). Build on a feature branch off master.