9.5 KiB
Alarm Ack/Shelve Follow-ups — Design
Date: 2026-06-11 Status: Approved (brainstorming) — ready for implementation plan.
Goal
Resolve the six notes/follow-ups left open by the T17–T24 inbound-alarm-ack work
(merged to master bc9843d2): the redundancy double-emit, the deferred Timed-shelve
UI, two T21 minors, the docker-dev rig cleanup, and two pre-existing Layer-1 gaps (Galaxy
reconnect, the live-pill). Scope decided in brainstorming: everything, including the
pre-existing L1 gaps. Double-emit approach decided: primary-only at the source.
1 — Redundancy double-emit fix (core; high-risk)
Problem (verified)
Both central nodes run roles admin,driver in the same MAIN cluster. Each spawns its own
ScriptedAlarmHostActor + ScriptedAlarmEngine (DriverHostActor.SpawnScriptedAlarmHost,
per-node, no singleton) and, for a single-cluster artifact, loads the same alarms
(DeploymentArtifact.ParseComposition returns ClusterFilterMode.None → unfiltered). Both
independently evaluate and both hit ScriptedAlarmHostActor.OnEngineEmission:
- line 261
_publishActor.Tell(new OpcUaPublishActor.AlarmStateUpdate(...))— writes the OPC UA node (directed, local). - line 278
_mediator.Tell(new Publish(AlertsTopic /* "alerts" */, evt))— cluster-wide, ungated → every transition appears twice on/alerts. Inbound ack/shelve commands are likewise processed by both nodes' engines.
HistorianAdapterActor (Runtime/Historian/, registered per-node in
Runtime/ServiceCollectionExtensions.cs:146, name historian-adapter) also consumes the
alerts topic, so a single publish is historized once per node → double DB writes in
production (invisible in docker-dev: NullAlarmHistorianSink).
Existing signal to mirror
RedundancyStateActor (admin singleton, ControlPlane/Redundancy/RedundancyStateActor.cs:90-114)
publishes RedundancyStateChanged to the redundancy-state DPS topic. Per-node
NodeRedundancyState carries RedundancyRole (Primary/Secondary/Detached) +
IsRoleLeaderForDriver; Primary = the driver-role cluster leader. OpcUaPublishActor
already subscribes to this topic (OpcUaPublishActor.cs:30,147,156,335) to drive its
ServiceLevel — the exact pattern to copy (Subscribe(RedundancyStateTopic, Self) in
PreStart; Receive<RedundancyStateChanged>; read msg.Nodes.FirstOrDefault(n => n.NodeId == localNode)?.Role).
Decision — gate only the cluster-wide emission, on Primary
- Emission gate (task A1):
ScriptedAlarmHostActorsubscribes toredundancy-state, caches the local role, and skips thealertspublish (line 278) when not Primary. Default = emit until aRedundancyStateChangedsays this node is Secondary/Detached — so single-node deploys (sole node is always the driver leader = Primary) and the boot window never drop transitions. The OPC UA node write (line 261) and inbound command processing stay UNGATED — the secondary must keep its address space warm and its engine state consistent for failover; clients only ever see the Primary via ServiceLevel, so the secondary's node writes + its (subscriber-less) condition events are harmless. - Historian gate (task A2):
HistorianAdapterActorsubscribes toredundancy-state, caches the local role, and skips the sink write when not Primary. Gives exactly-once historization for all alarm sources (native Galaxy/AB-CIP too), not just scripted. - Edge: a brief failover window (old Primary gone, new not yet elected) may drop a transition/historization — acceptable, and identical to the existing ServiceLevel handoff behaviour.
Tests
TestKit: with a Secondary RedundancyStateChanged, OnEngineEmission does NOT publish to
alerts but DOES still Tell the OPC UA AlarmStateUpdate; with Primary (or before any
state) it publishes. Inbound AlarmCommand is still processed regardless of role. Historian:
a Secondary skips the sink write, a Primary writes (fake sink + role injection).
2 — Timed-shelve picker UI (small)
Alerts.razor currently exposes Ack / Shelve(OneShot) / Unshelve. The Timed backend is
fully wired + tested (ShelveAlarmCommand{ShelveKind.Timed, UnshelveAtUtc} →
AdminOperationsActor → engine.TimedShelveAsync). Add a small duration input (a minutes/
seconds number box, or a datetime-local) to the shelve control on each row and call
ShelveAlarmAsync(kind: Timed, unshelveAtUtc: now + duration). Razor-only; no backend
change; proven by docker-dev live-verify (no bUnit). Keep the existing OneShot/Unshelve
buttons.
3 — T21 minors (small)
- Chip auto-clear: the
_opResult*chip persists until the next action. Add the ~8s auto-clear timer patternDriverStatusPanel.razoralready uses (aTask.Delay→InvokeAsync(StateHasChanged)that clears the per-row result). - CorrelationId consistency:
AcknowledgeAlarmCommand/ShelveAlarmCommanduse a bareGuid CorrelationId; the project's other control-plane commands use theCorrelationIdwrapper type. Switch both records +AdminOperationsClient(CorrelationId.NewId()) + the actor reply contract + tests to the wrapper for uniform correlation tracing.
4 — Rig cleanup (operational, last)
Delete the docker-dev seed artifacts left for live-verify: the t12-overheat scripted alarm,
the SC-ba675b168a85 predicate script, the layer0-logcheck vtag + script; revert
filler-02's inert cycle-time-s logger line to return ctx.GetTag("TestMachine_002.TestDuration").Value;.
Redeploy (POST /api/deployments, X-Api-Key: docker-dev-deploy-key). DB/redeploy, not a
code change — done last so the rig stays available for verifying tasks 1–3/6.
5 — Galaxy reconnect recreate (high-risk)
Problem (verified)
GalaxyMxSession.ConnectAsync (Driver.Galaxy/Runtime/GalaxyMxSession.cs:58-69) is
idempotent: if (_session is not null) return;. When the gateway restarts and the session
goes Faulted/NotFound, _session is still a non-null (dead) handle, so ConnectAsync is a
silent no-op. GalaxyDriver.ReopenAsync (GalaxyDriver.cs:289) calls it expecting a
reconnect → no-op; ReconnectSupervisor.RecoveryLoopAsync
(Driver.Galaxy/Runtime/ReconnectSupervisor.cs:158-186) sees reopen "succeed", proceeds to
replay (which fails on the dead session), and loops forever (backoff capped 30s).
Decision — recreate on reopen
Add a recreate path to GalaxyMxSession (e.g. RecreateAsync/DisposeSessionForRecreationAsync)
that disposes + nulls _session and _ownedClient, and have ReopenAsync call it before
ConnectAsync so a reopen always routes through the happy-path create (OpenSessionAsync +
RegisterAsync). Confirm what status/exception marks Faulted/NotFound and that the dispose is
safe (gRPC channel teardown). Keep the supervisor's backoff loop; it now actually recovers.
Tests
A reconnect test asserting that after a faulted session, the reopen path creates a new
session (new handle / OpenSessionAsync called again) rather than no-op'ing. Mirror any
existing ReconnectSupervisor/session tests.
6 — Live-pill circuit health (standard)
Problem (verified)
ScriptLog.razor:122 and Alerts.razor:132 set _connected = true once in
OnInitialized and never update it; the pill markup binds to that set-once bool.
IInProcessBroadcaster<T> (AdminUI/Hubs/IInProcessBroadcaster.cs) exposes only
Received + Publish — no health signal.
Decision
- Extend
IInProcessBroadcaster<T>with a connection-health signal (bool IsConnected+event … ConnectionStateChanged). The bridge actors (ScriptLogSignalRBridge/AlertSignalRBridge) set it from their DPS-subscription health (SubscribeAck up / failure down). The razors bind the pill to it and subscribe/unsubscribe likeDriverStatusPanel.razor(OnConnectionStateChanged→InvokeAsync(StateHasChanged)). - Dead-circuit case (node recreate kills the server-side circuit — the component is dead and cannot self-update its pill): this is Blazor Server's built-in reconnection concern. Verify the default reconnect overlay is present/visible (it is what actually signals a dropped circuit) rather than trying to fake liveness from a dead component. If absent, add the standard Blazor reconnect UI.
Tests
Broadcaster unit test for the new health signal + SetConnected propagation. Razor proven by
docker-dev live-verify (no bUnit).
Sequencing & risk
| Item | Risk | Notes |
|---|---|---|
| 1 redundancy double-emit (A1 emit gate, A2 historian gate) | high-risk | independent subsystem; A1∥A2 (different files) |
| 5 Galaxy reconnect | high-risk | independent subsystem (Galaxy driver) |
| 2 Timed-shelve picker | small | razor-only, live-verify |
| 3a chip auto-clear / 3b CorrelationId | small | razor / mechanical refactor |
| 6 live-pill | standard | interface + bridge + 2 razors |
| 4 rig cleanup | trivial/operational | last |
The two high-risk items (1, 5) are in different subsystems and can run in parallel with each
other and with the UI items (2, 3, 6). Rig cleanup (4) is last. TDD where there's logic; UI
proven by docker-dev /run live-verify (agent drives — login disabled on docker-dev).
Hard rules
Stage by explicit path (never git add .); never stage sql_login.txt /
src/Server/.../Host/pki/; never echo the gateway API key into a new tracked file; no
force-push, no --no-verify; no Configuration entity / EF migration change (none of these
items needs one). Build on a feature branch off master.