docs(design): alarm ack/shelve follow-ups (redundancy double-emit, Timed-shelve UI, T21 minors, rig cleanup, Galaxy reconnect, live-pill)
This commit is contained in:
@@ -0,0 +1,167 @@
|
||||
# Alarm Ack/Shelve Follow-ups — Design
|
||||
|
||||
**Date:** 2026-06-11
|
||||
**Status:** Approved (brainstorming) — ready for implementation plan.
|
||||
|
||||
## Goal
|
||||
|
||||
Resolve the six notes/follow-ups left open by the T17–T24 inbound-alarm-ack work
|
||||
(merged to master `bc9843d2`): the redundancy **double-emit**, the deferred Timed-shelve
|
||||
UI, two T21 minors, the docker-dev rig cleanup, and two pre-existing Layer-1 gaps (Galaxy
|
||||
reconnect, the live-pill). Scope decided in brainstorming: **everything**, including the
|
||||
pre-existing L1 gaps. Double-emit approach decided: **primary-only at the source**.
|
||||
|
||||
---
|
||||
|
||||
## 1 — Redundancy double-emit fix (core; high-risk)
|
||||
|
||||
### Problem (verified)
|
||||
Both central nodes run roles `admin,driver` in the same MAIN cluster. Each spawns its own
|
||||
`ScriptedAlarmHostActor` + `ScriptedAlarmEngine` (`DriverHostActor.SpawnScriptedAlarmHost`,
|
||||
per-node, no singleton) and, for a single-cluster artifact, loads the **same** alarms
|
||||
(`DeploymentArtifact.ParseComposition` returns `ClusterFilterMode.None` → unfiltered). Both
|
||||
independently evaluate and both hit `ScriptedAlarmHostActor.OnEngineEmission`:
|
||||
|
||||
- line 261 `_publishActor.Tell(new OpcUaPublishActor.AlarmStateUpdate(...))` — writes the OPC
|
||||
UA node (directed, local).
|
||||
- line 278 `_mediator.Tell(new Publish(AlertsTopic /* "alerts" */, evt))` — **cluster-wide,
|
||||
ungated** → every transition appears **twice** on `/alerts`. Inbound ack/shelve commands
|
||||
are likewise processed by both nodes' engines.
|
||||
|
||||
`HistorianAdapterActor` (`Runtime/Historian/`, registered **per-node** in
|
||||
`Runtime/ServiceCollectionExtensions.cs:146`, name `historian-adapter`) also consumes the
|
||||
`alerts` topic, so a single publish is historized **once per node** → double DB writes in
|
||||
production (invisible in docker-dev: `NullAlarmHistorianSink`).
|
||||
|
||||
### Existing signal to mirror
|
||||
`RedundancyStateActor` (admin singleton, `ControlPlane/Redundancy/RedundancyStateActor.cs:90-114`)
|
||||
publishes `RedundancyStateChanged` to the `redundancy-state` DPS topic. Per-node
|
||||
`NodeRedundancyState` carries `RedundancyRole` (`Primary`/`Secondary`/`Detached`) +
|
||||
`IsRoleLeaderForDriver`; **Primary = the driver-role cluster leader**. `OpcUaPublishActor`
|
||||
already subscribes to this topic (`OpcUaPublishActor.cs:30,147,156,335`) to drive its
|
||||
ServiceLevel — the exact pattern to copy (`Subscribe(RedundancyStateTopic, Self)` in
|
||||
PreStart; `Receive<RedundancyStateChanged>`; read `msg.Nodes.FirstOrDefault(n => n.NodeId ==
|
||||
localNode)?.Role`).
|
||||
|
||||
### Decision — gate **only the cluster-wide emission**, on `Primary`
|
||||
- **Emission gate (task A1):** `ScriptedAlarmHostActor` subscribes to `redundancy-state`,
|
||||
caches the local role, and **skips the `alerts` publish (line 278) when not Primary**.
|
||||
**Default = emit until a `RedundancyStateChanged` says this node is Secondary/Detached** —
|
||||
so single-node deploys (sole node is always the driver leader = Primary) and the boot
|
||||
window never drop transitions. The OPC UA node write (line 261) and **inbound command
|
||||
processing stay UNGATED** — the secondary must keep its address space warm and its engine
|
||||
state consistent for failover; clients only ever see the Primary via ServiceLevel, so the
|
||||
secondary's node writes + its (subscriber-less) condition events are harmless.
|
||||
- **Historian gate (task A2):** `HistorianAdapterActor` subscribes to `redundancy-state`,
|
||||
caches the local role, and **skips the sink write when not Primary**. Gives exactly-once
|
||||
historization for **all** alarm sources (native Galaxy/AB-CIP too), not just scripted.
|
||||
- **Edge:** a brief failover window (old Primary gone, new not yet elected) may drop a
|
||||
transition/historization — acceptable, and identical to the existing ServiceLevel handoff
|
||||
behaviour.
|
||||
|
||||
### Tests
|
||||
TestKit: with a Secondary `RedundancyStateChanged`, `OnEngineEmission` does NOT publish to
|
||||
`alerts` but DOES still `Tell` the OPC UA `AlarmStateUpdate`; with Primary (or before any
|
||||
state) it publishes. Inbound `AlarmCommand` is still processed regardless of role. Historian:
|
||||
a Secondary skips the sink write, a Primary writes (fake sink + role injection).
|
||||
|
||||
---
|
||||
|
||||
## 2 — Timed-shelve picker UI (small)
|
||||
|
||||
`Alerts.razor` currently exposes Ack / Shelve(OneShot) / Unshelve. The `Timed` backend is
|
||||
fully wired + tested (`ShelveAlarmCommand{ShelveKind.Timed, UnshelveAtUtc}` →
|
||||
`AdminOperationsActor` → `engine.TimedShelveAsync`). Add a small duration input (a minutes/
|
||||
seconds number box, or a `datetime-local`) to the shelve control on each row and call
|
||||
`ShelveAlarmAsync(kind: Timed, unshelveAtUtc: now + duration)`. Razor-only; **no backend
|
||||
change**; proven by docker-dev live-verify (no bUnit). Keep the existing OneShot/Unshelve
|
||||
buttons.
|
||||
|
||||
## 3 — T21 minors (small)
|
||||
|
||||
- **Chip auto-clear:** the `_opResult*` chip persists until the next action. Add the ~8s
|
||||
auto-clear timer pattern `DriverStatusPanel.razor` already uses (a `Task.Delay` →
|
||||
`InvokeAsync(StateHasChanged)` that clears the per-row result).
|
||||
- **CorrelationId consistency:** `AcknowledgeAlarmCommand`/`ShelveAlarmCommand` use a bare
|
||||
`Guid CorrelationId`; the project's other control-plane commands use the `CorrelationId`
|
||||
wrapper type. Switch both records + `AdminOperationsClient` (`CorrelationId.NewId()`) +
|
||||
the actor reply contract + tests to the wrapper for uniform correlation tracing.
|
||||
|
||||
## 4 — Rig cleanup (operational, last)
|
||||
|
||||
Delete the docker-dev seed artifacts left for live-verify: the `t12-overheat` scripted alarm,
|
||||
the `SC-ba675b168a85` predicate script, the `layer0-logcheck` vtag + script; revert
|
||||
filler-02's inert `cycle-time-s` logger line to `return ctx.GetTag("TestMachine_002.TestDuration").Value;`.
|
||||
Redeploy (`POST /api/deployments`, `X-Api-Key: docker-dev-deploy-key`). DB/redeploy, **not a
|
||||
code change** — done last so the rig stays available for verifying tasks 1–3/6.
|
||||
|
||||
## 5 — Galaxy reconnect recreate (high-risk)
|
||||
|
||||
### Problem (verified)
|
||||
`GalaxyMxSession.ConnectAsync` (`Driver.Galaxy/Runtime/GalaxyMxSession.cs:58-69`) is
|
||||
idempotent: `if (_session is not null) return;`. When the gateway restarts and the session
|
||||
goes Faulted/NotFound, `_session` is still a non-null (dead) handle, so `ConnectAsync` is a
|
||||
silent no-op. `GalaxyDriver.ReopenAsync` (`GalaxyDriver.cs:289`) calls it expecting a
|
||||
reconnect → no-op; `ReconnectSupervisor.RecoveryLoopAsync`
|
||||
(`Driver.Galaxy/Runtime/ReconnectSupervisor.cs:158-186`) sees reopen "succeed", proceeds to
|
||||
replay (which fails on the dead session), and **loops forever** (backoff capped 30s).
|
||||
|
||||
### Decision — recreate on reopen
|
||||
Add a recreate path to `GalaxyMxSession` (e.g. `RecreateAsync`/`DisposeSessionForRecreationAsync`)
|
||||
that disposes + nulls `_session` and `_ownedClient`, and have `ReopenAsync` call it **before**
|
||||
`ConnectAsync` so a reopen always routes through the happy-path create (`OpenSessionAsync` +
|
||||
`RegisterAsync`). Confirm what status/exception marks Faulted/NotFound and that the dispose is
|
||||
safe (gRPC channel teardown). Keep the supervisor's backoff loop; it now actually recovers.
|
||||
|
||||
### Tests
|
||||
A reconnect test asserting that after a faulted session, the reopen path **creates a new
|
||||
session** (new handle / `OpenSessionAsync` called again) rather than no-op'ing. Mirror any
|
||||
existing `ReconnectSupervisor`/session tests.
|
||||
|
||||
## 6 — Live-pill circuit health (standard)
|
||||
|
||||
### Problem (verified)
|
||||
`ScriptLog.razor:122` and `Alerts.razor:132` set `_connected = true` once in
|
||||
`OnInitialized` and never update it; the pill markup binds to that set-once bool.
|
||||
`IInProcessBroadcaster<T>` (`AdminUI/Hubs/IInProcessBroadcaster.cs`) exposes only
|
||||
`Received` + `Publish` — no health signal.
|
||||
|
||||
### Decision
|
||||
- Extend `IInProcessBroadcaster<T>` with a connection-health signal (`bool IsConnected` +
|
||||
`event … ConnectionStateChanged`). The bridge actors (`ScriptLogSignalRBridge` /
|
||||
`AlertSignalRBridge`) set it from their DPS-subscription health (SubscribeAck up / failure
|
||||
down). The razors bind the pill to it and subscribe/unsubscribe like
|
||||
`DriverStatusPanel.razor` (`OnConnectionStateChanged` → `InvokeAsync(StateHasChanged)`).
|
||||
- **Dead-circuit case** (node recreate kills the server-side circuit — the component is dead
|
||||
and cannot self-update its pill): this is Blazor Server's built-in reconnection concern.
|
||||
**Verify the default reconnect overlay is present/visible** (it is what actually signals a
|
||||
dropped circuit) rather than trying to fake liveness from a dead component. If absent, add
|
||||
the standard Blazor reconnect UI.
|
||||
|
||||
### Tests
|
||||
Broadcaster unit test for the new health signal + `SetConnected` propagation. Razor proven by
|
||||
docker-dev live-verify (no bUnit).
|
||||
|
||||
---
|
||||
|
||||
## Sequencing & risk
|
||||
|
||||
| Item | Risk | Notes |
|
||||
|---|---|---|
|
||||
| 1 redundancy double-emit (A1 emit gate, A2 historian gate) | high-risk | independent subsystem; A1∥A2 (different files) |
|
||||
| 5 Galaxy reconnect | high-risk | independent subsystem (Galaxy driver) |
|
||||
| 2 Timed-shelve picker | small | razor-only, live-verify |
|
||||
| 3a chip auto-clear / 3b CorrelationId | small | razor / mechanical refactor |
|
||||
| 6 live-pill | standard | interface + bridge + 2 razors |
|
||||
| 4 rig cleanup | trivial/operational | last |
|
||||
|
||||
The two high-risk items (1, 5) are in different subsystems and can run in parallel with each
|
||||
other and with the UI items (2, 3, 6). Rig cleanup (4) is last. TDD where there's logic; UI
|
||||
proven by docker-dev `/run` live-verify (agent drives — login disabled on docker-dev).
|
||||
|
||||
## Hard rules
|
||||
|
||||
Stage by explicit path (never `git add .`); never stage `sql_login.txt` /
|
||||
`src/Server/.../Host/pki/`; never echo the gateway API key into a **new** tracked file; no
|
||||
force-push, no `--no-verify`; **no Configuration entity / EF migration change** (none of these
|
||||
items needs one). Build on a feature branch off `master`.
|
||||
Reference in New Issue
Block a user