docs(design): alarm ack/shelve follow-ups (redundancy double-emit, Timed-shelve UI, T21 minors, rig cleanup, Galaxy reconnect, live-pill)
This commit is contained in:
@@ -0,0 +1,167 @@
|
|||||||
|
# Alarm Ack/Shelve Follow-ups — Design
|
||||||
|
|
||||||
|
**Date:** 2026-06-11
|
||||||
|
**Status:** Approved (brainstorming) — ready for implementation plan.
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
Resolve the six notes/follow-ups left open by the T17–T24 inbound-alarm-ack work
|
||||||
|
(merged to master `bc9843d2`): the redundancy **double-emit**, the deferred Timed-shelve
|
||||||
|
UI, two T21 minors, the docker-dev rig cleanup, and two pre-existing Layer-1 gaps (Galaxy
|
||||||
|
reconnect, the live-pill). Scope decided in brainstorming: **everything**, including the
|
||||||
|
pre-existing L1 gaps. Double-emit approach decided: **primary-only at the source**.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1 — Redundancy double-emit fix (core; high-risk)
|
||||||
|
|
||||||
|
### Problem (verified)
|
||||||
|
Both central nodes run roles `admin,driver` in the same MAIN cluster. Each spawns its own
|
||||||
|
`ScriptedAlarmHostActor` + `ScriptedAlarmEngine` (`DriverHostActor.SpawnScriptedAlarmHost`,
|
||||||
|
per-node, no singleton) and, for a single-cluster artifact, loads the **same** alarms
|
||||||
|
(`DeploymentArtifact.ParseComposition` returns `ClusterFilterMode.None` → unfiltered). Both
|
||||||
|
independently evaluate and both hit `ScriptedAlarmHostActor.OnEngineEmission`:
|
||||||
|
|
||||||
|
- line 261 `_publishActor.Tell(new OpcUaPublishActor.AlarmStateUpdate(...))` — writes the OPC
|
||||||
|
UA node (directed, local).
|
||||||
|
- line 278 `_mediator.Tell(new Publish(AlertsTopic /* "alerts" */, evt))` — **cluster-wide,
|
||||||
|
ungated** → every transition appears **twice** on `/alerts`. Inbound ack/shelve commands
|
||||||
|
are likewise processed by both nodes' engines.
|
||||||
|
|
||||||
|
`HistorianAdapterActor` (`Runtime/Historian/`, registered **per-node** in
|
||||||
|
`Runtime/ServiceCollectionExtensions.cs:146`, name `historian-adapter`) also consumes the
|
||||||
|
`alerts` topic, so a single publish is historized **once per node** → double DB writes in
|
||||||
|
production (invisible in docker-dev: `NullAlarmHistorianSink`).
|
||||||
|
|
||||||
|
### Existing signal to mirror
|
||||||
|
`RedundancyStateActor` (admin singleton, `ControlPlane/Redundancy/RedundancyStateActor.cs:90-114`)
|
||||||
|
publishes `RedundancyStateChanged` to the `redundancy-state` DPS topic. Per-node
|
||||||
|
`NodeRedundancyState` carries `RedundancyRole` (`Primary`/`Secondary`/`Detached`) +
|
||||||
|
`IsRoleLeaderForDriver`; **Primary = the driver-role cluster leader**. `OpcUaPublishActor`
|
||||||
|
already subscribes to this topic (`OpcUaPublishActor.cs:30,147,156,335`) to drive its
|
||||||
|
ServiceLevel — the exact pattern to copy (`Subscribe(RedundancyStateTopic, Self)` in
|
||||||
|
PreStart; `Receive<RedundancyStateChanged>`; read `msg.Nodes.FirstOrDefault(n => n.NodeId ==
|
||||||
|
localNode)?.Role`).
|
||||||
|
|
||||||
|
### Decision — gate **only the cluster-wide emission**, on `Primary`
|
||||||
|
- **Emission gate (task A1):** `ScriptedAlarmHostActor` subscribes to `redundancy-state`,
|
||||||
|
caches the local role, and **skips the `alerts` publish (line 278) when not Primary**.
|
||||||
|
**Default = emit until a `RedundancyStateChanged` says this node is Secondary/Detached** —
|
||||||
|
so single-node deploys (sole node is always the driver leader = Primary) and the boot
|
||||||
|
window never drop transitions. The OPC UA node write (line 261) and **inbound command
|
||||||
|
processing stay UNGATED** — the secondary must keep its address space warm and its engine
|
||||||
|
state consistent for failover; clients only ever see the Primary via ServiceLevel, so the
|
||||||
|
secondary's node writes + its (subscriber-less) condition events are harmless.
|
||||||
|
- **Historian gate (task A2):** `HistorianAdapterActor` subscribes to `redundancy-state`,
|
||||||
|
caches the local role, and **skips the sink write when not Primary**. Gives exactly-once
|
||||||
|
historization for **all** alarm sources (native Galaxy/AB-CIP too), not just scripted.
|
||||||
|
- **Edge:** a brief failover window (old Primary gone, new not yet elected) may drop a
|
||||||
|
transition/historization — acceptable, and identical to the existing ServiceLevel handoff
|
||||||
|
behaviour.
|
||||||
|
|
||||||
|
### Tests
|
||||||
|
TestKit: with a Secondary `RedundancyStateChanged`, `OnEngineEmission` does NOT publish to
|
||||||
|
`alerts` but DOES still `Tell` the OPC UA `AlarmStateUpdate`; with Primary (or before any
|
||||||
|
state) it publishes. Inbound `AlarmCommand` is still processed regardless of role. Historian:
|
||||||
|
a Secondary skips the sink write, a Primary writes (fake sink + role injection).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2 — Timed-shelve picker UI (small)
|
||||||
|
|
||||||
|
`Alerts.razor` currently exposes Ack / Shelve(OneShot) / Unshelve. The `Timed` backend is
|
||||||
|
fully wired + tested (`ShelveAlarmCommand{ShelveKind.Timed, UnshelveAtUtc}` →
|
||||||
|
`AdminOperationsActor` → `engine.TimedShelveAsync`). Add a small duration input (a minutes/
|
||||||
|
seconds number box, or a `datetime-local`) to the shelve control on each row and call
|
||||||
|
`ShelveAlarmAsync(kind: Timed, unshelveAtUtc: now + duration)`. Razor-only; **no backend
|
||||||
|
change**; proven by docker-dev live-verify (no bUnit). Keep the existing OneShot/Unshelve
|
||||||
|
buttons.
|
||||||
|
|
||||||
|
## 3 — T21 minors (small)
|
||||||
|
|
||||||
|
- **Chip auto-clear:** the `_opResult*` chip persists until the next action. Add the ~8s
|
||||||
|
auto-clear timer pattern `DriverStatusPanel.razor` already uses (a `Task.Delay` →
|
||||||
|
`InvokeAsync(StateHasChanged)` that clears the per-row result).
|
||||||
|
- **CorrelationId consistency:** `AcknowledgeAlarmCommand`/`ShelveAlarmCommand` use a bare
|
||||||
|
`Guid CorrelationId`; the project's other control-plane commands use the `CorrelationId`
|
||||||
|
wrapper type. Switch both records + `AdminOperationsClient` (`CorrelationId.NewId()`) +
|
||||||
|
the actor reply contract + tests to the wrapper for uniform correlation tracing.
|
||||||
|
|
||||||
|
## 4 — Rig cleanup (operational, last)
|
||||||
|
|
||||||
|
Delete the docker-dev seed artifacts left for live-verify: the `t12-overheat` scripted alarm,
|
||||||
|
the `SC-ba675b168a85` predicate script, the `layer0-logcheck` vtag + script; revert
|
||||||
|
filler-02's inert `cycle-time-s` logger line to `return ctx.GetTag("TestMachine_002.TestDuration").Value;`.
|
||||||
|
Redeploy (`POST /api/deployments`, `X-Api-Key: docker-dev-deploy-key`). DB/redeploy, **not a
|
||||||
|
code change** — done last so the rig stays available for verifying tasks 1–3/6.
|
||||||
|
|
||||||
|
## 5 — Galaxy reconnect recreate (high-risk)
|
||||||
|
|
||||||
|
### Problem (verified)
|
||||||
|
`GalaxyMxSession.ConnectAsync` (`Driver.Galaxy/Runtime/GalaxyMxSession.cs:58-69`) is
|
||||||
|
idempotent: `if (_session is not null) return;`. When the gateway restarts and the session
|
||||||
|
goes Faulted/NotFound, `_session` is still a non-null (dead) handle, so `ConnectAsync` is a
|
||||||
|
silent no-op. `GalaxyDriver.ReopenAsync` (`GalaxyDriver.cs:289`) calls it expecting a
|
||||||
|
reconnect → no-op; `ReconnectSupervisor.RecoveryLoopAsync`
|
||||||
|
(`Driver.Galaxy/Runtime/ReconnectSupervisor.cs:158-186`) sees reopen "succeed", proceeds to
|
||||||
|
replay (which fails on the dead session), and **loops forever** (backoff capped 30s).
|
||||||
|
|
||||||
|
### Decision — recreate on reopen
|
||||||
|
Add a recreate path to `GalaxyMxSession` (e.g. `RecreateAsync`/`DisposeSessionForRecreationAsync`)
|
||||||
|
that disposes + nulls `_session` and `_ownedClient`, and have `ReopenAsync` call it **before**
|
||||||
|
`ConnectAsync` so a reopen always routes through the happy-path create (`OpenSessionAsync` +
|
||||||
|
`RegisterAsync`). Confirm what status/exception marks Faulted/NotFound and that the dispose is
|
||||||
|
safe (gRPC channel teardown). Keep the supervisor's backoff loop; it now actually recovers.
|
||||||
|
|
||||||
|
### Tests
|
||||||
|
A reconnect test asserting that after a faulted session, the reopen path **creates a new
|
||||||
|
session** (new handle / `OpenSessionAsync` called again) rather than no-op'ing. Mirror any
|
||||||
|
existing `ReconnectSupervisor`/session tests.
|
||||||
|
|
||||||
|
## 6 — Live-pill circuit health (standard)
|
||||||
|
|
||||||
|
### Problem (verified)
|
||||||
|
`ScriptLog.razor:122` and `Alerts.razor:132` set `_connected = true` once in
|
||||||
|
`OnInitialized` and never update it; the pill markup binds to that set-once bool.
|
||||||
|
`IInProcessBroadcaster<T>` (`AdminUI/Hubs/IInProcessBroadcaster.cs`) exposes only
|
||||||
|
`Received` + `Publish` — no health signal.
|
||||||
|
|
||||||
|
### Decision
|
||||||
|
- Extend `IInProcessBroadcaster<T>` with a connection-health signal (`bool IsConnected` +
|
||||||
|
`event … ConnectionStateChanged`). The bridge actors (`ScriptLogSignalRBridge` /
|
||||||
|
`AlertSignalRBridge`) set it from their DPS-subscription health (SubscribeAck up / failure
|
||||||
|
down). The razors bind the pill to it and subscribe/unsubscribe like
|
||||||
|
`DriverStatusPanel.razor` (`OnConnectionStateChanged` → `InvokeAsync(StateHasChanged)`).
|
||||||
|
- **Dead-circuit case** (node recreate kills the server-side circuit — the component is dead
|
||||||
|
and cannot self-update its pill): this is Blazor Server's built-in reconnection concern.
|
||||||
|
**Verify the default reconnect overlay is present/visible** (it is what actually signals a
|
||||||
|
dropped circuit) rather than trying to fake liveness from a dead component. If absent, add
|
||||||
|
the standard Blazor reconnect UI.
|
||||||
|
|
||||||
|
### Tests
|
||||||
|
Broadcaster unit test for the new health signal + `SetConnected` propagation. Razor proven by
|
||||||
|
docker-dev live-verify (no bUnit).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Sequencing & risk
|
||||||
|
|
||||||
|
| Item | Risk | Notes |
|
||||||
|
|---|---|---|
|
||||||
|
| 1 redundancy double-emit (A1 emit gate, A2 historian gate) | high-risk | independent subsystem; A1∥A2 (different files) |
|
||||||
|
| 5 Galaxy reconnect | high-risk | independent subsystem (Galaxy driver) |
|
||||||
|
| 2 Timed-shelve picker | small | razor-only, live-verify |
|
||||||
|
| 3a chip auto-clear / 3b CorrelationId | small | razor / mechanical refactor |
|
||||||
|
| 6 live-pill | standard | interface + bridge + 2 razors |
|
||||||
|
| 4 rig cleanup | trivial/operational | last |
|
||||||
|
|
||||||
|
The two high-risk items (1, 5) are in different subsystems and can run in parallel with each
|
||||||
|
other and with the UI items (2, 3, 6). Rig cleanup (4) is last. TDD where there's logic; UI
|
||||||
|
proven by docker-dev `/run` live-verify (agent drives — login disabled on docker-dev).
|
||||||
|
|
||||||
|
## Hard rules
|
||||||
|
|
||||||
|
Stage by explicit path (never `git add .`); never stage `sql_login.txt` /
|
||||||
|
`src/Server/.../Host/pki/`; never echo the gateway API key into a **new** tracked file; no
|
||||||
|
force-push, no `--no-verify`; **no Configuration entity / EF migration change** (none of these
|
||||||
|
items needs one). Build on a feature branch off `master`.
|
||||||
Reference in New Issue
Block a user