docs: redundancy primary-only alarm emission/historization, redundancy-state NodeId+heartbeat fixes, Galaxy reconnect, live-pill

2026-06-11 10:18:25 -04:00
parent 7891e28b52
commit 521ee75355
4 changed files with 21 additions and 2 deletions
@@ -153,6 +153,10 @@ The AdminUI `/alerts` Shelve flow was live-verified on docker-dev
 2026-06-11: singleton → topic → host actor → engine → "Shelved" status
 reflected on `/alerts` with the operator identity threaded through.

+## Redundancy deduplication
+
+Under warm/hot redundancy, both cluster nodes run `ScriptedAlarmHostActor` and the scripted-alarm engine. To prevent duplicate `/alerts` rows and duplicate historian writes, alarm transition publication to the `alerts` topic and `HistorianAdapterActor` historization are **Primary-gated**: only the node whose `RedundancyRole` is `Primary` publishes externally. OPC UA condition-node writes and inbound ack/shelve processing remain ungated on both nodes so the secondary stays warm for failover. See [Redundancy.md §Primary-gated alarm emission and historization](Redundancy.md#primary-gated-alarm-emission-and-historization).
+
 ## Historian write-back (non-Galaxy alarms)

 Scripted alarms (and any future non-Galaxy `IAlarmSource` like
@@ -111,6 +111,8 @@ Both nodes share the same `ConfigDb` connection string; `Cluster.PublicHostname`

 There is no longer a `Node:NodeId` setting and no `ClusterNode.RedundancyRole` column (the V2 migration dropped it — primary/secondary is now derived from cluster role-leadership). NodeId is derived as `host:port` of the cluster `PublicHostname` (see `ClusterRoleInfo.LocalNode` for the formula).

+> **`RedundancyStateActor` NodeId consistency (fixed).** `RedundancyStateActor` now keys each node's `NodeRedundancyState` entry by the canonical `host:port` node id (via a `ToNodeId(Address)` helper mirroring `ClusterRoleInfo.ToNodeId`). Previously it keyed by `member.Address.Host` (host-only, e.g. `central-2`); since every subscriber matches by the canonical `host:port` form, the mismatch silently meant no node ever matched its own entry — all nodes stayed at the default ServiceLevel 255 and never learned their role. This fix makes `RedundancyStateActor` consistent with the stated contract above. Additionally, `RedundancyStateActor` now **re-publishes the current snapshot on a periodic heartbeat (default 10 s)** so any node that subscribes after the last topology-change publish converges within the interval (DistributedPubSub does not replay to late subscribers).
+
 The `ClusterNode.ServiceLevelBase` column still exists and is editable in the Admin UI (NodeEdit / Cluster Redundancy pages), but it no longer drives the runtime ServiceLevel — that value is computed from cluster role/health and published per the mapping above, independent of this stored preference.

 ### Peer URI advertising
@@ -133,6 +135,17 @@ Node A lists Node B's `ApplicationUri` and vice-versa. Validated by `DualEndpoin

 There is no operator-driven role swap during a partition. Failover is what the cluster does automatically.

+## Primary-gated alarm emission and historization
+
+Under warm/hot redundancy both cluster nodes run `ScriptedAlarmHostActor` and evaluate scripted alarms, keeping each node's address space and engine state warm for instant failover. However, to avoid duplicate rows on `/alerts` and duplicate historian writes, only the Primary node publishes externally:
+
+- **`alerts` topic emission** — `ScriptedAlarmHostActor` subscribes to the `redundancy-state` DPS topic and caches the local node's `RedundancyRole`. Each alarm transition is published to the cluster `alerts` topic **only when the node's role is `Primary`**. The default behaviour before any `redundancy-state` message arrives is to emit, so single-node deployments and the boot window never drop transitions. The OPC UA condition-node write and inbound ack/shelve command processing remain **ungated** on both nodes so the secondary is always ready to serve clients after a failover.
+- **`HistorianAdapterActor` historization** — likewise Primary-gated so alarm historization is exactly-once across all alarm sources. (This actor currently has no production feeder; the gate is a forward-looking guard.)
+
+Net effect: each alarm transition appears **once** on `/alerts` and would historize once, not once per node.
+
+See [ScriptedAlarms.md](ScriptedAlarms.md) and [AlarmTracking.md](AlarmTracking.md) for the scripted-alarm engine internals.
+
 ## Client-side failover

 The OtOpcUa Client CLI at `src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI` supports `-F` / `--failover-urls` for automatic client-side failover; for long-running subscriptions the CLI monitors session KeepAlive and reconnects to the next available server, recreating the subscription on the new endpoint. See [`Client.CLI.md`](Client.CLI.md).
@@ -125,7 +125,9 @@ On allow, the handler publishes a `Commons.OpcUa.AlarmCommand` (containing comma

 ### AdminUI path

-The AdminUI `/alerts` page (`Alerts.razor`) shows per-row **Acknowledge / Shelve / Unshelve** buttons. These are gated by the `DriverOperator` AdminUI policy and routed through the `AdminOperationsActor` cluster singleton (`AcknowledgeAlarmCommand` / `ShelveAlarmCommand`), which publishes onto the same `alarm-commands` topic. Cross-node routing is handled by the cluster singleton — the command always reaches the driver-role node hosting the engine that owns the alarm regardless of which AdminUI instance the operator is connected to.
+The AdminUI `/alerts` page (`Alerts.razor`) shows per-row **Acknowledge / Shelve / Unshelve** buttons. These are gated by the `DriverOperator` AdminUI policy and routed through the `AdminOperationsActor` cluster singleton (`AcknowledgeAlarmCommand` / `ShelveAlarmCommand`), which publishes onto the same `alarm-commands` topic. Cross-node routing is handled by the cluster singleton — the command always reaches the driver-role node hosting the engine that owns the alarm regardless of which AdminUI instance the operator is connected to. The Shelve button accepts a duration (minutes) and passes it to `ShelveAlarmAsync(Timed, unshelveAtUtc)`; result chips auto-clear after ~8 s. The `/alerts` and `/script-log` live-status pills reflect real feed health driven by the SignalR bridge actors' DPS subscription state.
+
+Under redundancy, the `alerts` topic rows are deduplicated to the Primary node — see [Redundancy.md §Primary-gated alarm emission and historization](Redundancy.md#primary-gated-alarm-emission-and-historization).

 ### Client.CLI path

@@ -90,7 +90,7 @@ Full per-field descriptions live in `Config/GalaxyDriverOptions.cs`. The full JS

 ## Reconnect + Replay

-`ReconnectSupervisor` owns an exponential-backoff loop bounded by `Reconnect.InitialBackoffMs` / `MaxBackoffMs`. On session loss it tears down the gRPC channel, redials, and — when `ReplayOnSessionLost = true` — calls the gateway's `ReplaySubscriptions` RPC with the cached subscription set from `SubscriptionRegistry` instead of re-subscribing tag-by-tag. The gateway's worker then re-issues `AdviseSupervisory` server-side under the apartment lock.
+`ReconnectSupervisor` owns an exponential-backoff loop bounded by `Reconnect.InitialBackoffMs` / `MaxBackoffMs`. On session loss it calls `GalaxyDriver.ReopenAsync`, which invokes `GalaxyMxSession.RecreateAsync` to dispose the stale/faulted session and client before rebuilding (`OpenSessionAsync` + `RegisterAsync`). Previously `ConnectAsync` was a no-op when a stale session handle was still present, so the reopen supervisor looped forever without recovering. After a successful reopen — when `ReplayOnSessionLost = true` — the supervisor calls the gateway's `ReplaySubscriptions` RPC with the cached subscription set from `SubscriptionRegistry` instead of re-subscribing tag-by-tag. The gateway's worker then re-issues `AdviseSupervisory` server-side under the apartment lock.

 ## Testing