docs: redundancy primary-only alarm emission/historization, redundancy-state NodeId+heartbeat fixes, Galaxy reconnect, live-pill

2026-06-11 10:18:25 -04:00
parent 7891e28b52
commit 521ee75355
4 changed files with 21 additions and 2 deletions
@@ -111,6 +111,8 @@ Both nodes share the same `ConfigDb` connection string; `Cluster.PublicHostname`

 There is no longer a `Node:NodeId` setting and no `ClusterNode.RedundancyRole` column (the V2 migration dropped it — primary/secondary is now derived from cluster role-leadership). NodeId is derived as `host:port` of the cluster `PublicHostname` (see `ClusterRoleInfo.LocalNode` for the formula).

+> **`RedundancyStateActor` NodeId consistency (fixed).** `RedundancyStateActor` now keys each node's `NodeRedundancyState` entry by the canonical `host:port` node id (via a `ToNodeId(Address)` helper mirroring `ClusterRoleInfo.ToNodeId`). Previously it keyed by `member.Address.Host` (host-only, e.g. `central-2`); since every subscriber matches by the canonical `host:port` form, the mismatch silently meant no node ever matched its own entry — all nodes stayed at the default ServiceLevel 255 and never learned their role. This fix makes `RedundancyStateActor` consistent with the stated contract above. Additionally, `RedundancyStateActor` now **re-publishes the current snapshot on a periodic heartbeat (default 10 s)** so any node that subscribes after the last topology-change publish converges within the interval (DistributedPubSub does not replay to late subscribers).
+
 The `ClusterNode.ServiceLevelBase` column still exists and is editable in the Admin UI (NodeEdit / Cluster Redundancy pages), but it no longer drives the runtime ServiceLevel — that value is computed from cluster role/health and published per the mapping above, independent of this stored preference.

 ### Peer URI advertising
@@ -133,6 +135,17 @@ Node A lists Node B's `ApplicationUri` and vice-versa. Validated by `DualEndpoin

 There is no operator-driven role swap during a partition. Failover is what the cluster does automatically.

+## Primary-gated alarm emission and historization
+
+Under warm/hot redundancy both cluster nodes run `ScriptedAlarmHostActor` and evaluate scripted alarms, keeping each node's address space and engine state warm for instant failover. However, to avoid duplicate rows on `/alerts` and duplicate historian writes, only the Primary node publishes externally:
+
+- **`alerts` topic emission** — `ScriptedAlarmHostActor` subscribes to the `redundancy-state` DPS topic and caches the local node's `RedundancyRole`. Each alarm transition is published to the cluster `alerts` topic **only when the node's role is `Primary`**. The default behaviour before any `redundancy-state` message arrives is to emit, so single-node deployments and the boot window never drop transitions. The OPC UA condition-node write and inbound ack/shelve command processing remain **ungated** on both nodes so the secondary is always ready to serve clients after a failover.
+- **`HistorianAdapterActor` historization** — likewise Primary-gated so alarm historization is exactly-once across all alarm sources. (This actor currently has no production feeder; the gate is a forward-looking guard.)
+
+Net effect: each alarm transition appears **once** on `/alerts` and would historize once, not once per node.
+
+See [ScriptedAlarms.md](ScriptedAlarms.md) and [AlarmTracking.md](AlarmTracking.md) for the scripted-alarm engine internals.
+
 ## Client-side failover

 The OtOpcUa Client CLI at `src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI` supports `-F` / `--failover-urls` for automatic client-side failover; for long-running subscriptions the CLI monitors session KeepAlive and reconnects to the next available server, recreating the subscription on the new endpoint. See [`Client.CLI.md`](Client.CLI.md).