diff --git a/docs/AlarmTracking.md b/docs/AlarmTracking.md index 46ad17a0..7c8731b6 100644 --- a/docs/AlarmTracking.md +++ b/docs/AlarmTracking.md @@ -153,6 +153,10 @@ The AdminUI `/alerts` Shelve flow was live-verified on docker-dev 2026-06-11: singleton → topic → host actor → engine → "Shelved" status reflected on `/alerts` with the operator identity threaded through. +## Redundancy deduplication + +Under warm/hot redundancy, both cluster nodes run `ScriptedAlarmHostActor` and the scripted-alarm engine. To prevent duplicate `/alerts` rows and duplicate historian writes, alarm transition publication to the `alerts` topic and `HistorianAdapterActor` historization are **Primary-gated**: only the node whose `RedundancyRole` is `Primary` publishes externally. OPC UA condition-node writes and inbound ack/shelve processing remain ungated on both nodes so the secondary stays warm for failover. See [Redundancy.md §Primary-gated alarm emission and historization](Redundancy.md#primary-gated-alarm-emission-and-historization). + ## Historian write-back (non-Galaxy alarms) Scripted alarms (and any future non-Galaxy `IAlarmSource` like diff --git a/docs/Redundancy.md b/docs/Redundancy.md index a26acab0..e27e9a23 100644 --- a/docs/Redundancy.md +++ b/docs/Redundancy.md @@ -111,6 +111,8 @@ Both nodes share the same `ConfigDb` connection string; `Cluster.PublicHostname` There is no longer a `Node:NodeId` setting and no `ClusterNode.RedundancyRole` column (the V2 migration dropped it — primary/secondary is now derived from cluster role-leadership). NodeId is derived as `host:port` of the cluster `PublicHostname` (see `ClusterRoleInfo.LocalNode` for the formula). +> **`RedundancyStateActor` NodeId consistency (fixed).** `RedundancyStateActor` now keys each node's `NodeRedundancyState` entry by the canonical `host:port` node id (via a `ToNodeId(Address)` helper mirroring `ClusterRoleInfo.ToNodeId`). Previously it keyed by `member.Address.Host` (host-only, e.g. `central-2`); since every subscriber matches by the canonical `host:port` form, the mismatch silently meant no node ever matched its own entry — all nodes stayed at the default ServiceLevel 255 and never learned their role. This fix makes `RedundancyStateActor` consistent with the stated contract above. Additionally, `RedundancyStateActor` now **re-publishes the current snapshot on a periodic heartbeat (default 10 s)** so any node that subscribes after the last topology-change publish converges within the interval (DistributedPubSub does not replay to late subscribers). + The `ClusterNode.ServiceLevelBase` column still exists and is editable in the Admin UI (NodeEdit / Cluster Redundancy pages), but it no longer drives the runtime ServiceLevel — that value is computed from cluster role/health and published per the mapping above, independent of this stored preference. ### Peer URI advertising @@ -133,6 +135,17 @@ Node A lists Node B's `ApplicationUri` and vice-versa. Validated by `DualEndpoin There is no operator-driven role swap during a partition. Failover is what the cluster does automatically. +## Primary-gated alarm emission and historization + +Under warm/hot redundancy both cluster nodes run `ScriptedAlarmHostActor` and evaluate scripted alarms, keeping each node's address space and engine state warm for instant failover. However, to avoid duplicate rows on `/alerts` and duplicate historian writes, only the Primary node publishes externally: + +- **`alerts` topic emission** — `ScriptedAlarmHostActor` subscribes to the `redundancy-state` DPS topic and caches the local node's `RedundancyRole`. Each alarm transition is published to the cluster `alerts` topic **only when the node's role is `Primary`**. The default behaviour before any `redundancy-state` message arrives is to emit, so single-node deployments and the boot window never drop transitions. The OPC UA condition-node write and inbound ack/shelve command processing remain **ungated** on both nodes so the secondary is always ready to serve clients after a failover. +- **`HistorianAdapterActor` historization** — likewise Primary-gated so alarm historization is exactly-once across all alarm sources. (This actor currently has no production feeder; the gate is a forward-looking guard.) + +Net effect: each alarm transition appears **once** on `/alerts` and would historize once, not once per node. + +See [ScriptedAlarms.md](ScriptedAlarms.md) and [AlarmTracking.md](AlarmTracking.md) for the scripted-alarm engine internals. + ## Client-side failover The OtOpcUa Client CLI at `src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI` supports `-F` / `--failover-urls` for automatic client-side failover; for long-running subscriptions the CLI monitors session KeepAlive and reconnects to the next available server, recreating the subscription on the new endpoint. See [`Client.CLI.md`](Client.CLI.md). diff --git a/docs/ScriptedAlarms.md b/docs/ScriptedAlarms.md index 91b973b6..881d818b 100644 --- a/docs/ScriptedAlarms.md +++ b/docs/ScriptedAlarms.md @@ -125,7 +125,9 @@ On allow, the handler publishes a `Commons.OpcUa.AlarmCommand` (containing comma ### AdminUI path -The AdminUI `/alerts` page (`Alerts.razor`) shows per-row **Acknowledge / Shelve / Unshelve** buttons. These are gated by the `DriverOperator` AdminUI policy and routed through the `AdminOperationsActor` cluster singleton (`AcknowledgeAlarmCommand` / `ShelveAlarmCommand`), which publishes onto the same `alarm-commands` topic. Cross-node routing is handled by the cluster singleton — the command always reaches the driver-role node hosting the engine that owns the alarm regardless of which AdminUI instance the operator is connected to. +The AdminUI `/alerts` page (`Alerts.razor`) shows per-row **Acknowledge / Shelve / Unshelve** buttons. These are gated by the `DriverOperator` AdminUI policy and routed through the `AdminOperationsActor` cluster singleton (`AcknowledgeAlarmCommand` / `ShelveAlarmCommand`), which publishes onto the same `alarm-commands` topic. Cross-node routing is handled by the cluster singleton — the command always reaches the driver-role node hosting the engine that owns the alarm regardless of which AdminUI instance the operator is connected to. The Shelve button accepts a duration (minutes) and passes it to `ShelveAlarmAsync(Timed, unshelveAtUtc)`; result chips auto-clear after ~8 s. The `/alerts` and `/script-log` live-status pills reflect real feed health driven by the SignalR bridge actors' DPS subscription state. + +Under redundancy, the `alerts` topic rows are deduplicated to the Primary node — see [Redundancy.md §Primary-gated alarm emission and historization](Redundancy.md#primary-gated-alarm-emission-and-historization). ### Client.CLI path diff --git a/docs/drivers/Galaxy.md b/docs/drivers/Galaxy.md index 7b82d149..b0cc8054 100644 --- a/docs/drivers/Galaxy.md +++ b/docs/drivers/Galaxy.md @@ -90,7 +90,7 @@ Full per-field descriptions live in `Config/GalaxyDriverOptions.cs`. The full JS ## Reconnect + Replay -`ReconnectSupervisor` owns an exponential-backoff loop bounded by `Reconnect.InitialBackoffMs` / `MaxBackoffMs`. On session loss it tears down the gRPC channel, redials, and — when `ReplayOnSessionLost = true` — calls the gateway's `ReplaySubscriptions` RPC with the cached subscription set from `SubscriptionRegistry` instead of re-subscribing tag-by-tag. The gateway's worker then re-issues `AdviseSupervisory` server-side under the apartment lock. +`ReconnectSupervisor` owns an exponential-backoff loop bounded by `Reconnect.InitialBackoffMs` / `MaxBackoffMs`. On session loss it calls `GalaxyDriver.ReopenAsync`, which invokes `GalaxyMxSession.RecreateAsync` to dispose the stale/faulted session and client before rebuilding (`OpenSessionAsync` + `RegisterAsync`). Previously `ConnectAsync` was a no-op when a stale session handle was still present, so the reopen supervisor looped forever without recovering. After a successful reopen — when `ReplayOnSessionLost = true` — the supervisor calls the gateway's `ReplaySubscriptions` RPC with the cached subscription set from `SubscriptionRegistry` instead of re-subscribing tag-by-tag. The gateway's worker then re-issues `AdviseSupervisory` server-side under the apartment lock. ## Testing