6.4 KiB
Redundancy (v2)
Overview
OtOpcUa supports OPC UA non-transparent warm/hot redundancy. Two or more OtOpcUa.Host processes run side-by-side, share the same Config DB, and join the same Akka.NET cluster. Each process owns a distinct ApplicationUri; OPC UA clients discover both endpoints by reading Server.ServerArray (NodeId i=2254) on either node and pick one based on the ServiceLevel byte that each server publishes.
Discovery surface. The
ServerArraypath on theServerobject is what each node populates with self + peerApplicationUris — seeOpcUaApplicationHost.PopulateServerArrayand the per-nodePeerApplicationUrisoption below. The redundancy-object-typeServerUriArrayproper (a child ofServer.ServerRedundancy) remains deferred pending an SDK object-type upgrade; clients should readServer.ServerArrayfor peer discovery today.
v2 change. v1's operator-managed
ClusterNode.RedundancyRolecolumn +RedundancyCoordinator/ApplyLeaseRegistry/PeerHttpProbeLoopare gone. Primary/secondary is now derived from Akka cluster role-leader for thedriverrole. The operator no longer writes a role into the DB; cluster topology + health drive ServiceLevel automatically.
The runtime pieces live in:
| Component | Project | Role |
|---|---|---|
ServiceLevelCalculator |
OtOpcUa.ControlPlane.Redundancy |
Pure function (NodeHealthInputs) → byte. No side effects. |
RedundancyStateActor |
OtOpcUa.ControlPlane.Redundancy |
Admin-role cluster singleton; subscribes to cluster topology events, debounces 250ms, broadcasts RedundancyStateChanged on the redundancy-state DPS topic. |
DbHealthProbeActor |
OtOpcUa.Runtime.Health |
Per-node; runs SELECT 1 against ConfigDb every 5s. Read by health endpoint + redundancy calc. |
PeerOpcUaProbeActor |
OtOpcUa.Runtime.Health |
Per-node; pings peer opc.tcp://peer:4840 (real probe call is staged for follow-up F12). |
ClusterRoleInfo |
OtOpcUa.Cluster |
Live view of cluster membership + role-leader; exposes IClusterRoleInfo to the rest of the host. |
ServiceLevel tiers (Part 5 §6.5)
ServiceLevelCalculator.Compute(NodeHealthInputs) returns a byte in 0..255 by tier:
| Tier | Byte | Condition |
|---|---|---|
| Down | 0 | Member status is not Up or Joining (leaving, removed, exiting). |
| Critically degraded | 100 | ConfigDb unreachable AND data is stale. |
| Stale | 200 | Data stale but ConfigDb reachable. |
| Healthy follower | 240 | DB ok + OPC UA probe ok + not stale. |
| Healthy leader | 250 | Healthy + this node is the driver role-leader. |
Drivers write their computed byte into the OPC UA ServiceLevel Variable on each refresh. Clients with the standard redundancy heuristic ("pick the highest ServiceLevel") therefore prefer the role-leader and fall back to followers on its degradation.
Data flow
Cluster topology event ──┐
DB health probe ─────────┤
OPC UA peer probe ───────┤
▼
RedundancyStateActor (admin singleton)
│ debounce 250ms
▼
DPS topic "redundancy-state"
│
▼
Driver nodes' OpcUaPublishActor
│
▼
ServiceLevelCalculator → byte
│
▼
OPC UA ServiceLevel Variable
The admin singleton is the cluster's only RedundancyStateActor. If the admin leader fails over, the new admin node spins up its replacement, re-subscribes to cluster events, and publishes a fresh snapshot from the current Cluster.State. There is no DB-persisted state to recover.
Configuration
Per-node identity comes from appsettings.json + the OTOPCUA_ROLES env var:
{
"Cluster": {
"Hostname": "0.0.0.0",
"Port": 4053,
"PublicHostname": "node-a.lan",
"SeedNodes": ["akka.tcp://otopcua@node-a.lan:4053"],
"Roles": ["admin", "driver"]
}
}
OTOPCUA_ROLES=admin,driver
Both nodes share the same ConfigDb connection string; Cluster.PublicHostname + Roles are what makes them distinct in cluster gossip. The first node bootstraps the cluster (its address goes in SeedNodes); the second node joins via the same SeedNodes list.
There is no longer a Node:NodeId setting, no ClusterNode.RedundancyRole, no ServiceLevelBase. NodeId is derived as host:port of the cluster PublicHostname (see ClusterRoleInfo.LocalNode for the formula).
Peer URI advertising
Each node advertises its partner via OpcUaApplicationHostOptions.PeerApplicationUris (an IList<string>, default empty). OpcUaApplicationHost.PopulateServerArray appends each configured peer URI to the SDK's IServerInternal.ServerUris string table after server startup, so that Server.ServerArray reads served by OnReadServerArray return both self + peers. Set this per-node in appsettings.json:
{
"OpcUaServer": {
"PeerApplicationUris": ["urn:node-b:OtOpcUa"]
}
}
Node A lists Node B's ApplicationUri and vice-versa. Validated by DualEndpointTests in tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests/ — boots two OpcUaApplicationHost instances on loopback, asserts a real OPCFoundation client Session reading Server.ServerArray from Node A sees both URIs.
Split-brain
akka.conf configures Akka's split-brain resolver with active-strategy = keep-oldest, stable-after = 15s, and failure-detector.threshold = 10.0. Under a clean partition: the oldest member stays up + the smaller (or younger) side downs itself within ~15 seconds. The RedundancyStateActor on the surviving partition re-computes from the post-partition Cluster.State.
There is no operator-driven role swap during a partition. Failover is what the cluster does automatically.
Client-side failover
The OtOpcUa Client CLI at src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI supports -F / --failover-urls for automatic client-side failover; for long-running subscriptions the CLI monitors session KeepAlive and reconnects to the next available server, recreating the subscription on the new endpoint. See Client.CLI.md.
Depth reference
For the full design — message contracts, tiered calculator truth table, recovery semantics — see docs/plans/2026-05-26-akka-hosting-alignment-design.md §6.