8.9 KiB
Redundancy
Overview
OtOpcUa supports OPC UA non-transparent warm/hot redundancy. Two (or more) OtOpcUa Server processes run side-by-side, share the same Config DB, the same driver backends (Galaxy ZB, MXAccess runtime, remote PLCs), and advertise the same OPC UA node tree. Each process owns a distinct ApplicationUri; OPC UA clients see both endpoints via the standard ServerUriArray and pick one based on the ServiceLevel that each server publishes.
The redundancy surface lives in src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/:
| Class | Role |
|---|---|
RedundancyCoordinator |
Process-singleton; owns the current RedundancyTopology loaded from the ClusterNode table. RefreshAsync re-reads after sp_PublishGeneration so operator role swaps take effect without a process restart. CAS-style swap (Interlocked.Exchange) means readers always see a coherent snapshot. |
RedundancyTopology |
Immutable (ClusterId, Self, Peers, ServerUriArray, ValidityFlags) snapshot. |
ApplyLeaseRegistry |
Tracks in-progress sp_PublishGeneration apply leases keyed on (ConfigGenerationId, PublishRequestId). await using the disposable scope guarantees every exit path (success / exception / cancellation) decrements the lease; a stale-lease watchdog force-closes any lease older than ApplyMaxDuration (default 10 minutes) so a crashed publisher can't pin the node at PrimaryMidApply. |
PeerReachabilityTracker |
Maintains last-known reachability for each peer node over two independent probes — OPC UA ping and HTTP /healthz. Both must succeed for peerReachable = true. |
RecoveryStateManager |
Gates transitions out of the Recovering* bands; requires dwell + publish-witness satisfaction before allowing a return to nominal. |
ServiceLevelCalculator |
Pure function (role, selfHealthy, peerUa, peerHttp, applyInProgress, recoveryDwellMet, topologyValid, operatorMaintenance) → byte. |
RedundancyStatePublisher |
Orchestrates inputs into the calculator, pushes the resulting byte to the OPC UA ServiceLevel variable via an edge-triggered OnStateChanged event, and fires OnServerUriArrayChanged when the topology's ServerUriArray shifts. |
Data model
Per-node redundancy state lives in the Config DB ClusterNode table (src/ZB.MOM.WW.OtOpcUa.Configuration/Entities/ClusterNode.cs):
| Column | Role |
|---|---|
NodeId |
Unique node identity; matches Node:NodeId in the server's bootstrap appsettings.json. |
ClusterId |
Foreign key into ServerCluster. |
RedundancyRole |
Primary, Secondary, or Standalone (RedundancyRole enum in Configuration/Enums). |
ServiceLevelBase |
Per-node base value used to bias nominal ServiceLevel output. |
ApplicationUri |
Unique-per-node OPC UA ApplicationUri advertised in endpoint descriptions. |
ServerUriArray is derived from the set of peer ApplicationUri values at topology-load time and republished when the topology changes.
ServiceLevel matrix
ServiceLevelCalculator produces one of the following bands (see ServiceLevelBand enum in the same file):
| Band | Byte | Meaning |
|---|---|---|
Maintenance |
0 | Operator-declared maintenance. |
NoData |
1 | Self-reported unhealthy (/healthz fails). |
InvalidTopology |
2 | More than one Primary detected; both nodes self-demote. |
RecoveringBackup |
30 | Backup post-fault, dwell not met. |
BackupMidApply |
50 | Backup inside a publish-apply window. |
IsolatedBackup |
80 | Primary unreachable; Backup says "take over if asked" — does not auto-promote (non-transparent model). |
AuthoritativeBackup |
100 | Backup nominal. |
RecoveringPrimary |
180 | Primary post-fault, dwell not met. |
PrimaryMidApply |
200 | Primary inside a publish-apply window. |
IsolatedPrimary |
230 | Primary with unreachable peer, retains authority. |
AuthoritativePrimary |
255 | Primary nominal. |
The reserved bands (0 Maintenance, 1 NoData, 2 InvalidTopology) take precedence over operational states per OPC UA Part 5 §6.3.34. Operational values occupy 2..255 so spec-compliant clients that treat "<3 = unhealthy" keep working.
Standalone nodes (single-instance deployments) report AuthoritativePrimary when healthy and PrimaryMidApply during publish.
Publish fencing and split-brain prevention
Any Admin-triggered sp_PublishGeneration acquires an apply lease through ApplyLeaseRegistry.BeginApplyLease. While the lease is held:
- The calculator reports
PrimaryMidApply/BackupMidApply— clients see the band shift and cut over to the unaffected peer rather than racing against a half-applied generation. RedundancyCoordinator.RefreshAsyncis called at the end of the apply window so the post-publish topology becomes visible exactly once, atomically.- The watchdog force-closes any lease older than
ApplyMaxDuration; a stuck publisher therefore cannot strand a node atPrimaryMidApply.
Because role transitions are operator-driven (write RedundancyRole in the Config DB + publish), the Backup never auto-promotes. An IsolatedBackup at 80 is the signal that the operator should intervene; auto-failover is intentionally out of scope for the non-transparent model (decision #154).
Metrics
RedundancyMetrics in src/ZB.MOM.WW.OtOpcUa.Admin/Services/RedundancyMetrics.cs registers the ZB.MOM.WW.OtOpcUa.Redundancy meter on the Admin process. Instruments:
| Name | Kind | Tags | Description |
|---|---|---|---|
otopcua.redundancy.role_transition |
Counter | cluster.id, node.id, from_role, to_role |
Incremented every time FleetStatusPoller observes a RedundancyRole change on a ClusterNode row. |
otopcua.redundancy.primary_count |
ObservableGauge | cluster.id |
Primary-role nodes per cluster — should be exactly 1 in nominal state. |
otopcua.redundancy.secondary_count |
ObservableGauge | cluster.id |
Secondary-role nodes per cluster. |
otopcua.redundancy.stale_count |
ObservableGauge | cluster.id |
Nodes whose LastSeenAt exceeded the stale threshold. |
Admin Program.cs wires OpenTelemetry to the Prometheus exporter when Metrics:Prometheus:Enabled=true (default), exposing the meter under /metrics. The endpoint is intentionally unauthenticated — fleet conventions put it behind a reverse-proxy basic-auth gate if needed.
Real-time notifications (Admin UI)
FleetStatusPoller in src/ZB.MOM.WW.OtOpcUa.Admin/Hubs/ polls the ClusterNode table, records role transitions, updates RedundancyMetrics.SetClusterCounts, and pushes a RoleChanged SignalR event onto FleetStatusHub when a transition is observed. RedundancyTab.razor subscribes with _hub.On<RoleChangedMessage>("RoleChanged", …) so connected Admin sessions see role swaps the moment they happen.
Configuring a redundant pair
Redundancy is configured in the Config DB, not appsettings.json. The fields that must differ between the two instances:
| Field | Location | Instance 1 | Instance 2 |
|---|---|---|---|
NodeId |
appsettings.json Node:NodeId (bootstrap) |
node-a |
node-b |
ClusterNode.ApplicationUri |
Config DB | urn:node-a:OtOpcUa |
urn:node-b:OtOpcUa |
ClusterNode.RedundancyRole |
Config DB | Primary |
Secondary |
ClusterNode.ServiceLevelBase |
Config DB | typically 255 | typically 100 |
Shared between instances: ClusterId, Config DB connection string, published generation, cluster-level ACLs, UNS hierarchy, driver instances.
Role swaps, stand-alone promotions, and base-level adjustments all happen through the Admin UI RedundancyTab — the operator edits the ClusterNode row in a draft generation and publishes. RedundancyCoordinator.RefreshAsync picks up the new topology without a process restart.
Client-side failover
The OtOpcUa Client CLI at src/ZB.MOM.WW.OtOpcUa.Client.CLI supports -F / --failover-urls for automatic client-side failover; for long-running subscriptions the CLI monitors session KeepAlive and reconnects to the next available server, recreating the subscription on the new endpoint. See Client.CLI.md for the command reference.
vs. upstream-side redundancy
The mechanics on this page describe OtOpcUa as a redundant server — two of our instances clustered behind one OPC UA address space, exposing ServerUriArray + dynamic ServiceLevel to downstream clients. The mirror-image scenario — the OPC UA Client driver consuming an upstream redundant pair — is documented separately in drivers/OpcUaClient.md § Upstream redundancy. Both rely on the same OPC UA Part 4 § 6.6.2 model (non-transparent warm/hot via RedundancySupport + ServerUriArray + ServiceLevel); they sit at opposite ends of the gateway pipeline. A deployment can wire either, both, or neither.
Depth reference
For the full decision trail and implementation plan — topology invariants, peer-probe cadence, recovery-dwell policy, compliance-script guard against enum-value drift — see docs/v2/plan.md §Phase 6.3.