Files
lmxopcua/docs/Redundancy.md
2026-04-26 10:05:05 -04:00

8.9 KiB

Redundancy

Overview

OtOpcUa supports OPC UA non-transparent warm/hot redundancy. Two (or more) OtOpcUa Server processes run side-by-side, share the same Config DB, the same driver backends (Galaxy ZB, MXAccess runtime, remote PLCs), and advertise the same OPC UA node tree. Each process owns a distinct ApplicationUri; OPC UA clients see both endpoints via the standard ServerUriArray and pick one based on the ServiceLevel that each server publishes.

The redundancy surface lives in src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/:

Class Role
RedundancyCoordinator Process-singleton; owns the current RedundancyTopology loaded from the ClusterNode table. RefreshAsync re-reads after sp_PublishGeneration so operator role swaps take effect without a process restart. CAS-style swap (Interlocked.Exchange) means readers always see a coherent snapshot.
RedundancyTopology Immutable (ClusterId, Self, Peers, ServerUriArray, ValidityFlags) snapshot.
ApplyLeaseRegistry Tracks in-progress sp_PublishGeneration apply leases keyed on (ConfigGenerationId, PublishRequestId). await using the disposable scope guarantees every exit path (success / exception / cancellation) decrements the lease; a stale-lease watchdog force-closes any lease older than ApplyMaxDuration (default 10 minutes) so a crashed publisher can't pin the node at PrimaryMidApply.
PeerReachabilityTracker Maintains last-known reachability for each peer node over two independent probes — OPC UA ping and HTTP /healthz. Both must succeed for peerReachable = true.
RecoveryStateManager Gates transitions out of the Recovering* bands; requires dwell + publish-witness satisfaction before allowing a return to nominal.
ServiceLevelCalculator Pure function (role, selfHealthy, peerUa, peerHttp, applyInProgress, recoveryDwellMet, topologyValid, operatorMaintenance) → byte.
RedundancyStatePublisher Orchestrates inputs into the calculator, pushes the resulting byte to the OPC UA ServiceLevel variable via an edge-triggered OnStateChanged event, and fires OnServerUriArrayChanged when the topology's ServerUriArray shifts.

Data model

Per-node redundancy state lives in the Config DB ClusterNode table (src/ZB.MOM.WW.OtOpcUa.Configuration/Entities/ClusterNode.cs):

Column Role
NodeId Unique node identity; matches Node:NodeId in the server's bootstrap appsettings.json.
ClusterId Foreign key into ServerCluster.
RedundancyRole Primary, Secondary, or Standalone (RedundancyRole enum in Configuration/Enums).
ServiceLevelBase Per-node base value used to bias nominal ServiceLevel output.
ApplicationUri Unique-per-node OPC UA ApplicationUri advertised in endpoint descriptions.

ServerUriArray is derived from the set of peer ApplicationUri values at topology-load time and republished when the topology changes.

ServiceLevel matrix

ServiceLevelCalculator produces one of the following bands (see ServiceLevelBand enum in the same file):

Band Byte Meaning
Maintenance 0 Operator-declared maintenance.
NoData 1 Self-reported unhealthy (/healthz fails).
InvalidTopology 2 More than one Primary detected; both nodes self-demote.
RecoveringBackup 30 Backup post-fault, dwell not met.
BackupMidApply 50 Backup inside a publish-apply window.
IsolatedBackup 80 Primary unreachable; Backup says "take over if asked" — does not auto-promote (non-transparent model).
AuthoritativeBackup 100 Backup nominal.
RecoveringPrimary 180 Primary post-fault, dwell not met.
PrimaryMidApply 200 Primary inside a publish-apply window.
IsolatedPrimary 230 Primary with unreachable peer, retains authority.
AuthoritativePrimary 255 Primary nominal.

The reserved bands (0 Maintenance, 1 NoData, 2 InvalidTopology) take precedence over operational states per OPC UA Part 5 §6.3.34. Operational values occupy 2..255 so spec-compliant clients that treat "<3 = unhealthy" keep working.

Standalone nodes (single-instance deployments) report AuthoritativePrimary when healthy and PrimaryMidApply during publish.

Publish fencing and split-brain prevention

Any Admin-triggered sp_PublishGeneration acquires an apply lease through ApplyLeaseRegistry.BeginApplyLease. While the lease is held:

  • The calculator reports PrimaryMidApply / BackupMidApply — clients see the band shift and cut over to the unaffected peer rather than racing against a half-applied generation.
  • RedundancyCoordinator.RefreshAsync is called at the end of the apply window so the post-publish topology becomes visible exactly once, atomically.
  • The watchdog force-closes any lease older than ApplyMaxDuration; a stuck publisher therefore cannot strand a node at PrimaryMidApply.

Because role transitions are operator-driven (write RedundancyRole in the Config DB + publish), the Backup never auto-promotes. An IsolatedBackup at 80 is the signal that the operator should intervene; auto-failover is intentionally out of scope for the non-transparent model (decision #154).

Metrics

RedundancyMetrics in src/ZB.MOM.WW.OtOpcUa.Admin/Services/RedundancyMetrics.cs registers the ZB.MOM.WW.OtOpcUa.Redundancy meter on the Admin process. Instruments:

Name Kind Tags Description
otopcua.redundancy.role_transition Counter cluster.id, node.id, from_role, to_role Incremented every time FleetStatusPoller observes a RedundancyRole change on a ClusterNode row.
otopcua.redundancy.primary_count ObservableGauge cluster.id Primary-role nodes per cluster — should be exactly 1 in nominal state.
otopcua.redundancy.secondary_count ObservableGauge cluster.id Secondary-role nodes per cluster.
otopcua.redundancy.stale_count ObservableGauge cluster.id Nodes whose LastSeenAt exceeded the stale threshold.

Admin Program.cs wires OpenTelemetry to the Prometheus exporter when Metrics:Prometheus:Enabled=true (default), exposing the meter under /metrics. The endpoint is intentionally unauthenticated — fleet conventions put it behind a reverse-proxy basic-auth gate if needed.

Real-time notifications (Admin UI)

FleetStatusPoller in src/ZB.MOM.WW.OtOpcUa.Admin/Hubs/ polls the ClusterNode table, records role transitions, updates RedundancyMetrics.SetClusterCounts, and pushes a RoleChanged SignalR event onto FleetStatusHub when a transition is observed. RedundancyTab.razor subscribes with _hub.On<RoleChangedMessage>("RoleChanged", …) so connected Admin sessions see role swaps the moment they happen.

Configuring a redundant pair

Redundancy is configured in the Config DB, not appsettings.json. The fields that must differ between the two instances:

Field Location Instance 1 Instance 2
NodeId appsettings.json Node:NodeId (bootstrap) node-a node-b
ClusterNode.ApplicationUri Config DB urn:node-a:OtOpcUa urn:node-b:OtOpcUa
ClusterNode.RedundancyRole Config DB Primary Secondary
ClusterNode.ServiceLevelBase Config DB typically 255 typically 100

Shared between instances: ClusterId, Config DB connection string, published generation, cluster-level ACLs, UNS hierarchy, driver instances.

Role swaps, stand-alone promotions, and base-level adjustments all happen through the Admin UI RedundancyTab — the operator edits the ClusterNode row in a draft generation and publishes. RedundancyCoordinator.RefreshAsync picks up the new topology without a process restart.

Client-side failover

The OtOpcUa Client CLI at src/ZB.MOM.WW.OtOpcUa.Client.CLI supports -F / --failover-urls for automatic client-side failover; for long-running subscriptions the CLI monitors session KeepAlive and reconnects to the next available server, recreating the subscription on the new endpoint. See Client.CLI.md for the command reference.

vs. upstream-side redundancy

The mechanics on this page describe OtOpcUa as a redundant server — two of our instances clustered behind one OPC UA address space, exposing ServerUriArray + dynamic ServiceLevel to downstream clients. The mirror-image scenario — the OPC UA Client driver consuming an upstream redundant pair — is documented separately in drivers/OpcUaClient.md § Upstream redundancy. Both rely on the same OPC UA Part 4 § 6.6.2 model (non-transparent warm/hot via RedundancySupport + ServerUriArray + ServiceLevel); they sit at opposite ends of the gateway pipeline. A deployment can wire either, both, or neither.

Depth reference

For the full decision trail and implementation plan — topology invariants, peer-probe cadence, recovery-dwell policy, compliance-script guard against enum-value drift — see docs/v2/plan.md §Phase 6.3.