Files
lmxopcua/docs/Redundancy.md
Joseph Doherty 969b0847a1 docs: update path references for module-folder reorganization
Rewrite src/ and tests/ project paths in docs, CLAUDE.md, README.md, and
test-fixture READMEs to the new module-folder layout (Core/Server/Drivers/
Client/Tooling). References to retired v1 projects (Galaxy.Host/Proxy/Shared,
the legacy monolithic test projects) are left untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 02:10:29 -04:00

8.3 KiB

Redundancy

Overview

OtOpcUa supports OPC UA non-transparent warm/hot redundancy. Two (or more) OtOpcUa Server processes run side-by-side, share the same Config DB, the same driver backends (Galaxy ZB, MXAccess runtime, remote PLCs), and advertise the same OPC UA node tree. Each process owns a distinct ApplicationUri; OPC UA clients see both endpoints via the standard ServerUriArray and pick one based on the ServiceLevel that each server publishes.

The redundancy surface lives in src/Server/ZB.MOM.WW.OtOpcUa.Server/Redundancy/:

Class Role
RedundancyCoordinator Process-singleton; owns the current RedundancyTopology loaded from the ClusterNode table. RefreshAsync re-reads after sp_PublishGeneration so operator role swaps take effect without a process restart. CAS-style swap (Interlocked.Exchange) means readers always see a coherent snapshot.
RedundancyTopology Immutable (ClusterId, Self, Peers, ServerUriArray, ValidityFlags) snapshot.
ApplyLeaseRegistry Tracks in-progress sp_PublishGeneration apply leases keyed on (ConfigGenerationId, PublishRequestId). await using the disposable scope guarantees every exit path (success / exception / cancellation) decrements the lease; a stale-lease watchdog force-closes any lease older than ApplyMaxDuration (default 10 minutes) so a crashed publisher can't pin the node at PrimaryMidApply.
PeerReachabilityTracker Maintains last-known reachability for each peer node over two independent probes — OPC UA ping and HTTP /healthz. Both must succeed for peerReachable = true.
RecoveryStateManager Gates transitions out of the Recovering* bands; requires dwell + publish-witness satisfaction before allowing a return to nominal.
ServiceLevelCalculator Pure function (role, selfHealthy, peerUa, peerHttp, applyInProgress, recoveryDwellMet, topologyValid, operatorMaintenance) → byte.
RedundancyStatePublisher Orchestrates inputs into the calculator, pushes the resulting byte to the OPC UA ServiceLevel variable via an edge-triggered OnStateChanged event, and fires OnServerUriArrayChanged when the topology's ServerUriArray shifts.

Data model

Per-node redundancy state lives in the Config DB ClusterNode table (src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Entities/ClusterNode.cs):

Column Role
NodeId Unique node identity; matches Node:NodeId in the server's bootstrap appsettings.json.
ClusterId Foreign key into ServerCluster.
RedundancyRole Primary, Secondary, or Standalone (RedundancyRole enum in Configuration/Enums).
ServiceLevelBase Per-node base value used to bias nominal ServiceLevel output.
ApplicationUri Unique-per-node OPC UA ApplicationUri advertised in endpoint descriptions.

ServerUriArray is derived from the set of peer ApplicationUri values at topology-load time and republished when the topology changes.

ServiceLevel matrix

ServiceLevelCalculator produces one of the following bands (see ServiceLevelBand enum in the same file):

Band Byte Meaning
Maintenance 0 Operator-declared maintenance.
NoData 1 Self-reported unhealthy (/healthz fails).
InvalidTopology 2 More than one Primary detected; both nodes self-demote.
RecoveringBackup 30 Backup post-fault, dwell not met.
BackupMidApply 50 Backup inside a publish-apply window.
IsolatedBackup 80 Primary unreachable; Backup says "take over if asked" — does not auto-promote (non-transparent model).
AuthoritativeBackup 100 Backup nominal.
RecoveringPrimary 180 Primary post-fault, dwell not met.
PrimaryMidApply 200 Primary inside a publish-apply window.
IsolatedPrimary 230 Primary with unreachable peer, retains authority.
AuthoritativePrimary 255 Primary nominal.

The reserved bands (0 Maintenance, 1 NoData, 2 InvalidTopology) take precedence over operational states per OPC UA Part 5 §6.3.34. Operational values occupy 2..255 so spec-compliant clients that treat "<3 = unhealthy" keep working.

Standalone nodes (single-instance deployments) report AuthoritativePrimary when healthy and PrimaryMidApply during publish.

Publish fencing and split-brain prevention

Any Admin-triggered sp_PublishGeneration acquires an apply lease through ApplyLeaseRegistry.BeginApplyLease. While the lease is held:

  • The calculator reports PrimaryMidApply / BackupMidApply — clients see the band shift and cut over to the unaffected peer rather than racing against a half-applied generation.
  • RedundancyCoordinator.RefreshAsync is called at the end of the apply window so the post-publish topology becomes visible exactly once, atomically.
  • The watchdog force-closes any lease older than ApplyMaxDuration; a stuck publisher therefore cannot strand a node at PrimaryMidApply.

Because role transitions are operator-driven (write RedundancyRole in the Config DB + publish), the Backup never auto-promotes. An IsolatedBackup at 80 is the signal that the operator should intervene; auto-failover is intentionally out of scope for the non-transparent model (decision #154).

Metrics

RedundancyMetrics in src/Server/ZB.MOM.WW.OtOpcUa.Admin/Services/RedundancyMetrics.cs registers the ZB.MOM.WW.OtOpcUa.Redundancy meter on the Admin process. Instruments:

Name Kind Tags Description
otopcua.redundancy.role_transition Counter cluster.id, node.id, from_role, to_role Incremented every time FleetStatusPoller observes a RedundancyRole change on a ClusterNode row.
otopcua.redundancy.primary_count ObservableGauge cluster.id Primary-role nodes per cluster — should be exactly 1 in nominal state.
otopcua.redundancy.secondary_count ObservableGauge cluster.id Secondary-role nodes per cluster.
otopcua.redundancy.stale_count ObservableGauge cluster.id Nodes whose LastSeenAt exceeded the stale threshold.

Admin Program.cs wires OpenTelemetry to the Prometheus exporter when Metrics:Prometheus:Enabled=true (default), exposing the meter under /metrics. The endpoint is intentionally unauthenticated — fleet conventions put it behind a reverse-proxy basic-auth gate if needed.

Real-time notifications (Admin UI)

FleetStatusPoller in src/Server/ZB.MOM.WW.OtOpcUa.Admin/Hubs/ polls the ClusterNode table, records role transitions, updates RedundancyMetrics.SetClusterCounts, and pushes a RoleChanged SignalR event onto FleetStatusHub when a transition is observed. RedundancyTab.razor subscribes with _hub.On<RoleChangedMessage>("RoleChanged", …) so connected Admin sessions see role swaps the moment they happen.

Configuring a redundant pair

Redundancy is configured in the Config DB, not appsettings.json. The fields that must differ between the two instances:

Field Location Instance 1 Instance 2
NodeId appsettings.json Node:NodeId (bootstrap) node-a node-b
ClusterNode.ApplicationUri Config DB urn:node-a:OtOpcUa urn:node-b:OtOpcUa
ClusterNode.RedundancyRole Config DB Primary Secondary
ClusterNode.ServiceLevelBase Config DB typically 255 typically 100

Shared between instances: ClusterId, Config DB connection string, published generation, cluster-level ACLs, UNS hierarchy, driver instances.

Role swaps, stand-alone promotions, and base-level adjustments all happen through the Admin UI RedundancyTab — the operator edits the ClusterNode row in a draft generation and publishes. RedundancyCoordinator.RefreshAsync picks up the new topology without a process restart.

Client-side failover

The OtOpcUa Client CLI at src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI supports -F / --failover-urls for automatic client-side failover; for long-running subscriptions the CLI monitors session KeepAlive and reconnects to the next available server, recreating the subscription on the new endpoint. See Client.CLI.md for the command reference.

Depth reference

For the full decision trail and implementation plan — topology invariants, peer-probe cadence, recovery-dwell policy, compliance-script guard against enum-value drift — see docs/v2/plan.md §Phase 6.3.