Files

Joseph Doherty 705c98ad98 Auto: opcuaclient-14 — ServerUriArray redundant failover

Closes #286

2026-04-26 10:05:05 -04:00

8.9 KiB

Raw Blame History

Redundancy

Overview

OtOpcUa supports OPC UA non-transparent warm/hot redundancy. Two (or more) OtOpcUa Server processes run side-by-side, share the same Config DB, the same driver backends (Galaxy ZB, MXAccess runtime, remote PLCs), and advertise the same OPC UA node tree. Each process owns a distinct ApplicationUri; OPC UA clients see both endpoints via the standard ServerUriArray and pick one based on the ServiceLevel that each server publishes.

The redundancy surface lives in src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/:

Class	Role
`RedundancyCoordinator`	Process-singleton; owns the current `RedundancyTopology` loaded from the `ClusterNode` table. `RefreshAsync` re-reads after `sp_PublishGeneration` so operator role swaps take effect without a process restart. CAS-style swap (`Interlocked.Exchange`) means readers always see a coherent snapshot.
`RedundancyTopology`	Immutable `(ClusterId, Self, Peers, ServerUriArray, ValidityFlags)` snapshot.
`ApplyLeaseRegistry`	Tracks in-progress `sp_PublishGeneration` apply leases keyed on `(ConfigGenerationId, PublishRequestId)`. `await using` the disposable scope guarantees every exit path (success / exception / cancellation) decrements the lease; a stale-lease watchdog force-closes any lease older than `ApplyMaxDuration` (default 10 minutes) so a crashed publisher can't pin the node at `PrimaryMidApply`.
`PeerReachabilityTracker`	Maintains last-known reachability for each peer node over two independent probes — OPC UA ping and HTTP `/healthz`. Both must succeed for `peerReachable = true`.
`RecoveryStateManager`	Gates transitions out of the `Recovering*` bands; requires dwell + publish-witness satisfaction before allowing a return to nominal.
`ServiceLevelCalculator`	Pure function `(role, selfHealthy, peerUa, peerHttp, applyInProgress, recoveryDwellMet, topologyValid, operatorMaintenance) → byte`.
`RedundancyStatePublisher`	Orchestrates inputs into the calculator, pushes the resulting byte to the OPC UA `ServiceLevel` variable via an edge-triggered `OnStateChanged` event, and fires `OnServerUriArrayChanged` when the topology's `ServerUriArray` shifts.

Data model

Per-node redundancy state lives in the Config DB ClusterNode table (src/ZB.MOM.WW.OtOpcUa.Configuration/Entities/ClusterNode.cs):

Column	Role
`NodeId`	Unique node identity; matches `Node:NodeId` in the server's bootstrap `appsettings.json`.
`ClusterId`	Foreign key into `ServerCluster`.
`RedundancyRole`	`Primary`, `Secondary`, or `Standalone` (`RedundancyRole` enum in `Configuration/Enums`).
`ServiceLevelBase`	Per-node base value used to bias nominal ServiceLevel output.
`ApplicationUri`	Unique-per-node OPC UA ApplicationUri advertised in endpoint descriptions.

ServerUriArray is derived from the set of peer ApplicationUri values at topology-load time and republished when the topology changes.

ServiceLevel matrix

ServiceLevelCalculator produces one of the following bands (see ServiceLevelBand enum in the same file):

Band	Byte	Meaning
`Maintenance`	0	Operator-declared maintenance.
`NoData`	1	Self-reported unhealthy (`/healthz` fails).
`InvalidTopology`	2	More than one Primary detected; both nodes self-demote.
`RecoveringBackup`	30	Backup post-fault, dwell not met.
`BackupMidApply`	50	Backup inside a publish-apply window.
`IsolatedBackup`	80	Primary unreachable; Backup says "take over if asked" — does not auto-promote (non-transparent model).
`AuthoritativeBackup`	100	Backup nominal.
`RecoveringPrimary`	180	Primary post-fault, dwell not met.
`PrimaryMidApply`	200	Primary inside a publish-apply window.
`IsolatedPrimary`	230	Primary with unreachable peer, retains authority.
`AuthoritativePrimary`	255	Primary nominal.

The reserved bands (0 Maintenance, 1 NoData, 2 InvalidTopology) take precedence over operational states per OPC UA Part 5 §6.3.34. Operational values occupy 2..255 so spec-compliant clients that treat "<3 = unhealthy" keep working.

Standalone nodes (single-instance deployments) report AuthoritativePrimary when healthy and PrimaryMidApply during publish.

Publish fencing and split-brain prevention

Any Admin-triggered sp_PublishGeneration acquires an apply lease through ApplyLeaseRegistry.BeginApplyLease. While the lease is held:

The calculator reports PrimaryMidApply / BackupMidApply — clients see the band shift and cut over to the unaffected peer rather than racing against a half-applied generation.
RedundancyCoordinator.RefreshAsync is called at the end of the apply window so the post-publish topology becomes visible exactly once, atomically.
The watchdog force-closes any lease older than ApplyMaxDuration; a stuck publisher therefore cannot strand a node at PrimaryMidApply.

Because role transitions are operator-driven (write RedundancyRole in the Config DB + publish), the Backup never auto-promotes. An IsolatedBackup at 80 is the signal that the operator should intervene; auto-failover is intentionally out of scope for the non-transparent model (decision #154).

Metrics

RedundancyMetrics in src/ZB.MOM.WW.OtOpcUa.Admin/Services/RedundancyMetrics.cs registers the ZB.MOM.WW.OtOpcUa.Redundancy meter on the Admin process. Instruments:

Name	Kind	Tags	Description
`otopcua.redundancy.role_transition`	Counter	`cluster.id`, `node.id`, `from_role`, `to_role`	Incremented every time `FleetStatusPoller` observes a `RedundancyRole` change on a `ClusterNode` row.
`otopcua.redundancy.primary_count`	ObservableGauge	`cluster.id`	Primary-role nodes per cluster — should be exactly 1 in nominal state.
`otopcua.redundancy.secondary_count`	ObservableGauge	`cluster.id`	Secondary-role nodes per cluster.
`otopcua.redundancy.stale_count`	ObservableGauge	`cluster.id`	Nodes whose `LastSeenAt` exceeded the stale threshold.

Admin Program.cs wires OpenTelemetry to the Prometheus exporter when Metrics:Prometheus:Enabled=true (default), exposing the meter under /metrics. The endpoint is intentionally unauthenticated — fleet conventions put it behind a reverse-proxy basic-auth gate if needed.

Real-time notifications (Admin UI)

FleetStatusPoller in src/ZB.MOM.WW.OtOpcUa.Admin/Hubs/ polls the ClusterNode table, records role transitions, updates RedundancyMetrics.SetClusterCounts, and pushes a RoleChanged SignalR event onto FleetStatusHub when a transition is observed. RedundancyTab.razor subscribes with _hub.On<RoleChangedMessage>("RoleChanged", …) so connected Admin sessions see role swaps the moment they happen.

Configuring a redundant pair

Redundancy is configured in the Config DB, not appsettings.json. The fields that must differ between the two instances:

Field	Location	Instance 1	Instance 2
`NodeId`	`appsettings.json` `Node:NodeId` (bootstrap)	`node-a`	`node-b`
`ClusterNode.ApplicationUri`	Config DB	`urn:node-a:OtOpcUa`	`urn:node-b:OtOpcUa`
`ClusterNode.RedundancyRole`	Config DB	`Primary`	`Secondary`
`ClusterNode.ServiceLevelBase`	Config DB	typically 255	typically 100

Shared between instances: ClusterId, Config DB connection string, published generation, cluster-level ACLs, UNS hierarchy, driver instances.

Role swaps, stand-alone promotions, and base-level adjustments all happen through the Admin UI RedundancyTab — the operator edits the ClusterNode row in a draft generation and publishes. RedundancyCoordinator.RefreshAsync picks up the new topology without a process restart.

Client-side failover

The OtOpcUa Client CLI at src/ZB.MOM.WW.OtOpcUa.Client.CLI supports -F / --failover-urls for automatic client-side failover; for long-running subscriptions the CLI monitors session KeepAlive and reconnects to the next available server, recreating the subscription on the new endpoint. See Client.CLI.md for the command reference.

vs. upstream-side redundancy

The mechanics on this page describe OtOpcUa as a redundant server — two of our instances clustered behind one OPC UA address space, exposing ServerUriArray + dynamic ServiceLevel to downstream clients. The mirror-image scenario — the OPC UA Client driver consuming an upstream redundant pair — is documented separately in drivers/OpcUaClient.md § Upstream redundancy. Both rely on the same OPC UA Part 4 § 6.6.2 model (non-transparent warm/hot via RedundancySupport + ServerUriArray + ServiceLevel); they sit at opposite ends of the gateway pipeline. A deployment can wire either, both, or neither.

Depth reference

For the full decision trail and implementation plan — topology invariants, peer-probe cadence, recovery-dwell policy, compliance-script guard against enum-value drift — see docs/v2/plan.md §Phase 6.3.

8.9 KiB Raw Blame History