Rewrite src/ and tests/ project paths in docs, CLAUDE.md, README.md, and test-fixture READMEs to the new module-folder layout (Core/Server/Drivers/ Client/Tooling). References to retired v1 projects (Galaxy.Host/Proxy/Shared, the legacy monolithic test projects) are left untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8.3 KiB
Redundancy
Overview
OtOpcUa supports OPC UA non-transparent warm/hot redundancy. Two (or more) OtOpcUa Server processes run side-by-side, share the same Config DB, the same driver backends (Galaxy ZB, MXAccess runtime, remote PLCs), and advertise the same OPC UA node tree. Each process owns a distinct ApplicationUri; OPC UA clients see both endpoints via the standard ServerUriArray and pick one based on the ServiceLevel that each server publishes.
The redundancy surface lives in src/Server/ZB.MOM.WW.OtOpcUa.Server/Redundancy/:
| Class | Role |
|---|---|
RedundancyCoordinator |
Process-singleton; owns the current RedundancyTopology loaded from the ClusterNode table. RefreshAsync re-reads after sp_PublishGeneration so operator role swaps take effect without a process restart. CAS-style swap (Interlocked.Exchange) means readers always see a coherent snapshot. |
RedundancyTopology |
Immutable (ClusterId, Self, Peers, ServerUriArray, ValidityFlags) snapshot. |
ApplyLeaseRegistry |
Tracks in-progress sp_PublishGeneration apply leases keyed on (ConfigGenerationId, PublishRequestId). await using the disposable scope guarantees every exit path (success / exception / cancellation) decrements the lease; a stale-lease watchdog force-closes any lease older than ApplyMaxDuration (default 10 minutes) so a crashed publisher can't pin the node at PrimaryMidApply. |
PeerReachabilityTracker |
Maintains last-known reachability for each peer node over two independent probes — OPC UA ping and HTTP /healthz. Both must succeed for peerReachable = true. |
RecoveryStateManager |
Gates transitions out of the Recovering* bands; requires dwell + publish-witness satisfaction before allowing a return to nominal. |
ServiceLevelCalculator |
Pure function (role, selfHealthy, peerUa, peerHttp, applyInProgress, recoveryDwellMet, topologyValid, operatorMaintenance) → byte. |
RedundancyStatePublisher |
Orchestrates inputs into the calculator, pushes the resulting byte to the OPC UA ServiceLevel variable via an edge-triggered OnStateChanged event, and fires OnServerUriArrayChanged when the topology's ServerUriArray shifts. |
Data model
Per-node redundancy state lives in the Config DB ClusterNode table (src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Entities/ClusterNode.cs):
| Column | Role |
|---|---|
NodeId |
Unique node identity; matches Node:NodeId in the server's bootstrap appsettings.json. |
ClusterId |
Foreign key into ServerCluster. |
RedundancyRole |
Primary, Secondary, or Standalone (RedundancyRole enum in Configuration/Enums). |
ServiceLevelBase |
Per-node base value used to bias nominal ServiceLevel output. |
ApplicationUri |
Unique-per-node OPC UA ApplicationUri advertised in endpoint descriptions. |
ServerUriArray is derived from the set of peer ApplicationUri values at topology-load time and republished when the topology changes.
ServiceLevel matrix
ServiceLevelCalculator produces one of the following bands (see ServiceLevelBand enum in the same file):
| Band | Byte | Meaning |
|---|---|---|
Maintenance |
0 | Operator-declared maintenance. |
NoData |
1 | Self-reported unhealthy (/healthz fails). |
InvalidTopology |
2 | More than one Primary detected; both nodes self-demote. |
RecoveringBackup |
30 | Backup post-fault, dwell not met. |
BackupMidApply |
50 | Backup inside a publish-apply window. |
IsolatedBackup |
80 | Primary unreachable; Backup says "take over if asked" — does not auto-promote (non-transparent model). |
AuthoritativeBackup |
100 | Backup nominal. |
RecoveringPrimary |
180 | Primary post-fault, dwell not met. |
PrimaryMidApply |
200 | Primary inside a publish-apply window. |
IsolatedPrimary |
230 | Primary with unreachable peer, retains authority. |
AuthoritativePrimary |
255 | Primary nominal. |
The reserved bands (0 Maintenance, 1 NoData, 2 InvalidTopology) take precedence over operational states per OPC UA Part 5 §6.3.34. Operational values occupy 2..255 so spec-compliant clients that treat "<3 = unhealthy" keep working.
Standalone nodes (single-instance deployments) report AuthoritativePrimary when healthy and PrimaryMidApply during publish.
Publish fencing and split-brain prevention
Any Admin-triggered sp_PublishGeneration acquires an apply lease through ApplyLeaseRegistry.BeginApplyLease. While the lease is held:
- The calculator reports
PrimaryMidApply/BackupMidApply— clients see the band shift and cut over to the unaffected peer rather than racing against a half-applied generation. RedundancyCoordinator.RefreshAsyncis called at the end of the apply window so the post-publish topology becomes visible exactly once, atomically.- The watchdog force-closes any lease older than
ApplyMaxDuration; a stuck publisher therefore cannot strand a node atPrimaryMidApply.
Because role transitions are operator-driven (write RedundancyRole in the Config DB + publish), the Backup never auto-promotes. An IsolatedBackup at 80 is the signal that the operator should intervene; auto-failover is intentionally out of scope for the non-transparent model (decision #154).
Metrics
RedundancyMetrics in src/Server/ZB.MOM.WW.OtOpcUa.Admin/Services/RedundancyMetrics.cs registers the ZB.MOM.WW.OtOpcUa.Redundancy meter on the Admin process. Instruments:
| Name | Kind | Tags | Description |
|---|---|---|---|
otopcua.redundancy.role_transition |
Counter | cluster.id, node.id, from_role, to_role |
Incremented every time FleetStatusPoller observes a RedundancyRole change on a ClusterNode row. |
otopcua.redundancy.primary_count |
ObservableGauge | cluster.id |
Primary-role nodes per cluster — should be exactly 1 in nominal state. |
otopcua.redundancy.secondary_count |
ObservableGauge | cluster.id |
Secondary-role nodes per cluster. |
otopcua.redundancy.stale_count |
ObservableGauge | cluster.id |
Nodes whose LastSeenAt exceeded the stale threshold. |
Admin Program.cs wires OpenTelemetry to the Prometheus exporter when Metrics:Prometheus:Enabled=true (default), exposing the meter under /metrics. The endpoint is intentionally unauthenticated — fleet conventions put it behind a reverse-proxy basic-auth gate if needed.
Real-time notifications (Admin UI)
FleetStatusPoller in src/Server/ZB.MOM.WW.OtOpcUa.Admin/Hubs/ polls the ClusterNode table, records role transitions, updates RedundancyMetrics.SetClusterCounts, and pushes a RoleChanged SignalR event onto FleetStatusHub when a transition is observed. RedundancyTab.razor subscribes with _hub.On<RoleChangedMessage>("RoleChanged", …) so connected Admin sessions see role swaps the moment they happen.
Configuring a redundant pair
Redundancy is configured in the Config DB, not appsettings.json. The fields that must differ between the two instances:
| Field | Location | Instance 1 | Instance 2 |
|---|---|---|---|
NodeId |
appsettings.json Node:NodeId (bootstrap) |
node-a |
node-b |
ClusterNode.ApplicationUri |
Config DB | urn:node-a:OtOpcUa |
urn:node-b:OtOpcUa |
ClusterNode.RedundancyRole |
Config DB | Primary |
Secondary |
ClusterNode.ServiceLevelBase |
Config DB | typically 255 | typically 100 |
Shared between instances: ClusterId, Config DB connection string, published generation, cluster-level ACLs, UNS hierarchy, driver instances.
Role swaps, stand-alone promotions, and base-level adjustments all happen through the Admin UI RedundancyTab — the operator edits the ClusterNode row in a draft generation and publishes. RedundancyCoordinator.RefreshAsync picks up the new topology without a process restart.
Client-side failover
The OtOpcUa Client CLI at src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI supports -F / --failover-urls for automatic client-side failover; for long-running subscriptions the CLI monitors session KeepAlive and reconnects to the next available server, recreating the subscription on the new endpoint. See Client.CLI.md for the command reference.
Depth reference
For the full decision trail and implementation plan — topology invariants, peer-probe cadence, recovery-dwell policy, compliance-script guard against enum-value drift — see docs/v2/plan.md §Phase 6.3.