# Redundancy ## Overview OtOpcUa supports OPC UA **non-transparent** warm/hot redundancy. Two (or more) OtOpcUa Server processes run side-by-side, share the same Config DB, the same driver backends (Galaxy ZB, MXAccess runtime, remote PLCs), and advertise the same OPC UA node tree. Each process owns a distinct `ApplicationUri`; OPC UA clients see both endpoints via the standard `ServerUriArray` and pick one based on the `ServiceLevel` that each server publishes. The redundancy surface lives in `src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/`: | Class | Role | |---|---| | `RedundancyCoordinator` | Process-singleton; owns the current `RedundancyTopology` loaded from the `ClusterNode` table. `RefreshAsync` re-reads after `sp_PublishGeneration` so operator role swaps take effect without a process restart. CAS-style swap (`Interlocked.Exchange`) means readers always see a coherent snapshot. | | `RedundancyTopology` | Immutable `(ClusterId, Self, Peers, ServerUriArray, ValidityFlags)` snapshot. | | `ApplyLeaseRegistry` | Tracks in-progress `sp_PublishGeneration` apply leases keyed on `(ConfigGenerationId, PublishRequestId)`. `await using` the disposable scope guarantees every exit path (success / exception / cancellation) decrements the lease; a stale-lease watchdog force-closes any lease older than `ApplyMaxDuration` (default 10 minutes) so a crashed publisher can't pin the node at `PrimaryMidApply`. | | `PeerReachabilityTracker` | Maintains last-known reachability for each peer node over two independent probes — OPC UA ping and HTTP `/healthz`. Both must succeed for `peerReachable = true`. | | `RecoveryStateManager` | Gates transitions out of the `Recovering*` bands; requires dwell + publish-witness satisfaction before allowing a return to nominal. | | `ServiceLevelCalculator` | Pure function `(role, selfHealthy, peerUa, peerHttp, applyInProgress, recoveryDwellMet, topologyValid, operatorMaintenance) → byte`. | | `RedundancyStatePublisher` | Orchestrates inputs into the calculator, pushes the resulting byte to the OPC UA `ServiceLevel` variable via an edge-triggered `OnStateChanged` event, and fires `OnServerUriArrayChanged` when the topology's `ServerUriArray` shifts. | ## Data model Per-node redundancy state lives in the Config DB `ClusterNode` table (`src/ZB.MOM.WW.OtOpcUa.Configuration/Entities/ClusterNode.cs`): | Column | Role | |---|---| | `NodeId` | Unique node identity; matches `Node:NodeId` in the server's bootstrap `appsettings.json`. | | `ClusterId` | Foreign key into `ServerCluster`. | | `RedundancyRole` | `Primary`, `Secondary`, or `Standalone` (`RedundancyRole` enum in `Configuration/Enums`). | | `ServiceLevelBase` | Per-node base value used to bias nominal ServiceLevel output. | | `ApplicationUri` | Unique-per-node OPC UA ApplicationUri advertised in endpoint descriptions. | `ServerUriArray` is derived from the set of peer `ApplicationUri` values at topology-load time and republished when the topology changes. ## ServiceLevel matrix `ServiceLevelCalculator` produces one of the following bands (see `ServiceLevelBand` enum in the same file): | Band | Byte | Meaning | |---|---|---| | `Maintenance` | 0 | Operator-declared maintenance. | | `NoData` | 1 | Self-reported unhealthy (`/healthz` fails). | | `InvalidTopology` | 2 | More than one Primary detected; both nodes self-demote. | | `RecoveringBackup` | 30 | Backup post-fault, dwell not met. | | `BackupMidApply` | 50 | Backup inside a publish-apply window. | | `IsolatedBackup` | 80 | Primary unreachable; Backup says "take over if asked" — does **not** auto-promote (non-transparent model). | | `AuthoritativeBackup` | 100 | Backup nominal. | | `RecoveringPrimary` | 180 | Primary post-fault, dwell not met. | | `PrimaryMidApply` | 200 | Primary inside a publish-apply window. | | `IsolatedPrimary` | 230 | Primary with unreachable peer, retains authority. | | `AuthoritativePrimary` | 255 | Primary nominal. | The reserved bands (0 Maintenance, 1 NoData, 2 InvalidTopology) take precedence over operational states per OPC UA Part 5 §6.3.34. Operational values occupy 2..255 so spec-compliant clients that treat "<3 = unhealthy" keep working. Standalone nodes (single-instance deployments) report `AuthoritativePrimary` when healthy and `PrimaryMidApply` during publish. ## Publish fencing and split-brain prevention Any Admin-triggered `sp_PublishGeneration` acquires an apply lease through `ApplyLeaseRegistry.BeginApplyLease`. While the lease is held: - The calculator reports `PrimaryMidApply` / `BackupMidApply` — clients see the band shift and cut over to the unaffected peer rather than racing against a half-applied generation. - `RedundancyCoordinator.RefreshAsync` is called at the end of the apply window so the post-publish topology becomes visible exactly once, atomically. - The watchdog force-closes any lease older than `ApplyMaxDuration`; a stuck publisher therefore cannot strand a node at `PrimaryMidApply`. Because role transitions are **operator-driven** (write `RedundancyRole` in the Config DB + publish), the Backup never auto-promotes. An `IsolatedBackup` at 80 is the signal that the operator should intervene; auto-failover is intentionally out of scope for the non-transparent model (decision #154). ## Metrics `RedundancyMetrics` in `src/ZB.MOM.WW.OtOpcUa.Admin/Services/RedundancyMetrics.cs` registers the `ZB.MOM.WW.OtOpcUa.Redundancy` meter on the Admin process. Instruments: | Name | Kind | Tags | Description | |---|---|---|---| | `otopcua.redundancy.role_transition` | Counter | `cluster.id`, `node.id`, `from_role`, `to_role` | Incremented every time `FleetStatusPoller` observes a `RedundancyRole` change on a `ClusterNode` row. | | `otopcua.redundancy.primary_count` | ObservableGauge | `cluster.id` | Primary-role nodes per cluster — should be exactly 1 in nominal state. | | `otopcua.redundancy.secondary_count` | ObservableGauge | `cluster.id` | Secondary-role nodes per cluster. | | `otopcua.redundancy.stale_count` | ObservableGauge | `cluster.id` | Nodes whose `LastSeenAt` exceeded the stale threshold. | Admin `Program.cs` wires OpenTelemetry to the Prometheus exporter when `Metrics:Prometheus:Enabled=true` (default), exposing the meter under `/metrics`. The endpoint is intentionally unauthenticated — fleet conventions put it behind a reverse-proxy basic-auth gate if needed. ## Real-time notifications (Admin UI) `FleetStatusPoller` in `src/ZB.MOM.WW.OtOpcUa.Admin/Hubs/` polls the `ClusterNode` table, records role transitions, updates `RedundancyMetrics.SetClusterCounts`, and pushes a `RoleChanged` SignalR event onto `FleetStatusHub` when a transition is observed. `RedundancyTab.razor` subscribes with `_hub.On("RoleChanged", …)` so connected Admin sessions see role swaps the moment they happen. ## Configuring a redundant pair Redundancy is configured **in the Config DB, not appsettings.json**. The fields that must differ between the two instances: | Field | Location | Instance 1 | Instance 2 | |---|---|---|---| | `NodeId` | `appsettings.json` `Node:NodeId` (bootstrap) | `node-a` | `node-b` | | `ClusterNode.ApplicationUri` | Config DB | `urn:node-a:OtOpcUa` | `urn:node-b:OtOpcUa` | | `ClusterNode.RedundancyRole` | Config DB | `Primary` | `Secondary` | | `ClusterNode.ServiceLevelBase` | Config DB | typically 255 | typically 100 | Shared between instances: `ClusterId`, Config DB connection string, published generation, cluster-level ACLs, UNS hierarchy, driver instances. Role swaps, stand-alone promotions, and base-level adjustments all happen through the Admin UI `RedundancyTab` — the operator edits the `ClusterNode` row in a draft generation and publishes. `RedundancyCoordinator.RefreshAsync` picks up the new topology without a process restart. ## Client-side failover The OtOpcUa Client CLI at `src/ZB.MOM.WW.OtOpcUa.Client.CLI` supports `-F` / `--failover-urls` for automatic client-side failover; for long-running subscriptions the CLI monitors session KeepAlive and reconnects to the next available server, recreating the subscription on the new endpoint. See [`Client.CLI.md`](Client.CLI.md) for the command reference. ## Depth reference For the full decision trail and implementation plan — topology invariants, peer-probe cadence, recovery-dwell policy, compliance-script guard against enum-value drift — see `docs/v2/plan.md` §Phase 6.3.