Files
lmxopcua/docs/Redundancy.md
Joseph Doherty 3c3fef911c docs: v2 updates to Redundancy, ServiceHosting, security, README (Task 64)
- Redundancy.md: full rewrite — Akka-leader-driven ServiceLevel replaces
  operator-managed RedundancyRole. Documents the 5-tier ServiceLevelCalculator,
  RedundancyStateActor cluster singleton, and the DPS data flow.

- ServiceHosting.md: full rewrite — single fused OtOpcUa.Host binary with
  OTOPCUA_ROLES env gating. Documents the conditional DI graph and the new
  health endpoints (/health/ready, /health/active, /healthz).

- security.md: v2 banner at top covering path/project renames + new JWT bearer
  + DataProtection persisted to ConfigDb. Body unchanged because the 4-concern
  security model is unchanged in v2; full per-section rewrite waits for F15
  (Admin pages migration) since security.md references many pages that move.

- README.md: platform overview updated to v2 (fused Host + role gating).
2026-05-26 06:38:55 -04:00

5.1 KiB

Redundancy (v2)

Overview

OtOpcUa supports OPC UA non-transparent warm/hot redundancy. Two or more OtOpcUa.Host processes run side-by-side, share the same Config DB, and join the same Akka.NET cluster. Each process owns a distinct ApplicationUri; OPC UA clients see both endpoints via the standard ServerUriArray and pick one based on the ServiceLevel byte that each server publishes.

v2 change. v1's operator-managed ClusterNode.RedundancyRole column + RedundancyCoordinator / ApplyLeaseRegistry / PeerHttpProbeLoop are gone. Primary/secondary is now derived from Akka cluster role-leader for the driver role. The operator no longer writes a role into the DB; cluster topology + health drive ServiceLevel automatically.

The runtime pieces live in:

Component Project Role
ServiceLevelCalculator OtOpcUa.ControlPlane.Redundancy Pure function (NodeHealthInputs) → byte. No side effects.
RedundancyStateActor OtOpcUa.ControlPlane.Redundancy Admin-role cluster singleton; subscribes to cluster topology events, debounces 250ms, broadcasts RedundancyStateChanged on the redundancy-state DPS topic.
DbHealthProbeActor OtOpcUa.Runtime.Health Per-node; runs SELECT 1 against ConfigDb every 5s. Read by health endpoint + redundancy calc.
PeerOpcUaProbeActor OtOpcUa.Runtime.Health Per-node; pings peer opc.tcp://peer:4840 (real probe call is staged for follow-up F12).
ClusterRoleInfo OtOpcUa.Cluster Live view of cluster membership + role-leader; exposes IClusterRoleInfo to the rest of the host.

ServiceLevel tiers (Part 5 §6.5)

ServiceLevelCalculator.Compute(NodeHealthInputs) returns a byte in 0..255 by tier:

Tier Byte Condition
Down 0 Member status is not Up or Joining (leaving, removed, exiting).
Critically degraded 100 ConfigDb unreachable AND data is stale.
Stale 200 Data stale but ConfigDb reachable.
Healthy follower 240 DB ok + OPC UA probe ok + not stale.
Healthy leader 250 Healthy + this node is the driver role-leader.

Drivers write their computed byte into the OPC UA ServiceLevel Variable on each refresh. Clients with the standard redundancy heuristic ("pick the highest ServiceLevel") therefore prefer the role-leader and fall back to followers on its degradation.

Data flow

Cluster topology event ──┐
DB health probe ─────────┤
OPC UA peer probe ───────┤
                          ▼
              RedundancyStateActor (admin singleton)
                          │  debounce 250ms
                          ▼
              DPS topic "redundancy-state"
                          │
                          ▼
                Driver nodes' OpcUaPublishActor
                          │
                          ▼
              ServiceLevelCalculator → byte
                          │
                          ▼
              OPC UA ServiceLevel Variable

The admin singleton is the cluster's only RedundancyStateActor. If the admin leader fails over, the new admin node spins up its replacement, re-subscribes to cluster events, and publishes a fresh snapshot from the current Cluster.State. There is no DB-persisted state to recover.

Configuration

Per-node identity comes from appsettings.json + the OTOPCUA_ROLES env var:

{
  "Cluster": {
    "Hostname": "0.0.0.0",
    "Port": 4053,
    "PublicHostname": "node-a.lan",
    "SeedNodes": ["akka.tcp://otopcua@node-a.lan:4053"],
    "Roles": ["admin", "driver"]
  }
}
OTOPCUA_ROLES=admin,driver

Both nodes share the same ConfigDb connection string; Cluster.PublicHostname + Roles are what makes them distinct in cluster gossip. The first node bootstraps the cluster (its address goes in SeedNodes); the second node joins via the same SeedNodes list.

There is no longer a Node:NodeId setting, no ClusterNode.RedundancyRole, no ServiceLevelBase. NodeId is derived as host:port of the cluster PublicHostname (see ClusterRoleInfo.LocalNode for the formula).

Split-brain

akka.conf configures Akka's split-brain resolver with active-strategy = keep-oldest, stable-after = 15s, and failure-detector.threshold = 10.0. Under a clean partition: the oldest member stays up + the smaller (or younger) side downs itself within ~15 seconds. The RedundancyStateActor on the surviving partition re-computes from the post-partition Cluster.State.

There is no operator-driven role swap during a partition. Failover is what the cluster does automatically.

Client-side failover

The OtOpcUa Client CLI at src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI supports -F / --failover-urls for automatic client-side failover; for long-running subscriptions the CLI monitors session KeepAlive and reconnects to the next available server, recreating the subscription on the new endpoint. See Client.CLI.md.

Depth reference

For the full design — message contracts, tiered calculator truth table, recovery semantics — see docs/plans/2026-05-26-akka-hosting-alignment-design.md §6.