Files

Joseph Doherty 3c3fef911c docs: v2 updates to Redundancy, ServiceHosting, security, README (Task 64)

- Redundancy.md: full rewrite — Akka-leader-driven ServiceLevel replaces
  operator-managed RedundancyRole. Documents the 5-tier ServiceLevelCalculator,
  RedundancyStateActor cluster singleton, and the DPS data flow.

- ServiceHosting.md: full rewrite — single fused OtOpcUa.Host binary with
  OTOPCUA_ROLES env gating. Documents the conditional DI graph and the new
  health endpoints (/health/ready, /health/active, /healthz).

- security.md: v2 banner at top covering path/project renames + new JWT bearer
  + DataProtection persisted to ConfigDb. Body unchanged because the 4-concern
  security model is unchanged in v2; full per-section rewrite waits for F15
  (Admin pages migration) since security.md references many pages that move.

- README.md: platform overview updated to v2 (fused Host + role gating).

2026-05-26 06:38:55 -04:00

5.1 KiB

Raw Blame History

Redundancy (v2)

Overview

OtOpcUa supports OPC UA non-transparent warm/hot redundancy. Two or more OtOpcUa.Host processes run side-by-side, share the same Config DB, and join the same Akka.NET cluster. Each process owns a distinct ApplicationUri; OPC UA clients see both endpoints via the standard ServerUriArray and pick one based on the ServiceLevel byte that each server publishes.

v2 change. v1's operator-managed ClusterNode.RedundancyRole column + RedundancyCoordinator / ApplyLeaseRegistry / PeerHttpProbeLoop are gone. Primary/secondary is now derived from Akka cluster role-leader for the driver role. The operator no longer writes a role into the DB; cluster topology + health drive ServiceLevel automatically.

The runtime pieces live in:

Component	Project	Role
`ServiceLevelCalculator`	`OtOpcUa.ControlPlane.Redundancy`	Pure function `(NodeHealthInputs) → byte`. No side effects.
`RedundancyStateActor`	`OtOpcUa.ControlPlane.Redundancy`	Admin-role cluster singleton; subscribes to cluster topology events, debounces 250ms, broadcasts `RedundancyStateChanged` on the `redundancy-state` DPS topic.
`DbHealthProbeActor`	`OtOpcUa.Runtime.Health`	Per-node; runs `SELECT 1` against ConfigDb every 5s. Read by health endpoint + redundancy calc.
`PeerOpcUaProbeActor`	`OtOpcUa.Runtime.Health`	Per-node; pings peer `opc.tcp://peer:4840` (real probe call is staged for follow-up F12).
`ClusterRoleInfo`	`OtOpcUa.Cluster`	Live view of cluster membership + role-leader; exposes `IClusterRoleInfo` to the rest of the host.

ServiceLevel tiers (Part 5 §6.5)

ServiceLevelCalculator.Compute(NodeHealthInputs) returns a byte in 0..255 by tier:

Tier	Byte	Condition
Down	0	Member status is not `Up` or `Joining` (leaving, removed, exiting).
Critically degraded	100	ConfigDb unreachable AND data is stale.
Stale	200	Data stale but ConfigDb reachable.
Healthy follower	240	DB ok + OPC UA probe ok + not stale.
Healthy leader	250	Healthy + this node is the `driver` role-leader.

Drivers write their computed byte into the OPC UA ServiceLevel Variable on each refresh. Clients with the standard redundancy heuristic ("pick the highest ServiceLevel") therefore prefer the role-leader and fall back to followers on its degradation.

Data flow

Cluster topology event ──┐
DB health probe ─────────┤
OPC UA peer probe ───────┤
                          ▼
              RedundancyStateActor (admin singleton)
                          │  debounce 250ms
                          ▼
              DPS topic "redundancy-state"
                          │
                          ▼
                Driver nodes' OpcUaPublishActor
                          │
                          ▼
              ServiceLevelCalculator → byte
                          │
                          ▼
              OPC UA ServiceLevel Variable

The admin singleton is the cluster's only RedundancyStateActor. If the admin leader fails over, the new admin node spins up its replacement, re-subscribes to cluster events, and publishes a fresh snapshot from the current Cluster.State. There is no DB-persisted state to recover.

Configuration

Per-node identity comes from appsettings.json + the OTOPCUA_ROLES env var:

{
  "Cluster": {
    "Hostname": "0.0.0.0",
    "Port": 4053,
    "PublicHostname": "node-a.lan",
    "SeedNodes": ["akka.tcp://otopcua@node-a.lan:4053"],
    "Roles": ["admin", "driver"]
  }
}

OTOPCUA_ROLES=admin,driver

Both nodes share the same ConfigDb connection string; Cluster.PublicHostname + Roles are what makes them distinct in cluster gossip. The first node bootstraps the cluster (its address goes in SeedNodes); the second node joins via the same SeedNodes list.

There is no longer a Node:NodeId setting, no ClusterNode.RedundancyRole, no ServiceLevelBase. NodeId is derived as host:port of the cluster PublicHostname (see ClusterRoleInfo.LocalNode for the formula).

Split-brain

akka.conf configures Akka's split-brain resolver with active-strategy = keep-oldest, stable-after = 15s, and failure-detector.threshold = 10.0. Under a clean partition: the oldest member stays up + the smaller (or younger) side downs itself within ~15 seconds. The RedundancyStateActor on the surviving partition re-computes from the post-partition Cluster.State.

There is no operator-driven role swap during a partition. Failover is what the cluster does automatically.

Client-side failover

The OtOpcUa Client CLI at src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI supports -F / --failover-urls for automatic client-side failover; for long-running subscriptions the CLI monitors session KeepAlive and reconnects to the next available server, recreating the subscription on the new endpoint. See Client.CLI.md.

Depth reference

For the full design — message contracts, tiered calculator truth table, recovery semantics — see docs/plans/2026-05-26-akka-hosting-alignment-design.md §6.

5.1 KiB Raw Blame History