110 lines
6.4 KiB
Markdown
110 lines
6.4 KiB
Markdown
# Redundancy (v2)
|
|
|
|
## Overview
|
|
|
|
OtOpcUa supports OPC UA **non-transparent** warm/hot redundancy. Two or more `OtOpcUa.Host` processes run side-by-side, share the same Config DB, and join the same Akka.NET cluster. Each process owns a distinct `ApplicationUri`; OPC UA clients discover both endpoints by reading `Server.ServerArray` (NodeId `i=2254`) on either node and pick one based on the `ServiceLevel` byte that each server publishes.
|
|
|
|
> **Discovery surface.** The `ServerArray` path on the `Server` object is what each node populates with self + peer `ApplicationUri`s — see `OpcUaApplicationHost.PopulateServerArray` and the per-node `PeerApplicationUris` option below. The redundancy-object-type `ServerUriArray` proper (a child of `Server.ServerRedundancy`) remains deferred pending an SDK object-type upgrade; clients should read `Server.ServerArray` for peer discovery today.
|
|
|
|
> **v2 change.** v1's operator-managed `ClusterNode.RedundancyRole` column + `RedundancyCoordinator` / `ApplyLeaseRegistry` / `PeerHttpProbeLoop` are gone. Primary/secondary is now derived from **Akka cluster role-leader** for the `driver` role. The operator no longer writes a role into the DB; cluster topology + health drive ServiceLevel automatically.
|
|
|
|
The runtime pieces live in:
|
|
|
|
| Component | Project | Role |
|
|
|---|---|---|
|
|
| `ServiceLevelCalculator` | `OtOpcUa.ControlPlane.Redundancy` | Pure function `(NodeHealthInputs) → byte`. No side effects. |
|
|
| `RedundancyStateActor` | `OtOpcUa.ControlPlane.Redundancy` | Admin-role cluster singleton; subscribes to cluster topology events, debounces 250ms, broadcasts `RedundancyStateChanged` on the `redundancy-state` DPS topic. |
|
|
| `DbHealthProbeActor` | `OtOpcUa.Runtime.Health` | Per-node; runs `SELECT 1` against ConfigDb every 5s. Read by health endpoint + redundancy calc. |
|
|
| `PeerOpcUaProbeActor` | `OtOpcUa.Runtime.Health` | Per-node; pings peer `opc.tcp://peer:4840` (real probe call is staged for follow-up F12). |
|
|
| `ClusterRoleInfo` | `OtOpcUa.Cluster` | Live view of cluster membership + role-leader; exposes `IClusterRoleInfo` to the rest of the host. |
|
|
|
|
## ServiceLevel tiers (Part 5 §6.5)
|
|
|
|
`ServiceLevelCalculator.Compute(NodeHealthInputs)` returns a byte in 0..255 by tier:
|
|
|
|
| Tier | Byte | Condition |
|
|
|---|---|---|
|
|
| Down | 0 | Member status is not `Up` or `Joining` (leaving, removed, exiting). |
|
|
| Critically degraded | 100 | ConfigDb unreachable AND data is stale. |
|
|
| Stale | 200 | Data stale but ConfigDb reachable. |
|
|
| Healthy follower | 240 | DB ok + OPC UA probe ok + not stale. |
|
|
| Healthy leader | 250 | Healthy + this node is the `driver` role-leader. |
|
|
|
|
Drivers write their computed byte into the OPC UA `ServiceLevel` Variable on each refresh. Clients with the standard redundancy heuristic ("pick the highest ServiceLevel") therefore prefer the role-leader and fall back to followers on its degradation.
|
|
|
|
## Data flow
|
|
|
|
```
|
|
Cluster topology event ──┐
|
|
DB health probe ─────────┤
|
|
OPC UA peer probe ───────┤
|
|
▼
|
|
RedundancyStateActor (admin singleton)
|
|
│ debounce 250ms
|
|
▼
|
|
DPS topic "redundancy-state"
|
|
│
|
|
▼
|
|
Driver nodes' OpcUaPublishActor
|
|
│
|
|
▼
|
|
ServiceLevelCalculator → byte
|
|
│
|
|
▼
|
|
OPC UA ServiceLevel Variable
|
|
```
|
|
|
|
The admin singleton is the cluster's only `RedundancyStateActor`. If the admin leader fails over, the new admin node spins up its replacement, re-subscribes to cluster events, and publishes a fresh snapshot from the current `Cluster.State`. There is no DB-persisted state to recover.
|
|
|
|
## Configuration
|
|
|
|
Per-node identity comes from `appsettings.json` + the `OTOPCUA_ROLES` env var:
|
|
|
|
```json
|
|
{
|
|
"Cluster": {
|
|
"Hostname": "0.0.0.0",
|
|
"Port": 4053,
|
|
"PublicHostname": "node-a.lan",
|
|
"SeedNodes": ["akka.tcp://otopcua@node-a.lan:4053"],
|
|
"Roles": ["admin", "driver"]
|
|
}
|
|
}
|
|
```
|
|
|
|
```
|
|
OTOPCUA_ROLES=admin,driver
|
|
```
|
|
|
|
Both nodes share the same `ConfigDb` connection string; `Cluster.PublicHostname` + `Roles` are what makes them distinct in cluster gossip. The first node bootstraps the cluster (its address goes in `SeedNodes`); the second node joins via the same `SeedNodes` list.
|
|
|
|
There is no longer a `Node:NodeId` setting, no `ClusterNode.RedundancyRole`, no `ServiceLevelBase`. NodeId is derived as `host:port` of the cluster `PublicHostname` (see `ClusterRoleInfo.LocalNode` for the formula).
|
|
|
|
### Peer URI advertising
|
|
|
|
Each node advertises its partner via `OpcUaApplicationHostOptions.PeerApplicationUris` (an `IList<string>`, default empty). `OpcUaApplicationHost.PopulateServerArray` appends each configured peer URI to the SDK's `IServerInternal.ServerUris` string table after server startup, so that `Server.ServerArray` reads served by `OnReadServerArray` return both self + peers. Set this per-node in `appsettings.json`:
|
|
|
|
```json
|
|
{
|
|
"OpcUaServer": {
|
|
"PeerApplicationUris": ["urn:node-b:OtOpcUa"]
|
|
}
|
|
}
|
|
```
|
|
|
|
Node A lists Node B's `ApplicationUri` and vice-versa. Validated by `DualEndpointTests` in `tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests/` — boots two `OpcUaApplicationHost` instances on loopback, asserts a real OPCFoundation client `Session` reading `Server.ServerArray` from Node A sees both URIs.
|
|
|
|
## Split-brain
|
|
|
|
`akka.conf` configures Akka's split-brain resolver with `active-strategy = keep-oldest`, `stable-after = 15s`, and `failure-detector.threshold = 10.0`. Under a clean partition: the oldest member stays up + the smaller (or younger) side downs itself within ~15 seconds. The `RedundancyStateActor` on the surviving partition re-computes from the post-partition `Cluster.State`.
|
|
|
|
There is no operator-driven role swap during a partition. Failover is what the cluster does automatically.
|
|
|
|
## Client-side failover
|
|
|
|
The OtOpcUa Client CLI at `src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI` supports `-F` / `--failover-urls` for automatic client-side failover; for long-running subscriptions the CLI monitors session KeepAlive and reconnects to the next available server, recreating the subscription on the new endpoint. See [`Client.CLI.md`](Client.CLI.md).
|
|
|
|
## Depth reference
|
|
|
|
For the full design — message contracts, tiered calculator truth table, recovery semantics — see `docs/plans/2026-05-26-akka-hosting-alignment-design.md` §6.
|