docs(audit): Redundancy.md — accuracy pass
This commit is contained in:
+60
-17
@@ -6,21 +6,44 @@ OtOpcUa supports OPC UA **non-transparent** warm/hot redundancy. Two or more `Ot
|
||||
|
||||
> **Discovery surface.** The `ServerArray` path on the `Server` object is what each node populates with self + peer `ApplicationUri`s — see `OpcUaApplicationHost.PopulateServerArray` and the per-node `PeerApplicationUris` option below. The redundancy-object-type `ServerUriArray` proper (a child of `Server.ServerRedundancy`) remains deferred pending an SDK object-type upgrade; clients should read `Server.ServerArray` for peer discovery today.
|
||||
|
||||
> **v2 change.** v1's operator-managed `ClusterNode.RedundancyRole` column + `RedundancyCoordinator` / `ApplyLeaseRegistry` / `PeerHttpProbeLoop` are gone. Primary/secondary is now derived from **Akka cluster role-leader** for the `driver` role. The operator no longer writes a role into the DB; cluster topology + health drive ServiceLevel automatically.
|
||||
> **v2 change.** v1's operator-managed `ClusterNode.RedundancyRole` column + `RedundancyCoordinator` / `ApplyLeaseRegistry` / `PeerHttpProbeLoop` are gone. Primary/secondary is now derived from **Akka cluster role-leader** for the `driver` role. The operator no longer writes a role into the DB; cluster topology (specifically the `driver` role-leader) drives ServiceLevel automatically.
|
||||
|
||||
The runtime pieces live in:
|
||||
|
||||
| Component | Project | Role |
|
||||
|---|---|---|
|
||||
| `ServiceLevelCalculator` | `OtOpcUa.ControlPlane.Redundancy` | Pure function `(NodeHealthInputs) → byte`. No side effects. |
|
||||
| `RedundancyStateActor` | `OtOpcUa.ControlPlane.Redundancy` | Admin-role cluster singleton; subscribes to cluster topology events, debounces 250ms, broadcasts `RedundancyStateChanged` on the `redundancy-state` DPS topic. |
|
||||
| `DbHealthProbeActor` | `OtOpcUa.Runtime.Health` | Per-node; runs `SELECT 1` against ConfigDb every 5s. Read by health endpoint + redundancy calc. |
|
||||
| `PeerOpcUaProbeActor` | `OtOpcUa.Runtime.Health` | Per-node; pings peer `opc.tcp://peer:4840` (real probe call is staged for follow-up F12). |
|
||||
| `OpcUaPublishActor` | `OtOpcUa.Runtime.OpcUa` | Per-driver-node; subscribes to the `redundancy-state` topic, maps the local node's role to a ServiceLevel byte (see below), and forwards it to `IServiceLevelPublisher`. |
|
||||
| `IServiceLevelPublisher` / `SdkServiceLevelPublisher` | `OtOpcUa.Commons.OpcUa` / `OtOpcUa.OpcUaServer` | Writes the byte into the SDK's `Server.ServiceLevel` Variable. Production binds `DeferredServiceLevelPublisher`, which swaps in the real `SdkServiceLevelPublisher` once the SDK is up (it needs `IServerInternal`, available only after `StandardServer.Start`); until then writes route through `NullServiceLevelPublisher`. |
|
||||
| `ServiceLevelCalculator` | `OtOpcUa.ControlPlane.Redundancy` | Pure function `(NodeHealthInputs) → byte` — the fuller DB/probe-aware tiering (see truth table below). Covered by `ServiceLevelCalculatorTests`; **not yet wired into the live driver publish path**, which uses the coarse role mapping in `OpcUaPublishActor`. |
|
||||
| `DbHealthProbeActor` | `OtOpcUa.Runtime.Health` | Per-node; runs `SELECT 1` against ConfigDb every 5s. Read by health endpoint. |
|
||||
| `PeerOpcUaProbeActor` | `OtOpcUa.Runtime.Health` | Per-node; pings peer `opc.tcp://peer:4840` with a TCP connect (2s timeout) and publishes the result on the `redundancy-state` topic. A full secure-channel Hello handshake is a possible future upgrade; the TCP connect is the current real probe. |
|
||||
| `ClusterRoleInfo` | `OtOpcUa.Cluster` | Live view of cluster membership + role-leader; exposes `IClusterRoleInfo` to the rest of the host. |
|
||||
|
||||
## ServiceLevel tiers (Part 5 §6.5)
|
||||
## ServiceLevel tiers
|
||||
|
||||
`ServiceLevelCalculator.Compute(NodeHealthInputs)` returns a byte in 0..255 by tier:
|
||||
### Live driver-side mapping (current)
|
||||
|
||||
`OpcUaPublishActor.HandleRedundancyStateChanged` maps the local node's role
|
||||
(from the `RedundancyStateChanged` snapshot) to a ServiceLevel byte and forwards
|
||||
it through `IServiceLevelPublisher` to the SDK's `Server.ServiceLevel` Variable:
|
||||
|
||||
| Local role | Byte |
|
||||
|---|---|
|
||||
| `Primary` and `driver` role-leader | 240 |
|
||||
| `Primary` (not role-leader) | 200 |
|
||||
| `Secondary` | 100 |
|
||||
| `Detached` (no `driver` role) | 0 |
|
||||
|
||||
Roles come from `RedundancyStateActor.BuildSnapshot`: a node with the `driver`
|
||||
role is `Primary` when it holds the `driver` role-leader lease, otherwise
|
||||
`Secondary`; a node without the `driver` role is `Detached`.
|
||||
|
||||
### Full health-aware tiering (`ServiceLevelCalculator`)
|
||||
|
||||
`ServiceLevelCalculator.Compute(NodeHealthInputs)` is the fuller, DB/probe-aware
|
||||
calculation. It is unit-tested but **not yet on the live publish path** — the
|
||||
driver-side mapping above is what actually drives the SDK today.
|
||||
|
||||
| Tier | Byte | Condition |
|
||||
|---|---|---|
|
||||
@@ -28,16 +51,16 @@ The runtime pieces live in:
|
||||
| Critically degraded | 100 | ConfigDb unreachable AND data is stale. |
|
||||
| Stale | 200 | Data stale but ConfigDb reachable. |
|
||||
| Healthy follower | 240 | DB ok + OPC UA probe ok + not stale. |
|
||||
| Healthy leader | 250 | Healthy + this node is the `driver` role-leader. |
|
||||
| Healthy leader | 250 | Healthy follower (240) + a `+10` bonus when this node is the `driver` role-leader. |
|
||||
|
||||
Drivers write their computed byte into the OPC UA `ServiceLevel` Variable on each refresh. Clients with the standard redundancy heuristic ("pick the highest ServiceLevel") therefore prefer the role-leader and fall back to followers on its degradation.
|
||||
Either way, clients with the standard redundancy heuristic ("pick the highest
|
||||
ServiceLevel") prefer the `driver` role-leader and fall back to followers on its
|
||||
degradation.
|
||||
|
||||
## Data flow
|
||||
|
||||
```
|
||||
Cluster topology event ──┐
|
||||
DB health probe ─────────┤
|
||||
OPC UA peer probe ───────┤
|
||||
▼
|
||||
RedundancyStateActor (admin singleton)
|
||||
│ debounce 250ms
|
||||
@@ -46,14 +69,22 @@ OPC UA peer probe ───────┤
|
||||
│
|
||||
▼
|
||||
Driver nodes' OpcUaPublishActor
|
||||
│ role → byte (240/200/100/0)
|
||||
▼
|
||||
IServiceLevelPublisher (SdkServiceLevelPublisher)
|
||||
│
|
||||
▼
|
||||
ServiceLevelCalculator → byte
|
||||
│
|
||||
▼
|
||||
OPC UA ServiceLevel Variable
|
||||
OPC UA Server.ServiceLevel Variable
|
||||
```
|
||||
|
||||
Today only cluster topology drives the published ServiceLevel.
|
||||
`PeerOpcUaProbeActor` and `DbHealthProbeActor` also run per-node — the peer probe
|
||||
publishes `OpcUaProbeResult` onto the `redundancy-state` topic and the DB probe
|
||||
backs the health endpoint — but their outputs are not yet consumed by
|
||||
`RedundancyStateActor` or folded into the published byte. They are the inputs the
|
||||
fuller `ServiceLevelCalculator` truth table is designed to use once that path goes
|
||||
live.
|
||||
|
||||
The admin singleton is the cluster's only `RedundancyStateActor`. If the admin leader fails over, the new admin node spins up its replacement, re-subscribes to cluster events, and publishes a fresh snapshot from the current `Cluster.State`. There is no DB-persisted state to recover.
|
||||
|
||||
## Configuration
|
||||
@@ -78,15 +109,17 @@ OTOPCUA_ROLES=admin,driver
|
||||
|
||||
Both nodes share the same `ConfigDb` connection string; `Cluster.PublicHostname` + `Roles` are what makes them distinct in cluster gossip. The first node bootstraps the cluster (its address goes in `SeedNodes`); the second node joins via the same `SeedNodes` list.
|
||||
|
||||
There is no longer a `Node:NodeId` setting, no `ClusterNode.RedundancyRole`, no `ServiceLevelBase`. NodeId is derived as `host:port` of the cluster `PublicHostname` (see `ClusterRoleInfo.LocalNode` for the formula).
|
||||
There is no longer a `Node:NodeId` setting and no `ClusterNode.RedundancyRole` column (the V2 migration dropped it — primary/secondary is now derived from cluster role-leadership). NodeId is derived as `host:port` of the cluster `PublicHostname` (see `ClusterRoleInfo.LocalNode` for the formula).
|
||||
|
||||
The `ClusterNode.ServiceLevelBase` column still exists and is editable in the Admin UI (NodeEdit / Cluster Redundancy pages), but it no longer drives the runtime ServiceLevel — that value is computed from cluster role/health and published per the mapping above, independent of this stored preference.
|
||||
|
||||
### Peer URI advertising
|
||||
|
||||
Each node advertises its partner via `OpcUaApplicationHostOptions.PeerApplicationUris` (an `IList<string>`, default empty). `OpcUaApplicationHost.PopulateServerArray` appends each configured peer URI to the SDK's `IServerInternal.ServerUris` string table after server startup, so that `Server.ServerArray` reads served by `OnReadServerArray` return both self + peers. Set this per-node in `appsettings.json`:
|
||||
Each node advertises its partner via `OpcUaApplicationHostOptions.PeerApplicationUris` (an `IList<string>`, default empty). `OpcUaApplicationHost.PopulateServerArray` appends each configured peer URI to the SDK's `IServerInternal.ServerUris` string table after server startup, so that `Server.ServerArray` reads served by `OnReadServerArray` return both self + peers. The options bind from the `OpcUa` config section (see `Program.cs` — `AddValidatedOptions<OpcUaApplicationHostOptions>(…, "OpcUa")`). Set this per-node in `appsettings.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"OpcUaServer": {
|
||||
"OpcUa": {
|
||||
"PeerApplicationUris": ["urn:node-b:OtOpcUa"]
|
||||
}
|
||||
}
|
||||
@@ -104,6 +137,16 @@ There is no operator-driven role swap during a partition. Failover is what the c
|
||||
|
||||
The OtOpcUa Client CLI at `src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI` supports `-F` / `--failover-urls` for automatic client-side failover; for long-running subscriptions the CLI monitors session KeepAlive and reconnects to the next available server, recreating the subscription on the new endpoint. See [`Client.CLI.md`](Client.CLI.md).
|
||||
|
||||
## Observability
|
||||
|
||||
`OpcUaPublishActor` emits one metric on every ServiceLevel transition (it suppresses no-op repeats of the same byte):
|
||||
|
||||
| Metric | Type | Notes |
|
||||
|---|---|---|
|
||||
| `otopcua.redundancy.service_level_change` | Counter (`{change}`) | OPC UA `Server.ServiceLevel` transitions emitted by the redundancy state. Tagged with `level` = the new byte. |
|
||||
|
||||
The meter is defined on `OtOpcUaTelemetry` (`src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/OtOpcUaTelemetry.cs`); it surfaces through whatever OpenTelemetry exporter the host configures.
|
||||
|
||||
## Depth reference
|
||||
|
||||
For the full design — message contracts, tiered calculator truth table, recovery semantics — see `docs/plans/2026-05-26-akka-hosting-alignment-design.md` §6.
|
||||
|
||||
Reference in New Issue
Block a user