Auto: opcuaclient-14 — ServerUriArray redundant failover

Closes #286
This commit is contained in:
Joseph Doherty
2026-04-26 10:05:05 -04:00
parent 35d733d73b
commit 705c98ad98
11 changed files with 1088 additions and 2 deletions

View File

@@ -258,3 +258,93 @@ namespace-0 NodeId, and the original 5 ordinals stay pinned. Wire-side
behaviour against a live server is exercised by
`OpcUaClientAggregateSweepTests` (build-only scaffold pending an opc-plc
history-sim profile).
## Upstream redundancy (`ServerArray`)
When the upstream OPC UA server is itself a redundant pair (warm or hot per
OPC UA Part 4 §6.6.2), the driver supports **mid-session failover** driven by
the upstream's own `Server.ServerRedundancy.RedundancySupport` +
`ServerUriArray` + `Server.ServiceLevel` nodes. Distinct from the static
boot-time failover sweep on `EndpointUrls`: that path picks a single survivor
at session-create time; this path swaps the active session live when the
upstream signals degradation, transferring subscriptions onto the secondary so
monitored-item handles stay valid.
### Configuration
| Option | Default | Notes |
| --- | --- | --- |
| `Redundancy.Enabled` | `false` | Opt-in. When `false`, the driver doesn't read `RedundancySupport` / `ServerUriArray` and doesn't subscribe to `ServiceLevel`. |
| `Redundancy.ServiceLevelThreshold` | `200` | Byte value below which the driver triggers failover. OPC UA spec convention: 200+ = healthy primary, 100..199 = degraded, 0..99 = unrecoverable. |
| `Redundancy.RecheckInterval` | `5s` | Lower bound between two consecutive failovers — suppresses oscillation when ServiceLevel flaps around the threshold. |
### Behaviour
- At session activation the driver reads
`Server.ServerRedundancy.RedundancySupport`. When `None`, the driver records
an empty peer list and the failover path becomes a no-op (`ServiceLevel`
drops are still observable via diagnostics but trigger nothing).
- When the upstream advertises `Cold` / `Warm` / `WarmActive` / `Hot`, the
driver pulls `Server.ServerRedundancy.ServerUriArray` for the peer list,
falling back to the top-level `Server.ServerArray` for legacy upstreams that
don't expose the redundancy node.
- A dedicated subscription on `Server.ServiceLevel` (publish interval 1s,
separate from the alarm + data subscriptions) drives every failover decision
via the SDK's notification path — no polling loop.
- On a drop below `ServiceLevelThreshold` the driver picks the next URI in the
peer list that isn't the active one, opens a parallel session against it,
and calls `Session.TransferSubscriptionsAsync(other, sendInitialValues:true)`
to migrate every live subscription (data + alarm + model-change +
service-level itself). On success the driver swaps `Session`, closes the
old one, and bumps `RedundancyFailoverCount`.
- On any failure (`BadSecureChannelClosed`, `BadCertificateUntrusted`,
`TransferSubscriptions` returning `false`, secondary unreachable) the driver
leaves the existing session untouched, increments
`RedundancyFailoverFailures`, and waits for the next ServiceLevel
notification. The keep-alive watchdog continues to cover full
upstream-loss scenarios.
### Shared client-cert prerequisite
`TransferSubscriptionsAsync` requires the secondary's secure channel to accept
the same client certificate the primary did. Operators running heterogeneous
secondaries (different cert trust stores) will see `BadCertificateUntrusted`
on every failover attempt and the failures counter climbing. The fix is to
push the gateway driver's application-instance certificate into both
upstreams' `TrustedPeerCertificates` store before enabling redundancy. A
follow-up adds a fallback path that re-creates subscriptions instead of
transferring when the secondary rejects the channel.
### Diagnostics
The `driver-diagnostics` RPC surfaces three new counters via
`DriverHealth.Diagnostics`:
| Key | Type | Notes |
| --- | --- | --- |
| `RedundancyFailoverCount` | `double` (long-counted) | Successful mid-session swaps since driver start. |
| `RedundancyFailoverFailures` | `double` (long-counted) | Swap attempts that bailed (TransferSubscriptions false, secondary unreachable, etc.). |
| `ActiveServerUri` | string (in `OpcUaClientDiagnostics.ActiveServerUri`) | URI of the upstream the driver is currently bound to. Updates on every successful failover. |
### Forced-failover runbook
To validate the wiring against a real redundant upstream pair:
1. Confirm the upstream advertises `RedundancySupport != None` and a
non-empty `ServerUriArray`. Use the Client CLI:
`dotnet run --project src/ZB.MOM.WW.OtOpcUa.Client.CLI -- redundancy -u <primary>`.
2. Set `Redundancy.Enabled = true` on the gateway's `OpcUaClient` driver
instance and restart.
3. Tail driver diagnostics:
`driver-diagnostics --instance <id>` — note `RedundancyFailoverCount = 0`
pre-test.
4. Drive a `ServiceLevel` drop on the primary. On AVEVA / KEPServer this is
typically a "force standby" Admin action; on a custom server it's a write
to the simulated ServiceLevel node.
5. Observe `RedundancyFailoverCount = 1` within `RecheckInterval` of the
drop, the gateway's `HostName` swap to the secondary URI, and downstream
reads/subscriptions continuing without interruption.
For non-redundant upstreams (single-server deployments) the recommended
configuration is to leave `Redundancy.Enabled = false` and rely on
`EndpointUrls` for boot-time failover only.