Auto: opcuaclient-14 — ServerUriArray redundant failover

Closes #286
2026-04-26 10:05:05 -04:00
parent 35d733d73b
commit 705c98ad98
11 changed files with 1088 additions and 2 deletions
--- a/docs/drivers/OpcUaClient.md
+++ b/docs/drivers/OpcUaClient.md
@@ -258,3 +258,93 @@ namespace-0 NodeId, and the original 5 ordinals stay pinned. Wire-side
 behaviour against a live server is exercised by
 `OpcUaClientAggregateSweepTests` (build-only scaffold pending an opc-plc
 history-sim profile).
+
+## Upstream redundancy (`ServerArray`)
+
+When the upstream OPC UA server is itself a redundant pair (warm or hot per
+OPC UA Part 4 §6.6.2), the driver supports **mid-session failover** driven by
+the upstream's own `Server.ServerRedundancy.RedundancySupport` +
+`ServerUriArray` + `Server.ServiceLevel` nodes. Distinct from the static
+boot-time failover sweep on `EndpointUrls`: that path picks a single survivor
+at session-create time; this path swaps the active session live when the
+upstream signals degradation, transferring subscriptions onto the secondary so
+monitored-item handles stay valid.
+
+### Configuration
+
+| Option | Default | Notes |
+| --- | --- | --- |
+| `Redundancy.Enabled` | `false` | Opt-in. When `false`, the driver doesn't read `RedundancySupport` / `ServerUriArray` and doesn't subscribe to `ServiceLevel`. |
+| `Redundancy.ServiceLevelThreshold` | `200` | Byte value below which the driver triggers failover. OPC UA spec convention: 200+ = healthy primary, 100..199 = degraded, 0..99 = unrecoverable. |
+| `Redundancy.RecheckInterval` | `5s` | Lower bound between two consecutive failovers — suppresses oscillation when ServiceLevel flaps around the threshold. |
+
+### Behaviour
+
+- At session activation the driver reads
+  `Server.ServerRedundancy.RedundancySupport`. When `None`, the driver records
+  an empty peer list and the failover path becomes a no-op (`ServiceLevel`
+  drops are still observable via diagnostics but trigger nothing).
+- When the upstream advertises `Cold` / `Warm` / `WarmActive` / `Hot`, the
+  driver pulls `Server.ServerRedundancy.ServerUriArray` for the peer list,
+  falling back to the top-level `Server.ServerArray` for legacy upstreams that
+  don't expose the redundancy node.
+- A dedicated subscription on `Server.ServiceLevel` (publish interval 1s,
+  separate from the alarm + data subscriptions) drives every failover decision
+  via the SDK's notification path — no polling loop.
+- On a drop below `ServiceLevelThreshold` the driver picks the next URI in the
+  peer list that isn't the active one, opens a parallel session against it,
+  and calls `Session.TransferSubscriptionsAsync(other, sendInitialValues:true)`
+  to migrate every live subscription (data + alarm + model-change +
+  service-level itself). On success the driver swaps `Session`, closes the
+  old one, and bumps `RedundancyFailoverCount`.
+- On any failure (`BadSecureChannelClosed`, `BadCertificateUntrusted`,
+  `TransferSubscriptions` returning `false`, secondary unreachable) the driver
+  leaves the existing session untouched, increments
+  `RedundancyFailoverFailures`, and waits for the next ServiceLevel
+  notification. The keep-alive watchdog continues to cover full
+  upstream-loss scenarios.
+
+### Shared client-cert prerequisite
+
+`TransferSubscriptionsAsync` requires the secondary's secure channel to accept
+the same client certificate the primary did. Operators running heterogeneous
+secondaries (different cert trust stores) will see `BadCertificateUntrusted`
+on every failover attempt and the failures counter climbing. The fix is to
+push the gateway driver's application-instance certificate into both
+upstreams' `TrustedPeerCertificates` store before enabling redundancy. A
+follow-up adds a fallback path that re-creates subscriptions instead of
+transferring when the secondary rejects the channel.
+
+### Diagnostics
+
+The `driver-diagnostics` RPC surfaces three new counters via
+`DriverHealth.Diagnostics`:
+
+| Key | Type | Notes |
+| --- | --- | --- |
+| `RedundancyFailoverCount` | `double` (long-counted) | Successful mid-session swaps since driver start. |
+| `RedundancyFailoverFailures` | `double` (long-counted) | Swap attempts that bailed (TransferSubscriptions false, secondary unreachable, etc.). |
+| `ActiveServerUri` | string (in `OpcUaClientDiagnostics.ActiveServerUri`) | URI of the upstream the driver is currently bound to. Updates on every successful failover. |
+
+### Forced-failover runbook
+
+To validate the wiring against a real redundant upstream pair:
+
+1. Confirm the upstream advertises `RedundancySupport != None` and a
+   non-empty `ServerUriArray`. Use the Client CLI:
+   `dotnet run --project src/ZB.MOM.WW.OtOpcUa.Client.CLI -- redundancy -u <primary>`.
+2. Set `Redundancy.Enabled = true` on the gateway's `OpcUaClient` driver
+   instance and restart.
+3. Tail driver diagnostics:
+   `driver-diagnostics --instance <id>` — note `RedundancyFailoverCount = 0`
+   pre-test.
+4. Drive a `ServiceLevel` drop on the primary. On AVEVA / KEPServer this is
+   typically a "force standby" Admin action; on a custom server it's a write
+   to the simulated ServiceLevel node.
+5. Observe `RedundancyFailoverCount = 1` within `RecheckInterval` of the
+   drop, the gateway's `HostName` swap to the secondary URI, and downstream
+   reads/subscriptions continuing without interruption.
+
+For non-redundant upstreams (single-server deployments) the recommended
+configuration is to leave `Redundancy.Enabled = false` and rely on
+`EndpointUrls` for boot-time failover only.