@@ -258,3 +258,93 @@ namespace-0 NodeId, and the original 5 ordinals stay pinned. Wire-side
|
||||
behaviour against a live server is exercised by
|
||||
`OpcUaClientAggregateSweepTests` (build-only scaffold pending an opc-plc
|
||||
history-sim profile).
|
||||
|
||||
## Upstream redundancy (`ServerArray`)
|
||||
|
||||
When the upstream OPC UA server is itself a redundant pair (warm or hot per
|
||||
OPC UA Part 4 §6.6.2), the driver supports **mid-session failover** driven by
|
||||
the upstream's own `Server.ServerRedundancy.RedundancySupport` +
|
||||
`ServerUriArray` + `Server.ServiceLevel` nodes. Distinct from the static
|
||||
boot-time failover sweep on `EndpointUrls`: that path picks a single survivor
|
||||
at session-create time; this path swaps the active session live when the
|
||||
upstream signals degradation, transferring subscriptions onto the secondary so
|
||||
monitored-item handles stay valid.
|
||||
|
||||
### Configuration
|
||||
|
||||
| Option | Default | Notes |
|
||||
| --- | --- | --- |
|
||||
| `Redundancy.Enabled` | `false` | Opt-in. When `false`, the driver doesn't read `RedundancySupport` / `ServerUriArray` and doesn't subscribe to `ServiceLevel`. |
|
||||
| `Redundancy.ServiceLevelThreshold` | `200` | Byte value below which the driver triggers failover. OPC UA spec convention: 200+ = healthy primary, 100..199 = degraded, 0..99 = unrecoverable. |
|
||||
| `Redundancy.RecheckInterval` | `5s` | Lower bound between two consecutive failovers — suppresses oscillation when ServiceLevel flaps around the threshold. |
|
||||
|
||||
### Behaviour
|
||||
|
||||
- At session activation the driver reads
|
||||
`Server.ServerRedundancy.RedundancySupport`. When `None`, the driver records
|
||||
an empty peer list and the failover path becomes a no-op (`ServiceLevel`
|
||||
drops are still observable via diagnostics but trigger nothing).
|
||||
- When the upstream advertises `Cold` / `Warm` / `WarmActive` / `Hot`, the
|
||||
driver pulls `Server.ServerRedundancy.ServerUriArray` for the peer list,
|
||||
falling back to the top-level `Server.ServerArray` for legacy upstreams that
|
||||
don't expose the redundancy node.
|
||||
- A dedicated subscription on `Server.ServiceLevel` (publish interval 1s,
|
||||
separate from the alarm + data subscriptions) drives every failover decision
|
||||
via the SDK's notification path — no polling loop.
|
||||
- On a drop below `ServiceLevelThreshold` the driver picks the next URI in the
|
||||
peer list that isn't the active one, opens a parallel session against it,
|
||||
and calls `Session.TransferSubscriptionsAsync(other, sendInitialValues:true)`
|
||||
to migrate every live subscription (data + alarm + model-change +
|
||||
service-level itself). On success the driver swaps `Session`, closes the
|
||||
old one, and bumps `RedundancyFailoverCount`.
|
||||
- On any failure (`BadSecureChannelClosed`, `BadCertificateUntrusted`,
|
||||
`TransferSubscriptions` returning `false`, secondary unreachable) the driver
|
||||
leaves the existing session untouched, increments
|
||||
`RedundancyFailoverFailures`, and waits for the next ServiceLevel
|
||||
notification. The keep-alive watchdog continues to cover full
|
||||
upstream-loss scenarios.
|
||||
|
||||
### Shared client-cert prerequisite
|
||||
|
||||
`TransferSubscriptionsAsync` requires the secondary's secure channel to accept
|
||||
the same client certificate the primary did. Operators running heterogeneous
|
||||
secondaries (different cert trust stores) will see `BadCertificateUntrusted`
|
||||
on every failover attempt and the failures counter climbing. The fix is to
|
||||
push the gateway driver's application-instance certificate into both
|
||||
upstreams' `TrustedPeerCertificates` store before enabling redundancy. A
|
||||
follow-up adds a fallback path that re-creates subscriptions instead of
|
||||
transferring when the secondary rejects the channel.
|
||||
|
||||
### Diagnostics
|
||||
|
||||
The `driver-diagnostics` RPC surfaces three new counters via
|
||||
`DriverHealth.Diagnostics`:
|
||||
|
||||
| Key | Type | Notes |
|
||||
| --- | --- | --- |
|
||||
| `RedundancyFailoverCount` | `double` (long-counted) | Successful mid-session swaps since driver start. |
|
||||
| `RedundancyFailoverFailures` | `double` (long-counted) | Swap attempts that bailed (TransferSubscriptions false, secondary unreachable, etc.). |
|
||||
| `ActiveServerUri` | string (in `OpcUaClientDiagnostics.ActiveServerUri`) | URI of the upstream the driver is currently bound to. Updates on every successful failover. |
|
||||
|
||||
### Forced-failover runbook
|
||||
|
||||
To validate the wiring against a real redundant upstream pair:
|
||||
|
||||
1. Confirm the upstream advertises `RedundancySupport != None` and a
|
||||
non-empty `ServerUriArray`. Use the Client CLI:
|
||||
`dotnet run --project src/ZB.MOM.WW.OtOpcUa.Client.CLI -- redundancy -u <primary>`.
|
||||
2. Set `Redundancy.Enabled = true` on the gateway's `OpcUaClient` driver
|
||||
instance and restart.
|
||||
3. Tail driver diagnostics:
|
||||
`driver-diagnostics --instance <id>` — note `RedundancyFailoverCount = 0`
|
||||
pre-test.
|
||||
4. Drive a `ServiceLevel` drop on the primary. On AVEVA / KEPServer this is
|
||||
typically a "force standby" Admin action; on a custom server it's a write
|
||||
to the simulated ServiceLevel node.
|
||||
5. Observe `RedundancyFailoverCount = 1` within `RecheckInterval` of the
|
||||
drop, the gateway's `HostName` swap to the secondary URI, and downstream
|
||||
reads/subscriptions continuing without interruption.
|
||||
|
||||
For non-redundant upstreams (single-server deployments) the recommended
|
||||
configuration is to leave `Redundancy.Enabled = false` and rely on
|
||||
`EndpointUrls` for boot-time failover only.
|
||||
|
||||
Reference in New Issue
Block a user