# Redundancy Interop Playbook (Phase 6.3 Stream F — task #150) > **Scope**: manual validation that third-party OPC UA clients + AVEVA MXAccess > observe our non-transparent redundancy signals (ServiceLevel, ServerUriArray, > RedundancySupport) and fail over to the Backup node when the Primary drops. > > **Why manual**: the third-party clients named here are Windows-GUI binaries > (UaExpert, Kepware QuickClient) or embedded inside AVEVA System Platform. > Automating any of them into PR-CI is out of scope for v2. This playbook > captures the minimal dev-box-plus-VM setup and the expected pass criteria so > the work can be executed repeatably at v2 release readiness and after any > Phase 6.3 follow-up change. ## Prerequisites 1. Two `OtOpcUa.Server` nodes in one `ServerCluster`: - Declared as `NodeCount = 2`, `RedundancyMode = Hot` (or `Warm`). - Each with a distinct `ApplicationUri` (enforced by unique index per decision #86). - Each node's `StaticRoutes.xml` points at the other (`ServerCluster.Node[].Host`). 2. `scripts/install/Install-Services.ps1` applied on each node so the `RedundancyPublisherHostedService` is running. 3. At least one `DriverInstance` with a reachable simulator or PLC so both servers have a non-empty address space to browse. 4. On the client host: - `UaExpert` ≥ 1.7 installed - Kepware `ClientAce QuickClient` (or equivalent) — optional, for a second client 5. For the AVEVA leg: a `Galaxy.Host` running against an MXAccess deployment with an external OPC UA client object pointed at the cluster (not at a single node). ## Expected signals on a running cluster | Node | `ServiceLevel` | `RedundancySupport` | `ServerUriArray` | |---|---|---|---| | Primary, healthy, peer reachable | 200 | `Hot` (or `Warm`) | self + peer | | Primary, mid-apply | 75 (`PrimaryMidApply`) | same | same | | Primary, peer UNreachable | 150 (`PrimaryPeerDown`) | same | same | | Backup, healthy | 100 (`Secondary`) | same | same | | Either, dwelling in recovery | 50 (`Recovering`) | same | same | | Either, invariant violation (two Primary, disabled-node mismatch) | 2 (`InvalidTopology`) | same | same | (The band constants live in `ServiceLevelCalculator.Classify`.) ## Test matrix Each row is one manual run; pass criterion in the right column. ### Block A — UA protocol signals (UaExpert) | # | Scenario | Procedure | Pass criterion | |---|---|---|---| | A1 | ServiceLevel published | Connect UaExpert to Primary. Browse to `Server.ServerStatus.ServiceLevel`. | Value = 200 (or the expected Band byte per table above) | | A2 | ServiceLevel updates on peer down | Connect to Primary. Stop Backup (`sc stop OtOpcUa`). Watch `ServiceLevel`. | Transitions 200 → 150 within ~2 s of peer probe timeout | | A3 | RedundancySupport | Browse to `Server.ServerRedundancy.RedundancySupport`. | Value matches the declared `RedundancyMode` (Warm / Hot / None) | | A4 | ServerUriArray (non-transparent upgrade) | Requires a redundancy-object-type upgrade follow-up. | When upgrade lands: `ServerUriArray` reports both ApplicationUris, self first | | A5 | Mid-apply dip | On Primary trigger a `sp_PublishGeneration` apply. | `ServiceLevel` drops to 75 for the apply duration + dwell | ### Block B — Client failover | # | Scenario | Procedure | Pass criterion | |---|---|---|---| | B1 | UaExpert picks Primary by ServiceLevel | In UaExpert configure a Redundancy Group with both endpoint URLs. | Client picks the Primary URL (higher ServiceLevel) | | B2 | UaExpert cuts over on Primary kill | Kill the Primary's `OtOpcUa` service. | Client session reconnects to Backup within UaExpert's reconnect timeout (default 5 s). Data-change monitored items resume. | | B3 | UaExpert cuts back when Primary returns | Start the Primary service. Wait ≥ recovery dwell (see `RecoveryStateManager.DwellTime`). | `ServiceLevel` on returning Primary goes through 50 (Recovering) → 200; UaExpert may or may not switch back (client-policy dependent; both are accepted outcomes) | | B4 | Kepware QuickClient failover | Repeat B1–B3 with Kepware in place of UaExpert. | Same pass criteria; establishes we're not UaExpert-specific | ### Block C — Galaxy MXAccess failover This block validates that an AVEVA System Platform app consuming our cluster via MXAccess tolerates a Primary drop the same way a native OPC UA client does. The MXAccess toolkit internally wraps the OPC UA Client and does its own redundancy negotiation; we're asserting that negotiation honors our `ServiceLevel` signal. | # | Scenario | Procedure | Pass criterion | |---|---|---|---| | C1 | Galaxy binds to Primary on first connect | Bring the cluster up. Start a Galaxy `$MxAccessClient` object pointed at the cluster with both node URLs. | Galaxy reports `QUALITY = Good` + initial values from the Primary | | C2 | Galaxy redirects on Primary drop | Stop the Primary. | Galaxy's `QUALITY` briefly goes `Uncertain`, then back to `Good`; values continue streaming from the Backup within MXAccess's `ReconnectInterval` (default 20 s) | | C3 | Galaxy handles mid-apply dip | Trigger a generation apply on the Primary. | Galaxy continues reading — the mid-apply dip is advertisory (ServiceLevel 75), not a session drop; MXAccess should stay bound | ## Recording results Copy the tables above into a tracking doc per run. The tracking doc shape: ``` Run date: 2026-MM-DD Cluster: Primary: Backup: Release: A1: PASS evidence: UaExpert screenshot uaexpert-a1.png A2: PASS evidence: ServiceLevel trend grafana-a2.png … ``` One pass of every row is the acceptance criterion. Re-run after any Phase 6.3 follow-up ships (especially the non-transparent redundancy-type upgrade, which flips A4 from "deferred" to "expected pass"). ## Known limitations - **A4 pending**: `Server.ServerRedundancy` on our current SDK build lands as the base `ServerRedundancyState`, which has no `ServerUriArray` child. `ServerRedundancyNodeWriter.ApplyServerUriArray` logs-and-skips until the redundancy-object-type upgrade follow-up lands. - **Recovery dwell default**: `RecoveryStateManager.DwellTime` defaults to 60 s in `Program.cs`. Adjust via future config knob if B3 takes too long to observe. - **C-block external dependency**: The `Galaxy.Host` side of the redundancy story is largely out of our code — it's MXAccess's own client-redundancy policy talking to our published ServiceLevel. A negative result on C1-C3 does not necessarily indicate an OtOpcUa bug; cross-check with UaExpert (Block A / B) first. ## Automation notes (why this is a playbook, not a test) - UaExpert and Kepware binaries are closed-source Windows GUIs; they don't ship headless CLIs for the browse/connect/subscribe flows. - The OPC Foundation reference SDK *can* drive every scenario, but our own `Driver.OpcUaClient` tests already cover that client's behaviour; Block B adds value specifically because these two clients have independent redundancy implementations we don't control. - For the sub-set of scenarios that *can* be automated — the self-loopback case where our own `otopcua-cli` drives Primary + Backup — the existing `tests/ZB.MOM.WW.OtOpcUa.Server.Tests/RedundancyStatePublisherTests` + `ServiceLevelCalculatorTests` (unit) + `ClusterTopologyLoaderTests` (integration) already cover the math + data path. The wire-level assertion that the values actually land on the right OPC UA nodes is covered by `ServerRedundancyNodeWriterTests`.