# Redundancy Interop Playbook (Phase 6.3 Stream F — task #150)

> **Scope**: manual validation that third-party OPC UA clients + AVEVA MXAccess
> observe our non-transparent redundancy signals (ServiceLevel, ServerUriArray,
> RedundancySupport) and fail over to the Backup node when the Primary drops.
>
> **Why manual**: the third-party clients named here are Windows-GUI binaries
> (UaExpert, Kepware QuickClient) or embedded inside AVEVA System Platform.
> Automating any of them into PR-CI is out of scope for v2. This playbook
> captures the minimal dev-box-plus-VM setup and the expected pass criteria so
> the work can be executed repeatably at v2 release readiness and after any
> Phase 6.3 follow-up change.

## Prerequisites

1. Two `OtOpcUa.Server` nodes in one `ServerCluster`:
   - Declared as `NodeCount = 2`, `RedundancyMode = Hot` (or `Warm`).
   - Each with a distinct `ApplicationUri` (enforced by unique index per
     decision #86).
   - Each node's `StaticRoutes.xml` points at the other (`ServerCluster.Node[].Host`).
2. `scripts/install/Install-Services.ps1` applied on each node so the
   `RedundancyPublisherHostedService` is running.
3. At least one `DriverInstance` with a reachable simulator or PLC so both
   servers have a non-empty address space to browse.
4. On the client host:
   - `UaExpert` ≥ 1.7 installed
   - Kepware `ClientAce QuickClient` (or equivalent) — optional, for a second
     client
5. For the AVEVA leg: a `Galaxy.Host` running against an MXAccess deployment
   with an external OPC UA client object pointed at the cluster (not at a
   single node).

## Expected signals on a running cluster

| Node | `ServiceLevel` | `RedundancySupport` | `ServerUriArray` |
|---|---|---|---|
| Primary, healthy, peer reachable | 200 | `Hot` (or `Warm`) | self + peer |
| Primary, mid-apply | 75 (`PrimaryMidApply`) | same | same |
| Primary, peer UNreachable | 150 (`PrimaryPeerDown`) | same | same |
| Backup, healthy | 100 (`Secondary`) | same | same |
| Either, dwelling in recovery | 50 (`Recovering`) | same | same |
| Either, invariant violation (two Primary, disabled-node mismatch) | 2 (`InvalidTopology`) | same | same |

(The band constants live in `ServiceLevelCalculator.Classify`.)

## Test matrix

Each row is one manual run; pass criterion in the right column.

### Block A — UA protocol signals (UaExpert)

| # | Scenario | Procedure | Pass criterion |
|---|---|---|---|
| A1 | ServiceLevel published | Connect UaExpert to Primary. Browse to `Server.ServerStatus.ServiceLevel`. | Value = 200 (or the expected Band byte per table above) |
| A2 | ServiceLevel updates on peer down | Connect to Primary. Stop Backup (`sc stop OtOpcUa`). Watch `ServiceLevel`. | Transitions 200 → 150 within ~2 s of peer probe timeout |
| A3 | RedundancySupport | Browse to `Server.ServerRedundancy.RedundancySupport`. | Value matches the declared `RedundancyMode` (Warm / Hot / None) |
| A4 | ServerUriArray (non-transparent upgrade) | Requires a redundancy-object-type upgrade follow-up. | When upgrade lands: `ServerUriArray` reports both ApplicationUris, self first |
| A5 | Mid-apply dip | On Primary trigger a `sp_PublishGeneration` apply. | `ServiceLevel` drops to 75 for the apply duration + dwell |

### Block B — Client failover

| # | Scenario | Procedure | Pass criterion |
|---|---|---|---|
| B1 | UaExpert picks Primary by ServiceLevel | In UaExpert configure a Redundancy Group with both endpoint URLs. | Client picks the Primary URL (higher ServiceLevel) |
| B2 | UaExpert cuts over on Primary kill | Kill the Primary's `OtOpcUa` service. | Client session reconnects to Backup within UaExpert's reconnect timeout (default 5 s). Data-change monitored items resume. |
| B3 | UaExpert cuts back when Primary returns | Start the Primary service. Wait ≥ recovery dwell (see `RecoveryStateManager.DwellTime`). | `ServiceLevel` on returning Primary goes through 50 (Recovering) → 200; UaExpert may or may not switch back (client-policy dependent; both are accepted outcomes) |
| B4 | Kepware QuickClient failover | Repeat B1–B3 with Kepware in place of UaExpert. | Same pass criteria; establishes we're not UaExpert-specific |

### Block C — Galaxy MXAccess failover

This block validates that an AVEVA System Platform app consuming our cluster
via MXAccess tolerates a Primary drop the same way a native OPC UA client does.
The MXAccess toolkit internally wraps the OPC UA Client and does its own
redundancy negotiation; we're asserting that negotiation honors our
`ServiceLevel` signal.

| # | Scenario | Procedure | Pass criterion |
|---|---|---|---|
| C1 | Galaxy binds to Primary on first connect | Bring the cluster up. Start a Galaxy `$MxAccessClient` object pointed at the cluster with both node URLs. | Galaxy reports `QUALITY = Good` + initial values from the Primary |
| C2 | Galaxy redirects on Primary drop | Stop the Primary. | Galaxy's `QUALITY` briefly goes `Uncertain`, then back to `Good`; values continue streaming from the Backup within MXAccess's `ReconnectInterval` (default 20 s) |
| C3 | Galaxy handles mid-apply dip | Trigger a generation apply on the Primary. | Galaxy continues reading — the mid-apply dip is advertisory (ServiceLevel 75), not a session drop; MXAccess should stay bound |

## Recording results

Copy the tables above into a tracking doc per run. The tracking doc shape:

```
Run date: 2026-MM-DD
Cluster: <id>  Primary: <node>  Backup: <node>  Release: <sha>
A1: PASS  evidence: UaExpert screenshot uaexpert-a1.png
A2: PASS  evidence: ServiceLevel trend grafana-a2.png
…
```

One pass of every row is the acceptance criterion. Re-run after any Phase 6.3
follow-up ships (especially the non-transparent redundancy-type upgrade, which
flips A4 from "deferred" to "expected pass").

## Known limitations

- **A4 pending**: `Server.ServerRedundancy` on our current SDK build lands as
  the base `ServerRedundancyState`, which has no `ServerUriArray` child.
  `ServerRedundancyNodeWriter.ApplyServerUriArray` logs-and-skips until the
  redundancy-object-type upgrade follow-up lands.
- **Recovery dwell default**: `RecoveryStateManager.DwellTime` defaults to 60 s
  in `Program.cs`. Adjust via future config knob if B3 takes too long to
  observe.
- **C-block external dependency**: The `Galaxy.Host` side of the redundancy
  story is largely out of our code — it's MXAccess's own client-redundancy
  policy talking to our published ServiceLevel. A negative result on C1-C3
  does not necessarily indicate an OtOpcUa bug; cross-check with UaExpert
  (Block A / B) first.

## Automation notes (why this is a playbook, not a test)

- UaExpert and Kepware binaries are closed-source Windows GUIs; they don't
  ship headless CLIs for the browse/connect/subscribe flows.
- The OPC Foundation reference SDK *can* drive every scenario, but our own
  `Driver.OpcUaClient` tests already cover that client's behaviour; Block B
  adds value specifically because these two clients have independent
  redundancy implementations we don't control.
- For the sub-set of scenarios that *can* be automated — the self-loopback
  case where our own `otopcua-cli` drives Primary + Backup — the existing
  `tests/ZB.MOM.WW.OtOpcUa.Server.Tests/RedundancyStatePublisherTests` +
  `ServiceLevelCalculatorTests` (unit) + `ClusterTopologyLoaderTests`
  (integration) already cover the math + data path. The wire-level assertion
  that the values actually land on the right OPC UA nodes is covered by
  `ServerRedundancyNodeWriterTests`.