Rewrite src/ and tests/ project paths in docs, CLAUDE.md, README.md, and test-fixture READMEs to the new module-folder layout (Core/Server/Drivers/ Client/Tooling). References to retired v1 projects (Galaxy.Host/Proxy/Shared, the legacy monolithic test projects) are left untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7.4 KiB
Redundancy Interop Playbook (Phase 6.3 Stream F — task #150)
Scope: manual validation that third-party OPC UA clients + AVEVA MXAccess observe our non-transparent redundancy signals (ServiceLevel, ServerUriArray, RedundancySupport) and fail over to the Backup node when the Primary drops.
Why manual: the third-party clients named here are Windows-GUI binaries (UaExpert, Kepware QuickClient) or embedded inside AVEVA System Platform. Automating any of them into PR-CI is out of scope for v2. This playbook captures the minimal dev-box-plus-VM setup and the expected pass criteria so the work can be executed repeatably at v2 release readiness and after any Phase 6.3 follow-up change.
Prerequisites
- Two
OtOpcUa.Servernodes in oneServerCluster:- Declared as
NodeCount = 2,RedundancyMode = Hot(orWarm). - Each with a distinct
ApplicationUri(enforced by unique index per decision #86). - Each node's
StaticRoutes.xmlpoints at the other (ServerCluster.Node[].Host).
- Declared as
scripts/install/Install-Services.ps1applied on each node so theRedundancyPublisherHostedServiceis running.- At least one
DriverInstancewith a reachable simulator or PLC so both servers have a non-empty address space to browse. - On the client host:
UaExpert≥ 1.7 installed- Kepware
ClientAce QuickClient(or equivalent) — optional, for a second client
- For the AVEVA leg: a
Galaxy.Hostrunning against an MXAccess deployment with an external OPC UA client object pointed at the cluster (not at a single node).
Expected signals on a running cluster
| Node | ServiceLevel |
RedundancySupport |
ServerUriArray |
|---|---|---|---|
| Primary, healthy, peer reachable | 200 | Hot (or Warm) |
self + peer |
| Primary, mid-apply | 75 (PrimaryMidApply) |
same | same |
| Primary, peer UNreachable | 150 (PrimaryPeerDown) |
same | same |
| Backup, healthy | 100 (Secondary) |
same | same |
| Either, dwelling in recovery | 50 (Recovering) |
same | same |
| Either, invariant violation (two Primary, disabled-node mismatch) | 2 (InvalidTopology) |
same | same |
(The band constants live in ServiceLevelCalculator.Classify.)
Test matrix
Each row is one manual run; pass criterion in the right column.
Block A — UA protocol signals (UaExpert)
| # | Scenario | Procedure | Pass criterion |
|---|---|---|---|
| A1 | ServiceLevel published | Connect UaExpert to Primary. Browse to Server.ServerStatus.ServiceLevel. |
Value = 200 (or the expected Band byte per table above) |
| A2 | ServiceLevel updates on peer down | Connect to Primary. Stop Backup (sc stop OtOpcUa). Watch ServiceLevel. |
Transitions 200 → 150 within ~2 s of peer probe timeout |
| A3 | RedundancySupport | Browse to Server.ServerRedundancy.RedundancySupport. |
Value matches the declared RedundancyMode (Warm / Hot / None) |
| A4 | ServerUriArray (non-transparent upgrade) | Requires a redundancy-object-type upgrade follow-up. | When upgrade lands: ServerUriArray reports both ApplicationUris, self first |
| A5 | Mid-apply dip | On Primary trigger a sp_PublishGeneration apply. |
ServiceLevel drops to 75 for the apply duration + dwell |
Block B — Client failover
| # | Scenario | Procedure | Pass criterion |
|---|---|---|---|
| B1 | UaExpert picks Primary by ServiceLevel | In UaExpert configure a Redundancy Group with both endpoint URLs. | Client picks the Primary URL (higher ServiceLevel) |
| B2 | UaExpert cuts over on Primary kill | Kill the Primary's OtOpcUa service. |
Client session reconnects to Backup within UaExpert's reconnect timeout (default 5 s). Data-change monitored items resume. |
| B3 | UaExpert cuts back when Primary returns | Start the Primary service. Wait ≥ recovery dwell (see RecoveryStateManager.DwellTime). |
ServiceLevel on returning Primary goes through 50 (Recovering) → 200; UaExpert may or may not switch back (client-policy dependent; both are accepted outcomes) |
| B4 | Kepware QuickClient failover | Repeat B1–B3 with Kepware in place of UaExpert. | Same pass criteria; establishes we're not UaExpert-specific |
Block C — Galaxy MXAccess failover
This block validates that an AVEVA System Platform app consuming our cluster
via MXAccess tolerates a Primary drop the same way a native OPC UA client does.
The MXAccess toolkit internally wraps the OPC UA Client and does its own
redundancy negotiation; we're asserting that negotiation honors our
ServiceLevel signal.
| # | Scenario | Procedure | Pass criterion |
|---|---|---|---|
| C1 | Galaxy binds to Primary on first connect | Bring the cluster up. Start a Galaxy $MxAccessClient object pointed at the cluster with both node URLs. |
Galaxy reports QUALITY = Good + initial values from the Primary |
| C2 | Galaxy redirects on Primary drop | Stop the Primary. | Galaxy's QUALITY briefly goes Uncertain, then back to Good; values continue streaming from the Backup within MXAccess's ReconnectInterval (default 20 s) |
| C3 | Galaxy handles mid-apply dip | Trigger a generation apply on the Primary. | Galaxy continues reading — the mid-apply dip is advertisory (ServiceLevel 75), not a session drop; MXAccess should stay bound |
Recording results
Copy the tables above into a tracking doc per run. The tracking doc shape:
Run date: 2026-MM-DD
Cluster: <id> Primary: <node> Backup: <node> Release: <sha>
A1: PASS evidence: UaExpert screenshot uaexpert-a1.png
A2: PASS evidence: ServiceLevel trend grafana-a2.png
…
One pass of every row is the acceptance criterion. Re-run after any Phase 6.3 follow-up ships (especially the non-transparent redundancy-type upgrade, which flips A4 from "deferred" to "expected pass").
Known limitations
- A4 pending:
Server.ServerRedundancyon our current SDK build lands as the baseServerRedundancyState, which has noServerUriArraychild.ServerRedundancyNodeWriter.ApplyServerUriArraylogs-and-skips until the redundancy-object-type upgrade follow-up lands. - Recovery dwell default:
RecoveryStateManager.DwellTimedefaults to 60 s inProgram.cs. Adjust via future config knob if B3 takes too long to observe. - C-block external dependency: The
Galaxy.Hostside of the redundancy story is largely out of our code — it's MXAccess's own client-redundancy policy talking to our published ServiceLevel. A negative result on C1-C3 does not necessarily indicate an OtOpcUa bug; cross-check with UaExpert (Block A / B) first.
Automation notes (why this is a playbook, not a test)
- UaExpert and Kepware binaries are closed-source Windows GUIs; they don't ship headless CLIs for the browse/connect/subscribe flows.
- The OPC Foundation reference SDK can drive every scenario, but our own
Driver.OpcUaClienttests already cover that client's behaviour; Block B adds value specifically because these two clients have independent redundancy implementations we don't control. - For the sub-set of scenarios that can be automated — the self-loopback
case where our own
otopcua-clidrives Primary + Backup — the existingtests/Server/ZB.MOM.WW.OtOpcUa.Server.Tests/RedundancyStatePublisherTests+ServiceLevelCalculatorTests(unit) +ClusterTopologyLoaderTests(integration) already cover the math + data path. The wire-level assertion that the values actually land on the right OPC UA nodes is covered byServerRedundancyNodeWriterTests.