Files

Joseph Doherty 969b0847a1 docs: update path references for module-folder reorganization

Rewrite src/ and tests/ project paths in docs, CLAUDE.md, README.md, and
test-fixture READMEs to the new module-folder layout (Core/Server/Drivers/
Client/Tooling). References to retired v1 projects (Galaxy.Host/Proxy/Shared,
the legacy monolithic test projects) are left untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-17 02:10:29 -04:00

7.4 KiB

Raw Blame History

Redundancy Interop Playbook (Phase 6.3 Stream F — task #150)

Scope: manual validation that third-party OPC UA clients + AVEVA MXAccess observe our non-transparent redundancy signals (ServiceLevel, ServerUriArray, RedundancySupport) and fail over to the Backup node when the Primary drops.

Why manual: the third-party clients named here are Windows-GUI binaries (UaExpert, Kepware QuickClient) or embedded inside AVEVA System Platform. Automating any of them into PR-CI is out of scope for v2. This playbook captures the minimal dev-box-plus-VM setup and the expected pass criteria so the work can be executed repeatably at v2 release readiness and after any Phase 6.3 follow-up change.

Prerequisites

Two OtOpcUa.Server nodes in one ServerCluster:
- Declared as NodeCount = 2, RedundancyMode = Hot (or Warm).
- Each with a distinct ApplicationUri (enforced by unique index per decision #86).
- Each node's StaticRoutes.xml points at the other (ServerCluster.Node[].Host).
scripts/install/Install-Services.ps1 applied on each node so the RedundancyPublisherHostedService is running.
At least one DriverInstance with a reachable simulator or PLC so both servers have a non-empty address space to browse.
On the client host:
- UaExpert ≥ 1.7 installed
- Kepware ClientAce QuickClient (or equivalent) — optional, for a second client
For the AVEVA leg: a Galaxy.Host running against an MXAccess deployment with an external OPC UA client object pointed at the cluster (not at a single node).

Expected signals on a running cluster

Node	`ServiceLevel`	`RedundancySupport`	`ServerUriArray`
Primary, healthy, peer reachable	200	`Hot` (or `Warm`)	self + peer
Primary, mid-apply	75 (`PrimaryMidApply`)	same	same
Primary, peer UNreachable	150 (`PrimaryPeerDown`)	same	same
Backup, healthy	100 (`Secondary`)	same	same
Either, dwelling in recovery	50 (`Recovering`)	same	same
Either, invariant violation (two Primary, disabled-node mismatch)	2 (`InvalidTopology`)	same	same

(The band constants live in ServiceLevelCalculator.Classify.)

Test matrix

Each row is one manual run; pass criterion in the right column.

Block A — UA protocol signals (UaExpert)

#	Scenario	Procedure	Pass criterion
A1	ServiceLevel published	Connect UaExpert to Primary. Browse to `Server.ServerStatus.ServiceLevel`.	Value = 200 (or the expected Band byte per table above)
A2	ServiceLevel updates on peer down	Connect to Primary. Stop Backup (`sc stop OtOpcUa`). Watch `ServiceLevel`.	Transitions 200 → 150 within ~2 s of peer probe timeout
A3	RedundancySupport	Browse to `Server.ServerRedundancy.RedundancySupport`.	Value matches the declared `RedundancyMode` (Warm / Hot / None)
A4	ServerUriArray (non-transparent upgrade)	Requires a redundancy-object-type upgrade follow-up.	When upgrade lands: `ServerUriArray` reports both ApplicationUris, self first
A5	Mid-apply dip	On Primary trigger a `sp_PublishGeneration` apply.	`ServiceLevel` drops to 75 for the apply duration + dwell

Block B — Client failover

#	Scenario	Procedure	Pass criterion
B1	UaExpert picks Primary by ServiceLevel	In UaExpert configure a Redundancy Group with both endpoint URLs.	Client picks the Primary URL (higher ServiceLevel)
B2	UaExpert cuts over on Primary kill	Kill the Primary's `OtOpcUa` service.	Client session reconnects to Backup within UaExpert's reconnect timeout (default 5 s). Data-change monitored items resume.
B3	UaExpert cuts back when Primary returns	Start the Primary service. Wait ≥ recovery dwell (see `RecoveryStateManager.DwellTime`).	`ServiceLevel` on returning Primary goes through 50 (Recovering) → 200; UaExpert may or may not switch back (client-policy dependent; both are accepted outcomes)
B4	Kepware QuickClient failover	Repeat B1–B3 with Kepware in place of UaExpert.	Same pass criteria; establishes we're not UaExpert-specific

Block C — Galaxy MXAccess failover

This block validates that an AVEVA System Platform app consuming our cluster via MXAccess tolerates a Primary drop the same way a native OPC UA client does. The MXAccess toolkit internally wraps the OPC UA Client and does its own redundancy negotiation; we're asserting that negotiation honors our ServiceLevel signal.

#	Scenario	Procedure	Pass criterion
C1	Galaxy binds to Primary on first connect	Bring the cluster up. Start a Galaxy `$MxAccessClient` object pointed at the cluster with both node URLs.	Galaxy reports `QUALITY = Good` + initial values from the Primary
C2	Galaxy redirects on Primary drop	Stop the Primary.	Galaxy's `QUALITY` briefly goes `Uncertain`, then back to `Good`; values continue streaming from the Backup within MXAccess's `ReconnectInterval` (default 20 s)
C3	Galaxy handles mid-apply dip	Trigger a generation apply on the Primary.	Galaxy continues reading — the mid-apply dip is advertisory (ServiceLevel 75), not a session drop; MXAccess should stay bound

Recording results

Copy the tables above into a tracking doc per run. The tracking doc shape:

Run date: 2026-MM-DD
Cluster: <id>  Primary: <node>  Backup: <node>  Release: <sha>
A1: PASS  evidence: UaExpert screenshot uaexpert-a1.png
A2: PASS  evidence: ServiceLevel trend grafana-a2.png
…

One pass of every row is the acceptance criterion. Re-run after any Phase 6.3 follow-up ships (especially the non-transparent redundancy-type upgrade, which flips A4 from "deferred" to "expected pass").

Known limitations

A4 pending: Server.ServerRedundancy on our current SDK build lands as the base ServerRedundancyState, which has no ServerUriArray child. ServerRedundancyNodeWriter.ApplyServerUriArray logs-and-skips until the redundancy-object-type upgrade follow-up lands.
Recovery dwell default: RecoveryStateManager.DwellTime defaults to 60 s in Program.cs. Adjust via future config knob if B3 takes too long to observe.
C-block external dependency: The Galaxy.Host side of the redundancy story is largely out of our code — it's MXAccess's own client-redundancy policy talking to our published ServiceLevel. A negative result on C1-C3 does not necessarily indicate an OtOpcUa bug; cross-check with UaExpert (Block A / B) first.

Automation notes (why this is a playbook, not a test)

UaExpert and Kepware binaries are closed-source Windows GUIs; they don't ship headless CLIs for the browse/connect/subscribe flows.
The OPC Foundation reference SDK can drive every scenario, but our own Driver.OpcUaClient tests already cover that client's behaviour; Block B adds value specifically because these two clients have independent redundancy implementations we don't control.
For the sub-set of scenarios that can be automated — the self-loopback case where our own otopcua-cli drives Primary + Backup — the existing tests/Server/ZB.MOM.WW.OtOpcUa.Server.Tests/RedundancyStatePublisherTests + ServiceLevelCalculatorTests (unit) + ClusterTopologyLoaderTests (integration) already cover the math + data path. The wire-level assertion that the values actually land on the right OPC UA nodes is covered by ServerRedundancyNodeWriterTests.

7.4 KiB Raw Blame History Unescape Escape