Produces docs/plans/ entries for tasks #13, #15, #16, and #17-#20: - phase-6-3-redundancy-interop-plan.md: automation boundary analysis, concrete test matrix (A/B/C blocks), and a step-by-step cutover runbook for the deferred Stream F client interop work - v2-ga-lab-gates-plan.md: exact gate list with command, pass criterion, and owner for each of the nine v2 GA exit criteria - live-hardware-validation-runbooks.md: one runbook per driver (FOCAS CNC smoke #54, AB CIP live-boot, TwinCAT wire-live) with preconditions, procedure, expected results, and recording template - alarms-worker-wiring-plan.md: focused plan for A.2/A.3-A.4/C.1/D.1 worker wiring in the mxaccessgw sibling repo, documenting the discovered AVEVA API surface, the architectural decision that blocks A.2, the dependency order, and what each item needs to unblock Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
14 KiB
Phase 6.3 Redundancy — Client Interop Matrix and Cutover Validation Plan
Scope: Phase 6.3 redundancy runtime core shipped (PRs #89-90, #98-99, #24-peerprobe, Stream C node wiring, Stream D lease wrap). What remains is Stream F (task #150): validating that third-party OPC UA clients honour our
ServiceLevel/ServerUriArray/RedundancySupportsignals and fail over correctly when the Primary drops. This document defines what is automatable as integration tests, what requires two live instances plus a real client, and a step-by-step cutover-validation runbook.Source of truth:
docs/Redundancy.md,docs/v2/redundancy-interop-playbook.md,docs/v2/implementation/phase-6-3-redundancy-runtime.md,scripts/compliance/phase-6-3-compliance.ps1.
What is already tested (no live cluster needed)
The following are covered by existing automated tests that run in ordinary
dotnet test:
| Area | Test class(es) | What it asserts |
|---|---|---|
ServiceLevelCalculator — 8-state matrix |
ServiceLevelCalculatorTests |
All 10 band values; role × self-health × peer-http × peer-ua × apply × recovery × topology combinations |
RecoveryStateManager — dwell + witness |
RecoveryStateManagerTests |
60 s dwell default; premature-exit rejection; witness-required gate |
ApplyLeaseRegistry — lease lifecycle |
ApplyLeaseRegistryTests |
Disposal on success / exception / cancellation; watchdog force-close at 10 min |
ServerRedundancyNodeWriter — OPC UA variable binding |
ServerRedundancyNodeWriterTests |
ServiceLevel byte push; RedundancySupport enum; ServerUriArray skip-log when node absent |
RedundancyStatePublisher — orchestration |
RedundancyStatePublisherTests |
Edge-triggered OnStateChanged; idempotent dedup |
ClusterTopologyLoader |
ClusterTopologyLoaderTests |
Two-node seed; one-node degenerate; duplicate-URI rejection |
DraftValidator.ValidateClusterTopology |
DraftValidatorTests (8 cases) |
NodeCount/mode pairs; Enabled-count vs NodeCount; multiple-Primary rejection |
Run with:
dotnet test ZB.MOM.WW.OtOpcUa.slnx --filter "FullyQualifiedName~Redundancy"
Compliance gate (every Phase 6.3 static check):
pwsh ./scripts/compliance/phase-6-3-compliance.ps1
Pass criteria: exit 0; all [PASS] lines green; [DEFERRED] lines are
known-deferred surfaces, not failures.
What cannot be automated — requires two live instances
The scenarios below require two running OtOpcUa.Server processes in the
same ServerCluster, a real SQL Server Config DB, and at least one driver
instance with a reachable endpoint (simulator or real PLC).
Why it cannot be unit/integration-tested in-process
- UaExpert, Kepware KEPServerEX, and AVEVA OI Gateway are closed-source Windows GUI binaries with no headless CLI interface for the subscribe/browse flows.
- The AVEVA MXAccess failover leg (
IAlarmSourcereconnect,$MxAccessClientquality transition) involves the Galaxy runtime's own client-redundancy policy and the COM-layer session model — both live outside this repo. - Even the automatable sub-set (our own
otopcua-clias the client) needs two distinct listening TCP endpoints; that requires two live processes, which is out of scope fordotnet testintegration fixtures.
Test matrix
Prerequisites
- Two
OtOpcUa.Serverprocesses on separate Windows hosts (or separate ports on the same host for dev) sharing one Config DB (ServerClusterwithNodeCount=2,RedundancyMode=WarmorHot). - Each node registered in
ClusterNode:- Node A:
RedundancyRole=Primary,ServiceLevelBase=255,ApplicationUri=urn:node-a:OtOpcUa - Node B:
RedundancyRole=Secondary,ServiceLevelBase=100,ApplicationUri=urn:node-b:OtOpcUa
- Node A:
PeerHttpProbeLoopandPeerUaProbeLoopHostedServices running on both nodes (registered viaAddHostedService<PeerHttpProbeLoop>+AddHostedService<PeerUaProbeLoop>inProgram.cs).- At least one
DriverInstancein the cluster with a reachable PLC or simulator (e.g. Modbus sim at10.100.0.35:5020). - Client machine with UaExpert >= 1.7 installed.
- Optional second client: Kepware KEPServerEX 6.x QuickClient or AVEVA OI Gateway 2020R2+.
Block A — OPC UA protocol signals (UaExpert, no failover yet)
| ID | Scenario | Procedure | Pass criterion | Automatable? |
|---|---|---|---|---|
| A1 | ServiceLevel published on Primary | Connect UaExpert to Node A. Browse Server/ServerStatus/ServiceLevel. |
Value = 255 (AuthoritativePrimary) |
No — requires UaExpert GUI |
| A2 | ServiceLevel published on Backup | Connect UaExpert to Node B. Read same node. | Value = 100 (AuthoritativeBackup) |
No |
| A3 | ServiceLevel updates when peer drops | Node A connected. Stop Node B (sc stop OtOpcUa). Watch ServiceLevel on Node A. |
Transitions 255 → 230 (IsolatedPrimary) within ~6 s (3 × 2 s HTTP probe interval) |
No |
| A4 | RedundancySupport | Browse Server/ServerRedundancy/RedundancySupport on either node. |
Value = Warm or Hot matching the cluster RedundancyMode |
No |
| A5 | ServerUriArray | Browse Server/ServerRedundancy/ServerUriArray on either node. |
Array contains both ApplicationUri values; self listed first. Note: requires non-transparent redundancy-type upgrade (currently logs-and-skips — see known limitation A5 below). |
No |
| A6 | Mid-apply ServiceLevel dip | Trigger a sp_PublishGeneration apply (via Admin UI draft → publish) while watching Node A ServiceLevel. |
Drops to 200 (PrimaryMidApply) for the apply duration; returns to 255 after RefreshAsync. |
No |
| A7 | Client.CLI reads correct ServiceLevel | dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- read -u opc.tcp://<node-a>:4840 -n "i=2267" |
Prints current byte value matching expected band. | Yes — scriptable with the Client CLI |
| A8 | otopcua-cli failover reconnect | dotnet run ... -- connect -u opc.tcp://<node-a>:4840 -F opc.tcp://<node-b>:4840 — then kill Node A. |
CLI session reconnects to Node B within the session keep-alive timeout. | Yes — scriptable with the Client CLI |
Block B — Third-party client failover
| ID | Scenario | Procedure | Pass criterion |
|---|---|---|---|
| B1 | UaExpert picks Primary by ServiceLevel | Configure a Redundancy Group in UaExpert with both endpoint URLs. | Client connects to Node A (higher ServiceLevel) |
| B2 | UaExpert cuts over on Primary kill | Kill Node A OtOpcUa service. |
Client session reconnects to Node B within UaExpert's reconnect timeout (default 5 s). Data-change monitored items resume. |
| B3 | UaExpert returns when Primary restores | Start Node A. Wait >= 60 s recovery dwell. | ServiceLevel on Node A progresses: 180 (RecoveringPrimary) → 255 (AuthoritativePrimary). UaExpert may or may not switch back (client-policy-dependent; both outcomes accepted). |
| B4 | Kepware QuickClient failover | Repeat B1–B3 with Kepware configured for the same two endpoints. | Same pass criteria; establishes no UaExpert-specific behaviour. |
| B5 | AVEVA OI Gateway | Configure OI Gateway OPC DA/UA client object against the cluster. Kill Primary. | OI Gateway data quality recovers within ReconnectInterval (default 20 s); no permanent data-loss alert. |
Block C — Galaxy MXAccess failover
This block requires a running Galaxy and $MxAccessClient object (AVEVA
System Platform installed, Galaxy deployed on dev box — see project memory
project_aveva_platform_installed.md).
| ID | Scenario | Procedure | Pass criterion |
|---|---|---|---|
| C1 | Galaxy binds to Primary on first connect | Bring cluster up. Start a Galaxy $MxAccessClient with both node URLs configured. |
Galaxy reports QUALITY = Good; initial values stream from Node A. |
| C2 | Galaxy redirects on Primary drop | Stop Node A. | Galaxy QUALITY briefly goes Uncertain, then returns to Good; values continue streaming from Node B within MXAccess's ReconnectInterval (default 20 s). |
| C3 | Galaxy tolerates mid-apply dip | Trigger generation apply on Node A. | Galaxy remains bound — mid-apply dip (200) is advisory, not a session drop. No quality interruption. |
Note: A negative result on C1–C3 does not necessarily indicate an OtOpcUa
defect. Cross-check with Block A / B first to confirm our ServiceLevel
signal is correct before debugging the MXAccess client layer.
Step-by-step cutover-validation runbook
This is the minimum procedure to satisfy the v2 GA exit criterion: "Non-transparent redundancy cutover validated with at least one production client (Ignition 8.3 recommended — see decision #85)."
Step 1 — Provision the cluster
# On the Config DB host, seed or verify cluster rows:
# ServerCluster: Id=<id>, Name="test-cluster", NodeCount=2, RedundancyMode=Warm
# ClusterNode A: NodeId="node-a", ClusterId=<id>, RedundancyRole=Primary,
# ServiceLevelBase=255, ApplicationUri="urn:node-a:OtOpcUa"
# ClusterNode B: NodeId="node-b", ClusterId=<id>, RedundancyRole=Secondary,
# ServiceLevelBase=100, ApplicationUri="urn:node-b:OtOpcUa"
Verify uniqueness constraint: no two ClusterNode rows share the same
ApplicationUri (unique index on ApplicationUri).
Step 2 — Start both server instances
On Node A host:
# appsettings.json: Node:NodeId = "node-a"
sc start OtOpcUa
On Node B host:
# appsettings.json: Node:NodeId = "node-b"
sc start OtOpcUa
Wait 10 s for HostedServices to complete first probe cycle.
Step 3 — Verify baseline ServiceLevel via Client CLI
# Node A should report 255
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- read `
-u opc.tcp://<node-a-host>:4840 -n "i=2267"
# Node B should report 100
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- read `
-u opc.tcp://<node-b-host>:4840 -n "i=2267"
Pass: Node A = 255, Node B = 100.
Step 4 — Verify ServerUriArray
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- read `
-u opc.tcp://<node-a-host>:4840 -n "i=2271"
Pass: array returned contains both ApplicationUri strings. If
ServerUriArray node returns empty or an error, the non-transparent
redundancy-type upgrade follow-up is still pending (known limitation —
ServerRedundancyNodeWriter.ApplyServerUriArray logs-and-skips on the
base ServerRedundancyState object type).
Step 5 — Execute Primary kill + failover (B2 scenario)
- Connect UaExpert (or Kepware) Redundancy Group to both endpoints.
- Confirm client is subscribed to at least one variable node.
- Kill Node A:
sc stop OtOpcUaon Node A host. - Observe:
- Node B
ServiceLevelshould transition: 100 (AuthoritativeBackup) → 80 (IsolatedBackup) within ~6 s. - Client should reconnect to Node B and resume data-change events.
- Node B
- Record: time from kill to client reconnect; whether data gaps occurred.
Step 6 — Verify Primary recovery (B3 scenario)
- Restart Node A:
sc start OtOpcUaon Node A host. - Observe Node A
ServiceLevelprogression:- ~0 s: 1 (
NoData) briefly while HostedServices start. - Startup: 180 (
RecoveringPrimary) — recovery dwell gate active. - After >= 60 s dwell + one positive publish witness: 255 (
AuthoritativePrimary).
- ~0 s: 1 (
- Observe Node B:
- Returns to 100 (
AuthoritativeBackup) once it sees Node A peer probe succeed.
- Returns to 100 (
- Record dwell duration and whether the client (UaExpert/Kepware) switches back.
Step 7 — Execute mid-apply dip (A6 scenario)
- Via Admin UI, create a trivial draft change and publish.
- Watch Node A
ServiceLevelduring apply. - Expected: drops to 200 (
PrimaryMidApply) for the apply duration (typically seconds); returns to 255 whenGenerationRefreshHostedServicereleases the lease.
Step 8 — Record results
Copy the following block into a tracking doc:
Run date: YYYY-MM-DD
Release SHA: <git sha>
Cluster: <cluster-id> Primary: node-a Backup: node-b
Config DB: 10.100.0.35,14330
A1: [PASS/FAIL] evidence: <screenshot or CLI output>
A2: [PASS/FAIL]
A3: [PASS/FAIL] time-to-IsolatedPrimary: <N>s
A4: [PASS/FAIL]
A5: [PASS/FAIL/DEFERRED - ServerUriArray upgrade pending]
A6: [PASS/FAIL] mid-apply duration: <N>s
A7: [PASS/FAIL] CLI output attached
A8: [PASS/FAIL] CLI reconnect observed
B1: [PASS/FAIL]
B2: [PASS/FAIL] reconnect time: <N>s
B3: [PASS/FAIL] dwell observed: <N>s
B4: [PASS/FAIL] (Kepware)
B5: [PASS/FAIL] (OI Gateway — if available)
C1: [PASS/FAIL/SKIP - Galaxy not available]
C2: [PASS/FAIL/SKIP]
C3: [PASS/FAIL/SKIP]
One pass of every non-SKIP row is the v2 GA acceptance criterion.
Known limitations
A5 — ServerUriArray node not yet writable
The OPC UA .NET Standard SDK's default Server.ServerRedundancy object is the
base ServerRedundancyState, which has no ServerUriArray child node.
ServerRedundancyNodeWriter.ApplyServerUriArray currently logs a warning and
skips. The operator obtains ServerUriArray by reading ClusterNode rows
directly until the non-transparent redundancy-type upgrade follow-up ships.
Recovery dwell is 60 s by default
RecoveryStateManager.DwellTime defaults to TimeSpan.FromSeconds(60) in
Program.cs. Step 6 of the runbook will block for at least 60 s waiting for
Node A to return to AuthoritativePrimary. This is intentional per
decision #154 (thrash prevention) — do not lower it for the test run.
IsolatedBackup (80) does not auto-promote
Per decision #154, the Backup at band 80 does not self-elevate. If the operator
needs authoritative service from Node B while Node A is down, they must write
RedundancyRole=Primary on the ClusterNode row for Node B and publish a
draft generation. The Admin UI RedundancyTab exposes this flow.
Dependency on existing tests
The cutover runbook validates the end-to-end wire path. The math and edge cases
are already locked by the unit/integration tests enumerated in the first section.
A failing runbook step that contradicts a passing unit test indicates a
deployment configuration error or an SDK version mismatch — not a logic bug.
Check PeerHttpProbeLoop logs first (look for PeerProbe Serilog events).