Files
lmxopcua/docs/plans/phase-6-3-redundancy-interop-plan.md
Joseph Doherty 16a87b08f3 docs: add four planning runbooks for Phase 6.3 interop, v2 GA gates, live-hardware validation, and alarms worker wiring
Produces docs/plans/ entries for tasks #13, #15, #16, and #17-#20:
- phase-6-3-redundancy-interop-plan.md: automation boundary analysis,
  concrete test matrix (A/B/C blocks), and a step-by-step cutover
  runbook for the deferred Stream F client interop work
- v2-ga-lab-gates-plan.md: exact gate list with command, pass criterion,
  and owner for each of the nine v2 GA exit criteria
- live-hardware-validation-runbooks.md: one runbook per driver (FOCAS
  CNC smoke #54, AB CIP live-boot, TwinCAT wire-live) with preconditions,
  procedure, expected results, and recording template
- alarms-worker-wiring-plan.md: focused plan for A.2/A.3-A.4/C.1/D.1
  worker wiring in the mxaccessgw sibling repo, documenting the
  discovered AVEVA API surface, the architectural decision that blocks
  A.2, the dependency order, and what each item needs to unblock

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 04:53:36 -04:00

14 KiB
Raw Blame History

Phase 6.3 Redundancy — Client Interop Matrix and Cutover Validation Plan

Scope: Phase 6.3 redundancy runtime core shipped (PRs #89-90, #98-99, #24-peerprobe, Stream C node wiring, Stream D lease wrap). What remains is Stream F (task #150): validating that third-party OPC UA clients honour our ServiceLevel / ServerUriArray / RedundancySupport signals and fail over correctly when the Primary drops. This document defines what is automatable as integration tests, what requires two live instances plus a real client, and a step-by-step cutover-validation runbook.

Source of truth: docs/Redundancy.md, docs/v2/redundancy-interop-playbook.md, docs/v2/implementation/phase-6-3-redundancy-runtime.md, scripts/compliance/phase-6-3-compliance.ps1.

What is already tested (no live cluster needed)

The following are covered by existing automated tests that run in ordinary dotnet test:

Area Test class(es) What it asserts
ServiceLevelCalculator — 8-state matrix ServiceLevelCalculatorTests All 10 band values; role × self-health × peer-http × peer-ua × apply × recovery × topology combinations
RecoveryStateManager — dwell + witness RecoveryStateManagerTests 60 s dwell default; premature-exit rejection; witness-required gate
ApplyLeaseRegistry — lease lifecycle ApplyLeaseRegistryTests Disposal on success / exception / cancellation; watchdog force-close at 10 min
ServerRedundancyNodeWriter — OPC UA variable binding ServerRedundancyNodeWriterTests ServiceLevel byte push; RedundancySupport enum; ServerUriArray skip-log when node absent
RedundancyStatePublisher — orchestration RedundancyStatePublisherTests Edge-triggered OnStateChanged; idempotent dedup
ClusterTopologyLoader ClusterTopologyLoaderTests Two-node seed; one-node degenerate; duplicate-URI rejection
DraftValidator.ValidateClusterTopology DraftValidatorTests (8 cases) NodeCount/mode pairs; Enabled-count vs NodeCount; multiple-Primary rejection

Run with:

dotnet test ZB.MOM.WW.OtOpcUa.slnx --filter "FullyQualifiedName~Redundancy"

Compliance gate (every Phase 6.3 static check):

pwsh ./scripts/compliance/phase-6-3-compliance.ps1

Pass criteria: exit 0; all [PASS] lines green; [DEFERRED] lines are known-deferred surfaces, not failures.

What cannot be automated — requires two live instances

The scenarios below require two running OtOpcUa.Server processes in the same ServerCluster, a real SQL Server Config DB, and at least one driver instance with a reachable endpoint (simulator or real PLC).

Why it cannot be unit/integration-tested in-process

  • UaExpert, Kepware KEPServerEX, and AVEVA OI Gateway are closed-source Windows GUI binaries with no headless CLI interface for the subscribe/browse flows.
  • The AVEVA MXAccess failover leg (IAlarmSource reconnect, $MxAccessClient quality transition) involves the Galaxy runtime's own client-redundancy policy and the COM-layer session model — both live outside this repo.
  • Even the automatable sub-set (our own otopcua-cli as the client) needs two distinct listening TCP endpoints; that requires two live processes, which is out of scope for dotnet test integration fixtures.

Test matrix

Prerequisites

  1. Two OtOpcUa.Server processes on separate Windows hosts (or separate ports on the same host for dev) sharing one Config DB (ServerCluster with NodeCount=2, RedundancyMode=Warm or Hot).
  2. Each node registered in ClusterNode:
    • Node A: RedundancyRole=Primary, ServiceLevelBase=255, ApplicationUri=urn:node-a:OtOpcUa
    • Node B: RedundancyRole=Secondary, ServiceLevelBase=100, ApplicationUri=urn:node-b:OtOpcUa
  3. PeerHttpProbeLoop and PeerUaProbeLoop HostedServices running on both nodes (registered via AddHostedService<PeerHttpProbeLoop> + AddHostedService<PeerUaProbeLoop> in Program.cs).
  4. At least one DriverInstance in the cluster with a reachable PLC or simulator (e.g. Modbus sim at 10.100.0.35:5020).
  5. Client machine with UaExpert >= 1.7 installed.
  6. Optional second client: Kepware KEPServerEX 6.x QuickClient or AVEVA OI Gateway 2020R2+.

Block A — OPC UA protocol signals (UaExpert, no failover yet)

ID Scenario Procedure Pass criterion Automatable?
A1 ServiceLevel published on Primary Connect UaExpert to Node A. Browse Server/ServerStatus/ServiceLevel. Value = 255 (AuthoritativePrimary) No — requires UaExpert GUI
A2 ServiceLevel published on Backup Connect UaExpert to Node B. Read same node. Value = 100 (AuthoritativeBackup) No
A3 ServiceLevel updates when peer drops Node A connected. Stop Node B (sc stop OtOpcUa). Watch ServiceLevel on Node A. Transitions 255 → 230 (IsolatedPrimary) within ~6 s (3 × 2 s HTTP probe interval) No
A4 RedundancySupport Browse Server/ServerRedundancy/RedundancySupport on either node. Value = Warm or Hot matching the cluster RedundancyMode No
A5 ServerUriArray Browse Server/ServerRedundancy/ServerUriArray on either node. Array contains both ApplicationUri values; self listed first. Note: requires non-transparent redundancy-type upgrade (currently logs-and-skips — see known limitation A5 below). No
A6 Mid-apply ServiceLevel dip Trigger a sp_PublishGeneration apply (via Admin UI draft → publish) while watching Node A ServiceLevel. Drops to 200 (PrimaryMidApply) for the apply duration; returns to 255 after RefreshAsync. No
A7 Client.CLI reads correct ServiceLevel dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- read -u opc.tcp://<node-a>:4840 -n "i=2267" Prints current byte value matching expected band. Yes — scriptable with the Client CLI
A8 otopcua-cli failover reconnect dotnet run ... -- connect -u opc.tcp://<node-a>:4840 -F opc.tcp://<node-b>:4840 — then kill Node A. CLI session reconnects to Node B within the session keep-alive timeout. Yes — scriptable with the Client CLI

Block B — Third-party client failover

ID Scenario Procedure Pass criterion
B1 UaExpert picks Primary by ServiceLevel Configure a Redundancy Group in UaExpert with both endpoint URLs. Client connects to Node A (higher ServiceLevel)
B2 UaExpert cuts over on Primary kill Kill Node A OtOpcUa service. Client session reconnects to Node B within UaExpert's reconnect timeout (default 5 s). Data-change monitored items resume.
B3 UaExpert returns when Primary restores Start Node A. Wait >= 60 s recovery dwell. ServiceLevel on Node A progresses: 180 (RecoveringPrimary) → 255 (AuthoritativePrimary). UaExpert may or may not switch back (client-policy-dependent; both outcomes accepted).
B4 Kepware QuickClient failover Repeat B1B3 with Kepware configured for the same two endpoints. Same pass criteria; establishes no UaExpert-specific behaviour.
B5 AVEVA OI Gateway Configure OI Gateway OPC DA/UA client object against the cluster. Kill Primary. OI Gateway data quality recovers within ReconnectInterval (default 20 s); no permanent data-loss alert.

Block C — Galaxy MXAccess failover

This block requires a running Galaxy and $MxAccessClient object (AVEVA System Platform installed, Galaxy deployed on dev box — see project memory project_aveva_platform_installed.md).

ID Scenario Procedure Pass criterion
C1 Galaxy binds to Primary on first connect Bring cluster up. Start a Galaxy $MxAccessClient with both node URLs configured. Galaxy reports QUALITY = Good; initial values stream from Node A.
C2 Galaxy redirects on Primary drop Stop Node A. Galaxy QUALITY briefly goes Uncertain, then returns to Good; values continue streaming from Node B within MXAccess's ReconnectInterval (default 20 s).
C3 Galaxy tolerates mid-apply dip Trigger generation apply on Node A. Galaxy remains bound — mid-apply dip (200) is advisory, not a session drop. No quality interruption.

Note: A negative result on C1C3 does not necessarily indicate an OtOpcUa defect. Cross-check with Block A / B first to confirm our ServiceLevel signal is correct before debugging the MXAccess client layer.

Step-by-step cutover-validation runbook

This is the minimum procedure to satisfy the v2 GA exit criterion: "Non-transparent redundancy cutover validated with at least one production client (Ignition 8.3 recommended — see decision #85)."

Step 1 — Provision the cluster

# On the Config DB host, seed or verify cluster rows:
# ServerCluster: Id=<id>, Name="test-cluster", NodeCount=2, RedundancyMode=Warm
# ClusterNode A: NodeId="node-a", ClusterId=<id>, RedundancyRole=Primary,
#   ServiceLevelBase=255, ApplicationUri="urn:node-a:OtOpcUa"
# ClusterNode B: NodeId="node-b", ClusterId=<id>, RedundancyRole=Secondary,
#   ServiceLevelBase=100, ApplicationUri="urn:node-b:OtOpcUa"

Verify uniqueness constraint: no two ClusterNode rows share the same ApplicationUri (unique index on ApplicationUri).

Step 2 — Start both server instances

On Node A host:

# appsettings.json: Node:NodeId = "node-a"
sc start OtOpcUa

On Node B host:

# appsettings.json: Node:NodeId = "node-b"
sc start OtOpcUa

Wait 10 s for HostedServices to complete first probe cycle.

Step 3 — Verify baseline ServiceLevel via Client CLI

# Node A should report 255
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- read `
    -u opc.tcp://<node-a-host>:4840 -n "i=2267"

# Node B should report 100
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- read `
    -u opc.tcp://<node-b-host>:4840 -n "i=2267"

Pass: Node A = 255, Node B = 100.

Step 4 — Verify ServerUriArray

dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- read `
    -u opc.tcp://<node-a-host>:4840 -n "i=2271"

Pass: array returned contains both ApplicationUri strings. If ServerUriArray node returns empty or an error, the non-transparent redundancy-type upgrade follow-up is still pending (known limitation — ServerRedundancyNodeWriter.ApplyServerUriArray logs-and-skips on the base ServerRedundancyState object type).

Step 5 — Execute Primary kill + failover (B2 scenario)

  1. Connect UaExpert (or Kepware) Redundancy Group to both endpoints.
  2. Confirm client is subscribed to at least one variable node.
  3. Kill Node A: sc stop OtOpcUa on Node A host.
  4. Observe:
    • Node B ServiceLevel should transition: 100 (AuthoritativeBackup) → 80 (IsolatedBackup) within ~6 s.
    • Client should reconnect to Node B and resume data-change events.
  5. Record: time from kill to client reconnect; whether data gaps occurred.

Step 6 — Verify Primary recovery (B3 scenario)

  1. Restart Node A: sc start OtOpcUa on Node A host.
  2. Observe Node A ServiceLevel progression:
    • ~0 s: 1 (NoData) briefly while HostedServices start.
    • Startup: 180 (RecoveringPrimary) — recovery dwell gate active.
    • After >= 60 s dwell + one positive publish witness: 255 (AuthoritativePrimary).
  3. Observe Node B:
    • Returns to 100 (AuthoritativeBackup) once it sees Node A peer probe succeed.
  4. Record dwell duration and whether the client (UaExpert/Kepware) switches back.

Step 7 — Execute mid-apply dip (A6 scenario)

  1. Via Admin UI, create a trivial draft change and publish.
  2. Watch Node A ServiceLevel during apply.
  3. Expected: drops to 200 (PrimaryMidApply) for the apply duration (typically seconds); returns to 255 when GenerationRefreshHostedService releases the lease.

Step 8 — Record results

Copy the following block into a tracking doc:

Run date: YYYY-MM-DD
Release SHA: <git sha>
Cluster: <cluster-id>  Primary: node-a  Backup: node-b
Config DB: 10.100.0.35,14330

A1: [PASS/FAIL]  evidence: <screenshot or CLI output>
A2: [PASS/FAIL]
A3: [PASS/FAIL]  time-to-IsolatedPrimary: <N>s
A4: [PASS/FAIL]
A5: [PASS/FAIL/DEFERRED - ServerUriArray upgrade pending]
A6: [PASS/FAIL]  mid-apply duration: <N>s
A7: [PASS/FAIL]  CLI output attached
A8: [PASS/FAIL]  CLI reconnect observed
B1: [PASS/FAIL]
B2: [PASS/FAIL]  reconnect time: <N>s
B3: [PASS/FAIL]  dwell observed: <N>s
B4: [PASS/FAIL]  (Kepware)
B5: [PASS/FAIL]  (OI Gateway — if available)
C1: [PASS/FAIL/SKIP - Galaxy not available]
C2: [PASS/FAIL/SKIP]
C3: [PASS/FAIL/SKIP]

One pass of every non-SKIP row is the v2 GA acceptance criterion.

Known limitations

A5 — ServerUriArray node not yet writable

The OPC UA .NET Standard SDK's default Server.ServerRedundancy object is the base ServerRedundancyState, which has no ServerUriArray child node. ServerRedundancyNodeWriter.ApplyServerUriArray currently logs a warning and skips. The operator obtains ServerUriArray by reading ClusterNode rows directly until the non-transparent redundancy-type upgrade follow-up ships.

Recovery dwell is 60 s by default

RecoveryStateManager.DwellTime defaults to TimeSpan.FromSeconds(60) in Program.cs. Step 6 of the runbook will block for at least 60 s waiting for Node A to return to AuthoritativePrimary. This is intentional per decision #154 (thrash prevention) — do not lower it for the test run.

IsolatedBackup (80) does not auto-promote

Per decision #154, the Backup at band 80 does not self-elevate. If the operator needs authoritative service from Node B while Node A is down, they must write RedundancyRole=Primary on the ClusterNode row for Node B and publish a draft generation. The Admin UI RedundancyTab exposes this flow.

Dependency on existing tests

The cutover runbook validates the end-to-end wire path. The math and edge cases are already locked by the unit/integration tests enumerated in the first section. A failing runbook step that contradicts a passing unit test indicates a deployment configuration error or an SDK version mismatch — not a logic bug. Check PeerHttpProbeLoop logs first (look for PeerProbe Serilog events).