Files

Joseph Doherty 16a87b08f3 docs: add four planning runbooks for Phase 6.3 interop, v2 GA gates, live-hardware validation, and alarms worker wiring

Produces docs/plans/ entries for tasks #13, #15, #16, and #17-#20:
- phase-6-3-redundancy-interop-plan.md: automation boundary analysis,
  concrete test matrix (A/B/C blocks), and a step-by-step cutover
  runbook for the deferred Stream F client interop work
- v2-ga-lab-gates-plan.md: exact gate list with command, pass criterion,
  and owner for each of the nine v2 GA exit criteria
- live-hardware-validation-runbooks.md: one runbook per driver (FOCAS
  CNC smoke #54, AB CIP live-boot, TwinCAT wire-live) with preconditions,
  procedure, expected results, and recording template
- alarms-worker-wiring-plan.md: focused plan for A.2/A.3-A.4/C.1/D.1
  worker wiring in the mxaccessgw sibling repo, documenting the
  discovered AVEVA API surface, the architectural decision that blocks
  A.2, the dependency order, and what each item needs to unblock

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-18 04:53:36 -04:00

14 KiB

Raw Blame History

Phase 6.3 Redundancy — Client Interop Matrix and Cutover Validation Plan

Scope: Phase 6.3 redundancy runtime core shipped (PRs #89-90, #98-99, #24-peerprobe, Stream C node wiring, Stream D lease wrap). What remains is Stream F (task #150): validating that third-party OPC UA clients honour our ServiceLevel / ServerUriArray / RedundancySupport signals and fail over correctly when the Primary drops. This document defines what is automatable as integration tests, what requires two live instances plus a real client, and a step-by-step cutover-validation runbook.

Source of truth: docs/Redundancy.md, docs/v2/redundancy-interop-playbook.md, docs/v2/implementation/phase-6-3-redundancy-runtime.md, scripts/compliance/phase-6-3-compliance.ps1.

What is already tested (no live cluster needed)

The following are covered by existing automated tests that run in ordinary dotnet test:

Area	Test class(es)	What it asserts
`ServiceLevelCalculator` — 8-state matrix	`ServiceLevelCalculatorTests`	All 10 band values; role × self-health × peer-http × peer-ua × apply × recovery × topology combinations
`RecoveryStateManager` — dwell + witness	`RecoveryStateManagerTests`	60 s dwell default; premature-exit rejection; witness-required gate
`ApplyLeaseRegistry` — lease lifecycle	`ApplyLeaseRegistryTests`	Disposal on success / exception / cancellation; watchdog force-close at 10 min
`ServerRedundancyNodeWriter` — OPC UA variable binding	`ServerRedundancyNodeWriterTests`	`ServiceLevel` byte push; `RedundancySupport` enum; `ServerUriArray` skip-log when node absent
`RedundancyStatePublisher` — orchestration	`RedundancyStatePublisherTests`	Edge-triggered `OnStateChanged`; idempotent dedup
`ClusterTopologyLoader`	`ClusterTopologyLoaderTests`	Two-node seed; one-node degenerate; duplicate-URI rejection
`DraftValidator.ValidateClusterTopology`	`DraftValidatorTests` (8 cases)	NodeCount/mode pairs; Enabled-count vs NodeCount; multiple-Primary rejection

Run with:

dotnet test ZB.MOM.WW.OtOpcUa.slnx --filter "FullyQualifiedName~Redundancy"

Compliance gate (every Phase 6.3 static check):

pwsh ./scripts/compliance/phase-6-3-compliance.ps1

Pass criteria: exit 0; all [PASS] lines green; [DEFERRED] lines are known-deferred surfaces, not failures.

What cannot be automated — requires two live instances

The scenarios below require two running OtOpcUa.Server processes in the same ServerCluster, a real SQL Server Config DB, and at least one driver instance with a reachable endpoint (simulator or real PLC).

Why it cannot be unit/integration-tested in-process

UaExpert, Kepware KEPServerEX, and AVEVA OI Gateway are closed-source Windows GUI binaries with no headless CLI interface for the subscribe/browse flows.
The AVEVA MXAccess failover leg (IAlarmSource reconnect, $MxAccessClient quality transition) involves the Galaxy runtime's own client-redundancy policy and the COM-layer session model — both live outside this repo.
Even the automatable sub-set (our own otopcua-cli as the client) needs two distinct listening TCP endpoints; that requires two live processes, which is out of scope for dotnet test integration fixtures.

Test matrix

Prerequisites

Two OtOpcUa.Server processes on separate Windows hosts (or separate ports on the same host for dev) sharing one Config DB (ServerCluster with NodeCount=2, RedundancyMode=Warm or Hot).
Each node registered in ClusterNode:
- Node A: RedundancyRole=Primary, ServiceLevelBase=255, ApplicationUri=urn:node-a:OtOpcUa
- Node B: RedundancyRole=Secondary, ServiceLevelBase=100, ApplicationUri=urn:node-b:OtOpcUa
PeerHttpProbeLoop and PeerUaProbeLoop HostedServices running on both nodes (registered via AddHostedService<PeerHttpProbeLoop> + AddHostedService<PeerUaProbeLoop> in Program.cs).
At least one DriverInstance in the cluster with a reachable PLC or simulator (e.g. Modbus sim at 10.100.0.35:5020).
Client machine with UaExpert >= 1.7 installed.
Optional second client: Kepware KEPServerEX 6.x QuickClient or AVEVA OI Gateway 2020R2+.

Block A — OPC UA protocol signals (UaExpert, no failover yet)

ID	Scenario	Procedure	Pass criterion	Automatable?
A1	ServiceLevel published on Primary	Connect UaExpert to Node A. Browse `Server/ServerStatus/ServiceLevel`.	Value = 255 (`AuthoritativePrimary`)	No — requires UaExpert GUI
A2	ServiceLevel published on Backup	Connect UaExpert to Node B. Read same node.	Value = 100 (`AuthoritativeBackup`)	No
A3	ServiceLevel updates when peer drops	Node A connected. Stop Node B (`sc stop OtOpcUa`). Watch `ServiceLevel` on Node A.	Transitions 255 → 230 (`IsolatedPrimary`) within ~6 s (3 × 2 s HTTP probe interval)	No
A4	RedundancySupport	Browse `Server/ServerRedundancy/RedundancySupport` on either node.	Value = `Warm` or `Hot` matching the cluster `RedundancyMode`	No
A5	ServerUriArray	Browse `Server/ServerRedundancy/ServerUriArray` on either node.	Array contains both `ApplicationUri` values; self listed first. Note: requires non-transparent redundancy-type upgrade (currently logs-and-skips — see known limitation A5 below).	No
A6	Mid-apply ServiceLevel dip	Trigger a `sp_PublishGeneration` apply (via Admin UI draft → publish) while watching Node A `ServiceLevel`.	Drops to 200 (`PrimaryMidApply`) for the apply duration; returns to 255 after `RefreshAsync`.	No
A7	Client.CLI reads correct ServiceLevel	`dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- read -u opc.tcp://<node-a>:4840 -n "i=2267"`	Prints current byte value matching expected band.	Yes — scriptable with the Client CLI
A8	otopcua-cli failover reconnect	`dotnet run ... -- connect -u opc.tcp://<node-a>:4840 -F opc.tcp://<node-b>:4840` — then kill Node A.	CLI session reconnects to Node B within the session keep-alive timeout.	Yes — scriptable with the Client CLI

Block B — Third-party client failover

ID	Scenario	Procedure	Pass criterion
B1	UaExpert picks Primary by ServiceLevel	Configure a Redundancy Group in UaExpert with both endpoint URLs.	Client connects to Node A (higher ServiceLevel)
B2	UaExpert cuts over on Primary kill	Kill Node A `OtOpcUa` service.	Client session reconnects to Node B within UaExpert's reconnect timeout (default 5 s). Data-change monitored items resume.
B3	UaExpert returns when Primary restores	Start Node A. Wait >= 60 s recovery dwell.	`ServiceLevel` on Node A progresses: 180 (`RecoveringPrimary`) → 255 (`AuthoritativePrimary`). UaExpert may or may not switch back (client-policy-dependent; both outcomes accepted).
B4	Kepware QuickClient failover	Repeat B1–B3 with Kepware configured for the same two endpoints.	Same pass criteria; establishes no UaExpert-specific behaviour.
B5	AVEVA OI Gateway	Configure OI Gateway OPC DA/UA client object against the cluster. Kill Primary.	OI Gateway data quality recovers within `ReconnectInterval` (default 20 s); no permanent data-loss alert.

Block C — Galaxy MXAccess failover

This block requires a running Galaxy and $MxAccessClient object (AVEVA System Platform installed, Galaxy deployed on dev box — see project memory project_aveva_platform_installed.md).

ID	Scenario	Procedure	Pass criterion
C1	Galaxy binds to Primary on first connect	Bring cluster up. Start a Galaxy `$MxAccessClient` with both node URLs configured.	Galaxy reports `QUALITY = Good`; initial values stream from Node A.
C2	Galaxy redirects on Primary drop	Stop Node A.	Galaxy `QUALITY` briefly goes `Uncertain`, then returns to `Good`; values continue streaming from Node B within MXAccess's `ReconnectInterval` (default 20 s).
C3	Galaxy tolerates mid-apply dip	Trigger generation apply on Node A.	Galaxy remains bound — mid-apply dip (200) is advisory, not a session drop. No quality interruption.

Note: A negative result on C1–C3 does not necessarily indicate an OtOpcUa defect. Cross-check with Block A / B first to confirm our ServiceLevel signal is correct before debugging the MXAccess client layer.

Step-by-step cutover-validation runbook

This is the minimum procedure to satisfy the v2 GA exit criterion: "Non-transparent redundancy cutover validated with at least one production client (Ignition 8.3 recommended — see decision #85)."

Step 1 — Provision the cluster

# On the Config DB host, seed or verify cluster rows:
# ServerCluster: Id=<id>, Name="test-cluster", NodeCount=2, RedundancyMode=Warm
# ClusterNode A: NodeId="node-a", ClusterId=<id>, RedundancyRole=Primary,
#   ServiceLevelBase=255, ApplicationUri="urn:node-a:OtOpcUa"
# ClusterNode B: NodeId="node-b", ClusterId=<id>, RedundancyRole=Secondary,
#   ServiceLevelBase=100, ApplicationUri="urn:node-b:OtOpcUa"

Verify uniqueness constraint: no two ClusterNode rows share the same ApplicationUri (unique index on ApplicationUri).

Step 2 — Start both server instances

On Node A host:

# appsettings.json: Node:NodeId = "node-a"
sc start OtOpcUa

On Node B host:

# appsettings.json: Node:NodeId = "node-b"
sc start OtOpcUa

Wait 10 s for HostedServices to complete first probe cycle.

Step 3 — Verify baseline ServiceLevel via Client CLI

# Node A should report 255
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- read `
    -u opc.tcp://<node-a-host>:4840 -n "i=2267"

# Node B should report 100
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- read `
    -u opc.tcp://<node-b-host>:4840 -n "i=2267"

Pass: Node A = 255, Node B = 100.

Step 4 — Verify ServerUriArray

dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- read `
    -u opc.tcp://<node-a-host>:4840 -n "i=2271"

Pass: array returned contains both ApplicationUri strings. If ServerUriArray node returns empty or an error, the non-transparent redundancy-type upgrade follow-up is still pending (known limitation — ServerRedundancyNodeWriter.ApplyServerUriArray logs-and-skips on the base ServerRedundancyState object type).

Step 5 — Execute Primary kill + failover (B2 scenario)

Connect UaExpert (or Kepware) Redundancy Group to both endpoints.
Confirm client is subscribed to at least one variable node.
Kill Node A: sc stop OtOpcUa on Node A host.
Observe:
- Node B ServiceLevel should transition: 100 (AuthoritativeBackup) → 80 (IsolatedBackup) within ~6 s.
- Client should reconnect to Node B and resume data-change events.
Record: time from kill to client reconnect; whether data gaps occurred.

Step 6 — Verify Primary recovery (B3 scenario)

Restart Node A: sc start OtOpcUa on Node A host.
Observe Node A ServiceLevel progression:
- ~0 s: 1 (NoData) briefly while HostedServices start.
- Startup: 180 (RecoveringPrimary) — recovery dwell gate active.
- After >= 60 s dwell + one positive publish witness: 255 (AuthoritativePrimary).
Observe Node B:
- Returns to 100 (AuthoritativeBackup) once it sees Node A peer probe succeed.
Record dwell duration and whether the client (UaExpert/Kepware) switches back.

Step 7 — Execute mid-apply dip (A6 scenario)

Via Admin UI, create a trivial draft change and publish.
Watch Node A ServiceLevel during apply.
Expected: drops to 200 (PrimaryMidApply) for the apply duration (typically seconds); returns to 255 when GenerationRefreshHostedService releases the lease.

Step 8 — Record results

Copy the following block into a tracking doc:

Run date: YYYY-MM-DD
Release SHA: <git sha>
Cluster: <cluster-id>  Primary: node-a  Backup: node-b
Config DB: 10.100.0.35,14330

A1: [PASS/FAIL]  evidence: <screenshot or CLI output>
A2: [PASS/FAIL]
A3: [PASS/FAIL]  time-to-IsolatedPrimary: <N>s
A4: [PASS/FAIL]
A5: [PASS/FAIL/DEFERRED - ServerUriArray upgrade pending]
A6: [PASS/FAIL]  mid-apply duration: <N>s
A7: [PASS/FAIL]  CLI output attached
A8: [PASS/FAIL]  CLI reconnect observed
B1: [PASS/FAIL]
B2: [PASS/FAIL]  reconnect time: <N>s
B3: [PASS/FAIL]  dwell observed: <N>s
B4: [PASS/FAIL]  (Kepware)
B5: [PASS/FAIL]  (OI Gateway — if available)
C1: [PASS/FAIL/SKIP - Galaxy not available]
C2: [PASS/FAIL/SKIP]
C3: [PASS/FAIL/SKIP]

One pass of every non-SKIP row is the v2 GA acceptance criterion.

Known limitations

A5 — ServerUriArray node not yet writable

The OPC UA .NET Standard SDK's default Server.ServerRedundancy object is the base ServerRedundancyState, which has no ServerUriArray child node. ServerRedundancyNodeWriter.ApplyServerUriArray currently logs a warning and skips. The operator obtains ServerUriArray by reading ClusterNode rows directly until the non-transparent redundancy-type upgrade follow-up ships.

Recovery dwell is 60 s by default

RecoveryStateManager.DwellTime defaults to TimeSpan.FromSeconds(60) in Program.cs. Step 6 of the runbook will block for at least 60 s waiting for Node A to return to AuthoritativePrimary. This is intentional per decision #154 (thrash prevention) — do not lower it for the test run.

IsolatedBackup (80) does not auto-promote

Per decision #154, the Backup at band 80 does not self-elevate. If the operator needs authoritative service from Node B while Node A is down, they must write RedundancyRole=Primary on the ClusterNode row for Node B and publish a draft generation. The Admin UI RedundancyTab exposes this flow.

Dependency on existing tests

The cutover runbook validates the end-to-end wire path. The math and edge cases are already locked by the unit/integration tests enumerated in the first section. A failing runbook step that contradicts a passing unit test indicates a deployment configuration error or an SDK version mismatch — not a logic bug. Check PeerHttpProbeLoop logs first (look for PeerProbe Serilog events).

14 KiB Raw Blame History Unescape Escape