Files
lmxopcua/docs/plans/phase-6-3-redundancy-interop-plan.md
Joseph Doherty 16a87b08f3 docs: add four planning runbooks for Phase 6.3 interop, v2 GA gates, live-hardware validation, and alarms worker wiring
Produces docs/plans/ entries for tasks #13, #15, #16, and #17-#20:
- phase-6-3-redundancy-interop-plan.md: automation boundary analysis,
  concrete test matrix (A/B/C blocks), and a step-by-step cutover
  runbook for the deferred Stream F client interop work
- v2-ga-lab-gates-plan.md: exact gate list with command, pass criterion,
  and owner for each of the nine v2 GA exit criteria
- live-hardware-validation-runbooks.md: one runbook per driver (FOCAS
  CNC smoke #54, AB CIP live-boot, TwinCAT wire-live) with preconditions,
  procedure, expected results, and recording template
- alarms-worker-wiring-plan.md: focused plan for A.2/A.3-A.4/C.1/D.1
  worker wiring in the mxaccessgw sibling repo, documenting the
  discovered AVEVA API surface, the architectural decision that blocks
  A.2, the dependency order, and what each item needs to unblock

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 04:53:36 -04:00

279 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 6.3 Redundancy — Client Interop Matrix and Cutover Validation Plan
> **Scope**: Phase 6.3 redundancy runtime core shipped (PRs #89-90, #98-99,
> #24-peerprobe, Stream C node wiring, Stream D lease wrap). What remains is
> Stream F (task #150): validating that third-party OPC UA clients honour
> our `ServiceLevel` / `ServerUriArray` / `RedundancySupport` signals and
> fail over correctly when the Primary drops. This document defines what is
> automatable as integration tests, what requires two live instances plus a
> real client, and a step-by-step cutover-validation runbook.
>
> **Source of truth**: `docs/Redundancy.md`, `docs/v2/redundancy-interop-playbook.md`,
> `docs/v2/implementation/phase-6-3-redundancy-runtime.md`,
> `scripts/compliance/phase-6-3-compliance.ps1`.
## What is already tested (no live cluster needed)
The following are covered by existing automated tests that run in ordinary
`dotnet test`:
| Area | Test class(es) | What it asserts |
|---|---|---|
| `ServiceLevelCalculator` — 8-state matrix | `ServiceLevelCalculatorTests` | All 10 band values; role × self-health × peer-http × peer-ua × apply × recovery × topology combinations |
| `RecoveryStateManager` — dwell + witness | `RecoveryStateManagerTests` | 60 s dwell default; premature-exit rejection; witness-required gate |
| `ApplyLeaseRegistry` — lease lifecycle | `ApplyLeaseRegistryTests` | Disposal on success / exception / cancellation; watchdog force-close at 10 min |
| `ServerRedundancyNodeWriter` — OPC UA variable binding | `ServerRedundancyNodeWriterTests` | `ServiceLevel` byte push; `RedundancySupport` enum; `ServerUriArray` skip-log when node absent |
| `RedundancyStatePublisher` — orchestration | `RedundancyStatePublisherTests` | Edge-triggered `OnStateChanged`; idempotent dedup |
| `ClusterTopologyLoader` | `ClusterTopologyLoaderTests` | Two-node seed; one-node degenerate; duplicate-URI rejection |
| `DraftValidator.ValidateClusterTopology` | `DraftValidatorTests` (8 cases) | NodeCount/mode pairs; Enabled-count vs NodeCount; multiple-Primary rejection |
Run with:
```powershell
dotnet test ZB.MOM.WW.OtOpcUa.slnx --filter "FullyQualifiedName~Redundancy"
```
Compliance gate (every Phase 6.3 static check):
```powershell
pwsh ./scripts/compliance/phase-6-3-compliance.ps1
```
Pass criteria: exit 0; all `[PASS]` lines green; `[DEFERRED]` lines are
known-deferred surfaces, not failures.
## What cannot be automated — requires two live instances
The scenarios below require two running `OtOpcUa.Server` processes in the
same `ServerCluster`, a real SQL Server Config DB, and at least one driver
instance with a reachable endpoint (simulator or real PLC).
### Why it cannot be unit/integration-tested in-process
- UaExpert, Kepware KEPServerEX, and AVEVA OI Gateway are closed-source
Windows GUI binaries with no headless CLI interface for the
subscribe/browse flows.
- The AVEVA MXAccess failover leg (`IAlarmSource` reconnect, `$MxAccessClient`
quality transition) involves the Galaxy runtime's own client-redundancy
policy and the COM-layer session model — both live outside this repo.
- Even the automatable sub-set (our own `otopcua-cli` as the client) needs
two distinct listening TCP endpoints; that requires two live processes,
which is out of scope for `dotnet test` integration fixtures.
## Test matrix
### Prerequisites
1. Two `OtOpcUa.Server` processes on separate Windows hosts (or separate
ports on the same host for dev) sharing one Config DB (`ServerCluster`
with `NodeCount=2`, `RedundancyMode=Warm` or `Hot`).
2. Each node registered in `ClusterNode`:
- Node A: `RedundancyRole=Primary`, `ServiceLevelBase=255`,
`ApplicationUri=urn:node-a:OtOpcUa`
- Node B: `RedundancyRole=Secondary`, `ServiceLevelBase=100`,
`ApplicationUri=urn:node-b:OtOpcUa`
3. `PeerHttpProbeLoop` and `PeerUaProbeLoop` HostedServices running on both
nodes (registered via `AddHostedService<PeerHttpProbeLoop>` +
`AddHostedService<PeerUaProbeLoop>` in `Program.cs`).
4. At least one `DriverInstance` in the cluster with a reachable PLC or
simulator (e.g. Modbus sim at `10.100.0.35:5020`).
5. Client machine with UaExpert >= 1.7 installed.
6. Optional second client: Kepware KEPServerEX 6.x QuickClient or AVEVA
OI Gateway 2020R2+.
### Block A — OPC UA protocol signals (UaExpert, no failover yet)
| ID | Scenario | Procedure | Pass criterion | Automatable? |
|----|----------|-----------|----------------|--------------|
| A1 | ServiceLevel published on Primary | Connect UaExpert to Node A. Browse `Server/ServerStatus/ServiceLevel`. | Value = 255 (`AuthoritativePrimary`) | No — requires UaExpert GUI |
| A2 | ServiceLevel published on Backup | Connect UaExpert to Node B. Read same node. | Value = 100 (`AuthoritativeBackup`) | No |
| A3 | ServiceLevel updates when peer drops | Node A connected. Stop Node B (`sc stop OtOpcUa`). Watch `ServiceLevel` on Node A. | Transitions 255 → 230 (`IsolatedPrimary`) within ~6 s (3 × 2 s HTTP probe interval) | No |
| A4 | RedundancySupport | Browse `Server/ServerRedundancy/RedundancySupport` on either node. | Value = `Warm` or `Hot` matching the cluster `RedundancyMode` | No |
| A5 | ServerUriArray | Browse `Server/ServerRedundancy/ServerUriArray` on either node. | Array contains both `ApplicationUri` values; self listed first. Note: requires non-transparent redundancy-type upgrade (currently logs-and-skips — see known limitation A5 below). | No |
| A6 | Mid-apply ServiceLevel dip | Trigger a `sp_PublishGeneration` apply (via Admin UI draft → publish) while watching Node A `ServiceLevel`. | Drops to 200 (`PrimaryMidApply`) for the apply duration; returns to 255 after `RefreshAsync`. | No |
| A7 | Client.CLI reads correct ServiceLevel | `dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- read -u opc.tcp://<node-a>:4840 -n "i=2267"` | Prints current byte value matching expected band. | **Yes** — scriptable with the Client CLI |
| A8 | otopcua-cli failover reconnect | `dotnet run ... -- connect -u opc.tcp://<node-a>:4840 -F opc.tcp://<node-b>:4840` — then kill Node A. | CLI session reconnects to Node B within the session keep-alive timeout. | **Yes** — scriptable with the Client CLI |
### Block B — Third-party client failover
| ID | Scenario | Procedure | Pass criterion |
|----|----------|-----------|----------------|
| B1 | UaExpert picks Primary by ServiceLevel | Configure a Redundancy Group in UaExpert with both endpoint URLs. | Client connects to Node A (higher ServiceLevel) |
| B2 | UaExpert cuts over on Primary kill | Kill Node A `OtOpcUa` service. | Client session reconnects to Node B within UaExpert's reconnect timeout (default 5 s). Data-change monitored items resume. |
| B3 | UaExpert returns when Primary restores | Start Node A. Wait >= 60 s recovery dwell. | `ServiceLevel` on Node A progresses: 180 (`RecoveringPrimary`) → 255 (`AuthoritativePrimary`). UaExpert may or may not switch back (client-policy-dependent; both outcomes accepted). |
| B4 | Kepware QuickClient failover | Repeat B1B3 with Kepware configured for the same two endpoints. | Same pass criteria; establishes no UaExpert-specific behaviour. |
| B5 | AVEVA OI Gateway | Configure OI Gateway OPC DA/UA client object against the cluster. Kill Primary. | OI Gateway data quality recovers within `ReconnectInterval` (default 20 s); no permanent data-loss alert. |
### Block C — Galaxy MXAccess failover
This block requires a running Galaxy and `$MxAccessClient` object (AVEVA
System Platform installed, Galaxy deployed on dev box — see project memory
`project_aveva_platform_installed.md`).
| ID | Scenario | Procedure | Pass criterion |
|----|----------|-----------|----------------|
| C1 | Galaxy binds to Primary on first connect | Bring cluster up. Start a Galaxy `$MxAccessClient` with both node URLs configured. | Galaxy reports `QUALITY = Good`; initial values stream from Node A. |
| C2 | Galaxy redirects on Primary drop | Stop Node A. | Galaxy `QUALITY` briefly goes `Uncertain`, then returns to `Good`; values continue streaming from Node B within MXAccess's `ReconnectInterval` (default 20 s). |
| C3 | Galaxy tolerates mid-apply dip | Trigger generation apply on Node A. | Galaxy remains bound — mid-apply dip (200) is advisory, not a session drop. No quality interruption. |
Note: A negative result on C1C3 does not necessarily indicate an OtOpcUa
defect. Cross-check with Block A / B first to confirm our `ServiceLevel`
signal is correct before debugging the MXAccess client layer.
## Step-by-step cutover-validation runbook
This is the minimum procedure to satisfy the v2 GA exit criterion:
"Non-transparent redundancy cutover validated with at least one production
client (Ignition 8.3 recommended — see decision #85)."
### Step 1 — Provision the cluster
```powershell
# On the Config DB host, seed or verify cluster rows:
# ServerCluster: Id=<id>, Name="test-cluster", NodeCount=2, RedundancyMode=Warm
# ClusterNode A: NodeId="node-a", ClusterId=<id>, RedundancyRole=Primary,
# ServiceLevelBase=255, ApplicationUri="urn:node-a:OtOpcUa"
# ClusterNode B: NodeId="node-b", ClusterId=<id>, RedundancyRole=Secondary,
# ServiceLevelBase=100, ApplicationUri="urn:node-b:OtOpcUa"
```
Verify uniqueness constraint: no two `ClusterNode` rows share the same
`ApplicationUri` (unique index on `ApplicationUri`).
### Step 2 — Start both server instances
On Node A host:
```powershell
# appsettings.json: Node:NodeId = "node-a"
sc start OtOpcUa
```
On Node B host:
```powershell
# appsettings.json: Node:NodeId = "node-b"
sc start OtOpcUa
```
Wait 10 s for HostedServices to complete first probe cycle.
### Step 3 — Verify baseline ServiceLevel via Client CLI
```powershell
# Node A should report 255
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- read `
-u opc.tcp://<node-a-host>:4840 -n "i=2267"
# Node B should report 100
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- read `
-u opc.tcp://<node-b-host>:4840 -n "i=2267"
```
Pass: Node A = 255, Node B = 100.
### Step 4 — Verify ServerUriArray
```powershell
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- read `
-u opc.tcp://<node-a-host>:4840 -n "i=2271"
```
Pass: array returned contains both `ApplicationUri` strings. If
`ServerUriArray` node returns empty or an error, the non-transparent
redundancy-type upgrade follow-up is still pending (known limitation —
`ServerRedundancyNodeWriter.ApplyServerUriArray` logs-and-skips on the
base `ServerRedundancyState` object type).
### Step 5 — Execute Primary kill + failover (B2 scenario)
1. Connect UaExpert (or Kepware) Redundancy Group to both endpoints.
2. Confirm client is subscribed to at least one variable node.
3. Kill Node A: `sc stop OtOpcUa` on Node A host.
4. Observe:
- Node B `ServiceLevel` should transition: 100 (`AuthoritativeBackup`)
→ 80 (`IsolatedBackup`) within ~6 s.
- Client should reconnect to Node B and resume data-change events.
5. Record: time from kill to client reconnect; whether data gaps occurred.
### Step 6 — Verify Primary recovery (B3 scenario)
1. Restart Node A: `sc start OtOpcUa` on Node A host.
2. Observe Node A `ServiceLevel` progression:
- ~0 s: 1 (`NoData`) briefly while HostedServices start.
- Startup: 180 (`RecoveringPrimary`) — recovery dwell gate active.
- After >= 60 s dwell + one positive publish witness: 255 (`AuthoritativePrimary`).
3. Observe Node B:
- Returns to 100 (`AuthoritativeBackup`) once it sees Node A peer probe succeed.
4. Record dwell duration and whether the client (UaExpert/Kepware) switches back.
### Step 7 — Execute mid-apply dip (A6 scenario)
1. Via Admin UI, create a trivial draft change and publish.
2. Watch Node A `ServiceLevel` during apply.
3. Expected: drops to 200 (`PrimaryMidApply`) for the apply duration
(typically seconds); returns to 255 when `GenerationRefreshHostedService`
releases the lease.
### Step 8 — Record results
Copy the following block into a tracking doc:
```
Run date: YYYY-MM-DD
Release SHA: <git sha>
Cluster: <cluster-id> Primary: node-a Backup: node-b
Config DB: 10.100.0.35,14330
A1: [PASS/FAIL] evidence: <screenshot or CLI output>
A2: [PASS/FAIL]
A3: [PASS/FAIL] time-to-IsolatedPrimary: <N>s
A4: [PASS/FAIL]
A5: [PASS/FAIL/DEFERRED - ServerUriArray upgrade pending]
A6: [PASS/FAIL] mid-apply duration: <N>s
A7: [PASS/FAIL] CLI output attached
A8: [PASS/FAIL] CLI reconnect observed
B1: [PASS/FAIL]
B2: [PASS/FAIL] reconnect time: <N>s
B3: [PASS/FAIL] dwell observed: <N>s
B4: [PASS/FAIL] (Kepware)
B5: [PASS/FAIL] (OI Gateway — if available)
C1: [PASS/FAIL/SKIP - Galaxy not available]
C2: [PASS/FAIL/SKIP]
C3: [PASS/FAIL/SKIP]
```
One pass of every non-SKIP row is the v2 GA acceptance criterion.
## Known limitations
### A5 — ServerUriArray node not yet writable
The OPC UA .NET Standard SDK's default `Server.ServerRedundancy` object is the
base `ServerRedundancyState`, which has no `ServerUriArray` child node.
`ServerRedundancyNodeWriter.ApplyServerUriArray` currently logs a warning and
skips. The operator obtains `ServerUriArray` by reading `ClusterNode` rows
directly until the non-transparent redundancy-type upgrade follow-up ships.
### Recovery dwell is 60 s by default
`RecoveryStateManager.DwellTime` defaults to `TimeSpan.FromSeconds(60)` in
`Program.cs`. Step 6 of the runbook will block for at least 60 s waiting for
Node A to return to `AuthoritativePrimary`. This is intentional per
decision #154 (thrash prevention) — do not lower it for the test run.
### IsolatedBackup (80) does not auto-promote
Per decision #154, the Backup at band 80 does not self-elevate. If the operator
needs authoritative service from Node B while Node A is down, they must write
`RedundancyRole=Primary` on the `ClusterNode` row for Node B and publish a
draft generation. The Admin UI `RedundancyTab` exposes this flow.
## Dependency on existing tests
The cutover runbook validates the end-to-end wire path. The math and edge cases
are already locked by the unit/integration tests enumerated in the first section.
A failing runbook step that contradicts a passing unit test indicates a
deployment configuration error or an SDK version mismatch — not a logic bug.
Check `PeerHttpProbeLoop` logs first (look for `PeerProbe` Serilog events).