docs: add four planning runbooks for Phase 6.3 interop, v2 GA gates, live-hardware validation, and alarms worker wiring
Produces docs/plans/ entries for tasks #13, #15, #16, and #17-#20: - phase-6-3-redundancy-interop-plan.md: automation boundary analysis, concrete test matrix (A/B/C blocks), and a step-by-step cutover runbook for the deferred Stream F client interop work - v2-ga-lab-gates-plan.md: exact gate list with command, pass criterion, and owner for each of the nine v2 GA exit criteria - live-hardware-validation-runbooks.md: one runbook per driver (FOCAS CNC smoke #54, AB CIP live-boot, TwinCAT wire-live) with preconditions, procedure, expected results, and recording template - alarms-worker-wiring-plan.md: focused plan for A.2/A.3-A.4/C.1/D.1 worker wiring in the mxaccessgw sibling repo, documenting the discovered AVEVA API surface, the architectural decision that blocks A.2, the dependency order, and what each item needs to unblock Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
278
docs/plans/phase-6-3-redundancy-interop-plan.md
Normal file
278
docs/plans/phase-6-3-redundancy-interop-plan.md
Normal file
@@ -0,0 +1,278 @@
|
||||
# Phase 6.3 Redundancy — Client Interop Matrix and Cutover Validation Plan
|
||||
|
||||
> **Scope**: Phase 6.3 redundancy runtime core shipped (PRs #89-90, #98-99,
|
||||
> #24-peerprobe, Stream C node wiring, Stream D lease wrap). What remains is
|
||||
> Stream F (task #150): validating that third-party OPC UA clients honour
|
||||
> our `ServiceLevel` / `ServerUriArray` / `RedundancySupport` signals and
|
||||
> fail over correctly when the Primary drops. This document defines what is
|
||||
> automatable as integration tests, what requires two live instances plus a
|
||||
> real client, and a step-by-step cutover-validation runbook.
|
||||
>
|
||||
> **Source of truth**: `docs/Redundancy.md`, `docs/v2/redundancy-interop-playbook.md`,
|
||||
> `docs/v2/implementation/phase-6-3-redundancy-runtime.md`,
|
||||
> `scripts/compliance/phase-6-3-compliance.ps1`.
|
||||
|
||||
## What is already tested (no live cluster needed)
|
||||
|
||||
The following are covered by existing automated tests that run in ordinary
|
||||
`dotnet test`:
|
||||
|
||||
| Area | Test class(es) | What it asserts |
|
||||
|---|---|---|
|
||||
| `ServiceLevelCalculator` — 8-state matrix | `ServiceLevelCalculatorTests` | All 10 band values; role × self-health × peer-http × peer-ua × apply × recovery × topology combinations |
|
||||
| `RecoveryStateManager` — dwell + witness | `RecoveryStateManagerTests` | 60 s dwell default; premature-exit rejection; witness-required gate |
|
||||
| `ApplyLeaseRegistry` — lease lifecycle | `ApplyLeaseRegistryTests` | Disposal on success / exception / cancellation; watchdog force-close at 10 min |
|
||||
| `ServerRedundancyNodeWriter` — OPC UA variable binding | `ServerRedundancyNodeWriterTests` | `ServiceLevel` byte push; `RedundancySupport` enum; `ServerUriArray` skip-log when node absent |
|
||||
| `RedundancyStatePublisher` — orchestration | `RedundancyStatePublisherTests` | Edge-triggered `OnStateChanged`; idempotent dedup |
|
||||
| `ClusterTopologyLoader` | `ClusterTopologyLoaderTests` | Two-node seed; one-node degenerate; duplicate-URI rejection |
|
||||
| `DraftValidator.ValidateClusterTopology` | `DraftValidatorTests` (8 cases) | NodeCount/mode pairs; Enabled-count vs NodeCount; multiple-Primary rejection |
|
||||
|
||||
Run with:
|
||||
|
||||
```powershell
|
||||
dotnet test ZB.MOM.WW.OtOpcUa.slnx --filter "FullyQualifiedName~Redundancy"
|
||||
```
|
||||
|
||||
Compliance gate (every Phase 6.3 static check):
|
||||
|
||||
```powershell
|
||||
pwsh ./scripts/compliance/phase-6-3-compliance.ps1
|
||||
```
|
||||
|
||||
Pass criteria: exit 0; all `[PASS]` lines green; `[DEFERRED]` lines are
|
||||
known-deferred surfaces, not failures.
|
||||
|
||||
## What cannot be automated — requires two live instances
|
||||
|
||||
The scenarios below require two running `OtOpcUa.Server` processes in the
|
||||
same `ServerCluster`, a real SQL Server Config DB, and at least one driver
|
||||
instance with a reachable endpoint (simulator or real PLC).
|
||||
|
||||
### Why it cannot be unit/integration-tested in-process
|
||||
|
||||
- UaExpert, Kepware KEPServerEX, and AVEVA OI Gateway are closed-source
|
||||
Windows GUI binaries with no headless CLI interface for the
|
||||
subscribe/browse flows.
|
||||
- The AVEVA MXAccess failover leg (`IAlarmSource` reconnect, `$MxAccessClient`
|
||||
quality transition) involves the Galaxy runtime's own client-redundancy
|
||||
policy and the COM-layer session model — both live outside this repo.
|
||||
- Even the automatable sub-set (our own `otopcua-cli` as the client) needs
|
||||
two distinct listening TCP endpoints; that requires two live processes,
|
||||
which is out of scope for `dotnet test` integration fixtures.
|
||||
|
||||
## Test matrix
|
||||
|
||||
### Prerequisites
|
||||
|
||||
1. Two `OtOpcUa.Server` processes on separate Windows hosts (or separate
|
||||
ports on the same host for dev) sharing one Config DB (`ServerCluster`
|
||||
with `NodeCount=2`, `RedundancyMode=Warm` or `Hot`).
|
||||
2. Each node registered in `ClusterNode`:
|
||||
- Node A: `RedundancyRole=Primary`, `ServiceLevelBase=255`,
|
||||
`ApplicationUri=urn:node-a:OtOpcUa`
|
||||
- Node B: `RedundancyRole=Secondary`, `ServiceLevelBase=100`,
|
||||
`ApplicationUri=urn:node-b:OtOpcUa`
|
||||
3. `PeerHttpProbeLoop` and `PeerUaProbeLoop` HostedServices running on both
|
||||
nodes (registered via `AddHostedService<PeerHttpProbeLoop>` +
|
||||
`AddHostedService<PeerUaProbeLoop>` in `Program.cs`).
|
||||
4. At least one `DriverInstance` in the cluster with a reachable PLC or
|
||||
simulator (e.g. Modbus sim at `10.100.0.35:5020`).
|
||||
5. Client machine with UaExpert >= 1.7 installed.
|
||||
6. Optional second client: Kepware KEPServerEX 6.x QuickClient or AVEVA
|
||||
OI Gateway 2020R2+.
|
||||
|
||||
### Block A — OPC UA protocol signals (UaExpert, no failover yet)
|
||||
|
||||
| ID | Scenario | Procedure | Pass criterion | Automatable? |
|
||||
|----|----------|-----------|----------------|--------------|
|
||||
| A1 | ServiceLevel published on Primary | Connect UaExpert to Node A. Browse `Server/ServerStatus/ServiceLevel`. | Value = 255 (`AuthoritativePrimary`) | No — requires UaExpert GUI |
|
||||
| A2 | ServiceLevel published on Backup | Connect UaExpert to Node B. Read same node. | Value = 100 (`AuthoritativeBackup`) | No |
|
||||
| A3 | ServiceLevel updates when peer drops | Node A connected. Stop Node B (`sc stop OtOpcUa`). Watch `ServiceLevel` on Node A. | Transitions 255 → 230 (`IsolatedPrimary`) within ~6 s (3 × 2 s HTTP probe interval) | No |
|
||||
| A4 | RedundancySupport | Browse `Server/ServerRedundancy/RedundancySupport` on either node. | Value = `Warm` or `Hot` matching the cluster `RedundancyMode` | No |
|
||||
| A5 | ServerUriArray | Browse `Server/ServerRedundancy/ServerUriArray` on either node. | Array contains both `ApplicationUri` values; self listed first. Note: requires non-transparent redundancy-type upgrade (currently logs-and-skips — see known limitation A5 below). | No |
|
||||
| A6 | Mid-apply ServiceLevel dip | Trigger a `sp_PublishGeneration` apply (via Admin UI draft → publish) while watching Node A `ServiceLevel`. | Drops to 200 (`PrimaryMidApply`) for the apply duration; returns to 255 after `RefreshAsync`. | No |
|
||||
| A7 | Client.CLI reads correct ServiceLevel | `dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- read -u opc.tcp://<node-a>:4840 -n "i=2267"` | Prints current byte value matching expected band. | **Yes** — scriptable with the Client CLI |
|
||||
| A8 | otopcua-cli failover reconnect | `dotnet run ... -- connect -u opc.tcp://<node-a>:4840 -F opc.tcp://<node-b>:4840` — then kill Node A. | CLI session reconnects to Node B within the session keep-alive timeout. | **Yes** — scriptable with the Client CLI |
|
||||
|
||||
### Block B — Third-party client failover
|
||||
|
||||
| ID | Scenario | Procedure | Pass criterion |
|
||||
|----|----------|-----------|----------------|
|
||||
| B1 | UaExpert picks Primary by ServiceLevel | Configure a Redundancy Group in UaExpert with both endpoint URLs. | Client connects to Node A (higher ServiceLevel) |
|
||||
| B2 | UaExpert cuts over on Primary kill | Kill Node A `OtOpcUa` service. | Client session reconnects to Node B within UaExpert's reconnect timeout (default 5 s). Data-change monitored items resume. |
|
||||
| B3 | UaExpert returns when Primary restores | Start Node A. Wait >= 60 s recovery dwell. | `ServiceLevel` on Node A progresses: 180 (`RecoveringPrimary`) → 255 (`AuthoritativePrimary`). UaExpert may or may not switch back (client-policy-dependent; both outcomes accepted). |
|
||||
| B4 | Kepware QuickClient failover | Repeat B1–B3 with Kepware configured for the same two endpoints. | Same pass criteria; establishes no UaExpert-specific behaviour. |
|
||||
| B5 | AVEVA OI Gateway | Configure OI Gateway OPC DA/UA client object against the cluster. Kill Primary. | OI Gateway data quality recovers within `ReconnectInterval` (default 20 s); no permanent data-loss alert. |
|
||||
|
||||
### Block C — Galaxy MXAccess failover
|
||||
|
||||
This block requires a running Galaxy and `$MxAccessClient` object (AVEVA
|
||||
System Platform installed, Galaxy deployed on dev box — see project memory
|
||||
`project_aveva_platform_installed.md`).
|
||||
|
||||
| ID | Scenario | Procedure | Pass criterion |
|
||||
|----|----------|-----------|----------------|
|
||||
| C1 | Galaxy binds to Primary on first connect | Bring cluster up. Start a Galaxy `$MxAccessClient` with both node URLs configured. | Galaxy reports `QUALITY = Good`; initial values stream from Node A. |
|
||||
| C2 | Galaxy redirects on Primary drop | Stop Node A. | Galaxy `QUALITY` briefly goes `Uncertain`, then returns to `Good`; values continue streaming from Node B within MXAccess's `ReconnectInterval` (default 20 s). |
|
||||
| C3 | Galaxy tolerates mid-apply dip | Trigger generation apply on Node A. | Galaxy remains bound — mid-apply dip (200) is advisory, not a session drop. No quality interruption. |
|
||||
|
||||
Note: A negative result on C1–C3 does not necessarily indicate an OtOpcUa
|
||||
defect. Cross-check with Block A / B first to confirm our `ServiceLevel`
|
||||
signal is correct before debugging the MXAccess client layer.
|
||||
|
||||
## Step-by-step cutover-validation runbook
|
||||
|
||||
This is the minimum procedure to satisfy the v2 GA exit criterion:
|
||||
"Non-transparent redundancy cutover validated with at least one production
|
||||
client (Ignition 8.3 recommended — see decision #85)."
|
||||
|
||||
### Step 1 — Provision the cluster
|
||||
|
||||
```powershell
|
||||
# On the Config DB host, seed or verify cluster rows:
|
||||
# ServerCluster: Id=<id>, Name="test-cluster", NodeCount=2, RedundancyMode=Warm
|
||||
# ClusterNode A: NodeId="node-a", ClusterId=<id>, RedundancyRole=Primary,
|
||||
# ServiceLevelBase=255, ApplicationUri="urn:node-a:OtOpcUa"
|
||||
# ClusterNode B: NodeId="node-b", ClusterId=<id>, RedundancyRole=Secondary,
|
||||
# ServiceLevelBase=100, ApplicationUri="urn:node-b:OtOpcUa"
|
||||
```
|
||||
|
||||
Verify uniqueness constraint: no two `ClusterNode` rows share the same
|
||||
`ApplicationUri` (unique index on `ApplicationUri`).
|
||||
|
||||
### Step 2 — Start both server instances
|
||||
|
||||
On Node A host:
|
||||
|
||||
```powershell
|
||||
# appsettings.json: Node:NodeId = "node-a"
|
||||
sc start OtOpcUa
|
||||
```
|
||||
|
||||
On Node B host:
|
||||
|
||||
```powershell
|
||||
# appsettings.json: Node:NodeId = "node-b"
|
||||
sc start OtOpcUa
|
||||
```
|
||||
|
||||
Wait 10 s for HostedServices to complete first probe cycle.
|
||||
|
||||
### Step 3 — Verify baseline ServiceLevel via Client CLI
|
||||
|
||||
```powershell
|
||||
# Node A should report 255
|
||||
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- read `
|
||||
-u opc.tcp://<node-a-host>:4840 -n "i=2267"
|
||||
|
||||
# Node B should report 100
|
||||
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- read `
|
||||
-u opc.tcp://<node-b-host>:4840 -n "i=2267"
|
||||
```
|
||||
|
||||
Pass: Node A = 255, Node B = 100.
|
||||
|
||||
### Step 4 — Verify ServerUriArray
|
||||
|
||||
```powershell
|
||||
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- read `
|
||||
-u opc.tcp://<node-a-host>:4840 -n "i=2271"
|
||||
```
|
||||
|
||||
Pass: array returned contains both `ApplicationUri` strings. If
|
||||
`ServerUriArray` node returns empty or an error, the non-transparent
|
||||
redundancy-type upgrade follow-up is still pending (known limitation —
|
||||
`ServerRedundancyNodeWriter.ApplyServerUriArray` logs-and-skips on the
|
||||
base `ServerRedundancyState` object type).
|
||||
|
||||
### Step 5 — Execute Primary kill + failover (B2 scenario)
|
||||
|
||||
1. Connect UaExpert (or Kepware) Redundancy Group to both endpoints.
|
||||
2. Confirm client is subscribed to at least one variable node.
|
||||
3. Kill Node A: `sc stop OtOpcUa` on Node A host.
|
||||
4. Observe:
|
||||
- Node B `ServiceLevel` should transition: 100 (`AuthoritativeBackup`)
|
||||
→ 80 (`IsolatedBackup`) within ~6 s.
|
||||
- Client should reconnect to Node B and resume data-change events.
|
||||
5. Record: time from kill to client reconnect; whether data gaps occurred.
|
||||
|
||||
### Step 6 — Verify Primary recovery (B3 scenario)
|
||||
|
||||
1. Restart Node A: `sc start OtOpcUa` on Node A host.
|
||||
2. Observe Node A `ServiceLevel` progression:
|
||||
- ~0 s: 1 (`NoData`) briefly while HostedServices start.
|
||||
- Startup: 180 (`RecoveringPrimary`) — recovery dwell gate active.
|
||||
- After >= 60 s dwell + one positive publish witness: 255 (`AuthoritativePrimary`).
|
||||
3. Observe Node B:
|
||||
- Returns to 100 (`AuthoritativeBackup`) once it sees Node A peer probe succeed.
|
||||
4. Record dwell duration and whether the client (UaExpert/Kepware) switches back.
|
||||
|
||||
### Step 7 — Execute mid-apply dip (A6 scenario)
|
||||
|
||||
1. Via Admin UI, create a trivial draft change and publish.
|
||||
2. Watch Node A `ServiceLevel` during apply.
|
||||
3. Expected: drops to 200 (`PrimaryMidApply`) for the apply duration
|
||||
(typically seconds); returns to 255 when `GenerationRefreshHostedService`
|
||||
releases the lease.
|
||||
|
||||
### Step 8 — Record results
|
||||
|
||||
Copy the following block into a tracking doc:
|
||||
|
||||
```
|
||||
Run date: YYYY-MM-DD
|
||||
Release SHA: <git sha>
|
||||
Cluster: <cluster-id> Primary: node-a Backup: node-b
|
||||
Config DB: 10.100.0.35,14330
|
||||
|
||||
A1: [PASS/FAIL] evidence: <screenshot or CLI output>
|
||||
A2: [PASS/FAIL]
|
||||
A3: [PASS/FAIL] time-to-IsolatedPrimary: <N>s
|
||||
A4: [PASS/FAIL]
|
||||
A5: [PASS/FAIL/DEFERRED - ServerUriArray upgrade pending]
|
||||
A6: [PASS/FAIL] mid-apply duration: <N>s
|
||||
A7: [PASS/FAIL] CLI output attached
|
||||
A8: [PASS/FAIL] CLI reconnect observed
|
||||
B1: [PASS/FAIL]
|
||||
B2: [PASS/FAIL] reconnect time: <N>s
|
||||
B3: [PASS/FAIL] dwell observed: <N>s
|
||||
B4: [PASS/FAIL] (Kepware)
|
||||
B5: [PASS/FAIL] (OI Gateway — if available)
|
||||
C1: [PASS/FAIL/SKIP - Galaxy not available]
|
||||
C2: [PASS/FAIL/SKIP]
|
||||
C3: [PASS/FAIL/SKIP]
|
||||
```
|
||||
|
||||
One pass of every non-SKIP row is the v2 GA acceptance criterion.
|
||||
|
||||
## Known limitations
|
||||
|
||||
### A5 — ServerUriArray node not yet writable
|
||||
|
||||
The OPC UA .NET Standard SDK's default `Server.ServerRedundancy` object is the
|
||||
base `ServerRedundancyState`, which has no `ServerUriArray` child node.
|
||||
`ServerRedundancyNodeWriter.ApplyServerUriArray` currently logs a warning and
|
||||
skips. The operator obtains `ServerUriArray` by reading `ClusterNode` rows
|
||||
directly until the non-transparent redundancy-type upgrade follow-up ships.
|
||||
|
||||
### Recovery dwell is 60 s by default
|
||||
|
||||
`RecoveryStateManager.DwellTime` defaults to `TimeSpan.FromSeconds(60)` in
|
||||
`Program.cs`. Step 6 of the runbook will block for at least 60 s waiting for
|
||||
Node A to return to `AuthoritativePrimary`. This is intentional per
|
||||
decision #154 (thrash prevention) — do not lower it for the test run.
|
||||
|
||||
### IsolatedBackup (80) does not auto-promote
|
||||
|
||||
Per decision #154, the Backup at band 80 does not self-elevate. If the operator
|
||||
needs authoritative service from Node B while Node A is down, they must write
|
||||
`RedundancyRole=Primary` on the `ClusterNode` row for Node B and publish a
|
||||
draft generation. The Admin UI `RedundancyTab` exposes this flow.
|
||||
|
||||
## Dependency on existing tests
|
||||
|
||||
The cutover runbook validates the end-to-end wire path. The math and edge cases
|
||||
are already locked by the unit/integration tests enumerated in the first section.
|
||||
A failing runbook step that contradicts a passing unit test indicates a
|
||||
deployment configuration error or an SDK version mismatch — not a logic bug.
|
||||
Check `PeerHttpProbeLoop` logs first (look for `PeerProbe` Serilog events).
|
||||
Reference in New Issue
Block a user