Catch-all commit for pending work on the task-galaxy-e2e branch that
wasn't part of the FOCAS migration. Grouping by topic so future per-topic
commits can be cherry-picked if needed.
TwinCAT
- src/.../Driver.TwinCAT/AdsTwinCATClient.cs + TwinCATDriverFactoryExtensions.cs:
factory-registration extensions + ADS client refinements.
- src/.../Driver.TwinCAT.Cli/Commands/BrowseCommand.cs: new browse command
for the TwinCAT test-client CLI.
- tests/.../Driver.TwinCAT.IntegrationTests/TwinCAT3SmokeTests.cs + TwinCatProject/:
fixture scaffold with a minimal POU + README pointing at the TCBSD/ESXi
VM for e2e.
- docs/Driver.TwinCAT.Cli.md + docs/drivers/TwinCAT-Test-Fixture.md:
documentation for the above.
- docs/v3/twincat-backlog.md: forward-looking backlog seed.
Admin UI + fleet status
- src/.../Admin/Components/Pages/Clusters/DriversTab.razor + Hosts.razor:
UI refresh for fleet-status rendering.
- src/.../Admin/Hubs/FleetStatusHub.cs + FleetStatusPoller.cs +
Admin/Program.cs: SignalR hub + poller plumbing for live fleet data.
- tests/.../Admin.Tests/FleetStatusPollerTests.cs: poller coverage.
Server + redundancy runtime (Phase 6.3 follow-ups)
- src/.../Server/Hosting/RedundancyPublisherHostedService.cs: HostedService
that owns the RedundancyStatePublisher lifecycle + wires peer reachability.
- src/.../Server/Redundancy/ServerRedundancyNodeWriter.cs: OPC UA
variable-node writer binding ServiceLevel + ServerUriArray to the
publisher's events.
- src/.../Server/Program.cs + Server.csproj: hosted-service registration.
- tests/.../Server.Tests/ServerRedundancyNodeWriterTests.cs +
Server.Tests.csproj: coverage for the above.
Configuration
- src/.../Configuration/Validation/DraftValidator.cs +
tests/.../Configuration.Tests/DraftValidatorTests.cs: draft-validation
refinements.
E2E scripts (shared infrastructure)
- scripts/e2e/README.md + _common.ps1 + test-all.ps1: shared helpers + the
all-drivers test-all runner.
- scripts/e2e/test-opcuaclient.ps1: OPC UA Client e2e runner.
Docs
- docs/v2/implementation/phase-6-{1,2,3,4}*.md + exit-gate-phase-{3,7}.md:
phase-gate + implementation doc updates.
- docs/v2/plan.md: top-level plan refresh.
- docs/v2/redundancy-interop-playbook.md: client interop playbook for the
Phase 6.3 redundancy-runtime work.
Two orphan FOCAS docs remain on disk but deliberately unstaged —
docs/v2/focas-deployment.md and docs/v2/implementation/focas-simulator-plan.md
describe the now-retired Tier-C topology and should either be rewritten
or deleted in a follow-up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
129 lines
7.3 KiB
Markdown
129 lines
7.3 KiB
Markdown
# Redundancy Interop Playbook (Phase 6.3 Stream F — task #150)
|
||
|
||
> **Scope**: manual validation that third-party OPC UA clients + AVEVA MXAccess
|
||
> observe our non-transparent redundancy signals (ServiceLevel, ServerUriArray,
|
||
> RedundancySupport) and fail over to the Backup node when the Primary drops.
|
||
>
|
||
> **Why manual**: the third-party clients named here are Windows-GUI binaries
|
||
> (UaExpert, Kepware QuickClient) or embedded inside AVEVA System Platform.
|
||
> Automating any of them into PR-CI is out of scope for v2. This playbook
|
||
> captures the minimal dev-box-plus-VM setup and the expected pass criteria so
|
||
> the work can be executed repeatably at v2 release readiness and after any
|
||
> Phase 6.3 follow-up change.
|
||
|
||
## Prerequisites
|
||
|
||
1. Two `OtOpcUa.Server` nodes in one `ServerCluster`:
|
||
- Declared as `NodeCount = 2`, `RedundancyMode = Hot` (or `Warm`).
|
||
- Each with a distinct `ApplicationUri` (enforced by unique index per
|
||
decision #86).
|
||
- Each node's `StaticRoutes.xml` points at the other (`ServerCluster.Node[].Host`).
|
||
2. `scripts/install/Install-Services.ps1` applied on each node so the
|
||
`RedundancyPublisherHostedService` is running.
|
||
3. At least one `DriverInstance` with a reachable simulator or PLC so both
|
||
servers have a non-empty address space to browse.
|
||
4. On the client host:
|
||
- `UaExpert` ≥ 1.7 installed
|
||
- Kepware `ClientAce QuickClient` (or equivalent) — optional, for a second
|
||
client
|
||
5. For the AVEVA leg: a `Galaxy.Host` running against an MXAccess deployment
|
||
with an external OPC UA client object pointed at the cluster (not at a
|
||
single node).
|
||
|
||
## Expected signals on a running cluster
|
||
|
||
| Node | `ServiceLevel` | `RedundancySupport` | `ServerUriArray` |
|
||
|---|---|---|---|
|
||
| Primary, healthy, peer reachable | 200 | `Hot` (or `Warm`) | self + peer |
|
||
| Primary, mid-apply | 75 (`PrimaryMidApply`) | same | same |
|
||
| Primary, peer UNreachable | 150 (`PrimaryPeerDown`) | same | same |
|
||
| Backup, healthy | 100 (`Secondary`) | same | same |
|
||
| Either, dwelling in recovery | 50 (`Recovering`) | same | same |
|
||
| Either, invariant violation (two Primary, disabled-node mismatch) | 2 (`InvalidTopology`) | same | same |
|
||
|
||
(The band constants live in `ServiceLevelCalculator.Classify`.)
|
||
|
||
## Test matrix
|
||
|
||
Each row is one manual run; pass criterion in the right column.
|
||
|
||
### Block A — UA protocol signals (UaExpert)
|
||
|
||
| # | Scenario | Procedure | Pass criterion |
|
||
|---|---|---|---|
|
||
| A1 | ServiceLevel published | Connect UaExpert to Primary. Browse to `Server.ServerStatus.ServiceLevel`. | Value = 200 (or the expected Band byte per table above) |
|
||
| A2 | ServiceLevel updates on peer down | Connect to Primary. Stop Backup (`sc stop OtOpcUa`). Watch `ServiceLevel`. | Transitions 200 → 150 within ~2 s of peer probe timeout |
|
||
| A3 | RedundancySupport | Browse to `Server.ServerRedundancy.RedundancySupport`. | Value matches the declared `RedundancyMode` (Warm / Hot / None) |
|
||
| A4 | ServerUriArray (non-transparent upgrade) | Requires a redundancy-object-type upgrade follow-up. | When upgrade lands: `ServerUriArray` reports both ApplicationUris, self first |
|
||
| A5 | Mid-apply dip | On Primary trigger a `sp_PublishGeneration` apply. | `ServiceLevel` drops to 75 for the apply duration + dwell |
|
||
|
||
### Block B — Client failover
|
||
|
||
| # | Scenario | Procedure | Pass criterion |
|
||
|---|---|---|---|
|
||
| B1 | UaExpert picks Primary by ServiceLevel | In UaExpert configure a Redundancy Group with both endpoint URLs. | Client picks the Primary URL (higher ServiceLevel) |
|
||
| B2 | UaExpert cuts over on Primary kill | Kill the Primary's `OtOpcUa` service. | Client session reconnects to Backup within UaExpert's reconnect timeout (default 5 s). Data-change monitored items resume. |
|
||
| B3 | UaExpert cuts back when Primary returns | Start the Primary service. Wait ≥ recovery dwell (see `RecoveryStateManager.DwellTime`). | `ServiceLevel` on returning Primary goes through 50 (Recovering) → 200; UaExpert may or may not switch back (client-policy dependent; both are accepted outcomes) |
|
||
| B4 | Kepware QuickClient failover | Repeat B1–B3 with Kepware in place of UaExpert. | Same pass criteria; establishes we're not UaExpert-specific |
|
||
|
||
### Block C — Galaxy MXAccess failover
|
||
|
||
This block validates that an AVEVA System Platform app consuming our cluster
|
||
via MXAccess tolerates a Primary drop the same way a native OPC UA client does.
|
||
The MXAccess toolkit internally wraps the OPC UA Client and does its own
|
||
redundancy negotiation; we're asserting that negotiation honors our
|
||
`ServiceLevel` signal.
|
||
|
||
| # | Scenario | Procedure | Pass criterion |
|
||
|---|---|---|---|
|
||
| C1 | Galaxy binds to Primary on first connect | Bring the cluster up. Start a Galaxy `$MxAccessClient` object pointed at the cluster with both node URLs. | Galaxy reports `QUALITY = Good` + initial values from the Primary |
|
||
| C2 | Galaxy redirects on Primary drop | Stop the Primary. | Galaxy's `QUALITY` briefly goes `Uncertain`, then back to `Good`; values continue streaming from the Backup within MXAccess's `ReconnectInterval` (default 20 s) |
|
||
| C3 | Galaxy handles mid-apply dip | Trigger a generation apply on the Primary. | Galaxy continues reading — the mid-apply dip is advertisory (ServiceLevel 75), not a session drop; MXAccess should stay bound |
|
||
|
||
## Recording results
|
||
|
||
Copy the tables above into a tracking doc per run. The tracking doc shape:
|
||
|
||
```
|
||
Run date: 2026-MM-DD
|
||
Cluster: <id> Primary: <node> Backup: <node> Release: <sha>
|
||
A1: PASS evidence: UaExpert screenshot uaexpert-a1.png
|
||
A2: PASS evidence: ServiceLevel trend grafana-a2.png
|
||
…
|
||
```
|
||
|
||
One pass of every row is the acceptance criterion. Re-run after any Phase 6.3
|
||
follow-up ships (especially the non-transparent redundancy-type upgrade, which
|
||
flips A4 from "deferred" to "expected pass").
|
||
|
||
## Known limitations
|
||
|
||
- **A4 pending**: `Server.ServerRedundancy` on our current SDK build lands as
|
||
the base `ServerRedundancyState`, which has no `ServerUriArray` child.
|
||
`ServerRedundancyNodeWriter.ApplyServerUriArray` logs-and-skips until the
|
||
redundancy-object-type upgrade follow-up lands.
|
||
- **Recovery dwell default**: `RecoveryStateManager.DwellTime` defaults to 60 s
|
||
in `Program.cs`. Adjust via future config knob if B3 takes too long to
|
||
observe.
|
||
- **C-block external dependency**: The `Galaxy.Host` side of the redundancy
|
||
story is largely out of our code — it's MXAccess's own client-redundancy
|
||
policy talking to our published ServiceLevel. A negative result on C1-C3
|
||
does not necessarily indicate an OtOpcUa bug; cross-check with UaExpert
|
||
(Block A / B) first.
|
||
|
||
## Automation notes (why this is a playbook, not a test)
|
||
|
||
- UaExpert and Kepware binaries are closed-source Windows GUIs; they don't
|
||
ship headless CLIs for the browse/connect/subscribe flows.
|
||
- The OPC Foundation reference SDK *can* drive every scenario, but our own
|
||
`Driver.OpcUaClient` tests already cover that client's behaviour; Block B
|
||
adds value specifically because these two clients have independent
|
||
redundancy implementations we don't control.
|
||
- For the sub-set of scenarios that *can* be automated — the self-loopback
|
||
case where our own `otopcua-cli` drives Primary + Backup — the existing
|
||
`tests/ZB.MOM.WW.OtOpcUa.Server.Tests/RedundancyStatePublisherTests` +
|
||
`ServiceLevelCalculatorTests` (unit) + `ClusterTopologyLoaderTests`
|
||
(integration) already cover the math + data path. The wire-level assertion
|
||
that the values actually land on the right OPC UA nodes is covered by
|
||
`ServerRedundancyNodeWriterTests`.
|