Catch-all commit for pending work on the task-galaxy-e2e branch that
wasn't part of the FOCAS migration. Grouping by topic so future per-topic
commits can be cherry-picked if needed.
TwinCAT
- src/.../Driver.TwinCAT/AdsTwinCATClient.cs + TwinCATDriverFactoryExtensions.cs:
factory-registration extensions + ADS client refinements.
- src/.../Driver.TwinCAT.Cli/Commands/BrowseCommand.cs: new browse command
for the TwinCAT test-client CLI.
- tests/.../Driver.TwinCAT.IntegrationTests/TwinCAT3SmokeTests.cs + TwinCatProject/:
fixture scaffold with a minimal POU + README pointing at the TCBSD/ESXi
VM for e2e.
- docs/Driver.TwinCAT.Cli.md + docs/drivers/TwinCAT-Test-Fixture.md:
documentation for the above.
- docs/v3/twincat-backlog.md: forward-looking backlog seed.
Admin UI + fleet status
- src/.../Admin/Components/Pages/Clusters/DriversTab.razor + Hosts.razor:
UI refresh for fleet-status rendering.
- src/.../Admin/Hubs/FleetStatusHub.cs + FleetStatusPoller.cs +
Admin/Program.cs: SignalR hub + poller plumbing for live fleet data.
- tests/.../Admin.Tests/FleetStatusPollerTests.cs: poller coverage.
Server + redundancy runtime (Phase 6.3 follow-ups)
- src/.../Server/Hosting/RedundancyPublisherHostedService.cs: HostedService
that owns the RedundancyStatePublisher lifecycle + wires peer reachability.
- src/.../Server/Redundancy/ServerRedundancyNodeWriter.cs: OPC UA
variable-node writer binding ServiceLevel + ServerUriArray to the
publisher's events.
- src/.../Server/Program.cs + Server.csproj: hosted-service registration.
- tests/.../Server.Tests/ServerRedundancyNodeWriterTests.cs +
Server.Tests.csproj: coverage for the above.
Configuration
- src/.../Configuration/Validation/DraftValidator.cs +
tests/.../Configuration.Tests/DraftValidatorTests.cs: draft-validation
refinements.
E2E scripts (shared infrastructure)
- scripts/e2e/README.md + _common.ps1 + test-all.ps1: shared helpers + the
all-drivers test-all runner.
- scripts/e2e/test-opcuaclient.ps1: OPC UA Client e2e runner.
Docs
- docs/v2/implementation/phase-6-{1,2,3,4}*.md + exit-gate-phase-{3,7}.md:
phase-gate + implementation doc updates.
- docs/v2/plan.md: top-level plan refresh.
- docs/v2/redundancy-interop-playbook.md: client interop playbook for the
Phase 6.3 redundancy-runtime work.
Two orphan FOCAS docs remain on disk but deliberately unstaged —
docs/v2/focas-deployment.md and docs/v2/implementation/focas-simulator-plan.md
describe the now-retired Tier-C topology and should either be rewritten
or deleted in a follow-up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7.3 KiB
Redundancy Interop Playbook (Phase 6.3 Stream F — task #150)
Scope: manual validation that third-party OPC UA clients + AVEVA MXAccess observe our non-transparent redundancy signals (ServiceLevel, ServerUriArray, RedundancySupport) and fail over to the Backup node when the Primary drops.
Why manual: the third-party clients named here are Windows-GUI binaries (UaExpert, Kepware QuickClient) or embedded inside AVEVA System Platform. Automating any of them into PR-CI is out of scope for v2. This playbook captures the minimal dev-box-plus-VM setup and the expected pass criteria so the work can be executed repeatably at v2 release readiness and after any Phase 6.3 follow-up change.
Prerequisites
- Two
OtOpcUa.Servernodes in oneServerCluster:- Declared as
NodeCount = 2,RedundancyMode = Hot(orWarm). - Each with a distinct
ApplicationUri(enforced by unique index per decision #86). - Each node's
StaticRoutes.xmlpoints at the other (ServerCluster.Node[].Host).
- Declared as
scripts/install/Install-Services.ps1applied on each node so theRedundancyPublisherHostedServiceis running.- At least one
DriverInstancewith a reachable simulator or PLC so both servers have a non-empty address space to browse. - On the client host:
UaExpert≥ 1.7 installed- Kepware
ClientAce QuickClient(or equivalent) — optional, for a second client
- For the AVEVA leg: a
Galaxy.Hostrunning against an MXAccess deployment with an external OPC UA client object pointed at the cluster (not at a single node).
Expected signals on a running cluster
| Node | ServiceLevel |
RedundancySupport |
ServerUriArray |
|---|---|---|---|
| Primary, healthy, peer reachable | 200 | Hot (or Warm) |
self + peer |
| Primary, mid-apply | 75 (PrimaryMidApply) |
same | same |
| Primary, peer UNreachable | 150 (PrimaryPeerDown) |
same | same |
| Backup, healthy | 100 (Secondary) |
same | same |
| Either, dwelling in recovery | 50 (Recovering) |
same | same |
| Either, invariant violation (two Primary, disabled-node mismatch) | 2 (InvalidTopology) |
same | same |
(The band constants live in ServiceLevelCalculator.Classify.)
Test matrix
Each row is one manual run; pass criterion in the right column.
Block A — UA protocol signals (UaExpert)
| # | Scenario | Procedure | Pass criterion |
|---|---|---|---|
| A1 | ServiceLevel published | Connect UaExpert to Primary. Browse to Server.ServerStatus.ServiceLevel. |
Value = 200 (or the expected Band byte per table above) |
| A2 | ServiceLevel updates on peer down | Connect to Primary. Stop Backup (sc stop OtOpcUa). Watch ServiceLevel. |
Transitions 200 → 150 within ~2 s of peer probe timeout |
| A3 | RedundancySupport | Browse to Server.ServerRedundancy.RedundancySupport. |
Value matches the declared RedundancyMode (Warm / Hot / None) |
| A4 | ServerUriArray (non-transparent upgrade) | Requires a redundancy-object-type upgrade follow-up. | When upgrade lands: ServerUriArray reports both ApplicationUris, self first |
| A5 | Mid-apply dip | On Primary trigger a sp_PublishGeneration apply. |
ServiceLevel drops to 75 for the apply duration + dwell |
Block B — Client failover
| # | Scenario | Procedure | Pass criterion |
|---|---|---|---|
| B1 | UaExpert picks Primary by ServiceLevel | In UaExpert configure a Redundancy Group with both endpoint URLs. | Client picks the Primary URL (higher ServiceLevel) |
| B2 | UaExpert cuts over on Primary kill | Kill the Primary's OtOpcUa service. |
Client session reconnects to Backup within UaExpert's reconnect timeout (default 5 s). Data-change monitored items resume. |
| B3 | UaExpert cuts back when Primary returns | Start the Primary service. Wait ≥ recovery dwell (see RecoveryStateManager.DwellTime). |
ServiceLevel on returning Primary goes through 50 (Recovering) → 200; UaExpert may or may not switch back (client-policy dependent; both are accepted outcomes) |
| B4 | Kepware QuickClient failover | Repeat B1–B3 with Kepware in place of UaExpert. | Same pass criteria; establishes we're not UaExpert-specific |
Block C — Galaxy MXAccess failover
This block validates that an AVEVA System Platform app consuming our cluster
via MXAccess tolerates a Primary drop the same way a native OPC UA client does.
The MXAccess toolkit internally wraps the OPC UA Client and does its own
redundancy negotiation; we're asserting that negotiation honors our
ServiceLevel signal.
| # | Scenario | Procedure | Pass criterion |
|---|---|---|---|
| C1 | Galaxy binds to Primary on first connect | Bring the cluster up. Start a Galaxy $MxAccessClient object pointed at the cluster with both node URLs. |
Galaxy reports QUALITY = Good + initial values from the Primary |
| C2 | Galaxy redirects on Primary drop | Stop the Primary. | Galaxy's QUALITY briefly goes Uncertain, then back to Good; values continue streaming from the Backup within MXAccess's ReconnectInterval (default 20 s) |
| C3 | Galaxy handles mid-apply dip | Trigger a generation apply on the Primary. | Galaxy continues reading — the mid-apply dip is advertisory (ServiceLevel 75), not a session drop; MXAccess should stay bound |
Recording results
Copy the tables above into a tracking doc per run. The tracking doc shape:
Run date: 2026-MM-DD
Cluster: <id> Primary: <node> Backup: <node> Release: <sha>
A1: PASS evidence: UaExpert screenshot uaexpert-a1.png
A2: PASS evidence: ServiceLevel trend grafana-a2.png
…
One pass of every row is the acceptance criterion. Re-run after any Phase 6.3 follow-up ships (especially the non-transparent redundancy-type upgrade, which flips A4 from "deferred" to "expected pass").
Known limitations
- A4 pending:
Server.ServerRedundancyon our current SDK build lands as the baseServerRedundancyState, which has noServerUriArraychild.ServerRedundancyNodeWriter.ApplyServerUriArraylogs-and-skips until the redundancy-object-type upgrade follow-up lands. - Recovery dwell default:
RecoveryStateManager.DwellTimedefaults to 60 s inProgram.cs. Adjust via future config knob if B3 takes too long to observe. - C-block external dependency: The
Galaxy.Hostside of the redundancy story is largely out of our code — it's MXAccess's own client-redundancy policy talking to our published ServiceLevel. A negative result on C1-C3 does not necessarily indicate an OtOpcUa bug; cross-check with UaExpert (Block A / B) first.
Automation notes (why this is a playbook, not a test)
- UaExpert and Kepware binaries are closed-source Windows GUIs; they don't ship headless CLIs for the browse/connect/subscribe flows.
- The OPC Foundation reference SDK can drive every scenario, but our own
Driver.OpcUaClienttests already cover that client's behaviour; Block B adds value specifically because these two clients have independent redundancy implementations we don't control. - For the sub-set of scenarios that can be automated — the self-loopback
case where our own
otopcua-clidrives Primary + Backup — the existingtests/ZB.MOM.WW.OtOpcUa.Server.Tests/RedundancyStatePublisherTests+ServiceLevelCalculatorTests(unit) +ClusterTopologyLoaderTests(integration) already cover the math + data path. The wire-level assertion that the values actually land on the right OPC UA nodes is covered byServerRedundancyNodeWriterTests.