Files

Joseph Doherty 69e0d02c72 task-galaxy-e2e branch — non-FOCAS work-in-progress snapshot

Catch-all commit for pending work on the task-galaxy-e2e branch that
wasn't part of the FOCAS migration. Grouping by topic so future per-topic
commits can be cherry-picked if needed.

TwinCAT
- src/.../Driver.TwinCAT/AdsTwinCATClient.cs + TwinCATDriverFactoryExtensions.cs:
  factory-registration extensions + ADS client refinements.
- src/.../Driver.TwinCAT.Cli/Commands/BrowseCommand.cs: new browse command
  for the TwinCAT test-client CLI.
- tests/.../Driver.TwinCAT.IntegrationTests/TwinCAT3SmokeTests.cs + TwinCatProject/:
  fixture scaffold with a minimal POU + README pointing at the TCBSD/ESXi
  VM for e2e.
- docs/Driver.TwinCAT.Cli.md + docs/drivers/TwinCAT-Test-Fixture.md:
  documentation for the above.
- docs/v3/twincat-backlog.md: forward-looking backlog seed.

Admin UI + fleet status
- src/.../Admin/Components/Pages/Clusters/DriversTab.razor + Hosts.razor:
  UI refresh for fleet-status rendering.
- src/.../Admin/Hubs/FleetStatusHub.cs + FleetStatusPoller.cs +
  Admin/Program.cs: SignalR hub + poller plumbing for live fleet data.
- tests/.../Admin.Tests/FleetStatusPollerTests.cs: poller coverage.

Server + redundancy runtime (Phase 6.3 follow-ups)
- src/.../Server/Hosting/RedundancyPublisherHostedService.cs: HostedService
  that owns the RedundancyStatePublisher lifecycle + wires peer reachability.
- src/.../Server/Redundancy/ServerRedundancyNodeWriter.cs: OPC UA
  variable-node writer binding ServiceLevel + ServerUriArray to the
  publisher's events.
- src/.../Server/Program.cs + Server.csproj: hosted-service registration.
- tests/.../Server.Tests/ServerRedundancyNodeWriterTests.cs +
  Server.Tests.csproj: coverage for the above.

Configuration
- src/.../Configuration/Validation/DraftValidator.cs +
  tests/.../Configuration.Tests/DraftValidatorTests.cs: draft-validation
  refinements.

E2E scripts (shared infrastructure)
- scripts/e2e/README.md + _common.ps1 + test-all.ps1: shared helpers + the
  all-drivers test-all runner.
- scripts/e2e/test-opcuaclient.ps1: OPC UA Client e2e runner.

Docs
- docs/v2/implementation/phase-6-{1,2,3,4}*.md + exit-gate-phase-{3,7}.md:
  phase-gate + implementation doc updates.
- docs/v2/plan.md: top-level plan refresh.
- docs/v2/redundancy-interop-playbook.md: client interop playbook for the
  Phase 6.3 redundancy-runtime work.

Two orphan FOCAS docs remain on disk but deliberately unstaged —
docs/v2/focas-deployment.md and docs/v2/implementation/focas-simulator-plan.md
describe the now-retired Tier-C topology and should either be rewritten
or deleted in a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-24 14:12:19 -04:00

7.3 KiB

Raw Blame History

Redundancy Interop Playbook (Phase 6.3 Stream F — task #150)

Scope: manual validation that third-party OPC UA clients + AVEVA MXAccess observe our non-transparent redundancy signals (ServiceLevel, ServerUriArray, RedundancySupport) and fail over to the Backup node when the Primary drops.

Why manual: the third-party clients named here are Windows-GUI binaries (UaExpert, Kepware QuickClient) or embedded inside AVEVA System Platform. Automating any of them into PR-CI is out of scope for v2. This playbook captures the minimal dev-box-plus-VM setup and the expected pass criteria so the work can be executed repeatably at v2 release readiness and after any Phase 6.3 follow-up change.

Prerequisites

Two OtOpcUa.Server nodes in one ServerCluster:
- Declared as NodeCount = 2, RedundancyMode = Hot (or Warm).
- Each with a distinct ApplicationUri (enforced by unique index per decision #86).
- Each node's StaticRoutes.xml points at the other (ServerCluster.Node[].Host).
scripts/install/Install-Services.ps1 applied on each node so the RedundancyPublisherHostedService is running.
At least one DriverInstance with a reachable simulator or PLC so both servers have a non-empty address space to browse.
On the client host:
- UaExpert ≥ 1.7 installed
- Kepware ClientAce QuickClient (or equivalent) — optional, for a second client
For the AVEVA leg: a Galaxy.Host running against an MXAccess deployment with an external OPC UA client object pointed at the cluster (not at a single node).

Expected signals on a running cluster

Node	`ServiceLevel`	`RedundancySupport`	`ServerUriArray`
Primary, healthy, peer reachable	200	`Hot` (or `Warm`)	self + peer
Primary, mid-apply	75 (`PrimaryMidApply`)	same	same
Primary, peer UNreachable	150 (`PrimaryPeerDown`)	same	same
Backup, healthy	100 (`Secondary`)	same	same
Either, dwelling in recovery	50 (`Recovering`)	same	same
Either, invariant violation (two Primary, disabled-node mismatch)	2 (`InvalidTopology`)	same	same

(The band constants live in ServiceLevelCalculator.Classify.)

Test matrix

Each row is one manual run; pass criterion in the right column.

Block A — UA protocol signals (UaExpert)

#	Scenario	Procedure	Pass criterion
A1	ServiceLevel published	Connect UaExpert to Primary. Browse to `Server.ServerStatus.ServiceLevel`.	Value = 200 (or the expected Band byte per table above)
A2	ServiceLevel updates on peer down	Connect to Primary. Stop Backup (`sc stop OtOpcUa`). Watch `ServiceLevel`.	Transitions 200 → 150 within ~2 s of peer probe timeout
A3	RedundancySupport	Browse to `Server.ServerRedundancy.RedundancySupport`.	Value matches the declared `RedundancyMode` (Warm / Hot / None)
A4	ServerUriArray (non-transparent upgrade)	Requires a redundancy-object-type upgrade follow-up.	When upgrade lands: `ServerUriArray` reports both ApplicationUris, self first
A5	Mid-apply dip	On Primary trigger a `sp_PublishGeneration` apply.	`ServiceLevel` drops to 75 for the apply duration + dwell

Block B — Client failover

#	Scenario	Procedure	Pass criterion
B1	UaExpert picks Primary by ServiceLevel	In UaExpert configure a Redundancy Group with both endpoint URLs.	Client picks the Primary URL (higher ServiceLevel)
B2	UaExpert cuts over on Primary kill	Kill the Primary's `OtOpcUa` service.	Client session reconnects to Backup within UaExpert's reconnect timeout (default 5 s). Data-change monitored items resume.
B3	UaExpert cuts back when Primary returns	Start the Primary service. Wait ≥ recovery dwell (see `RecoveryStateManager.DwellTime`).	`ServiceLevel` on returning Primary goes through 50 (Recovering) → 200; UaExpert may or may not switch back (client-policy dependent; both are accepted outcomes)
B4	Kepware QuickClient failover	Repeat B1–B3 with Kepware in place of UaExpert.	Same pass criteria; establishes we're not UaExpert-specific

Block C — Galaxy MXAccess failover

This block validates that an AVEVA System Platform app consuming our cluster via MXAccess tolerates a Primary drop the same way a native OPC UA client does. The MXAccess toolkit internally wraps the OPC UA Client and does its own redundancy negotiation; we're asserting that negotiation honors our ServiceLevel signal.

#	Scenario	Procedure	Pass criterion
C1	Galaxy binds to Primary on first connect	Bring the cluster up. Start a Galaxy `$MxAccessClient` object pointed at the cluster with both node URLs.	Galaxy reports `QUALITY = Good` + initial values from the Primary
C2	Galaxy redirects on Primary drop	Stop the Primary.	Galaxy's `QUALITY` briefly goes `Uncertain`, then back to `Good`; values continue streaming from the Backup within MXAccess's `ReconnectInterval` (default 20 s)
C3	Galaxy handles mid-apply dip	Trigger a generation apply on the Primary.	Galaxy continues reading — the mid-apply dip is advertisory (ServiceLevel 75), not a session drop; MXAccess should stay bound

Recording results

Copy the tables above into a tracking doc per run. The tracking doc shape:

Run date: 2026-MM-DD
Cluster: <id>  Primary: <node>  Backup: <node>  Release: <sha>
A1: PASS  evidence: UaExpert screenshot uaexpert-a1.png
A2: PASS  evidence: ServiceLevel trend grafana-a2.png
…

One pass of every row is the acceptance criterion. Re-run after any Phase 6.3 follow-up ships (especially the non-transparent redundancy-type upgrade, which flips A4 from "deferred" to "expected pass").

Known limitations

A4 pending: Server.ServerRedundancy on our current SDK build lands as the base ServerRedundancyState, which has no ServerUriArray child. ServerRedundancyNodeWriter.ApplyServerUriArray logs-and-skips until the redundancy-object-type upgrade follow-up lands.
Recovery dwell default: RecoveryStateManager.DwellTime defaults to 60 s in Program.cs. Adjust via future config knob if B3 takes too long to observe.
C-block external dependency: The Galaxy.Host side of the redundancy story is largely out of our code — it's MXAccess's own client-redundancy policy talking to our published ServiceLevel. A negative result on C1-C3 does not necessarily indicate an OtOpcUa bug; cross-check with UaExpert (Block A / B) first.

Automation notes (why this is a playbook, not a test)

UaExpert and Kepware binaries are closed-source Windows GUIs; they don't ship headless CLIs for the browse/connect/subscribe flows.
The OPC Foundation reference SDK can drive every scenario, but our own Driver.OpcUaClient tests already cover that client's behaviour; Block B adds value specifically because these two clients have independent redundancy implementations we don't control.
For the sub-set of scenarios that can be automated — the self-loopback case where our own otopcua-cli drives Primary + Backup — the existing tests/ZB.MOM.WW.OtOpcUa.Server.Tests/RedundancyStatePublisherTests + ServiceLevelCalculatorTests (unit) + ClusterTopologyLoaderTests (integration) already cover the math + data path. The wire-level assertion that the values actually land on the right OPC UA nodes is covered by ServerRedundancyNodeWriterTests.

7.3 KiB Raw Blame History Unescape Escape