Files
lmxopcua/docs/v2/redundancy-interop-playbook.md
Joseph Doherty 69e0d02c72 task-galaxy-e2e branch — non-FOCAS work-in-progress snapshot
Catch-all commit for pending work on the task-galaxy-e2e branch that
wasn't part of the FOCAS migration. Grouping by topic so future per-topic
commits can be cherry-picked if needed.

TwinCAT
- src/.../Driver.TwinCAT/AdsTwinCATClient.cs + TwinCATDriverFactoryExtensions.cs:
  factory-registration extensions + ADS client refinements.
- src/.../Driver.TwinCAT.Cli/Commands/BrowseCommand.cs: new browse command
  for the TwinCAT test-client CLI.
- tests/.../Driver.TwinCAT.IntegrationTests/TwinCAT3SmokeTests.cs + TwinCatProject/:
  fixture scaffold with a minimal POU + README pointing at the TCBSD/ESXi
  VM for e2e.
- docs/Driver.TwinCAT.Cli.md + docs/drivers/TwinCAT-Test-Fixture.md:
  documentation for the above.
- docs/v3/twincat-backlog.md: forward-looking backlog seed.

Admin UI + fleet status
- src/.../Admin/Components/Pages/Clusters/DriversTab.razor + Hosts.razor:
  UI refresh for fleet-status rendering.
- src/.../Admin/Hubs/FleetStatusHub.cs + FleetStatusPoller.cs +
  Admin/Program.cs: SignalR hub + poller plumbing for live fleet data.
- tests/.../Admin.Tests/FleetStatusPollerTests.cs: poller coverage.

Server + redundancy runtime (Phase 6.3 follow-ups)
- src/.../Server/Hosting/RedundancyPublisherHostedService.cs: HostedService
  that owns the RedundancyStatePublisher lifecycle + wires peer reachability.
- src/.../Server/Redundancy/ServerRedundancyNodeWriter.cs: OPC UA
  variable-node writer binding ServiceLevel + ServerUriArray to the
  publisher's events.
- src/.../Server/Program.cs + Server.csproj: hosted-service registration.
- tests/.../Server.Tests/ServerRedundancyNodeWriterTests.cs +
  Server.Tests.csproj: coverage for the above.

Configuration
- src/.../Configuration/Validation/DraftValidator.cs +
  tests/.../Configuration.Tests/DraftValidatorTests.cs: draft-validation
  refinements.

E2E scripts (shared infrastructure)
- scripts/e2e/README.md + _common.ps1 + test-all.ps1: shared helpers + the
  all-drivers test-all runner.
- scripts/e2e/test-opcuaclient.ps1: OPC UA Client e2e runner.

Docs
- docs/v2/implementation/phase-6-{1,2,3,4}*.md + exit-gate-phase-{3,7}.md:
  phase-gate + implementation doc updates.
- docs/v2/plan.md: top-level plan refresh.
- docs/v2/redundancy-interop-playbook.md: client interop playbook for the
  Phase 6.3 redundancy-runtime work.

Two orphan FOCAS docs remain on disk but deliberately unstaged —
docs/v2/focas-deployment.md and docs/v2/implementation/focas-simulator-plan.md
describe the now-retired Tier-C topology and should either be rewritten
or deleted in a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 14:12:19 -04:00

7.3 KiB
Raw Blame History

Redundancy Interop Playbook (Phase 6.3 Stream F — task #150)

Scope: manual validation that third-party OPC UA clients + AVEVA MXAccess observe our non-transparent redundancy signals (ServiceLevel, ServerUriArray, RedundancySupport) and fail over to the Backup node when the Primary drops.

Why manual: the third-party clients named here are Windows-GUI binaries (UaExpert, Kepware QuickClient) or embedded inside AVEVA System Platform. Automating any of them into PR-CI is out of scope for v2. This playbook captures the minimal dev-box-plus-VM setup and the expected pass criteria so the work can be executed repeatably at v2 release readiness and after any Phase 6.3 follow-up change.

Prerequisites

  1. Two OtOpcUa.Server nodes in one ServerCluster:
    • Declared as NodeCount = 2, RedundancyMode = Hot (or Warm).
    • Each with a distinct ApplicationUri (enforced by unique index per decision #86).
    • Each node's StaticRoutes.xml points at the other (ServerCluster.Node[].Host).
  2. scripts/install/Install-Services.ps1 applied on each node so the RedundancyPublisherHostedService is running.
  3. At least one DriverInstance with a reachable simulator or PLC so both servers have a non-empty address space to browse.
  4. On the client host:
    • UaExpert ≥ 1.7 installed
    • Kepware ClientAce QuickClient (or equivalent) — optional, for a second client
  5. For the AVEVA leg: a Galaxy.Host running against an MXAccess deployment with an external OPC UA client object pointed at the cluster (not at a single node).

Expected signals on a running cluster

Node ServiceLevel RedundancySupport ServerUriArray
Primary, healthy, peer reachable 200 Hot (or Warm) self + peer
Primary, mid-apply 75 (PrimaryMidApply) same same
Primary, peer UNreachable 150 (PrimaryPeerDown) same same
Backup, healthy 100 (Secondary) same same
Either, dwelling in recovery 50 (Recovering) same same
Either, invariant violation (two Primary, disabled-node mismatch) 2 (InvalidTopology) same same

(The band constants live in ServiceLevelCalculator.Classify.)

Test matrix

Each row is one manual run; pass criterion in the right column.

Block A — UA protocol signals (UaExpert)

# Scenario Procedure Pass criterion
A1 ServiceLevel published Connect UaExpert to Primary. Browse to Server.ServerStatus.ServiceLevel. Value = 200 (or the expected Band byte per table above)
A2 ServiceLevel updates on peer down Connect to Primary. Stop Backup (sc stop OtOpcUa). Watch ServiceLevel. Transitions 200 → 150 within ~2 s of peer probe timeout
A3 RedundancySupport Browse to Server.ServerRedundancy.RedundancySupport. Value matches the declared RedundancyMode (Warm / Hot / None)
A4 ServerUriArray (non-transparent upgrade) Requires a redundancy-object-type upgrade follow-up. When upgrade lands: ServerUriArray reports both ApplicationUris, self first
A5 Mid-apply dip On Primary trigger a sp_PublishGeneration apply. ServiceLevel drops to 75 for the apply duration + dwell

Block B — Client failover

# Scenario Procedure Pass criterion
B1 UaExpert picks Primary by ServiceLevel In UaExpert configure a Redundancy Group with both endpoint URLs. Client picks the Primary URL (higher ServiceLevel)
B2 UaExpert cuts over on Primary kill Kill the Primary's OtOpcUa service. Client session reconnects to Backup within UaExpert's reconnect timeout (default 5 s). Data-change monitored items resume.
B3 UaExpert cuts back when Primary returns Start the Primary service. Wait ≥ recovery dwell (see RecoveryStateManager.DwellTime). ServiceLevel on returning Primary goes through 50 (Recovering) → 200; UaExpert may or may not switch back (client-policy dependent; both are accepted outcomes)
B4 Kepware QuickClient failover Repeat B1B3 with Kepware in place of UaExpert. Same pass criteria; establishes we're not UaExpert-specific

Block C — Galaxy MXAccess failover

This block validates that an AVEVA System Platform app consuming our cluster via MXAccess tolerates a Primary drop the same way a native OPC UA client does. The MXAccess toolkit internally wraps the OPC UA Client and does its own redundancy negotiation; we're asserting that negotiation honors our ServiceLevel signal.

# Scenario Procedure Pass criterion
C1 Galaxy binds to Primary on first connect Bring the cluster up. Start a Galaxy $MxAccessClient object pointed at the cluster with both node URLs. Galaxy reports QUALITY = Good + initial values from the Primary
C2 Galaxy redirects on Primary drop Stop the Primary. Galaxy's QUALITY briefly goes Uncertain, then back to Good; values continue streaming from the Backup within MXAccess's ReconnectInterval (default 20 s)
C3 Galaxy handles mid-apply dip Trigger a generation apply on the Primary. Galaxy continues reading — the mid-apply dip is advertisory (ServiceLevel 75), not a session drop; MXAccess should stay bound

Recording results

Copy the tables above into a tracking doc per run. The tracking doc shape:

Run date: 2026-MM-DD
Cluster: <id>  Primary: <node>  Backup: <node>  Release: <sha>
A1: PASS  evidence: UaExpert screenshot uaexpert-a1.png
A2: PASS  evidence: ServiceLevel trend grafana-a2.png
…

One pass of every row is the acceptance criterion. Re-run after any Phase 6.3 follow-up ships (especially the non-transparent redundancy-type upgrade, which flips A4 from "deferred" to "expected pass").

Known limitations

  • A4 pending: Server.ServerRedundancy on our current SDK build lands as the base ServerRedundancyState, which has no ServerUriArray child. ServerRedundancyNodeWriter.ApplyServerUriArray logs-and-skips until the redundancy-object-type upgrade follow-up lands.
  • Recovery dwell default: RecoveryStateManager.DwellTime defaults to 60 s in Program.cs. Adjust via future config knob if B3 takes too long to observe.
  • C-block external dependency: The Galaxy.Host side of the redundancy story is largely out of our code — it's MXAccess's own client-redundancy policy talking to our published ServiceLevel. A negative result on C1-C3 does not necessarily indicate an OtOpcUa bug; cross-check with UaExpert (Block A / B) first.

Automation notes (why this is a playbook, not a test)

  • UaExpert and Kepware binaries are closed-source Windows GUIs; they don't ship headless CLIs for the browse/connect/subscribe flows.
  • The OPC Foundation reference SDK can drive every scenario, but our own Driver.OpcUaClient tests already cover that client's behaviour; Block B adds value specifically because these two clients have independent redundancy implementations we don't control.
  • For the sub-set of scenarios that can be automated — the self-loopback case where our own otopcua-cli drives Primary + Backup — the existing tests/ZB.MOM.WW.OtOpcUa.Server.Tests/RedundancyStatePublisherTests + ServiceLevelCalculatorTests (unit) + ClusterTopologyLoaderTests (integration) already cover the math + data path. The wire-level assertion that the values actually land on the right OPC UA nodes is covered by ServerRedundancyNodeWriterTests.