task-galaxy-e2e branch — non-FOCAS work-in-progress snapshot

Catch-all commit for pending work on the task-galaxy-e2e branch that
wasn't part of the FOCAS migration. Grouping by topic so future per-topic
commits can be cherry-picked if needed.

TwinCAT
- src/.../Driver.TwinCAT/AdsTwinCATClient.cs + TwinCATDriverFactoryExtensions.cs:
  factory-registration extensions + ADS client refinements.
- src/.../Driver.TwinCAT.Cli/Commands/BrowseCommand.cs: new browse command
  for the TwinCAT test-client CLI.
- tests/.../Driver.TwinCAT.IntegrationTests/TwinCAT3SmokeTests.cs + TwinCatProject/:
  fixture scaffold with a minimal POU + README pointing at the TCBSD/ESXi
  VM for e2e.
- docs/Driver.TwinCAT.Cli.md + docs/drivers/TwinCAT-Test-Fixture.md:
  documentation for the above.
- docs/v3/twincat-backlog.md: forward-looking backlog seed.

Admin UI + fleet status
- src/.../Admin/Components/Pages/Clusters/DriversTab.razor + Hosts.razor:
  UI refresh for fleet-status rendering.
- src/.../Admin/Hubs/FleetStatusHub.cs + FleetStatusPoller.cs +
  Admin/Program.cs: SignalR hub + poller plumbing for live fleet data.
- tests/.../Admin.Tests/FleetStatusPollerTests.cs: poller coverage.

Server + redundancy runtime (Phase 6.3 follow-ups)
- src/.../Server/Hosting/RedundancyPublisherHostedService.cs: HostedService
  that owns the RedundancyStatePublisher lifecycle + wires peer reachability.
- src/.../Server/Redundancy/ServerRedundancyNodeWriter.cs: OPC UA
  variable-node writer binding ServiceLevel + ServerUriArray to the
  publisher's events.
- src/.../Server/Program.cs + Server.csproj: hosted-service registration.
- tests/.../Server.Tests/ServerRedundancyNodeWriterTests.cs +
  Server.Tests.csproj: coverage for the above.

Configuration
- src/.../Configuration/Validation/DraftValidator.cs +
  tests/.../Configuration.Tests/DraftValidatorTests.cs: draft-validation
  refinements.

E2E scripts (shared infrastructure)
- scripts/e2e/README.md + _common.ps1 + test-all.ps1: shared helpers + the
  all-drivers test-all runner.
- scripts/e2e/test-opcuaclient.ps1: OPC UA Client e2e runner.

Docs
- docs/v2/implementation/phase-6-{1,2,3,4}*.md + exit-gate-phase-{3,7}.md:
  phase-gate + implementation doc updates.
- docs/v2/plan.md: top-level plan refresh.
- docs/v2/redundancy-interop-playbook.md: client interop playbook for the
  Phase 6.3 redundancy-runtime work.

Two orphan FOCAS docs remain on disk but deliberately unstaged —
docs/v2/focas-deployment.md and docs/v2/implementation/focas-simulator-plan.md
describe the now-retired Tier-C topology and should either be rewritten
or deleted in a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-04-24 14:12:19 -04:00
parent 4b0664bd55
commit 69e0d02c72
58 changed files with 3070 additions and 247 deletions

View File

@@ -1,13 +1,20 @@
# Phase 6.3 — Redundancy Runtime
> **Status**: **SHIPPED (core)** 2026-04-19 — Streams B (ServiceLevelCalculator + RecoveryStateManager) and D core (ApplyLeaseRegistry) merged to `v2` in PR #89. Exit gate in PR #90.
> **Status**: **SHIPPED (core + Stream C)** — original body merged 2026-04-19; audit 2026-04-23 promoted **Stream C (task #147)** into shipped state.
>
> Deferred follow-ups (tracked separately):
> - Stream A — RedundancyCoordinator cluster-topology loader (task #145).
> - Stream COPC UA node wiring: ServiceLevel + ServerUriArray + RedundancySupport (task #147).
> - Stream E — Admin UI RedundancyTab + OpenTelemetry metrics + SignalR (task #149).
> - Stream Fclient interop matrix + Galaxy MXAccess failover test (task #150).
> - sp_PublishGeneration pre-publish validator rejecting unsupported RedundancyMode values (task #148 part 2 — SQL-side).
> **In** (verified in repo):
> - Stream A — `ClusterTopologyLoader`, `RedundancyCoordinator`, `RedundancyTopology`, `PeerReachability` all present under `src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/`. Coordinator is now also hosted by `Program.cs` via the new `RedundancyPublisherHostedService`, which calls `RefreshAsync` on startup.
> - Stream B`ServiceLevelCalculator` + `RecoveryStateManager`.
> - **Stream C (task #147) — OPC UA node wiring**. `ServerRedundancyNodeWriter` maintains `Server.ServiceLevel` (i=2267), `Server.ServerRedundancy.RedundancySupport` (i=2994), and `Server.ServerRedundancy.ServerUriArray` (non-transparent subtype) by writing the `PropertyState.Value` + calling `ClearChangeMasks`. `RedundancyPublisherHostedService` drives the publisher on a 1 s tick and fans `OnStateChanged` / `OnServerUriArrayChanged` into the writer. Mapping of `Configuration.RedundancyMode` → Part 4 `RedundancySupport` is Warm/Hot/None (v2 clusters don't enumerate Cold / HotAndMirrored per decision #85). Idempotent per-value dedupe prevents spurious OPC UA notifications. Unit coverage: `ServerRedundancyNodeWriterTests` (4 tests, green).
> - Stream D`ApplyLeaseRegistry`.
> - Stream E — `RedundancyTab.razor` with SignalR `RoleChanged` wiring (via `FleetStatusPoller` + `FleetStatusHub`) — stale-flag + role-swap banner.
>
> **Closed this session (2026-04-23)**:
> - **Task #148 part 2** — `DraftValidator.ValidateClusterTopology(cluster, nodes)` now catches three pre-publish invariants the SQL CHECK can't see: (a) unsupported `NodeCount`/`RedundancyMode` pairs; (b) `Enabled`-node count vs. declared `NodeCount` mismatch (catches disabled-node drift with mode still Hot/Warm); (c) multiple-Primary per decision #84. Returns every failure in one pass — same shape as `Validate`. 8 new tests in `DraftValidatorTests` green.
> - **Task #150 Stream F** — `docs/v2/redundancy-interop-playbook.md` captures the manual validation matrix against UaExpert + Kepware + AVEVA MXAccess failover. Automating these closed-source GUI clients in PR-CI is out of scope; the automatable half is already covered by `ServiceLevelCalculatorTests` / `RedundancyStatePublisherTests` / `ClusterTopologyLoaderTests` / `ServerRedundancyNodeWriterTests`.
>
> **Remaining (documented limitation, not blocking v2.0)**:
> - Non-transparent redundancy-state node upgrade — the SDK's default `Server.ServerRedundancy` object is the base `ServerRedundancyState`, so `ApplyServerUriArray` currently logs-and-skips. Operators on the rare deployment that needs `ServerUriArray` read-back get a clear warning with the upgrade path. Documented in the interop playbook's "Known limitations" section.
>
> Baseline pre-Phase-6.3: 1097 solution tests → post-Phase-6.3 core: 1137 passing (+40 net).
>