Catch-all commit for pending work on the task-galaxy-e2e branch that
wasn't part of the FOCAS migration. Grouping by topic so future per-topic
commits can be cherry-picked if needed.
TwinCAT
- src/.../Driver.TwinCAT/AdsTwinCATClient.cs + TwinCATDriverFactoryExtensions.cs:
factory-registration extensions + ADS client refinements.
- src/.../Driver.TwinCAT.Cli/Commands/BrowseCommand.cs: new browse command
for the TwinCAT test-client CLI.
- tests/.../Driver.TwinCAT.IntegrationTests/TwinCAT3SmokeTests.cs + TwinCatProject/:
fixture scaffold with a minimal POU + README pointing at the TCBSD/ESXi
VM for e2e.
- docs/Driver.TwinCAT.Cli.md + docs/drivers/TwinCAT-Test-Fixture.md:
documentation for the above.
- docs/v3/twincat-backlog.md: forward-looking backlog seed.
Admin UI + fleet status
- src/.../Admin/Components/Pages/Clusters/DriversTab.razor + Hosts.razor:
UI refresh for fleet-status rendering.
- src/.../Admin/Hubs/FleetStatusHub.cs + FleetStatusPoller.cs +
Admin/Program.cs: SignalR hub + poller plumbing for live fleet data.
- tests/.../Admin.Tests/FleetStatusPollerTests.cs: poller coverage.
Server + redundancy runtime (Phase 6.3 follow-ups)
- src/.../Server/Hosting/RedundancyPublisherHostedService.cs: HostedService
that owns the RedundancyStatePublisher lifecycle + wires peer reachability.
- src/.../Server/Redundancy/ServerRedundancyNodeWriter.cs: OPC UA
variable-node writer binding ServiceLevel + ServerUriArray to the
publisher's events.
- src/.../Server/Program.cs + Server.csproj: hosted-service registration.
- tests/.../Server.Tests/ServerRedundancyNodeWriterTests.cs +
Server.Tests.csproj: coverage for the above.
Configuration
- src/.../Configuration/Validation/DraftValidator.cs +
tests/.../Configuration.Tests/DraftValidatorTests.cs: draft-validation
refinements.
E2E scripts (shared infrastructure)
- scripts/e2e/README.md + _common.ps1 + test-all.ps1: shared helpers + the
all-drivers test-all runner.
- scripts/e2e/test-opcuaclient.ps1: OPC UA Client e2e runner.
Docs
- docs/v2/implementation/phase-6-{1,2,3,4}*.md + exit-gate-phase-{3,7}.md:
phase-gate + implementation doc updates.
- docs/v2/plan.md: top-level plan refresh.
- docs/v2/redundancy-interop-playbook.md: client interop playbook for the
Phase 6.3 redundancy-runtime work.
Two orphan FOCAS docs remain on disk but deliberately unstaged —
docs/v2/focas-deployment.md and docs/v2/implementation/focas-simulator-plan.md
describe the now-retired Tier-C topology and should either be rewritten
or deleted in a follow-up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
22 KiB
Phase 6.3 — Redundancy Runtime
Status: SHIPPED (core + Stream C) — original body merged 2026-04-19; audit 2026-04-23 promoted Stream C (task #147) into shipped state.
In (verified in repo):
- Stream A —
ClusterTopologyLoader,RedundancyCoordinator,RedundancyTopology,PeerReachabilityall present undersrc/ZB.MOM.WW.OtOpcUa.Server/Redundancy/. Coordinator is now also hosted byProgram.csvia the newRedundancyPublisherHostedService, which callsRefreshAsyncon startup.- Stream B —
ServiceLevelCalculator+RecoveryStateManager.- Stream C (task #147) — OPC UA node wiring.
ServerRedundancyNodeWritermaintainsServer.ServiceLevel(i=2267),Server.ServerRedundancy.RedundancySupport(i=2994), andServer.ServerRedundancy.ServerUriArray(non-transparent subtype) by writing thePropertyState.Value+ callingClearChangeMasks.RedundancyPublisherHostedServicedrives the publisher on a 1 s tick and fansOnStateChanged/OnServerUriArrayChangedinto the writer. Mapping ofConfiguration.RedundancyMode→ Part 4RedundancySupportis Warm/Hot/None (v2 clusters don't enumerate Cold / HotAndMirrored per decision #85). Idempotent per-value dedupe prevents spurious OPC UA notifications. Unit coverage:ServerRedundancyNodeWriterTests(4 tests, green).- Stream D —
ApplyLeaseRegistry.- Stream E —
RedundancyTab.razorwith SignalRRoleChangedwiring (viaFleetStatusPoller+FleetStatusHub) — stale-flag + role-swap banner.Closed this session (2026-04-23):
- Task #148 part 2 —
DraftValidator.ValidateClusterTopology(cluster, nodes)now catches three pre-publish invariants the SQL CHECK can't see: (a) unsupportedNodeCount/RedundancyModepairs; (b)Enabled-node count vs. declaredNodeCountmismatch (catches disabled-node drift with mode still Hot/Warm); (c) multiple-Primary per decision #84. Returns every failure in one pass — same shape asValidate. 8 new tests inDraftValidatorTestsgreen.- Task #150 Stream F —
docs/v2/redundancy-interop-playbook.mdcaptures the manual validation matrix against UaExpert + Kepware + AVEVA MXAccess failover. Automating these closed-source GUI clients in PR-CI is out of scope; the automatable half is already covered byServiceLevelCalculatorTests/RedundancyStatePublisherTests/ClusterTopologyLoaderTests/ServerRedundancyNodeWriterTests.Remaining (documented limitation, not blocking v2.0):
- Non-transparent redundancy-state node upgrade — the SDK's default
Server.ServerRedundancyobject is the baseServerRedundancyState, soApplyServerUriArraycurrently logs-and-skips. Operators on the rare deployment that needsServerUriArrayread-back get a clear warning with the upgrade path. Documented in the interop playbook's "Known limitations" section.Baseline pre-Phase-6.3: 1097 solution tests → post-Phase-6.3 core: 1137 passing (+40 net).
Branch:
v2/phase-6-3-redundancy-runtimeEstimated duration: 2 weeks Predecessor: Phase 6.2 (Authorization) — reuses the Phase 6.1 health endpoints for cluster-peer probing Successor: Phase 6.4 (Admin UI completion)
Phase Objective
Land the non-transparent redundancy protocol end-to-end: two OtOpcUa.Server instances in a ServerCluster each expose a live ServiceLevel node whose value reflects that instance's suitability to serve traffic, advertise each other via ServerUriArray, and transition role (Primary ↔ Backup) based on health + operator intent.
Closes these gaps:
- Dynamic
ServiceLevel— OPC UA Part 5 §6.3.34 specifies a Byte (0..255) that clients poll to pick the healthiest server. Our server publishes it as a static value today. ServerUriArraybroadcast — Part 4 specifies that every node in a redundant pair should advertise its peers' ApplicationUris. Currently advertises only its own.- Primary / Backup role coordination — entities carry
RedundancyRolebut the runtime doesn't read it; no peer health probing; no role-transfer on primary failure. - Mid-apply dip — decision-level expectation that a server mid-generation-apply should report a lower ServiceLevel so clients cut over to the peer during the apply window. Not implemented.
Scope — What Changes
| Concern | Change |
|---|---|
OtOpcUa.Server → new Server.Redundancy sub-namespace |
RedundancyCoordinator singleton. Resolves the current node's ClusterNode row at startup, loads peers, runs two-layer peer health probe: (a) /healthz every 2 s as the fast-fail (inherits Phase 6.1 semantics — HTTP + DB/cache healthy); (b) UaHealthProbe every 10 s — opens a lightweight OPC UA client session to the peer + reads its ServiceLevel node + verifies endpoint serves data. Authority decisions use UaHealthProbe; /healthz is used only to avoid wasting UA probes when peer is obviously down. |
| Publish-generation fencing | Topology + role decisions are stamped with a monotonic ConfigGenerationId from the shared config DB. Coordinator re-reads topology via CAS on (ClusterId, ExpectedGeneration) → new row; peers reject state propagated from a lower generation. Prevents split-publish races. |
InvalidTopology runtime state |
If both nodes detect >1 Primary AFTER startup (config-DB drift during a publish), both self-demote to ServiceLevel 2 until convergence. Neither node serves authoritatively; clients pick the healthier alternative or reconnect later. |
| OPC UA server root | ServiceLevel variable node becomes a BaseDataVariable whose value updates on RedundancyCoordinator state change. ServerUriArray array variable includes self + peers in stable deterministic ordering (decision per OPC UA Part 4 §6.6.2.2). RedundancySupport stays static (set from RedundancyMode at startup); Transparent mode validated pre-publish, not rejected at startup. |
RedundancyCoordinator computation |
8-state ServiceLevel matrix — avoids OPC UA Part 5 §6.3.34 collision (0=Maintenance, 1=NoData). Operator-declared maintenance only = 0. Unreachable / Faulted = 1. In-range operational states occupy 2..255: Authoritative-Primary = 255; Isolated-Primary (peer unreachable, self serving) = 230; Primary-Mid-Apply = 200; Recovering-Primary (post-fault, dwell not met) = 180; Authoritative-Backup = 100; Isolated-Backup (primary unreachable, "take over if asked") = 80; Backup-Mid-Apply = 50; Recovering-Backup = 30; InvalidTopology (runtime detects >1 Primary) = 2 (detected-inconsistency band — below normal operation). Full matrix documented in docs/Redundancy.md update. |
| Role transition | Split-brain avoidance: role is declared in the shared config DB (ClusterNode.RedundancyRole), not elected at runtime. An operator flips the row (or a failover script does). Coordinator only reads; never writes. |
sp_PublishGeneration hook |
Uses named apply leases keyed to (ConfigGenerationId, PublishRequestId). await using var lease = coordinator.BeginApplyLease(...). Disposal on any exit path (success, exception, cancellation) decrements. Watchdog auto-closes any lease older than ApplyMaxDuration (default 10 min) → ServiceLevel can't stick at mid-apply. Pre-publish validator rejects unsupported RedundancyMode (e.g. Transparent) with a clear error so runtime never sees an invalid state. |
Admin UI /cluster/{id} page |
New RedundancyTab.razor — shows current node's role + ServiceLevel + peer reachability. FleetAdmin can trigger a role-swap by editing ClusterNode.RedundancyRole + publishing a draft. |
| Metrics | New OpenTelemetry metrics: ot_opcua_service_level{cluster,node}, ot_opcua_peer_reachable{cluster,node,peer}, ot_opcua_apply_in_progress{cluster,node}. Sink via Phase 6.1 observability layer. |
Scope — What Does NOT Change
| Item | Reason |
|---|---|
| OPC UA authn / authz | Phases 6.2 + prior. Redundancy is orthogonal. |
| Driver layer | Drivers aren't redundancy-aware; they run on each node independently against the same equipment. The server layer handles the ServiceLevel story. |
| Automatic failover / election | Explicitly out of scope. Non-transparent = client picks which server to use via ServiceLevel + ServerUriArray. We do NOT ship consensus, leader election, or automatic promotion. Operator-driven failover is the v2.0 model per decision #79–85. |
Transparent redundancy (RedundancySupport=Transparent) |
Not supported. If the operator asks for it the server fails startup with a clear error. |
| Historian redundancy | Galaxy Historian's own redundancy (two historians on two CPUs) is out of scope. The Galaxy driver talks to whichever historian is reachable from its node. |
Entry Gate Checklist
- Phase 6.1 merged (uses
/healthzfor peer probing) CLAUDE.md§Redundancy +docs/Redundancy.mdre-read- Decisions #79–85 re-skimmed
ServerCluster/ClusterNode/RedundancyRole/RedundancyModeentities + existing migration reviewed- OPC UA Part 4 §Redundancy + Part 5 §6.3.34 (ServiceLevel) re-skimmed
- Dev box has two OtOpcUa.Server instances configured against the same cluster — one designated Primary, one Backup — for integration testing
Task Breakdown
Stream A — Cluster topology loader (3 days)
- A.1
RedundancyCoordinatorstartup path: readsClusterNoderow for the current node (identified byappsettings.jsonCluster:NodeId), reads the cluster's peer list, validates invariants (no duplicateApplicationUri, at most onePrimaryper cluster ifRedundancyMode.WarmActive, at most two nodes total in v2.0 per decision #83). - A.2 Topology subscription — coordinator re-reads on
sp_PublishGenerationconfirmation so an operator role-swap takes effect after publish (no process restart needed). - A.3 Tests: two-node cluster seed, one-node cluster seed (degenerate), duplicate-uri rejection.
Stream B — Peer health probing + ServiceLevel computation (6 days, widened)
- B.1
PeerHttpProbeLoopper peer at 2 s — calls peer's/healthz, 1 s timeout, exponential backoff on sustained failure. Used as fast-fail. - B.2
PeerUaProbeLoopper peer at 10 s — opens an OPC UA client session to the peer (reuses Phase 5Driver.OpcUaClientstack), reads peer'sServiceLevelnode + verifies endpoint serves data. Short-circuit: if HTTP probe is failing, skip UA probe (no wasted sessions). - B.3
ServiceLevelCalculator.Compute(role, selfHealth, peerHttpHealthy, peerUaHealthy, applyInProgress, recoveryDwellMet, topologyValid) → byte. 8-state matrix per §Scope.topologyValid=falseforces InvalidTopology = 2 regardless of other inputs. - B.4
RecoveryStateManager: after aFaulted → Healthytransition, hold driver inRecoveringband (180 Primary / 30 Backup) forRecoveryDwellTime(default 60 s) AND require one positive publish witness (successfulReadon a reference node) before entering Authoritative band. - B.5 Calculator reacts to inputs via
IObserverso changes immediately push to the OPC UAServiceLevelnode. - B.6 Tests: 64-case matrix covering role × self-health × peer-http × peer-ua × apply × recovery × topology. Specific cases flagged: Primary-with-unreachable-peer-serves-at-230 (authority retained); Backup-with-unreachable-primary-escalates-to-80 (not auto-promote); InvalidTopology demotes both nodes; Recovering dwell + publish-witness blocks premature return to 255.
Stream C — OPC UA node wiring (3 days)
- C.1
ServiceLevelvariable node created underServerStatusat server startup. TypeByte, AccessLevel = CurrentRead only. Subscribe toServiceLevelCalculatorobservable; push updates viaDataChangeNotification. - C.2
ServerUriArrayvariable node underServerCapabilities. Array ofString, includes self + peers with deterministic ordering (self first). Updates on topology change. Compliance test asserts local-plus-peer membership. - C.3
RedundancySupportvariable — static at startup fromRedundancyMode. Values:None,Cold,Warm,WarmActive,Hot. Unsupported values (Transparent,HotAndMirrored) are rejected pre-publish by validator — runtime never sees them. - C.4 Client.CLI cutover test: connect to primary, read
ServiceLevel→ 255; pause primary apply → 200; unreachable peer while apply in progress → 200 (apply dominates peer-unreachable per matrix); client sees peer viaServerUriArray; fail primary → client reconnects to peer at 80 (isolated-backup band).
Stream D — Apply-window integration (3 days)
- D.1
sp_PublishGenerationcaller wraps the apply inawait using var lease = coordinator.BeginApplyLease(generationId, publishRequestId). Lease keyed to(ConfigGenerationId, PublishRequestId)so concurrent publishes stay isolated. Disposal decrements on every exit path. - D.2
ApplyLeaseWatchdogauto-closes leases older thanApplyMaxDuration(default 10 min) so a crashed publisher can't pin the node at mid-apply. - D.3 Pre-publish validator in
sp_PublishGenerationrejects unsupportedRedundancyModevalues (Transparent,HotAndMirrored) with a clear error message — runtime never sees an invalid mode. - D.4 Tests: (a) mid-apply client subscribes → sees ServiceLevel drop → sees restore; (b) lease leak via
ThreadAbort/ cancellation → watchdog closes; (c) publish rejected forTransparent→ operator-actionable error.
Stream E — Admin UI + metrics (3 days)
- E.1
RedundancyTab.razorunder/cluster/{id}/redundancy. Shows each node's role, current ServiceLevel (with band label per 8-state matrix), peer reachability (HTTP + UA probe separately), last apply timestamp. Role-swap button posts a draft edit onClusterNode.RedundancyRole; publish applies. - E.2 OpenTelemetry meter export:
ot_opcua_service_level{cluster,node}gauge +ot_opcua_peer_reachable{cluster,node,peer,kind=http|ua}+ot_opcua_apply_in_progress{cluster,node}+ot_opcua_topology_valid{cluster}. Sink via Phase 6.1 observability. - E.3 SignalR push:
FleetStatusHubbroadcasts ServiceLevel changes so the Admin UI updates within ~1 s of the coordinator observing a peer flip.
Stream F — Client-interoperability matrix (3 days, new)
- F.1 Validate ServiceLevel-driven cutover against Ignition 8.1 + 8.3, Kepware KEPServerEX 6.x, Aveva OI Gateway 2020R2 + 2023R1. For each: configure the client with both endpoints, verify it honors
ServiceLevel+ServerUriArrayduring primary failover. - F.2 Clients that don't honour the standards (doc field — may include Kepware and OI Gateway per Codex review) get an explicit compatibility-matrix entry: "requires manual backup-endpoint config / vendor-specific redundancy primitives". Documented in
docs/Redundancy.md. - F.3 Galaxy MXAccess failover test — boot Galaxy.Proxy on both nodes, kill Primary, assert Galaxy consumer reconnects to Backup within
(SessionTimeout + KeepAliveInterval × 3). Document required session-timeout config indocs/Redundancy.md.
Compliance Checks (run at exit gate)
- OPC UA band compliance:
0=Maintenancereserved,1=NoDatareserved. Operational states in 2..255 per 8-state matrix. - Authoritative-Primary ServiceLevel = 255.
- Isolated-Primary (peer unreachable, self serving) = 230 — Primary retains authority.
- Primary-Mid-Apply = 200.
- Recovering-Primary = 180 with dwell + publish witness enforced.
- Authoritative-Backup = 100.
- Isolated-Backup (primary unreachable) = 80 — does NOT auto-promote.
- InvalidTopology = 2 — both nodes self-demote when >1 Primary detected runtime.
- ServerUriArray returns self + peer URIs, self first.
- UaHealthProbe authority: integration test — peer returns HTTP 200 but OPC UA endpoint unreachable → coordinator treats peer as UA-unhealthy; peer is not a valid authority source.
- Apply-lease disposal: leases close on exception, cancellation, and watchdog timeout; ServiceLevel never sticks at mid-apply band.
- Transparent-mode rejection: attempting to publish
RedundancyMode=Transparentis blocked atsp_PublishGeneration; runtime never sees an invalid mode. - Role transition via operator publish: FleetAdmin swaps
RedundancyRolein a draft, publishes; both nodes re-read topology on publish confirmation + flip ServiceLevel — no restart. - Client.CLI cutover: with primary halted, Client.CLI that was connected to primary sees primary drop + reconnects to backup via
ServerUriArray. - Client interoperability matrix (Stream F): Ignition 8.1 + 8.3 honour ServiceLevel; Kepware + Aveva OI Gateway findings documented.
- Galaxy MXAccess failover: end-to-end test — primary kill → Galaxy consumer reconnects to backup within session-timeout budget.
- No regression in existing driver test suites; no regression in
/healthzreachability under redundancy load.
Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Split-brain from operator race (both nodes marked Primary) | Low | High | Coordinator rejects startup if its cluster has >1 Primary row; logs + fails fast. Document as a publish-time validation in sp_PublishGeneration. |
| ServiceLevel thrashing on flaky peer | Medium | Medium | 2 s probe interval + 3-sample smoothing window; only declares a peer unreachable after 3 consecutive failed probes |
| Client ignores ServiceLevel and stays on broken primary | Medium | Medium | Documented in docs/Redundancy.md — non-transparent redundancy requires client cooperation; most SCADA clients (Ignition, Kepware, Aveva OI Gateway) honor it. Unit-test the advertised values; field behavior is client-responsibility |
| Apply-window counter leaks on exception | Low | High | BeginApplyWindow returns IDisposable; using syntax enforces paired decrement; unit test for exception-in-apply path |
HttpClient probe leaks sockets |
Low | Medium | Single shared HttpClient per coordinator (not per-probe); timeouts tight to avoid keeping connections open during peer downtime |
Completion Checklist
- Stream A: topology loader + tests
- Stream B: peer probe + ServiceLevel calculator + 32-case matrix tests
- Stream C: ServiceLevel / ServerUriArray / RedundancySupport node wiring + Client.CLI smoke test
- Stream D: apply-window integration + nested-apply counter
- Stream E: Admin
RedundancyTab+ OpenTelemetry metrics + SignalR push phase-6-3-compliance.ps1exits 0; exit-gate doc;docs/Redundancy.mdupdated with the ServiceLevel matrix
Adversarial Review — 2026-04-19 (Codex, thread 019da490-3fa0-7340-98b8-cceeca802550)
- Crit · ACCEPT — No publish-generation fencing enables split-publish advertising both as authoritative. Change: coordinator CAS on a monotonic
ConfigGenerationId; every topology decision is generation-stamped; peers reject state propagated from a lower generation. - Crit · ACCEPT —
>1 Primaryat startup covered but runtime containment missing when invalid topology appears later (mid-apply race). Change: add runtimeInvalidTopologystate — both nodes self-demote to ServiceLevel 2 (the "detected inconsistency" band, below normal operation) until convergence. - High · ACCEPT —
0 = Faultedcollides with OPC UA Part 5 §6.3.34 semantics where 0 means Maintenance and 1 means NoData. Change: reserve 0 for operator-declared maintenance-mode only; Faulted/unreachable uses 1 (NoData); in-range degraded states occupy 2..199. - High · ACCEPT — Matrix collapses distinct operational states onto the same value. Change: matrix expanded to Authoritative-Primary=255, Isolated-Primary=230 (peer unreachable — still serving), Primary-Mid-Apply=200, Recovering-Primary=180, Authoritative-Backup=100, Isolated-Backup=80 (primary unreachable — "take over if asked"), Backup-Mid-Apply=50, Recovering-Backup=30.
- High · ACCEPT —
/healthzfrom 6.1 is HTTP-healthy but doesn't guarantee OPC UA data plane. Change: add a redundancy-specific probeUaHealthProbe— issues aReadAsync(ServiceLevel)against the peer's OPC UA endpoint via a lightweight client session./healthzremains the fast-fail; the UA probe is the authority signal. - High · ACCEPT —
ServerUriArraymust include self + peers, not peers only. Change: array contains[self.ApplicationUri, peer.ApplicationUri]in stable deterministic ordering; compliance test asserts local-plus-peer membership. - Med · ACCEPT — No
Faulted → Recovering → Healthypath. Change: addRecoveringstate with min dwell time (60 s default) + positive publish witness (one successful Read on a reference node) before returning to Healthy. Thrash-prevention. - Med · ACCEPT — Topology change during in-flight probe undefined. Change: every probe task tagged with
ConfigGenerationIdat dispatch; obsolete results discarded; in-flight probes cancelled on topology reload. - Med · ACCEPT — Apply-window counter race on exception/cancellation/async ownership. Change: apply-window is a named lease keyed to
(ConfigGenerationId, PublishRequestId)with disposal enforced viaawait using; watchdog detects leased-but-abandoned and force-closes afterApplyMaxDuration(default 10 min). - High · ACCEPT — Ignition + Kepware + Aveva OI Gateway
ServiceLevelcompliance is unverified. Change: risk elevated to High; add Stream F (new) — build an interop matrix: validate against Ignition 8.1/8.3, Kepware KEPServerEX 6.x, Aveva OI Gateway 2020R2 + 2023R1. Document per-client cutover behaviour. Field deployments get a documented compatibility table; clients that ignore ServiceLevel documented as requiring explicit backup-endpoint config. - Med · ACCEPT — Galaxy MXAccess re-session on Primary death not in acceptance. Change: Stream F adds an end-to-end failover smoke test that boots Galaxy.Proxy on both nodes, kills Primary, asserts Galaxy consumer reconnects to Backup within
(SessionTimeout + KeepAliveInterval × 3)budget.docs/Redundancy.mdupdated with required session timeouts. - Med · ACCEPT — Transparent-mode startup rejection is outage-prone. Change:
sp_PublishGenerationvalidatesRedundancyModepre-publish — unsupported values reject the publish attempt with a clear validation error; runtime never sees an unsupported mode. Last-good config stays active.