Files
lmxopcua/docs/v2/implementation/phase-6-3-redundancy-runtime.md
Joseph Doherty 2fe4bac508 Phase 6.3 exit gate — compliance real-checks + phase doc = SHIPPED (core)
scripts/compliance/phase-6-3-compliance.ps1 turns stub TODOs into 21 real
checks covering:
- Stream B 8-state matrix: ServiceLevelCalculator + ServiceLevelBand present;
  Maintenance=0, NoData=1, InvalidTopology=2, AuthoritativePrimary=255,
  IsolatedPrimary=230, PrimaryMidApply=200, RecoveringPrimary=180,
  AuthoritativeBackup=100, IsolatedBackup=80, BackupMidApply=50,
  RecoveringBackup=30 — every numeric band pattern-matched in source (any
  drift turns a check red).
- Stream B RecoveryStateManager with dwell + publish-witness gate + 60s
  default dwell.
- Stream D ApplyLeaseRegistry: BeginApplyLease returns IAsyncDisposable;
  key includes PublishRequestId (decision #162); PruneStale watchdog present;
  10 min default ApplyMaxDuration.

Five [DEFERRED] follow-up surfaces explicitly listed with task IDs:
  - Stream A topology loader (task #145)
  - Stream C OPC UA node wiring (task #147)
  - Stream E Admin UI (task #149)
  - Stream F interop + Galaxy failover (task #150)
  - sp_PublishGeneration Transparent-mode rejection (task #148 part 2)

Cross-cutting: full solution dotnet test passes 1137 >= 1097 pre-Phase-6.3
baseline; pre-existing Client.CLI Subscribe flake tolerated.

docs/v2/implementation/phase-6-3-redundancy-runtime.md status updated from
DRAFT to SHIPPED (core). Non-transparent redundancy per decision #84 keeps
role election out of scope — operator-driven failover is the v2.0 model.

`Phase 6.3 compliance: PASS` — exit 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:00:30 -04:00

20 KiB
Raw Blame History

Phase 6.3 — Redundancy Runtime

Status: SHIPPED (core) 2026-04-19 — Streams B (ServiceLevelCalculator + RecoveryStateManager) and D core (ApplyLeaseRegistry) merged to v2 in PR #89. Exit gate in PR #90.

Deferred follow-ups (tracked separately):

  • Stream A — RedundancyCoordinator cluster-topology loader (task #145).
  • Stream C — OPC UA node wiring: ServiceLevel + ServerUriArray + RedundancySupport (task #147).
  • Stream E — Admin UI RedundancyTab + OpenTelemetry metrics + SignalR (task #149).
  • Stream F — client interop matrix + Galaxy MXAccess failover test (task #150).
  • sp_PublishGeneration pre-publish validator rejecting unsupported RedundancyMode values (task #148 part 2 — SQL-side).

Baseline pre-Phase-6.3: 1097 solution tests → post-Phase-6.3 core: 1137 passing (+40 net).

Branch: v2/phase-6-3-redundancy-runtime Estimated duration: 2 weeks Predecessor: Phase 6.2 (Authorization) — reuses the Phase 6.1 health endpoints for cluster-peer probing Successor: Phase 6.4 (Admin UI completion)

Phase Objective

Land the non-transparent redundancy protocol end-to-end: two OtOpcUa.Server instances in a ServerCluster each expose a live ServiceLevel node whose value reflects that instance's suitability to serve traffic, advertise each other via ServerUriArray, and transition role (Primary ↔ Backup) based on health + operator intent.

Closes these gaps:

  1. Dynamic ServiceLevel — OPC UA Part 5 §6.3.34 specifies a Byte (0..255) that clients poll to pick the healthiest server. Our server publishes it as a static value today.
  2. ServerUriArray broadcast — Part 4 specifies that every node in a redundant pair should advertise its peers' ApplicationUris. Currently advertises only its own.
  3. Primary / Backup role coordination — entities carry RedundancyRole but the runtime doesn't read it; no peer health probing; no role-transfer on primary failure.
  4. Mid-apply dip — decision-level expectation that a server mid-generation-apply should report a lower ServiceLevel so clients cut over to the peer during the apply window. Not implemented.

Scope — What Changes

Concern Change
OtOpcUa.Server → new Server.Redundancy sub-namespace RedundancyCoordinator singleton. Resolves the current node's ClusterNode row at startup, loads peers, runs two-layer peer health probe: (a) /healthz every 2 s as the fast-fail (inherits Phase 6.1 semantics — HTTP + DB/cache healthy); (b) UaHealthProbe every 10 s — opens a lightweight OPC UA client session to the peer + reads its ServiceLevel node + verifies endpoint serves data. Authority decisions use UaHealthProbe; /healthz is used only to avoid wasting UA probes when peer is obviously down.
Publish-generation fencing Topology + role decisions are stamped with a monotonic ConfigGenerationId from the shared config DB. Coordinator re-reads topology via CAS on (ClusterId, ExpectedGeneration) → new row; peers reject state propagated from a lower generation. Prevents split-publish races.
InvalidTopology runtime state If both nodes detect >1 Primary AFTER startup (config-DB drift during a publish), both self-demote to ServiceLevel 2 until convergence. Neither node serves authoritatively; clients pick the healthier alternative or reconnect later.
OPC UA server root ServiceLevel variable node becomes a BaseDataVariable whose value updates on RedundancyCoordinator state change. ServerUriArray array variable includes self + peers in stable deterministic ordering (decision per OPC UA Part 4 §6.6.2.2). RedundancySupport stays static (set from RedundancyMode at startup); Transparent mode validated pre-publish, not rejected at startup.
RedundancyCoordinator computation 8-state ServiceLevel matrix — avoids OPC UA Part 5 §6.3.34 collision (0=Maintenance, 1=NoData). Operator-declared maintenance only = 0. Unreachable / Faulted = 1. In-range operational states occupy 2..255: Authoritative-Primary = 255; Isolated-Primary (peer unreachable, self serving) = 230; Primary-Mid-Apply = 200; Recovering-Primary (post-fault, dwell not met) = 180; Authoritative-Backup = 100; Isolated-Backup (primary unreachable, "take over if asked") = 80; Backup-Mid-Apply = 50; Recovering-Backup = 30; InvalidTopology (runtime detects >1 Primary) = 2 (detected-inconsistency band — below normal operation). Full matrix documented in docs/Redundancy.md update.
Role transition Split-brain avoidance: role is declared in the shared config DB (ClusterNode.RedundancyRole), not elected at runtime. An operator flips the row (or a failover script does). Coordinator only reads; never writes.
sp_PublishGeneration hook Uses named apply leases keyed to (ConfigGenerationId, PublishRequestId). await using var lease = coordinator.BeginApplyLease(...). Disposal on any exit path (success, exception, cancellation) decrements. Watchdog auto-closes any lease older than ApplyMaxDuration (default 10 min) → ServiceLevel can't stick at mid-apply. Pre-publish validator rejects unsupported RedundancyMode (e.g. Transparent) with a clear error so runtime never sees an invalid state.
Admin UI /cluster/{id} page New RedundancyTab.razor — shows current node's role + ServiceLevel + peer reachability. FleetAdmin can trigger a role-swap by editing ClusterNode.RedundancyRole + publishing a draft.
Metrics New OpenTelemetry metrics: ot_opcua_service_level{cluster,node}, ot_opcua_peer_reachable{cluster,node,peer}, ot_opcua_apply_in_progress{cluster,node}. Sink via Phase 6.1 observability layer.

Scope — What Does NOT Change

Item Reason
OPC UA authn / authz Phases 6.2 + prior. Redundancy is orthogonal.
Driver layer Drivers aren't redundancy-aware; they run on each node independently against the same equipment. The server layer handles the ServiceLevel story.
Automatic failover / election Explicitly out of scope. Non-transparent = client picks which server to use via ServiceLevel + ServerUriArray. We do NOT ship consensus, leader election, or automatic promotion. Operator-driven failover is the v2.0 model per decision #7985.
Transparent redundancy (RedundancySupport=Transparent) Not supported. If the operator asks for it the server fails startup with a clear error.
Historian redundancy Galaxy Historian's own redundancy (two historians on two CPUs) is out of scope. The Galaxy driver talks to whichever historian is reachable from its node.

Entry Gate Checklist

  • Phase 6.1 merged (uses /healthz for peer probing)
  • CLAUDE.md §Redundancy + docs/Redundancy.md re-read
  • Decisions #7985 re-skimmed
  • ServerCluster/ClusterNode/RedundancyRole/RedundancyMode entities + existing migration reviewed
  • OPC UA Part 4 §Redundancy + Part 5 §6.3.34 (ServiceLevel) re-skimmed
  • Dev box has two OtOpcUa.Server instances configured against the same cluster — one designated Primary, one Backup — for integration testing

Task Breakdown

Stream A — Cluster topology loader (3 days)

  1. A.1 RedundancyCoordinator startup path: reads ClusterNode row for the current node (identified by appsettings.json Cluster:NodeId), reads the cluster's peer list, validates invariants (no duplicate ApplicationUri, at most one Primary per cluster if RedundancyMode.WarmActive, at most two nodes total in v2.0 per decision #83).
  2. A.2 Topology subscription — coordinator re-reads on sp_PublishGeneration confirmation so an operator role-swap takes effect after publish (no process restart needed).
  3. A.3 Tests: two-node cluster seed, one-node cluster seed (degenerate), duplicate-uri rejection.

Stream B — Peer health probing + ServiceLevel computation (6 days, widened)

  1. B.1 PeerHttpProbeLoop per peer at 2 s — calls peer's /healthz, 1 s timeout, exponential backoff on sustained failure. Used as fast-fail.
  2. B.2 PeerUaProbeLoop per peer at 10 s — opens an OPC UA client session to the peer (reuses Phase 5 Driver.OpcUaClient stack), reads peer's ServiceLevel node + verifies endpoint serves data. Short-circuit: if HTTP probe is failing, skip UA probe (no wasted sessions).
  3. B.3 ServiceLevelCalculator.Compute(role, selfHealth, peerHttpHealthy, peerUaHealthy, applyInProgress, recoveryDwellMet, topologyValid) → byte. 8-state matrix per §Scope. topologyValid=false forces InvalidTopology = 2 regardless of other inputs.
  4. B.4 RecoveryStateManager: after a Faulted → Healthy transition, hold driver in Recovering band (180 Primary / 30 Backup) for RecoveryDwellTime (default 60 s) AND require one positive publish witness (successful Read on a reference node) before entering Authoritative band.
  5. B.5 Calculator reacts to inputs via IObserver so changes immediately push to the OPC UA ServiceLevel node.
  6. B.6 Tests: 64-case matrix covering role × self-health × peer-http × peer-ua × apply × recovery × topology. Specific cases flagged: Primary-with-unreachable-peer-serves-at-230 (authority retained); Backup-with-unreachable-primary-escalates-to-80 (not auto-promote); InvalidTopology demotes both nodes; Recovering dwell + publish-witness blocks premature return to 255.

Stream C — OPC UA node wiring (3 days)

  1. C.1 ServiceLevel variable node created under ServerStatus at server startup. Type Byte, AccessLevel = CurrentRead only. Subscribe to ServiceLevelCalculator observable; push updates via DataChangeNotification.
  2. C.2 ServerUriArray variable node under ServerCapabilities. Array of String, includes self + peers with deterministic ordering (self first). Updates on topology change. Compliance test asserts local-plus-peer membership.
  3. C.3 RedundancySupport variable — static at startup from RedundancyMode. Values: None, Cold, Warm, WarmActive, Hot. Unsupported values (Transparent, HotAndMirrored) are rejected pre-publish by validator — runtime never sees them.
  4. C.4 Client.CLI cutover test: connect to primary, read ServiceLevel → 255; pause primary apply → 200; unreachable peer while apply in progress → 200 (apply dominates peer-unreachable per matrix); client sees peer via ServerUriArray; fail primary → client reconnects to peer at 80 (isolated-backup band).

Stream D — Apply-window integration (3 days)

  1. D.1 sp_PublishGeneration caller wraps the apply in await using var lease = coordinator.BeginApplyLease(generationId, publishRequestId). Lease keyed to (ConfigGenerationId, PublishRequestId) so concurrent publishes stay isolated. Disposal decrements on every exit path.
  2. D.2 ApplyLeaseWatchdog auto-closes leases older than ApplyMaxDuration (default 10 min) so a crashed publisher can't pin the node at mid-apply.
  3. D.3 Pre-publish validator in sp_PublishGeneration rejects unsupported RedundancyMode values (Transparent, HotAndMirrored) with a clear error message — runtime never sees an invalid mode.
  4. D.4 Tests: (a) mid-apply client subscribes → sees ServiceLevel drop → sees restore; (b) lease leak via ThreadAbort / cancellation → watchdog closes; (c) publish rejected for Transparent → operator-actionable error.

Stream E — Admin UI + metrics (3 days)

  1. E.1 RedundancyTab.razor under /cluster/{id}/redundancy. Shows each node's role, current ServiceLevel (with band label per 8-state matrix), peer reachability (HTTP + UA probe separately), last apply timestamp. Role-swap button posts a draft edit on ClusterNode.RedundancyRole; publish applies.
  2. E.2 OpenTelemetry meter export: ot_opcua_service_level{cluster,node} gauge + ot_opcua_peer_reachable{cluster,node,peer,kind=http|ua} + ot_opcua_apply_in_progress{cluster,node} + ot_opcua_topology_valid{cluster}. Sink via Phase 6.1 observability.
  3. E.3 SignalR push: FleetStatusHub broadcasts ServiceLevel changes so the Admin UI updates within ~1 s of the coordinator observing a peer flip.

Stream F — Client-interoperability matrix (3 days, new)

  1. F.1 Validate ServiceLevel-driven cutover against Ignition 8.1 + 8.3, Kepware KEPServerEX 6.x, Aveva OI Gateway 2020R2 + 2023R1. For each: configure the client with both endpoints, verify it honors ServiceLevel + ServerUriArray during primary failover.
  2. F.2 Clients that don't honour the standards (doc field — may include Kepware and OI Gateway per Codex review) get an explicit compatibility-matrix entry: "requires manual backup-endpoint config / vendor-specific redundancy primitives". Documented in docs/Redundancy.md.
  3. F.3 Galaxy MXAccess failover test — boot Galaxy.Proxy on both nodes, kill Primary, assert Galaxy consumer reconnects to Backup within (SessionTimeout + KeepAliveInterval × 3). Document required session-timeout config in docs/Redundancy.md.

Compliance Checks (run at exit gate)

  • OPC UA band compliance: 0=Maintenance reserved, 1=NoData reserved. Operational states in 2..255 per 8-state matrix.
  • Authoritative-Primary ServiceLevel = 255.
  • Isolated-Primary (peer unreachable, self serving) = 230 — Primary retains authority.
  • Primary-Mid-Apply = 200.
  • Recovering-Primary = 180 with dwell + publish witness enforced.
  • Authoritative-Backup = 100.
  • Isolated-Backup (primary unreachable) = 80 — does NOT auto-promote.
  • InvalidTopology = 2 — both nodes self-demote when >1 Primary detected runtime.
  • ServerUriArray returns self + peer URIs, self first.
  • UaHealthProbe authority: integration test — peer returns HTTP 200 but OPC UA endpoint unreachable → coordinator treats peer as UA-unhealthy; peer is not a valid authority source.
  • Apply-lease disposal: leases close on exception, cancellation, and watchdog timeout; ServiceLevel never sticks at mid-apply band.
  • Transparent-mode rejection: attempting to publish RedundancyMode=Transparent is blocked at sp_PublishGeneration; runtime never sees an invalid mode.
  • Role transition via operator publish: FleetAdmin swaps RedundancyRole in a draft, publishes; both nodes re-read topology on publish confirmation + flip ServiceLevel — no restart.
  • Client.CLI cutover: with primary halted, Client.CLI that was connected to primary sees primary drop + reconnects to backup via ServerUriArray.
  • Client interoperability matrix (Stream F): Ignition 8.1 + 8.3 honour ServiceLevel; Kepware + Aveva OI Gateway findings documented.
  • Galaxy MXAccess failover: end-to-end test — primary kill → Galaxy consumer reconnects to backup within session-timeout budget.
  • No regression in existing driver test suites; no regression in /healthz reachability under redundancy load.

Risks and Mitigations

Risk Likelihood Impact Mitigation
Split-brain from operator race (both nodes marked Primary) Low High Coordinator rejects startup if its cluster has >1 Primary row; logs + fails fast. Document as a publish-time validation in sp_PublishGeneration.
ServiceLevel thrashing on flaky peer Medium Medium 2 s probe interval + 3-sample smoothing window; only declares a peer unreachable after 3 consecutive failed probes
Client ignores ServiceLevel and stays on broken primary Medium Medium Documented in docs/Redundancy.md — non-transparent redundancy requires client cooperation; most SCADA clients (Ignition, Kepware, Aveva OI Gateway) honor it. Unit-test the advertised values; field behavior is client-responsibility
Apply-window counter leaks on exception Low High BeginApplyWindow returns IDisposable; using syntax enforces paired decrement; unit test for exception-in-apply path
HttpClient probe leaks sockets Low Medium Single shared HttpClient per coordinator (not per-probe); timeouts tight to avoid keeping connections open during peer downtime

Completion Checklist

  • Stream A: topology loader + tests
  • Stream B: peer probe + ServiceLevel calculator + 32-case matrix tests
  • Stream C: ServiceLevel / ServerUriArray / RedundancySupport node wiring + Client.CLI smoke test
  • Stream D: apply-window integration + nested-apply counter
  • Stream E: Admin RedundancyTab + OpenTelemetry metrics + SignalR push
  • phase-6-3-compliance.ps1 exits 0; exit-gate doc; docs/Redundancy.md updated with the ServiceLevel matrix

Adversarial Review — 2026-04-19 (Codex, thread 019da490-3fa0-7340-98b8-cceeca802550)

  1. Crit · ACCEPT — No publish-generation fencing enables split-publish advertising both as authoritative. Change: coordinator CAS on a monotonic ConfigGenerationId; every topology decision is generation-stamped; peers reject state propagated from a lower generation.
  2. Crit · ACCEPT>1 Primary at startup covered but runtime containment missing when invalid topology appears later (mid-apply race). Change: add runtime InvalidTopology state — both nodes self-demote to ServiceLevel 2 (the "detected inconsistency" band, below normal operation) until convergence.
  3. High · ACCEPT0 = Faulted collides with OPC UA Part 5 §6.3.34 semantics where 0 means Maintenance and 1 means NoData. Change: reserve 0 for operator-declared maintenance-mode only; Faulted/unreachable uses 1 (NoData); in-range degraded states occupy 2..199.
  4. High · ACCEPT — Matrix collapses distinct operational states onto the same value. Change: matrix expanded to Authoritative-Primary=255, Isolated-Primary=230 (peer unreachable — still serving), Primary-Mid-Apply=200, Recovering-Primary=180, Authoritative-Backup=100, Isolated-Backup=80 (primary unreachable — "take over if asked"), Backup-Mid-Apply=50, Recovering-Backup=30.
  5. High · ACCEPT/healthz from 6.1 is HTTP-healthy but doesn't guarantee OPC UA data plane. Change: add a redundancy-specific probe UaHealthProbe — issues a ReadAsync(ServiceLevel) against the peer's OPC UA endpoint via a lightweight client session. /healthz remains the fast-fail; the UA probe is the authority signal.
  6. High · ACCEPTServerUriArray must include self + peers, not peers only. Change: array contains [self.ApplicationUri, peer.ApplicationUri] in stable deterministic ordering; compliance test asserts local-plus-peer membership.
  7. Med · ACCEPT — No Faulted → Recovering → Healthy path. Change: add Recovering state with min dwell time (60 s default) + positive publish witness (one successful Read on a reference node) before returning to Healthy. Thrash-prevention.
  8. Med · ACCEPT — Topology change during in-flight probe undefined. Change: every probe task tagged with ConfigGenerationId at dispatch; obsolete results discarded; in-flight probes cancelled on topology reload.
  9. Med · ACCEPT — Apply-window counter race on exception/cancellation/async ownership. Change: apply-window is a named lease keyed to (ConfigGenerationId, PublishRequestId) with disposal enforced via await using; watchdog detects leased-but-abandoned and force-closes after ApplyMaxDuration (default 10 min).
  10. High · ACCEPT — Ignition + Kepware + Aveva OI Gateway ServiceLevel compliance is unverified. Change: risk elevated to High; add Stream F (new) — build an interop matrix: validate against Ignition 8.1/8.3, Kepware KEPServerEX 6.x, Aveva OI Gateway 2020R2 + 2023R1. Document per-client cutover behaviour. Field deployments get a documented compatibility table; clients that ignore ServiceLevel documented as requiring explicit backup-endpoint config.
  11. Med · ACCEPT — Galaxy MXAccess re-session on Primary death not in acceptance. Change: Stream F adds an end-to-end failover smoke test that boots Galaxy.Proxy on both nodes, kills Primary, asserts Galaxy consumer reconnects to Backup within (SessionTimeout + KeepAliveInterval × 3) budget. docs/Redundancy.md updated with required session timeouts.
  12. Med · ACCEPT — Transparent-mode startup rejection is outage-prone. Change: sp_PublishGeneration validates RedundancyMode pre-publish — unsupported values reject the publish attempt with a clear validation error; runtime never sees an unsupported mode. Last-good config stays active.