Phase 6.3 — Redundancy Runtime

Status: DRAFT — CLAUDE.md + docs/Redundancy.md describe a non-transparent warm/hot redundancy model with unique ApplicationUris, RedundancySupport advertisement, ServerUriArray, and dynamic ServiceLevel. Entities (ServerCluster, ClusterNode, RedundancyRole, RedundancyMode) exist; the runtime behavior (actual ServiceLevel number computation, mid-apply dip, ServerUriArray broadcast) is not wired.

Branch: v2/phase-6-3-redundancy-runtime Estimated duration: 2 weeks Predecessor: Phase 6.2 (Authorization) — reuses the Phase 6.1 health endpoints for cluster-peer probing Successor: Phase 6.4 (Admin UI completion)

Phase Objective

Land the non-transparent redundancy protocol end-to-end: two OtOpcUa.Server instances in a ServerCluster each expose a live ServiceLevel node whose value reflects that instance's suitability to serve traffic, advertise each other via ServerUriArray, and transition role (Primary ↔ Backup) based on health + operator intent.

Closes these gaps:

Dynamic ServiceLevel — OPC UA Part 5 §6.3.34 specifies a Byte (0..255) that clients poll to pick the healthiest server. Our server publishes it as a static value today.
ServerUriArray broadcast — Part 4 specifies that every node in a redundant pair should advertise its peers' ApplicationUris. Currently advertises only its own.
Primary / Backup role coordination — entities carry RedundancyRole but the runtime doesn't read it; no peer health probing; no role-transfer on primary failure.
Mid-apply dip — decision-level expectation that a server mid-generation-apply should report a lower ServiceLevel so clients cut over to the peer during the apply window. Not implemented.

Scope — What Changes

Concern	Change
`OtOpcUa.Server` → new `Server.Redundancy` sub-namespace	`RedundancyCoordinator` singleton. Resolves the current node's `ClusterNode` row at startup, loads its peers from `ServerCluster`, probes each peer's `/healthz` (Phase 6.1 endpoint) every `PeerProbeInterval` (default 2 s), maintains per-peer health state.
OPC UA server root	`ServiceLevel` variable node becomes a `BaseDataVariable` whose value updates on `RedundancyCoordinator` state change. `ServerUriArray` array variable refreshes on cluster-topology change. `RedundancySupport` stays static (set from `RedundancyMode` at startup).
`RedundancyCoordinator` computation	`ServiceLevel` formula: 255 = Primary + fully healthy + no apply in progress; 200 = Primary + an apply in the middle (clients should prefer peer); 100 = Backup + fully healthy; 50 = Backup + mid-apply; 0 = Faulted or peer-unreachable-and-I'm-not-authoritative. Documented in `docs/Redundancy.md` update.
Role transition	Split-brain avoidance: role is declared in the shared config DB (`ClusterNode.RedundancyRole`), not elected at runtime. An operator flips the row (or a failover script does). Coordinator only reads; never writes.
`sp_PublishGeneration` hook	Before the apply starts, the coordinator sets `ApplyInProgress = true` in-memory → `ServiceLevel` drops to mid-apply band. Clears after `sp_PublishGeneration` returns.
Admin UI `/cluster/{id}` page	New `RedundancyTab.razor` — shows current node's role + ServiceLevel + peer reachability. FleetAdmin can trigger a role-swap by editing `ClusterNode.RedundancyRole` + publishing a draft.
Metrics	New OpenTelemetry metrics: `ot_opcua_service_level{cluster,node}`, `ot_opcua_peer_reachable{cluster,node,peer}`, `ot_opcua_apply_in_progress{cluster,node}`. Sink via Phase 6.1 observability layer.

Scope — What Does NOT Change

Item	Reason
OPC UA authn / authz	Phases 6.2 + prior. Redundancy is orthogonal.
Driver layer	Drivers aren't redundancy-aware; they run on each node independently against the same equipment. The server layer handles the ServiceLevel story.
Automatic failover / election	Explicitly out of scope. Non-transparent = client picks which server to use via ServiceLevel + ServerUriArray. We do NOT ship consensus, leader election, or automatic promotion. Operator-driven failover is the v2.0 model per decision #79–85.
Transparent redundancy (`RedundancySupport=Transparent`)	Not supported. If the operator asks for it the server fails startup with a clear error.
Historian redundancy	Galaxy Historian's own redundancy (two historians on two CPUs) is out of scope. The Galaxy driver talks to whichever historian is reachable from its node.

Entry Gate Checklist

Phase 6.1 merged (uses /healthz for peer probing)
CLAUDE.md §Redundancy + docs/Redundancy.md re-read
Decisions #79–85 re-skimmed
ServerCluster/ClusterNode/RedundancyRole/RedundancyMode entities + existing migration reviewed
OPC UA Part 4 §Redundancy + Part 5 §6.3.34 (ServiceLevel) re-skimmed
Dev box has two OtOpcUa.Server instances configured against the same cluster — one designated Primary, one Backup — for integration testing

Task Breakdown

Stream A — Cluster topology loader (3 days)

A.1 RedundancyCoordinator startup path: reads ClusterNode row for the current node (identified by appsettings.json Cluster:NodeId), reads the cluster's peer list, validates invariants (no duplicate ApplicationUri, at most one Primary per cluster if RedundancyMode.WarmActive, at most two nodes total in v2.0 per decision #83).
A.2 Topology subscription — coordinator re-reads on sp_PublishGeneration confirmation so an operator role-swap takes effect after publish (no process restart needed).
A.3 Tests: two-node cluster seed, one-node cluster seed (degenerate), duplicate-uri rejection.

Stream B — Peer health probing + ServiceLevel computation (4 days)

B.1 PeerProbeLoop runs per peer at PeerProbeInterval (2 s default, configurable via appsettings.json). Calls peer's /healthz via HttpClient; timeout 1 s. Exponential backoff on sustained failure.
B.2 ServiceLevelCalculator.Compute(current role, self health, peer reachable, apply in progress) → byte. Matrix documented in §Scope.
B.3 Calculator reacts to inputs via IObserver pattern so changes immediately push to the OPC UA ServiceLevel node.
B.4 Tests: matrix coverage for all role × health × apply permutations (32 cases); injected IClock + fake HttpClient so tests are deterministic.

Stream C — OPC UA node wiring (3 days)

C.1 ServiceLevel variable node created under ServerStatus at server startup. Type Byte, AccessLevel = CurrentRead only. Subscribe to ServiceLevelCalculator observable; push updates via DataChangeNotification.
C.2 ServerUriArray variable node under ServerCapabilities. Array of String, length = peer count. Updates on topology change.
C.3 RedundancySupport variable — static at startup from RedundancyMode. Values: None, Cold, Warm, WarmActive, Hot. Phase 6.3 supports everything except Transparent + HotAndMirrored.
C.4 Test against the Client.CLI: connect to primary, read ServiceLevel → expect 255; pause primary apply → expect 200; fail primary → client sees Bad_ServerNotConnected + reconnects to peer at 100.

Stream D — Apply-window integration (2 days)

D.1 sp_PublishGeneration caller wraps the apply in using (coordinator.BeginApplyWindow()). BeginApplyWindow increments an in-process counter; ServiceLevel drops on first increment. Dispose decrements.
D.2 Nested applies handled by the counter (rarely happens but Ignition and Kepware clients have both been observed firing rapid-succession draft publishes).
D.3 Test: mid-apply subscribe on primary; assert the subscribing client sees the ServiceLevel drop immediately after the apply starts, then restore after apply completes.

Stream E — Admin UI + metrics (3 days)

E.1 RedundancyTab.razor under /cluster/{id}/redundancy. Shows each node's role, current ServiceLevel, peer reachability, last apply timestamp. Role-swap button posts a draft edit on ClusterNode.RedundancyRole; publish applies.
E.2 OpenTelemetry meter export: three gauges per the §Scope metrics. Sink via Phase 6.1 observability.
E.3 SignalR push: FleetStatusHub broadcasts ServiceLevel changes so the Admin UI updates within ~1 s of the coordinator observing a peer flip.

Compliance Checks (run at exit gate)

Primary-healthy ServiceLevel = 255.
Backup-healthy ServiceLevel = 100.
Mid-apply Primary ServiceLevel = 200 — verified via Client.CLI subscription polling ServiceLevel during a forced draft publish.
Peer-unreachable handling: when a Primary can't probe its Backup's /healthz, Primary still serves at 255 (peer is the one with the problem). When a Backup can't probe Primary, Backup flips to 200 (per decision #81 — a lonely Backup promotes its advertised level to signal "I'll take over if you ask" without auto-promoting).
Role transition via operator publish: FleetAdmin swaps RedundancyRole rows in a draft, publishes; both nodes re-read topology on publish confirmation and flip ServiceLevel accordingly — no restart needed.
ServerUriArray returns exactly the peer node's ApplicationUri.
Client.CLI cutover: with a primary deliberately halted, a client that was connected to primary reconnects to the backup within the ServiceLevel-polling interval.
No regression in existing driver test suites; no regression in /healthz reachability under redundancy load.

Risks and Mitigations

Risk	Likelihood	Impact	Mitigation
Split-brain from operator race (both nodes marked Primary)	Low	High	Coordinator rejects startup if its cluster has >1 Primary row; logs + fails fast. Document as a publish-time validation in `sp_PublishGeneration`.
ServiceLevel thrashing on flaky peer	Medium	Medium	2 s probe interval + 3-sample smoothing window; only declares a peer unreachable after 3 consecutive failed probes
Client ignores ServiceLevel and stays on broken primary	Medium	Medium	Documented in `docs/Redundancy.md` — non-transparent redundancy requires client cooperation; most SCADA clients (Ignition, Kepware, Aveva OI Gateway) honor it. Unit-test the advertised values; field behavior is client-responsibility
Apply-window counter leaks on exception	Low	High	`BeginApplyWindow` returns `IDisposable`; `using` syntax enforces paired decrement; unit test for exception-in-apply path
`HttpClient` probe leaks sockets	Low	Medium	Single shared `HttpClient` per coordinator (not per-probe); timeouts tight to avoid keeping connections open during peer downtime

Completion Checklist

Stream A: topology loader + tests
Stream B: peer probe + ServiceLevel calculator + 32-case matrix tests
Stream C: ServiceLevel / ServerUriArray / RedundancySupport node wiring + Client.CLI smoke test
Stream D: apply-window integration + nested-apply counter
Stream E: Admin RedundancyTab + OpenTelemetry metrics + SignalR push
phase-6-3-compliance.ps1 exits 0; exit-gate doc; docs/Redundancy.md updated with the ServiceLevel matrix

Adversarial Review — 2026-04-19 (Codex, thread `019da490-3fa0-7340-98b8-cceeca802550`)

Crit · ACCEPT — No publish-generation fencing enables split-publish advertising both as authoritative. Change: coordinator CAS on a monotonic ConfigGenerationId; every topology decision is generation-stamped; peers reject state propagated from a lower generation.
Crit · ACCEPT — >1 Primary at startup covered but runtime containment missing when invalid topology appears later (mid-apply race). Change: add runtime InvalidTopology state — both nodes self-demote to ServiceLevel 2 (the "detected inconsistency" band, below normal operation) until convergence.
High · ACCEPT — 0 = Faulted collides with OPC UA Part 5 §6.3.34 semantics where 0 means Maintenance and 1 means NoData. Change: reserve 0 for operator-declared maintenance-mode only; Faulted/unreachable uses 1 (NoData); in-range degraded states occupy 2..199.
High · ACCEPT — Matrix collapses distinct operational states onto the same value. Change: matrix expanded to Authoritative-Primary=255, Isolated-Primary=230 (peer unreachable — still serving), Primary-Mid-Apply=200, Recovering-Primary=180, Authoritative-Backup=100, Isolated-Backup=80 (primary unreachable — "take over if asked"), Backup-Mid-Apply=50, Recovering-Backup=30.
High · ACCEPT — /healthz from 6.1 is HTTP-healthy but doesn't guarantee OPC UA data plane. Change: add a redundancy-specific probe UaHealthProbe — issues a ReadAsync(ServiceLevel) against the peer's OPC UA endpoint via a lightweight client session. /healthz remains the fast-fail; the UA probe is the authority signal.
High · ACCEPT — ServerUriArray must include self + peers, not peers only. Change: array contains [self.ApplicationUri, peer.ApplicationUri] in stable deterministic ordering; compliance test asserts local-plus-peer membership.
Med · ACCEPT — No Faulted → Recovering → Healthy path. Change: add Recovering state with min dwell time (60 s default) + positive publish witness (one successful Read on a reference node) before returning to Healthy. Thrash-prevention.
Med · ACCEPT — Topology change during in-flight probe undefined. Change: every probe task tagged with ConfigGenerationId at dispatch; obsolete results discarded; in-flight probes cancelled on topology reload.
Med · ACCEPT — Apply-window counter race on exception/cancellation/async ownership. Change: apply-window is a named lease keyed to (ConfigGenerationId, PublishRequestId) with disposal enforced via await using; watchdog detects leased-but-abandoned and force-closes after ApplyMaxDuration (default 10 min).
High · ACCEPT — Ignition + Kepware + Aveva OI Gateway ServiceLevel compliance is unverified. Change: risk elevated to High; add Stream F (new) — build an interop matrix: validate against Ignition 8.1/8.3, Kepware KEPServerEX 6.x, Aveva OI Gateway 2020R2 + 2023R1. Document per-client cutover behaviour. Field deployments get a documented compatibility table; clients that ignore ServiceLevel documented as requiring explicit backup-endpoint config.
Med · ACCEPT — Galaxy MXAccess re-session on Primary death not in acceptance. Change: Stream F adds an end-to-end failover smoke test that boots Galaxy.Proxy on both nodes, kills Primary, asserts Galaxy consumer reconnects to Backup within (SessionTimeout + KeepAliveInterval × 3) budget. docs/Redundancy.md updated with required session timeouts.
Med · ACCEPT — Transparent-mode startup rejection is outage-prone. Change: sp_PublishGeneration validates RedundancyMode pre-publish — unsupported values reject the publish attempt with a clear validation error; runtime never sees an unsupported mode. Last-good config stays active.

15 KiB Raw Blame History Unescape Escape