15 KiB
Phase 6.3 — Redundancy Runtime
Status: DRAFT —
CLAUDE.md+docs/Redundancy.mddescribe a non-transparent warm/hot redundancy model with unique ApplicationUris,RedundancySupportadvertisement,ServerUriArray, and dynamicServiceLevel. Entities (ServerCluster,ClusterNode,RedundancyRole,RedundancyMode) exist; the runtime behavior (actualServiceLevelnumber computation, mid-apply dip,ServerUriArraybroadcast) is not wired.Branch:
v2/phase-6-3-redundancy-runtimeEstimated duration: 2 weeks Predecessor: Phase 6.2 (Authorization) — reuses the Phase 6.1 health endpoints for cluster-peer probing Successor: Phase 6.4 (Admin UI completion)
Phase Objective
Land the non-transparent redundancy protocol end-to-end: two OtOpcUa.Server instances in a ServerCluster each expose a live ServiceLevel node whose value reflects that instance's suitability to serve traffic, advertise each other via ServerUriArray, and transition role (Primary ↔ Backup) based on health + operator intent.
Closes these gaps:
- Dynamic
ServiceLevel— OPC UA Part 5 §6.3.34 specifies a Byte (0..255) that clients poll to pick the healthiest server. Our server publishes it as a static value today. ServerUriArraybroadcast — Part 4 specifies that every node in a redundant pair should advertise its peers' ApplicationUris. Currently advertises only its own.- Primary / Backup role coordination — entities carry
RedundancyRolebut the runtime doesn't read it; no peer health probing; no role-transfer on primary failure. - Mid-apply dip — decision-level expectation that a server mid-generation-apply should report a lower ServiceLevel so clients cut over to the peer during the apply window. Not implemented.
Scope — What Changes
| Concern | Change |
|---|---|
OtOpcUa.Server → new Server.Redundancy sub-namespace |
RedundancyCoordinator singleton. Resolves the current node's ClusterNode row at startup, loads its peers from ServerCluster, probes each peer's /healthz (Phase 6.1 endpoint) every PeerProbeInterval (default 2 s), maintains per-peer health state. |
| OPC UA server root | ServiceLevel variable node becomes a BaseDataVariable whose value updates on RedundancyCoordinator state change. ServerUriArray array variable refreshes on cluster-topology change. RedundancySupport stays static (set from RedundancyMode at startup). |
RedundancyCoordinator computation |
ServiceLevel formula: 255 = Primary + fully healthy + no apply in progress; 200 = Primary + an apply in the middle (clients should prefer peer); 100 = Backup + fully healthy; 50 = Backup + mid-apply; 0 = Faulted or peer-unreachable-and-I'm-not-authoritative. Documented in docs/Redundancy.md update. |
| Role transition | Split-brain avoidance: role is declared in the shared config DB (ClusterNode.RedundancyRole), not elected at runtime. An operator flips the row (or a failover script does). Coordinator only reads; never writes. |
sp_PublishGeneration hook |
Before the apply starts, the coordinator sets ApplyInProgress = true in-memory → ServiceLevel drops to mid-apply band. Clears after sp_PublishGeneration returns. |
Admin UI /cluster/{id} page |
New RedundancyTab.razor — shows current node's role + ServiceLevel + peer reachability. FleetAdmin can trigger a role-swap by editing ClusterNode.RedundancyRole + publishing a draft. |
| Metrics | New OpenTelemetry metrics: ot_opcua_service_level{cluster,node}, ot_opcua_peer_reachable{cluster,node,peer}, ot_opcua_apply_in_progress{cluster,node}. Sink via Phase 6.1 observability layer. |
Scope — What Does NOT Change
| Item | Reason |
|---|---|
| OPC UA authn / authz | Phases 6.2 + prior. Redundancy is orthogonal. |
| Driver layer | Drivers aren't redundancy-aware; they run on each node independently against the same equipment. The server layer handles the ServiceLevel story. |
| Automatic failover / election | Explicitly out of scope. Non-transparent = client picks which server to use via ServiceLevel + ServerUriArray. We do NOT ship consensus, leader election, or automatic promotion. Operator-driven failover is the v2.0 model per decision #79–85. |
Transparent redundancy (RedundancySupport=Transparent) |
Not supported. If the operator asks for it the server fails startup with a clear error. |
| Historian redundancy | Galaxy Historian's own redundancy (two historians on two CPUs) is out of scope. The Galaxy driver talks to whichever historian is reachable from its node. |
Entry Gate Checklist
- Phase 6.1 merged (uses
/healthzfor peer probing) CLAUDE.md§Redundancy +docs/Redundancy.mdre-read- Decisions #79–85 re-skimmed
ServerCluster/ClusterNode/RedundancyRole/RedundancyModeentities + existing migration reviewed- OPC UA Part 4 §Redundancy + Part 5 §6.3.34 (ServiceLevel) re-skimmed
- Dev box has two OtOpcUa.Server instances configured against the same cluster — one designated Primary, one Backup — for integration testing
Task Breakdown
Stream A — Cluster topology loader (3 days)
- A.1
RedundancyCoordinatorstartup path: readsClusterNoderow for the current node (identified byappsettings.jsonCluster:NodeId), reads the cluster's peer list, validates invariants (no duplicateApplicationUri, at most onePrimaryper cluster ifRedundancyMode.WarmActive, at most two nodes total in v2.0 per decision #83). - A.2 Topology subscription — coordinator re-reads on
sp_PublishGenerationconfirmation so an operator role-swap takes effect after publish (no process restart needed). - A.3 Tests: two-node cluster seed, one-node cluster seed (degenerate), duplicate-uri rejection.
Stream B — Peer health probing + ServiceLevel computation (4 days)
- B.1
PeerProbeLoopruns per peer atPeerProbeInterval(2 s default, configurable viaappsettings.json). Calls peer's/healthzviaHttpClient; timeout 1 s. Exponential backoff on sustained failure. - B.2
ServiceLevelCalculator.Compute(current role, self health, peer reachable, apply in progress) → byte. Matrix documented in §Scope. - B.3 Calculator reacts to inputs via
IObserverpattern so changes immediately push to the OPC UAServiceLevelnode. - B.4 Tests: matrix coverage for all role × health × apply permutations (32 cases); injected
IClock+ fakeHttpClientso tests are deterministic.
Stream C — OPC UA node wiring (3 days)
- C.1
ServiceLevelvariable node created underServerStatusat server startup. TypeByte, AccessLevel = CurrentRead only. Subscribe toServiceLevelCalculatorobservable; push updates viaDataChangeNotification. - C.2
ServerUriArrayvariable node underServerCapabilities. Array ofString, length = peer count. Updates on topology change. - C.3
RedundancySupportvariable — static at startup fromRedundancyMode. Values:None,Cold,Warm,WarmActive,Hot. Phase 6.3 supports everything exceptTransparent+HotAndMirrored. - C.4 Test against the Client.CLI: connect to primary, read
ServiceLevel→ expect 255; pause primary apply → expect 200; fail primary → client seesBad_ServerNotConnected+ reconnects to peer at 100.
Stream D — Apply-window integration (2 days)
- D.1
sp_PublishGenerationcaller wraps the apply inusing (coordinator.BeginApplyWindow()).BeginApplyWindowincrements an in-process counter; ServiceLevel drops on first increment. Dispose decrements. - D.2 Nested applies handled by the counter (rarely happens but Ignition and Kepware clients have both been observed firing rapid-succession draft publishes).
- D.3 Test: mid-apply subscribe on primary; assert the subscribing client sees the ServiceLevel drop immediately after the apply starts, then restore after apply completes.
Stream E — Admin UI + metrics (3 days)
- E.1
RedundancyTab.razorunder/cluster/{id}/redundancy. Shows each node's role, current ServiceLevel, peer reachability, last apply timestamp. Role-swap button posts a draft edit onClusterNode.RedundancyRole; publish applies. - E.2 OpenTelemetry meter export: three gauges per the §Scope metrics. Sink via Phase 6.1 observability.
- E.3 SignalR push:
FleetStatusHubbroadcasts ServiceLevel changes so the Admin UI updates within ~1 s of the coordinator observing a peer flip.
Compliance Checks (run at exit gate)
- Primary-healthy ServiceLevel = 255.
- Backup-healthy ServiceLevel = 100.
- Mid-apply Primary ServiceLevel = 200 — verified via Client.CLI subscription polling ServiceLevel during a forced draft publish.
- Peer-unreachable handling: when a Primary can't probe its Backup's
/healthz, Primary still serves at 255 (peer is the one with the problem). When a Backup can't probe Primary, Backup flips to 200 (per decision #81 — a lonely Backup promotes its advertised level to signal "I'll take over if you ask" without auto-promoting). - Role transition via operator publish: FleetAdmin swaps
RedundancyRolerows in a draft, publishes; both nodes re-read topology on publish confirmation and flip ServiceLevel accordingly — no restart needed. - ServerUriArray returns exactly the peer node's ApplicationUri.
- Client.CLI cutover: with a primary deliberately halted, a client that was connected to primary reconnects to the backup within the ServiceLevel-polling interval.
- No regression in existing driver test suites; no regression in
/healthzreachability under redundancy load.
Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Split-brain from operator race (both nodes marked Primary) | Low | High | Coordinator rejects startup if its cluster has >1 Primary row; logs + fails fast. Document as a publish-time validation in sp_PublishGeneration. |
| ServiceLevel thrashing on flaky peer | Medium | Medium | 2 s probe interval + 3-sample smoothing window; only declares a peer unreachable after 3 consecutive failed probes |
| Client ignores ServiceLevel and stays on broken primary | Medium | Medium | Documented in docs/Redundancy.md — non-transparent redundancy requires client cooperation; most SCADA clients (Ignition, Kepware, Aveva OI Gateway) honor it. Unit-test the advertised values; field behavior is client-responsibility |
| Apply-window counter leaks on exception | Low | High | BeginApplyWindow returns IDisposable; using syntax enforces paired decrement; unit test for exception-in-apply path |
HttpClient probe leaks sockets |
Low | Medium | Single shared HttpClient per coordinator (not per-probe); timeouts tight to avoid keeping connections open during peer downtime |
Completion Checklist
- Stream A: topology loader + tests
- Stream B: peer probe + ServiceLevel calculator + 32-case matrix tests
- Stream C: ServiceLevel / ServerUriArray / RedundancySupport node wiring + Client.CLI smoke test
- Stream D: apply-window integration + nested-apply counter
- Stream E: Admin
RedundancyTab+ OpenTelemetry metrics + SignalR push phase-6-3-compliance.ps1exits 0; exit-gate doc;docs/Redundancy.mdupdated with the ServiceLevel matrix
Adversarial Review — 2026-04-19 (Codex, thread 019da490-3fa0-7340-98b8-cceeca802550)
- Crit · ACCEPT — No publish-generation fencing enables split-publish advertising both as authoritative. Change: coordinator CAS on a monotonic
ConfigGenerationId; every topology decision is generation-stamped; peers reject state propagated from a lower generation. - Crit · ACCEPT —
>1 Primaryat startup covered but runtime containment missing when invalid topology appears later (mid-apply race). Change: add runtimeInvalidTopologystate — both nodes self-demote to ServiceLevel 2 (the "detected inconsistency" band, below normal operation) until convergence. - High · ACCEPT —
0 = Faultedcollides with OPC UA Part 5 §6.3.34 semantics where 0 means Maintenance and 1 means NoData. Change: reserve 0 for operator-declared maintenance-mode only; Faulted/unreachable uses 1 (NoData); in-range degraded states occupy 2..199. - High · ACCEPT — Matrix collapses distinct operational states onto the same value. Change: matrix expanded to Authoritative-Primary=255, Isolated-Primary=230 (peer unreachable — still serving), Primary-Mid-Apply=200, Recovering-Primary=180, Authoritative-Backup=100, Isolated-Backup=80 (primary unreachable — "take over if asked"), Backup-Mid-Apply=50, Recovering-Backup=30.
- High · ACCEPT —
/healthzfrom 6.1 is HTTP-healthy but doesn't guarantee OPC UA data plane. Change: add a redundancy-specific probeUaHealthProbe— issues aReadAsync(ServiceLevel)against the peer's OPC UA endpoint via a lightweight client session./healthzremains the fast-fail; the UA probe is the authority signal. - High · ACCEPT —
ServerUriArraymust include self + peers, not peers only. Change: array contains[self.ApplicationUri, peer.ApplicationUri]in stable deterministic ordering; compliance test asserts local-plus-peer membership. - Med · ACCEPT — No
Faulted → Recovering → Healthypath. Change: addRecoveringstate with min dwell time (60 s default) + positive publish witness (one successful Read on a reference node) before returning to Healthy. Thrash-prevention. - Med · ACCEPT — Topology change during in-flight probe undefined. Change: every probe task tagged with
ConfigGenerationIdat dispatch; obsolete results discarded; in-flight probes cancelled on topology reload. - Med · ACCEPT — Apply-window counter race on exception/cancellation/async ownership. Change: apply-window is a named lease keyed to
(ConfigGenerationId, PublishRequestId)with disposal enforced viaawait using; watchdog detects leased-but-abandoned and force-closes afterApplyMaxDuration(default 10 min). - High · ACCEPT — Ignition + Kepware + Aveva OI Gateway
ServiceLevelcompliance is unverified. Change: risk elevated to High; add Stream F (new) — build an interop matrix: validate against Ignition 8.1/8.3, Kepware KEPServerEX 6.x, Aveva OI Gateway 2020R2 + 2023R1. Document per-client cutover behaviour. Field deployments get a documented compatibility table; clients that ignore ServiceLevel documented as requiring explicit backup-endpoint config. - Med · ACCEPT — Galaxy MXAccess re-session on Primary death not in acceptance. Change: Stream F adds an end-to-end failover smoke test that boots Galaxy.Proxy on both nodes, kills Primary, asserts Galaxy consumer reconnects to Backup within
(SessionTimeout + KeepAliveInterval × 3)budget.docs/Redundancy.mdupdated with required session timeouts. - Med · ACCEPT — Transparent-mode startup rejection is outage-prone. Change:
sp_PublishGenerationvalidatesRedundancyModepre-publish — unsupported values reject the publish attempt with a clear validation error; runtime never sees an unsupported mode. Last-good config stays active.