Files
lmxopcua/docs/v2/implementation/phase-6-3-redundancy-runtime.md
Joseph Doherty 4695a5c88e Phase 6 — Draft 4 implementation plans covering v2 unimplemented features + adversarial review + adjustments. After drivers were paused per user direction, audited the v2 plan for features documented-but-unshipped and identified four coherent tracks that had no implementation plan at all. Each plan follows the docs/v2/implementation/phase-*.md template (DRAFT status, branch name, Stream A-E task breakdown, Compliance Checks, Risks, Completion Checklist). docs/v2/implementation/phase-6-1-resilience-and-observability.md (243 lines) covers Polly resilience pipelines wired to every capability interface, Tier A/B/C runtime enforcement (memory watchdog generalized beyond Galaxy, scheduled recycle per decision #67, wedge detection), health endpoints on :4841, structured Serilog with correlation IDs, LiteDB local-cache fallback per decision #36. phase-6-2-authorization-runtime.md (145 lines) wires ACL enforcement on every OPC UA Read/Write/Subscribe/Call path + LDAP-group-to-admin-role grants per decisions #105 and #129 -- runtime permission-trie evaluator over the 6-level Cluster/Namespace/UnsArea/UnsLine/Equipment/Tag hierarchy, per-session cache invalidated on generation-apply + LDAP-cache expiry. phase-6-3-redundancy-runtime.md (165 lines) lands the non-transparent warm/hot redundancy runtime per decisions #79-85: dynamic ServiceLevel node, ServerUriArray peer broadcast, mid-apply dip via sp_PublishGeneration hook, operator-driven role transition (no auto-election -- plan remains explicit about what's out of scope). phase-6-4-admin-ui-completion.md (178 lines) closes Phase 1 Stream E completion-checklist items that never landed: UNS drag-reorder + impact preview, Equipment CSV import, 5-identifier search, draft-diff viewer enhancements, OPC 40010 _base Identification field exposure per decisions #138-139. Each plan then got a Codex adversarial-review pass (codex mcp tool, read-only sandbox, synchronous). Reviews explicitly targeted decision-log conflicts, API-shape assumptions, unbounded blast radius, under-specified state transitions, and testing holes. Appended 'Adversarial Review — 2026-04-19' section to each plan with numbered findings (severity / finding / why-it-matters / adjustment accepted). Review surfaced real substantive issues that the initial drafts glossed over: Phase 6.1 auto-retry conflicting with decisions #44-45 no-auto-write-retry rule; Phase 6.1 per-driver-instance pipeline breaking decision #35's per-device isolation; Phase 6.1 recycle/watchdog at Tier A/B breaching decisions #73-74 Tier-C-only constraint; Phase 6.2 conflating control-plane LdapGroupRoleMapping with data-plane ACL grants; Phase 6.2 missing Browse enforcement entirely; Phase 6.2 subscription re-authorization policy unresolved between create-time-only and per-publish; Phase 6.3 ServiceLevel=0 colliding with OPC UA Part 5 Maintenance semantics; Phase 6.3 ServerUriArray excluding self (spec-bug); Phase 6.3 apply-window counter race on cancellation; Phase 6.3 client cutover for Kepware/Aveva OI Gateway is unverified hearsay; Phase 6.4 stale UNS impact preview overwriting concurrent draft edits; Phase 6.4 identifier contract drifting from admin-ui.md canonical set (ZTag/MachineCode/SAPID/EquipmentId/EquipmentUuid, not ZTag/SAPID/UniqueId/Alias1/Alias2); Phase 6.4 CSV import atomicity internally contradictory (single txn vs chunked inserts); Phase 6.4 OPC 40010 field list not matching decision #139. Every finding has an adjustment in the plan doc -- plans are meant to be executable from the next session with the critique already baked in rather than a clean draft that would run into the same issues at implementation time. Codex thread IDs cited in each plan's review section for reproducibility. Pure documentation PR -- no code changes. Plans are DRAFT status; each becomes its own implementation phase with its own entry-gate + exit-gate when business prioritizes.
2026-04-19 03:15:00 -04:00

15 KiB
Raw Blame History

Phase 6.3 — Redundancy Runtime

Status: DRAFT — CLAUDE.md + docs/Redundancy.md describe a non-transparent warm/hot redundancy model with unique ApplicationUris, RedundancySupport advertisement, ServerUriArray, and dynamic ServiceLevel. Entities (ServerCluster, ClusterNode, RedundancyRole, RedundancyMode) exist; the runtime behavior (actual ServiceLevel number computation, mid-apply dip, ServerUriArray broadcast) is not wired.

Branch: v2/phase-6-3-redundancy-runtime Estimated duration: 2 weeks Predecessor: Phase 6.2 (Authorization) — reuses the Phase 6.1 health endpoints for cluster-peer probing Successor: Phase 6.4 (Admin UI completion)

Phase Objective

Land the non-transparent redundancy protocol end-to-end: two OtOpcUa.Server instances in a ServerCluster each expose a live ServiceLevel node whose value reflects that instance's suitability to serve traffic, advertise each other via ServerUriArray, and transition role (Primary ↔ Backup) based on health + operator intent.

Closes these gaps:

  1. Dynamic ServiceLevel — OPC UA Part 5 §6.3.34 specifies a Byte (0..255) that clients poll to pick the healthiest server. Our server publishes it as a static value today.
  2. ServerUriArray broadcast — Part 4 specifies that every node in a redundant pair should advertise its peers' ApplicationUris. Currently advertises only its own.
  3. Primary / Backup role coordination — entities carry RedundancyRole but the runtime doesn't read it; no peer health probing; no role-transfer on primary failure.
  4. Mid-apply dip — decision-level expectation that a server mid-generation-apply should report a lower ServiceLevel so clients cut over to the peer during the apply window. Not implemented.

Scope — What Changes

Concern Change
OtOpcUa.Server → new Server.Redundancy sub-namespace RedundancyCoordinator singleton. Resolves the current node's ClusterNode row at startup, loads its peers from ServerCluster, probes each peer's /healthz (Phase 6.1 endpoint) every PeerProbeInterval (default 2 s), maintains per-peer health state.
OPC UA server root ServiceLevel variable node becomes a BaseDataVariable whose value updates on RedundancyCoordinator state change. ServerUriArray array variable refreshes on cluster-topology change. RedundancySupport stays static (set from RedundancyMode at startup).
RedundancyCoordinator computation ServiceLevel formula: 255 = Primary + fully healthy + no apply in progress; 200 = Primary + an apply in the middle (clients should prefer peer); 100 = Backup + fully healthy; 50 = Backup + mid-apply; 0 = Faulted or peer-unreachable-and-I'm-not-authoritative. Documented in docs/Redundancy.md update.
Role transition Split-brain avoidance: role is declared in the shared config DB (ClusterNode.RedundancyRole), not elected at runtime. An operator flips the row (or a failover script does). Coordinator only reads; never writes.
sp_PublishGeneration hook Before the apply starts, the coordinator sets ApplyInProgress = true in-memory → ServiceLevel drops to mid-apply band. Clears after sp_PublishGeneration returns.
Admin UI /cluster/{id} page New RedundancyTab.razor — shows current node's role + ServiceLevel + peer reachability. FleetAdmin can trigger a role-swap by editing ClusterNode.RedundancyRole + publishing a draft.
Metrics New OpenTelemetry metrics: ot_opcua_service_level{cluster,node}, ot_opcua_peer_reachable{cluster,node,peer}, ot_opcua_apply_in_progress{cluster,node}. Sink via Phase 6.1 observability layer.

Scope — What Does NOT Change

Item Reason
OPC UA authn / authz Phases 6.2 + prior. Redundancy is orthogonal.
Driver layer Drivers aren't redundancy-aware; they run on each node independently against the same equipment. The server layer handles the ServiceLevel story.
Automatic failover / election Explicitly out of scope. Non-transparent = client picks which server to use via ServiceLevel + ServerUriArray. We do NOT ship consensus, leader election, or automatic promotion. Operator-driven failover is the v2.0 model per decision #7985.
Transparent redundancy (RedundancySupport=Transparent) Not supported. If the operator asks for it the server fails startup with a clear error.
Historian redundancy Galaxy Historian's own redundancy (two historians on two CPUs) is out of scope. The Galaxy driver talks to whichever historian is reachable from its node.

Entry Gate Checklist

  • Phase 6.1 merged (uses /healthz for peer probing)
  • CLAUDE.md §Redundancy + docs/Redundancy.md re-read
  • Decisions #7985 re-skimmed
  • ServerCluster/ClusterNode/RedundancyRole/RedundancyMode entities + existing migration reviewed
  • OPC UA Part 4 §Redundancy + Part 5 §6.3.34 (ServiceLevel) re-skimmed
  • Dev box has two OtOpcUa.Server instances configured against the same cluster — one designated Primary, one Backup — for integration testing

Task Breakdown

Stream A — Cluster topology loader (3 days)

  1. A.1 RedundancyCoordinator startup path: reads ClusterNode row for the current node (identified by appsettings.json Cluster:NodeId), reads the cluster's peer list, validates invariants (no duplicate ApplicationUri, at most one Primary per cluster if RedundancyMode.WarmActive, at most two nodes total in v2.0 per decision #83).
  2. A.2 Topology subscription — coordinator re-reads on sp_PublishGeneration confirmation so an operator role-swap takes effect after publish (no process restart needed).
  3. A.3 Tests: two-node cluster seed, one-node cluster seed (degenerate), duplicate-uri rejection.

Stream B — Peer health probing + ServiceLevel computation (4 days)

  1. B.1 PeerProbeLoop runs per peer at PeerProbeInterval (2 s default, configurable via appsettings.json). Calls peer's /healthz via HttpClient; timeout 1 s. Exponential backoff on sustained failure.
  2. B.2 ServiceLevelCalculator.Compute(current role, self health, peer reachable, apply in progress) → byte. Matrix documented in §Scope.
  3. B.3 Calculator reacts to inputs via IObserver pattern so changes immediately push to the OPC UA ServiceLevel node.
  4. B.4 Tests: matrix coverage for all role × health × apply permutations (32 cases); injected IClock + fake HttpClient so tests are deterministic.

Stream C — OPC UA node wiring (3 days)

  1. C.1 ServiceLevel variable node created under ServerStatus at server startup. Type Byte, AccessLevel = CurrentRead only. Subscribe to ServiceLevelCalculator observable; push updates via DataChangeNotification.
  2. C.2 ServerUriArray variable node under ServerCapabilities. Array of String, length = peer count. Updates on topology change.
  3. C.3 RedundancySupport variable — static at startup from RedundancyMode. Values: None, Cold, Warm, WarmActive, Hot. Phase 6.3 supports everything except Transparent + HotAndMirrored.
  4. C.4 Test against the Client.CLI: connect to primary, read ServiceLevel → expect 255; pause primary apply → expect 200; fail primary → client sees Bad_ServerNotConnected + reconnects to peer at 100.

Stream D — Apply-window integration (2 days)

  1. D.1 sp_PublishGeneration caller wraps the apply in using (coordinator.BeginApplyWindow()). BeginApplyWindow increments an in-process counter; ServiceLevel drops on first increment. Dispose decrements.
  2. D.2 Nested applies handled by the counter (rarely happens but Ignition and Kepware clients have both been observed firing rapid-succession draft publishes).
  3. D.3 Test: mid-apply subscribe on primary; assert the subscribing client sees the ServiceLevel drop immediately after the apply starts, then restore after apply completes.

Stream E — Admin UI + metrics (3 days)

  1. E.1 RedundancyTab.razor under /cluster/{id}/redundancy. Shows each node's role, current ServiceLevel, peer reachability, last apply timestamp. Role-swap button posts a draft edit on ClusterNode.RedundancyRole; publish applies.
  2. E.2 OpenTelemetry meter export: three gauges per the §Scope metrics. Sink via Phase 6.1 observability.
  3. E.3 SignalR push: FleetStatusHub broadcasts ServiceLevel changes so the Admin UI updates within ~1 s of the coordinator observing a peer flip.

Compliance Checks (run at exit gate)

  • Primary-healthy ServiceLevel = 255.
  • Backup-healthy ServiceLevel = 100.
  • Mid-apply Primary ServiceLevel = 200 — verified via Client.CLI subscription polling ServiceLevel during a forced draft publish.
  • Peer-unreachable handling: when a Primary can't probe its Backup's /healthz, Primary still serves at 255 (peer is the one with the problem). When a Backup can't probe Primary, Backup flips to 200 (per decision #81 — a lonely Backup promotes its advertised level to signal "I'll take over if you ask" without auto-promoting).
  • Role transition via operator publish: FleetAdmin swaps RedundancyRole rows in a draft, publishes; both nodes re-read topology on publish confirmation and flip ServiceLevel accordingly — no restart needed.
  • ServerUriArray returns exactly the peer node's ApplicationUri.
  • Client.CLI cutover: with a primary deliberately halted, a client that was connected to primary reconnects to the backup within the ServiceLevel-polling interval.
  • No regression in existing driver test suites; no regression in /healthz reachability under redundancy load.

Risks and Mitigations

Risk Likelihood Impact Mitigation
Split-brain from operator race (both nodes marked Primary) Low High Coordinator rejects startup if its cluster has >1 Primary row; logs + fails fast. Document as a publish-time validation in sp_PublishGeneration.
ServiceLevel thrashing on flaky peer Medium Medium 2 s probe interval + 3-sample smoothing window; only declares a peer unreachable after 3 consecutive failed probes
Client ignores ServiceLevel and stays on broken primary Medium Medium Documented in docs/Redundancy.md — non-transparent redundancy requires client cooperation; most SCADA clients (Ignition, Kepware, Aveva OI Gateway) honor it. Unit-test the advertised values; field behavior is client-responsibility
Apply-window counter leaks on exception Low High BeginApplyWindow returns IDisposable; using syntax enforces paired decrement; unit test for exception-in-apply path
HttpClient probe leaks sockets Low Medium Single shared HttpClient per coordinator (not per-probe); timeouts tight to avoid keeping connections open during peer downtime

Completion Checklist

  • Stream A: topology loader + tests
  • Stream B: peer probe + ServiceLevel calculator + 32-case matrix tests
  • Stream C: ServiceLevel / ServerUriArray / RedundancySupport node wiring + Client.CLI smoke test
  • Stream D: apply-window integration + nested-apply counter
  • Stream E: Admin RedundancyTab + OpenTelemetry metrics + SignalR push
  • phase-6-3-compliance.ps1 exits 0; exit-gate doc; docs/Redundancy.md updated with the ServiceLevel matrix

Adversarial Review — 2026-04-19 (Codex, thread 019da490-3fa0-7340-98b8-cceeca802550)

  1. Crit · ACCEPT — No publish-generation fencing enables split-publish advertising both as authoritative. Change: coordinator CAS on a monotonic ConfigGenerationId; every topology decision is generation-stamped; peers reject state propagated from a lower generation.
  2. Crit · ACCEPT>1 Primary at startup covered but runtime containment missing when invalid topology appears later (mid-apply race). Change: add runtime InvalidTopology state — both nodes self-demote to ServiceLevel 2 (the "detected inconsistency" band, below normal operation) until convergence.
  3. High · ACCEPT0 = Faulted collides with OPC UA Part 5 §6.3.34 semantics where 0 means Maintenance and 1 means NoData. Change: reserve 0 for operator-declared maintenance-mode only; Faulted/unreachable uses 1 (NoData); in-range degraded states occupy 2..199.
  4. High · ACCEPT — Matrix collapses distinct operational states onto the same value. Change: matrix expanded to Authoritative-Primary=255, Isolated-Primary=230 (peer unreachable — still serving), Primary-Mid-Apply=200, Recovering-Primary=180, Authoritative-Backup=100, Isolated-Backup=80 (primary unreachable — "take over if asked"), Backup-Mid-Apply=50, Recovering-Backup=30.
  5. High · ACCEPT/healthz from 6.1 is HTTP-healthy but doesn't guarantee OPC UA data plane. Change: add a redundancy-specific probe UaHealthProbe — issues a ReadAsync(ServiceLevel) against the peer's OPC UA endpoint via a lightweight client session. /healthz remains the fast-fail; the UA probe is the authority signal.
  6. High · ACCEPTServerUriArray must include self + peers, not peers only. Change: array contains [self.ApplicationUri, peer.ApplicationUri] in stable deterministic ordering; compliance test asserts local-plus-peer membership.
  7. Med · ACCEPT — No Faulted → Recovering → Healthy path. Change: add Recovering state with min dwell time (60 s default) + positive publish witness (one successful Read on a reference node) before returning to Healthy. Thrash-prevention.
  8. Med · ACCEPT — Topology change during in-flight probe undefined. Change: every probe task tagged with ConfigGenerationId at dispatch; obsolete results discarded; in-flight probes cancelled on topology reload.
  9. Med · ACCEPT — Apply-window counter race on exception/cancellation/async ownership. Change: apply-window is a named lease keyed to (ConfigGenerationId, PublishRequestId) with disposal enforced via await using; watchdog detects leased-but-abandoned and force-closes after ApplyMaxDuration (default 10 min).
  10. High · ACCEPT — Ignition + Kepware + Aveva OI Gateway ServiceLevel compliance is unverified. Change: risk elevated to High; add Stream F (new) — build an interop matrix: validate against Ignition 8.1/8.3, Kepware KEPServerEX 6.x, Aveva OI Gateway 2020R2 + 2023R1. Document per-client cutover behaviour. Field deployments get a documented compatibility table; clients that ignore ServiceLevel documented as requiring explicit backup-endpoint config.
  11. Med · ACCEPT — Galaxy MXAccess re-session on Primary death not in acceptance. Change: Stream F adds an end-to-end failover smoke test that boots Galaxy.Proxy on both nodes, kills Primary, asserts Galaxy consumer reconnects to Backup within (SessionTimeout + KeepAliveInterval × 3) budget. docs/Redundancy.md updated with required session timeouts.
  12. Med · ACCEPT — Transparent-mode startup rejection is outage-prone. Change: sp_PublishGeneration validates RedundancyMode pre-publish — unsupported values reject the publish attempt with a clear validation error; runtime never sees an unsupported mode. Last-good config stays active.