Files
lmxopcua/docs/v2/implementation/phase-6-3-redundancy-runtime.md
Joseph Doherty ba31f200f6 Phase 6 reconcile — merge adjustments into plan bodies, add decisions #143-162, scaffold compliance stubs
After shipping the four Phase 6 plan drafts (PRs 77-80), the adversarial-review
adjustments lived only as trailing "Review" sections. An implementer reading
Stream A would find the original unadjusted guidance, then have to cross-reference
the review to reconcile. This PR makes the plans genuinely executable:

1. Merges every ACCEPTed review finding into the actual Scope / Stream / Compliance
   sections of each phase plan:
   - phase-6-1: Scope table rewrite (per-capability retry, (instance,host) pipeline key,
     MemoryTracking vs MemoryRecycle split, hybrid watchdog formula, demand-aware
     wedge detector, generation-sealed LiteDB). Streams A/B/D + Compliance rewritten.
   - phase-6-2: AuthorizationDecision tri-state, control/data-plane separation,
     MembershipFreshnessInterval (15 min), AuthCacheMaxStaleness (5 min),
     subscription stamp-and-reevaluate. Stream C widened to 11 OPC UA operations.
   - phase-6-3: 8-state ServiceLevel matrix (OPC UA Part 5 §6.3.34-compliant),
     two-layer peer probe (/healthz + UaHealthProbe), apply-lease via await using,
     publish-generation fencing, InvalidTopology runtime state, ServerUriArray
     self-first + peers. New Stream F (interop matrix + Galaxy failover).
   - phase-6-4: DraftRevisionToken concurrency control, staged-import via
     EquipmentImportBatch with user-scoped visibility, CSV header version marker,
     decision-#117-aligned identifier columns, 1000-row diff cap,
     decision-#139 OPC 40010 fields, Identification inherits Equipment ACL.

2. Appends decisions #143 through #162 to docs/v2/plan.md capturing the
   architectural commitments the adjustments created. Each decision carries its
   dated rationale so future readers know why the choice was made.

3. Scaffolds scripts/compliance/phase-6-{1,2,3,4}-compliance.ps1 — PowerShell
   stubs with Assert-Todo / Assert-Pass / Assert-Fail helpers. Every check
   maps to a Stream task ID from the corresponding phase plan. Currently all
   checks are TODO and scripts exit 0; each implementation task is responsible
   for replacing its TODO with a real check before closing that task. Saved
   as UTF-8 with BOM so Windows PowerShell 5.1 parses em-dash characters
   without breaking.

Net result: the Phase 6.1 plan is genuinely ready to execute. Stream A.3 can
start tomorrow without reconciling Streams vs. Review on every task; the
compliance script is wired to the Stream IDs; plan.md has the architectural
commitments that justify the Stream choices.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 03:49:41 -04:00

20 KiB
Raw Blame History

Phase 6.3 — Redundancy Runtime

Status: DRAFT — CLAUDE.md + docs/Redundancy.md describe a non-transparent warm/hot redundancy model with unique ApplicationUris, RedundancySupport advertisement, ServerUriArray, and dynamic ServiceLevel. Entities (ServerCluster, ClusterNode, RedundancyRole, RedundancyMode) exist; the runtime behavior (actual ServiceLevel number computation, mid-apply dip, ServerUriArray broadcast) is not wired.

Branch: v2/phase-6-3-redundancy-runtime Estimated duration: 2 weeks Predecessor: Phase 6.2 (Authorization) — reuses the Phase 6.1 health endpoints for cluster-peer probing Successor: Phase 6.4 (Admin UI completion)

Phase Objective

Land the non-transparent redundancy protocol end-to-end: two OtOpcUa.Server instances in a ServerCluster each expose a live ServiceLevel node whose value reflects that instance's suitability to serve traffic, advertise each other via ServerUriArray, and transition role (Primary ↔ Backup) based on health + operator intent.

Closes these gaps:

  1. Dynamic ServiceLevel — OPC UA Part 5 §6.3.34 specifies a Byte (0..255) that clients poll to pick the healthiest server. Our server publishes it as a static value today.
  2. ServerUriArray broadcast — Part 4 specifies that every node in a redundant pair should advertise its peers' ApplicationUris. Currently advertises only its own.
  3. Primary / Backup role coordination — entities carry RedundancyRole but the runtime doesn't read it; no peer health probing; no role-transfer on primary failure.
  4. Mid-apply dip — decision-level expectation that a server mid-generation-apply should report a lower ServiceLevel so clients cut over to the peer during the apply window. Not implemented.

Scope — What Changes

Concern Change
OtOpcUa.Server → new Server.Redundancy sub-namespace RedundancyCoordinator singleton. Resolves the current node's ClusterNode row at startup, loads peers, runs two-layer peer health probe: (a) /healthz every 2 s as the fast-fail (inherits Phase 6.1 semantics — HTTP + DB/cache healthy); (b) UaHealthProbe every 10 s — opens a lightweight OPC UA client session to the peer + reads its ServiceLevel node + verifies endpoint serves data. Authority decisions use UaHealthProbe; /healthz is used only to avoid wasting UA probes when peer is obviously down.
Publish-generation fencing Topology + role decisions are stamped with a monotonic ConfigGenerationId from the shared config DB. Coordinator re-reads topology via CAS on (ClusterId, ExpectedGeneration) → new row; peers reject state propagated from a lower generation. Prevents split-publish races.
InvalidTopology runtime state If both nodes detect >1 Primary AFTER startup (config-DB drift during a publish), both self-demote to ServiceLevel 2 until convergence. Neither node serves authoritatively; clients pick the healthier alternative or reconnect later.
OPC UA server root ServiceLevel variable node becomes a BaseDataVariable whose value updates on RedundancyCoordinator state change. ServerUriArray array variable includes self + peers in stable deterministic ordering (decision per OPC UA Part 4 §6.6.2.2). RedundancySupport stays static (set from RedundancyMode at startup); Transparent mode validated pre-publish, not rejected at startup.
RedundancyCoordinator computation 8-state ServiceLevel matrix — avoids OPC UA Part 5 §6.3.34 collision (0=Maintenance, 1=NoData). Operator-declared maintenance only = 0. Unreachable / Faulted = 1. In-range operational states occupy 2..255: Authoritative-Primary = 255; Isolated-Primary (peer unreachable, self serving) = 230; Primary-Mid-Apply = 200; Recovering-Primary (post-fault, dwell not met) = 180; Authoritative-Backup = 100; Isolated-Backup (primary unreachable, "take over if asked") = 80; Backup-Mid-Apply = 50; Recovering-Backup = 30; InvalidTopology (runtime detects >1 Primary) = 2 (detected-inconsistency band — below normal operation). Full matrix documented in docs/Redundancy.md update.
Role transition Split-brain avoidance: role is declared in the shared config DB (ClusterNode.RedundancyRole), not elected at runtime. An operator flips the row (or a failover script does). Coordinator only reads; never writes.
sp_PublishGeneration hook Uses named apply leases keyed to (ConfigGenerationId, PublishRequestId). await using var lease = coordinator.BeginApplyLease(...). Disposal on any exit path (success, exception, cancellation) decrements. Watchdog auto-closes any lease older than ApplyMaxDuration (default 10 min) → ServiceLevel can't stick at mid-apply. Pre-publish validator rejects unsupported RedundancyMode (e.g. Transparent) with a clear error so runtime never sees an invalid state.
Admin UI /cluster/{id} page New RedundancyTab.razor — shows current node's role + ServiceLevel + peer reachability. FleetAdmin can trigger a role-swap by editing ClusterNode.RedundancyRole + publishing a draft.
Metrics New OpenTelemetry metrics: ot_opcua_service_level{cluster,node}, ot_opcua_peer_reachable{cluster,node,peer}, ot_opcua_apply_in_progress{cluster,node}. Sink via Phase 6.1 observability layer.

Scope — What Does NOT Change

Item Reason
OPC UA authn / authz Phases 6.2 + prior. Redundancy is orthogonal.
Driver layer Drivers aren't redundancy-aware; they run on each node independently against the same equipment. The server layer handles the ServiceLevel story.
Automatic failover / election Explicitly out of scope. Non-transparent = client picks which server to use via ServiceLevel + ServerUriArray. We do NOT ship consensus, leader election, or automatic promotion. Operator-driven failover is the v2.0 model per decision #7985.
Transparent redundancy (RedundancySupport=Transparent) Not supported. If the operator asks for it the server fails startup with a clear error.
Historian redundancy Galaxy Historian's own redundancy (two historians on two CPUs) is out of scope. The Galaxy driver talks to whichever historian is reachable from its node.

Entry Gate Checklist

  • Phase 6.1 merged (uses /healthz for peer probing)
  • CLAUDE.md §Redundancy + docs/Redundancy.md re-read
  • Decisions #7985 re-skimmed
  • ServerCluster/ClusterNode/RedundancyRole/RedundancyMode entities + existing migration reviewed
  • OPC UA Part 4 §Redundancy + Part 5 §6.3.34 (ServiceLevel) re-skimmed
  • Dev box has two OtOpcUa.Server instances configured against the same cluster — one designated Primary, one Backup — for integration testing

Task Breakdown

Stream A — Cluster topology loader (3 days)

  1. A.1 RedundancyCoordinator startup path: reads ClusterNode row for the current node (identified by appsettings.json Cluster:NodeId), reads the cluster's peer list, validates invariants (no duplicate ApplicationUri, at most one Primary per cluster if RedundancyMode.WarmActive, at most two nodes total in v2.0 per decision #83).
  2. A.2 Topology subscription — coordinator re-reads on sp_PublishGeneration confirmation so an operator role-swap takes effect after publish (no process restart needed).
  3. A.3 Tests: two-node cluster seed, one-node cluster seed (degenerate), duplicate-uri rejection.

Stream B — Peer health probing + ServiceLevel computation (6 days, widened)

  1. B.1 PeerHttpProbeLoop per peer at 2 s — calls peer's /healthz, 1 s timeout, exponential backoff on sustained failure. Used as fast-fail.
  2. B.2 PeerUaProbeLoop per peer at 10 s — opens an OPC UA client session to the peer (reuses Phase 5 Driver.OpcUaClient stack), reads peer's ServiceLevel node + verifies endpoint serves data. Short-circuit: if HTTP probe is failing, skip UA probe (no wasted sessions).
  3. B.3 ServiceLevelCalculator.Compute(role, selfHealth, peerHttpHealthy, peerUaHealthy, applyInProgress, recoveryDwellMet, topologyValid) → byte. 8-state matrix per §Scope. topologyValid=false forces InvalidTopology = 2 regardless of other inputs.
  4. B.4 RecoveryStateManager: after a Faulted → Healthy transition, hold driver in Recovering band (180 Primary / 30 Backup) for RecoveryDwellTime (default 60 s) AND require one positive publish witness (successful Read on a reference node) before entering Authoritative band.
  5. B.5 Calculator reacts to inputs via IObserver so changes immediately push to the OPC UA ServiceLevel node.
  6. B.6 Tests: 64-case matrix covering role × self-health × peer-http × peer-ua × apply × recovery × topology. Specific cases flagged: Primary-with-unreachable-peer-serves-at-230 (authority retained); Backup-with-unreachable-primary-escalates-to-80 (not auto-promote); InvalidTopology demotes both nodes; Recovering dwell + publish-witness blocks premature return to 255.

Stream C — OPC UA node wiring (3 days)

  1. C.1 ServiceLevel variable node created under ServerStatus at server startup. Type Byte, AccessLevel = CurrentRead only. Subscribe to ServiceLevelCalculator observable; push updates via DataChangeNotification.
  2. C.2 ServerUriArray variable node under ServerCapabilities. Array of String, includes self + peers with deterministic ordering (self first). Updates on topology change. Compliance test asserts local-plus-peer membership.
  3. C.3 RedundancySupport variable — static at startup from RedundancyMode. Values: None, Cold, Warm, WarmActive, Hot. Unsupported values (Transparent, HotAndMirrored) are rejected pre-publish by validator — runtime never sees them.
  4. C.4 Client.CLI cutover test: connect to primary, read ServiceLevel → 255; pause primary apply → 200; unreachable peer while apply in progress → 200 (apply dominates peer-unreachable per matrix); client sees peer via ServerUriArray; fail primary → client reconnects to peer at 80 (isolated-backup band).

Stream D — Apply-window integration (3 days)

  1. D.1 sp_PublishGeneration caller wraps the apply in await using var lease = coordinator.BeginApplyLease(generationId, publishRequestId). Lease keyed to (ConfigGenerationId, PublishRequestId) so concurrent publishes stay isolated. Disposal decrements on every exit path.
  2. D.2 ApplyLeaseWatchdog auto-closes leases older than ApplyMaxDuration (default 10 min) so a crashed publisher can't pin the node at mid-apply.
  3. D.3 Pre-publish validator in sp_PublishGeneration rejects unsupported RedundancyMode values (Transparent, HotAndMirrored) with a clear error message — runtime never sees an invalid mode.
  4. D.4 Tests: (a) mid-apply client subscribes → sees ServiceLevel drop → sees restore; (b) lease leak via ThreadAbort / cancellation → watchdog closes; (c) publish rejected for Transparent → operator-actionable error.

Stream E — Admin UI + metrics (3 days)

  1. E.1 RedundancyTab.razor under /cluster/{id}/redundancy. Shows each node's role, current ServiceLevel (with band label per 8-state matrix), peer reachability (HTTP + UA probe separately), last apply timestamp. Role-swap button posts a draft edit on ClusterNode.RedundancyRole; publish applies.
  2. E.2 OpenTelemetry meter export: ot_opcua_service_level{cluster,node} gauge + ot_opcua_peer_reachable{cluster,node,peer,kind=http|ua} + ot_opcua_apply_in_progress{cluster,node} + ot_opcua_topology_valid{cluster}. Sink via Phase 6.1 observability.
  3. E.3 SignalR push: FleetStatusHub broadcasts ServiceLevel changes so the Admin UI updates within ~1 s of the coordinator observing a peer flip.

Stream F — Client-interoperability matrix (3 days, new)

  1. F.1 Validate ServiceLevel-driven cutover against Ignition 8.1 + 8.3, Kepware KEPServerEX 6.x, Aveva OI Gateway 2020R2 + 2023R1. For each: configure the client with both endpoints, verify it honors ServiceLevel + ServerUriArray during primary failover.
  2. F.2 Clients that don't honour the standards (doc field — may include Kepware and OI Gateway per Codex review) get an explicit compatibility-matrix entry: "requires manual backup-endpoint config / vendor-specific redundancy primitives". Documented in docs/Redundancy.md.
  3. F.3 Galaxy MXAccess failover test — boot Galaxy.Proxy on both nodes, kill Primary, assert Galaxy consumer reconnects to Backup within (SessionTimeout + KeepAliveInterval × 3). Document required session-timeout config in docs/Redundancy.md.

Compliance Checks (run at exit gate)

  • OPC UA band compliance: 0=Maintenance reserved, 1=NoData reserved. Operational states in 2..255 per 8-state matrix.
  • Authoritative-Primary ServiceLevel = 255.
  • Isolated-Primary (peer unreachable, self serving) = 230 — Primary retains authority.
  • Primary-Mid-Apply = 200.
  • Recovering-Primary = 180 with dwell + publish witness enforced.
  • Authoritative-Backup = 100.
  • Isolated-Backup (primary unreachable) = 80 — does NOT auto-promote.
  • InvalidTopology = 2 — both nodes self-demote when >1 Primary detected runtime.
  • ServerUriArray returns self + peer URIs, self first.
  • UaHealthProbe authority: integration test — peer returns HTTP 200 but OPC UA endpoint unreachable → coordinator treats peer as UA-unhealthy; peer is not a valid authority source.
  • Apply-lease disposal: leases close on exception, cancellation, and watchdog timeout; ServiceLevel never sticks at mid-apply band.
  • Transparent-mode rejection: attempting to publish RedundancyMode=Transparent is blocked at sp_PublishGeneration; runtime never sees an invalid mode.
  • Role transition via operator publish: FleetAdmin swaps RedundancyRole in a draft, publishes; both nodes re-read topology on publish confirmation + flip ServiceLevel — no restart.
  • Client.CLI cutover: with primary halted, Client.CLI that was connected to primary sees primary drop + reconnects to backup via ServerUriArray.
  • Client interoperability matrix (Stream F): Ignition 8.1 + 8.3 honour ServiceLevel; Kepware + Aveva OI Gateway findings documented.
  • Galaxy MXAccess failover: end-to-end test — primary kill → Galaxy consumer reconnects to backup within session-timeout budget.
  • No regression in existing driver test suites; no regression in /healthz reachability under redundancy load.

Risks and Mitigations

Risk Likelihood Impact Mitigation
Split-brain from operator race (both nodes marked Primary) Low High Coordinator rejects startup if its cluster has >1 Primary row; logs + fails fast. Document as a publish-time validation in sp_PublishGeneration.
ServiceLevel thrashing on flaky peer Medium Medium 2 s probe interval + 3-sample smoothing window; only declares a peer unreachable after 3 consecutive failed probes
Client ignores ServiceLevel and stays on broken primary Medium Medium Documented in docs/Redundancy.md — non-transparent redundancy requires client cooperation; most SCADA clients (Ignition, Kepware, Aveva OI Gateway) honor it. Unit-test the advertised values; field behavior is client-responsibility
Apply-window counter leaks on exception Low High BeginApplyWindow returns IDisposable; using syntax enforces paired decrement; unit test for exception-in-apply path
HttpClient probe leaks sockets Low Medium Single shared HttpClient per coordinator (not per-probe); timeouts tight to avoid keeping connections open during peer downtime

Completion Checklist

  • Stream A: topology loader + tests
  • Stream B: peer probe + ServiceLevel calculator + 32-case matrix tests
  • Stream C: ServiceLevel / ServerUriArray / RedundancySupport node wiring + Client.CLI smoke test
  • Stream D: apply-window integration + nested-apply counter
  • Stream E: Admin RedundancyTab + OpenTelemetry metrics + SignalR push
  • phase-6-3-compliance.ps1 exits 0; exit-gate doc; docs/Redundancy.md updated with the ServiceLevel matrix

Adversarial Review — 2026-04-19 (Codex, thread 019da490-3fa0-7340-98b8-cceeca802550)

  1. Crit · ACCEPT — No publish-generation fencing enables split-publish advertising both as authoritative. Change: coordinator CAS on a monotonic ConfigGenerationId; every topology decision is generation-stamped; peers reject state propagated from a lower generation.
  2. Crit · ACCEPT>1 Primary at startup covered but runtime containment missing when invalid topology appears later (mid-apply race). Change: add runtime InvalidTopology state — both nodes self-demote to ServiceLevel 2 (the "detected inconsistency" band, below normal operation) until convergence.
  3. High · ACCEPT0 = Faulted collides with OPC UA Part 5 §6.3.34 semantics where 0 means Maintenance and 1 means NoData. Change: reserve 0 for operator-declared maintenance-mode only; Faulted/unreachable uses 1 (NoData); in-range degraded states occupy 2..199.
  4. High · ACCEPT — Matrix collapses distinct operational states onto the same value. Change: matrix expanded to Authoritative-Primary=255, Isolated-Primary=230 (peer unreachable — still serving), Primary-Mid-Apply=200, Recovering-Primary=180, Authoritative-Backup=100, Isolated-Backup=80 (primary unreachable — "take over if asked"), Backup-Mid-Apply=50, Recovering-Backup=30.
  5. High · ACCEPT/healthz from 6.1 is HTTP-healthy but doesn't guarantee OPC UA data plane. Change: add a redundancy-specific probe UaHealthProbe — issues a ReadAsync(ServiceLevel) against the peer's OPC UA endpoint via a lightweight client session. /healthz remains the fast-fail; the UA probe is the authority signal.
  6. High · ACCEPTServerUriArray must include self + peers, not peers only. Change: array contains [self.ApplicationUri, peer.ApplicationUri] in stable deterministic ordering; compliance test asserts local-plus-peer membership.
  7. Med · ACCEPT — No Faulted → Recovering → Healthy path. Change: add Recovering state with min dwell time (60 s default) + positive publish witness (one successful Read on a reference node) before returning to Healthy. Thrash-prevention.
  8. Med · ACCEPT — Topology change during in-flight probe undefined. Change: every probe task tagged with ConfigGenerationId at dispatch; obsolete results discarded; in-flight probes cancelled on topology reload.
  9. Med · ACCEPT — Apply-window counter race on exception/cancellation/async ownership. Change: apply-window is a named lease keyed to (ConfigGenerationId, PublishRequestId) with disposal enforced via await using; watchdog detects leased-but-abandoned and force-closes after ApplyMaxDuration (default 10 min).
  10. High · ACCEPT — Ignition + Kepware + Aveva OI Gateway ServiceLevel compliance is unverified. Change: risk elevated to High; add Stream F (new) — build an interop matrix: validate against Ignition 8.1/8.3, Kepware KEPServerEX 6.x, Aveva OI Gateway 2020R2 + 2023R1. Document per-client cutover behaviour. Field deployments get a documented compatibility table; clients that ignore ServiceLevel documented as requiring explicit backup-endpoint config.
  11. Med · ACCEPT — Galaxy MXAccess re-session on Primary death not in acceptance. Change: Stream F adds an end-to-end failover smoke test that boots Galaxy.Proxy on both nodes, kills Primary, asserts Galaxy consumer reconnects to Backup within (SessionTimeout + KeepAliveInterval × 3) budget. docs/Redundancy.md updated with required session timeouts.
  12. Med · ACCEPT — Transparent-mode startup rejection is outage-prone. Change: sp_PublishGeneration validates RedundancyMode pre-publish — unsupported values reject the publish attempt with a clear validation error; runtime never sees an unsupported mode. Last-good config stays active.