After shipping the four Phase 6 plan drafts (PRs 77-80), the adversarial-review
adjustments lived only as trailing "Review" sections. An implementer reading
Stream A would find the original unadjusted guidance, then have to cross-reference
the review to reconcile. This PR makes the plans genuinely executable:
1. Merges every ACCEPTed review finding into the actual Scope / Stream / Compliance
sections of each phase plan:
- phase-6-1: Scope table rewrite (per-capability retry, (instance,host) pipeline key,
MemoryTracking vs MemoryRecycle split, hybrid watchdog formula, demand-aware
wedge detector, generation-sealed LiteDB). Streams A/B/D + Compliance rewritten.
- phase-6-2: AuthorizationDecision tri-state, control/data-plane separation,
MembershipFreshnessInterval (15 min), AuthCacheMaxStaleness (5 min),
subscription stamp-and-reevaluate. Stream C widened to 11 OPC UA operations.
- phase-6-3: 8-state ServiceLevel matrix (OPC UA Part 5 §6.3.34-compliant),
two-layer peer probe (/healthz + UaHealthProbe), apply-lease via await using,
publish-generation fencing, InvalidTopology runtime state, ServerUriArray
self-first + peers. New Stream F (interop matrix + Galaxy failover).
- phase-6-4: DraftRevisionToken concurrency control, staged-import via
EquipmentImportBatch with user-scoped visibility, CSV header version marker,
decision-#117-aligned identifier columns, 1000-row diff cap,
decision-#139 OPC 40010 fields, Identification inherits Equipment ACL.
2. Appends decisions #143 through #162 to docs/v2/plan.md capturing the
architectural commitments the adjustments created. Each decision carries its
dated rationale so future readers know why the choice was made.
3. Scaffolds scripts/compliance/phase-6-{1,2,3,4}-compliance.ps1 — PowerShell
stubs with Assert-Todo / Assert-Pass / Assert-Fail helpers. Every check
maps to a Stream task ID from the corresponding phase plan. Currently all
checks are TODO and scripts exit 0; each implementation task is responsible
for replacing its TODO with a real check before closing that task. Saved
as UTF-8 with BOM so Windows PowerShell 5.1 parses em-dash characters
without breaking.
Net result: the Phase 6.1 plan is genuinely ready to execute. Stream A.3 can
start tomorrow without reconciling Streams vs. Review on every task; the
compliance script is wired to the Stream IDs; plan.md has the architectural
commitments that justify the Stream choices.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
20 KiB
Phase 6.3 — Redundancy Runtime
Status: DRAFT —
CLAUDE.md+docs/Redundancy.mddescribe a non-transparent warm/hot redundancy model with unique ApplicationUris,RedundancySupportadvertisement,ServerUriArray, and dynamicServiceLevel. Entities (ServerCluster,ClusterNode,RedundancyRole,RedundancyMode) exist; the runtime behavior (actualServiceLevelnumber computation, mid-apply dip,ServerUriArraybroadcast) is not wired.Branch:
v2/phase-6-3-redundancy-runtimeEstimated duration: 2 weeks Predecessor: Phase 6.2 (Authorization) — reuses the Phase 6.1 health endpoints for cluster-peer probing Successor: Phase 6.4 (Admin UI completion)
Phase Objective
Land the non-transparent redundancy protocol end-to-end: two OtOpcUa.Server instances in a ServerCluster each expose a live ServiceLevel node whose value reflects that instance's suitability to serve traffic, advertise each other via ServerUriArray, and transition role (Primary ↔ Backup) based on health + operator intent.
Closes these gaps:
- Dynamic
ServiceLevel— OPC UA Part 5 §6.3.34 specifies a Byte (0..255) that clients poll to pick the healthiest server. Our server publishes it as a static value today. ServerUriArraybroadcast — Part 4 specifies that every node in a redundant pair should advertise its peers' ApplicationUris. Currently advertises only its own.- Primary / Backup role coordination — entities carry
RedundancyRolebut the runtime doesn't read it; no peer health probing; no role-transfer on primary failure. - Mid-apply dip — decision-level expectation that a server mid-generation-apply should report a lower ServiceLevel so clients cut over to the peer during the apply window. Not implemented.
Scope — What Changes
| Concern | Change |
|---|---|
OtOpcUa.Server → new Server.Redundancy sub-namespace |
RedundancyCoordinator singleton. Resolves the current node's ClusterNode row at startup, loads peers, runs two-layer peer health probe: (a) /healthz every 2 s as the fast-fail (inherits Phase 6.1 semantics — HTTP + DB/cache healthy); (b) UaHealthProbe every 10 s — opens a lightweight OPC UA client session to the peer + reads its ServiceLevel node + verifies endpoint serves data. Authority decisions use UaHealthProbe; /healthz is used only to avoid wasting UA probes when peer is obviously down. |
| Publish-generation fencing | Topology + role decisions are stamped with a monotonic ConfigGenerationId from the shared config DB. Coordinator re-reads topology via CAS on (ClusterId, ExpectedGeneration) → new row; peers reject state propagated from a lower generation. Prevents split-publish races. |
InvalidTopology runtime state |
If both nodes detect >1 Primary AFTER startup (config-DB drift during a publish), both self-demote to ServiceLevel 2 until convergence. Neither node serves authoritatively; clients pick the healthier alternative or reconnect later. |
| OPC UA server root | ServiceLevel variable node becomes a BaseDataVariable whose value updates on RedundancyCoordinator state change. ServerUriArray array variable includes self + peers in stable deterministic ordering (decision per OPC UA Part 4 §6.6.2.2). RedundancySupport stays static (set from RedundancyMode at startup); Transparent mode validated pre-publish, not rejected at startup. |
RedundancyCoordinator computation |
8-state ServiceLevel matrix — avoids OPC UA Part 5 §6.3.34 collision (0=Maintenance, 1=NoData). Operator-declared maintenance only = 0. Unreachable / Faulted = 1. In-range operational states occupy 2..255: Authoritative-Primary = 255; Isolated-Primary (peer unreachable, self serving) = 230; Primary-Mid-Apply = 200; Recovering-Primary (post-fault, dwell not met) = 180; Authoritative-Backup = 100; Isolated-Backup (primary unreachable, "take over if asked") = 80; Backup-Mid-Apply = 50; Recovering-Backup = 30; InvalidTopology (runtime detects >1 Primary) = 2 (detected-inconsistency band — below normal operation). Full matrix documented in docs/Redundancy.md update. |
| Role transition | Split-brain avoidance: role is declared in the shared config DB (ClusterNode.RedundancyRole), not elected at runtime. An operator flips the row (or a failover script does). Coordinator only reads; never writes. |
sp_PublishGeneration hook |
Uses named apply leases keyed to (ConfigGenerationId, PublishRequestId). await using var lease = coordinator.BeginApplyLease(...). Disposal on any exit path (success, exception, cancellation) decrements. Watchdog auto-closes any lease older than ApplyMaxDuration (default 10 min) → ServiceLevel can't stick at mid-apply. Pre-publish validator rejects unsupported RedundancyMode (e.g. Transparent) with a clear error so runtime never sees an invalid state. |
Admin UI /cluster/{id} page |
New RedundancyTab.razor — shows current node's role + ServiceLevel + peer reachability. FleetAdmin can trigger a role-swap by editing ClusterNode.RedundancyRole + publishing a draft. |
| Metrics | New OpenTelemetry metrics: ot_opcua_service_level{cluster,node}, ot_opcua_peer_reachable{cluster,node,peer}, ot_opcua_apply_in_progress{cluster,node}. Sink via Phase 6.1 observability layer. |
Scope — What Does NOT Change
| Item | Reason |
|---|---|
| OPC UA authn / authz | Phases 6.2 + prior. Redundancy is orthogonal. |
| Driver layer | Drivers aren't redundancy-aware; they run on each node independently against the same equipment. The server layer handles the ServiceLevel story. |
| Automatic failover / election | Explicitly out of scope. Non-transparent = client picks which server to use via ServiceLevel + ServerUriArray. We do NOT ship consensus, leader election, or automatic promotion. Operator-driven failover is the v2.0 model per decision #79–85. |
Transparent redundancy (RedundancySupport=Transparent) |
Not supported. If the operator asks for it the server fails startup with a clear error. |
| Historian redundancy | Galaxy Historian's own redundancy (two historians on two CPUs) is out of scope. The Galaxy driver talks to whichever historian is reachable from its node. |
Entry Gate Checklist
- Phase 6.1 merged (uses
/healthzfor peer probing) CLAUDE.md§Redundancy +docs/Redundancy.mdre-read- Decisions #79–85 re-skimmed
ServerCluster/ClusterNode/RedundancyRole/RedundancyModeentities + existing migration reviewed- OPC UA Part 4 §Redundancy + Part 5 §6.3.34 (ServiceLevel) re-skimmed
- Dev box has two OtOpcUa.Server instances configured against the same cluster — one designated Primary, one Backup — for integration testing
Task Breakdown
Stream A — Cluster topology loader (3 days)
- A.1
RedundancyCoordinatorstartup path: readsClusterNoderow for the current node (identified byappsettings.jsonCluster:NodeId), reads the cluster's peer list, validates invariants (no duplicateApplicationUri, at most onePrimaryper cluster ifRedundancyMode.WarmActive, at most two nodes total in v2.0 per decision #83). - A.2 Topology subscription — coordinator re-reads on
sp_PublishGenerationconfirmation so an operator role-swap takes effect after publish (no process restart needed). - A.3 Tests: two-node cluster seed, one-node cluster seed (degenerate), duplicate-uri rejection.
Stream B — Peer health probing + ServiceLevel computation (6 days, widened)
- B.1
PeerHttpProbeLoopper peer at 2 s — calls peer's/healthz, 1 s timeout, exponential backoff on sustained failure. Used as fast-fail. - B.2
PeerUaProbeLoopper peer at 10 s — opens an OPC UA client session to the peer (reuses Phase 5Driver.OpcUaClientstack), reads peer'sServiceLevelnode + verifies endpoint serves data. Short-circuit: if HTTP probe is failing, skip UA probe (no wasted sessions). - B.3
ServiceLevelCalculator.Compute(role, selfHealth, peerHttpHealthy, peerUaHealthy, applyInProgress, recoveryDwellMet, topologyValid) → byte. 8-state matrix per §Scope.topologyValid=falseforces InvalidTopology = 2 regardless of other inputs. - B.4
RecoveryStateManager: after aFaulted → Healthytransition, hold driver inRecoveringband (180 Primary / 30 Backup) forRecoveryDwellTime(default 60 s) AND require one positive publish witness (successfulReadon a reference node) before entering Authoritative band. - B.5 Calculator reacts to inputs via
IObserverso changes immediately push to the OPC UAServiceLevelnode. - B.6 Tests: 64-case matrix covering role × self-health × peer-http × peer-ua × apply × recovery × topology. Specific cases flagged: Primary-with-unreachable-peer-serves-at-230 (authority retained); Backup-with-unreachable-primary-escalates-to-80 (not auto-promote); InvalidTopology demotes both nodes; Recovering dwell + publish-witness blocks premature return to 255.
Stream C — OPC UA node wiring (3 days)
- C.1
ServiceLevelvariable node created underServerStatusat server startup. TypeByte, AccessLevel = CurrentRead only. Subscribe toServiceLevelCalculatorobservable; push updates viaDataChangeNotification. - C.2
ServerUriArrayvariable node underServerCapabilities. Array ofString, includes self + peers with deterministic ordering (self first). Updates on topology change. Compliance test asserts local-plus-peer membership. - C.3
RedundancySupportvariable — static at startup fromRedundancyMode. Values:None,Cold,Warm,WarmActive,Hot. Unsupported values (Transparent,HotAndMirrored) are rejected pre-publish by validator — runtime never sees them. - C.4 Client.CLI cutover test: connect to primary, read
ServiceLevel→ 255; pause primary apply → 200; unreachable peer while apply in progress → 200 (apply dominates peer-unreachable per matrix); client sees peer viaServerUriArray; fail primary → client reconnects to peer at 80 (isolated-backup band).
Stream D — Apply-window integration (3 days)
- D.1
sp_PublishGenerationcaller wraps the apply inawait using var lease = coordinator.BeginApplyLease(generationId, publishRequestId). Lease keyed to(ConfigGenerationId, PublishRequestId)so concurrent publishes stay isolated. Disposal decrements on every exit path. - D.2
ApplyLeaseWatchdogauto-closes leases older thanApplyMaxDuration(default 10 min) so a crashed publisher can't pin the node at mid-apply. - D.3 Pre-publish validator in
sp_PublishGenerationrejects unsupportedRedundancyModevalues (Transparent,HotAndMirrored) with a clear error message — runtime never sees an invalid mode. - D.4 Tests: (a) mid-apply client subscribes → sees ServiceLevel drop → sees restore; (b) lease leak via
ThreadAbort/ cancellation → watchdog closes; (c) publish rejected forTransparent→ operator-actionable error.
Stream E — Admin UI + metrics (3 days)
- E.1
RedundancyTab.razorunder/cluster/{id}/redundancy. Shows each node's role, current ServiceLevel (with band label per 8-state matrix), peer reachability (HTTP + UA probe separately), last apply timestamp. Role-swap button posts a draft edit onClusterNode.RedundancyRole; publish applies. - E.2 OpenTelemetry meter export:
ot_opcua_service_level{cluster,node}gauge +ot_opcua_peer_reachable{cluster,node,peer,kind=http|ua}+ot_opcua_apply_in_progress{cluster,node}+ot_opcua_topology_valid{cluster}. Sink via Phase 6.1 observability. - E.3 SignalR push:
FleetStatusHubbroadcasts ServiceLevel changes so the Admin UI updates within ~1 s of the coordinator observing a peer flip.
Stream F — Client-interoperability matrix (3 days, new)
- F.1 Validate ServiceLevel-driven cutover against Ignition 8.1 + 8.3, Kepware KEPServerEX 6.x, Aveva OI Gateway 2020R2 + 2023R1. For each: configure the client with both endpoints, verify it honors
ServiceLevel+ServerUriArrayduring primary failover. - F.2 Clients that don't honour the standards (doc field — may include Kepware and OI Gateway per Codex review) get an explicit compatibility-matrix entry: "requires manual backup-endpoint config / vendor-specific redundancy primitives". Documented in
docs/Redundancy.md. - F.3 Galaxy MXAccess failover test — boot Galaxy.Proxy on both nodes, kill Primary, assert Galaxy consumer reconnects to Backup within
(SessionTimeout + KeepAliveInterval × 3). Document required session-timeout config indocs/Redundancy.md.
Compliance Checks (run at exit gate)
- OPC UA band compliance:
0=Maintenancereserved,1=NoDatareserved. Operational states in 2..255 per 8-state matrix. - Authoritative-Primary ServiceLevel = 255.
- Isolated-Primary (peer unreachable, self serving) = 230 — Primary retains authority.
- Primary-Mid-Apply = 200.
- Recovering-Primary = 180 with dwell + publish witness enforced.
- Authoritative-Backup = 100.
- Isolated-Backup (primary unreachable) = 80 — does NOT auto-promote.
- InvalidTopology = 2 — both nodes self-demote when >1 Primary detected runtime.
- ServerUriArray returns self + peer URIs, self first.
- UaHealthProbe authority: integration test — peer returns HTTP 200 but OPC UA endpoint unreachable → coordinator treats peer as UA-unhealthy; peer is not a valid authority source.
- Apply-lease disposal: leases close on exception, cancellation, and watchdog timeout; ServiceLevel never sticks at mid-apply band.
- Transparent-mode rejection: attempting to publish
RedundancyMode=Transparentis blocked atsp_PublishGeneration; runtime never sees an invalid mode. - Role transition via operator publish: FleetAdmin swaps
RedundancyRolein a draft, publishes; both nodes re-read topology on publish confirmation + flip ServiceLevel — no restart. - Client.CLI cutover: with primary halted, Client.CLI that was connected to primary sees primary drop + reconnects to backup via
ServerUriArray. - Client interoperability matrix (Stream F): Ignition 8.1 + 8.3 honour ServiceLevel; Kepware + Aveva OI Gateway findings documented.
- Galaxy MXAccess failover: end-to-end test — primary kill → Galaxy consumer reconnects to backup within session-timeout budget.
- No regression in existing driver test suites; no regression in
/healthzreachability under redundancy load.
Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Split-brain from operator race (both nodes marked Primary) | Low | High | Coordinator rejects startup if its cluster has >1 Primary row; logs + fails fast. Document as a publish-time validation in sp_PublishGeneration. |
| ServiceLevel thrashing on flaky peer | Medium | Medium | 2 s probe interval + 3-sample smoothing window; only declares a peer unreachable after 3 consecutive failed probes |
| Client ignores ServiceLevel and stays on broken primary | Medium | Medium | Documented in docs/Redundancy.md — non-transparent redundancy requires client cooperation; most SCADA clients (Ignition, Kepware, Aveva OI Gateway) honor it. Unit-test the advertised values; field behavior is client-responsibility |
| Apply-window counter leaks on exception | Low | High | BeginApplyWindow returns IDisposable; using syntax enforces paired decrement; unit test for exception-in-apply path |
HttpClient probe leaks sockets |
Low | Medium | Single shared HttpClient per coordinator (not per-probe); timeouts tight to avoid keeping connections open during peer downtime |
Completion Checklist
- Stream A: topology loader + tests
- Stream B: peer probe + ServiceLevel calculator + 32-case matrix tests
- Stream C: ServiceLevel / ServerUriArray / RedundancySupport node wiring + Client.CLI smoke test
- Stream D: apply-window integration + nested-apply counter
- Stream E: Admin
RedundancyTab+ OpenTelemetry metrics + SignalR push phase-6-3-compliance.ps1exits 0; exit-gate doc;docs/Redundancy.mdupdated with the ServiceLevel matrix
Adversarial Review — 2026-04-19 (Codex, thread 019da490-3fa0-7340-98b8-cceeca802550)
- Crit · ACCEPT — No publish-generation fencing enables split-publish advertising both as authoritative. Change: coordinator CAS on a monotonic
ConfigGenerationId; every topology decision is generation-stamped; peers reject state propagated from a lower generation. - Crit · ACCEPT —
>1 Primaryat startup covered but runtime containment missing when invalid topology appears later (mid-apply race). Change: add runtimeInvalidTopologystate — both nodes self-demote to ServiceLevel 2 (the "detected inconsistency" band, below normal operation) until convergence. - High · ACCEPT —
0 = Faultedcollides with OPC UA Part 5 §6.3.34 semantics where 0 means Maintenance and 1 means NoData. Change: reserve 0 for operator-declared maintenance-mode only; Faulted/unreachable uses 1 (NoData); in-range degraded states occupy 2..199. - High · ACCEPT — Matrix collapses distinct operational states onto the same value. Change: matrix expanded to Authoritative-Primary=255, Isolated-Primary=230 (peer unreachable — still serving), Primary-Mid-Apply=200, Recovering-Primary=180, Authoritative-Backup=100, Isolated-Backup=80 (primary unreachable — "take over if asked"), Backup-Mid-Apply=50, Recovering-Backup=30.
- High · ACCEPT —
/healthzfrom 6.1 is HTTP-healthy but doesn't guarantee OPC UA data plane. Change: add a redundancy-specific probeUaHealthProbe— issues aReadAsync(ServiceLevel)against the peer's OPC UA endpoint via a lightweight client session./healthzremains the fast-fail; the UA probe is the authority signal. - High · ACCEPT —
ServerUriArraymust include self + peers, not peers only. Change: array contains[self.ApplicationUri, peer.ApplicationUri]in stable deterministic ordering; compliance test asserts local-plus-peer membership. - Med · ACCEPT — No
Faulted → Recovering → Healthypath. Change: addRecoveringstate with min dwell time (60 s default) + positive publish witness (one successful Read on a reference node) before returning to Healthy. Thrash-prevention. - Med · ACCEPT — Topology change during in-flight probe undefined. Change: every probe task tagged with
ConfigGenerationIdat dispatch; obsolete results discarded; in-flight probes cancelled on topology reload. - Med · ACCEPT — Apply-window counter race on exception/cancellation/async ownership. Change: apply-window is a named lease keyed to
(ConfigGenerationId, PublishRequestId)with disposal enforced viaawait using; watchdog detects leased-but-abandoned and force-closes afterApplyMaxDuration(default 10 min). - High · ACCEPT — Ignition + Kepware + Aveva OI Gateway
ServiceLevelcompliance is unverified. Change: risk elevated to High; add Stream F (new) — build an interop matrix: validate against Ignition 8.1/8.3, Kepware KEPServerEX 6.x, Aveva OI Gateway 2020R2 + 2023R1. Document per-client cutover behaviour. Field deployments get a documented compatibility table; clients that ignore ServiceLevel documented as requiring explicit backup-endpoint config. - Med · ACCEPT — Galaxy MXAccess re-session on Primary death not in acceptance. Change: Stream F adds an end-to-end failover smoke test that boots Galaxy.Proxy on both nodes, kills Primary, asserts Galaxy consumer reconnects to Backup within
(SessionTimeout + KeepAliveInterval × 3)budget.docs/Redundancy.mdupdated with required session timeouts. - Med · ACCEPT — Transparent-mode startup rejection is outage-prone. Change:
sp_PublishGenerationvalidatesRedundancyModepre-publish — unsupported values reject the publish attempt with a clear validation error; runtime never sees an unsupported mode. Last-good config stays active.