Phase 6 reconcile — merge adjustments into plan bodies, add decisions #143-162, scaffold compliance stubs

After shipping the four Phase 6 plan drafts (PRs 77-80), the adversarial-review adjustments lived only as trailing "Review" sections. An implementer reading Stream A would find the original unadjusted guidance, then have to cross-reference the review to reconcile. This PR makes the plans genuinely executable: 1. Merges every ACCEPTed review finding into the actual Scope / Stream / Compliance sections of each phase plan: - phase-6-1: Scope table rewrite (per-capability retry, (instance,host) pipeline key, MemoryTracking vs MemoryRecycle split, hybrid watchdog formula, demand-aware wedge detector, generation-sealed LiteDB). Streams A/B/D + Compliance rewritten. - phase-6-2: AuthorizationDecision tri-state, control/data-plane separation, MembershipFreshnessInterval (15 min), AuthCacheMaxStaleness (5 min), subscription stamp-and-reevaluate. Stream C widened to 11 OPC UA operations. - phase-6-3: 8-state ServiceLevel matrix (OPC UA Part 5 §6.3.34-compliant), two-layer peer probe (/healthz + UaHealthProbe), apply-lease via await using, publish-generation fencing, InvalidTopology runtime state, ServerUriArray self-first + peers. New Stream F (interop matrix + Galaxy failover). - phase-6-4: DraftRevisionToken concurrency control, staged-import via EquipmentImportBatch with user-scoped visibility, CSV header version marker, decision-#117-aligned identifier columns, 1000-row diff cap, decision-#139 OPC 40010 fields, Identification inherits Equipment ACL. 2. Appends decisions #143 through #162 to docs/v2/plan.md capturing the architectural commitments the adjustments created. Each decision carries its dated rationale so future readers know why the choice was made. 3. Scaffolds scripts/compliance/phase-6-{1,2,3,4}-compliance.ps1 — PowerShell stubs with Assert-Todo / Assert-Pass / Assert-Fail helpers. Every check maps to a Stream task ID from the corresponding phase plan. Currently all checks are TODO and scripts exit 0; each implementation task is responsible for replacing its TODO with a real check before closing that task. Saved as UTF-8 with BOM so Windows PowerShell 5.1 parses em-dash characters without breaking. Net result: the Phase 6.1 plan is genuinely ready to execute. Stream A.3 can start tomorrow without reconciling Streams vs. Review on every task; the compliance script is wired to the Stream IDs; plan.md has the architectural commitments that justify the Stream choices. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 03:49:41 -04:00
parent 81a1f7f0f6
commit ba31f200f6
9 changed files with 517 additions and 102 deletions
@@ -22,11 +22,13 @@ Closes these gaps:

 | Concern | Change |
 |---------|--------|
-| `OtOpcUa.Server` → new `Server.Redundancy` sub-namespace | `RedundancyCoordinator` singleton. Resolves the current node's `ClusterNode` row at startup, loads its peers from `ServerCluster`, probes each peer's `/healthz` (Phase 6.1 endpoint) every `PeerProbeInterval` (default 2 s), maintains per-peer health state. |
-| OPC UA server root | `ServiceLevel` variable node becomes a `BaseDataVariable` whose value updates on `RedundancyCoordinator` state change. `ServerUriArray` array variable refreshes on cluster-topology change. `RedundancySupport` stays static (set from `RedundancyMode` at startup). |
-| `RedundancyCoordinator` computation | `ServiceLevel` formula: 255 = Primary + fully healthy + no apply in progress; 200 = Primary + an apply in the middle (clients should prefer peer); 100 = Backup + fully healthy; 50 = Backup + mid-apply; 0 = Faulted or peer-unreachable-and-I'm-not-authoritative. Documented in `docs/Redundancy.md` update. |
+| `OtOpcUa.Server` → new `Server.Redundancy` sub-namespace | `RedundancyCoordinator` singleton. Resolves the current node's `ClusterNode` row at startup, loads peers, runs **two-layer peer health probe**: (a) `/healthz` every 2 s as the fast-fail (inherits Phase 6.1 semantics — HTTP + DB/cache healthy); (b) `UaHealthProbe` every 10 s — opens a lightweight OPC UA client session to the peer + reads its `ServiceLevel` node + verifies endpoint serves data. Authority decisions use UaHealthProbe; `/healthz` is used only to avoid wasting UA probes when peer is obviously down. |
+| Publish-generation fencing | Topology + role decisions are stamped with a monotonic `ConfigGenerationId` from the shared config DB. Coordinator re-reads topology via CAS on `(ClusterId, ExpectedGeneration)` → new row; peers reject state propagated from a lower generation. Prevents split-publish races. |
+| `InvalidTopology` runtime state | If both nodes detect >1 Primary AFTER startup (config-DB drift during a publish), both self-demote to ServiceLevel 2 until convergence. Neither node serves authoritatively; clients pick the healthier alternative or reconnect later. |
+| OPC UA server root | `ServiceLevel` variable node becomes a `BaseDataVariable` whose value updates on `RedundancyCoordinator` state change. `ServerUriArray` array variable includes **self + peers** in stable deterministic ordering (decision per OPC UA Part 4 §6.6.2.2). `RedundancySupport` stays static (set from `RedundancyMode` at startup); `Transparent` mode validated pre-publish, not rejected at startup. |
+| `RedundancyCoordinator` computation | **8-state ServiceLevel matrix** — avoids OPC UA Part 5 §6.3.34 collision (`0=Maintenance`, `1=NoData`). Operator-declared maintenance only = **0**. Unreachable / Faulted = **1**. In-range operational states occupy **2..255**: Authoritative-Primary = **255**; Isolated-Primary (peer unreachable, self serving) = **230**; Primary-Mid-Apply = **200**; Recovering-Primary (post-fault, dwell not met) = **180**; Authoritative-Backup = **100**; Isolated-Backup (primary unreachable, "take over if asked") = **80**; Backup-Mid-Apply = **50**; Recovering-Backup = **30**; `InvalidTopology` (runtime detects >1 Primary) = **2** (detected-inconsistency band — below normal operation). Full matrix documented in `docs/Redundancy.md` update. |
 | Role transition | Split-brain avoidance: role is *declared* in the shared config DB (`ClusterNode.RedundancyRole`), not elected at runtime. An operator flips the row (or a failover script does). Coordinator only reads; never writes. |
-| `sp_PublishGeneration` hook | Before the apply starts, the coordinator sets `ApplyInProgress = true` in-memory → `ServiceLevel` drops to mid-apply band. Clears after `sp_PublishGeneration` returns. |
+| `sp_PublishGeneration` hook | Uses named **apply leases** keyed to `(ConfigGenerationId, PublishRequestId)`. `await using var lease = coordinator.BeginApplyLease(...)`. Disposal on any exit path (success, exception, cancellation) decrements. Watchdog auto-closes any lease older than `ApplyMaxDuration` (default 10 min) → ServiceLevel can't stick at mid-apply. Pre-publish validator rejects unsupported `RedundancyMode` (e.g. `Transparent`) with a clear error so runtime never sees an invalid state. |
 | Admin UI `/cluster/{id}` page | New `RedundancyTab.razor` — shows current node's role + ServiceLevel + peer reachability. FleetAdmin can trigger a role-swap by editing `ClusterNode.RedundancyRole` + publishing a draft. |
 | Metrics | New OpenTelemetry metrics: `ot_opcua_service_level{cluster,node}`, `ot_opcua_peer_reachable{cluster,node,peer}`, `ot_opcua_apply_in_progress{cluster,node}`. Sink via Phase 6.1 observability layer. |

@@ -57,41 +59,59 @@ Closes these gaps:
 2. **A.2** Topology subscription — coordinator re-reads on `sp_PublishGeneration` confirmation so an operator role-swap takes effect after publish (no process restart needed).
 3. **A.3** Tests: two-node cluster seed, one-node cluster seed (degenerate), duplicate-uri rejection.

-### Stream B — Peer health probing + ServiceLevel computation (4 days)
+### Stream B — Peer health probing + ServiceLevel computation (6 days, widened)

-1. **B.1** `PeerProbeLoop` runs per peer at `PeerProbeInterval` (2 s default, configurable via `appsettings.json`). Calls peer's `/healthz` via `HttpClient`; timeout 1 s. Exponential backoff on sustained failure.
-2. **B.2** `ServiceLevelCalculator.Compute(current role, self health, peer reachable, apply in progress) → byte`. Matrix documented in §Scope.
-3. **B.3** Calculator reacts to inputs via `IObserver` pattern so changes immediately push to the OPC UA `ServiceLevel` node.
-4. **B.4** Tests: matrix coverage for all role × health × apply permutations (32 cases); injected `IClock` + fake `HttpClient` so tests are deterministic.
+1. **B.1** `PeerHttpProbeLoop` per peer at 2 s — calls peer's `/healthz`, 1 s timeout, exponential backoff on sustained failure. Used as fast-fail.
+2. **B.2** `PeerUaProbeLoop` per peer at 10 s — opens an OPC UA client session to the peer (reuses Phase 5 `Driver.OpcUaClient` stack), reads peer's `ServiceLevel` node + verifies endpoint serves data. Short-circuit: if HTTP probe is failing, skip UA probe (no wasted sessions).
+3. **B.3** `ServiceLevelCalculator.Compute(role, selfHealth, peerHttpHealthy, peerUaHealthy, applyInProgress, recoveryDwellMet, topologyValid) → byte`. 8-state matrix per §Scope. `topologyValid=false` forces InvalidTopology = 2 regardless of other inputs.
+4. **B.4** `RecoveryStateManager`: after a `Faulted → Healthy` transition, hold driver in `Recovering` band (180 Primary / 30 Backup) for `RecoveryDwellTime` (default 60 s) AND require one positive publish witness (successful `Read` on a reference node) before entering Authoritative band.
+5. **B.5** Calculator reacts to inputs via `IObserver` so changes immediately push to the OPC UA `ServiceLevel` node.
+6. **B.6** Tests: **64-case matrix** covering role × self-health × peer-http × peer-ua × apply × recovery × topology. Specific cases flagged: Primary-with-unreachable-peer-serves-at-230 (authority retained); Backup-with-unreachable-primary-escalates-to-80 (not auto-promote); InvalidTopology demotes both nodes; Recovering dwell + publish-witness blocks premature return to 255.

 ### Stream C — OPC UA node wiring (3 days)

 1. **C.1** `ServiceLevel` variable node created under `ServerStatus` at server startup. Type `Byte`, AccessLevel = CurrentRead only. Subscribe to `ServiceLevelCalculator` observable; push updates via `DataChangeNotification`.
-2. **C.2** `ServerUriArray` variable node under `ServerCapabilities`. Array of `String`, length = peer count. Updates on topology change.
-3. **C.3** `RedundancySupport` variable — static at startup from `RedundancyMode`. Values: `None`, `Cold`, `Warm`, `WarmActive`, `Hot`. Phase 6.3 supports everything except `Transparent` + `HotAndMirrored`.
-4. **C.4** Test against the Client.CLI: connect to primary, read `ServiceLevel` → expect 255; pause primary apply → expect 200; fail primary → client sees `Bad_ServerNotConnected` + reconnects to peer at 100.
+2. **C.2** `ServerUriArray` variable node under `ServerCapabilities`. Array of `String`, **includes self + peers** with deterministic ordering (self first). Updates on topology change. Compliance test asserts local-plus-peer membership.
+3. **C.3** `RedundancySupport` variable — static at startup from `RedundancyMode`. Values: `None`, `Cold`, `Warm`, `WarmActive`, `Hot`. Unsupported values (`Transparent`, `HotAndMirrored`) are rejected **pre-publish** by validator — runtime never sees them.
+4. **C.4** Client.CLI cutover test: connect to primary, read `ServiceLevel` → 255; pause primary apply → 200; unreachable peer while apply in progress → 200 (apply dominates peer-unreachable per matrix); client sees peer via `ServerUriArray`; fail primary → client reconnects to peer at 80 (isolated-backup band).

-### Stream D — Apply-window integration (2 days)
+### Stream D — Apply-window integration (3 days)

-1. **D.1** `sp_PublishGeneration` caller wraps the apply in `using (coordinator.BeginApplyWindow())`. `BeginApplyWindow` increments an in-process counter; ServiceLevel drops on first increment. Dispose decrements.
-2. **D.2** Nested applies handled by the counter (rarely happens but Ignition and Kepware clients have both been observed firing rapid-succession draft publishes).
-3. **D.3** Test: mid-apply subscribe on primary; assert the subscribing client sees the ServiceLevel drop immediately after the apply starts, then restore after apply completes.
+1. **D.1** `sp_PublishGeneration` caller wraps the apply in `await using var lease = coordinator.BeginApplyLease(generationId, publishRequestId)`. Lease keyed to `(ConfigGenerationId, PublishRequestId)` so concurrent publishes stay isolated. Disposal decrements on every exit path.
+2. **D.2** `ApplyLeaseWatchdog` auto-closes leases older than `ApplyMaxDuration` (default 10 min) so a crashed publisher can't pin the node at mid-apply.
+3. **D.3** Pre-publish validator in `sp_PublishGeneration` rejects unsupported `RedundancyMode` values (`Transparent`, `HotAndMirrored`) with a clear error message — runtime never sees an invalid mode.
+4. **D.4** Tests: (a) mid-apply client subscribes → sees ServiceLevel drop → sees restore; (b) lease leak via `ThreadAbort` / cancellation → watchdog closes; (c) publish rejected for `Transparent` → operator-actionable error.

 ### Stream E — Admin UI + metrics (3 days)

-1. **E.1** `RedundancyTab.razor` under `/cluster/{id}/redundancy`. Shows each node's role, current ServiceLevel, peer reachability, last apply timestamp. Role-swap button posts a draft edit on `ClusterNode.RedundancyRole`; publish applies.
-2. **E.2** OpenTelemetry meter export: three gauges per the §Scope metrics. Sink via Phase 6.1 observability.
+1. **E.1** `RedundancyTab.razor` under `/cluster/{id}/redundancy`. Shows each node's role, current ServiceLevel (with band label per 8-state matrix), peer reachability (HTTP + UA probe separately), last apply timestamp. Role-swap button posts a draft edit on `ClusterNode.RedundancyRole`; publish applies.
+2. **E.2** OpenTelemetry meter export: `ot_opcua_service_level{cluster,node}` gauge + `ot_opcua_peer_reachable{cluster,node,peer,kind=http|ua}` + `ot_opcua_apply_in_progress{cluster,node}` + `ot_opcua_topology_valid{cluster}`. Sink via Phase 6.1 observability.
 3. **E.3** SignalR push: `FleetStatusHub` broadcasts ServiceLevel changes so the Admin UI updates within ~1 s of the coordinator observing a peer flip.

+### Stream F — Client-interoperability matrix (3 days, new)
+
+1. **F.1** Validate ServiceLevel-driven cutover against **Ignition 8.1 + 8.3**, **Kepware KEPServerEX 6.x**, **Aveva OI Gateway 2020R2 + 2023R1**. For each: configure the client with both endpoints, verify it honors `ServiceLevel` + `ServerUriArray` during primary failover.
+2. **F.2** Clients that don't honour the standards (doc field — may include Kepware and OI Gateway per Codex review) get an explicit compatibility-matrix entry: "requires manual backup-endpoint config / vendor-specific redundancy primitives". Documented in `docs/Redundancy.md`.
+3. **F.3** Galaxy MXAccess failover test — boot Galaxy.Proxy on both nodes, kill Primary, assert Galaxy consumer reconnects to Backup within `(SessionTimeout + KeepAliveInterval × 3)`. Document required session-timeout config in `docs/Redundancy.md`.
+
 ## Compliance Checks (run at exit gate)

- [ ] **Primary-healthy** ServiceLevel = 255.
- [ ] **Backup-healthy** ServiceLevel = 100.
- [ ] **Mid-apply Primary** ServiceLevel = 200 — verified via Client.CLI subscription polling ServiceLevel during a forced draft publish.
- [ ] **Peer-unreachable** handling: when a Primary can't probe its Backup's `/healthz`, Primary still serves at 255 (peer is the one with the problem). When a Backup can't probe Primary, Backup flips to 200 (per decision #81 — a lonely Backup promotes its advertised level to signal "I'll take over if you ask" without auto-promoting).
- [ ] **Role transition via operator publish**: FleetAdmin swaps `RedundancyRole` rows in a draft, publishes; both nodes re-read topology on publish confirmation and flip ServiceLevel accordingly — no restart needed.
- [ ] **ServerUriArray** returns exactly the peer node's ApplicationUri.
- [ ] **Client.CLI cutover**: with a primary deliberately halted, a client that was connected to primary reconnects to the backup within the ServiceLevel-polling interval.
+- [ ] **OPC UA band compliance**: `0=Maintenance` reserved, `1=NoData` reserved. Operational states in 2..255 per 8-state matrix.
+- [ ] **Authoritative-Primary** ServiceLevel = 255.
+- [ ] **Isolated-Primary** (peer unreachable, self serving) = 230 — Primary retains authority.
+- [ ] **Primary-Mid-Apply** = 200.
+- [ ] **Recovering-Primary** = 180 with dwell + publish witness enforced.
+- [ ] **Authoritative-Backup** = 100.
+- [ ] **Isolated-Backup** (primary unreachable) = 80 — does NOT auto-promote.
+- [ ] **InvalidTopology** = 2 — both nodes self-demote when >1 Primary detected runtime.
+- [ ] **ServerUriArray** returns self + peer URIs, self first.
+- [ ] **UaHealthProbe authority**: integration test — peer returns HTTP 200 but OPC UA endpoint unreachable → coordinator treats peer as UA-unhealthy; peer is not a valid authority source.
+- [ ] **Apply-lease disposal**: leases close on exception, cancellation, and watchdog timeout; ServiceLevel never sticks at mid-apply band.
+- [ ] **Transparent-mode rejection**: attempting to publish `RedundancyMode=Transparent` is blocked at `sp_PublishGeneration`; runtime never sees an invalid mode.
+- [ ] **Role transition via operator publish**: FleetAdmin swaps `RedundancyRole` in a draft, publishes; both nodes re-read topology on publish confirmation + flip ServiceLevel — no restart.
+- [ ] **Client.CLI cutover**: with primary halted, Client.CLI that was connected to primary sees primary drop + reconnects to backup via `ServerUriArray`.
+- [ ] **Client interoperability matrix** (Stream F): Ignition 8.1 + 8.3 honour ServiceLevel; Kepware + Aveva OI Gateway findings documented.
+- [ ] **Galaxy MXAccess failover**: end-to-end test — primary kill → Galaxy consumer reconnects to backup within session-timeout budget.
 - [ ] No regression in existing driver test suites; no regression in `/healthz` reachability under redundancy load.

 ## Risks and Mitigations