From 0528353315a1c68f0f7b92b4cee563034f37f56a Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Mon, 15 Jun 2026 12:33:09 -0400 Subject: [PATCH] =?UTF-8?q?docs(redundancy):=20Phase=202=20design=20?= =?UTF-8?q?=E2=80=94=20health-aware=20ServiceLevel=20(H3)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- ...tillpending-phase-2-servicelevel-design.md | 183 ++++++++++++++++++ 1 file changed, 183 insertions(+) create mode 100644 docs/plans/2026-06-15-stillpending-phase-2-servicelevel-design.md diff --git a/docs/plans/2026-06-15-stillpending-phase-2-servicelevel-design.md b/docs/plans/2026-06-15-stillpending-phase-2-servicelevel-design.md new file mode 100644 index 00000000..46e1bc1a --- /dev/null +++ b/docs/plans/2026-06-15-stillpending-phase-2-servicelevel-design.md @@ -0,0 +1,183 @@ +# Phase 2 — Health-aware redundancy ServiceLevel (H3) — design + +> **Status:** approved 2026-06-15. Parent roadmap: `docs/plans/2026-06-15-stillpending-backlog-design.md` +> (Phase 2). Backlog item **H3** in `stillpending.md` §1. +> **Branch:** `feat/stillpending-phase-2-servicelevel` off master `4bd7180e`. +> **Classification:** high-risk (actor model + redundancy correctness). Live `/run` on the 2-node rig is +> the acceptance gate; unit tests cannot prove the cross-node wiring (per `project_redundancy_state_delivery`). + +## Problem (H3) + +The SDK's `Server.ServiceLevel` byte is driven **solely by the coarse role map** today +(`OpcUaPublishActor.HandleRedundancyStateChanged`: Primary-leader→240, Primary→200, Secondary→100, +Detached→0). The richer `ServiceLevelCalculator.Compute(NodeHealthInputs)` — which folds in DB +reachability, an OPC-UA liveness probe, and a staleness signal — **is never invoked in production**. +The two health producers are dead or half-wired: + +- `DbHealthProbeActor` — **spawned** per node, but only feeds `/health/ready`; its `DbReachable` never + reaches the published byte. +- `PeerOpcUaProbeActor` — **never spawned anywhere**; its `OpcUaProbeResult` has **zero consumers**. +- `NodeHealthInputs.Stale` — **no producer exists** anywhere in the codebase. + +Consequence: a **DB-unreachable / OPC-UA-down primary still advertises its full role-based ServiceLevel**, +so redundant clients keep subscribing to a degraded node instead of failing over. + +## Goal + +Make each driver node publish a ServiceLevel computed by `ServiceLevelCalculator.Compute` from its **real +local health**, so an unhealthy node drops below its role-based level (and below a healthy peer's level, +triggering client failover). Honor the **already-documented** tiers in `docs/Redundancy.md` +(250 / 240 / 200 / 100 / 0) — do not reshape the calculator's truth table. No EF migration; no change to +the `RedundancyStateChanged` / `NodeRedundancyState` message contract. + +## Locked decisions (from brainstorming 2026-06-15) + +1. **Compute site = local per-node** (`OpcUaPublishActor`), *not* the admin `RedundancyStateActor` + singleton. DB reachability is an inherently local fact; centralizing it would require a new + cluster-wide health-broadcast contract for no benefit. The publish actor already runs per node, already + subscribes to the `redundancy-state` topic, already holds the role snapshot, and already writes the byte. +2. **`Stale` is derived from signal freshness, with DB-unreachable ⟹ stale.** This makes the documented + 200/100 graceful-degradation tiers reachable, so a node that loses its DB settles into 100 ("degraded, + still serving cached values") rather than slamming to 0. The 0 tier stays reserved for "not a healthy + cluster member / detached." A richer *driver-input* staleness signal is **deferred** (flagged, not built). +3. **`OpcUaProbeOk` = peer-probes-me**, reusing the existing `PeerOpcUaProbeActor` (finally spawned), + closing the audit's "never spawned / zero consumers" gaps. Absence of a fresh peer result → `true` + (benefit of the doubt: don't penalize a node for its peer being down); only an *actively observed* + recent failure demotes. Single-node / no-peer cluster → `true`. + +## Architecture + +All changes are **per-driver-node, inside `Runtime`** (`WithOtOpcUaRuntimeActors`). No ControlPlane / +admin-singleton change. No new Commons message types. + +``` +WithOtOpcUaRuntimeActors (per driver node) + ├─ DbHealthProbeActor (exists; now ALSO consulted by the publish actor) + ├─ PeerProbeSupervisor (NEW) watches cluster membership → + │ └─ PeerOpcUaProbeActor(peer) one child per OTHER driver member; + │ publishes OpcUaProbeResult(peer, ok) on the redundancy-state topic + └─ OpcUaPublishActor (MODIFIED) + subscribes redundancy-state topic → RedundancyStateChanged (role/leadership) + → OpcUaProbeResult (peer's verdict on ME) + holds IActorRef dbHealthProbe → Ask on a periodic HealthTick + Cluster.SelfMember.Status → MemberState (local) + on any trigger: ServiceLevelCalculator.Compute(inputs) → publish-if-changed (existing dedup) +``` + +### Components + +**1. `PeerProbeSupervisor` (NEW, `Runtime/Health/PeerProbeSupervisor.cs`)** — a small per-node actor whose +only job is the probe lifecycle (kept separate so the pinned-dispatcher `OpcUaPublishActor` never parents +network-probe children): +- Subscribes to cluster membership (`IMemberEvent` / `ReachabilityEvent`) and/or the redundancy snapshot. +- Maintains one `PeerOpcUaProbeActor` child per **other driver-role member**, (re)spawning on membership + change, stopping children for departed members. Resolves the peer's OPC UA host from its node id + (`host:port` → host) and the configured OPC UA port. +- Children publish `OpcUaProbeResult` to the `redundancy-state` topic (existing behavior). +- Spawned on the default dispatcher (not the OPC UA pinned one). + +**2. `OpcUaPublishActor` (MODIFIED)** — replace the role-only switch in `HandleRedundancyStateChanged` +with a `RecomputeServiceLevel()` that calls the calculator. New Props params (both optional, defaulted so +test Props/harnesses keep working): `IActorRef? dbHealthProbe`, and the freshness windows / +self-OPC-UA-port (injectable for tests). +- New state: `_dbReachable` + `_dbAsOfUtc` (from the DB Ask), `_lastSnapshotAsOfUtc`, a per-peer map of + `(ok, asOfUtc)` for results **about my own node**, and the current `RedundancyStateChanged` local entry. +- New `Receive`: if `msg.NodeId == _localNode`, record `(msg.Ok, now)` for the reporting + peer; recompute. +- New `Receive` (periodic, e.g. 5 s): `dbHealthProbe.Ask(GetStatus, 1s).PipeTo(Self)`. +- New `Receive`: cache `_dbReachable` + `_dbAsOfUtc`; recompute. +- `RecomputeServiceLevel()` builds `NodeHealthInputs` and calls `ServiceLevelCalculator.Compute`, then + `Self.Tell(new ServiceLevelChanged(level))` (the existing handler dedups + publishes + emits the metric). + +### Input derivation (the load-bearing logic) + +For the **local** node, on each recompute: + +| `NodeHealthInputs` field | Source | +|---|---| +| `MemberState` | `Cluster.Get(system).SelfMember.Status` (local; most accurate) | +| `IsDriverRoleLeader` | local entry of the latest `RedundancyStateChanged` snapshot (`IsRoleLeaderForDriver`) | +| `DbReachable` | latest `DbHealthStatus.Reachable` from the local `DbHealthProbeActor` | +| `OpcUaProbeOk` | `true` if **no** fresh peer result about me, else the latest such result's `Ok` (only an actively-observed recent failure → `false`); single-node → `true` | +| `Stale` | `!DbReachable` **OR** `(now − _dbAsOfUtc) > StaleWindow` **OR** `(now − _lastSnapshotAsOfUtc) > StaleWindow` | + +**Guards (calculator does not model these):** +- If there is **no local snapshot entry**, or the local entry's `Role == Detached`, publish **0** directly + and skip the calculator (a healthy detached node would otherwise compute 240 because the calculator only + checks `MemberState` + the leader bonus). In steady state a node running `OpcUaPublishActor` always carries + the `driver` role, so this is defensive — it preserves the current Detached→0 behavior during transitions. + +### Resulting ServiceLevel truth table (now fully reachable) + +| Node condition | Byte | How reached | +|---|---|---| +| Not a healthy member / detached | **0** | `MemberState` not Up/Joining, or Detached guard | +| DB unreachable, sustained | **100** | `DbReachable=false ⇒ Stale=true` → `(false,_,true)` | +| DB reachable but signals stale | **200** | `(true,_,true)` | +| Healthy follower (DB ok + probe ok + fresh) | **240** | `(true,true,false)` | +| Healthy leader | **250** | follower 240 + `IsDriverRoleLeader` +10 | +| DB ok + fresh but peer reports me unreachable | **0** | `(true,false,false)` → falls through | + +**Behavior change to be aware of (already documented as the target):** a *healthy Secondary moves +100 → 240*. Both healthy nodes sit at 240/250 with the leader preferred by the +10 bonus; role-leadership is +stable so the bytes do not flap. The "DB ok + probe-fail" → 0 edge is intentionally sharp but **debounced** +by the `OpcUaProbeOk` freshness rule (a single missed probe does not flip it; only a sustained, actively +observed failure does). + +## Error handling / edge cases + +- **DB Ask timeout** → treat as `DbReachable=false` for that cycle (fail-safe demote); the next tick re-Asks. +- **No peer (single node)** → `OpcUaProbeOk=true`; supervisor spawns no children. +- **Peer departs** → its prior result ages out of the freshness window → `OpcUaProbeOk` reverts to `true` + (we don't penalize a node for its peer's absence). Supervisor stops the child for the departed member. +- **Cluster forming / role-leader not yet resolved** → no local snapshot entry yet → Detached guard → 0, + same as today, until the first snapshot arrives. +- **`HealthTick` before the first snapshot** → no local entry → 0 (no spurious 240 before role is known). + +## Testing strategy (xUnit + Shouldly, TDD; NO bUnit) + +Unit-testable (the wiring that *can* be proven in a TestKit): +- `OpcUaPublishActor` input-derivation: feed a `RedundancyStateChanged` (Primary-leader) + a cached + `DbHealthStatus(reachable)` + an `OpcUaProbeResult(me, ok)` → assert published byte 250; flip DB + unreachable → 100; stale snapshot → 200; healthy secondary → 240; Detached entry → 0; probe `Ok=false` + about me with DB-ok+fresh → 0. Use `PropsForTests` with a broadcast/serviceLevel capture + injected + `dbHealthProbe` test ref + short windows. +- `OpcUaProbeOk` freshness/debounce: a single stale/absent result → `true`; a recent explicit `false` → `false`. +- `PeerProbeSupervisor`: on a membership snapshot with one peer driver member, spawns exactly one child + targeting that peer; on the peer leaving, stops it; single-node → no children. (TestKit, with a probe + factory injected so no real TCP is attempted.) +- Reuse existing `ServiceLevelCalculatorTests` unchanged (the pure function is not modified). + +Not unit-testable (acceptance gate): the live cross-node ServiceLevel on the **2-node rig** — proven by +`/run` (user-driven). A DB-unreachable primary must be observed dropping below the secondary, and the +secondary taking over as the higher-ServiceLevel endpoint. See `project_redundancy_state_delivery`. + +## Verification + +- `dotnet build` clean (production projects are `TreatWarningsAsErrors`); full `dotnet test` green. +- High-risk review chain (serial spec → code → final integration). +- **Live `/run` on the 2-node / docker-dev rig** (user-driven; agent does not sign in): confirm + steady-state 250/240, then kill a node's DB reachability and confirm its ServiceLevel drops to 100 and the + peer is preferred. Update `docs/Redundancy.md` to mark the calculator path **wired** (remove the "not yet + wired into the live driver publish path" caveats) and document the freshness/peer-probe sourcing. + +## Alternatives considered + +- **Central compute in `RedundancyStateActor`** — rejected: the admin singleton can't see each node's local + DB health without a new cluster-wide health-broadcast contract; heavier, bigger blast radius, no benefit. +- **Local self-probe** (TCP to own `localhost:4840` / SDK server-state) instead of the peer-probe — simpler + (no membership-tracking supervisor) and a defensible "is my OPC UA serving" signal, but leaves + `PeerOpcUaProbeActor` dead and misses the cross-node vantage. Rejected to honor the locked design and close + the audit's dead-code gap; recorded here so the trade-off is explicit if the supervisor proves heavier than + valued (then it becomes a follow-up swap, not a redesign). +- **Reshape the calculator to role-baseline-demoted-by-health** (keep the wide 240-vs-100 gap; health only + lowers) — rejected: the calculator's 250/240/200/100/0 tiers are already approved, documented, and tested; + reshaping re-opens a settled decision. The +10 leader bonus already preserves client preference for the + primary. + +## Hard constraints (carried from the roadmap) + +NO Configuration entity / EF migration. Stage by path — never `git add .`; never stage `sql_login.txt`, +`src/Server/.../Host/pki/`, `pending.md`, `current.md`, `docker-dev/docker-compose.yml`, `stillpending.md`. +Never echo or commit secrets. No force-push, no `--no-verify`. Razor/runtime cross-node behavior proven only +by live `/run`, never bUnit.