# Phase 2 — Health-aware redundancy ServiceLevel (H3) — design > **Status:** approved 2026-06-15. Parent roadmap: `docs/plans/2026-06-15-stillpending-backlog-design.md` > (Phase 2). Backlog item **H3** in `stillpending.md` §1. > **Branch:** `feat/stillpending-phase-2-servicelevel` off master `4bd7180e`. > **Classification:** high-risk (actor model + redundancy correctness). Live `/run` on the 2-node rig is > the acceptance gate; unit tests cannot prove the cross-node wiring (per `project_redundancy_state_delivery`). ## Problem (H3) The SDK's `Server.ServiceLevel` byte is driven **solely by the coarse role map** today (`OpcUaPublishActor.HandleRedundancyStateChanged`: Primary-leader→240, Primary→200, Secondary→100, Detached→0). The richer `ServiceLevelCalculator.Compute(NodeHealthInputs)` — which folds in DB reachability, an OPC-UA liveness probe, and a staleness signal — **is never invoked in production**. The two health producers are dead or half-wired: - `DbHealthProbeActor` — **spawned** per node, but only feeds `/health/ready`; its `DbReachable` never reaches the published byte. - `PeerOpcUaProbeActor` — **never spawned anywhere**; its `OpcUaProbeResult` has **zero consumers**. - `NodeHealthInputs.Stale` — **no producer exists** anywhere in the codebase. Consequence: a **DB-unreachable / OPC-UA-down primary still advertises its full role-based ServiceLevel**, so redundant clients keep subscribing to a degraded node instead of failing over. ## Goal Make each driver node publish a ServiceLevel computed by `ServiceLevelCalculator.Compute` from its **real local health**, so an unhealthy node drops below its role-based level (and below a healthy peer's level, triggering client failover). Honor the **already-documented** tiers in `docs/Redundancy.md` (250 / 240 / 200 / 100 / 0) — do not reshape the calculator's truth table. No EF migration; no change to the `RedundancyStateChanged` / `NodeRedundancyState` message contract. ## Locked decisions (from brainstorming 2026-06-15) 1. **Compute site = local per-node** (`OpcUaPublishActor`), *not* the admin `RedundancyStateActor` singleton. DB reachability is an inherently local fact; centralizing it would require a new cluster-wide health-broadcast contract for no benefit. The publish actor already runs per node, already subscribes to the `redundancy-state` topic, already holds the role snapshot, and already writes the byte. 2. **`Stale` is derived from signal freshness, with DB-unreachable ⟹ stale.** This makes the documented 200/100 graceful-degradation tiers reachable, so a node that loses its DB settles into 100 ("degraded, still serving cached values") rather than slamming to 0. The 0 tier stays reserved for "not a healthy cluster member / detached." A richer *driver-input* staleness signal is **deferred** (flagged, not built). 3. **`OpcUaProbeOk` = peer-probes-me**, reusing the existing `PeerOpcUaProbeActor` (finally spawned), closing the audit's "never spawned / zero consumers" gaps. Absence of a fresh peer result → `true` (benefit of the doubt: don't penalize a node for its peer being down); only an *actively observed* recent failure demotes. Single-node / no-peer cluster → `true`. ## Architecture All changes are **per-driver-node, inside `Runtime`** (`WithOtOpcUaRuntimeActors`). No ControlPlane / admin-singleton change. No new Commons message types. ``` WithOtOpcUaRuntimeActors (per driver node) ├─ DbHealthProbeActor (exists; now ALSO consulted by the publish actor) ├─ PeerProbeSupervisor (NEW) watches cluster membership → │ └─ PeerOpcUaProbeActor(peer) one child per OTHER driver member; │ publishes OpcUaProbeResult(peer, ok) on the redundancy-state topic └─ OpcUaPublishActor (MODIFIED) subscribes redundancy-state topic → RedundancyStateChanged (role/leadership) → OpcUaProbeResult (peer's verdict on ME) holds IActorRef dbHealthProbe → Ask on a periodic HealthTick Cluster.SelfMember.Status → MemberState (local) on any trigger: ServiceLevelCalculator.Compute(inputs) → publish-if-changed (existing dedup) ``` ### Components **1. `PeerProbeSupervisor` (NEW, `Runtime/Health/PeerProbeSupervisor.cs`)** — a small per-node actor whose only job is the probe lifecycle (kept separate so the pinned-dispatcher `OpcUaPublishActor` never parents network-probe children): - Subscribes to cluster membership (`IMemberEvent` / `ReachabilityEvent`) and/or the redundancy snapshot. - Maintains one `PeerOpcUaProbeActor` child per **other driver-role member**, (re)spawning on membership change, stopping children for departed members. Resolves the peer's OPC UA host from its node id (`host:port` → host) and the configured OPC UA port. - Children publish `OpcUaProbeResult` to the `redundancy-state` topic (existing behavior). - Spawned on the default dispatcher (not the OPC UA pinned one). **2. `OpcUaPublishActor` (MODIFIED)** — replace the role-only switch in `HandleRedundancyStateChanged` with a `RecomputeServiceLevel()` that calls the calculator. New Props params (both optional, defaulted so test Props/harnesses keep working): `IActorRef? dbHealthProbe`, and the freshness windows / self-OPC-UA-port (injectable for tests). - New state: `_dbReachable` + `_dbAsOfUtc` (from the DB Ask), `_lastSnapshotAsOfUtc`, a per-peer map of `(ok, asOfUtc)` for results **about my own node**, and the current `RedundancyStateChanged` local entry. - New `Receive`: if `msg.NodeId == _localNode`, record `(msg.Ok, now)` for the reporting peer; recompute. - New `Receive` (periodic, e.g. 5 s): `dbHealthProbe.Ask(GetStatus, 1s).PipeTo(Self)`. - New `Receive`: cache `_dbReachable` + `_dbAsOfUtc`; recompute. - `RecomputeServiceLevel()` builds `NodeHealthInputs` and calls `ServiceLevelCalculator.Compute`, then `Self.Tell(new ServiceLevelChanged(level))` (the existing handler dedups + publishes + emits the metric). ### Input derivation (the load-bearing logic) For the **local** node, on each recompute: | `NodeHealthInputs` field | Source | |---|---| | `MemberState` | `Cluster.Get(system).SelfMember.Status` (local; most accurate) | | `IsDriverRoleLeader` | local entry of the latest `RedundancyStateChanged` snapshot (`IsRoleLeaderForDriver`) | | `DbReachable` | latest `DbHealthStatus.Reachable` from the local `DbHealthProbeActor` | | `OpcUaProbeOk` | `true` if **no** fresh peer result about me, else the latest such result's `Ok` (only an actively-observed recent failure → `false`); single-node → `true` | | `Stale` | `!DbReachable` **OR** `(now − _dbAsOfUtc) > StaleWindow` **OR** `(now − _lastSnapshotAsOfUtc) > StaleWindow` | **Guards (calculator does not model these):** - If there is **no local snapshot entry**, or the local entry's `Role == Detached`, publish **0** directly and skip the calculator (a healthy detached node would otherwise compute 240 because the calculator only checks `MemberState` + the leader bonus). In steady state a node running `OpcUaPublishActor` always carries the `driver` role, so this is defensive — it preserves the current Detached→0 behavior during transitions. ### Resulting ServiceLevel truth table (now fully reachable) | Node condition | Byte | How reached | |---|---|---| | Not a healthy member / detached | **0** | `MemberState` not Up/Joining, or Detached guard | | DB unreachable, sustained | **100** | `DbReachable=false ⇒ Stale=true` → `(false,_,true)` | | DB reachable but signals stale | **200** | `(true,_,true)` | | Healthy follower (DB ok + probe ok + fresh) | **240** | `(true,true,false)` | | Healthy leader | **250** | follower 240 + `IsDriverRoleLeader` +10 | | DB ok + fresh but peer reports me unreachable | **0** | `(true,false,false)` → falls through | **Behavior change to be aware of (already documented as the target):** a *healthy Secondary moves 100 → 240*. Both healthy nodes sit at 240/250 with the leader preferred by the +10 bonus; role-leadership is stable so the bytes do not flap. The "DB ok + probe-fail" → 0 edge is intentionally sharp but **debounced** by the `OpcUaProbeOk` freshness rule (a single missed probe does not flip it; only a sustained, actively observed failure does). ## Error handling / edge cases - **DB Ask timeout** → treat as `DbReachable=false` for that cycle (fail-safe demote); the next tick re-Asks. - **No peer (single node)** → `OpcUaProbeOk=true`; supervisor spawns no children. - **Peer departs** → its prior result ages out of the freshness window → `OpcUaProbeOk` reverts to `true` (we don't penalize a node for its peer's absence). Supervisor stops the child for the departed member. - **Cluster forming / role-leader not yet resolved** → no local snapshot entry yet → Detached guard → 0, same as today, until the first snapshot arrives. - **`HealthTick` before the first snapshot** → no local entry → 0 (no spurious 240 before role is known). ## Testing strategy (xUnit + Shouldly, TDD; NO bUnit) Unit-testable (the wiring that *can* be proven in a TestKit): - `OpcUaPublishActor` input-derivation: feed a `RedundancyStateChanged` (Primary-leader) + a cached `DbHealthStatus(reachable)` + an `OpcUaProbeResult(me, ok)` → assert published byte 250; flip DB unreachable → 100; stale snapshot → 200; healthy secondary → 240; Detached entry → 0; probe `Ok=false` about me with DB-ok+fresh → 0. Use `PropsForTests` with a broadcast/serviceLevel capture + injected `dbHealthProbe` test ref + short windows. - `OpcUaProbeOk` freshness/debounce: a single stale/absent result → `true`; a recent explicit `false` → `false`. - `PeerProbeSupervisor`: on a membership snapshot with one peer driver member, spawns exactly one child targeting that peer; on the peer leaving, stops it; single-node → no children. (TestKit, with a probe factory injected so no real TCP is attempted.) - Reuse existing `ServiceLevelCalculatorTests` unchanged (the pure function is not modified). Not unit-testable (acceptance gate): the live cross-node ServiceLevel on the **2-node rig** — proven by `/run` (user-driven). A DB-unreachable primary must be observed dropping below the secondary, and the secondary taking over as the higher-ServiceLevel endpoint. See `project_redundancy_state_delivery`. ## Verification - `dotnet build` clean (production projects are `TreatWarningsAsErrors`); full `dotnet test` green. - High-risk review chain (serial spec → code → final integration). - **Live `/run` on the 2-node / docker-dev rig** (user-driven; agent does not sign in): confirm steady-state 250/240, then kill a node's DB reachability and confirm its ServiceLevel drops to 100 and the peer is preferred. Update `docs/Redundancy.md` to mark the calculator path **wired** (remove the "not yet wired into the live driver publish path" caveats) and document the freshness/peer-probe sourcing. ## Alternatives considered - **Central compute in `RedundancyStateActor`** — rejected: the admin singleton can't see each node's local DB health without a new cluster-wide health-broadcast contract; heavier, bigger blast radius, no benefit. - **Local self-probe** (TCP to own `localhost:4840` / SDK server-state) instead of the peer-probe — simpler (no membership-tracking supervisor) and a defensible "is my OPC UA serving" signal, but leaves `PeerOpcUaProbeActor` dead and misses the cross-node vantage. Rejected to honor the locked design and close the audit's dead-code gap; recorded here so the trade-off is explicit if the supervisor proves heavier than valued (then it becomes a follow-up swap, not a redesign). - **Reshape the calculator to role-baseline-demoted-by-health** (keep the wide 240-vs-100 gap; health only lowers) — rejected: the calculator's 250/240/200/100/0 tiers are already approved, documented, and tested; reshaping re-opens a settled decision. The +10 leader bonus already preserves client preference for the primary. ## Hard constraints (carried from the roadmap) NO Configuration entity / EF migration. Stage by path — never `git add .`; never stage `sql_login.txt`, `src/Server/.../Host/pki/`, `pending.md`, `current.md`, `docker-dev/docker-compose.yml`, `stillpending.md`. Never echo or commit secrets. No force-push, no `--no-verify`. Razor/runtime cross-node behavior proven only by live `/run`, never bUnit.