docs(redundancy): Phase 2 design — health-aware ServiceLevel (H3)

2026-06-15 12:33:09 -04:00
parent 4bd7180e7f
commit 0528353315
1 changed files with 183 additions and 0 deletions
@@ -0,0 +1,183 @@
+# Phase 2 — Health-aware redundancy ServiceLevel (H3) — design
+
+> **Status:** approved 2026-06-15. Parent roadmap: `docs/plans/2026-06-15-stillpending-backlog-design.md`
+> (Phase 2). Backlog item **H3** in `stillpending.md` §1.
+> **Branch:** `feat/stillpending-phase-2-servicelevel` off master `4bd7180e`.
+> **Classification:** high-risk (actor model + redundancy correctness). Live `/run` on the 2-node rig is
+> the acceptance gate; unit tests cannot prove the cross-node wiring (per `project_redundancy_state_delivery`).
+
+## Problem (H3)
+
+The SDK's `Server.ServiceLevel` byte is driven **solely by the coarse role map** today
+(`OpcUaPublishActor.HandleRedundancyStateChanged`: Primary-leader→240, Primary→200, Secondary→100,
+Detached→0). The richer `ServiceLevelCalculator.Compute(NodeHealthInputs)` — which folds in DB
+reachability, an OPC-UA liveness probe, and a staleness signal — **is never invoked in production**.
+The two health producers are dead or half-wired:
+
+- `DbHealthProbeActor` — **spawned** per node, but only feeds `/health/ready`; its `DbReachable` never
+  reaches the published byte.
+- `PeerOpcUaProbeActor` — **never spawned anywhere**; its `OpcUaProbeResult` has **zero consumers**.
+- `NodeHealthInputs.Stale` — **no producer exists** anywhere in the codebase.
+
+Consequence: a **DB-unreachable / OPC-UA-down primary still advertises its full role-based ServiceLevel**,
+so redundant clients keep subscribing to a degraded node instead of failing over.
+
+## Goal
+
+Make each driver node publish a ServiceLevel computed by `ServiceLevelCalculator.Compute` from its **real
+local health**, so an unhealthy node drops below its role-based level (and below a healthy peer's level,
+triggering client failover). Honor the **already-documented** tiers in `docs/Redundancy.md`
+(250 / 240 / 200 / 100 / 0) — do not reshape the calculator's truth table. No EF migration; no change to
+the `RedundancyStateChanged` / `NodeRedundancyState` message contract.
+
+## Locked decisions (from brainstorming 2026-06-15)
+
+1. **Compute site = local per-node** (`OpcUaPublishActor`), *not* the admin `RedundancyStateActor`
+   singleton. DB reachability is an inherently local fact; centralizing it would require a new
+   cluster-wide health-broadcast contract for no benefit. The publish actor already runs per node, already
+   subscribes to the `redundancy-state` topic, already holds the role snapshot, and already writes the byte.
+2. **`Stale` is derived from signal freshness, with DB-unreachable ⟹ stale.** This makes the documented
+   200/100 graceful-degradation tiers reachable, so a node that loses its DB settles into 100 ("degraded,
+   still serving cached values") rather than slamming to 0. The 0 tier stays reserved for "not a healthy
+   cluster member / detached." A richer *driver-input* staleness signal is **deferred** (flagged, not built).
+3. **`OpcUaProbeOk` = peer-probes-me**, reusing the existing `PeerOpcUaProbeActor` (finally spawned),
+   closing the audit's "never spawned / zero consumers" gaps. Absence of a fresh peer result → `true`
+   (benefit of the doubt: don't penalize a node for its peer being down); only an *actively observed*
+   recent failure demotes. Single-node / no-peer cluster → `true`.
+
+## Architecture
+
+All changes are **per-driver-node, inside `Runtime`** (`WithOtOpcUaRuntimeActors`). No ControlPlane /
+admin-singleton change. No new Commons message types.
+
+```
+WithOtOpcUaRuntimeActors (per driver node)
+  ├─ DbHealthProbeActor            (exists; now ALSO consulted by the publish actor)
+  ├─ PeerProbeSupervisor   (NEW)   watches cluster membership →
+  │     └─ PeerOpcUaProbeActor(peer)  one child per OTHER driver member;
+  │            publishes OpcUaProbeResult(peer, ok) on the redundancy-state topic
+  └─ OpcUaPublishActor      (MODIFIED)
+        subscribes redundancy-state topic  → RedundancyStateChanged  (role/leadership)
+                                           → OpcUaProbeResult         (peer's verdict on ME)
+        holds IActorRef dbHealthProbe      → Ask<DbHealthStatus> on a periodic HealthTick
+        Cluster.SelfMember.Status          → MemberState (local)
+        on any trigger:  ServiceLevelCalculator.Compute(inputs) → publish-if-changed (existing dedup)
+```
+
+### Components
+
+**1. `PeerProbeSupervisor` (NEW, `Runtime/Health/PeerProbeSupervisor.cs`)** — a small per-node actor whose
+only job is the probe lifecycle (kept separate so the pinned-dispatcher `OpcUaPublishActor` never parents
+network-probe children):
+- Subscribes to cluster membership (`IMemberEvent` / `ReachabilityEvent`) and/or the redundancy snapshot.
+- Maintains one `PeerOpcUaProbeActor` child per **other driver-role member**, (re)spawning on membership
+  change, stopping children for departed members. Resolves the peer's OPC UA host from its node id
+  (`host:port` → host) and the configured OPC UA port.
+- Children publish `OpcUaProbeResult` to the `redundancy-state` topic (existing behavior).
+- Spawned on the default dispatcher (not the OPC UA pinned one).
+
+**2. `OpcUaPublishActor` (MODIFIED)** — replace the role-only switch in `HandleRedundancyStateChanged`
+with a `RecomputeServiceLevel()` that calls the calculator. New Props params (both optional, defaulted so
+test Props/harnesses keep working): `IActorRef? dbHealthProbe`, and the freshness windows /
+self-OPC-UA-port (injectable for tests).
+- New state: `_dbReachable` + `_dbAsOfUtc` (from the DB Ask), `_lastSnapshotAsOfUtc`, a per-peer map of
+  `(ok, asOfUtc)` for results **about my own node**, and the current `RedundancyStateChanged` local entry.
+- New `Receive<OpcUaProbeResult>`: if `msg.NodeId == _localNode`, record `(msg.Ok, now)` for the reporting
+  peer; recompute.
+- New `Receive<HealthTick>` (periodic, e.g. 5 s): `dbHealthProbe.Ask<DbHealthStatus>(GetStatus, 1s).PipeTo(Self)`.
+- New `Receive<DbHealthStatus>`: cache `_dbReachable` + `_dbAsOfUtc`; recompute.
+- `RecomputeServiceLevel()` builds `NodeHealthInputs` and calls `ServiceLevelCalculator.Compute`, then
+  `Self.Tell(new ServiceLevelChanged(level))` (the existing handler dedups + publishes + emits the metric).
+
+### Input derivation (the load-bearing logic)
+
+For the **local** node, on each recompute:
+
+| `NodeHealthInputs` field | Source |
+|---|---|
+| `MemberState` | `Cluster.Get(system).SelfMember.Status` (local; most accurate) |
+| `IsDriverRoleLeader` | local entry of the latest `RedundancyStateChanged` snapshot (`IsRoleLeaderForDriver`) |
+| `DbReachable` | latest `DbHealthStatus.Reachable` from the local `DbHealthProbeActor` |
+| `OpcUaProbeOk` | `true` if **no** fresh peer result about me, else the latest such result's `Ok` (only an actively-observed recent failure → `false`); single-node → `true` |
+| `Stale` | `!DbReachable` **OR** `(now − _dbAsOfUtc) > StaleWindow` **OR** `(now − _lastSnapshotAsOfUtc) > StaleWindow` |
+
+**Guards (calculator does not model these):**
+- If there is **no local snapshot entry**, or the local entry's `Role == Detached`, publish **0** directly
+  and skip the calculator (a healthy detached node would otherwise compute 240 because the calculator only
+  checks `MemberState` + the leader bonus). In steady state a node running `OpcUaPublishActor` always carries
+  the `driver` role, so this is defensive — it preserves the current Detached→0 behavior during transitions.
+
+### Resulting ServiceLevel truth table (now fully reachable)
+
+| Node condition | Byte | How reached |
+|---|---|---|
+| Not a healthy member / detached | **0** | `MemberState` not Up/Joining, or Detached guard |
+| DB unreachable, sustained | **100** | `DbReachable=false ⇒ Stale=true` → `(false,_,true)` |
+| DB reachable but signals stale | **200** | `(true,_,true)` |
+| Healthy follower (DB ok + probe ok + fresh) | **240** | `(true,true,false)` |
+| Healthy leader | **250** | follower 240 + `IsDriverRoleLeader` +10 |
+| DB ok + fresh but peer reports me unreachable | **0** | `(true,false,false)` → falls through |
+
+**Behavior change to be aware of (already documented as the target):** a *healthy Secondary moves
+100 → 240*. Both healthy nodes sit at 240/250 with the leader preferred by the +10 bonus; role-leadership is
+stable so the bytes do not flap. The "DB ok + probe-fail" → 0 edge is intentionally sharp but **debounced**
+by the `OpcUaProbeOk` freshness rule (a single missed probe does not flip it; only a sustained, actively
+observed failure does).
+
+## Error handling / edge cases
+
+- **DB Ask timeout** → treat as `DbReachable=false` for that cycle (fail-safe demote); the next tick re-Asks.
+- **No peer (single node)** → `OpcUaProbeOk=true`; supervisor spawns no children.
+- **Peer departs** → its prior result ages out of the freshness window → `OpcUaProbeOk` reverts to `true`
+  (we don't penalize a node for its peer's absence). Supervisor stops the child for the departed member.
+- **Cluster forming / role-leader not yet resolved** → no local snapshot entry yet → Detached guard → 0,
+  same as today, until the first snapshot arrives.
+- **`HealthTick` before the first snapshot** → no local entry → 0 (no spurious 240 before role is known).
+
+## Testing strategy (xUnit + Shouldly, TDD; NO bUnit)
+
+Unit-testable (the wiring that *can* be proven in a TestKit):
+- `OpcUaPublishActor` input-derivation: feed a `RedundancyStateChanged` (Primary-leader) + a cached
+  `DbHealthStatus(reachable)` + an `OpcUaProbeResult(me, ok)` → assert published byte 250; flip DB
+  unreachable → 100; stale snapshot → 200; healthy secondary → 240; Detached entry → 0; probe `Ok=false`
+  about me with DB-ok+fresh → 0. Use `PropsForTests` with a broadcast/serviceLevel capture + injected
+  `dbHealthProbe` test ref + short windows.
+- `OpcUaProbeOk` freshness/debounce: a single stale/absent result → `true`; a recent explicit `false` → `false`.
+- `PeerProbeSupervisor`: on a membership snapshot with one peer driver member, spawns exactly one child
+  targeting that peer; on the peer leaving, stops it; single-node → no children. (TestKit, with a probe
+  factory injected so no real TCP is attempted.)
+- Reuse existing `ServiceLevelCalculatorTests` unchanged (the pure function is not modified).
+
+Not unit-testable (acceptance gate): the live cross-node ServiceLevel on the **2-node rig** — proven by
+`/run` (user-driven). A DB-unreachable primary must be observed dropping below the secondary, and the
+secondary taking over as the higher-ServiceLevel endpoint. See `project_redundancy_state_delivery`.
+
+## Verification
+
+- `dotnet build` clean (production projects are `TreatWarningsAsErrors`); full `dotnet test` green.
+- High-risk review chain (serial spec → code → final integration).
+- **Live `/run` on the 2-node / docker-dev rig** (user-driven; agent does not sign in): confirm
+  steady-state 250/240, then kill a node's DB reachability and confirm its ServiceLevel drops to 100 and the
+  peer is preferred. Update `docs/Redundancy.md` to mark the calculator path **wired** (remove the "not yet
+  wired into the live driver publish path" caveats) and document the freshness/peer-probe sourcing.
+
+## Alternatives considered
+
+- **Central compute in `RedundancyStateActor`** — rejected: the admin singleton can't see each node's local
+  DB health without a new cluster-wide health-broadcast contract; heavier, bigger blast radius, no benefit.
+- **Local self-probe** (TCP to own `localhost:4840` / SDK server-state) instead of the peer-probe — simpler
+  (no membership-tracking supervisor) and a defensible "is my OPC UA serving" signal, but leaves
+  `PeerOpcUaProbeActor` dead and misses the cross-node vantage. Rejected to honor the locked design and close
+  the audit's dead-code gap; recorded here so the trade-off is explicit if the supervisor proves heavier than
+  valued (then it becomes a follow-up swap, not a redesign).
+- **Reshape the calculator to role-baseline-demoted-by-health** (keep the wide 240-vs-100 gap; health only
+  lowers) — rejected: the calculator's 250/240/200/100/0 tiers are already approved, documented, and tested;
+  reshaping re-opens a settled decision. The +10 leader bonus already preserves client preference for the
+  primary.
+
+## Hard constraints (carried from the roadmap)
+
+NO Configuration entity / EF migration. Stage by path — never `git add .`; never stage `sql_login.txt`,
+`src/Server/.../Host/pki/`, `pending.md`, `current.md`, `docker-dev/docker-compose.yml`, `stillpending.md`.
+Never echo or commit secrets. No force-push, no `--no-verify`. Razor/runtime cross-node behavior proven only
+by live `/run`, never bUnit.