docs(redundancy): Phase 2 design — health-aware ServiceLevel (H3)

This commit is contained in:
Joseph Doherty
2026-06-15 12:33:09 -04:00
parent 4bd7180e7f
commit 0528353315
@@ -0,0 +1,183 @@
# Phase 2 — Health-aware redundancy ServiceLevel (H3) — design
> **Status:** approved 2026-06-15. Parent roadmap: `docs/plans/2026-06-15-stillpending-backlog-design.md`
> (Phase 2). Backlog item **H3** in `stillpending.md` §1.
> **Branch:** `feat/stillpending-phase-2-servicelevel` off master `4bd7180e`.
> **Classification:** high-risk (actor model + redundancy correctness). Live `/run` on the 2-node rig is
> the acceptance gate; unit tests cannot prove the cross-node wiring (per `project_redundancy_state_delivery`).
## Problem (H3)
The SDK's `Server.ServiceLevel` byte is driven **solely by the coarse role map** today
(`OpcUaPublishActor.HandleRedundancyStateChanged`: Primary-leader→240, Primary→200, Secondary→100,
Detached→0). The richer `ServiceLevelCalculator.Compute(NodeHealthInputs)` — which folds in DB
reachability, an OPC-UA liveness probe, and a staleness signal — **is never invoked in production**.
The two health producers are dead or half-wired:
- `DbHealthProbeActor`**spawned** per node, but only feeds `/health/ready`; its `DbReachable` never
reaches the published byte.
- `PeerOpcUaProbeActor`**never spawned anywhere**; its `OpcUaProbeResult` has **zero consumers**.
- `NodeHealthInputs.Stale`**no producer exists** anywhere in the codebase.
Consequence: a **DB-unreachable / OPC-UA-down primary still advertises its full role-based ServiceLevel**,
so redundant clients keep subscribing to a degraded node instead of failing over.
## Goal
Make each driver node publish a ServiceLevel computed by `ServiceLevelCalculator.Compute` from its **real
local health**, so an unhealthy node drops below its role-based level (and below a healthy peer's level,
triggering client failover). Honor the **already-documented** tiers in `docs/Redundancy.md`
(250 / 240 / 200 / 100 / 0) — do not reshape the calculator's truth table. No EF migration; no change to
the `RedundancyStateChanged` / `NodeRedundancyState` message contract.
## Locked decisions (from brainstorming 2026-06-15)
1. **Compute site = local per-node** (`OpcUaPublishActor`), *not* the admin `RedundancyStateActor`
singleton. DB reachability is an inherently local fact; centralizing it would require a new
cluster-wide health-broadcast contract for no benefit. The publish actor already runs per node, already
subscribes to the `redundancy-state` topic, already holds the role snapshot, and already writes the byte.
2. **`Stale` is derived from signal freshness, with DB-unreachable ⟹ stale.** This makes the documented
200/100 graceful-degradation tiers reachable, so a node that loses its DB settles into 100 ("degraded,
still serving cached values") rather than slamming to 0. The 0 tier stays reserved for "not a healthy
cluster member / detached." A richer *driver-input* staleness signal is **deferred** (flagged, not built).
3. **`OpcUaProbeOk` = peer-probes-me**, reusing the existing `PeerOpcUaProbeActor` (finally spawned),
closing the audit's "never spawned / zero consumers" gaps. Absence of a fresh peer result → `true`
(benefit of the doubt: don't penalize a node for its peer being down); only an *actively observed*
recent failure demotes. Single-node / no-peer cluster → `true`.
## Architecture
All changes are **per-driver-node, inside `Runtime`** (`WithOtOpcUaRuntimeActors`). No ControlPlane /
admin-singleton change. No new Commons message types.
```
WithOtOpcUaRuntimeActors (per driver node)
├─ DbHealthProbeActor (exists; now ALSO consulted by the publish actor)
├─ PeerProbeSupervisor (NEW) watches cluster membership →
│ └─ PeerOpcUaProbeActor(peer) one child per OTHER driver member;
│ publishes OpcUaProbeResult(peer, ok) on the redundancy-state topic
└─ OpcUaPublishActor (MODIFIED)
subscribes redundancy-state topic → RedundancyStateChanged (role/leadership)
→ OpcUaProbeResult (peer's verdict on ME)
holds IActorRef dbHealthProbe → Ask<DbHealthStatus> on a periodic HealthTick
Cluster.SelfMember.Status → MemberState (local)
on any trigger: ServiceLevelCalculator.Compute(inputs) → publish-if-changed (existing dedup)
```
### Components
**1. `PeerProbeSupervisor` (NEW, `Runtime/Health/PeerProbeSupervisor.cs`)** — a small per-node actor whose
only job is the probe lifecycle (kept separate so the pinned-dispatcher `OpcUaPublishActor` never parents
network-probe children):
- Subscribes to cluster membership (`IMemberEvent` / `ReachabilityEvent`) and/or the redundancy snapshot.
- Maintains one `PeerOpcUaProbeActor` child per **other driver-role member**, (re)spawning on membership
change, stopping children for departed members. Resolves the peer's OPC UA host from its node id
(`host:port` → host) and the configured OPC UA port.
- Children publish `OpcUaProbeResult` to the `redundancy-state` topic (existing behavior).
- Spawned on the default dispatcher (not the OPC UA pinned one).
**2. `OpcUaPublishActor` (MODIFIED)** — replace the role-only switch in `HandleRedundancyStateChanged`
with a `RecomputeServiceLevel()` that calls the calculator. New Props params (both optional, defaulted so
test Props/harnesses keep working): `IActorRef? dbHealthProbe`, and the freshness windows /
self-OPC-UA-port (injectable for tests).
- New state: `_dbReachable` + `_dbAsOfUtc` (from the DB Ask), `_lastSnapshotAsOfUtc`, a per-peer map of
`(ok, asOfUtc)` for results **about my own node**, and the current `RedundancyStateChanged` local entry.
- New `Receive<OpcUaProbeResult>`: if `msg.NodeId == _localNode`, record `(msg.Ok, now)` for the reporting
peer; recompute.
- New `Receive<HealthTick>` (periodic, e.g. 5 s): `dbHealthProbe.Ask<DbHealthStatus>(GetStatus, 1s).PipeTo(Self)`.
- New `Receive<DbHealthStatus>`: cache `_dbReachable` + `_dbAsOfUtc`; recompute.
- `RecomputeServiceLevel()` builds `NodeHealthInputs` and calls `ServiceLevelCalculator.Compute`, then
`Self.Tell(new ServiceLevelChanged(level))` (the existing handler dedups + publishes + emits the metric).
### Input derivation (the load-bearing logic)
For the **local** node, on each recompute:
| `NodeHealthInputs` field | Source |
|---|---|
| `MemberState` | `Cluster.Get(system).SelfMember.Status` (local; most accurate) |
| `IsDriverRoleLeader` | local entry of the latest `RedundancyStateChanged` snapshot (`IsRoleLeaderForDriver`) |
| `DbReachable` | latest `DbHealthStatus.Reachable` from the local `DbHealthProbeActor` |
| `OpcUaProbeOk` | `true` if **no** fresh peer result about me, else the latest such result's `Ok` (only an actively-observed recent failure → `false`); single-node → `true` |
| `Stale` | `!DbReachable` **OR** `(now _dbAsOfUtc) > StaleWindow` **OR** `(now _lastSnapshotAsOfUtc) > StaleWindow` |
**Guards (calculator does not model these):**
- If there is **no local snapshot entry**, or the local entry's `Role == Detached`, publish **0** directly
and skip the calculator (a healthy detached node would otherwise compute 240 because the calculator only
checks `MemberState` + the leader bonus). In steady state a node running `OpcUaPublishActor` always carries
the `driver` role, so this is defensive — it preserves the current Detached→0 behavior during transitions.
### Resulting ServiceLevel truth table (now fully reachable)
| Node condition | Byte | How reached |
|---|---|---|
| Not a healthy member / detached | **0** | `MemberState` not Up/Joining, or Detached guard |
| DB unreachable, sustained | **100** | `DbReachable=false ⇒ Stale=true``(false,_,true)` |
| DB reachable but signals stale | **200** | `(true,_,true)` |
| Healthy follower (DB ok + probe ok + fresh) | **240** | `(true,true,false)` |
| Healthy leader | **250** | follower 240 + `IsDriverRoleLeader` +10 |
| DB ok + fresh but peer reports me unreachable | **0** | `(true,false,false)` → falls through |
**Behavior change to be aware of (already documented as the target):** a *healthy Secondary moves
100 → 240*. Both healthy nodes sit at 240/250 with the leader preferred by the +10 bonus; role-leadership is
stable so the bytes do not flap. The "DB ok + probe-fail" → 0 edge is intentionally sharp but **debounced**
by the `OpcUaProbeOk` freshness rule (a single missed probe does not flip it; only a sustained, actively
observed failure does).
## Error handling / edge cases
- **DB Ask timeout** → treat as `DbReachable=false` for that cycle (fail-safe demote); the next tick re-Asks.
- **No peer (single node)** → `OpcUaProbeOk=true`; supervisor spawns no children.
- **Peer departs** → its prior result ages out of the freshness window → `OpcUaProbeOk` reverts to `true`
(we don't penalize a node for its peer's absence). Supervisor stops the child for the departed member.
- **Cluster forming / role-leader not yet resolved** → no local snapshot entry yet → Detached guard → 0,
same as today, until the first snapshot arrives.
- **`HealthTick` before the first snapshot** → no local entry → 0 (no spurious 240 before role is known).
## Testing strategy (xUnit + Shouldly, TDD; NO bUnit)
Unit-testable (the wiring that *can* be proven in a TestKit):
- `OpcUaPublishActor` input-derivation: feed a `RedundancyStateChanged` (Primary-leader) + a cached
`DbHealthStatus(reachable)` + an `OpcUaProbeResult(me, ok)` → assert published byte 250; flip DB
unreachable → 100; stale snapshot → 200; healthy secondary → 240; Detached entry → 0; probe `Ok=false`
about me with DB-ok+fresh → 0. Use `PropsForTests` with a broadcast/serviceLevel capture + injected
`dbHealthProbe` test ref + short windows.
- `OpcUaProbeOk` freshness/debounce: a single stale/absent result → `true`; a recent explicit `false``false`.
- `PeerProbeSupervisor`: on a membership snapshot with one peer driver member, spawns exactly one child
targeting that peer; on the peer leaving, stops it; single-node → no children. (TestKit, with a probe
factory injected so no real TCP is attempted.)
- Reuse existing `ServiceLevelCalculatorTests` unchanged (the pure function is not modified).
Not unit-testable (acceptance gate): the live cross-node ServiceLevel on the **2-node rig** — proven by
`/run` (user-driven). A DB-unreachable primary must be observed dropping below the secondary, and the
secondary taking over as the higher-ServiceLevel endpoint. See `project_redundancy_state_delivery`.
## Verification
- `dotnet build` clean (production projects are `TreatWarningsAsErrors`); full `dotnet test` green.
- High-risk review chain (serial spec → code → final integration).
- **Live `/run` on the 2-node / docker-dev rig** (user-driven; agent does not sign in): confirm
steady-state 250/240, then kill a node's DB reachability and confirm its ServiceLevel drops to 100 and the
peer is preferred. Update `docs/Redundancy.md` to mark the calculator path **wired** (remove the "not yet
wired into the live driver publish path" caveats) and document the freshness/peer-probe sourcing.
## Alternatives considered
- **Central compute in `RedundancyStateActor`** — rejected: the admin singleton can't see each node's local
DB health without a new cluster-wide health-broadcast contract; heavier, bigger blast radius, no benefit.
- **Local self-probe** (TCP to own `localhost:4840` / SDK server-state) instead of the peer-probe — simpler
(no membership-tracking supervisor) and a defensible "is my OPC UA serving" signal, but leaves
`PeerOpcUaProbeActor` dead and misses the cross-node vantage. Rejected to honor the locked design and close
the audit's dead-code gap; recorded here so the trade-off is explicit if the supervisor proves heavier than
valued (then it becomes a follow-up swap, not a redesign).
- **Reshape the calculator to role-baseline-demoted-by-health** (keep the wide 240-vs-100 gap; health only
lowers) — rejected: the calculator's 250/240/200/100/0 tiers are already approved, documented, and tested;
reshaping re-opens a settled decision. The +10 leader bonus already preserves client preference for the
primary.
## Hard constraints (carried from the roadmap)
NO Configuration entity / EF migration. Stage by path — never `git add .`; never stage `sql_login.txt`,
`src/Server/.../Host/pki/`, `pending.md`, `current.md`, `docker-dev/docker-compose.yml`, `stillpending.md`.
Never echo or commit secrets. No force-push, no `--no-verify`. Razor/runtime cross-node behavior proven only
by live `/run`, never bUnit.