docs(redundancy): Phase 2 design — health-aware ServiceLevel (H3)
This commit is contained in:
@@ -0,0 +1,183 @@
|
||||
# Phase 2 — Health-aware redundancy ServiceLevel (H3) — design
|
||||
|
||||
> **Status:** approved 2026-06-15. Parent roadmap: `docs/plans/2026-06-15-stillpending-backlog-design.md`
|
||||
> (Phase 2). Backlog item **H3** in `stillpending.md` §1.
|
||||
> **Branch:** `feat/stillpending-phase-2-servicelevel` off master `4bd7180e`.
|
||||
> **Classification:** high-risk (actor model + redundancy correctness). Live `/run` on the 2-node rig is
|
||||
> the acceptance gate; unit tests cannot prove the cross-node wiring (per `project_redundancy_state_delivery`).
|
||||
|
||||
## Problem (H3)
|
||||
|
||||
The SDK's `Server.ServiceLevel` byte is driven **solely by the coarse role map** today
|
||||
(`OpcUaPublishActor.HandleRedundancyStateChanged`: Primary-leader→240, Primary→200, Secondary→100,
|
||||
Detached→0). The richer `ServiceLevelCalculator.Compute(NodeHealthInputs)` — which folds in DB
|
||||
reachability, an OPC-UA liveness probe, and a staleness signal — **is never invoked in production**.
|
||||
The two health producers are dead or half-wired:
|
||||
|
||||
- `DbHealthProbeActor` — **spawned** per node, but only feeds `/health/ready`; its `DbReachable` never
|
||||
reaches the published byte.
|
||||
- `PeerOpcUaProbeActor` — **never spawned anywhere**; its `OpcUaProbeResult` has **zero consumers**.
|
||||
- `NodeHealthInputs.Stale` — **no producer exists** anywhere in the codebase.
|
||||
|
||||
Consequence: a **DB-unreachable / OPC-UA-down primary still advertises its full role-based ServiceLevel**,
|
||||
so redundant clients keep subscribing to a degraded node instead of failing over.
|
||||
|
||||
## Goal
|
||||
|
||||
Make each driver node publish a ServiceLevel computed by `ServiceLevelCalculator.Compute` from its **real
|
||||
local health**, so an unhealthy node drops below its role-based level (and below a healthy peer's level,
|
||||
triggering client failover). Honor the **already-documented** tiers in `docs/Redundancy.md`
|
||||
(250 / 240 / 200 / 100 / 0) — do not reshape the calculator's truth table. No EF migration; no change to
|
||||
the `RedundancyStateChanged` / `NodeRedundancyState` message contract.
|
||||
|
||||
## Locked decisions (from brainstorming 2026-06-15)
|
||||
|
||||
1. **Compute site = local per-node** (`OpcUaPublishActor`), *not* the admin `RedundancyStateActor`
|
||||
singleton. DB reachability is an inherently local fact; centralizing it would require a new
|
||||
cluster-wide health-broadcast contract for no benefit. The publish actor already runs per node, already
|
||||
subscribes to the `redundancy-state` topic, already holds the role snapshot, and already writes the byte.
|
||||
2. **`Stale` is derived from signal freshness, with DB-unreachable ⟹ stale.** This makes the documented
|
||||
200/100 graceful-degradation tiers reachable, so a node that loses its DB settles into 100 ("degraded,
|
||||
still serving cached values") rather than slamming to 0. The 0 tier stays reserved for "not a healthy
|
||||
cluster member / detached." A richer *driver-input* staleness signal is **deferred** (flagged, not built).
|
||||
3. **`OpcUaProbeOk` = peer-probes-me**, reusing the existing `PeerOpcUaProbeActor` (finally spawned),
|
||||
closing the audit's "never spawned / zero consumers" gaps. Absence of a fresh peer result → `true`
|
||||
(benefit of the doubt: don't penalize a node for its peer being down); only an *actively observed*
|
||||
recent failure demotes. Single-node / no-peer cluster → `true`.
|
||||
|
||||
## Architecture
|
||||
|
||||
All changes are **per-driver-node, inside `Runtime`** (`WithOtOpcUaRuntimeActors`). No ControlPlane /
|
||||
admin-singleton change. No new Commons message types.
|
||||
|
||||
```
|
||||
WithOtOpcUaRuntimeActors (per driver node)
|
||||
├─ DbHealthProbeActor (exists; now ALSO consulted by the publish actor)
|
||||
├─ PeerProbeSupervisor (NEW) watches cluster membership →
|
||||
│ └─ PeerOpcUaProbeActor(peer) one child per OTHER driver member;
|
||||
│ publishes OpcUaProbeResult(peer, ok) on the redundancy-state topic
|
||||
└─ OpcUaPublishActor (MODIFIED)
|
||||
subscribes redundancy-state topic → RedundancyStateChanged (role/leadership)
|
||||
→ OpcUaProbeResult (peer's verdict on ME)
|
||||
holds IActorRef dbHealthProbe → Ask<DbHealthStatus> on a periodic HealthTick
|
||||
Cluster.SelfMember.Status → MemberState (local)
|
||||
on any trigger: ServiceLevelCalculator.Compute(inputs) → publish-if-changed (existing dedup)
|
||||
```
|
||||
|
||||
### Components
|
||||
|
||||
**1. `PeerProbeSupervisor` (NEW, `Runtime/Health/PeerProbeSupervisor.cs`)** — a small per-node actor whose
|
||||
only job is the probe lifecycle (kept separate so the pinned-dispatcher `OpcUaPublishActor` never parents
|
||||
network-probe children):
|
||||
- Subscribes to cluster membership (`IMemberEvent` / `ReachabilityEvent`) and/or the redundancy snapshot.
|
||||
- Maintains one `PeerOpcUaProbeActor` child per **other driver-role member**, (re)spawning on membership
|
||||
change, stopping children for departed members. Resolves the peer's OPC UA host from its node id
|
||||
(`host:port` → host) and the configured OPC UA port.
|
||||
- Children publish `OpcUaProbeResult` to the `redundancy-state` topic (existing behavior).
|
||||
- Spawned on the default dispatcher (not the OPC UA pinned one).
|
||||
|
||||
**2. `OpcUaPublishActor` (MODIFIED)** — replace the role-only switch in `HandleRedundancyStateChanged`
|
||||
with a `RecomputeServiceLevel()` that calls the calculator. New Props params (both optional, defaulted so
|
||||
test Props/harnesses keep working): `IActorRef? dbHealthProbe`, and the freshness windows /
|
||||
self-OPC-UA-port (injectable for tests).
|
||||
- New state: `_dbReachable` + `_dbAsOfUtc` (from the DB Ask), `_lastSnapshotAsOfUtc`, a per-peer map of
|
||||
`(ok, asOfUtc)` for results **about my own node**, and the current `RedundancyStateChanged` local entry.
|
||||
- New `Receive<OpcUaProbeResult>`: if `msg.NodeId == _localNode`, record `(msg.Ok, now)` for the reporting
|
||||
peer; recompute.
|
||||
- New `Receive<HealthTick>` (periodic, e.g. 5 s): `dbHealthProbe.Ask<DbHealthStatus>(GetStatus, 1s).PipeTo(Self)`.
|
||||
- New `Receive<DbHealthStatus>`: cache `_dbReachable` + `_dbAsOfUtc`; recompute.
|
||||
- `RecomputeServiceLevel()` builds `NodeHealthInputs` and calls `ServiceLevelCalculator.Compute`, then
|
||||
`Self.Tell(new ServiceLevelChanged(level))` (the existing handler dedups + publishes + emits the metric).
|
||||
|
||||
### Input derivation (the load-bearing logic)
|
||||
|
||||
For the **local** node, on each recompute:
|
||||
|
||||
| `NodeHealthInputs` field | Source |
|
||||
|---|---|
|
||||
| `MemberState` | `Cluster.Get(system).SelfMember.Status` (local; most accurate) |
|
||||
| `IsDriverRoleLeader` | local entry of the latest `RedundancyStateChanged` snapshot (`IsRoleLeaderForDriver`) |
|
||||
| `DbReachable` | latest `DbHealthStatus.Reachable` from the local `DbHealthProbeActor` |
|
||||
| `OpcUaProbeOk` | `true` if **no** fresh peer result about me, else the latest such result's `Ok` (only an actively-observed recent failure → `false`); single-node → `true` |
|
||||
| `Stale` | `!DbReachable` **OR** `(now − _dbAsOfUtc) > StaleWindow` **OR** `(now − _lastSnapshotAsOfUtc) > StaleWindow` |
|
||||
|
||||
**Guards (calculator does not model these):**
|
||||
- If there is **no local snapshot entry**, or the local entry's `Role == Detached`, publish **0** directly
|
||||
and skip the calculator (a healthy detached node would otherwise compute 240 because the calculator only
|
||||
checks `MemberState` + the leader bonus). In steady state a node running `OpcUaPublishActor` always carries
|
||||
the `driver` role, so this is defensive — it preserves the current Detached→0 behavior during transitions.
|
||||
|
||||
### Resulting ServiceLevel truth table (now fully reachable)
|
||||
|
||||
| Node condition | Byte | How reached |
|
||||
|---|---|---|
|
||||
| Not a healthy member / detached | **0** | `MemberState` not Up/Joining, or Detached guard |
|
||||
| DB unreachable, sustained | **100** | `DbReachable=false ⇒ Stale=true` → `(false,_,true)` |
|
||||
| DB reachable but signals stale | **200** | `(true,_,true)` |
|
||||
| Healthy follower (DB ok + probe ok + fresh) | **240** | `(true,true,false)` |
|
||||
| Healthy leader | **250** | follower 240 + `IsDriverRoleLeader` +10 |
|
||||
| DB ok + fresh but peer reports me unreachable | **0** | `(true,false,false)` → falls through |
|
||||
|
||||
**Behavior change to be aware of (already documented as the target):** a *healthy Secondary moves
|
||||
100 → 240*. Both healthy nodes sit at 240/250 with the leader preferred by the +10 bonus; role-leadership is
|
||||
stable so the bytes do not flap. The "DB ok + probe-fail" → 0 edge is intentionally sharp but **debounced**
|
||||
by the `OpcUaProbeOk` freshness rule (a single missed probe does not flip it; only a sustained, actively
|
||||
observed failure does).
|
||||
|
||||
## Error handling / edge cases
|
||||
|
||||
- **DB Ask timeout** → treat as `DbReachable=false` for that cycle (fail-safe demote); the next tick re-Asks.
|
||||
- **No peer (single node)** → `OpcUaProbeOk=true`; supervisor spawns no children.
|
||||
- **Peer departs** → its prior result ages out of the freshness window → `OpcUaProbeOk` reverts to `true`
|
||||
(we don't penalize a node for its peer's absence). Supervisor stops the child for the departed member.
|
||||
- **Cluster forming / role-leader not yet resolved** → no local snapshot entry yet → Detached guard → 0,
|
||||
same as today, until the first snapshot arrives.
|
||||
- **`HealthTick` before the first snapshot** → no local entry → 0 (no spurious 240 before role is known).
|
||||
|
||||
## Testing strategy (xUnit + Shouldly, TDD; NO bUnit)
|
||||
|
||||
Unit-testable (the wiring that *can* be proven in a TestKit):
|
||||
- `OpcUaPublishActor` input-derivation: feed a `RedundancyStateChanged` (Primary-leader) + a cached
|
||||
`DbHealthStatus(reachable)` + an `OpcUaProbeResult(me, ok)` → assert published byte 250; flip DB
|
||||
unreachable → 100; stale snapshot → 200; healthy secondary → 240; Detached entry → 0; probe `Ok=false`
|
||||
about me with DB-ok+fresh → 0. Use `PropsForTests` with a broadcast/serviceLevel capture + injected
|
||||
`dbHealthProbe` test ref + short windows.
|
||||
- `OpcUaProbeOk` freshness/debounce: a single stale/absent result → `true`; a recent explicit `false` → `false`.
|
||||
- `PeerProbeSupervisor`: on a membership snapshot with one peer driver member, spawns exactly one child
|
||||
targeting that peer; on the peer leaving, stops it; single-node → no children. (TestKit, with a probe
|
||||
factory injected so no real TCP is attempted.)
|
||||
- Reuse existing `ServiceLevelCalculatorTests` unchanged (the pure function is not modified).
|
||||
|
||||
Not unit-testable (acceptance gate): the live cross-node ServiceLevel on the **2-node rig** — proven by
|
||||
`/run` (user-driven). A DB-unreachable primary must be observed dropping below the secondary, and the
|
||||
secondary taking over as the higher-ServiceLevel endpoint. See `project_redundancy_state_delivery`.
|
||||
|
||||
## Verification
|
||||
|
||||
- `dotnet build` clean (production projects are `TreatWarningsAsErrors`); full `dotnet test` green.
|
||||
- High-risk review chain (serial spec → code → final integration).
|
||||
- **Live `/run` on the 2-node / docker-dev rig** (user-driven; agent does not sign in): confirm
|
||||
steady-state 250/240, then kill a node's DB reachability and confirm its ServiceLevel drops to 100 and the
|
||||
peer is preferred. Update `docs/Redundancy.md` to mark the calculator path **wired** (remove the "not yet
|
||||
wired into the live driver publish path" caveats) and document the freshness/peer-probe sourcing.
|
||||
|
||||
## Alternatives considered
|
||||
|
||||
- **Central compute in `RedundancyStateActor`** — rejected: the admin singleton can't see each node's local
|
||||
DB health without a new cluster-wide health-broadcast contract; heavier, bigger blast radius, no benefit.
|
||||
- **Local self-probe** (TCP to own `localhost:4840` / SDK server-state) instead of the peer-probe — simpler
|
||||
(no membership-tracking supervisor) and a defensible "is my OPC UA serving" signal, but leaves
|
||||
`PeerOpcUaProbeActor` dead and misses the cross-node vantage. Rejected to honor the locked design and close
|
||||
the audit's dead-code gap; recorded here so the trade-off is explicit if the supervisor proves heavier than
|
||||
valued (then it becomes a follow-up swap, not a redesign).
|
||||
- **Reshape the calculator to role-baseline-demoted-by-health** (keep the wide 240-vs-100 gap; health only
|
||||
lowers) — rejected: the calculator's 250/240/200/100/0 tiers are already approved, documented, and tested;
|
||||
reshaping re-opens a settled decision. The +10 leader bonus already preserves client preference for the
|
||||
primary.
|
||||
|
||||
## Hard constraints (carried from the roadmap)
|
||||
|
||||
NO Configuration entity / EF migration. Stage by path — never `git add .`; never stage `sql_login.txt`,
|
||||
`src/Server/.../Host/pki/`, `pending.md`, `current.md`, `docker-dev/docker-compose.yml`, `stillpending.md`.
|
||||
Never echo or commit secrets. No force-push, no `--no-verify`. Razor/runtime cross-node behavior proven only
|
||||
by live `/run`, never bUnit.
|
||||
Reference in New Issue
Block a user