Files
lmxopcua/docs/plans/2026-06-15-stillpending-phase-2-servicelevel-design.md
T

184 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 2 — Health-aware redundancy ServiceLevel (H3) — design
> **Status:** approved 2026-06-15. Parent roadmap: `docs/plans/2026-06-15-stillpending-backlog-design.md`
> (Phase 2). Backlog item **H3** in `stillpending.md` §1.
> **Branch:** `feat/stillpending-phase-2-servicelevel` off master `4bd7180e`.
> **Classification:** high-risk (actor model + redundancy correctness). Live `/run` on the 2-node rig is
> the acceptance gate; unit tests cannot prove the cross-node wiring (per `project_redundancy_state_delivery`).
## Problem (H3)
The SDK's `Server.ServiceLevel` byte is driven **solely by the coarse role map** today
(`OpcUaPublishActor.HandleRedundancyStateChanged`: Primary-leader→240, Primary→200, Secondary→100,
Detached→0). The richer `ServiceLevelCalculator.Compute(NodeHealthInputs)` — which folds in DB
reachability, an OPC-UA liveness probe, and a staleness signal — **is never invoked in production**.
The two health producers are dead or half-wired:
- `DbHealthProbeActor`**spawned** per node, but only feeds `/health/ready`; its `DbReachable` never
reaches the published byte.
- `PeerOpcUaProbeActor`**never spawned anywhere**; its `OpcUaProbeResult` has **zero consumers**.
- `NodeHealthInputs.Stale`**no producer exists** anywhere in the codebase.
Consequence: a **DB-unreachable / OPC-UA-down primary still advertises its full role-based ServiceLevel**,
so redundant clients keep subscribing to a degraded node instead of failing over.
## Goal
Make each driver node publish a ServiceLevel computed by `ServiceLevelCalculator.Compute` from its **real
local health**, so an unhealthy node drops below its role-based level (and below a healthy peer's level,
triggering client failover). Honor the **already-documented** tiers in `docs/Redundancy.md`
(250 / 240 / 200 / 100 / 0) — do not reshape the calculator's truth table. No EF migration; no change to
the `RedundancyStateChanged` / `NodeRedundancyState` message contract.
## Locked decisions (from brainstorming 2026-06-15)
1. **Compute site = local per-node** (`OpcUaPublishActor`), *not* the admin `RedundancyStateActor`
singleton. DB reachability is an inherently local fact; centralizing it would require a new
cluster-wide health-broadcast contract for no benefit. The publish actor already runs per node, already
subscribes to the `redundancy-state` topic, already holds the role snapshot, and already writes the byte.
2. **`Stale` is derived from signal freshness, with DB-unreachable ⟹ stale.** This makes the documented
200/100 graceful-degradation tiers reachable, so a node that loses its DB settles into 100 ("degraded,
still serving cached values") rather than slamming to 0. The 0 tier stays reserved for "not a healthy
cluster member / detached." A richer *driver-input* staleness signal is **deferred** (flagged, not built).
3. **`OpcUaProbeOk` = peer-probes-me**, reusing the existing `PeerOpcUaProbeActor` (finally spawned),
closing the audit's "never spawned / zero consumers" gaps. Absence of a fresh peer result → `true`
(benefit of the doubt: don't penalize a node for its peer being down); only an *actively observed*
recent failure demotes. Single-node / no-peer cluster → `true`.
## Architecture
All changes are **per-driver-node, inside `Runtime`** (`WithOtOpcUaRuntimeActors`). No ControlPlane /
admin-singleton change. No new Commons message types.
```
WithOtOpcUaRuntimeActors (per driver node)
├─ DbHealthProbeActor (exists; now ALSO consulted by the publish actor)
├─ PeerProbeSupervisor (NEW) watches cluster membership →
│ └─ PeerOpcUaProbeActor(peer) one child per OTHER driver member;
│ publishes OpcUaProbeResult(peer, ok) on the redundancy-state topic
└─ OpcUaPublishActor (MODIFIED)
subscribes redundancy-state topic → RedundancyStateChanged (role/leadership)
→ OpcUaProbeResult (peer's verdict on ME)
holds IActorRef dbHealthProbe → Ask<DbHealthStatus> on a periodic HealthTick
Cluster.SelfMember.Status → MemberState (local)
on any trigger: ServiceLevelCalculator.Compute(inputs) → publish-if-changed (existing dedup)
```
### Components
**1. `PeerProbeSupervisor` (NEW, `Runtime/Health/PeerProbeSupervisor.cs`)** — a small per-node actor whose
only job is the probe lifecycle (kept separate so the pinned-dispatcher `OpcUaPublishActor` never parents
network-probe children):
- Subscribes to cluster membership (`IMemberEvent` / `ReachabilityEvent`) and/or the redundancy snapshot.
- Maintains one `PeerOpcUaProbeActor` child per **other driver-role member**, (re)spawning on membership
change, stopping children for departed members. Resolves the peer's OPC UA host from its node id
(`host:port` → host) and the configured OPC UA port.
- Children publish `OpcUaProbeResult` to the `redundancy-state` topic (existing behavior).
- Spawned on the default dispatcher (not the OPC UA pinned one).
**2. `OpcUaPublishActor` (MODIFIED)** — replace the role-only switch in `HandleRedundancyStateChanged`
with a `RecomputeServiceLevel()` that calls the calculator. New Props params (both optional, defaulted so
test Props/harnesses keep working): `IActorRef? dbHealthProbe`, and the freshness windows /
self-OPC-UA-port (injectable for tests).
- New state: `_dbReachable` + `_dbAsOfUtc` (from the DB Ask), `_lastSnapshotAsOfUtc`, a per-peer map of
`(ok, asOfUtc)` for results **about my own node**, and the current `RedundancyStateChanged` local entry.
- New `Receive<OpcUaProbeResult>`: if `msg.NodeId == _localNode`, record `(msg.Ok, now)` for the reporting
peer; recompute.
- New `Receive<HealthTick>` (periodic, e.g. 5 s): `dbHealthProbe.Ask<DbHealthStatus>(GetStatus, 1s).PipeTo(Self)`.
- New `Receive<DbHealthStatus>`: cache `_dbReachable` + `_dbAsOfUtc`; recompute.
- `RecomputeServiceLevel()` builds `NodeHealthInputs` and calls `ServiceLevelCalculator.Compute`, then
`Self.Tell(new ServiceLevelChanged(level))` (the existing handler dedups + publishes + emits the metric).
### Input derivation (the load-bearing logic)
For the **local** node, on each recompute:
| `NodeHealthInputs` field | Source |
|---|---|
| `MemberState` | `Cluster.Get(system).SelfMember.Status` (local; most accurate) |
| `IsDriverRoleLeader` | local entry of the latest `RedundancyStateChanged` snapshot (`IsRoleLeaderForDriver`) |
| `DbReachable` | latest `DbHealthStatus.Reachable` from the local `DbHealthProbeActor` |
| `OpcUaProbeOk` | `true` if **no** fresh peer result about me, else the latest such result's `Ok` (only an actively-observed recent failure → `false`); single-node → `true` |
| `Stale` | `!DbReachable` **OR** `(now _dbAsOfUtc) > StaleWindow` **OR** `(now _lastSnapshotAsOfUtc) > StaleWindow` |
**Guards (calculator does not model these):**
- If there is **no local snapshot entry**, or the local entry's `Role == Detached`, publish **0** directly
and skip the calculator (a healthy detached node would otherwise compute 240 because the calculator only
checks `MemberState` + the leader bonus). In steady state a node running `OpcUaPublishActor` always carries
the `driver` role, so this is defensive — it preserves the current Detached→0 behavior during transitions.
### Resulting ServiceLevel truth table (now fully reachable)
| Node condition | Byte | How reached |
|---|---|---|
| Not a healthy member / detached | **0** | `MemberState` not Up/Joining, or Detached guard |
| DB unreachable, sustained | **100** | `DbReachable=false ⇒ Stale=true``(false,_,true)` |
| DB reachable but signals stale | **200** | `(true,_,true)` |
| Healthy follower (DB ok + probe ok + fresh) | **240** | `(true,true,false)` |
| Healthy leader | **250** | follower 240 + `IsDriverRoleLeader` +10 |
| DB ok + fresh but peer reports me unreachable | **0** | `(true,false,false)` → falls through |
**Behavior change to be aware of (already documented as the target):** a *healthy Secondary moves
100 → 240*. Both healthy nodes sit at 240/250 with the leader preferred by the +10 bonus; role-leadership is
stable so the bytes do not flap. The "DB ok + probe-fail" → 0 edge is intentionally sharp but **debounced**
by the `OpcUaProbeOk` freshness rule (a single missed probe does not flip it; only a sustained, actively
observed failure does).
## Error handling / edge cases
- **DB Ask timeout** → treat as `DbReachable=false` for that cycle (fail-safe demote); the next tick re-Asks.
- **No peer (single node)** → `OpcUaProbeOk=true`; supervisor spawns no children.
- **Peer departs** → its prior result ages out of the freshness window → `OpcUaProbeOk` reverts to `true`
(we don't penalize a node for its peer's absence). Supervisor stops the child for the departed member.
- **Cluster forming / role-leader not yet resolved** → no local snapshot entry yet → Detached guard → 0,
same as today, until the first snapshot arrives.
- **`HealthTick` before the first snapshot** → no local entry → 0 (no spurious 240 before role is known).
## Testing strategy (xUnit + Shouldly, TDD; NO bUnit)
Unit-testable (the wiring that *can* be proven in a TestKit):
- `OpcUaPublishActor` input-derivation: feed a `RedundancyStateChanged` (Primary-leader) + a cached
`DbHealthStatus(reachable)` + an `OpcUaProbeResult(me, ok)` → assert published byte 250; flip DB
unreachable → 100; stale snapshot → 200; healthy secondary → 240; Detached entry → 0; probe `Ok=false`
about me with DB-ok+fresh → 0. Use `PropsForTests` with a broadcast/serviceLevel capture + injected
`dbHealthProbe` test ref + short windows.
- `OpcUaProbeOk` freshness/debounce: a single stale/absent result → `true`; a recent explicit `false``false`.
- `PeerProbeSupervisor`: on a membership snapshot with one peer driver member, spawns exactly one child
targeting that peer; on the peer leaving, stops it; single-node → no children. (TestKit, with a probe
factory injected so no real TCP is attempted.)
- Reuse existing `ServiceLevelCalculatorTests` unchanged (the pure function is not modified).
Not unit-testable (acceptance gate): the live cross-node ServiceLevel on the **2-node rig** — proven by
`/run` (user-driven). A DB-unreachable primary must be observed dropping below the secondary, and the
secondary taking over as the higher-ServiceLevel endpoint. See `project_redundancy_state_delivery`.
## Verification
- `dotnet build` clean (production projects are `TreatWarningsAsErrors`); full `dotnet test` green.
- High-risk review chain (serial spec → code → final integration).
- **Live `/run` on the 2-node / docker-dev rig** (user-driven; agent does not sign in): confirm
steady-state 250/240, then kill a node's DB reachability and confirm its ServiceLevel drops to 100 and the
peer is preferred. Update `docs/Redundancy.md` to mark the calculator path **wired** (remove the "not yet
wired into the live driver publish path" caveats) and document the freshness/peer-probe sourcing.
## Alternatives considered
- **Central compute in `RedundancyStateActor`** — rejected: the admin singleton can't see each node's local
DB health without a new cluster-wide health-broadcast contract; heavier, bigger blast radius, no benefit.
- **Local self-probe** (TCP to own `localhost:4840` / SDK server-state) instead of the peer-probe — simpler
(no membership-tracking supervisor) and a defensible "is my OPC UA serving" signal, but leaves
`PeerOpcUaProbeActor` dead and misses the cross-node vantage. Rejected to honor the locked design and close
the audit's dead-code gap; recorded here so the trade-off is explicit if the supervisor proves heavier than
valued (then it becomes a follow-up swap, not a redesign).
- **Reshape the calculator to role-baseline-demoted-by-health** (keep the wide 240-vs-100 gap; health only
lowers) — rejected: the calculator's 250/240/200/100/0 tiers are already approved, documented, and tested;
reshaping re-opens a settled decision. The +10 leader bonus already preserves client preference for the
primary.
## Hard constraints (carried from the roadmap)
NO Configuration entity / EF migration. Stage by path — never `git add .`; never stage `sql_login.txt`,
`src/Server/.../Host/pki/`, `pending.md`, `current.md`, `docker-dev/docker-compose.yml`, `stillpending.md`.
Never echo or commit secrets. No force-push, no `--no-verify`. Razor/runtime cross-node behavior proven only
by live `/run`, never bUnit.