12 KiB
Phase 2 — Health-aware redundancy ServiceLevel (H3) — design
Status: approved 2026-06-15. Parent roadmap:
docs/plans/2026-06-15-stillpending-backlog-design.md(Phase 2). Backlog item H3 instillpending.md§1. Branch:feat/stillpending-phase-2-serviceleveloff master4bd7180e. Classification: high-risk (actor model + redundancy correctness). Live/runon the 2-node rig is the acceptance gate; unit tests cannot prove the cross-node wiring (perproject_redundancy_state_delivery).
Problem (H3)
The SDK's Server.ServiceLevel byte is driven solely by the coarse role map today
(OpcUaPublishActor.HandleRedundancyStateChanged: Primary-leader→240, Primary→200, Secondary→100,
Detached→0). The richer ServiceLevelCalculator.Compute(NodeHealthInputs) — which folds in DB
reachability, an OPC-UA liveness probe, and a staleness signal — is never invoked in production.
The two health producers are dead or half-wired:
DbHealthProbeActor— spawned per node, but only feeds/health/ready; itsDbReachablenever reaches the published byte.PeerOpcUaProbeActor— never spawned anywhere; itsOpcUaProbeResulthas zero consumers.NodeHealthInputs.Stale— no producer exists anywhere in the codebase.
Consequence: a DB-unreachable / OPC-UA-down primary still advertises its full role-based ServiceLevel, so redundant clients keep subscribing to a degraded node instead of failing over.
Goal
Make each driver node publish a ServiceLevel computed by ServiceLevelCalculator.Compute from its real
local health, so an unhealthy node drops below its role-based level (and below a healthy peer's level,
triggering client failover). Honor the already-documented tiers in docs/Redundancy.md
(250 / 240 / 200 / 100 / 0) — do not reshape the calculator's truth table. No EF migration; no change to
the RedundancyStateChanged / NodeRedundancyState message contract.
Locked decisions (from brainstorming 2026-06-15)
- Compute site = local per-node (
OpcUaPublishActor), not the adminRedundancyStateActorsingleton. DB reachability is an inherently local fact; centralizing it would require a new cluster-wide health-broadcast contract for no benefit. The publish actor already runs per node, already subscribes to theredundancy-statetopic, already holds the role snapshot, and already writes the byte. Staleis derived from signal freshness, with DB-unreachable ⟹ stale. This makes the documented 200/100 graceful-degradation tiers reachable, so a node that loses its DB settles into 100 ("degraded, still serving cached values") rather than slamming to 0. The 0 tier stays reserved for "not a healthy cluster member / detached." A richer driver-input staleness signal is deferred (flagged, not built).OpcUaProbeOk= peer-probes-me, reusing the existingPeerOpcUaProbeActor(finally spawned), closing the audit's "never spawned / zero consumers" gaps. Absence of a fresh peer result →true(benefit of the doubt: don't penalize a node for its peer being down); only an actively observed recent failure demotes. Single-node / no-peer cluster →true.
Architecture
All changes are per-driver-node, inside Runtime (WithOtOpcUaRuntimeActors). No ControlPlane /
admin-singleton change. No new Commons message types.
WithOtOpcUaRuntimeActors (per driver node)
├─ DbHealthProbeActor (exists; now ALSO consulted by the publish actor)
├─ PeerProbeSupervisor (NEW) watches cluster membership →
│ └─ PeerOpcUaProbeActor(peer) one child per OTHER driver member;
│ publishes OpcUaProbeResult(peer, ok) on the redundancy-state topic
└─ OpcUaPublishActor (MODIFIED)
subscribes redundancy-state topic → RedundancyStateChanged (role/leadership)
→ OpcUaProbeResult (peer's verdict on ME)
holds IActorRef dbHealthProbe → Ask<DbHealthStatus> on a periodic HealthTick
Cluster.SelfMember.Status → MemberState (local)
on any trigger: ServiceLevelCalculator.Compute(inputs) → publish-if-changed (existing dedup)
Components
1. PeerProbeSupervisor (NEW, Runtime/Health/PeerProbeSupervisor.cs) — a small per-node actor whose
only job is the probe lifecycle (kept separate so the pinned-dispatcher OpcUaPublishActor never parents
network-probe children):
- Subscribes to cluster membership (
IMemberEvent/ReachabilityEvent) and/or the redundancy snapshot. - Maintains one
PeerOpcUaProbeActorchild per other driver-role member, (re)spawning on membership change, stopping children for departed members. Resolves the peer's OPC UA host from its node id (host:port→ host) and the configured OPC UA port. - Children publish
OpcUaProbeResultto theredundancy-statetopic (existing behavior). - Spawned on the default dispatcher (not the OPC UA pinned one).
2. OpcUaPublishActor (MODIFIED) — replace the role-only switch in HandleRedundancyStateChanged
with a RecomputeServiceLevel() that calls the calculator. New Props params (both optional, defaulted so
test Props/harnesses keep working): IActorRef? dbHealthProbe, and the freshness windows /
self-OPC-UA-port (injectable for tests).
- New state:
_dbReachable+_dbAsOfUtc(from the DB Ask),_lastSnapshotAsOfUtc, a per-peer map of(ok, asOfUtc)for results about my own node, and the currentRedundancyStateChangedlocal entry. - New
Receive<OpcUaProbeResult>: ifmsg.NodeId == _localNode, record(msg.Ok, now)for the reporting peer; recompute. - New
Receive<HealthTick>(periodic, e.g. 5 s):dbHealthProbe.Ask<DbHealthStatus>(GetStatus, 1s).PipeTo(Self). - New
Receive<DbHealthStatus>: cache_dbReachable+_dbAsOfUtc; recompute. RecomputeServiceLevel()buildsNodeHealthInputsand callsServiceLevelCalculator.Compute, thenSelf.Tell(new ServiceLevelChanged(level))(the existing handler dedups + publishes + emits the metric).
Input derivation (the load-bearing logic)
For the local node, on each recompute:
NodeHealthInputs field |
Source |
|---|---|
MemberState |
Cluster.Get(system).SelfMember.Status (local; most accurate) |
IsDriverRoleLeader |
local entry of the latest RedundancyStateChanged snapshot (IsRoleLeaderForDriver) |
DbReachable |
latest DbHealthStatus.Reachable from the local DbHealthProbeActor |
OpcUaProbeOk |
true if no fresh peer result about me, else the latest such result's Ok (only an actively-observed recent failure → false); single-node → true |
Stale |
!DbReachable OR (now − _dbAsOfUtc) > StaleWindow OR (now − _lastSnapshotAsOfUtc) > StaleWindow |
Guards (calculator does not model these):
- If there is no local snapshot entry, or the local entry's
Role == Detached, publish 0 directly and skip the calculator (a healthy detached node would otherwise compute 240 because the calculator only checksMemberState+ the leader bonus). In steady state a node runningOpcUaPublishActoralways carries thedriverrole, so this is defensive — it preserves the current Detached→0 behavior during transitions.
Resulting ServiceLevel truth table (now fully reachable)
| Node condition | Byte | How reached |
|---|---|---|
| Not a healthy member / detached | 0 | MemberState not Up/Joining, or Detached guard |
| DB unreachable, sustained | 100 | DbReachable=false ⇒ Stale=true → (false,_,true) |
| DB reachable but signals stale | 200 | (true,_,true) |
| Healthy follower (DB ok + probe ok + fresh) | 240 | (true,true,false) |
| Healthy leader | 250 | follower 240 + IsDriverRoleLeader +10 |
| DB ok + fresh but peer reports me unreachable | 0 | (true,false,false) → falls through |
Behavior change to be aware of (already documented as the target): a healthy Secondary moves
100 → 240. Both healthy nodes sit at 240/250 with the leader preferred by the +10 bonus; role-leadership is
stable so the bytes do not flap. The "DB ok + probe-fail" → 0 edge is intentionally sharp but debounced
by the OpcUaProbeOk freshness rule (a single missed probe does not flip it; only a sustained, actively
observed failure does).
Error handling / edge cases
- DB Ask timeout → treat as
DbReachable=falsefor that cycle (fail-safe demote); the next tick re-Asks. - No peer (single node) →
OpcUaProbeOk=true; supervisor spawns no children. - Peer departs → its prior result ages out of the freshness window →
OpcUaProbeOkreverts totrue(we don't penalize a node for its peer's absence). Supervisor stops the child for the departed member. - Cluster forming / role-leader not yet resolved → no local snapshot entry yet → Detached guard → 0, same as today, until the first snapshot arrives.
HealthTickbefore the first snapshot → no local entry → 0 (no spurious 240 before role is known).
Testing strategy (xUnit + Shouldly, TDD; NO bUnit)
Unit-testable (the wiring that can be proven in a TestKit):
OpcUaPublishActorinput-derivation: feed aRedundancyStateChanged(Primary-leader) + a cachedDbHealthStatus(reachable)+ anOpcUaProbeResult(me, ok)→ assert published byte 250; flip DB unreachable → 100; stale snapshot → 200; healthy secondary → 240; Detached entry → 0; probeOk=falseabout me with DB-ok+fresh → 0. UsePropsForTestswith a broadcast/serviceLevel capture + injecteddbHealthProbetest ref + short windows.OpcUaProbeOkfreshness/debounce: a single stale/absent result →true; a recent explicitfalse→false.PeerProbeSupervisor: on a membership snapshot with one peer driver member, spawns exactly one child targeting that peer; on the peer leaving, stops it; single-node → no children. (TestKit, with a probe factory injected so no real TCP is attempted.)- Reuse existing
ServiceLevelCalculatorTestsunchanged (the pure function is not modified).
Not unit-testable (acceptance gate): the live cross-node ServiceLevel on the 2-node rig — proven by
/run (user-driven). A DB-unreachable primary must be observed dropping below the secondary, and the
secondary taking over as the higher-ServiceLevel endpoint. See project_redundancy_state_delivery.
Verification
dotnet buildclean (production projects areTreatWarningsAsErrors); fulldotnet testgreen.- High-risk review chain (serial spec → code → final integration).
- Live
/runon the 2-node / docker-dev rig (user-driven; agent does not sign in): confirm steady-state 250/240, then kill a node's DB reachability and confirm its ServiceLevel drops to 100 and the peer is preferred. Updatedocs/Redundancy.mdto mark the calculator path wired (remove the "not yet wired into the live driver publish path" caveats) and document the freshness/peer-probe sourcing.
Alternatives considered
- Central compute in
RedundancyStateActor— rejected: the admin singleton can't see each node's local DB health without a new cluster-wide health-broadcast contract; heavier, bigger blast radius, no benefit. - Local self-probe (TCP to own
localhost:4840/ SDK server-state) instead of the peer-probe — simpler (no membership-tracking supervisor) and a defensible "is my OPC UA serving" signal, but leavesPeerOpcUaProbeActordead and misses the cross-node vantage. Rejected to honor the locked design and close the audit's dead-code gap; recorded here so the trade-off is explicit if the supervisor proves heavier than valued (then it becomes a follow-up swap, not a redesign). - Reshape the calculator to role-baseline-demoted-by-health (keep the wide 240-vs-100 gap; health only lowers) — rejected: the calculator's 250/240/200/100/0 tiers are already approved, documented, and tested; reshaping re-opens a settled decision. The +10 leader bonus already preserves client preference for the primary.
Hard constraints (carried from the roadmap)
NO Configuration entity / EF migration. Stage by path — never git add .; never stage sql_login.txt,
src/Server/.../Host/pki/, pending.md, current.md, docker-dev/docker-compose.yml, stillpending.md.
Never echo or commit secrets. No force-push, no --no-verify. Razor/runtime cross-node behavior proven only
by live /run, never bUnit.