Files

T

Joseph Doherty 0528353315 docs(redundancy): Phase 2 design — health-aware ServiceLevel (H3)

2026-06-15 12:33:09 -04:00

12 KiB

Raw Blame History

Phase 2 — Health-aware redundancy ServiceLevel (H3) — design

Status: approved 2026-06-15. Parent roadmap: docs/plans/2026-06-15-stillpending-backlog-design.md (Phase 2). Backlog item H3 in stillpending.md §1. Branch: feat/stillpending-phase-2-servicelevel off master 4bd7180e. Classification: high-risk (actor model + redundancy correctness). Live /run on the 2-node rig is the acceptance gate; unit tests cannot prove the cross-node wiring (per project_redundancy_state_delivery).

Problem (H3)

The SDK's Server.ServiceLevel byte is driven solely by the coarse role map today (OpcUaPublishActor.HandleRedundancyStateChanged: Primary-leader→240, Primary→200, Secondary→100, Detached→0). The richer ServiceLevelCalculator.Compute(NodeHealthInputs) — which folds in DB reachability, an OPC-UA liveness probe, and a staleness signal — is never invoked in production. The two health producers are dead or half-wired:

DbHealthProbeActor — spawned per node, but only feeds /health/ready; its DbReachable never reaches the published byte.
PeerOpcUaProbeActor — never spawned anywhere; its OpcUaProbeResult has zero consumers.
NodeHealthInputs.Stale — no producer exists anywhere in the codebase.

Consequence: a DB-unreachable / OPC-UA-down primary still advertises its full role-based ServiceLevel, so redundant clients keep subscribing to a degraded node instead of failing over.

Goal

Make each driver node publish a ServiceLevel computed by ServiceLevelCalculator.Compute from its real local health, so an unhealthy node drops below its role-based level (and below a healthy peer's level, triggering client failover). Honor the already-documented tiers in docs/Redundancy.md (250 / 240 / 200 / 100 / 0) — do not reshape the calculator's truth table. No EF migration; no change to the RedundancyStateChanged / NodeRedundancyState message contract.

Locked decisions (from brainstorming 2026-06-15)

Compute site = local per-node (OpcUaPublishActor), not the admin RedundancyStateActor singleton. DB reachability is an inherently local fact; centralizing it would require a new cluster-wide health-broadcast contract for no benefit. The publish actor already runs per node, already subscribes to the redundancy-state topic, already holds the role snapshot, and already writes the byte.
Stale is derived from signal freshness, with DB-unreachable ⟹ stale. This makes the documented 200/100 graceful-degradation tiers reachable, so a node that loses its DB settles into 100 ("degraded, still serving cached values") rather than slamming to 0. The 0 tier stays reserved for "not a healthy cluster member / detached." A richer driver-input staleness signal is deferred (flagged, not built).
OpcUaProbeOk = peer-probes-me, reusing the existing PeerOpcUaProbeActor (finally spawned), closing the audit's "never spawned / zero consumers" gaps. Absence of a fresh peer result → true (benefit of the doubt: don't penalize a node for its peer being down); only an actively observed recent failure demotes. Single-node / no-peer cluster → true.

Architecture

All changes are per-driver-node, inside Runtime (WithOtOpcUaRuntimeActors). No ControlPlane / admin-singleton change. No new Commons message types.

WithOtOpcUaRuntimeActors (per driver node)
  ├─ DbHealthProbeActor            (exists; now ALSO consulted by the publish actor)
  ├─ PeerProbeSupervisor   (NEW)   watches cluster membership →
  │     └─ PeerOpcUaProbeActor(peer)  one child per OTHER driver member;
  │            publishes OpcUaProbeResult(peer, ok) on the redundancy-state topic
  └─ OpcUaPublishActor      (MODIFIED)
        subscribes redundancy-state topic  → RedundancyStateChanged  (role/leadership)
                                           → OpcUaProbeResult         (peer's verdict on ME)
        holds IActorRef dbHealthProbe      → Ask<DbHealthStatus> on a periodic HealthTick
        Cluster.SelfMember.Status          → MemberState (local)
        on any trigger:  ServiceLevelCalculator.Compute(inputs) → publish-if-changed (existing dedup)

Components

1. PeerProbeSupervisor (NEW, Runtime/Health/PeerProbeSupervisor.cs) — a small per-node actor whose only job is the probe lifecycle (kept separate so the pinned-dispatcher OpcUaPublishActor never parents network-probe children):

Subscribes to cluster membership (IMemberEvent / ReachabilityEvent) and/or the redundancy snapshot.
Maintains one PeerOpcUaProbeActor child per other driver-role member, (re)spawning on membership change, stopping children for departed members. Resolves the peer's OPC UA host from its node id (host:port → host) and the configured OPC UA port.
Children publish OpcUaProbeResult to the redundancy-state topic (existing behavior).
Spawned on the default dispatcher (not the OPC UA pinned one).

2. OpcUaPublishActor (MODIFIED) — replace the role-only switch in HandleRedundancyStateChanged with a RecomputeServiceLevel() that calls the calculator. New Props params (both optional, defaulted so test Props/harnesses keep working): IActorRef? dbHealthProbe, and the freshness windows / self-OPC-UA-port (injectable for tests).

New state: _dbReachable + _dbAsOfUtc (from the DB Ask), _lastSnapshotAsOfUtc, a per-peer map of (ok, asOfUtc) for results about my own node, and the current RedundancyStateChanged local entry.
New Receive<OpcUaProbeResult>: if msg.NodeId == _localNode, record (msg.Ok, now) for the reporting peer; recompute.
New Receive<HealthTick> (periodic, e.g. 5 s): dbHealthProbe.Ask<DbHealthStatus>(GetStatus, 1s).PipeTo(Self).
New Receive<DbHealthStatus>: cache _dbReachable + _dbAsOfUtc; recompute.
RecomputeServiceLevel() builds NodeHealthInputs and calls ServiceLevelCalculator.Compute, then Self.Tell(new ServiceLevelChanged(level)) (the existing handler dedups + publishes + emits the metric).

Input derivation (the load-bearing logic)

For the local node, on each recompute:

`NodeHealthInputs` field	Source
`MemberState`	`Cluster.Get(system).SelfMember.Status` (local; most accurate)
`IsDriverRoleLeader`	local entry of the latest `RedundancyStateChanged` snapshot (`IsRoleLeaderForDriver`)
`DbReachable`	latest `DbHealthStatus.Reachable` from the local `DbHealthProbeActor`
`OpcUaProbeOk`	`true` if no fresh peer result about me, else the latest such result's `Ok` (only an actively-observed recent failure → `false`); single-node → `true`
`Stale`	`!DbReachable` OR `(now − _dbAsOfUtc) > StaleWindow` OR `(now − _lastSnapshotAsOfUtc) > StaleWindow`

Guards (calculator does not model these):

If there is no local snapshot entry, or the local entry's Role == Detached, publish 0 directly and skip the calculator (a healthy detached node would otherwise compute 240 because the calculator only checks MemberState + the leader bonus). In steady state a node running OpcUaPublishActor always carries the driver role, so this is defensive — it preserves the current Detached→0 behavior during transitions.

Resulting ServiceLevel truth table (now fully reachable)

Node condition	Byte	How reached
Not a healthy member / detached	0	`MemberState` not Up/Joining, or Detached guard
DB unreachable, sustained	100	`DbReachable=false ⇒ Stale=true` → `(false,_,true)`
DB reachable but signals stale	200	`(true,_,true)`
Healthy follower (DB ok + probe ok + fresh)	240	`(true,true,false)`
Healthy leader	250	follower 240 + `IsDriverRoleLeader` +10
DB ok + fresh but peer reports me unreachable	0	`(true,false,false)` → falls through

Behavior change to be aware of (already documented as the target): a healthy Secondary moves 100 → 240. Both healthy nodes sit at 240/250 with the leader preferred by the +10 bonus; role-leadership is stable so the bytes do not flap. The "DB ok + probe-fail" → 0 edge is intentionally sharp but debounced by the OpcUaProbeOk freshness rule (a single missed probe does not flip it; only a sustained, actively observed failure does).

Error handling / edge cases

DB Ask timeout → treat as DbReachable=false for that cycle (fail-safe demote); the next tick re-Asks.
No peer (single node) → OpcUaProbeOk=true; supervisor spawns no children.
Peer departs → its prior result ages out of the freshness window → OpcUaProbeOk reverts to true (we don't penalize a node for its peer's absence). Supervisor stops the child for the departed member.
Cluster forming / role-leader not yet resolved → no local snapshot entry yet → Detached guard → 0, same as today, until the first snapshot arrives.
HealthTick before the first snapshot → no local entry → 0 (no spurious 240 before role is known).

Testing strategy (xUnit + Shouldly, TDD; NO bUnit)

Unit-testable (the wiring that can be proven in a TestKit):

OpcUaPublishActor input-derivation: feed a RedundancyStateChanged (Primary-leader) + a cached DbHealthStatus(reachable) + an OpcUaProbeResult(me, ok) → assert published byte 250; flip DB unreachable → 100; stale snapshot → 200; healthy secondary → 240; Detached entry → 0; probe Ok=false about me with DB-ok+fresh → 0. Use PropsForTests with a broadcast/serviceLevel capture + injected dbHealthProbe test ref + short windows.
OpcUaProbeOk freshness/debounce: a single stale/absent result → true; a recent explicit false → false.
PeerProbeSupervisor: on a membership snapshot with one peer driver member, spawns exactly one child targeting that peer; on the peer leaving, stops it; single-node → no children. (TestKit, with a probe factory injected so no real TCP is attempted.)
Reuse existing ServiceLevelCalculatorTests unchanged (the pure function is not modified).

Not unit-testable (acceptance gate): the live cross-node ServiceLevel on the 2-node rig — proven by /run (user-driven). A DB-unreachable primary must be observed dropping below the secondary, and the secondary taking over as the higher-ServiceLevel endpoint. See project_redundancy_state_delivery.

Verification

dotnet build clean (production projects are TreatWarningsAsErrors); full dotnet test green.
High-risk review chain (serial spec → code → final integration).
Live /run on the 2-node / docker-dev rig (user-driven; agent does not sign in): confirm steady-state 250/240, then kill a node's DB reachability and confirm its ServiceLevel drops to 100 and the peer is preferred. Update docs/Redundancy.md to mark the calculator path wired (remove the "not yet wired into the live driver publish path" caveats) and document the freshness/peer-probe sourcing.

Alternatives considered

Central compute in RedundancyStateActor — rejected: the admin singleton can't see each node's local DB health without a new cluster-wide health-broadcast contract; heavier, bigger blast radius, no benefit.
Local self-probe (TCP to own localhost:4840 / SDK server-state) instead of the peer-probe — simpler (no membership-tracking supervisor) and a defensible "is my OPC UA serving" signal, but leaves PeerOpcUaProbeActor dead and misses the cross-node vantage. Rejected to honor the locked design and close the audit's dead-code gap; recorded here so the trade-off is explicit if the supervisor proves heavier than valued (then it becomes a follow-up swap, not a redesign).
Reshape the calculator to role-baseline-demoted-by-health (keep the wide 240-vs-100 gap; health only lowers) — rejected: the calculator's 250/240/200/100/0 tiers are already approved, documented, and tested; reshaping re-opens a settled decision. The +10 leader bonus already preserves client preference for the primary.

Hard constraints (carried from the roadmap)

NO Configuration entity / EF migration. Stage by path — never git add .; never stage sql_login.txt, src/Server/.../Host/pki/, pending.md, current.md, docker-dev/docker-compose.yml, stillpending.md. Never echo or commit secrets. No force-push, no --no-verify. Razor/runtime cross-node behavior proven only by live /run, never bUnit.

12 KiB Raw Blame History Unescape Escape