Files
lmxopcua/docs/plans/2026-06-15-stillpending-phase-2-servicelevel-design.md
T

12 KiB
Raw Blame History

Phase 2 — Health-aware redundancy ServiceLevel (H3) — design

Status: approved 2026-06-15. Parent roadmap: docs/plans/2026-06-15-stillpending-backlog-design.md (Phase 2). Backlog item H3 in stillpending.md §1. Branch: feat/stillpending-phase-2-servicelevel off master 4bd7180e. Classification: high-risk (actor model + redundancy correctness). Live /run on the 2-node rig is the acceptance gate; unit tests cannot prove the cross-node wiring (per project_redundancy_state_delivery).

Problem (H3)

The SDK's Server.ServiceLevel byte is driven solely by the coarse role map today (OpcUaPublishActor.HandleRedundancyStateChanged: Primary-leader→240, Primary→200, Secondary→100, Detached→0). The richer ServiceLevelCalculator.Compute(NodeHealthInputs) — which folds in DB reachability, an OPC-UA liveness probe, and a staleness signal — is never invoked in production. The two health producers are dead or half-wired:

  • DbHealthProbeActorspawned per node, but only feeds /health/ready; its DbReachable never reaches the published byte.
  • PeerOpcUaProbeActornever spawned anywhere; its OpcUaProbeResult has zero consumers.
  • NodeHealthInputs.Staleno producer exists anywhere in the codebase.

Consequence: a DB-unreachable / OPC-UA-down primary still advertises its full role-based ServiceLevel, so redundant clients keep subscribing to a degraded node instead of failing over.

Goal

Make each driver node publish a ServiceLevel computed by ServiceLevelCalculator.Compute from its real local health, so an unhealthy node drops below its role-based level (and below a healthy peer's level, triggering client failover). Honor the already-documented tiers in docs/Redundancy.md (250 / 240 / 200 / 100 / 0) — do not reshape the calculator's truth table. No EF migration; no change to the RedundancyStateChanged / NodeRedundancyState message contract.

Locked decisions (from brainstorming 2026-06-15)

  1. Compute site = local per-node (OpcUaPublishActor), not the admin RedundancyStateActor singleton. DB reachability is an inherently local fact; centralizing it would require a new cluster-wide health-broadcast contract for no benefit. The publish actor already runs per node, already subscribes to the redundancy-state topic, already holds the role snapshot, and already writes the byte.
  2. Stale is derived from signal freshness, with DB-unreachable ⟹ stale. This makes the documented 200/100 graceful-degradation tiers reachable, so a node that loses its DB settles into 100 ("degraded, still serving cached values") rather than slamming to 0. The 0 tier stays reserved for "not a healthy cluster member / detached." A richer driver-input staleness signal is deferred (flagged, not built).
  3. OpcUaProbeOk = peer-probes-me, reusing the existing PeerOpcUaProbeActor (finally spawned), closing the audit's "never spawned / zero consumers" gaps. Absence of a fresh peer result → true (benefit of the doubt: don't penalize a node for its peer being down); only an actively observed recent failure demotes. Single-node / no-peer cluster → true.

Architecture

All changes are per-driver-node, inside Runtime (WithOtOpcUaRuntimeActors). No ControlPlane / admin-singleton change. No new Commons message types.

WithOtOpcUaRuntimeActors (per driver node)
  ├─ DbHealthProbeActor            (exists; now ALSO consulted by the publish actor)
  ├─ PeerProbeSupervisor   (NEW)   watches cluster membership →
  │     └─ PeerOpcUaProbeActor(peer)  one child per OTHER driver member;
  │            publishes OpcUaProbeResult(peer, ok) on the redundancy-state topic
  └─ OpcUaPublishActor      (MODIFIED)
        subscribes redundancy-state topic  → RedundancyStateChanged  (role/leadership)
                                           → OpcUaProbeResult         (peer's verdict on ME)
        holds IActorRef dbHealthProbe      → Ask<DbHealthStatus> on a periodic HealthTick
        Cluster.SelfMember.Status          → MemberState (local)
        on any trigger:  ServiceLevelCalculator.Compute(inputs) → publish-if-changed (existing dedup)

Components

1. PeerProbeSupervisor (NEW, Runtime/Health/PeerProbeSupervisor.cs) — a small per-node actor whose only job is the probe lifecycle (kept separate so the pinned-dispatcher OpcUaPublishActor never parents network-probe children):

  • Subscribes to cluster membership (IMemberEvent / ReachabilityEvent) and/or the redundancy snapshot.
  • Maintains one PeerOpcUaProbeActor child per other driver-role member, (re)spawning on membership change, stopping children for departed members. Resolves the peer's OPC UA host from its node id (host:port → host) and the configured OPC UA port.
  • Children publish OpcUaProbeResult to the redundancy-state topic (existing behavior).
  • Spawned on the default dispatcher (not the OPC UA pinned one).

2. OpcUaPublishActor (MODIFIED) — replace the role-only switch in HandleRedundancyStateChanged with a RecomputeServiceLevel() that calls the calculator. New Props params (both optional, defaulted so test Props/harnesses keep working): IActorRef? dbHealthProbe, and the freshness windows / self-OPC-UA-port (injectable for tests).

  • New state: _dbReachable + _dbAsOfUtc (from the DB Ask), _lastSnapshotAsOfUtc, a per-peer map of (ok, asOfUtc) for results about my own node, and the current RedundancyStateChanged local entry.
  • New Receive<OpcUaProbeResult>: if msg.NodeId == _localNode, record (msg.Ok, now) for the reporting peer; recompute.
  • New Receive<HealthTick> (periodic, e.g. 5 s): dbHealthProbe.Ask<DbHealthStatus>(GetStatus, 1s).PipeTo(Self).
  • New Receive<DbHealthStatus>: cache _dbReachable + _dbAsOfUtc; recompute.
  • RecomputeServiceLevel() builds NodeHealthInputs and calls ServiceLevelCalculator.Compute, then Self.Tell(new ServiceLevelChanged(level)) (the existing handler dedups + publishes + emits the metric).

Input derivation (the load-bearing logic)

For the local node, on each recompute:

NodeHealthInputs field Source
MemberState Cluster.Get(system).SelfMember.Status (local; most accurate)
IsDriverRoleLeader local entry of the latest RedundancyStateChanged snapshot (IsRoleLeaderForDriver)
DbReachable latest DbHealthStatus.Reachable from the local DbHealthProbeActor
OpcUaProbeOk true if no fresh peer result about me, else the latest such result's Ok (only an actively-observed recent failure → false); single-node → true
Stale !DbReachable OR (now _dbAsOfUtc) > StaleWindow OR (now _lastSnapshotAsOfUtc) > StaleWindow

Guards (calculator does not model these):

  • If there is no local snapshot entry, or the local entry's Role == Detached, publish 0 directly and skip the calculator (a healthy detached node would otherwise compute 240 because the calculator only checks MemberState + the leader bonus). In steady state a node running OpcUaPublishActor always carries the driver role, so this is defensive — it preserves the current Detached→0 behavior during transitions.

Resulting ServiceLevel truth table (now fully reachable)

Node condition Byte How reached
Not a healthy member / detached 0 MemberState not Up/Joining, or Detached guard
DB unreachable, sustained 100 DbReachable=false ⇒ Stale=true(false,_,true)
DB reachable but signals stale 200 (true,_,true)
Healthy follower (DB ok + probe ok + fresh) 240 (true,true,false)
Healthy leader 250 follower 240 + IsDriverRoleLeader +10
DB ok + fresh but peer reports me unreachable 0 (true,false,false) → falls through

Behavior change to be aware of (already documented as the target): a healthy Secondary moves 100 → 240. Both healthy nodes sit at 240/250 with the leader preferred by the +10 bonus; role-leadership is stable so the bytes do not flap. The "DB ok + probe-fail" → 0 edge is intentionally sharp but debounced by the OpcUaProbeOk freshness rule (a single missed probe does not flip it; only a sustained, actively observed failure does).

Error handling / edge cases

  • DB Ask timeout → treat as DbReachable=false for that cycle (fail-safe demote); the next tick re-Asks.
  • No peer (single node)OpcUaProbeOk=true; supervisor spawns no children.
  • Peer departs → its prior result ages out of the freshness window → OpcUaProbeOk reverts to true (we don't penalize a node for its peer's absence). Supervisor stops the child for the departed member.
  • Cluster forming / role-leader not yet resolved → no local snapshot entry yet → Detached guard → 0, same as today, until the first snapshot arrives.
  • HealthTick before the first snapshot → no local entry → 0 (no spurious 240 before role is known).

Testing strategy (xUnit + Shouldly, TDD; NO bUnit)

Unit-testable (the wiring that can be proven in a TestKit):

  • OpcUaPublishActor input-derivation: feed a RedundancyStateChanged (Primary-leader) + a cached DbHealthStatus(reachable) + an OpcUaProbeResult(me, ok) → assert published byte 250; flip DB unreachable → 100; stale snapshot → 200; healthy secondary → 240; Detached entry → 0; probe Ok=false about me with DB-ok+fresh → 0. Use PropsForTests with a broadcast/serviceLevel capture + injected dbHealthProbe test ref + short windows.
  • OpcUaProbeOk freshness/debounce: a single stale/absent result → true; a recent explicit falsefalse.
  • PeerProbeSupervisor: on a membership snapshot with one peer driver member, spawns exactly one child targeting that peer; on the peer leaving, stops it; single-node → no children. (TestKit, with a probe factory injected so no real TCP is attempted.)
  • Reuse existing ServiceLevelCalculatorTests unchanged (the pure function is not modified).

Not unit-testable (acceptance gate): the live cross-node ServiceLevel on the 2-node rig — proven by /run (user-driven). A DB-unreachable primary must be observed dropping below the secondary, and the secondary taking over as the higher-ServiceLevel endpoint. See project_redundancy_state_delivery.

Verification

  • dotnet build clean (production projects are TreatWarningsAsErrors); full dotnet test green.
  • High-risk review chain (serial spec → code → final integration).
  • Live /run on the 2-node / docker-dev rig (user-driven; agent does not sign in): confirm steady-state 250/240, then kill a node's DB reachability and confirm its ServiceLevel drops to 100 and the peer is preferred. Update docs/Redundancy.md to mark the calculator path wired (remove the "not yet wired into the live driver publish path" caveats) and document the freshness/peer-probe sourcing.

Alternatives considered

  • Central compute in RedundancyStateActor — rejected: the admin singleton can't see each node's local DB health without a new cluster-wide health-broadcast contract; heavier, bigger blast radius, no benefit.
  • Local self-probe (TCP to own localhost:4840 / SDK server-state) instead of the peer-probe — simpler (no membership-tracking supervisor) and a defensible "is my OPC UA serving" signal, but leaves PeerOpcUaProbeActor dead and misses the cross-node vantage. Rejected to honor the locked design and close the audit's dead-code gap; recorded here so the trade-off is explicit if the supervisor proves heavier than valued (then it becomes a follow-up swap, not a redesign).
  • Reshape the calculator to role-baseline-demoted-by-health (keep the wide 240-vs-100 gap; health only lowers) — rejected: the calculator's 250/240/200/100/0 tiers are already approved, documented, and tested; reshaping re-opens a settled decision. The +10 leader bonus already preserves client preference for the primary.

Hard constraints (carried from the roadmap)

NO Configuration entity / EF migration. Stage by path — never git add .; never stage sql_login.txt, src/Server/.../Host/pki/, pending.md, current.md, docker-dev/docker-compose.yml, stillpending.md. Never echo or commit secrets. No force-push, no --no-verify. Razor/runtime cross-node behavior proven only by live /run, never bUnit.