Joseph Doherty
|
483f55557c
|
Phase 6.3 Stream B + Stream D (core) — ServiceLevelCalculator + RecoveryStateManager + ApplyLeaseRegistry
Lands the pure-logic heart of Phase 6.3. OPC UA node wiring (Stream C),
RedundancyCoordinator topology loader (Stream A), Admin UI + metrics (Stream E),
and client interop tests (Stream F) are follow-up work — tracked as
tasks #145-150.
New Server.Redundancy sub-namespace:
- ServiceLevelCalculator — pure 8-state matrix per decision #154. Inputs:
role, selfHealthy, peerUa/HttpHealthy, applyInProgress, recoveryDwellMet,
topologyValid, operatorMaintenance. Output: OPC UA Part 5 §6.3.34 Byte.
Reserved bands (0=Maintenance, 1=NoData, 2=InvalidTopology) override
everything; operational bands occupy 30..255.
Key invariants:
* Authoritative-Primary = 255, Authoritative-Backup = 100.
* Isolated-Primary = 230 (retains authority with peer down).
* Isolated-Backup = 80 (does NOT auto-promote — non-transparent model).
* Primary-Mid-Apply = 200, Backup-Mid-Apply = 50; apply dominates
peer-unreachable per Stream C.4 integration expectation.
* Recovering-Primary = 180, Recovering-Backup = 30.
* Standalone treats healthy as Authoritative-Primary (no peer concept).
- ServiceLevelBand enum — labels every numeric band for logs + Admin UI.
Values match the calculator table exactly; compliance script asserts
drift detection.
- RecoveryStateManager — holds Recovering band until (dwell ≥ 60s default)
AND (one publish witness observed). Re-fault resets both gates so a
flapping node doesn't shortcut through recovery twice.
- ApplyLeaseRegistry — keyed on (ConfigGenerationId, PublishRequestId) per
decision #162. BeginApplyLease returns an IAsyncDisposable so every exit
path (success, exception, cancellation, dispose-twice) closes the lease.
ApplyMaxDuration watchdog (10 min default) via PruneStale tick forces
close after a crashed publisher so ServiceLevel can't stick at mid-apply.
Tests (40 new, all pass):
- ServiceLevelCalculatorTests (27): reserved bands override; self-unhealthy
→ NoData; invalid topology demotes both nodes to 2; authoritative primary
255; backup 100; isolated primary 230 retains authority; isolated backup
80 does not promote; http-only unreachable triggers isolated; mid-apply
primary 200; mid-apply backup 50; apply dominates peer-unreachable; recovering
primary 180; recovering backup 30; standalone treats healthy as 255;
classify round-trips every band including Unknown sentinel.
- RecoveryStateManagerTests (6): never-faulted auto-meets dwell; faulted-only
returns true (semantics-doc test — coordinator short-circuits on
selfHealthy=false); recovered without witness never meets; witness without
dwell never meets; witness + dwell-elapsed meets; re-fault resets.
- ApplyLeaseRegistryTests (7): empty registry not-in-progress; begin+dispose
closes; dispose on exception still closes; dispose twice safe; concurrent
leases isolated; watchdog closes stale; watchdog leaves recent alone.
Full solution dotnet test: 1137 passing (Phase 6.2 shipped at 1097, Phase 6.3
B + D core = +40 = 1137). Pre-existing Client.CLI Subscribe flake unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-19 09:56:34 -04:00 |
|