Lands the pure-logic heart of Phase 6.3. OPC UA node wiring (Stream C), RedundancyCoordinator topology loader (Stream A), Admin UI + metrics (Stream E), and client interop tests (Stream F) are follow-up work — tracked as tasks #145-150. New Server.Redundancy sub-namespace: - ServiceLevelCalculator — pure 8-state matrix per decision #154. Inputs: role, selfHealthy, peerUa/HttpHealthy, applyInProgress, recoveryDwellMet, topologyValid, operatorMaintenance. Output: OPC UA Part 5 §6.3.34 Byte. Reserved bands (0=Maintenance, 1=NoData, 2=InvalidTopology) override everything; operational bands occupy 30..255. Key invariants: * Authoritative-Primary = 255, Authoritative-Backup = 100. * Isolated-Primary = 230 (retains authority with peer down). * Isolated-Backup = 80 (does NOT auto-promote — non-transparent model). * Primary-Mid-Apply = 200, Backup-Mid-Apply = 50; apply dominates peer-unreachable per Stream C.4 integration expectation. * Recovering-Primary = 180, Recovering-Backup = 30. * Standalone treats healthy as Authoritative-Primary (no peer concept). - ServiceLevelBand enum — labels every numeric band for logs + Admin UI. Values match the calculator table exactly; compliance script asserts drift detection. - RecoveryStateManager — holds Recovering band until (dwell ≥ 60s default) AND (one publish witness observed). Re-fault resets both gates so a flapping node doesn't shortcut through recovery twice. - ApplyLeaseRegistry — keyed on (ConfigGenerationId, PublishRequestId) per decision #162. BeginApplyLease returns an IAsyncDisposable so every exit path (success, exception, cancellation, dispose-twice) closes the lease. ApplyMaxDuration watchdog (10 min default) via PruneStale tick forces close after a crashed publisher so ServiceLevel can't stick at mid-apply. Tests (40 new, all pass): - ServiceLevelCalculatorTests (27): reserved bands override; self-unhealthy → NoData; invalid topology demotes both nodes to 2; authoritative primary 255; backup 100; isolated primary 230 retains authority; isolated backup 80 does not promote; http-only unreachable triggers isolated; mid-apply primary 200; mid-apply backup 50; apply dominates peer-unreachable; recovering primary 180; recovering backup 30; standalone treats healthy as 255; classify round-trips every band including Unknown sentinel. - RecoveryStateManagerTests (6): never-faulted auto-meets dwell; faulted-only returns true (semantics-doc test — coordinator short-circuits on selfHealthy=false); recovered without witness never meets; witness without dwell never meets; witness + dwell-elapsed meets; re-fault resets. - ApplyLeaseRegistryTests (7): empty registry not-in-progress; begin+dispose closes; dispose on exception still closes; dispose twice safe; concurrent leases isolated; watchdog closes stale; watchdog leaves recent alone. Full solution dotnet test: 1137 passing (Phase 6.2 shipped at 1097, Phase 6.3 B + D core = +40 = 1137). Pre-existing Client.CLI Subscribe flake unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
66 lines
2.5 KiB
C#
66 lines
2.5 KiB
C#
namespace ZB.MOM.WW.OtOpcUa.Server.Redundancy;
|
|
|
|
/// <summary>
|
|
/// Tracks the Recovering-band dwell for a node after a <c>Faulted → Healthy</c> transition.
|
|
/// Per decision #154 and Phase 6.3 Stream B.4 a node that has just returned to health stays
|
|
/// in the Recovering band (180 Primary / 30 Backup) until BOTH: (a) the configured
|
|
/// <see cref="DwellTime"/> has elapsed, AND (b) at least one successful publish-witness
|
|
/// read has been observed.
|
|
/// </summary>
|
|
/// <remarks>
|
|
/// Purely in-memory, no I/O. The coordinator feeds events into <see cref="MarkFaulted"/>,
|
|
/// <see cref="MarkRecovered"/>, and <see cref="RecordPublishWitness"/>; <see cref="IsDwellMet"/>
|
|
/// becomes true only after both conditions converge.
|
|
/// </remarks>
|
|
public sealed class RecoveryStateManager
|
|
{
|
|
private readonly TimeSpan _dwellTime;
|
|
private readonly TimeProvider _timeProvider;
|
|
|
|
/// <summary>Last time the node transitioned Faulted → Healthy. Null until first recovery.</summary>
|
|
private DateTime? _recoveredUtc;
|
|
|
|
/// <summary>True once a publish-witness read has succeeded after the last recovery.</summary>
|
|
private bool _witnessed;
|
|
|
|
public TimeSpan DwellTime => _dwellTime;
|
|
|
|
public RecoveryStateManager(TimeSpan? dwellTime = null, TimeProvider? timeProvider = null)
|
|
{
|
|
_dwellTime = dwellTime ?? TimeSpan.FromSeconds(60);
|
|
_timeProvider = timeProvider ?? TimeProvider.System;
|
|
}
|
|
|
|
/// <summary>Report that the node has entered the Faulted state.</summary>
|
|
public void MarkFaulted()
|
|
{
|
|
_recoveredUtc = null;
|
|
_witnessed = false;
|
|
}
|
|
|
|
/// <summary>Report that the node has transitioned Faulted → Healthy; dwell clock starts now.</summary>
|
|
public void MarkRecovered()
|
|
{
|
|
_recoveredUtc = _timeProvider.GetUtcNow().UtcDateTime;
|
|
_witnessed = false;
|
|
}
|
|
|
|
/// <summary>Report a successful publish-witness read.</summary>
|
|
public void RecordPublishWitness() => _witnessed = true;
|
|
|
|
/// <summary>
|
|
/// True when the dwell is considered met: either the node never faulted in the first
|
|
/// place, or both (dwell time elapsed + publish witness recorded) since the last
|
|
/// recovery. False means the coordinator should report Recovering-band ServiceLevel.
|
|
/// </summary>
|
|
public bool IsDwellMet()
|
|
{
|
|
if (_recoveredUtc is null) return true; // never faulted → dwell N/A
|
|
|
|
if (!_witnessed) return false;
|
|
|
|
var elapsed = _timeProvider.GetUtcNow().UtcDateTime - _recoveredUtc.Value;
|
|
return elapsed >= _dwellTime;
|
|
}
|
|
}
|