Lands the pure-logic heart of Phase 6.3. OPC UA node wiring (Stream C), RedundancyCoordinator topology loader (Stream A), Admin UI + metrics (Stream E), and client interop tests (Stream F) are follow-up work — tracked as tasks #145-150. New Server.Redundancy sub-namespace: - ServiceLevelCalculator — pure 8-state matrix per decision #154. Inputs: role, selfHealthy, peerUa/HttpHealthy, applyInProgress, recoveryDwellMet, topologyValid, operatorMaintenance. Output: OPC UA Part 5 §6.3.34 Byte. Reserved bands (0=Maintenance, 1=NoData, 2=InvalidTopology) override everything; operational bands occupy 30..255. Key invariants: * Authoritative-Primary = 255, Authoritative-Backup = 100. * Isolated-Primary = 230 (retains authority with peer down). * Isolated-Backup = 80 (does NOT auto-promote — non-transparent model). * Primary-Mid-Apply = 200, Backup-Mid-Apply = 50; apply dominates peer-unreachable per Stream C.4 integration expectation. * Recovering-Primary = 180, Recovering-Backup = 30. * Standalone treats healthy as Authoritative-Primary (no peer concept). - ServiceLevelBand enum — labels every numeric band for logs + Admin UI. Values match the calculator table exactly; compliance script asserts drift detection. - RecoveryStateManager — holds Recovering band until (dwell ≥ 60s default) AND (one publish witness observed). Re-fault resets both gates so a flapping node doesn't shortcut through recovery twice. - ApplyLeaseRegistry — keyed on (ConfigGenerationId, PublishRequestId) per decision #162. BeginApplyLease returns an IAsyncDisposable so every exit path (success, exception, cancellation, dispose-twice) closes the lease. ApplyMaxDuration watchdog (10 min default) via PruneStale tick forces close after a crashed publisher so ServiceLevel can't stick at mid-apply. Tests (40 new, all pass): - ServiceLevelCalculatorTests (27): reserved bands override; self-unhealthy → NoData; invalid topology demotes both nodes to 2; authoritative primary 255; backup 100; isolated primary 230 retains authority; isolated backup 80 does not promote; http-only unreachable triggers isolated; mid-apply primary 200; mid-apply backup 50; apply dominates peer-unreachable; recovering primary 180; recovering backup 30; standalone treats healthy as 255; classify round-trips every band including Unknown sentinel. - RecoveryStateManagerTests (6): never-faulted auto-meets dwell; faulted-only returns true (semantics-doc test — coordinator short-circuits on selfHealthy=false); recovered without witness never meets; witness without dwell never meets; witness + dwell-elapsed meets; re-fault resets. - ApplyLeaseRegistryTests (7): empty registry not-in-progress; begin+dispose closes; dispose on exception still closes; dispose twice safe; concurrent leases isolated; watchdog closes stale; watchdog leaves recent alone. Full solution dotnet test: 1137 passing (Phase 6.2 shipped at 1097, Phase 6.3 B + D core = +40 = 1137). Pre-existing Client.CLI Subscribe flake unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
132 lines
6.6 KiB
C#
132 lines
6.6 KiB
C#
using ZB.MOM.WW.OtOpcUa.Configuration.Enums;
|
|
|
|
namespace ZB.MOM.WW.OtOpcUa.Server.Redundancy;
|
|
|
|
/// <summary>
|
|
/// Pure-function translator from the redundancy-state inputs (role, self health, peer
|
|
/// reachability via HTTP + UA probes, apply-in-progress flag, recovery dwell, topology
|
|
/// validity) to the OPC UA Part 5 §6.3.34 <see cref="byte"/> ServiceLevel value.
|
|
/// </summary>
|
|
/// <remarks>
|
|
/// <para>Per decision #154 the 8-state matrix avoids the reserved bands (0=Maintenance,
|
|
/// 1=NoData) for operational states. Operational values occupy 2..255 so a spec-compliant
|
|
/// client that cuts over on "<3 = unhealthy" keeps working without its vendor treating
|
|
/// the server as "under maintenance" during normal runtime.</para>
|
|
///
|
|
/// <para>This class is pure — no threads, no I/O. The coordinator that owns it re-evaluates
|
|
/// on every input change and pushes the new byte through an <c>IObserver<byte></c> to
|
|
/// the OPC UA ServiceLevel variable. Tests exercise the full matrix without touching a UA
|
|
/// stack.</para>
|
|
/// </remarks>
|
|
public static class ServiceLevelCalculator
|
|
{
|
|
/// <summary>Compute the ServiceLevel for the given inputs.</summary>
|
|
/// <param name="role">Role declared for this node in the shared config DB.</param>
|
|
/// <param name="selfHealthy">This node's own health (from Phase 6.1 /healthz).</param>
|
|
/// <param name="peerUaHealthy">Peer node reachable via OPC UA probe.</param>
|
|
/// <param name="peerHttpHealthy">Peer node reachable via HTTP /healthz probe.</param>
|
|
/// <param name="applyInProgress">True while this node is inside a publish-generation apply window.</param>
|
|
/// <param name="recoveryDwellMet">True once the post-fault dwell + publish-witness conditions are met.</param>
|
|
/// <param name="topologyValid">False when the cluster has detected >1 Primary (InvalidTopology demotes both nodes).</param>
|
|
/// <param name="operatorMaintenance">True when operator has declared the node in maintenance.</param>
|
|
public static byte Compute(
|
|
RedundancyRole role,
|
|
bool selfHealthy,
|
|
bool peerUaHealthy,
|
|
bool peerHttpHealthy,
|
|
bool applyInProgress,
|
|
bool recoveryDwellMet,
|
|
bool topologyValid,
|
|
bool operatorMaintenance = false)
|
|
{
|
|
// Reserved bands first — they override everything per OPC UA Part 5 §6.3.34.
|
|
if (operatorMaintenance) return (byte)ServiceLevelBand.Maintenance; // 0
|
|
if (!selfHealthy) return (byte)ServiceLevelBand.NoData; // 1
|
|
if (!topologyValid) return (byte)ServiceLevelBand.InvalidTopology; // 2
|
|
|
|
// Standalone nodes have no peer — treat as authoritative when healthy.
|
|
if (role == RedundancyRole.Standalone)
|
|
return (byte)(applyInProgress ? ServiceLevelBand.PrimaryMidApply : ServiceLevelBand.AuthoritativePrimary);
|
|
|
|
var isPrimary = role == RedundancyRole.Primary;
|
|
|
|
// Apply-in-progress band dominates recovery + isolation (client should cut to peer).
|
|
if (applyInProgress)
|
|
return (byte)(isPrimary ? ServiceLevelBand.PrimaryMidApply : ServiceLevelBand.BackupMidApply);
|
|
|
|
// Post-fault recovering — hold until dwell + witness satisfied.
|
|
if (!recoveryDwellMet)
|
|
return (byte)(isPrimary ? ServiceLevelBand.RecoveringPrimary : ServiceLevelBand.RecoveringBackup);
|
|
|
|
// Peer unreachable (either probe fails) → isolated band. Per decision #154 Primary
|
|
// retains authority at 230 when isolated; Backup signals 80 "take over if asked" and
|
|
// does NOT auto-promote (non-transparent model).
|
|
var peerReachable = peerUaHealthy && peerHttpHealthy;
|
|
if (!peerReachable)
|
|
return (byte)(isPrimary ? ServiceLevelBand.IsolatedPrimary : ServiceLevelBand.IsolatedBackup);
|
|
|
|
return (byte)(isPrimary ? ServiceLevelBand.AuthoritativePrimary : ServiceLevelBand.AuthoritativeBackup);
|
|
}
|
|
|
|
/// <summary>Labels a ServiceLevel byte with its matrix band name — for logs + Admin UI.</summary>
|
|
public static ServiceLevelBand Classify(byte value) => value switch
|
|
{
|
|
(byte)ServiceLevelBand.Maintenance => ServiceLevelBand.Maintenance,
|
|
(byte)ServiceLevelBand.NoData => ServiceLevelBand.NoData,
|
|
(byte)ServiceLevelBand.InvalidTopology => ServiceLevelBand.InvalidTopology,
|
|
(byte)ServiceLevelBand.RecoveringBackup => ServiceLevelBand.RecoveringBackup,
|
|
(byte)ServiceLevelBand.BackupMidApply => ServiceLevelBand.BackupMidApply,
|
|
(byte)ServiceLevelBand.IsolatedBackup => ServiceLevelBand.IsolatedBackup,
|
|
(byte)ServiceLevelBand.AuthoritativeBackup => ServiceLevelBand.AuthoritativeBackup,
|
|
(byte)ServiceLevelBand.RecoveringPrimary => ServiceLevelBand.RecoveringPrimary,
|
|
(byte)ServiceLevelBand.PrimaryMidApply => ServiceLevelBand.PrimaryMidApply,
|
|
(byte)ServiceLevelBand.IsolatedPrimary => ServiceLevelBand.IsolatedPrimary,
|
|
(byte)ServiceLevelBand.AuthoritativePrimary => ServiceLevelBand.AuthoritativePrimary,
|
|
_ => ServiceLevelBand.Unknown,
|
|
};
|
|
}
|
|
|
|
/// <summary>
|
|
/// Named bands of the 8-state ServiceLevel matrix. Numeric values match the
|
|
/// <see cref="ServiceLevelCalculator"/> table exactly; any drift will be caught by the
|
|
/// Phase 6.3 compliance script.
|
|
/// </summary>
|
|
public enum ServiceLevelBand : byte
|
|
{
|
|
/// <summary>Operator-declared maintenance. Reserved per OPC UA Part 5 §6.3.34.</summary>
|
|
Maintenance = 0,
|
|
|
|
/// <summary>Unreachable / Faulted. Reserved per OPC UA Part 5 §6.3.34.</summary>
|
|
NoData = 1,
|
|
|
|
/// <summary>Detected-inconsistency band — >1 Primary observed runtime; both nodes self-demote.</summary>
|
|
InvalidTopology = 2,
|
|
|
|
/// <summary>Backup post-fault, dwell not met.</summary>
|
|
RecoveringBackup = 30,
|
|
|
|
/// <summary>Backup inside a publish-apply window.</summary>
|
|
BackupMidApply = 50,
|
|
|
|
/// <summary>Backup with unreachable Primary — "take over if asked"; does NOT auto-promote.</summary>
|
|
IsolatedBackup = 80,
|
|
|
|
/// <summary>Backup nominal operation.</summary>
|
|
AuthoritativeBackup = 100,
|
|
|
|
/// <summary>Primary post-fault, dwell not met.</summary>
|
|
RecoveringPrimary = 180,
|
|
|
|
/// <summary>Primary inside a publish-apply window.</summary>
|
|
PrimaryMidApply = 200,
|
|
|
|
/// <summary>Primary with unreachable peer, self serving — retains authority.</summary>
|
|
IsolatedPrimary = 230,
|
|
|
|
/// <summary>Primary nominal operation.</summary>
|
|
AuthoritativePrimary = 255,
|
|
|
|
/// <summary>Sentinel for unrecognised byte values.</summary>
|
|
Unknown = 254,
|
|
}
|