Closes out Stream B per docs/v2/implementation/phase-6-1-resilience-and-observability.md. Core.Abstractions: - IDriverSupervisor — process-level supervisor contract a Tier C driver's out-of-process topology provides (Galaxy Proxy/Supervisor implements this in a follow-up Driver.Galaxy wiring PR). Concerns: DriverInstanceId + RecycleAsync. Tier A/B drivers don't implement this; Stream B code asserts tier == C before ever calling it. Core.Stability: - MemoryRecycle — companion to MemoryTracking. On HardBreach, invokes the supervisor IFF tier == C AND a supervisor is wired. Tier A/B HardBreach logs a promotion-to-Tier-C recommendation and returns false. Soft/None/Warming never triggers a recycle at any tier. - ScheduledRecycleScheduler — Tier C opt-in periodic recycler per decision #67. Ctor throws for Tier A/B (structural guard — scheduled recycle on an in-process driver would kill every OPC UA session and every co-hosted driver). TickAsync(now) advances the schedule by one interval per fire; RequestRecycleNowAsync drives an ad-hoc recycle without shifting the cron. - WedgeDetector — demand-aware per decision #147. Classify(state, demand, now) returns: * NotApplicable when driver state != Healthy * Idle when Healthy + no pending work (bulkhead=0 && monitored=0 && historic=0) * Healthy when Healthy + pending work + progress within threshold * Faulted when Healthy + pending work + no progress within threshold Threshold clamps to min 60 s. DemandSignal.HasPendingWork ORs the three counters. The three false-wedge cases the plan calls out all stay Healthy: idle subscription-only, slow historian backfill making progress, write-only burst with drained bulkhead. Tests (22 new, all pass): - MemoryRecycleTests (7): Tier C hard-breach requests recycle; Tier A/B hard-breach never requests; Tier C without supervisor no-ops; soft-breach at every tier never requests; None/Warming never request. - ScheduledRecycleSchedulerTests (6): ctor throws for A/B; zero/negative interval throws; tick before due no-ops; tick at/after due fires once and advances; RequestRecycleNow fires immediately without shifting schedule; multiple fires across ticks advance one interval each. - WedgeDetectorTests (9): threshold clamp to 60 s; unhealthy driver always NotApplicable; idle subscription stays Idle; pending+fresh progress stays Healthy; pending+stale progress is Faulted; MonitoredItems active but no publish is Faulted; MonitoredItems active with fresh publish stays Healthy; historian backfill with fresh progress stays Healthy; write-only burst with empty bulkhead is Idle; HasPendingWork theory for any non-zero counter. Full solution dotnet test: 989 passing (baseline 906, +83 for Phase 6.1 so far). Pre-existing Client.CLI Subscribe flake unchanged. Stream B complete. Next up: Stream C (health endpoints + structured logging). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
82 lines
3.8 KiB
C#
82 lines
3.8 KiB
C#
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
|
||
|
||
namespace ZB.MOM.WW.OtOpcUa.Core.Stability;
|
||
|
||
/// <summary>
|
||
/// Demand-aware driver-wedge detector per <c>docs/v2/plan.md</c> decision #147.
|
||
/// Flips a driver to <see cref="WedgeVerdict.Faulted"/> only when BOTH of the following hold:
|
||
/// (a) there is pending work outstanding, AND (b) no progress has been observed for longer
|
||
/// than <see cref="Threshold"/>. Idle drivers, write-only burst drivers, and subscription-only
|
||
/// drivers whose signals don't arrive regularly all stay Healthy.
|
||
/// </summary>
|
||
/// <remarks>
|
||
/// <para>Pending work signal is supplied by the caller via <see cref="DemandSignal"/>:
|
||
/// non-zero Polly bulkhead depth, ≥1 active MonitoredItem, or ≥1 queued historian read
|
||
/// each qualifies. The detector itself is state-light: all it remembers is the last
|
||
/// <c>LastProgressUtc</c> it saw and the last wedge verdict. No history buffer.</para>
|
||
///
|
||
/// <para>Default threshold per plan: <c>5 × PublishingInterval</c>, with a minimum of 60 s.
|
||
/// Concrete values are driver-agnostic and configured per-instance by the caller.</para>
|
||
/// </remarks>
|
||
public sealed class WedgeDetector
|
||
{
|
||
/// <summary>Wedge-detection threshold; pass < 60 s and the detector clamps to 60 s.</summary>
|
||
public TimeSpan Threshold { get; }
|
||
|
||
/// <summary>Whether the driver reported itself <see cref="DriverState.Healthy"/> at construction.</summary>
|
||
public WedgeDetector(TimeSpan threshold)
|
||
{
|
||
Threshold = threshold < TimeSpan.FromSeconds(60) ? TimeSpan.FromSeconds(60) : threshold;
|
||
}
|
||
|
||
/// <summary>
|
||
/// Classify the current state against the demand signal. Does not retain state across
|
||
/// calls — each call is self-contained; the caller owns the <c>LastProgressUtc</c> clock.
|
||
/// </summary>
|
||
public WedgeVerdict Classify(DriverState state, DemandSignal demand, DateTime utcNow)
|
||
{
|
||
if (state != DriverState.Healthy)
|
||
return WedgeVerdict.NotApplicable;
|
||
|
||
if (!demand.HasPendingWork)
|
||
return WedgeVerdict.Idle;
|
||
|
||
var sinceProgress = utcNow - demand.LastProgressUtc;
|
||
return sinceProgress > Threshold ? WedgeVerdict.Faulted : WedgeVerdict.Healthy;
|
||
}
|
||
}
|
||
|
||
/// <summary>
|
||
/// Caller-supplied demand snapshot. All three counters are OR'd — any non-zero means work
|
||
/// is outstanding, which is the trigger for checking the <see cref="LastProgressUtc"/> clock.
|
||
/// </summary>
|
||
/// <param name="BulkheadDepth">Polly bulkhead depth (in-flight capability calls).</param>
|
||
/// <param name="ActiveMonitoredItems">Number of live OPC UA MonitoredItems bound to this driver.</param>
|
||
/// <param name="QueuedHistoryReads">Pending historian-read requests the driver owes the server.</param>
|
||
/// <param name="LastProgressUtc">Last time the driver reported a successful unit of work (read, subscribe-ack, publish).</param>
|
||
public readonly record struct DemandSignal(
|
||
int BulkheadDepth,
|
||
int ActiveMonitoredItems,
|
||
int QueuedHistoryReads,
|
||
DateTime LastProgressUtc)
|
||
{
|
||
/// <summary>True when any of the three counters is > 0.</summary>
|
||
public bool HasPendingWork => BulkheadDepth > 0 || ActiveMonitoredItems > 0 || QueuedHistoryReads > 0;
|
||
}
|
||
|
||
/// <summary>Outcome of a single <see cref="WedgeDetector.Classify"/> call.</summary>
|
||
public enum WedgeVerdict
|
||
{
|
||
/// <summary>Driver wasn't Healthy to begin with — wedge detection doesn't apply.</summary>
|
||
NotApplicable,
|
||
|
||
/// <summary>Driver claims Healthy + no pending work → stays Healthy.</summary>
|
||
Idle,
|
||
|
||
/// <summary>Driver claims Healthy + has pending work + has made progress within the threshold → stays Healthy.</summary>
|
||
Healthy,
|
||
|
||
/// <summary>Driver claims Healthy + has pending work + has NOT made progress within the threshold → wedged.</summary>
|
||
Faulted,
|
||
}
|