Admin /hosts red-badge + resilience columns + Polly telemetry observer. Closes task #164 (the remaining slice of Phase 6.1 Stream E.3 after the earlier publisher + hub PR). Three cooperating pieces wired together so the operator-facing /hosts table actually reflects the live Polly counters that the pipeline builder is producing. DriverResiliencePipelineBuilder gains an optional DriverResilienceStatusTracker ctor param — when non-null, every built pipeline wires Polly's OnRetry/OnOpened/OnClosed strategy-options callbacks into the tracker. OnRetry → tracker.RecordFailure (so ConsecutiveFailures climbs per retry), OnOpened → tracker.RecordBreakerOpen (stamps LastCircuitBreakerOpenUtc), OnClosed → tracker.RecordSuccess (resets the failure counter once the target recovers). Absent tracker = silent, preserving the unit-test constructor path + any deployment that doesn't care about resilience observability. Cancellation stays excluded from the failure count via the existing ShouldHandle predicate. HostStatusService.HostStatusRow extends with four new fields — ConsecutiveFailures, LastCircuitBreakerOpenUtc, CurrentBulkheadDepth, LastRecycleUtc — populated via a second LEFT JOIN onto DriverInstanceResilienceStatuses keyed on (DriverInstanceId, HostName). LEFT JOIN because brand-new hosts haven't been sampled yet; a missing row means zero failures + never-opened breaker, which is the correct default. New FailureFlagThreshold constant (=3, matches plan decision #143's conservative half-of-breaker convention) + IsFlagged predicate so the UI can pre-warn before the breaker actually trips. Hosts.razor paints three new columns between State and Last-transition — Fail# (bold red when flagged), In-flight (bulkhead-depth proxy), Breaker-opened (relative age). Per-row "Flagged" red badge alongside State when IsFlagged is true. Above the first cluster table, a red alert banner summarises the flagged-host count when ≥1 host is flagged, so operators see the problem before scanning rows. Three new tests in DriverResiliencePipelineBuilderTests — Tracker_RecordsFailure_OnEveryRetry verifies ConsecutiveFailures reaches RetryCount after a transient-forever operation, Tracker_StampsBreakerOpen_WhenBreakerTrips verifies LastBreakerOpenUtc is set after threshold failures on a Write pipeline, Tracker_IsolatesCounters_PerHost verifies one dead host does not leak failure counts into a healthy sibling. Full suite — Core.Tests 14/14 resilience-builder tests passing (11 existing + 3 new), Admin.Tests 72/72 passing, Admin project builds 0 errors. SignalR live push of status changes + browser visual review are deliberately left to a follow-up — this PR keeps the structural change minimal (polling refresh already exists in the page's 10s timer; SignalR would be a structural add that touches hub registration + client subscription).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-04-19 21:35:54 -04:00
parent 5536e96b46
commit d06cc01a48
4 changed files with 169 additions and 16 deletions

View File

@@ -219,4 +219,67 @@ public sealed class DriverResiliencePipelineBuilderTests
attempts.ShouldBeLessThanOrEqualTo(1);
}
[Fact]
public async Task Tracker_RecordsFailure_OnEveryRetry()
{
var tracker = new DriverResilienceStatusTracker();
var builder = new DriverResiliencePipelineBuilder(statusTracker: tracker);
var pipeline = builder.GetOrCreate("drv-trk", "host-x", DriverCapability.Read, TierAOptions);
await Should.ThrowAsync<InvalidOperationException>(async () =>
await pipeline.ExecuteAsync(async _ =>
{
await Task.Yield();
throw new InvalidOperationException("always fails");
}));
var snap = tracker.TryGet("drv-trk", "host-x");
snap.ShouldNotBeNull();
var retryCount = TierAOptions.Resolve(DriverCapability.Read).RetryCount;
snap!.ConsecutiveFailures.ShouldBe(retryCount);
}
[Fact]
public async Task Tracker_StampsBreakerOpen_WhenBreakerTrips()
{
var tracker = new DriverResilienceStatusTracker();
var builder = new DriverResiliencePipelineBuilder(statusTracker: tracker);
var pipeline = builder.GetOrCreate("drv-trk", "host-b", DriverCapability.Write, TierAOptions);
var threshold = TierAOptions.Resolve(DriverCapability.Write).BreakerFailureThreshold;
for (var i = 0; i < threshold; i++)
{
await Should.ThrowAsync<InvalidOperationException>(async () =>
await pipeline.ExecuteAsync(async _ =>
{
await Task.Yield();
throw new InvalidOperationException("boom");
}));
}
var snap = tracker.TryGet("drv-trk", "host-b");
snap.ShouldNotBeNull();
snap!.LastBreakerOpenUtc.ShouldNotBeNull();
}
[Fact]
public async Task Tracker_IsolatesCounters_PerHost()
{
var tracker = new DriverResilienceStatusTracker();
var builder = new DriverResiliencePipelineBuilder(statusTracker: tracker);
var dead = builder.GetOrCreate("drv-trk", "dead", DriverCapability.Read, TierAOptions);
var live = builder.GetOrCreate("drv-trk", "live", DriverCapability.Read, TierAOptions);
await Should.ThrowAsync<InvalidOperationException>(async () =>
await dead.ExecuteAsync(async _ =>
{
await Task.Yield();
throw new InvalidOperationException("dead");
}));
await live.ExecuteAsync(async _ => await Task.Yield());
tracker.TryGet("drv-trk", "dead")!.ConsecutiveFailures.ShouldBeGreaterThan(0);
tracker.TryGet("drv-trk", "live").ShouldBeNull();
}
}