auto/driver-gaps-stash
3 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
d06cc01a48 |
Admin /hosts red-badge + resilience columns + Polly telemetry observer. Closes task #164 (the remaining slice of Phase 6.1 Stream E.3 after the earlier publisher + hub PR). Three cooperating pieces wired together so the operator-facing /hosts table actually reflects the live Polly counters that the pipeline builder is producing. DriverResiliencePipelineBuilder gains an optional DriverResilienceStatusTracker ctor param — when non-null, every built pipeline wires Polly's OnRetry/OnOpened/OnClosed strategy-options callbacks into the tracker. OnRetry → tracker.RecordFailure (so ConsecutiveFailures climbs per retry), OnOpened → tracker.RecordBreakerOpen (stamps LastCircuitBreakerOpenUtc), OnClosed → tracker.RecordSuccess (resets the failure counter once the target recovers). Absent tracker = silent, preserving the unit-test constructor path + any deployment that doesn't care about resilience observability. Cancellation stays excluded from the failure count via the existing ShouldHandle predicate. HostStatusService.HostStatusRow extends with four new fields — ConsecutiveFailures, LastCircuitBreakerOpenUtc, CurrentBulkheadDepth, LastRecycleUtc — populated via a second LEFT JOIN onto DriverInstanceResilienceStatuses keyed on (DriverInstanceId, HostName). LEFT JOIN because brand-new hosts haven't been sampled yet; a missing row means zero failures + never-opened breaker, which is the correct default. New FailureFlagThreshold constant (=3, matches plan decision #143's conservative half-of-breaker convention) + IsFlagged predicate so the UI can pre-warn before the breaker actually trips. Hosts.razor paints three new columns between State and Last-transition — Fail# (bold red when flagged), In-flight (bulkhead-depth proxy), Breaker-opened (relative age). Per-row "Flagged" red badge alongside State when IsFlagged is true. Above the first cluster table, a red alert banner summarises the flagged-host count when ≥1 host is flagged, so operators see the problem before scanning rows. Three new tests in DriverResiliencePipelineBuilderTests — Tracker_RecordsFailure_OnEveryRetry verifies ConsecutiveFailures reaches RetryCount after a transient-forever operation, Tracker_StampsBreakerOpen_WhenBreakerTrips verifies LastBreakerOpenUtc is set after threshold failures on a Write pipeline, Tracker_IsolatesCounters_PerHost verifies one dead host does not leak failure counts into a healthy sibling. Full suite — Core.Tests 14/14 resilience-builder tests passing (11 existing + 3 new), Admin.Tests 72/72 passing, Admin project builds 0 errors. SignalR live push of status changes + browser visual review are deliberately left to a follow-up — this PR keeps the structural change minimal (polling refresh already exists in the page's 10s timer; SignalR would be a structural add that touches hub registration + client subscription).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
b6d2803ff6 |
Phase 6.1 Stream A — switch pipeline keys from Guid to string to match IDriver.DriverInstanceId
IDriver.DriverInstanceId is declared as string in Core.Abstractions; keeping the pipeline key as Guid meant every call site would need .ToString() / Guid.Parse at the boundary. Switching the Resilience types to string removes that friction and lets OtOpcUaServer pass driver.DriverInstanceId directly to the builder in the upcoming server-dispatch wiring PR. - DriverResiliencePipelineBuilder.GetOrCreate + Invalidate + PipelineKey - CapabilityInvoker.ctor + _driverInstanceId field Tests: all 48 Core.Tests still pass. The Invalidate test's keepId / dropId now use distinct "drv-keep" / "drv-drop" literals (previously both were distinct Guid.NewGuid() values, which the sed-driven refactor had collapsed to the same literal — caught pre-commit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
c04b13f436 |
Phase 6.1 Stream A.1/A.2/A.6 — Polly resilience foundation: pipeline builder + per-tier policy defaults + WriteIdempotent attribute
Lands the first chunk of the Phase 6.1 Stream A resilience layer per docs/v2/implementation/phase-6-1-resilience-and-observability.md §Stream A. Downstream CapabilityInvoker (A.3) + driver-dispatch wiring land in follow-up PRs on the same branch. Core.Abstractions additions: - WriteIdempotentAttribute — marker for tag-definition records that opt into auto-retry on IWritable.WriteAsync. Absence = no retry per decisions #44, #45, #143. Read once via reflection at driver-init time; no per-write cost. - DriverCapability enum — enumerates the 8 capability surface points (Read / Write / Discover / Subscribe / Probe / AlarmSubscribe / AlarmAcknowledge / HistoryRead). AlarmAcknowledge is write-shaped (no retry by default). - DriverTier enum — A/B/C per driver-stability.md §2-4. Stream B.1 wires this into DriverTypeMetadata; surfaced here because the resilience policy defaults key on it. Core.Resilience new namespace: - DriverResilienceOptions — per-tier × per-capability policy defaults. GetTierDefaults(tier) is the source of truth: * Tier A: Read 2s/3 retries, Write 2s/0 retries, breaker threshold 5 * Tier B: Read 4s/3, Write 4s/0, breaker threshold 5 * Tier C: Read 10s/1, Write 10s/0, breaker threshold 0 (supervisor handles process-level breaker per decision #68) Resolve(capability) overlays CapabilityPolicies on top of the defaults. - DriverResiliencePipelineBuilder — composes Timeout → Retry (capability- permitting, never on cancellation) → CircuitBreaker (tier-permitting) → Bulkhead. Pipelines cached in a lock-free ConcurrentDictionary keyed on (DriverInstanceId, HostName, DriverCapability) per decision #144 — one dead PLC behind a multi-device driver does not open the breaker for healthy siblings. Invalidate(driverInstanceId) supports Admin-triggered reload. Tests (30 new, all pass): - DriverResilienceOptionsTests: tier-default coverage for every capability, Write + AlarmAcknowledge never retry at any tier, Tier C disables breaker, resolve-with-override layering. - DriverResiliencePipelineBuilderTests: Read retries transients, Write does NOT retry on failure (decision #44 guard), dead-host isolation from sibling hosts, pipeline reuse for same triple, per-capability isolation, breaker opens after threshold on Tier A, timeout fires, cancellation is not retried, invalidation scoped to matching instance. Polly.Core 8.6.6 added to Core.csproj. Full solution dotnet test: 936 passing (baseline 906 + 30 new). One pre-existing Client.CLI Subscribe flake unchanged by this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |