Files
lmxopcua/src/ZB.MOM.WW.OtOpcUa.Admin/Services/HostStatusService.cs
Joseph Doherty d06cc01a48 Admin /hosts red-badge + resilience columns + Polly telemetry observer. Closes task #164 (the remaining slice of Phase 6.1 Stream E.3 after the earlier publisher + hub PR). Three cooperating pieces wired together so the operator-facing /hosts table actually reflects the live Polly counters that the pipeline builder is producing. DriverResiliencePipelineBuilder gains an optional DriverResilienceStatusTracker ctor param — when non-null, every built pipeline wires Polly's OnRetry/OnOpened/OnClosed strategy-options callbacks into the tracker. OnRetry → tracker.RecordFailure (so ConsecutiveFailures climbs per retry), OnOpened → tracker.RecordBreakerOpen (stamps LastCircuitBreakerOpenUtc), OnClosed → tracker.RecordSuccess (resets the failure counter once the target recovers). Absent tracker = silent, preserving the unit-test constructor path + any deployment that doesn't care about resilience observability. Cancellation stays excluded from the failure count via the existing ShouldHandle predicate. HostStatusService.HostStatusRow extends with four new fields — ConsecutiveFailures, LastCircuitBreakerOpenUtc, CurrentBulkheadDepth, LastRecycleUtc — populated via a second LEFT JOIN onto DriverInstanceResilienceStatuses keyed on (DriverInstanceId, HostName). LEFT JOIN because brand-new hosts haven't been sampled yet; a missing row means zero failures + never-opened breaker, which is the correct default. New FailureFlagThreshold constant (=3, matches plan decision #143's conservative half-of-breaker convention) + IsFlagged predicate so the UI can pre-warn before the breaker actually trips. Hosts.razor paints three new columns between State and Last-transition — Fail# (bold red when flagged), In-flight (bulkhead-depth proxy), Breaker-opened (relative age). Per-row "Flagged" red badge alongside State when IsFlagged is true. Above the first cluster table, a red alert banner summarises the flagged-host count when ≥1 host is flagged, so operators see the problem before scanning rows. Three new tests in DriverResiliencePipelineBuilderTests — Tracker_RecordsFailure_OnEveryRetry verifies ConsecutiveFailures reaches RetryCount after a transient-forever operation, Tracker_StampsBreakerOpen_WhenBreakerTrips verifies LastBreakerOpenUtc is set after threshold failures on a Write pipeline, Tracker_IsolatesCounters_PerHost verifies one dead host does not leak failure counts into a healthy sibling. Full suite — Core.Tests 14/14 resilience-builder tests passing (11 existing + 3 new), Admin.Tests 72/72 passing, Admin project builds 0 errors. SignalR live push of status changes + browser visual review are deliberately left to a follow-up — this PR keeps the structural change minimal (polling refresh already exists in the page's 10s timer; SignalR would be a structural add that touches hub registration + client subscription).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 21:35:54 -04:00

91 lines
4.4 KiB
C#

using Microsoft.EntityFrameworkCore;
using ZB.MOM.WW.OtOpcUa.Configuration;
using ZB.MOM.WW.OtOpcUa.Configuration.Entities;
using ZB.MOM.WW.OtOpcUa.Configuration.Enums;
namespace ZB.MOM.WW.OtOpcUa.Admin.Services;
/// <summary>
/// One row per <see cref="DriverHostStatus"/> record, enriched with the owning
/// <c>ClusterNode.ClusterId</c> (left-join) + the per-<c>(DriverInstanceId, HostName)</c>
/// <see cref="DriverInstanceResilienceStatus"/> counters (also left-join) so the Admin
/// <c>/hosts</c> page renders the resilience surface inline with host state.
/// </summary>
public sealed record HostStatusRow(
string NodeId,
string? ClusterId,
string DriverInstanceId,
string HostName,
DriverHostState State,
DateTime StateChangedUtc,
DateTime LastSeenUtc,
string? Detail,
int ConsecutiveFailures,
DateTime? LastCircuitBreakerOpenUtc,
int CurrentBulkheadDepth,
DateTime? LastRecycleUtc);
/// <summary>
/// Read-side service for the Admin UI's per-host drill-down. Loads
/// <see cref="DriverHostStatus"/> rows (written by the Server process's
/// <c>HostStatusPublisher</c>) and left-joins <c>ClusterNode</c> so each row knows which
/// cluster it belongs to — the Admin UI groups by cluster for the fleet-wide view.
/// </summary>
/// <remarks>
/// The publisher heartbeat is 10s (<c>HostStatusPublisher.HeartbeatInterval</c>). The
/// Admin page also polls every ~10s and treats rows with <c>LastSeenUtc</c> older than
/// <c>StaleThreshold</c> (30s) as stale — covers a missed heartbeat tolerance plus
/// a generous buffer for clock skew and publisher GC pauses.
/// </remarks>
public sealed class HostStatusService(OtOpcUaConfigDbContext db)
{
public static readonly TimeSpan StaleThreshold = TimeSpan.FromSeconds(30);
/// <summary>Consecutive-failure threshold at which <see cref="IsFlagged"/> returns <c>true</c>
/// so the Admin UI can paint a red badge. Matches Phase 6.1 decision #143's conservative
/// half-of-breaker-threshold convention — flags before the breaker actually opens.</summary>
public const int FailureFlagThreshold = 3;
public async Task<IReadOnlyList<HostStatusRow>> ListAsync(CancellationToken ct = default)
{
// Two LEFT JOINs:
// 1. ClusterNodes on NodeId — row persists even when its owning ClusterNode row
// hasn't been created yet (first-boot bootstrap case).
// 2. DriverInstanceResilienceStatuses on (DriverInstanceId, HostName) — resilience
// counters haven't been sampled yet for brand-new hosts, so a missing row means
// zero failures + never-opened breaker.
var rows = await (from s in db.DriverHostStatuses.AsNoTracking()
join n in db.ClusterNodes.AsNoTracking()
on s.NodeId equals n.NodeId into nodeJoin
from n in nodeJoin.DefaultIfEmpty()
join r in db.DriverInstanceResilienceStatuses.AsNoTracking()
on new { s.DriverInstanceId, s.HostName } equals new { r.DriverInstanceId, r.HostName } into resilJoin
from r in resilJoin.DefaultIfEmpty()
orderby s.NodeId, s.DriverInstanceId, s.HostName
select new HostStatusRow(
s.NodeId,
n != null ? n.ClusterId : null,
s.DriverInstanceId,
s.HostName,
s.State,
s.StateChangedUtc,
s.LastSeenUtc,
s.Detail,
r != null ? r.ConsecutiveFailures : 0,
r != null ? r.LastCircuitBreakerOpenUtc : null,
r != null ? r.CurrentBulkheadDepth : 0,
r != null ? r.LastRecycleUtc : null)).ToListAsync(ct);
return rows;
}
public static bool IsStale(HostStatusRow row) =>
DateTime.UtcNow - row.LastSeenUtc > StaleThreshold;
/// <summary>
/// Red-badge predicate — <c>true</c> when the host has accumulated enough consecutive
/// failures that an operator should take notice before the breaker trips.
/// </summary>
public static bool IsFlagged(HostStatusRow row) =>
row.ConsecutiveFailures >= FailureFlagThreshold;
}