feat(host): readiness gates on required cluster singletons (#28, M2.14)

REQ-HOST-4a lists "required cluster singletons running (if applicable)" as a readiness criterion, but /health/ready only checked database + akka-cluster. Add a third Ready-tagged check, RequiredSingletonsHealthCheck, registered in the Central-role AddHealthChecks() chain (so it is naturally role-scoped — site nodes never run it). Probe: for each required central singleton, Ask its local ClusterSingletonProxy an Identify with a short bounded per-singleton timeout (~2s, probes run concurrently via Task.WhenAll). A non-null ActorIdentity.Subject within the timeout means the singleton is running and reachable through the proxy; a null subject or a timeout means unreachable → Unhealthy, naming the unreachable singleton(s). The check never throws (catch-all → Unhealthy) and resolves ActorSystem lazily from DI per probe (Unhealthy if Akka not yet up). Required-always set = the five singleton proxies created unconditionally in AkkaHostedService.RegisterCentralActors: notification-outbox, audit-log-ingest, site-call-audit, audit-log-purge, site-audit-reconciliation. There are no feature/config-gated central singletons today; any future gated singleton is the "if applicable" case and must NOT be added to the required set. Leadership-agnostic: the proxy reaches the singleton from either central node, so a ready standby still reports ready (readiness must not require cluster leadership — that is the Active tier's job). During a brief singleton handover the probe may time out and the node flaps to not-ready, which is correct (a node mid-handover is legitimately not fully ready); no retries, to keep the probe fast. Tests (TDD): RequiredSingletonsHealthCheckTests exercises the probe against a TestKit ActorSystem — all proxies present+reachable → Healthy; one missing → Unhealthy naming it; ActorSystem absent → Unhealthy, no throw. HealthCheckTests regression-guards the Ready tag + absence of the Active tag on the new check.
2026-06-16 06:48:52 -04:00
parent 3945789970
commit 253bec5a52
6 changed files with 311 additions and 2 deletions
@@ -0,0 +1,177 @@
+using Akka.Actor;
+using Microsoft.Extensions.DependencyInjection;
+using Microsoft.Extensions.Diagnostics.HealthChecks;
+using Microsoft.Extensions.Logging;
+
+namespace ZB.MOM.WW.ScadaBridge.Host.Health;
+
+/// <summary>
+/// M2.14 (#28): readiness check that verifies every <b>required central cluster
+/// singleton</b> is reachable from this node, satisfying the "required cluster
+/// singletons running (if applicable)" clause of REQ-HOST-4a. Register it
+/// <see cref="ZB.MOM.WW.Health.ZbHealthTags.Ready"/>-tagged in the Central-role
+/// <c>AddHealthChecks()</c> chain only, so it is naturally role-scoped (site nodes
+/// never register it).
+/// </summary>
+/// <remarks>
+/// <para>
+/// <b>Probe strategy.</b> Each central singleton has a local
+/// <c>ClusterSingletonProxy</c> actor (created unconditionally in
+/// <c>AkkaHostedService.RegisterCentralActors</c>). The proxy actor exists locally
+/// as soon as it is created, so merely resolving its path proves nothing about the
+/// singleton itself. Instead we <see cref="ActorRefImplicitSenderExtensions.Ask{T}(ICanTell, object, TimeSpan?)"/>
+/// the proxy an <see cref="Identify"/> with a short bounded per-singleton timeout and
+/// expect an <see cref="ActorIdentity"/> whose <see cref="ActorIdentity.Subject"/> is
+/// non-null. The proxy buffers and forwards to the live singleton, so a non-null
+/// Subject within the timeout means the singleton is running and reachable; a null
+/// Subject or a timeout means it is unreachable. Probes run concurrently
+/// (<see cref="Task.WhenAll(System.Collections.Generic.IEnumerable{Task})"/>) so the
+/// whole check stays cheap and readiness polling stays fast.
+/// </para>
+/// <para>
+/// <b>Required-always vs if-applicable.</b> All five central singleton proxies are
+/// created unconditionally on a central node (there is no feature/config gate around
+/// any of them), so all five are treated as required-always here. If a future
+/// singleton is created behind a feature flag, it should NOT be added to
+/// <see cref="RequiredSingletonProxyNames"/> — "if applicable" means skip when its
+/// feature is off.
+/// </para>
+/// <para>
+/// <b>Failover flakiness.</b> During a brief singleton handover the singleton may be
+/// momentarily unreachable through the proxy. The bounded per-singleton timeout maps
+/// that to Unhealthy (we never throw and never retry — retries would make the probe
+/// slow). Readiness flapping briefly during a failover is acceptable and correct: a
+/// node mid-handover is legitimately not fully ready. We deliberately accept that
+/// tradeoff rather than masking it with retries.
+/// </para>
+/// <para>
+/// <b>No leadership requirement.</b> The proxy reaches the singleton from either node
+/// (active or standby), so a ready standby still reports Healthy here — readiness must
+/// NOT require cluster leadership (that is the Active tier's job).
+/// </para>
+/// <para>
+/// The <see cref="ActorSystem"/> is resolved lazily from DI per probe, mirroring
+/// <c>AkkaClusterHealthCheck</c>; if it is not yet available (startup race) the check
+/// returns Unhealthy rather than throwing.
+/// </para>
+/// </remarks>
+public sealed class RequiredSingletonsHealthCheck : IHealthCheck
+{
+    /// <summary>
+    /// Local actor names (under <c>/user</c>) of the <c>ClusterSingletonProxy</c>
+    /// actors for the singletons that must always be running on a central node.
+    /// Matches the unconditional proxy registrations in
+    /// <c>AkkaHostedService.RegisterCentralActors</c>.
+    /// </summary>
+    public static readonly IReadOnlyList<string> RequiredSingletonProxyNames = new[]
+    {
+        "notification-outbox-proxy",
+        "audit-log-ingest-proxy",
+        "site-call-audit-proxy",
+        "audit-log-purge-proxy",
+        "site-audit-reconciliation-proxy",
+    };
+
+    // Short, bounded per-singleton timeout. Kept small so readiness polling stays
+    // fast; a singleton in mid-handover that does not answer within this window is
+    // (correctly) treated as momentarily unreachable. Do NOT add retries here.
+    private static readonly TimeSpan ProbeTimeout = TimeSpan.FromSeconds(2);
+
+    private readonly IServiceProvider _serviceProvider;
+    private readonly ILogger<RequiredSingletonsHealthCheck> _logger;
+
+    /// <summary>Initializes a new <see cref="RequiredSingletonsHealthCheck"/>.</summary>
+    /// <param name="serviceProvider">
+    /// Application service provider; the <see cref="ActorSystem"/> is resolved lazily so the
+    /// check is startup-safe (Unhealthy, never throwing, if Akka is not yet up).
+    /// </param>
+    /// <param name="logger">Logger for diagnostic detail on unreachable singletons.</param>
+    public RequiredSingletonsHealthCheck(
+        IServiceProvider serviceProvider,
+        ILogger<RequiredSingletonsHealthCheck> logger)
+    {
+        _serviceProvider = serviceProvider ?? throw new ArgumentNullException(nameof(serviceProvider));
+        _logger = logger ?? throw new ArgumentNullException(nameof(logger));
+    }
+
+    /// <inheritdoc />
+    public async Task<HealthCheckResult> CheckHealthAsync(
+        HealthCheckContext context,
+        CancellationToken cancellationToken = default)
+    {
+        // CheckHealthAsync must NEVER throw — catch everything and map to Unhealthy
+        // with a descriptive message. An escaping exception would be recorded as
+        // Unhealthy anyway, but a thrown exception loses the descriptive message.
+        try
+        {
+            var system = _serviceProvider.GetService<ActorSystem>();
+            if (system is null)
+                return HealthCheckResult.Unhealthy("ActorSystem not yet available.");
+
+            // Probe each required singleton concurrently so the whole check is bounded
+            // by ~ProbeTimeout, not the sum of the per-singleton timeouts.
+            var probes = RequiredSingletonProxyNames
+                .Select(name => ProbeAsync(system, name, cancellationToken))
+                .ToArray();
+
+            var results = await Task.WhenAll(probes).ConfigureAwait(false);
+
+            var unreachable = results
+                .Where(r => !r.Reachable)
+                .Select(r => r.Name)
+                .ToList();
+
+            if (unreachable.Count == 0)
+                return HealthCheckResult.Healthy(
+                    $"All {RequiredSingletonProxyNames.Count} required cluster singletons are reachable.");
+
+            var joined = string.Join(", ", unreachable);
+            _logger.LogWarning(
+                "Readiness degraded: required cluster singleton(s) unreachable: {Unreachable}",
+                joined);
+            return HealthCheckResult.Unhealthy(
+                $"Required cluster singleton(s) unreachable: {joined}.");
+        }
+        catch (Exception ex)
+        {
+            // Defensive: any unexpected failure (including OperationCanceledException
+            // on shutdown) degrades readiness rather than escaping the check.
+            return HealthCheckResult.Unhealthy(
+                "Failed to probe required cluster singletons.", ex);
+        }
+    }
+
+    /// <summary>
+    /// Asks the named local proxy an <see cref="Identify"/> with a bounded timeout.
+    /// Reachable iff a non-null <see cref="ActorIdentity.Subject"/> comes back in time.
+    /// A null Subject (path not present) or a timeout/exception → not reachable. This
+    /// method itself never throws.
+    /// </summary>
+    private async Task<(string Name, bool Reachable)> ProbeAsync(
+        ActorSystem system,
+        string proxyName,
+        CancellationToken cancellationToken)
+    {
+        try
+        {
+            // ActorSelection so a missing path resolves an ActorIdentity with a null
+            // Subject (rather than throwing) within the bounded timeout.
+            var selection = system.ActorSelection($"/user/{proxyName}");
+            using var cts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
+            cts.CancelAfter(ProbeTimeout);
+
+            var identity = await selection
+                .Ask<ActorIdentity>(new Identify(proxyName), ProbeTimeout, cts.Token)
+                .ConfigureAwait(false);
+
+            return (proxyName, identity.Subject is not null);
+        }
+        catch (Exception)
+        {
+            // Timeout / cancellation / any failure → momentarily unreachable. Bounded,
+            // no retry — readiness may briefly flap during a singleton handover, which
+            // is the correct signal for a node mid-handover.
+            return (proxyName, false);
+        }
+    }
+}
@@ -202,6 +202,18 @@ try
                failureStatus: null,
                tags: new[] { ZbHealthTags.Ready },
                args: AkkaClusterStatusPolicy.Default)
+            // M2.14 (#28): readiness ALSO reflects "required cluster singletons running"
+            // (REQ-HOST-4a). Probes each central singleton's local ClusterSingletonProxy
+            // with a bounded Identify and degrades to Unhealthy if any required singleton
+            // is unreachable. Registered inside the Central-role branch (this is it) so the
+            // check is naturally role-scoped — site nodes never run it. It resolves
+            // ActorSystem from DI per probe, like the akka-cluster check above, and is
+            // leadership-agnostic so a ready standby still reports ready (the proxy reaches
+            // the singleton from either node).
+            .AddTypeActivatedCheck<RequiredSingletonsHealthCheck>(
+                "required-singletons",
+                failureStatus: null,
+                tags: new[] { ZbHealthTags.Ready })
            .AddTypeActivatedCheck<ActiveNodeHealthCheck>(
                "active-node",
                failureStatus: null,