feat(host): readiness gates on required cluster singletons (#28, M2.14)

REQ-HOST-4a lists "required cluster singletons running (if applicable)" as a
readiness criterion, but /health/ready only checked database + akka-cluster.
Add a third Ready-tagged check, RequiredSingletonsHealthCheck, registered in the
Central-role AddHealthChecks() chain (so it is naturally role-scoped — site nodes
never run it).

Probe: for each required central singleton, Ask its local ClusterSingletonProxy
an Identify with a short bounded per-singleton timeout (~2s, probes run
concurrently via Task.WhenAll). A non-null ActorIdentity.Subject within the
timeout means the singleton is running and reachable through the proxy; a null
subject or a timeout means unreachable → Unhealthy, naming the unreachable
singleton(s). The check never throws (catch-all → Unhealthy) and resolves
ActorSystem lazily from DI per probe (Unhealthy if Akka not yet up).

Required-always set = the five singleton proxies created unconditionally in
AkkaHostedService.RegisterCentralActors: notification-outbox, audit-log-ingest,
site-call-audit, audit-log-purge, site-audit-reconciliation. There are no
feature/config-gated central singletons today; any future gated singleton is the
"if applicable" case and must NOT be added to the required set.

Leadership-agnostic: the proxy reaches the singleton from either central node, so
a ready standby still reports ready (readiness must not require cluster
leadership — that is the Active tier's job). During a brief singleton handover the
probe may time out and the node flaps to not-ready, which is correct (a node
mid-handover is legitimately not fully ready); no retries, to keep the probe fast.

Tests (TDD): RequiredSingletonsHealthCheckTests exercises the probe against a
TestKit ActorSystem — all proxies present+reachable → Healthy; one missing →
Unhealthy naming it; ActorSystem absent → Unhealthy, no throw. HealthCheckTests
regression-guards the Ready tag + absence of the Active tag on the new check.
This commit is contained in:
Joseph Doherty
2026-06-16 06:48:52 -04:00
parent 3945789970
commit 253bec5a52
6 changed files with 311 additions and 2 deletions
+12
View File
@@ -202,6 +202,18 @@ try
failureStatus: null,
tags: new[] { ZbHealthTags.Ready },
args: AkkaClusterStatusPolicy.Default)
// M2.14 (#28): readiness ALSO reflects "required cluster singletons running"
// (REQ-HOST-4a). Probes each central singleton's local ClusterSingletonProxy
// with a bounded Identify and degrades to Unhealthy if any required singleton
// is unreachable. Registered inside the Central-role branch (this is it) so the
// check is naturally role-scoped — site nodes never run it. It resolves
// ActorSystem from DI per probe, like the akka-cluster check above, and is
// leadership-agnostic so a ready standby still reports ready (the proxy reaches
// the singleton from either node).
.AddTypeActivatedCheck<RequiredSingletonsHealthCheck>(
"required-singletons",
failureStatus: null,
tags: new[] { ZbHealthTags.Ready })
.AddTypeActivatedCheck<ActiveNodeHealthCheck>(
"active-node",
failureStatus: null,