Applies canonical resolutions for eight settled decisions: - GAPS: remove three stale "Decisions still open" bullets (#1 IActiveNodeGate placement, #2 GrpcChannel type, #3 OtOpcUaCompat named constant) - Shared contract: AkkaClusterHealthCheck, ActiveNodeHealthCheck constructors take IServiceProvider (lazy ActorSystem, Degraded-when-not-ready) - Shared contract: AkkaActiveNodeGate takes IServiceProvider; reads SelfMember+leader directly, null-guarded; does not proxy ActiveNodeHealthCheck - Shared contract: DatabaseHealthCheckOptions.Probe renamed to ProbeQuery; consumer matrix updated - Shared contract: settled AddZbHealthChecks open question removed (spec §5 is per-project AddHealthChecks) - SPEC §2.2: OtOpcUaCompat Leaving/Exiting cell updated from — to Degraded + footnote; §2.3 startup-safety note added - README: status line corrected from "built and tested" to "scaffolded … implementation is follow-on (task #7)"; IActiveNodeGate "left per-project" bullet removed - OtOpcUa current-state: AddZbHealthChecks → AddHealthChecks().AddCheck<...>(); IClusterRoleInfo note reframed as accepted trade-off - ScadaBridge current-state: IActiveNodeGate bullet rewritten — interface moves to ZB.MOM.WW.Health on adoption, InboundApiEndpointFilter references shared interface
8.6 KiB
Health — current state: OtOpcUa
Repo: ~/Desktop/OtOpcUa. Stack: .NET 10, Akka.NET, OPC UA; solution ZB.MOM.WW.OtOpcUa.slnx.
Health code lives in src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/. All paths relative to repo root.
Verified 2026-06-01.
Full three-tier pattern: /health/ready, /health/active, and /healthz. Three probes covering
the database, the Akka cluster, and the admin-role leader. All endpoints are AllowAnonymous to
permit Traefik and load-balancer probing without credentials.
1. Endpoint wiring
src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/HealthEndpoints.cs:
:13— XML comment explicitly names this as "ScadaLink's three-tier pattern:ready= boot ok;active= fully serving traffic;healthz= bare process liveness.":17—AddOtOpcUaHealth(IServiceCollection)callsservices.AddHealthChecks()and registers all three probes (lines 20–22):DatabaseHealthCheckname"configdb", tags["ready","active"]AkkaClusterHealthCheckname"akka", tags["ready","active"]AdminRoleLeaderHealthCheckname"admin-leader", tags["active"]only
:28—MapOtOpcUaHealth(IEndpointRouteBuilder)maps three endpoints (lines 33–44):/health/ready— predicatec => c.Tags.Contains("ready"),.AllowAnonymous()(lines 33–36)/health/active— predicatec => c.Tags.Contains("active"),.AllowAnonymous()(lines 37–40)/healthz— predicate_ => false(no probes run; bare process liveness only),.AllowAnonymous()(lines 41–44)
Program.cs:
:137—builder.Services.AddOtOpcUaHealth():159—app.MapOtOpcUaHealth()
Response writer: default ASP.NET Core plain-text/JSON (no HealthChecks.UI.Client).
2. Probes
DatabaseHealthCheck
src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/DatabaseHealthCheck.cs:
:9— injectsIDbContextFactory<OtOpcUaConfigDbContext>:25–37— opens a pooled context viaCreateDbContextAsync, runsdb.Deployments.AsNoTracking().Take(1).ToListAsync(). If the query succeeds →HealthCheckResult.Healthy("ConfigDb reachable")(:31). If it throws →HealthCheckResult.Unhealthy("ConfigDb unreachable", ex)(:35). NoDegradedpath.
The probe exercises a real query (not just CanConnectAsync) — it confirms the Deployments table
is readable, which implies the schema migration has run. This is stricter than ScadaBridge's
CanConnectAsync but more opaque about the failure reason.
Tags on registration: ["ready","active"] — the database must be reachable for both readiness and
active-node determination.
AkkaClusterHealthCheck
src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/AkkaClusterHealthCheck.cs:
:9— injectsActorSystemdirectly:27–33— callsCluster.Get(_system), scanscluster.State.Membersfor the member whoseAddress == cluster.SelfAddressandStatus == MemberStatus.Up:- Found Up →
HealthCheckResult.Healthy($"Self Up; {cluster.State.Members.Count} member(s)")(:32) - Not found →
HealthCheckResult.Degraded("Self not yet Up in cluster")(:33)
- Found Up →
No Unhealthy path — joining/leaving/removed nodes are all reported as Degraded. This differs from
ScadaBridge's more granular three-way policy (see GAPS).
Tags on registration: ["ready","active"].
AdminRoleLeaderHealthCheck
src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/AdminRoleLeaderHealthCheck.cs:
:14— injectsIClusterRoleInfo:27–38— three-path logic:- Node does not carry the
"admin"role →Healthy("Node does not carry admin role")(:30) — non-admin nodes are immediately healthy, so this probe never gates a non-admin node. - Admin role + node is the role leader →
Healthy($"Admin leader ({...})")(:36) - Admin role + not the leader →
Degraded($"Admin member but not leader (leader=...)")(:37)
- Node does not carry the
Tags on registration: ["active"] only — does not participate in /health/ready. The intent is
Traefik routing: the active node (admin-role leader) gets sticky admin-UI traffic; standby nodes
are reachable for data-plane OPC UA but report Degraded on /health/active so the load balancer
does not route control-plane traffic to them.
Note: no Unhealthy path for the role-filter case. If the ActorSystem is not running, IClusterRoleInfo
presumably returns safe defaults (no role); this is not separately health-checked.
3. Tag / tier summary
| Probe | /health/ready |
/health/active |
/healthz |
|---|---|---|---|
DatabaseHealthCheck |
✅ | ✅ | — |
AkkaClusterHealthCheck |
✅ | ✅ | — |
AdminRoleLeaderHealthCheck |
— | ✅ | — |
| (no probes) | — | — | ✅ (bare liveness) |
/healthz runs zero probes — it is a pure process liveness sentinel (process reachable = healthy;
a crashed process = no response). Kubernetes liveness probes, Traefik TCP checks, and uptime
monitors use this tier.
4. Downstream dependency coverage
No probe for the upstream MxAccessGateway gRPC channel. If the gateway is unreachable, OtOpcUa
reports healthy here (the GalaxyDriver will surface errors in OPC UA diagnostics, but /health/ready
and /health/active will not reflect it). This is a gap that the shared GrpcDependencyHealthCheck
probe in ZB.MOM.WW.Health would close.
5. Notable design choices
- AllowAnonymous on all tiers — see
HealthEndpoints.cs:30–32comment: "Without it theAddOtOpcUaAuthfallback policy 401s every probe and Traefik marks every backend unhealthy." - Query probe, not
CanConnectAsync— theDeploymentsquery validates that the schema has been applied. ScadaBridge usesCanConnectAsync; neither is wrong but they diverge. Degradedsemantics — the Akka check usesDegraded(notUnhealthy) for a joining/pre-Up node. ASP.NET Core mapsDegradedto HTTP 200 by default; Traefik sees 200 and considers the node ready. IfUnhealthy(HTTP 503) is required to gate traffic, theDegradedpath is insufficient.IClusterRoleInfoabstraction — the admin-leader check depends onIClusterRoleInfo, an OtOpcUa interface, not the rawAkka.Cluster.ClusterAPI. This is a testability-friendly layer absent in ScadaBridge's direct Akka usage.
Adoption plan → ZB.MOM.WW.Health
Replace with shared probes:
AkkaClusterHealthCheck→ZB.MOM.WW.Health.Akka.AkkaClusterHealthCheckusing theOtOpcUaCompatpreset (self-Up-among-members scan → Healthy/Degraded). The preset keeps OtOpcUa's existing two-way policy without forcing ScadaBridge's three-way policy onto it.AdminRoleLeaderHealthCheck→ZB.MOM.WW.Health.Akka.ActiveNodeHealthCheckwithRoleFilter = "admin". The role-filter parameter produces identical behavior: non-admin nodes immediately healthy, admin leader healthy, admin non-leader degraded.DatabaseHealthCheck→ZB.MOM.WW.Health.EntityFrameworkCore.DatabaseHealthCheck<OtOpcUaConfigDbContext>with aProbeQuerydelegate ofdb => db.Deployments.AsNoTracking().Take(1).ToListAsync(). The delegate preserves the stricter query probe rather than falling back toCanConnectAsync.- Add
GrpcDependencyHealthChecktargeting the MxAccessGateway channel (closes the downstream dependency gap noted in §4). Tag["ready","active"]. - Replace
AddOtOpcUaHealth/MapOtOpcUaHealthwithservices.AddHealthChecks().AddCheck<...>()(one call per probe, per spec §5) +app.MapZbHealth(). The/healthzbare-liveness tier is part ofMapZbHealthby default — no separate wiring needed.
Keep bespoke:
IClusterRoleInfoand its Akka implementation — on adoption this testability seam is given up for the health-check path. The sharedActiveNodeHealthCheckreads cluster role state from the ActorSystem directly (resolving it lazily viaIServiceProvider); it does not acceptIClusterRoleInfoas an injection point. This is an accepted trade-off: the shared implementation is simpler and consistent across projects, whileIClusterRoleInforemains available elsewhere in the OtOpcUa codebase where it is used outside health checks.- The
AllowAnonymouspolicy — this is an OtOpcUa auth concern;MapZbHealthmust document that callers are responsible for applyingAllowAnonymous(or the shared helper applies it by default). - Which probes are registered and their tag assignments — the shared library supplies the check implementations; the wiring (which names, which tags, which options) remains per-project.
Adoption is a follow-on task (tracked in GAPS.md), not part of the ZB.MOM.WW.Health
library build. The library build delivers the shared implementations; adoption lands in the
OtOpcUa repo as a separate commit once the nupkg is available.