Code-verified current-state docs for OtOpcUa (three-tier full), ScadaBridge (two-tier, no /healthz), and MxAccessGateway (bare liveness only / no probes). GAPS backlog with P1 for MxGateway and convergence items for Akka status policy, DB probe technique, and response writer. README with per-project status table.
8.3 KiB
Health — current state: OtOpcUa
Repo: ~/Desktop/OtOpcUa. Stack: .NET 10, Akka.NET, OPC UA; solution ZB.MOM.WW.OtOpcUa.slnx.
Health code lives in src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/. All paths relative to repo root.
Verified 2026-06-01.
Full three-tier pattern: /health/ready, /health/active, and /healthz. Three probes covering
the database, the Akka cluster, and the admin-role leader. All endpoints are AllowAnonymous to
permit Traefik and load-balancer probing without credentials.
1. Endpoint wiring
src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/HealthEndpoints.cs:
:13— XML comment explicitly names this as "ScadaLink's three-tier pattern:ready= boot ok;active= fully serving traffic;healthz= bare process liveness.":17—AddOtOpcUaHealth(IServiceCollection)callsservices.AddHealthChecks()and registers all three probes (lines 20–22):DatabaseHealthCheckname"configdb", tags["ready","active"]AkkaClusterHealthCheckname"akka", tags["ready","active"]AdminRoleLeaderHealthCheckname"admin-leader", tags["active"]only
:28—MapOtOpcUaHealth(IEndpointRouteBuilder)maps three endpoints (lines 33–44):/health/ready— predicatec => c.Tags.Contains("ready"),.AllowAnonymous()(lines 33–36)/health/active— predicatec => c.Tags.Contains("active"),.AllowAnonymous()(lines 37–40)/healthz— predicate_ => false(no probes run; bare process liveness only),.AllowAnonymous()(lines 41–44)
Program.cs:
:137—builder.Services.AddOtOpcUaHealth():159—app.MapOtOpcUaHealth()
Response writer: default ASP.NET Core plain-text/JSON (no HealthChecks.UI.Client).
2. Probes
DatabaseHealthCheck
src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/DatabaseHealthCheck.cs:
:9— injectsIDbContextFactory<OtOpcUaConfigDbContext>:25–37— opens a pooled context viaCreateDbContextAsync, runsdb.Deployments.AsNoTracking().Take(1).ToListAsync(). If the query succeeds →HealthCheckResult.Healthy("ConfigDb reachable")(:31). If it throws →HealthCheckResult.Unhealthy("ConfigDb unreachable", ex)(:35). NoDegradedpath.
The probe exercises a real query (not just CanConnectAsync) — it confirms the Deployments table
is readable, which implies the schema migration has run. This is stricter than ScadaBridge's
CanConnectAsync but more opaque about the failure reason.
Tags on registration: ["ready","active"] — the database must be reachable for both readiness and
active-node determination.
AkkaClusterHealthCheck
src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/AkkaClusterHealthCheck.cs:
:9— injectsActorSystemdirectly:27–33— callsCluster.Get(_system), scanscluster.State.Membersfor the member whoseAddress == cluster.SelfAddressandStatus == MemberStatus.Up:- Found Up →
HealthCheckResult.Healthy($"Self Up; {cluster.State.Members.Count} member(s)")(:32) - Not found →
HealthCheckResult.Degraded("Self not yet Up in cluster")(:33)
- Found Up →
No Unhealthy path — joining/leaving/removed nodes are all reported as Degraded. This differs from
ScadaBridge's more granular three-way policy (see GAPS).
Tags on registration: ["ready","active"].
AdminRoleLeaderHealthCheck
src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/AdminRoleLeaderHealthCheck.cs:
:14— injectsIClusterRoleInfo:27–38— three-path logic:- Node does not carry the
"admin"role →Healthy("Node does not carry admin role")(:30) — non-admin nodes are immediately healthy, so this probe never gates a non-admin node. - Admin role + node is the role leader →
Healthy($"Admin leader ({...})")(:36) - Admin role + not the leader →
Degraded($"Admin member but not leader (leader=...)")(:37)
- Node does not carry the
Tags on registration: ["active"] only — does not participate in /health/ready. The intent is
Traefik routing: the active node (admin-role leader) gets sticky admin-UI traffic; standby nodes
are reachable for data-plane OPC UA but report Degraded on /health/active so the load balancer
does not route control-plane traffic to them.
Note: no Unhealthy path for the role-filter case. If the ActorSystem is not running, IClusterRoleInfo
presumably returns safe defaults (no role); this is not separately health-checked.
3. Tag / tier summary
| Probe | /health/ready |
/health/active |
/healthz |
|---|---|---|---|
DatabaseHealthCheck |
✅ | ✅ | — |
AkkaClusterHealthCheck |
✅ | ✅ | — |
AdminRoleLeaderHealthCheck |
— | ✅ | — |
| (no probes) | — | — | ✅ (bare liveness) |
/healthz runs zero probes — it is a pure process liveness sentinel (process reachable = healthy;
a crashed process = no response). Kubernetes liveness probes, Traefik TCP checks, and uptime
monitors use this tier.
4. Downstream dependency coverage
No probe for the upstream MxAccessGateway gRPC channel. If the gateway is unreachable, OtOpcUa
reports healthy here (the GalaxyDriver will surface errors in OPC UA diagnostics, but /health/ready
and /health/active will not reflect it). This is a gap that the shared GrpcDependencyHealthCheck
probe in ZB.MOM.WW.Health would close.
5. Notable design choices
- AllowAnonymous on all tiers — see
HealthEndpoints.cs:30–32comment: "Without it theAddOtOpcUaAuthfallback policy 401s every probe and Traefik marks every backend unhealthy." - Query probe, not
CanConnectAsync— theDeploymentsquery validates that the schema has been applied. ScadaBridge usesCanConnectAsync; neither is wrong but they diverge. Degradedsemantics — the Akka check usesDegraded(notUnhealthy) for a joining/pre-Up node. ASP.NET Core mapsDegradedto HTTP 200 by default; Traefik sees 200 and considers the node ready. IfUnhealthy(HTTP 503) is required to gate traffic, theDegradedpath is insufficient.IClusterRoleInfoabstraction — the admin-leader check depends onIClusterRoleInfo, an OtOpcUa interface, not the rawAkka.Cluster.ClusterAPI. This is a testability-friendly layer absent in ScadaBridge's direct Akka usage.
Adoption plan → ZB.MOM.WW.Health
Replace with shared probes:
AkkaClusterHealthCheck→ZB.MOM.WW.Health.Akka.AkkaClusterHealthCheckusing theOtOpcUaCompatpreset (self-Up-among-members scan → Healthy/Degraded). The preset keeps OtOpcUa's existing two-way policy without forcing ScadaBridge's three-way policy onto it.AdminRoleLeaderHealthCheck→ZB.MOM.WW.Health.Akka.ActiveNodeHealthCheckwithRoleFilter = "admin". The role-filter parameter produces identical behavior: non-admin nodes immediately healthy, admin leader healthy, admin non-leader degraded.DatabaseHealthCheck→ZB.MOM.WW.Health.EntityFrameworkCore.DatabaseHealthCheck<OtOpcUaConfigDbContext>with aProbeQuerydelegate ofdb => db.Deployments.AsNoTracking().Take(1).ToListAsync(). The delegate preserves the stricter query probe rather than falling back toCanConnectAsync.- Add
GrpcDependencyHealthChecktargeting the MxAccessGateway channel (closes the downstream dependency gap noted in §4). Tag["ready","active"]. - Replace
AddOtOpcUaHealth/MapOtOpcUaHealthwithservices.AddZbHealthChecks()+app.MapZbHealth(). The/healthzbare-liveness tier is part ofMapZbHealthby default — no separate wiring needed.
Keep bespoke:
IClusterRoleInfoand its Akka implementation — this is an OtOpcUa abstraction used for more than health checks; it should remain in the OtOpcUa codebase. The sharedActiveNodeHealthCheckwill acceptIClusterRoleInfo(or an equivalent cluster-info abstraction) as an injection point.- The
AllowAnonymouspolicy — this is an OtOpcUa auth concern;MapZbHealthmust document that callers are responsible for applyingAllowAnonymous(or the shared helper applies it by default). - Which probes are registered and their tag assignments — the shared library supplies the check implementations; the wiring (which names, which tags, which options) remains per-project.
Adoption is a follow-on task (tracked in GAPS.md), not part of the ZB.MOM.WW.Health
library build. The library build delivers the shared implementations; adoption lands in the
OtOpcUa repo as a separate commit once the nupkg is available.