Files
scadaproj/components/health/current-state/otopcua/CURRENT-STATE.md
T
Joseph Doherty 3d25ee5090 docs(health): current-state x3 + GAPS + README
Code-verified current-state docs for OtOpcUa (three-tier full), ScadaBridge
(two-tier, no /healthz), and MxAccessGateway (bare liveness only / no probes).
GAPS backlog with P1 for MxGateway and convergence items for Akka status policy,
DB probe technique, and response writer. README with per-project status table.
2026-06-01 06:23:53 -04:00

8.3 KiB
Raw Blame History

Health — current state: OtOpcUa

Repo: ~/Desktop/OtOpcUa. Stack: .NET 10, Akka.NET, OPC UA; solution ZB.MOM.WW.OtOpcUa.slnx. Health code lives in src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/. All paths relative to repo root. Verified 2026-06-01.

Full three-tier pattern: /health/ready, /health/active, and /healthz. Three probes covering the database, the Akka cluster, and the admin-role leader. All endpoints are AllowAnonymous to permit Traefik and load-balancer probing without credentials.

1. Endpoint wiring

src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/HealthEndpoints.cs:

  • :13 — XML comment explicitly names this as "ScadaLink's three-tier pattern: ready = boot ok; active = fully serving traffic; healthz = bare process liveness."
  • :17AddOtOpcUaHealth(IServiceCollection) calls services.AddHealthChecks() and registers all three probes (lines 2022):
    • DatabaseHealthCheck name "configdb", tags ["ready","active"]
    • AkkaClusterHealthCheck name "akka", tags ["ready","active"]
    • AdminRoleLeaderHealthCheck name "admin-leader", tags ["active"] only
  • :28MapOtOpcUaHealth(IEndpointRouteBuilder) maps three endpoints (lines 3344):
    • /health/ready — predicate c => c.Tags.Contains("ready"), .AllowAnonymous() (lines 3336)
    • /health/active — predicate c => c.Tags.Contains("active"), .AllowAnonymous() (lines 3740)
    • /healthz — predicate _ => false (no probes run; bare process liveness only), .AllowAnonymous() (lines 4144)

Program.cs:

  • :137builder.Services.AddOtOpcUaHealth()
  • :159app.MapOtOpcUaHealth()

Response writer: default ASP.NET Core plain-text/JSON (no HealthChecks.UI.Client).

2. Probes

DatabaseHealthCheck

src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/DatabaseHealthCheck.cs:

  • :9 — injects IDbContextFactory<OtOpcUaConfigDbContext>
  • :2537 — opens a pooled context via CreateDbContextAsync, runs db.Deployments.AsNoTracking().Take(1).ToListAsync(). If the query succeeds → HealthCheckResult.Healthy("ConfigDb reachable") (:31). If it throws → HealthCheckResult.Unhealthy("ConfigDb unreachable", ex) (:35). No Degraded path.

The probe exercises a real query (not just CanConnectAsync) — it confirms the Deployments table is readable, which implies the schema migration has run. This is stricter than ScadaBridge's CanConnectAsync but more opaque about the failure reason.

Tags on registration: ["ready","active"] — the database must be reachable for both readiness and active-node determination.

AkkaClusterHealthCheck

src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/AkkaClusterHealthCheck.cs:

  • :9 — injects ActorSystem directly
  • :2733 — calls Cluster.Get(_system), scans cluster.State.Members for the member whose Address == cluster.SelfAddress and Status == MemberStatus.Up:
    • Found Up → HealthCheckResult.Healthy($"Self Up; {cluster.State.Members.Count} member(s)") (:32)
    • Not found → HealthCheckResult.Degraded("Self not yet Up in cluster") (:33)

No Unhealthy path — joining/leaving/removed nodes are all reported as Degraded. This differs from ScadaBridge's more granular three-way policy (see GAPS).

Tags on registration: ["ready","active"].

AdminRoleLeaderHealthCheck

src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/AdminRoleLeaderHealthCheck.cs:

  • :14 — injects IClusterRoleInfo
  • :2738 — three-path logic:
    • Node does not carry the "admin" role → Healthy("Node does not carry admin role") (:30) — non-admin nodes are immediately healthy, so this probe never gates a non-admin node.
    • Admin role + node is the role leader → Healthy($"Admin leader ({...})") (:36)
    • Admin role + not the leader → Degraded($"Admin member but not leader (leader=...)") (:37)

Tags on registration: ["active"] only — does not participate in /health/ready. The intent is Traefik routing: the active node (admin-role leader) gets sticky admin-UI traffic; standby nodes are reachable for data-plane OPC UA but report Degraded on /health/active so the load balancer does not route control-plane traffic to them.

Note: no Unhealthy path for the role-filter case. If the ActorSystem is not running, IClusterRoleInfo presumably returns safe defaults (no role); this is not separately health-checked.

3. Tag / tier summary

Probe /health/ready /health/active /healthz
DatabaseHealthCheck
AkkaClusterHealthCheck
AdminRoleLeaderHealthCheck
(no probes) (bare liveness)

/healthz runs zero probes — it is a pure process liveness sentinel (process reachable = healthy; a crashed process = no response). Kubernetes liveness probes, Traefik TCP checks, and uptime monitors use this tier.

4. Downstream dependency coverage

No probe for the upstream MxAccessGateway gRPC channel. If the gateway is unreachable, OtOpcUa reports healthy here (the GalaxyDriver will surface errors in OPC UA diagnostics, but /health/ready and /health/active will not reflect it). This is a gap that the shared GrpcDependencyHealthCheck probe in ZB.MOM.WW.Health would close.

5. Notable design choices

  • AllowAnonymous on all tiers — see HealthEndpoints.cs:3032 comment: "Without it the AddOtOpcUaAuth fallback policy 401s every probe and Traefik marks every backend unhealthy."
  • Query probe, not CanConnectAsync — the Deployments query validates that the schema has been applied. ScadaBridge uses CanConnectAsync; neither is wrong but they diverge.
  • Degraded semantics — the Akka check uses Degraded (not Unhealthy) for a joining/pre-Up node. ASP.NET Core maps Degraded to HTTP 200 by default; Traefik sees 200 and considers the node ready. If Unhealthy (HTTP 503) is required to gate traffic, the Degraded path is insufficient.
  • IClusterRoleInfo abstraction — the admin-leader check depends on IClusterRoleInfo, an OtOpcUa interface, not the raw Akka.Cluster.Cluster API. This is a testability-friendly layer absent in ScadaBridge's direct Akka usage.

Adoption plan → ZB.MOM.WW.Health

Replace with shared probes:

  • AkkaClusterHealthCheckZB.MOM.WW.Health.Akka.AkkaClusterHealthCheck using the OtOpcUaCompat preset (self-Up-among-members scan → Healthy/Degraded). The preset keeps OtOpcUa's existing two-way policy without forcing ScadaBridge's three-way policy onto it.
  • AdminRoleLeaderHealthCheckZB.MOM.WW.Health.Akka.ActiveNodeHealthCheck with RoleFilter = "admin". The role-filter parameter produces identical behavior: non-admin nodes immediately healthy, admin leader healthy, admin non-leader degraded.
  • DatabaseHealthCheckZB.MOM.WW.Health.EntityFrameworkCore.DatabaseHealthCheck<OtOpcUaConfigDbContext> with a ProbeQuery delegate of db => db.Deployments.AsNoTracking().Take(1).ToListAsync(). The delegate preserves the stricter query probe rather than falling back to CanConnectAsync.
  • Add GrpcDependencyHealthCheck targeting the MxAccessGateway channel (closes the downstream dependency gap noted in §4). Tag ["ready","active"].
  • Replace AddOtOpcUaHealth / MapOtOpcUaHealth with services.AddZbHealthChecks() + app.MapZbHealth(). The /healthz bare-liveness tier is part of MapZbHealth by default — no separate wiring needed.

Keep bespoke:

  • IClusterRoleInfo and its Akka implementation — this is an OtOpcUa abstraction used for more than health checks; it should remain in the OtOpcUa codebase. The shared ActiveNodeHealthCheck will accept IClusterRoleInfo (or an equivalent cluster-info abstraction) as an injection point.
  • The AllowAnonymous policy — this is an OtOpcUa auth concern; MapZbHealth must document that callers are responsible for applying AllowAnonymous (or the shared helper applies it by default).
  • Which probes are registered and their tag assignments — the shared library supplies the check implementations; the wiring (which names, which tags, which options) remains per-project.

Adoption is a follow-on task (tracked in GAPS.md), not part of the ZB.MOM.WW.Health library build. The library build delivers the shared implementations; adoption lands in the OtOpcUa repo as a separate commit once the nupkg is available.