docs(health): current-state x3 + GAPS + README

Code-verified current-state docs for OtOpcUa (three-tier full), ScadaBridge
(two-tier, no /healthz), and MxAccessGateway (bare liveness only / no probes).
GAPS backlog with P1 for MxGateway and convergence items for Akka status policy,
DB probe technique, and response writer. README with per-project status table.
This commit is contained in:
Joseph Doherty
2026-06-01 06:23:53 -04:00
parent 1dc35a8c43
commit 3d25ee5090
5 changed files with 698 additions and 0 deletions
@@ -0,0 +1,150 @@
# Health — current state: OtOpcUa
Repo: `~/Desktop/OtOpcUa`. Stack: .NET 10, Akka.NET, OPC UA; solution `ZB.MOM.WW.OtOpcUa.slnx`.
Health code lives in `src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/`. All paths relative to repo root.
Verified 2026-06-01.
Full three-tier pattern: `/health/ready`, `/health/active`, and `/healthz`. Three probes covering
the database, the Akka cluster, and the admin-role leader. All endpoints are `AllowAnonymous` to
permit Traefik and load-balancer probing without credentials.
## 1. Endpoint wiring
`src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/HealthEndpoints.cs`:
- `:13` — XML comment explicitly names this as "ScadaLink's three-tier pattern: `ready` = boot ok;
`active` = fully serving traffic; `healthz` = bare process liveness."
- `:17``AddOtOpcUaHealth(IServiceCollection)` calls `services.AddHealthChecks()` and registers
all three probes (lines 2022):
- `DatabaseHealthCheck` name `"configdb"`, tags `["ready","active"]`
- `AkkaClusterHealthCheck` name `"akka"`, tags `["ready","active"]`
- `AdminRoleLeaderHealthCheck` name `"admin-leader"`, tags `["active"]` only
- `:28``MapOtOpcUaHealth(IEndpointRouteBuilder)` maps three endpoints (lines 3344):
- `/health/ready` — predicate `c => c.Tags.Contains("ready")`, `.AllowAnonymous()` (lines 3336)
- `/health/active` — predicate `c => c.Tags.Contains("active")`, `.AllowAnonymous()` (lines 3740)
- `/healthz` — predicate `_ => false` (no probes run; bare process liveness only), `.AllowAnonymous()` (lines 4144)
`Program.cs`:
- `:137``builder.Services.AddOtOpcUaHealth()`
- `:159``app.MapOtOpcUaHealth()`
Response writer: default ASP.NET Core plain-text/JSON (no `HealthChecks.UI.Client`).
## 2. Probes
### DatabaseHealthCheck
`src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/DatabaseHealthCheck.cs`:
- `:9` — injects `IDbContextFactory<OtOpcUaConfigDbContext>`
- `:2537` — opens a pooled context via `CreateDbContextAsync`, runs
`db.Deployments.AsNoTracking().Take(1).ToListAsync()`. If the query succeeds →
`HealthCheckResult.Healthy("ConfigDb reachable")` (`:31`). If it throws →
`HealthCheckResult.Unhealthy("ConfigDb unreachable", ex)` (`:35`). No `Degraded` path.
The probe exercises a real query (not just `CanConnectAsync`) — it confirms the `Deployments` table
is readable, which implies the schema migration has run. This is **stricter** than ScadaBridge's
`CanConnectAsync` but more opaque about the failure reason.
Tags on registration: `["ready","active"]` — the database must be reachable for both readiness and
active-node determination.
### AkkaClusterHealthCheck
`src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/AkkaClusterHealthCheck.cs`:
- `:9` — injects `ActorSystem` directly
- `:2733` — calls `Cluster.Get(_system)`, scans `cluster.State.Members` for the member whose
`Address == cluster.SelfAddress` and `Status == MemberStatus.Up`:
- Found Up → `HealthCheckResult.Healthy($"Self Up; {cluster.State.Members.Count} member(s)")` (`:32`)
- Not found → `HealthCheckResult.Degraded("Self not yet Up in cluster")` (`:33`)
No `Unhealthy` path — joining/leaving/removed nodes are all reported as `Degraded`. This differs from
ScadaBridge's more granular three-way policy (see GAPS).
Tags on registration: `["ready","active"]`.
### AdminRoleLeaderHealthCheck
`src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/AdminRoleLeaderHealthCheck.cs`:
- `:14` — injects `IClusterRoleInfo`
- `:2738` — three-path logic:
- Node does not carry the `"admin"` role → `Healthy("Node does not carry admin role")` (`:30`) —
non-admin nodes are immediately healthy, so this probe never gates a non-admin node.
- Admin role + node is the role leader → `Healthy($"Admin leader ({...})")` (`:36`)
- Admin role + not the leader → `Degraded($"Admin member but not leader (leader=...)")` (`:37`)
Tags on registration: `["active"]` only — does not participate in `/health/ready`. The intent is
Traefik routing: the active node (admin-role leader) gets sticky admin-UI traffic; standby nodes
are reachable for data-plane OPC UA but report `Degraded` on `/health/active` so the load balancer
does not route control-plane traffic to them.
Note: no `Unhealthy` path for the role-filter case. If the ActorSystem is not running, `IClusterRoleInfo`
presumably returns safe defaults (no role); this is not separately health-checked.
## 3. Tag / tier summary
| Probe | `/health/ready` | `/health/active` | `/healthz` |
|---|---|---|---|
| `DatabaseHealthCheck` | ✅ | ✅ | — |
| `AkkaClusterHealthCheck` | ✅ | ✅ | — |
| `AdminRoleLeaderHealthCheck` | — | ✅ | — |
| (no probes) | — | — | ✅ (bare liveness) |
`/healthz` runs zero probes — it is a pure process liveness sentinel (process reachable = healthy;
a crashed process = no response). Kubernetes liveness probes, Traefik TCP checks, and uptime
monitors use this tier.
## 4. Downstream dependency coverage
No probe for the upstream MxAccessGateway gRPC channel. If the gateway is unreachable, OtOpcUa
reports healthy here (the GalaxyDriver will surface errors in OPC UA diagnostics, but `/health/ready`
and `/health/active` will not reflect it). This is a gap that the shared `GrpcDependencyHealthCheck`
probe in `ZB.MOM.WW.Health` would close.
## 5. Notable design choices
- **AllowAnonymous on all tiers** — see `HealthEndpoints.cs:3032` comment: "Without it the
`AddOtOpcUaAuth` fallback policy 401s every probe and Traefik marks every backend unhealthy."
- **Query probe, not `CanConnectAsync`** — the `Deployments` query validates that the schema has
been applied. ScadaBridge uses `CanConnectAsync`; neither is wrong but they diverge.
- **`Degraded` semantics** — the Akka check uses `Degraded` (not `Unhealthy`) for a joining/pre-Up
node. ASP.NET Core maps `Degraded` to HTTP 200 by default; Traefik sees 200 and considers the
node ready. If `Unhealthy` (HTTP 503) is required to gate traffic, the `Degraded` path is
insufficient.
- **`IClusterRoleInfo` abstraction** — the admin-leader check depends on `IClusterRoleInfo`, an OtOpcUa
interface, not the raw `Akka.Cluster.Cluster` API. This is a testability-friendly layer absent in
ScadaBridge's direct Akka usage.
---
## Adoption plan → `ZB.MOM.WW.Health`
**Replace with shared probes:**
- `AkkaClusterHealthCheck``ZB.MOM.WW.Health.Akka.AkkaClusterHealthCheck` using the
**`OtOpcUaCompat` preset** (self-Up-among-members scan → Healthy/Degraded). The preset keeps
OtOpcUa's existing two-way policy without forcing ScadaBridge's three-way policy onto it.
- `AdminRoleLeaderHealthCheck``ZB.MOM.WW.Health.Akka.ActiveNodeHealthCheck` with
`RoleFilter = "admin"`. The role-filter parameter produces identical behavior: non-admin nodes
immediately healthy, admin leader healthy, admin non-leader degraded.
- `DatabaseHealthCheck``ZB.MOM.WW.Health.EntityFrameworkCore.DatabaseHealthCheck<OtOpcUaConfigDbContext>`
with a `ProbeQuery` delegate of `db => db.Deployments.AsNoTracking().Take(1).ToListAsync()`.
The delegate preserves the stricter query probe rather than falling back to `CanConnectAsync`.
- Add `GrpcDependencyHealthCheck` targeting the MxAccessGateway channel (closes the downstream
dependency gap noted in §4). Tag `["ready","active"]`.
- Replace `AddOtOpcUaHealth` / `MapOtOpcUaHealth` with `services.AddZbHealthChecks()` +
`app.MapZbHealth()`. The `/healthz` bare-liveness tier is part of `MapZbHealth` by default —
no separate wiring needed.
**Keep bespoke:**
- `IClusterRoleInfo` and its Akka implementation — this is an OtOpcUa abstraction used for more
than health checks; it should remain in the OtOpcUa codebase. The shared `ActiveNodeHealthCheck`
will accept `IClusterRoleInfo` (or an equivalent cluster-info abstraction) as an injection point.
- The `AllowAnonymous` policy — this is an OtOpcUa auth concern; `MapZbHealth` must document that
callers are responsible for applying `AllowAnonymous` (or the shared helper applies it by default).
- Which probes are registered and their tag assignments — the shared library supplies the check
implementations; the wiring (which names, which tags, which options) remains per-project.
**Adoption is a follow-on task** (tracked in `GAPS.md`), not part of the `ZB.MOM.WW.Health`
library build. The library build delivers the shared implementations; adoption lands in the
OtOpcUa repo as a separate commit once the nupkg is available.