docs(health): current-state x3 + GAPS + README
Code-verified current-state docs for OtOpcUa (three-tier full), ScadaBridge (two-tier, no /healthz), and MxAccessGateway (bare liveness only / no probes). GAPS backlog with P1 for MxGateway and convergence items for Akka status policy, DB probe technique, and response writer. README with per-project status table.
This commit is contained in:
@@ -0,0 +1,150 @@
|
||||
# Health — current state: OtOpcUa
|
||||
|
||||
Repo: `~/Desktop/OtOpcUa`. Stack: .NET 10, Akka.NET, OPC UA; solution `ZB.MOM.WW.OtOpcUa.slnx`.
|
||||
Health code lives in `src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/`. All paths relative to repo root.
|
||||
Verified 2026-06-01.
|
||||
|
||||
Full three-tier pattern: `/health/ready`, `/health/active`, and `/healthz`. Three probes covering
|
||||
the database, the Akka cluster, and the admin-role leader. All endpoints are `AllowAnonymous` to
|
||||
permit Traefik and load-balancer probing without credentials.
|
||||
|
||||
## 1. Endpoint wiring
|
||||
|
||||
`src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/HealthEndpoints.cs`:
|
||||
|
||||
- `:13` — XML comment explicitly names this as "ScadaLink's three-tier pattern: `ready` = boot ok;
|
||||
`active` = fully serving traffic; `healthz` = bare process liveness."
|
||||
- `:17` — `AddOtOpcUaHealth(IServiceCollection)` calls `services.AddHealthChecks()` and registers
|
||||
all three probes (lines 20–22):
|
||||
- `DatabaseHealthCheck` name `"configdb"`, tags `["ready","active"]`
|
||||
- `AkkaClusterHealthCheck` name `"akka"`, tags `["ready","active"]`
|
||||
- `AdminRoleLeaderHealthCheck` name `"admin-leader"`, tags `["active"]` only
|
||||
- `:28` — `MapOtOpcUaHealth(IEndpointRouteBuilder)` maps three endpoints (lines 33–44):
|
||||
- `/health/ready` — predicate `c => c.Tags.Contains("ready")`, `.AllowAnonymous()` (lines 33–36)
|
||||
- `/health/active` — predicate `c => c.Tags.Contains("active")`, `.AllowAnonymous()` (lines 37–40)
|
||||
- `/healthz` — predicate `_ => false` (no probes run; bare process liveness only), `.AllowAnonymous()` (lines 41–44)
|
||||
|
||||
`Program.cs`:
|
||||
- `:137` — `builder.Services.AddOtOpcUaHealth()`
|
||||
- `:159` — `app.MapOtOpcUaHealth()`
|
||||
|
||||
Response writer: default ASP.NET Core plain-text/JSON (no `HealthChecks.UI.Client`).
|
||||
|
||||
## 2. Probes
|
||||
|
||||
### DatabaseHealthCheck
|
||||
`src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/DatabaseHealthCheck.cs`:
|
||||
|
||||
- `:9` — injects `IDbContextFactory<OtOpcUaConfigDbContext>`
|
||||
- `:25–37` — opens a pooled context via `CreateDbContextAsync`, runs
|
||||
`db.Deployments.AsNoTracking().Take(1).ToListAsync()`. If the query succeeds →
|
||||
`HealthCheckResult.Healthy("ConfigDb reachable")` (`:31`). If it throws →
|
||||
`HealthCheckResult.Unhealthy("ConfigDb unreachable", ex)` (`:35`). No `Degraded` path.
|
||||
|
||||
The probe exercises a real query (not just `CanConnectAsync`) — it confirms the `Deployments` table
|
||||
is readable, which implies the schema migration has run. This is **stricter** than ScadaBridge's
|
||||
`CanConnectAsync` but more opaque about the failure reason.
|
||||
|
||||
Tags on registration: `["ready","active"]` — the database must be reachable for both readiness and
|
||||
active-node determination.
|
||||
|
||||
### AkkaClusterHealthCheck
|
||||
`src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/AkkaClusterHealthCheck.cs`:
|
||||
|
||||
- `:9` — injects `ActorSystem` directly
|
||||
- `:27–33` — calls `Cluster.Get(_system)`, scans `cluster.State.Members` for the member whose
|
||||
`Address == cluster.SelfAddress` and `Status == MemberStatus.Up`:
|
||||
- Found Up → `HealthCheckResult.Healthy($"Self Up; {cluster.State.Members.Count} member(s)")` (`:32`)
|
||||
- Not found → `HealthCheckResult.Degraded("Self not yet Up in cluster")` (`:33`)
|
||||
|
||||
No `Unhealthy` path — joining/leaving/removed nodes are all reported as `Degraded`. This differs from
|
||||
ScadaBridge's more granular three-way policy (see GAPS).
|
||||
|
||||
Tags on registration: `["ready","active"]`.
|
||||
|
||||
### AdminRoleLeaderHealthCheck
|
||||
`src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/AdminRoleLeaderHealthCheck.cs`:
|
||||
|
||||
- `:14` — injects `IClusterRoleInfo`
|
||||
- `:27–38` — three-path logic:
|
||||
- Node does not carry the `"admin"` role → `Healthy("Node does not carry admin role")` (`:30`) —
|
||||
non-admin nodes are immediately healthy, so this probe never gates a non-admin node.
|
||||
- Admin role + node is the role leader → `Healthy($"Admin leader ({...})")` (`:36`)
|
||||
- Admin role + not the leader → `Degraded($"Admin member but not leader (leader=...)")` (`:37`)
|
||||
|
||||
Tags on registration: `["active"]` only — does not participate in `/health/ready`. The intent is
|
||||
Traefik routing: the active node (admin-role leader) gets sticky admin-UI traffic; standby nodes
|
||||
are reachable for data-plane OPC UA but report `Degraded` on `/health/active` so the load balancer
|
||||
does not route control-plane traffic to them.
|
||||
|
||||
Note: no `Unhealthy` path for the role-filter case. If the ActorSystem is not running, `IClusterRoleInfo`
|
||||
presumably returns safe defaults (no role); this is not separately health-checked.
|
||||
|
||||
## 3. Tag / tier summary
|
||||
|
||||
| Probe | `/health/ready` | `/health/active` | `/healthz` |
|
||||
|---|---|---|---|
|
||||
| `DatabaseHealthCheck` | ✅ | ✅ | — |
|
||||
| `AkkaClusterHealthCheck` | ✅ | ✅ | — |
|
||||
| `AdminRoleLeaderHealthCheck` | — | ✅ | — |
|
||||
| (no probes) | — | — | ✅ (bare liveness) |
|
||||
|
||||
`/healthz` runs zero probes — it is a pure process liveness sentinel (process reachable = healthy;
|
||||
a crashed process = no response). Kubernetes liveness probes, Traefik TCP checks, and uptime
|
||||
monitors use this tier.
|
||||
|
||||
## 4. Downstream dependency coverage
|
||||
|
||||
No probe for the upstream MxAccessGateway gRPC channel. If the gateway is unreachable, OtOpcUa
|
||||
reports healthy here (the GalaxyDriver will surface errors in OPC UA diagnostics, but `/health/ready`
|
||||
and `/health/active` will not reflect it). This is a gap that the shared `GrpcDependencyHealthCheck`
|
||||
probe in `ZB.MOM.WW.Health` would close.
|
||||
|
||||
## 5. Notable design choices
|
||||
|
||||
- **AllowAnonymous on all tiers** — see `HealthEndpoints.cs:30–32` comment: "Without it the
|
||||
`AddOtOpcUaAuth` fallback policy 401s every probe and Traefik marks every backend unhealthy."
|
||||
- **Query probe, not `CanConnectAsync`** — the `Deployments` query validates that the schema has
|
||||
been applied. ScadaBridge uses `CanConnectAsync`; neither is wrong but they diverge.
|
||||
- **`Degraded` semantics** — the Akka check uses `Degraded` (not `Unhealthy`) for a joining/pre-Up
|
||||
node. ASP.NET Core maps `Degraded` to HTTP 200 by default; Traefik sees 200 and considers the
|
||||
node ready. If `Unhealthy` (HTTP 503) is required to gate traffic, the `Degraded` path is
|
||||
insufficient.
|
||||
- **`IClusterRoleInfo` abstraction** — the admin-leader check depends on `IClusterRoleInfo`, an OtOpcUa
|
||||
interface, not the raw `Akka.Cluster.Cluster` API. This is a testability-friendly layer absent in
|
||||
ScadaBridge's direct Akka usage.
|
||||
|
||||
---
|
||||
|
||||
## Adoption plan → `ZB.MOM.WW.Health`
|
||||
|
||||
**Replace with shared probes:**
|
||||
|
||||
- `AkkaClusterHealthCheck` → `ZB.MOM.WW.Health.Akka.AkkaClusterHealthCheck` using the
|
||||
**`OtOpcUaCompat` preset** (self-Up-among-members scan → Healthy/Degraded). The preset keeps
|
||||
OtOpcUa's existing two-way policy without forcing ScadaBridge's three-way policy onto it.
|
||||
- `AdminRoleLeaderHealthCheck` → `ZB.MOM.WW.Health.Akka.ActiveNodeHealthCheck` with
|
||||
`RoleFilter = "admin"`. The role-filter parameter produces identical behavior: non-admin nodes
|
||||
immediately healthy, admin leader healthy, admin non-leader degraded.
|
||||
- `DatabaseHealthCheck` → `ZB.MOM.WW.Health.EntityFrameworkCore.DatabaseHealthCheck<OtOpcUaConfigDbContext>`
|
||||
with a `ProbeQuery` delegate of `db => db.Deployments.AsNoTracking().Take(1).ToListAsync()`.
|
||||
The delegate preserves the stricter query probe rather than falling back to `CanConnectAsync`.
|
||||
- Add `GrpcDependencyHealthCheck` targeting the MxAccessGateway channel (closes the downstream
|
||||
dependency gap noted in §4). Tag `["ready","active"]`.
|
||||
- Replace `AddOtOpcUaHealth` / `MapOtOpcUaHealth` with `services.AddZbHealthChecks()` +
|
||||
`app.MapZbHealth()`. The `/healthz` bare-liveness tier is part of `MapZbHealth` by default —
|
||||
no separate wiring needed.
|
||||
|
||||
**Keep bespoke:**
|
||||
|
||||
- `IClusterRoleInfo` and its Akka implementation — this is an OtOpcUa abstraction used for more
|
||||
than health checks; it should remain in the OtOpcUa codebase. The shared `ActiveNodeHealthCheck`
|
||||
will accept `IClusterRoleInfo` (or an equivalent cluster-info abstraction) as an injection point.
|
||||
- The `AllowAnonymous` policy — this is an OtOpcUa auth concern; `MapZbHealth` must document that
|
||||
callers are responsible for applying `AllowAnonymous` (or the shared helper applies it by default).
|
||||
- Which probes are registered and their tag assignments — the shared library supplies the check
|
||||
implementations; the wiring (which names, which tags, which options) remains per-project.
|
||||
|
||||
**Adoption is a follow-on task** (tracked in `GAPS.md`), not part of the `ZB.MOM.WW.Health`
|
||||
library build. The library build delivers the shared implementations; adoption lands in the
|
||||
OtOpcUa repo as a separate commit once the nupkg is available.
|
||||
Reference in New Issue
Block a user