3d25ee5090
Code-verified current-state docs for OtOpcUa (three-tier full), ScadaBridge (two-tier, no /healthz), and MxAccessGateway (bare liveness only / no probes). GAPS backlog with P1 for MxGateway and convergence items for Akka status policy, DB probe technique, and response writer. README with per-project status table.
151 lines
8.3 KiB
Markdown
151 lines
8.3 KiB
Markdown
# Health — current state: OtOpcUa
|
||
|
||
Repo: `~/Desktop/OtOpcUa`. Stack: .NET 10, Akka.NET, OPC UA; solution `ZB.MOM.WW.OtOpcUa.slnx`.
|
||
Health code lives in `src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/`. All paths relative to repo root.
|
||
Verified 2026-06-01.
|
||
|
||
Full three-tier pattern: `/health/ready`, `/health/active`, and `/healthz`. Three probes covering
|
||
the database, the Akka cluster, and the admin-role leader. All endpoints are `AllowAnonymous` to
|
||
permit Traefik and load-balancer probing without credentials.
|
||
|
||
## 1. Endpoint wiring
|
||
|
||
`src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/HealthEndpoints.cs`:
|
||
|
||
- `:13` — XML comment explicitly names this as "ScadaLink's three-tier pattern: `ready` = boot ok;
|
||
`active` = fully serving traffic; `healthz` = bare process liveness."
|
||
- `:17` — `AddOtOpcUaHealth(IServiceCollection)` calls `services.AddHealthChecks()` and registers
|
||
all three probes (lines 20–22):
|
||
- `DatabaseHealthCheck` name `"configdb"`, tags `["ready","active"]`
|
||
- `AkkaClusterHealthCheck` name `"akka"`, tags `["ready","active"]`
|
||
- `AdminRoleLeaderHealthCheck` name `"admin-leader"`, tags `["active"]` only
|
||
- `:28` — `MapOtOpcUaHealth(IEndpointRouteBuilder)` maps three endpoints (lines 33–44):
|
||
- `/health/ready` — predicate `c => c.Tags.Contains("ready")`, `.AllowAnonymous()` (lines 33–36)
|
||
- `/health/active` — predicate `c => c.Tags.Contains("active")`, `.AllowAnonymous()` (lines 37–40)
|
||
- `/healthz` — predicate `_ => false` (no probes run; bare process liveness only), `.AllowAnonymous()` (lines 41–44)
|
||
|
||
`Program.cs`:
|
||
- `:137` — `builder.Services.AddOtOpcUaHealth()`
|
||
- `:159` — `app.MapOtOpcUaHealth()`
|
||
|
||
Response writer: default ASP.NET Core plain-text/JSON (no `HealthChecks.UI.Client`).
|
||
|
||
## 2. Probes
|
||
|
||
### DatabaseHealthCheck
|
||
`src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/DatabaseHealthCheck.cs`:
|
||
|
||
- `:9` — injects `IDbContextFactory<OtOpcUaConfigDbContext>`
|
||
- `:25–37` — opens a pooled context via `CreateDbContextAsync`, runs
|
||
`db.Deployments.AsNoTracking().Take(1).ToListAsync()`. If the query succeeds →
|
||
`HealthCheckResult.Healthy("ConfigDb reachable")` (`:31`). If it throws →
|
||
`HealthCheckResult.Unhealthy("ConfigDb unreachable", ex)` (`:35`). No `Degraded` path.
|
||
|
||
The probe exercises a real query (not just `CanConnectAsync`) — it confirms the `Deployments` table
|
||
is readable, which implies the schema migration has run. This is **stricter** than ScadaBridge's
|
||
`CanConnectAsync` but more opaque about the failure reason.
|
||
|
||
Tags on registration: `["ready","active"]` — the database must be reachable for both readiness and
|
||
active-node determination.
|
||
|
||
### AkkaClusterHealthCheck
|
||
`src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/AkkaClusterHealthCheck.cs`:
|
||
|
||
- `:9` — injects `ActorSystem` directly
|
||
- `:27–33` — calls `Cluster.Get(_system)`, scans `cluster.State.Members` for the member whose
|
||
`Address == cluster.SelfAddress` and `Status == MemberStatus.Up`:
|
||
- Found Up → `HealthCheckResult.Healthy($"Self Up; {cluster.State.Members.Count} member(s)")` (`:32`)
|
||
- Not found → `HealthCheckResult.Degraded("Self not yet Up in cluster")` (`:33`)
|
||
|
||
No `Unhealthy` path — joining/leaving/removed nodes are all reported as `Degraded`. This differs from
|
||
ScadaBridge's more granular three-way policy (see GAPS).
|
||
|
||
Tags on registration: `["ready","active"]`.
|
||
|
||
### AdminRoleLeaderHealthCheck
|
||
`src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/AdminRoleLeaderHealthCheck.cs`:
|
||
|
||
- `:14` — injects `IClusterRoleInfo`
|
||
- `:27–38` — three-path logic:
|
||
- Node does not carry the `"admin"` role → `Healthy("Node does not carry admin role")` (`:30`) —
|
||
non-admin nodes are immediately healthy, so this probe never gates a non-admin node.
|
||
- Admin role + node is the role leader → `Healthy($"Admin leader ({...})")` (`:36`)
|
||
- Admin role + not the leader → `Degraded($"Admin member but not leader (leader=...)")` (`:37`)
|
||
|
||
Tags on registration: `["active"]` only — does not participate in `/health/ready`. The intent is
|
||
Traefik routing: the active node (admin-role leader) gets sticky admin-UI traffic; standby nodes
|
||
are reachable for data-plane OPC UA but report `Degraded` on `/health/active` so the load balancer
|
||
does not route control-plane traffic to them.
|
||
|
||
Note: no `Unhealthy` path for the role-filter case. If the ActorSystem is not running, `IClusterRoleInfo`
|
||
presumably returns safe defaults (no role); this is not separately health-checked.
|
||
|
||
## 3. Tag / tier summary
|
||
|
||
| Probe | `/health/ready` | `/health/active` | `/healthz` |
|
||
|---|---|---|---|
|
||
| `DatabaseHealthCheck` | ✅ | ✅ | — |
|
||
| `AkkaClusterHealthCheck` | ✅ | ✅ | — |
|
||
| `AdminRoleLeaderHealthCheck` | — | ✅ | — |
|
||
| (no probes) | — | — | ✅ (bare liveness) |
|
||
|
||
`/healthz` runs zero probes — it is a pure process liveness sentinel (process reachable = healthy;
|
||
a crashed process = no response). Kubernetes liveness probes, Traefik TCP checks, and uptime
|
||
monitors use this tier.
|
||
|
||
## 4. Downstream dependency coverage
|
||
|
||
No probe for the upstream MxAccessGateway gRPC channel. If the gateway is unreachable, OtOpcUa
|
||
reports healthy here (the GalaxyDriver will surface errors in OPC UA diagnostics, but `/health/ready`
|
||
and `/health/active` will not reflect it). This is a gap that the shared `GrpcDependencyHealthCheck`
|
||
probe in `ZB.MOM.WW.Health` would close.
|
||
|
||
## 5. Notable design choices
|
||
|
||
- **AllowAnonymous on all tiers** — see `HealthEndpoints.cs:30–32` comment: "Without it the
|
||
`AddOtOpcUaAuth` fallback policy 401s every probe and Traefik marks every backend unhealthy."
|
||
- **Query probe, not `CanConnectAsync`** — the `Deployments` query validates that the schema has
|
||
been applied. ScadaBridge uses `CanConnectAsync`; neither is wrong but they diverge.
|
||
- **`Degraded` semantics** — the Akka check uses `Degraded` (not `Unhealthy`) for a joining/pre-Up
|
||
node. ASP.NET Core maps `Degraded` to HTTP 200 by default; Traefik sees 200 and considers the
|
||
node ready. If `Unhealthy` (HTTP 503) is required to gate traffic, the `Degraded` path is
|
||
insufficient.
|
||
- **`IClusterRoleInfo` abstraction** — the admin-leader check depends on `IClusterRoleInfo`, an OtOpcUa
|
||
interface, not the raw `Akka.Cluster.Cluster` API. This is a testability-friendly layer absent in
|
||
ScadaBridge's direct Akka usage.
|
||
|
||
---
|
||
|
||
## Adoption plan → `ZB.MOM.WW.Health`
|
||
|
||
**Replace with shared probes:**
|
||
|
||
- `AkkaClusterHealthCheck` → `ZB.MOM.WW.Health.Akka.AkkaClusterHealthCheck` using the
|
||
**`OtOpcUaCompat` preset** (self-Up-among-members scan → Healthy/Degraded). The preset keeps
|
||
OtOpcUa's existing two-way policy without forcing ScadaBridge's three-way policy onto it.
|
||
- `AdminRoleLeaderHealthCheck` → `ZB.MOM.WW.Health.Akka.ActiveNodeHealthCheck` with
|
||
`RoleFilter = "admin"`. The role-filter parameter produces identical behavior: non-admin nodes
|
||
immediately healthy, admin leader healthy, admin non-leader degraded.
|
||
- `DatabaseHealthCheck` → `ZB.MOM.WW.Health.EntityFrameworkCore.DatabaseHealthCheck<OtOpcUaConfigDbContext>`
|
||
with a `ProbeQuery` delegate of `db => db.Deployments.AsNoTracking().Take(1).ToListAsync()`.
|
||
The delegate preserves the stricter query probe rather than falling back to `CanConnectAsync`.
|
||
- Add `GrpcDependencyHealthCheck` targeting the MxAccessGateway channel (closes the downstream
|
||
dependency gap noted in §4). Tag `["ready","active"]`.
|
||
- Replace `AddOtOpcUaHealth` / `MapOtOpcUaHealth` with `services.AddZbHealthChecks()` +
|
||
`app.MapZbHealth()`. The `/healthz` bare-liveness tier is part of `MapZbHealth` by default —
|
||
no separate wiring needed.
|
||
|
||
**Keep bespoke:**
|
||
|
||
- `IClusterRoleInfo` and its Akka implementation — this is an OtOpcUa abstraction used for more
|
||
than health checks; it should remain in the OtOpcUa codebase. The shared `ActiveNodeHealthCheck`
|
||
will accept `IClusterRoleInfo` (or an equivalent cluster-info abstraction) as an injection point.
|
||
- The `AllowAnonymous` policy — this is an OtOpcUa auth concern; `MapZbHealth` must document that
|
||
callers are responsible for applying `AllowAnonymous` (or the shared helper applies it by default).
|
||
- Which probes are registered and their tag assignments — the shared library supplies the check
|
||
implementations; the wiring (which names, which tags, which options) remains per-project.
|
||
|
||
**Adoption is a follow-on task** (tracked in `GAPS.md`), not part of the `ZB.MOM.WW.Health`
|
||
library build. The library build delivers the shared implementations; adoption lands in the
|
||
OtOpcUa repo as a separate commit once the nupkg is available.
|