# Health — current state: OtOpcUa Repo: `~/Desktop/OtOpcUa`. Stack: .NET 10, Akka.NET, OPC UA; solution `ZB.MOM.WW.OtOpcUa.slnx`. Health code lives in `src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/`. All paths relative to repo root. Verified 2026-06-01. Full three-tier pattern: `/health/ready`, `/health/active`, and `/healthz`. Three probes covering the database, the Akka cluster, and the admin-role leader. All endpoints are `AllowAnonymous` to permit Traefik and load-balancer probing without credentials. ## 1. Endpoint wiring `src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/HealthEndpoints.cs`: - `:13` — XML comment explicitly names this as "ScadaLink's three-tier pattern: `ready` = boot ok; `active` = fully serving traffic; `healthz` = bare process liveness." - `:17` — `AddOtOpcUaHealth(IServiceCollection)` calls `services.AddHealthChecks()` and registers all three probes (lines 20–22): - `DatabaseHealthCheck` name `"configdb"`, tags `["ready","active"]` - `AkkaClusterHealthCheck` name `"akka"`, tags `["ready","active"]` - `AdminRoleLeaderHealthCheck` name `"admin-leader"`, tags `["active"]` only - `:28` — `MapOtOpcUaHealth(IEndpointRouteBuilder)` maps three endpoints (lines 33–44): - `/health/ready` — predicate `c => c.Tags.Contains("ready")`, `.AllowAnonymous()` (lines 33–36) - `/health/active` — predicate `c => c.Tags.Contains("active")`, `.AllowAnonymous()` (lines 37–40) - `/healthz` — predicate `_ => false` (no probes run; bare process liveness only), `.AllowAnonymous()` (lines 41–44) `Program.cs`: - `:137` — `builder.Services.AddOtOpcUaHealth()` - `:159` — `app.MapOtOpcUaHealth()` Response writer: default ASP.NET Core plain-text/JSON (no `HealthChecks.UI.Client`). ## 2. Probes ### DatabaseHealthCheck `src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/DatabaseHealthCheck.cs`: - `:9` — injects `IDbContextFactory` - `:25–37` — opens a pooled context via `CreateDbContextAsync`, runs `db.Deployments.AsNoTracking().Take(1).ToListAsync()`. If the query succeeds → `HealthCheckResult.Healthy("ConfigDb reachable")` (`:31`). If it throws → `HealthCheckResult.Unhealthy("ConfigDb unreachable", ex)` (`:35`). No `Degraded` path. The probe exercises a real query (not just `CanConnectAsync`) — it confirms the `Deployments` table is readable, which implies the schema migration has run. This is **stricter** than ScadaBridge's `CanConnectAsync` but more opaque about the failure reason. Tags on registration: `["ready","active"]` — the database must be reachable for both readiness and active-node determination. ### AkkaClusterHealthCheck `src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/AkkaClusterHealthCheck.cs`: - `:9` — injects `ActorSystem` directly - `:27–33` — calls `Cluster.Get(_system)`, scans `cluster.State.Members` for the member whose `Address == cluster.SelfAddress` and `Status == MemberStatus.Up`: - Found Up → `HealthCheckResult.Healthy($"Self Up; {cluster.State.Members.Count} member(s)")` (`:32`) - Not found → `HealthCheckResult.Degraded("Self not yet Up in cluster")` (`:33`) No `Unhealthy` path — joining/leaving/removed nodes are all reported as `Degraded`. This differs from ScadaBridge's more granular three-way policy (see GAPS). Tags on registration: `["ready","active"]`. ### AdminRoleLeaderHealthCheck `src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/AdminRoleLeaderHealthCheck.cs`: - `:14` — injects `IClusterRoleInfo` - `:27–38` — three-path logic: - Node does not carry the `"admin"` role → `Healthy("Node does not carry admin role")` (`:30`) — non-admin nodes are immediately healthy, so this probe never gates a non-admin node. - Admin role + node is the role leader → `Healthy($"Admin leader ({...})")` (`:36`) - Admin role + not the leader → `Degraded($"Admin member but not leader (leader=...)")` (`:37`) Tags on registration: `["active"]` only — does not participate in `/health/ready`. The intent is Traefik routing: the active node (admin-role leader) gets sticky admin-UI traffic; standby nodes are reachable for data-plane OPC UA but report `Degraded` on `/health/active` so the load balancer does not route control-plane traffic to them. Note: no `Unhealthy` path for the role-filter case. If the ActorSystem is not running, `IClusterRoleInfo` presumably returns safe defaults (no role); this is not separately health-checked. ## 3. Tag / tier summary | Probe | `/health/ready` | `/health/active` | `/healthz` | |---|---|---|---| | `DatabaseHealthCheck` | ✅ | ✅ | — | | `AkkaClusterHealthCheck` | ✅ | ✅ | — | | `AdminRoleLeaderHealthCheck` | — | ✅ | — | | (no probes) | — | — | ✅ (bare liveness) | `/healthz` runs zero probes — it is a pure process liveness sentinel (process reachable = healthy; a crashed process = no response). Kubernetes liveness probes, Traefik TCP checks, and uptime monitors use this tier. ## 4. Downstream dependency coverage No probe for the upstream MxAccessGateway gRPC channel. If the gateway is unreachable, OtOpcUa reports healthy here (the GalaxyDriver will surface errors in OPC UA diagnostics, but `/health/ready` and `/health/active` will not reflect it). This is a gap that the shared `GrpcDependencyHealthCheck` probe in `ZB.MOM.WW.Health` would close. ## 5. Notable design choices - **AllowAnonymous on all tiers** — see `HealthEndpoints.cs:30–32` comment: "Without it the `AddOtOpcUaAuth` fallback policy 401s every probe and Traefik marks every backend unhealthy." - **Query probe, not `CanConnectAsync`** — the `Deployments` query validates that the schema has been applied. ScadaBridge uses `CanConnectAsync`; neither is wrong but they diverge. - **`Degraded` semantics** — the Akka check uses `Degraded` (not `Unhealthy`) for a joining/pre-Up node. ASP.NET Core maps `Degraded` to HTTP 200 by default; Traefik sees 200 and considers the node ready. If `Unhealthy` (HTTP 503) is required to gate traffic, the `Degraded` path is insufficient. - **`IClusterRoleInfo` abstraction** — the admin-leader check depends on `IClusterRoleInfo`, an OtOpcUa interface, not the raw `Akka.Cluster.Cluster` API. This is a testability-friendly layer absent in ScadaBridge's direct Akka usage. --- ## Adoption plan → `ZB.MOM.WW.Health` **Replace with shared probes:** - `AkkaClusterHealthCheck` → `ZB.MOM.WW.Health.Akka.AkkaClusterHealthCheck` using the **`OtOpcUaCompat` preset** (self-Up-among-members scan → Healthy/Degraded). The preset keeps OtOpcUa's existing two-way policy without forcing ScadaBridge's three-way policy onto it. - `AdminRoleLeaderHealthCheck` → `ZB.MOM.WW.Health.Akka.ActiveNodeHealthCheck` with `RoleFilter = "admin"`. The role-filter parameter produces identical behavior: non-admin nodes immediately healthy, admin leader healthy, admin non-leader degraded. - `DatabaseHealthCheck` → `ZB.MOM.WW.Health.EntityFrameworkCore.DatabaseHealthCheck` with a `ProbeQuery` delegate of `db => db.Deployments.AsNoTracking().Take(1).ToListAsync()`. The delegate preserves the stricter query probe rather than falling back to `CanConnectAsync`. - Add `GrpcDependencyHealthCheck` targeting the MxAccessGateway channel (closes the downstream dependency gap noted in §4). Tag `["ready","active"]`. - Replace `AddOtOpcUaHealth` / `MapOtOpcUaHealth` with `services.AddHealthChecks().AddCheck<...>()` (one call per probe, per spec §5) + `app.MapZbHealth()`. The `/healthz` bare-liveness tier is part of `MapZbHealth` by default — no separate wiring needed. **Keep bespoke:** - `IClusterRoleInfo` and its Akka implementation — on adoption this testability seam is given up for the health-check path. The shared `ActiveNodeHealthCheck` reads cluster role state from the ActorSystem directly (resolving it lazily via `IServiceProvider`); it does not accept `IClusterRoleInfo` as an injection point. This is an accepted trade-off: the shared implementation is simpler and consistent across projects, while `IClusterRoleInfo` remains available elsewhere in the OtOpcUa codebase where it is used outside health checks. - The `AllowAnonymous` policy — this is an OtOpcUa auth concern; `MapZbHealth` must document that callers are responsible for applying `AllowAnonymous` (or the shared helper applies it by default). - Which probes are registered and their tag assignments — the shared library supplies the check implementations; the wiring (which names, which tags, which options) remains per-project. **Adoption is a follow-on task** (tracked in `GAPS.md`), not part of the `ZB.MOM.WW.Health` library build. The library build delivers the shared implementations; adoption lands in the OtOpcUa repo as a separate commit once the nupkg is available.