07d5907258
Applies canonical resolutions for eight settled decisions: - GAPS: remove three stale "Decisions still open" bullets (#1 IActiveNodeGate placement, #2 GrpcChannel type, #3 OtOpcUaCompat named constant) - Shared contract: AkkaClusterHealthCheck, ActiveNodeHealthCheck constructors take IServiceProvider (lazy ActorSystem, Degraded-when-not-ready) - Shared contract: AkkaActiveNodeGate takes IServiceProvider; reads SelfMember+leader directly, null-guarded; does not proxy ActiveNodeHealthCheck - Shared contract: DatabaseHealthCheckOptions.Probe renamed to ProbeQuery; consumer matrix updated - Shared contract: settled AddZbHealthChecks open question removed (spec §5 is per-project AddHealthChecks) - SPEC §2.2: OtOpcUaCompat Leaving/Exiting cell updated from — to Degraded + footnote; §2.3 startup-safety note added - README: status line corrected from "built and tested" to "scaffolded … implementation is follow-on (task #7)"; IActiveNodeGate "left per-project" bullet removed - OtOpcUa current-state: AddZbHealthChecks → AddHealthChecks().AddCheck<...>(); IClusterRoleInfo note reframed as accepted trade-off - ScadaBridge current-state: IActiveNodeGate bullet rewritten — interface moves to ZB.MOM.WW.Health on adoption, InboundApiEndpointFilter references shared interface
155 lines
8.6 KiB
Markdown
155 lines
8.6 KiB
Markdown
# Health — current state: OtOpcUa
|
||
|
||
Repo: `~/Desktop/OtOpcUa`. Stack: .NET 10, Akka.NET, OPC UA; solution `ZB.MOM.WW.OtOpcUa.slnx`.
|
||
Health code lives in `src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/`. All paths relative to repo root.
|
||
Verified 2026-06-01.
|
||
|
||
Full three-tier pattern: `/health/ready`, `/health/active`, and `/healthz`. Three probes covering
|
||
the database, the Akka cluster, and the admin-role leader. All endpoints are `AllowAnonymous` to
|
||
permit Traefik and load-balancer probing without credentials.
|
||
|
||
## 1. Endpoint wiring
|
||
|
||
`src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/HealthEndpoints.cs`:
|
||
|
||
- `:13` — XML comment explicitly names this as "ScadaLink's three-tier pattern: `ready` = boot ok;
|
||
`active` = fully serving traffic; `healthz` = bare process liveness."
|
||
- `:17` — `AddOtOpcUaHealth(IServiceCollection)` calls `services.AddHealthChecks()` and registers
|
||
all three probes (lines 20–22):
|
||
- `DatabaseHealthCheck` name `"configdb"`, tags `["ready","active"]`
|
||
- `AkkaClusterHealthCheck` name `"akka"`, tags `["ready","active"]`
|
||
- `AdminRoleLeaderHealthCheck` name `"admin-leader"`, tags `["active"]` only
|
||
- `:28` — `MapOtOpcUaHealth(IEndpointRouteBuilder)` maps three endpoints (lines 33–44):
|
||
- `/health/ready` — predicate `c => c.Tags.Contains("ready")`, `.AllowAnonymous()` (lines 33–36)
|
||
- `/health/active` — predicate `c => c.Tags.Contains("active")`, `.AllowAnonymous()` (lines 37–40)
|
||
- `/healthz` — predicate `_ => false` (no probes run; bare process liveness only), `.AllowAnonymous()` (lines 41–44)
|
||
|
||
`Program.cs`:
|
||
- `:137` — `builder.Services.AddOtOpcUaHealth()`
|
||
- `:159` — `app.MapOtOpcUaHealth()`
|
||
|
||
Response writer: default ASP.NET Core plain-text/JSON (no `HealthChecks.UI.Client`).
|
||
|
||
## 2. Probes
|
||
|
||
### DatabaseHealthCheck
|
||
`src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/DatabaseHealthCheck.cs`:
|
||
|
||
- `:9` — injects `IDbContextFactory<OtOpcUaConfigDbContext>`
|
||
- `:25–37` — opens a pooled context via `CreateDbContextAsync`, runs
|
||
`db.Deployments.AsNoTracking().Take(1).ToListAsync()`. If the query succeeds →
|
||
`HealthCheckResult.Healthy("ConfigDb reachable")` (`:31`). If it throws →
|
||
`HealthCheckResult.Unhealthy("ConfigDb unreachable", ex)` (`:35`). No `Degraded` path.
|
||
|
||
The probe exercises a real query (not just `CanConnectAsync`) — it confirms the `Deployments` table
|
||
is readable, which implies the schema migration has run. This is **stricter** than ScadaBridge's
|
||
`CanConnectAsync` but more opaque about the failure reason.
|
||
|
||
Tags on registration: `["ready","active"]` — the database must be reachable for both readiness and
|
||
active-node determination.
|
||
|
||
### AkkaClusterHealthCheck
|
||
`src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/AkkaClusterHealthCheck.cs`:
|
||
|
||
- `:9` — injects `ActorSystem` directly
|
||
- `:27–33` — calls `Cluster.Get(_system)`, scans `cluster.State.Members` for the member whose
|
||
`Address == cluster.SelfAddress` and `Status == MemberStatus.Up`:
|
||
- Found Up → `HealthCheckResult.Healthy($"Self Up; {cluster.State.Members.Count} member(s)")` (`:32`)
|
||
- Not found → `HealthCheckResult.Degraded("Self not yet Up in cluster")` (`:33`)
|
||
|
||
No `Unhealthy` path — joining/leaving/removed nodes are all reported as `Degraded`. This differs from
|
||
ScadaBridge's more granular three-way policy (see GAPS).
|
||
|
||
Tags on registration: `["ready","active"]`.
|
||
|
||
### AdminRoleLeaderHealthCheck
|
||
`src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/AdminRoleLeaderHealthCheck.cs`:
|
||
|
||
- `:14` — injects `IClusterRoleInfo`
|
||
- `:27–38` — three-path logic:
|
||
- Node does not carry the `"admin"` role → `Healthy("Node does not carry admin role")` (`:30`) —
|
||
non-admin nodes are immediately healthy, so this probe never gates a non-admin node.
|
||
- Admin role + node is the role leader → `Healthy($"Admin leader ({...})")` (`:36`)
|
||
- Admin role + not the leader → `Degraded($"Admin member but not leader (leader=...)")` (`:37`)
|
||
|
||
Tags on registration: `["active"]` only — does not participate in `/health/ready`. The intent is
|
||
Traefik routing: the active node (admin-role leader) gets sticky admin-UI traffic; standby nodes
|
||
are reachable for data-plane OPC UA but report `Degraded` on `/health/active` so the load balancer
|
||
does not route control-plane traffic to them.
|
||
|
||
Note: no `Unhealthy` path for the role-filter case. If the ActorSystem is not running, `IClusterRoleInfo`
|
||
presumably returns safe defaults (no role); this is not separately health-checked.
|
||
|
||
## 3. Tag / tier summary
|
||
|
||
| Probe | `/health/ready` | `/health/active` | `/healthz` |
|
||
|---|---|---|---|
|
||
| `DatabaseHealthCheck` | ✅ | ✅ | — |
|
||
| `AkkaClusterHealthCheck` | ✅ | ✅ | — |
|
||
| `AdminRoleLeaderHealthCheck` | — | ✅ | — |
|
||
| (no probes) | — | — | ✅ (bare liveness) |
|
||
|
||
`/healthz` runs zero probes — it is a pure process liveness sentinel (process reachable = healthy;
|
||
a crashed process = no response). Kubernetes liveness probes, Traefik TCP checks, and uptime
|
||
monitors use this tier.
|
||
|
||
## 4. Downstream dependency coverage
|
||
|
||
No probe for the upstream MxAccessGateway gRPC channel. If the gateway is unreachable, OtOpcUa
|
||
reports healthy here (the GalaxyDriver will surface errors in OPC UA diagnostics, but `/health/ready`
|
||
and `/health/active` will not reflect it). This is a gap that the shared `GrpcDependencyHealthCheck`
|
||
probe in `ZB.MOM.WW.Health` would close.
|
||
|
||
## 5. Notable design choices
|
||
|
||
- **AllowAnonymous on all tiers** — see `HealthEndpoints.cs:30–32` comment: "Without it the
|
||
`AddOtOpcUaAuth` fallback policy 401s every probe and Traefik marks every backend unhealthy."
|
||
- **Query probe, not `CanConnectAsync`** — the `Deployments` query validates that the schema has
|
||
been applied. ScadaBridge uses `CanConnectAsync`; neither is wrong but they diverge.
|
||
- **`Degraded` semantics** — the Akka check uses `Degraded` (not `Unhealthy`) for a joining/pre-Up
|
||
node. ASP.NET Core maps `Degraded` to HTTP 200 by default; Traefik sees 200 and considers the
|
||
node ready. If `Unhealthy` (HTTP 503) is required to gate traffic, the `Degraded` path is
|
||
insufficient.
|
||
- **`IClusterRoleInfo` abstraction** — the admin-leader check depends on `IClusterRoleInfo`, an OtOpcUa
|
||
interface, not the raw `Akka.Cluster.Cluster` API. This is a testability-friendly layer absent in
|
||
ScadaBridge's direct Akka usage.
|
||
|
||
---
|
||
|
||
## Adoption plan → `ZB.MOM.WW.Health`
|
||
|
||
**Replace with shared probes:**
|
||
|
||
- `AkkaClusterHealthCheck` → `ZB.MOM.WW.Health.Akka.AkkaClusterHealthCheck` using the
|
||
**`OtOpcUaCompat` preset** (self-Up-among-members scan → Healthy/Degraded). The preset keeps
|
||
OtOpcUa's existing two-way policy without forcing ScadaBridge's three-way policy onto it.
|
||
- `AdminRoleLeaderHealthCheck` → `ZB.MOM.WW.Health.Akka.ActiveNodeHealthCheck` with
|
||
`RoleFilter = "admin"`. The role-filter parameter produces identical behavior: non-admin nodes
|
||
immediately healthy, admin leader healthy, admin non-leader degraded.
|
||
- `DatabaseHealthCheck` → `ZB.MOM.WW.Health.EntityFrameworkCore.DatabaseHealthCheck<OtOpcUaConfigDbContext>`
|
||
with a `ProbeQuery` delegate of `db => db.Deployments.AsNoTracking().Take(1).ToListAsync()`.
|
||
The delegate preserves the stricter query probe rather than falling back to `CanConnectAsync`.
|
||
- Add `GrpcDependencyHealthCheck` targeting the MxAccessGateway channel (closes the downstream
|
||
dependency gap noted in §4). Tag `["ready","active"]`.
|
||
- Replace `AddOtOpcUaHealth` / `MapOtOpcUaHealth` with
|
||
`services.AddHealthChecks().AddCheck<...>()` (one call per probe, per spec §5) +
|
||
`app.MapZbHealth()`. The `/healthz` bare-liveness tier is part of `MapZbHealth` by default —
|
||
no separate wiring needed.
|
||
|
||
**Keep bespoke:**
|
||
|
||
- `IClusterRoleInfo` and its Akka implementation — on adoption this testability seam is given up
|
||
for the health-check path. The shared `ActiveNodeHealthCheck` reads cluster role state from the
|
||
ActorSystem directly (resolving it lazily via `IServiceProvider`); it does not accept
|
||
`IClusterRoleInfo` as an injection point. This is an accepted trade-off: the shared implementation
|
||
is simpler and consistent across projects, while `IClusterRoleInfo` remains available elsewhere
|
||
in the OtOpcUa codebase where it is used outside health checks.
|
||
- The `AllowAnonymous` policy — this is an OtOpcUa auth concern; `MapZbHealth` must document that
|
||
callers are responsible for applying `AllowAnonymous` (or the shared helper applies it by default).
|
||
- Which probes are registered and their tag assignments — the shared library supplies the check
|
||
implementations; the wiring (which names, which tags, which options) remains per-project.
|
||
|
||
**Adoption is a follow-on task** (tracked in `GAPS.md`), not part of the `ZB.MOM.WW.Health`
|
||
library build. The library build delivers the shared implementations; adoption lands in the
|
||
OtOpcUa repo as a separate commit once the nupkg is available.
|