Files
scadaproj/components/health/current-state/otopcua/CURRENT-STATE.md
T
Joseph Doherty 07d5907258 docs(health): resolve spec/contract/gaps consistency (review fixes)
Applies canonical resolutions for eight settled decisions:
- GAPS: remove three stale "Decisions still open" bullets (#1 IActiveNodeGate placement, #2 GrpcChannel type, #3 OtOpcUaCompat named constant)
- Shared contract: AkkaClusterHealthCheck, ActiveNodeHealthCheck constructors take IServiceProvider (lazy ActorSystem, Degraded-when-not-ready)
- Shared contract: AkkaActiveNodeGate takes IServiceProvider; reads SelfMember+leader directly, null-guarded; does not proxy ActiveNodeHealthCheck
- Shared contract: DatabaseHealthCheckOptions.Probe renamed to ProbeQuery; consumer matrix updated
- Shared contract: settled AddZbHealthChecks open question removed (spec §5 is per-project AddHealthChecks)
- SPEC §2.2: OtOpcUaCompat Leaving/Exiting cell updated from — to Degraded + footnote; §2.3 startup-safety note added
- README: status line corrected from "built and tested" to "scaffolded … implementation is follow-on (task #7)"; IActiveNodeGate "left per-project" bullet removed
- OtOpcUa current-state: AddZbHealthChecks → AddHealthChecks().AddCheck<...>(); IClusterRoleInfo note reframed as accepted trade-off
- ScadaBridge current-state: IActiveNodeGate bullet rewritten — interface moves to ZB.MOM.WW.Health on adoption, InboundApiEndpointFilter references shared interface
2026-06-01 06:33:42 -04:00

155 lines
8.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Health — current state: OtOpcUa
Repo: `~/Desktop/OtOpcUa`. Stack: .NET 10, Akka.NET, OPC UA; solution `ZB.MOM.WW.OtOpcUa.slnx`.
Health code lives in `src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/`. All paths relative to repo root.
Verified 2026-06-01.
Full three-tier pattern: `/health/ready`, `/health/active`, and `/healthz`. Three probes covering
the database, the Akka cluster, and the admin-role leader. All endpoints are `AllowAnonymous` to
permit Traefik and load-balancer probing without credentials.
## 1. Endpoint wiring
`src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/HealthEndpoints.cs`:
- `:13` — XML comment explicitly names this as "ScadaLink's three-tier pattern: `ready` = boot ok;
`active` = fully serving traffic; `healthz` = bare process liveness."
- `:17``AddOtOpcUaHealth(IServiceCollection)` calls `services.AddHealthChecks()` and registers
all three probes (lines 2022):
- `DatabaseHealthCheck` name `"configdb"`, tags `["ready","active"]`
- `AkkaClusterHealthCheck` name `"akka"`, tags `["ready","active"]`
- `AdminRoleLeaderHealthCheck` name `"admin-leader"`, tags `["active"]` only
- `:28``MapOtOpcUaHealth(IEndpointRouteBuilder)` maps three endpoints (lines 3344):
- `/health/ready` — predicate `c => c.Tags.Contains("ready")`, `.AllowAnonymous()` (lines 3336)
- `/health/active` — predicate `c => c.Tags.Contains("active")`, `.AllowAnonymous()` (lines 3740)
- `/healthz` — predicate `_ => false` (no probes run; bare process liveness only), `.AllowAnonymous()` (lines 4144)
`Program.cs`:
- `:137``builder.Services.AddOtOpcUaHealth()`
- `:159``app.MapOtOpcUaHealth()`
Response writer: default ASP.NET Core plain-text/JSON (no `HealthChecks.UI.Client`).
## 2. Probes
### DatabaseHealthCheck
`src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/DatabaseHealthCheck.cs`:
- `:9` — injects `IDbContextFactory<OtOpcUaConfigDbContext>`
- `:2537` — opens a pooled context via `CreateDbContextAsync`, runs
`db.Deployments.AsNoTracking().Take(1).ToListAsync()`. If the query succeeds →
`HealthCheckResult.Healthy("ConfigDb reachable")` (`:31`). If it throws →
`HealthCheckResult.Unhealthy("ConfigDb unreachable", ex)` (`:35`). No `Degraded` path.
The probe exercises a real query (not just `CanConnectAsync`) — it confirms the `Deployments` table
is readable, which implies the schema migration has run. This is **stricter** than ScadaBridge's
`CanConnectAsync` but more opaque about the failure reason.
Tags on registration: `["ready","active"]` — the database must be reachable for both readiness and
active-node determination.
### AkkaClusterHealthCheck
`src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/AkkaClusterHealthCheck.cs`:
- `:9` — injects `ActorSystem` directly
- `:2733` — calls `Cluster.Get(_system)`, scans `cluster.State.Members` for the member whose
`Address == cluster.SelfAddress` and `Status == MemberStatus.Up`:
- Found Up → `HealthCheckResult.Healthy($"Self Up; {cluster.State.Members.Count} member(s)")` (`:32`)
- Not found → `HealthCheckResult.Degraded("Self not yet Up in cluster")` (`:33`)
No `Unhealthy` path — joining/leaving/removed nodes are all reported as `Degraded`. This differs from
ScadaBridge's more granular three-way policy (see GAPS).
Tags on registration: `["ready","active"]`.
### AdminRoleLeaderHealthCheck
`src/Server/ZB.MOM.WW.OtOpcUa.Host/Health/AdminRoleLeaderHealthCheck.cs`:
- `:14` — injects `IClusterRoleInfo`
- `:2738` — three-path logic:
- Node does not carry the `"admin"` role → `Healthy("Node does not carry admin role")` (`:30`) —
non-admin nodes are immediately healthy, so this probe never gates a non-admin node.
- Admin role + node is the role leader → `Healthy($"Admin leader ({...})")` (`:36`)
- Admin role + not the leader → `Degraded($"Admin member but not leader (leader=...)")` (`:37`)
Tags on registration: `["active"]` only — does not participate in `/health/ready`. The intent is
Traefik routing: the active node (admin-role leader) gets sticky admin-UI traffic; standby nodes
are reachable for data-plane OPC UA but report `Degraded` on `/health/active` so the load balancer
does not route control-plane traffic to them.
Note: no `Unhealthy` path for the role-filter case. If the ActorSystem is not running, `IClusterRoleInfo`
presumably returns safe defaults (no role); this is not separately health-checked.
## 3. Tag / tier summary
| Probe | `/health/ready` | `/health/active` | `/healthz` |
|---|---|---|---|
| `DatabaseHealthCheck` | ✅ | ✅ | — |
| `AkkaClusterHealthCheck` | ✅ | ✅ | — |
| `AdminRoleLeaderHealthCheck` | — | ✅ | — |
| (no probes) | — | — | ✅ (bare liveness) |
`/healthz` runs zero probes — it is a pure process liveness sentinel (process reachable = healthy;
a crashed process = no response). Kubernetes liveness probes, Traefik TCP checks, and uptime
monitors use this tier.
## 4. Downstream dependency coverage
No probe for the upstream MxAccessGateway gRPC channel. If the gateway is unreachable, OtOpcUa
reports healthy here (the GalaxyDriver will surface errors in OPC UA diagnostics, but `/health/ready`
and `/health/active` will not reflect it). This is a gap that the shared `GrpcDependencyHealthCheck`
probe in `ZB.MOM.WW.Health` would close.
## 5. Notable design choices
- **AllowAnonymous on all tiers** — see `HealthEndpoints.cs:3032` comment: "Without it the
`AddOtOpcUaAuth` fallback policy 401s every probe and Traefik marks every backend unhealthy."
- **Query probe, not `CanConnectAsync`** — the `Deployments` query validates that the schema has
been applied. ScadaBridge uses `CanConnectAsync`; neither is wrong but they diverge.
- **`Degraded` semantics** — the Akka check uses `Degraded` (not `Unhealthy`) for a joining/pre-Up
node. ASP.NET Core maps `Degraded` to HTTP 200 by default; Traefik sees 200 and considers the
node ready. If `Unhealthy` (HTTP 503) is required to gate traffic, the `Degraded` path is
insufficient.
- **`IClusterRoleInfo` abstraction** — the admin-leader check depends on `IClusterRoleInfo`, an OtOpcUa
interface, not the raw `Akka.Cluster.Cluster` API. This is a testability-friendly layer absent in
ScadaBridge's direct Akka usage.
---
## Adoption plan → `ZB.MOM.WW.Health`
**Replace with shared probes:**
- `AkkaClusterHealthCheck``ZB.MOM.WW.Health.Akka.AkkaClusterHealthCheck` using the
**`OtOpcUaCompat` preset** (self-Up-among-members scan → Healthy/Degraded). The preset keeps
OtOpcUa's existing two-way policy without forcing ScadaBridge's three-way policy onto it.
- `AdminRoleLeaderHealthCheck``ZB.MOM.WW.Health.Akka.ActiveNodeHealthCheck` with
`RoleFilter = "admin"`. The role-filter parameter produces identical behavior: non-admin nodes
immediately healthy, admin leader healthy, admin non-leader degraded.
- `DatabaseHealthCheck``ZB.MOM.WW.Health.EntityFrameworkCore.DatabaseHealthCheck<OtOpcUaConfigDbContext>`
with a `ProbeQuery` delegate of `db => db.Deployments.AsNoTracking().Take(1).ToListAsync()`.
The delegate preserves the stricter query probe rather than falling back to `CanConnectAsync`.
- Add `GrpcDependencyHealthCheck` targeting the MxAccessGateway channel (closes the downstream
dependency gap noted in §4). Tag `["ready","active"]`.
- Replace `AddOtOpcUaHealth` / `MapOtOpcUaHealth` with
`services.AddHealthChecks().AddCheck<...>()` (one call per probe, per spec §5) +
`app.MapZbHealth()`. The `/healthz` bare-liveness tier is part of `MapZbHealth` by default —
no separate wiring needed.
**Keep bespoke:**
- `IClusterRoleInfo` and its Akka implementation — on adoption this testability seam is given up
for the health-check path. The shared `ActiveNodeHealthCheck` reads cluster role state from the
ActorSystem directly (resolving it lazily via `IServiceProvider`); it does not accept
`IClusterRoleInfo` as an injection point. This is an accepted trade-off: the shared implementation
is simpler and consistent across projects, while `IClusterRoleInfo` remains available elsewhere
in the OtOpcUa codebase where it is used outside health checks.
- The `AllowAnonymous` policy — this is an OtOpcUa auth concern; `MapZbHealth` must document that
callers are responsible for applying `AllowAnonymous` (or the shared helper applies it by default).
- Which probes are registered and their tag assignments — the shared library supplies the check
implementations; the wiring (which names, which tags, which options) remains per-project.
**Adoption is a follow-on task** (tracked in `GAPS.md`), not part of the `ZB.MOM.WW.Health`
library build. The library build delivers the shared implementations; adoption lands in the
OtOpcUa repo as a separate commit once the nupkg is available.