Files
scadaproj/components/health/GAPS.md
T

171 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Health — gaps & adoption backlog
Divergence of each project from [`spec/SPEC.md`](spec/SPEC.md), and the ordered backlog to
reach the shared `ZB.MOM.WW.Health` library. Status legend: ⛔ gap · 🟡 partial · ✅ matches.
## Divergence vs spec
### §1 Endpoint tiers
| Spec tier | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| `/health/ready` (tag `ready`) | ✅ present | ⛔ absent | ✅ present (name-predicate) |
| `/health/active` (tag `active`) | ✅ present | ⛔ absent | ✅ present (name-predicate) |
| `/healthz` (bare process liveness) | ✅ present | ⛔ absent | ⛔ absent |
| `/health/live` (non-standard) | — | ⛔ present (hardcoded `"Healthy"`, bypasses health-check pipeline) | — |
**Gap T1 (P1):** MxAccessGateway has no standard health tiers. The existing `/health/live`
`MapGet` lambda must be replaced by `app.MapZbHealth()` + real probes.
**Gap T2:** ScadaBridge lacks `/healthz`. `MapZbHealth()` adds it automatically.
**Gap T3:** MxAccessGateway's `/health/live` uses a raw `MapGet` that bypasses the ASP.NET Core
health-check middleware — it does not participate in `IHealthCheckPublisher`, `HealthReport`, or
UI integration. Must be removed.
### §2 Probe coverage
| Probe | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| Database connectivity | ✅ `DatabaseHealthCheck` (query probe) | ⛔ none | ✅ `DatabaseHealthCheck` (`CanConnectAsync`) |
| Akka cluster membership | ✅ `AkkaClusterHealthCheck` (2-way) | n/a (no Akka) | ✅ `AkkaClusterHealthCheck` (3-way) |
| Active / leader node | ✅ `AdminRoleLeaderHealthCheck` (role-filtered) | n/a | ✅ `ActiveNodeHealthCheck` (role-less) |
| Downstream gRPC dependency | ⛔ none | ⛔ none | ⛔ none |
**Gap P1 (P1):** MxAccessGateway has zero probes — `AddHealthChecks()` at
`GatewayApplication.cs:61` is dead code. Minimum viable: a `GrpcDependencyHealthCheck`
targeting the x86 worker IPC channel.
**Gap P2:** No project probes its downstream gRPC dependency. OtOpcUa should probe the
MxAccessGateway channel; MxAccessGateway should probe the worker IPC.
**Gap P3:** Dead `AddHealthChecks()` in MxAccessGateway (`GatewayApplication.cs:61`) should be
removed or replaced — it currently implies health checks are configured when they are not.
### §3 Akka status-policy divergence
| Aspect | OtOpcUa | ScadaBridge |
|---|---|---|
| Probe implementation | Scans `State.Members` for self by address | Reads `SelfMember.Status` directly |
| Joining status | Degraded (not in Members as Up) | Healthy |
| Leaving/Exiting status | Degraded | Degraded |
| Other (Removed, Down…) | Degraded | Unhealthy |
| ActorSystem null guard | — (none; `ActorSystem` injected directly) | ✅ Degraded if null |
The two implementations diverge in how they classify `Joining` (ScadaBridge calls it Healthy;
OtOpcUa would see it as Degraded because `SelfMember` with status `Joining` would not appear as
`Up` in the member scan). They also diverge in the Removed/Down classification (ScadaBridge
Unhealthy, OtOpcUa Degraded).
The shared `ZB.MOM.WW.Health.Akka.AkkaClusterHealthCheck` ships two presets to preserve both
behaviors rather than forcing one onto the other:
- **Default** — ScadaBridge's three-way policy (`Up`/`Joining`=Healthy, `Leaving`/`Exiting`=Degraded,
else Unhealthy)
- **OtOpcUaCompat** — OtOpcUa's self-Up-among-members scan (found Up=Healthy, not found=Degraded)
**Gap A1:** OtOpcUa adopts the `OtOpcUaCompat` preset; ScadaBridge adopts the `Default` preset.
Both preserve existing behavior without forcing convergence on a single policy.
**Gap A2:** OtOpcUa's `AkkaClusterHealthCheck` injects `ActorSystem` directly (no null guard).
The shared implementation injects via `AkkaHostedService` for startup safety.
### §4 Database probe technique
| Aspect | OtOpcUa | ScadaBridge |
|---|---|---|
| Probe method | `db.Deployments.AsNoTracking().Take(1).ToListAsync()` (query) | `_dbContext.Database.CanConnectAsync()` (connection only) |
| Injection style | `IDbContextFactory<T>` (pooled, safe for concurrent probes) | `DbContext` directly (scoped, requires care in background use) |
| Schema verification | ✅ implies schema is applied | ⛔ connection only |
**Gap D1:** `ZB.MOM.WW.Health.EntityFrameworkCore.DatabaseHealthCheck<TContext>` uses
`CanConnectAsync` as the default (ScadaBridge behavior). An optional `ProbeQuery` delegate covers
OtOpcUa's stricter approach. Both apps retain their existing probe semantics; neither is forced
to change unless desired.
**Gap D2:** ScadaBridge injects `DbContext` directly; the shared probe should use
`IDbContextFactory<TContext>` for safe reuse from a background-service health-check context.
ScadaBridge's DI registration will need updating on adoption.
### §5 Active-node / leader check
| Aspect | OtOpcUa | ScadaBridge |
|---|---|---|
| Probe type | `AdminRoleLeaderHealthCheck` (role-filtered: `"admin"`) | `ActiveNodeHealthCheck` (role-less; Up + leader) |
| Non-role-bearing node | Healthy immediately | n/a (all central nodes have no role filter) |
| Leader status | Healthy | Healthy |
| Non-leader (standby) | Degraded | Unhealthy |
| `IActiveNodeGate` backing | Not present | `ActiveNodeGate` (separate type, duplicated logic) |
**Gap L1:** `ZB.MOM.WW.Health.Akka.ActiveNodeHealthCheck` with an optional `RoleFilter`
parameter unifies both behaviors. OtOpcUa passes `RoleFilter = "admin"` (role-filtered);
ScadaBridge uses no role filter.
**Gap L2:** ScadaBridge's `ActiveNodeGate` duplicates `ActiveNodeHealthCheck` logic. The shared
`IActiveNodeGate` seam + a backing singleton eliminates the duplication.
### §6 Response writer
| | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| Writer | Default (plain-text/JSON) | Bespoke `GatewayHealthReply` JSON | `UIResponseWriter.WriteHealthCheckUIResponse` |
**Gap W1:** the shared `ZB.MOM.WW.Health` package ships a canonical JSON response writer
(lifting `HealthChecks.UI.Client` style to the default). All three projects adopt it on
`MapZbHealth()` call — no per-project writer wiring needed.
### §7 Endpoint authentication
Both OtOpcUa and ScadaBridge expose health endpoints without authentication (`AllowAnonymous` or
open by default). MxAccessGateway's `/health/live` has no authentication requirement. The spec
canonizes this: health tiers are `AllowAnonymous`; `MapZbHealth()` applies `AllowAnonymous` by
default.
No gap — consistent across all three. `MapZbHealth()` should document and enforce this default.
## Adoption backlog (ordered)
| # | Item | Projects | Priority | Effort | Risk | Notes |
|---|---|---|---|---|---|---|
| 1 | MxAccessGateway: remove dead `/health/live` + `AddHealthChecks()`, add `GrpcDependencyHealthCheck` (worker IPC) + `MapZbHealth()` | MxGateway | P1 | S | Low | Gap T1, T3, P1, P3 — no probes/tiers today; highest delta |
| 2 | OtOpcUa: replace 3 bespoke checks with shared probes (`AkkaClusterHealthCheck` OtOpcUaCompat + `ActiveNodeHealthCheck` role-filtered + `DatabaseHealthCheck<T>` ProbeQuery) | OtOpcUa | P2 | S | Low | Gap A1, D1, L1 |
| 3 | ScadaBridge: replace 3 bespoke checks with shared probes (Default policy + role-less Active + `CanConnectAsync`) + add `/healthz` + unify `ActiveNodeGate` with `IActiveNodeGate` seam | ScadaBridge | P2 | S | Low | Gap T2, A1, D2, L1, L2 |
| 4 | OtOpcUa + MxAccessGateway: add `GrpcDependencyHealthCheck` for downstream gRPC channel | OtOpcUa, MxGateway | P2 | S | Low | Gap P2 — closes the silent-gateway-down scenario |
| 5 | All: adopt canonical response writer (switch from per-project writers to `MapZbHealth` default) | all 3 | P3 | XS | Low | Gap W1 — mechanical; bundled with #13 |
| 6 | DB injection style: switch ScadaBridge from injected `DbContext` to `IDbContextFactory<T>` | ScadaBridge | P3 | XS | Low | Gap D2 — background-service safety |
**Note: adoption items #16 are all follow-on tasks.** They are tracked here as the backlog for
after `ZB.MOM.WW.Health` @ 0.1.0 is published. The library build itself (nupkgs, tests) is a
separate task. This is consistent with how `ZB.MOM.WW.Auth` and `ZB.MOM.WW.Theme` are structured:
the library is built first; adoption by the three apps is the next step.
## Adoption status — 2026-06-01 (DONE)
`ZB.MOM.WW.Health` 0.1.0 was published to the `dohertj2-gitea` NuGet feed and adopted across all
three apps on branch `feat/adopt-zb-health` in each repo (one branch per repo; commits below).
Plan + design: [`../../docs/plans/2026-06-01-health-library-adoption.md`](../../docs/plans/2026-06-01-health-library-adoption.md).
| Repo | What shipped | Build / tests |
|---|---|---|
| **MxAccessGateway** | Removed the pipeline-bypassing `/health/live` lambda + dead `AddHealthChecks()`; added a custom `AuthStoreHealthCheck` (readiness probe over the SQLite auth store) tagged `Ready`; `MapZbHealth()` → ready/active/healthz + canonical writer. | Server builds; 568 pass / 3 **pre-existing** macOS failures unrelated to health (OrphanWorkerTerminator ×2, fake-worker timeout ×1). |
| **OtOpcUa** | Swapped all 3 bespoke checks → shared probes (`DatabaseHealthCheck<OtOpcUaConfigDbContext>` + `ProbeQuery`, `AkkaClusterHealthCheck` **OtOpcUaCompat**, `ActiveNodeHealthCheck(role:"admin")`); `MapZbHealth()`. | Host builds clean; health tests pass **including a real two-node-cluster integration test**. Independently code-reviewed: APPROVED, behaviour-preserving. |
| **ScadaBridge** | Swapped 3 bespoke checks → shared probes (`DatabaseHealthCheck<ScadaBridgeDbContext>` scoped fallback, `AkkaClusterHealthCheck` **Default**, role-less `ActiveNodeHealthCheck`); added transient `ActorSystem` DI bridge (Central + Site roots); added `/healthz`; canonical writer. Kept its own `ActiveNodeGate`. | Builds clean (TreatWarningsAsErrors); 212 Host tests pass. |
### Deferred (verified ill-fitting on adoption — re-scoped from the original backlog)
- **#4 downstream gRPC dependency probes — DROPPED for now.** Neither repo holds a host-level
`GrpcChannel` to probe: the MxGateway↔worker IPC is **named pipes** (not gRPC), and OtOpcUa's
gateway channel is created **per-driver** from DB config (no DI-level channel). MxGateway readiness
instead probes the SQLite auth store (`AuthStoreHealthCheck`). A real worker/downstream probe needs
a custom non-gRPC check — future work.
- **#3 `IActiveNodeGate` seam unification (ScadaBridge) — DEFERRED.** ScadaBridge's `IActiveNodeGate`
is `…ScadaBridge.InboundAPI.IActiveNodeGate`, wired into inbound-API endpoint gating — a different
interface from the shared `ZB.MOM.WW.Health.IActiveNodeGate`. Unification touches the InboundAPI
path; its existing `ActiveNodeGate` (logic identical to the shared `AkkaActiveNodeGate`) was kept.
- **#6 `IDbContextFactory<T>` switch (ScadaBridge) — DROPPED as unnecessary.** The shared
`DatabaseHealthCheck<T>` self-scopes (creates its own DI scope per probe) when no factory is
registered — that *is* the background-safety fix — and ScadaBridge's context is built with an
injected `IDataProtectionProvider`, which `AddDbContextFactory` does not accommodate cleanly.
### Accepted behaviour change (one) — flag for ops
ScadaBridge `/health/active` during the **startup window** (ActorSystem/cluster not yet ready) now
returns **Degraded → HTTP 200** instead of the prior **Unhealthy → HTTP 503**. This is the shared
`ActiveNodeHealthCheck`'s documented startup-safe behaviour and the normalized convergence target.
The steady-state standby case (node Up but not leader) is **unchanged** (Unhealthy → 503). If a
load-balancer (Traefik) keys strictly on a 503 from `/health/active` to fence standby nodes during
startup, that fail-safe is briefly relaxed until the cluster forms.