171 lines
12 KiB
Markdown
171 lines
12 KiB
Markdown
# Health — gaps & adoption backlog
|
||
|
||
Divergence of each project from [`spec/SPEC.md`](spec/SPEC.md), and the ordered backlog to
|
||
reach the shared `ZB.MOM.WW.Health` library. Status legend: ⛔ gap · 🟡 partial · ✅ matches.
|
||
|
||
## Divergence vs spec
|
||
|
||
### §1 Endpoint tiers
|
||
|
||
| Spec tier | OtOpcUa | MxAccessGateway | ScadaBridge |
|
||
|---|---|---|---|
|
||
| `/health/ready` (tag `ready`) | ✅ present | ⛔ absent | ✅ present (name-predicate) |
|
||
| `/health/active` (tag `active`) | ✅ present | ⛔ absent | ✅ present (name-predicate) |
|
||
| `/healthz` (bare process liveness) | ✅ present | ⛔ absent | ⛔ absent |
|
||
| `/health/live` (non-standard) | — | ⛔ present (hardcoded `"Healthy"`, bypasses health-check pipeline) | — |
|
||
|
||
→ **Gap T1 (P1):** MxAccessGateway has no standard health tiers. The existing `/health/live`
|
||
`MapGet` lambda must be replaced by `app.MapZbHealth()` + real probes.
|
||
→ **Gap T2:** ScadaBridge lacks `/healthz`. `MapZbHealth()` adds it automatically.
|
||
→ **Gap T3:** MxAccessGateway's `/health/live` uses a raw `MapGet` that bypasses the ASP.NET Core
|
||
health-check middleware — it does not participate in `IHealthCheckPublisher`, `HealthReport`, or
|
||
UI integration. Must be removed.
|
||
|
||
### §2 Probe coverage
|
||
|
||
| Probe | OtOpcUa | MxAccessGateway | ScadaBridge |
|
||
|---|---|---|---|
|
||
| Database connectivity | ✅ `DatabaseHealthCheck` (query probe) | ⛔ none | ✅ `DatabaseHealthCheck` (`CanConnectAsync`) |
|
||
| Akka cluster membership | ✅ `AkkaClusterHealthCheck` (2-way) | n/a (no Akka) | ✅ `AkkaClusterHealthCheck` (3-way) |
|
||
| Active / leader node | ✅ `AdminRoleLeaderHealthCheck` (role-filtered) | n/a | ✅ `ActiveNodeHealthCheck` (role-less) |
|
||
| Downstream gRPC dependency | ⛔ none | ⛔ none | ⛔ none |
|
||
|
||
→ **Gap P1 (P1):** MxAccessGateway has zero probes — `AddHealthChecks()` at
|
||
`GatewayApplication.cs:61` is dead code. Minimum viable: a `GrpcDependencyHealthCheck`
|
||
targeting the x86 worker IPC channel.
|
||
→ **Gap P2:** No project probes its downstream gRPC dependency. OtOpcUa should probe the
|
||
MxAccessGateway channel; MxAccessGateway should probe the worker IPC.
|
||
→ **Gap P3:** Dead `AddHealthChecks()` in MxAccessGateway (`GatewayApplication.cs:61`) should be
|
||
removed or replaced — it currently implies health checks are configured when they are not.
|
||
|
||
### §3 Akka status-policy divergence
|
||
|
||
| Aspect | OtOpcUa | ScadaBridge |
|
||
|---|---|---|
|
||
| Probe implementation | Scans `State.Members` for self by address | Reads `SelfMember.Status` directly |
|
||
| Joining status | Degraded (not in Members as Up) | Healthy |
|
||
| Leaving/Exiting status | Degraded | Degraded |
|
||
| Other (Removed, Down…) | Degraded | Unhealthy |
|
||
| ActorSystem null guard | — (none; `ActorSystem` injected directly) | ✅ Degraded if null |
|
||
|
||
The two implementations diverge in how they classify `Joining` (ScadaBridge calls it Healthy;
|
||
OtOpcUa would see it as Degraded because `SelfMember` with status `Joining` would not appear as
|
||
`Up` in the member scan). They also diverge in the Removed/Down classification (ScadaBridge
|
||
Unhealthy, OtOpcUa Degraded).
|
||
|
||
The shared `ZB.MOM.WW.Health.Akka.AkkaClusterHealthCheck` ships two presets to preserve both
|
||
behaviors rather than forcing one onto the other:
|
||
- **Default** — ScadaBridge's three-way policy (`Up`/`Joining`=Healthy, `Leaving`/`Exiting`=Degraded,
|
||
else Unhealthy)
|
||
- **OtOpcUaCompat** — OtOpcUa's self-Up-among-members scan (found Up=Healthy, not found=Degraded)
|
||
|
||
→ **Gap A1:** OtOpcUa adopts the `OtOpcUaCompat` preset; ScadaBridge adopts the `Default` preset.
|
||
Both preserve existing behavior without forcing convergence on a single policy.
|
||
→ **Gap A2:** OtOpcUa's `AkkaClusterHealthCheck` injects `ActorSystem` directly (no null guard).
|
||
The shared implementation injects via `AkkaHostedService` for startup safety.
|
||
|
||
### §4 Database probe technique
|
||
|
||
| Aspect | OtOpcUa | ScadaBridge |
|
||
|---|---|---|
|
||
| Probe method | `db.Deployments.AsNoTracking().Take(1).ToListAsync()` (query) | `_dbContext.Database.CanConnectAsync()` (connection only) |
|
||
| Injection style | `IDbContextFactory<T>` (pooled, safe for concurrent probes) | `DbContext` directly (scoped, requires care in background use) |
|
||
| Schema verification | ✅ implies schema is applied | ⛔ connection only |
|
||
|
||
→ **Gap D1:** `ZB.MOM.WW.Health.EntityFrameworkCore.DatabaseHealthCheck<TContext>` uses
|
||
`CanConnectAsync` as the default (ScadaBridge behavior). An optional `ProbeQuery` delegate covers
|
||
OtOpcUa's stricter approach. Both apps retain their existing probe semantics; neither is forced
|
||
to change unless desired.
|
||
→ **Gap D2:** ScadaBridge injects `DbContext` directly; the shared probe should use
|
||
`IDbContextFactory<TContext>` for safe reuse from a background-service health-check context.
|
||
ScadaBridge's DI registration will need updating on adoption.
|
||
|
||
### §5 Active-node / leader check
|
||
|
||
| Aspect | OtOpcUa | ScadaBridge |
|
||
|---|---|---|
|
||
| Probe type | `AdminRoleLeaderHealthCheck` (role-filtered: `"admin"`) | `ActiveNodeHealthCheck` (role-less; Up + leader) |
|
||
| Non-role-bearing node | Healthy immediately | n/a (all central nodes have no role filter) |
|
||
| Leader status | Healthy | Healthy |
|
||
| Non-leader (standby) | Degraded | Unhealthy |
|
||
| `IActiveNodeGate` backing | Not present | `ActiveNodeGate` (separate type, duplicated logic) |
|
||
|
||
→ **Gap L1:** `ZB.MOM.WW.Health.Akka.ActiveNodeHealthCheck` with an optional `RoleFilter`
|
||
parameter unifies both behaviors. OtOpcUa passes `RoleFilter = "admin"` (role-filtered);
|
||
ScadaBridge uses no role filter.
|
||
→ **Gap L2:** ScadaBridge's `ActiveNodeGate` duplicates `ActiveNodeHealthCheck` logic. The shared
|
||
`IActiveNodeGate` seam + a backing singleton eliminates the duplication.
|
||
|
||
### §6 Response writer
|
||
|
||
| | OtOpcUa | MxAccessGateway | ScadaBridge |
|
||
|---|---|---|---|
|
||
| Writer | Default (plain-text/JSON) | Bespoke `GatewayHealthReply` JSON | `UIResponseWriter.WriteHealthCheckUIResponse` |
|
||
|
||
→ **Gap W1:** the shared `ZB.MOM.WW.Health` package ships a canonical JSON response writer
|
||
(lifting `HealthChecks.UI.Client` style to the default). All three projects adopt it on
|
||
`MapZbHealth()` call — no per-project writer wiring needed.
|
||
|
||
### §7 Endpoint authentication
|
||
|
||
Both OtOpcUa and ScadaBridge expose health endpoints without authentication (`AllowAnonymous` or
|
||
open by default). MxAccessGateway's `/health/live` has no authentication requirement. The spec
|
||
canonizes this: health tiers are `AllowAnonymous`; `MapZbHealth()` applies `AllowAnonymous` by
|
||
default.
|
||
|
||
No gap — consistent across all three. `MapZbHealth()` should document and enforce this default.
|
||
|
||
## Adoption backlog (ordered)
|
||
|
||
| # | Item | Projects | Priority | Effort | Risk | Notes |
|
||
|---|---|---|---|---|---|---|
|
||
| 1 | MxAccessGateway: remove dead `/health/live` + `AddHealthChecks()`, add `GrpcDependencyHealthCheck` (worker IPC) + `MapZbHealth()` | MxGateway | P1 | S | Low | Gap T1, T3, P1, P3 — no probes/tiers today; highest delta |
|
||
| 2 | OtOpcUa: replace 3 bespoke checks with shared probes (`AkkaClusterHealthCheck` OtOpcUaCompat + `ActiveNodeHealthCheck` role-filtered + `DatabaseHealthCheck<T>` ProbeQuery) | OtOpcUa | P2 | S | Low | Gap A1, D1, L1 |
|
||
| 3 | ScadaBridge: replace 3 bespoke checks with shared probes (Default policy + role-less Active + `CanConnectAsync`) + add `/healthz` + unify `ActiveNodeGate` with `IActiveNodeGate` seam | ScadaBridge | P2 | S | Low | Gap T2, A1, D2, L1, L2 |
|
||
| 4 | OtOpcUa + MxAccessGateway: add `GrpcDependencyHealthCheck` for downstream gRPC channel | OtOpcUa, MxGateway | P2 | S | Low | Gap P2 — closes the silent-gateway-down scenario |
|
||
| 5 | All: adopt canonical response writer (switch from per-project writers to `MapZbHealth` default) | all 3 | P3 | XS | Low | Gap W1 — mechanical; bundled with #1–3 |
|
||
| 6 | DB injection style: switch ScadaBridge from injected `DbContext` to `IDbContextFactory<T>` | ScadaBridge | P3 | XS | Low | Gap D2 — background-service safety |
|
||
|
||
**Note: adoption items #1–6 are all follow-on tasks.** They are tracked here as the backlog for
|
||
after `ZB.MOM.WW.Health` @ 0.1.0 is published. The library build itself (nupkgs, tests) is a
|
||
separate task. This is consistent with how `ZB.MOM.WW.Auth` and `ZB.MOM.WW.Theme` are structured:
|
||
the library is built first; adoption by the three apps is the next step.
|
||
|
||
## Adoption status — 2026-06-01 (DONE)
|
||
|
||
`ZB.MOM.WW.Health` 0.1.0 was published to the `dohertj2-gitea` NuGet feed and adopted across all
|
||
three apps on branch `feat/adopt-zb-health` in each repo (one branch per repo; commits below).
|
||
Plan + design: [`../../docs/plans/2026-06-01-health-library-adoption.md`](../../docs/plans/2026-06-01-health-library-adoption.md).
|
||
|
||
| Repo | What shipped | Build / tests |
|
||
|---|---|---|
|
||
| **MxAccessGateway** | Removed the pipeline-bypassing `/health/live` lambda + dead `AddHealthChecks()`; added a custom `AuthStoreHealthCheck` (readiness probe over the SQLite auth store) tagged `Ready`; `MapZbHealth()` → ready/active/healthz + canonical writer. | Server builds; 568 pass / 3 **pre-existing** macOS failures unrelated to health (OrphanWorkerTerminator ×2, fake-worker timeout ×1). |
|
||
| **OtOpcUa** | Swapped all 3 bespoke checks → shared probes (`DatabaseHealthCheck<OtOpcUaConfigDbContext>` + `ProbeQuery`, `AkkaClusterHealthCheck` **OtOpcUaCompat**, `ActiveNodeHealthCheck(role:"admin")`); `MapZbHealth()`. | Host builds clean; health tests pass **including a real two-node-cluster integration test**. Independently code-reviewed: APPROVED, behaviour-preserving. |
|
||
| **ScadaBridge** | Swapped 3 bespoke checks → shared probes (`DatabaseHealthCheck<ScadaBridgeDbContext>` scoped fallback, `AkkaClusterHealthCheck` **Default**, role-less `ActiveNodeHealthCheck`); added transient `ActorSystem` DI bridge (Central + Site roots); added `/healthz`; canonical writer. Kept its own `ActiveNodeGate`. | Builds clean (TreatWarningsAsErrors); 212 Host tests pass. |
|
||
|
||
### Deferred (verified ill-fitting on adoption — re-scoped from the original backlog)
|
||
|
||
- **#4 downstream gRPC dependency probes — DROPPED for now.** Neither repo holds a host-level
|
||
`GrpcChannel` to probe: the MxGateway↔worker IPC is **named pipes** (not gRPC), and OtOpcUa's
|
||
gateway channel is created **per-driver** from DB config (no DI-level channel). MxGateway readiness
|
||
instead probes the SQLite auth store (`AuthStoreHealthCheck`). A real worker/downstream probe needs
|
||
a custom non-gRPC check — future work.
|
||
- **#3 `IActiveNodeGate` seam unification (ScadaBridge) — DEFERRED.** ScadaBridge's `IActiveNodeGate`
|
||
is `…ScadaBridge.InboundAPI.IActiveNodeGate`, wired into inbound-API endpoint gating — a different
|
||
interface from the shared `ZB.MOM.WW.Health.IActiveNodeGate`. Unification touches the InboundAPI
|
||
path; its existing `ActiveNodeGate` (logic identical to the shared `AkkaActiveNodeGate`) was kept.
|
||
- **#6 `IDbContextFactory<T>` switch (ScadaBridge) — DROPPED as unnecessary.** The shared
|
||
`DatabaseHealthCheck<T>` self-scopes (creates its own DI scope per probe) when no factory is
|
||
registered — that *is* the background-safety fix — and ScadaBridge's context is built with an
|
||
injected `IDataProtectionProvider`, which `AddDbContextFactory` does not accommodate cleanly.
|
||
|
||
### Accepted behaviour change (one) — flag for ops
|
||
|
||
ScadaBridge `/health/active` during the **startup window** (ActorSystem/cluster not yet ready) now
|
||
returns **Degraded → HTTP 200** instead of the prior **Unhealthy → HTTP 503**. This is the shared
|
||
`ActiveNodeHealthCheck`'s documented startup-safe behaviour and the normalized convergence target.
|
||
The steady-state standby case (node Up but not leader) is **unchanged** (Unhealthy → 503). If a
|
||
load-balancer (Traefik) keys strictly on a 503 from `/health/active` to fence standby nodes during
|
||
startup, that fail-safe is briefly relaxed until the cluster forms.
|
||
|