# Health β€” gaps & adoption backlog Divergence of each project from [`spec/SPEC.md`](spec/SPEC.md), and the ordered backlog to reach the shared `ZB.MOM.WW.Health` library. Status legend: β›” gap Β· 🟑 partial Β· βœ… matches. ## Divergence vs spec ### Β§1 Endpoint tiers | Spec tier | OtOpcUa | MxAccessGateway | ScadaBridge | |---|---|---|---| | `/health/ready` (tag `ready`) | βœ… present | β›” absent | βœ… present (name-predicate) | | `/health/active` (tag `active`) | βœ… present | β›” absent | βœ… present (name-predicate) | | `/healthz` (bare process liveness) | βœ… present | β›” absent | β›” absent | | `/health/live` (non-standard) | β€” | β›” present (hardcoded `"Healthy"`, bypasses health-check pipeline) | β€” | β†’ **Gap T1 (P1):** MxAccessGateway has no standard health tiers. The existing `/health/live` `MapGet` lambda must be replaced by `app.MapZbHealth()` + real probes. β†’ **Gap T2:** ScadaBridge lacks `/healthz`. `MapZbHealth()` adds it automatically. β†’ **Gap T3:** MxAccessGateway's `/health/live` uses a raw `MapGet` that bypasses the ASP.NET Core health-check middleware β€” it does not participate in `IHealthCheckPublisher`, `HealthReport`, or UI integration. Must be removed. ### Β§2 Probe coverage | Probe | OtOpcUa | MxAccessGateway | ScadaBridge | |---|---|---|---| | Database connectivity | βœ… `DatabaseHealthCheck` (query probe) | β›” none | βœ… `DatabaseHealthCheck` (`CanConnectAsync`) | | Akka cluster membership | βœ… `AkkaClusterHealthCheck` (2-way) | n/a (no Akka) | βœ… `AkkaClusterHealthCheck` (3-way) | | Active / leader node | βœ… `AdminRoleLeaderHealthCheck` (role-filtered) | n/a | βœ… `ActiveNodeHealthCheck` (role-less) | | Downstream gRPC dependency | β›” none | β›” none | β›” none | β†’ **Gap P1 (P1):** MxAccessGateway has zero probes β€” `AddHealthChecks()` at `GatewayApplication.cs:61` is dead code. Minimum viable: a `GrpcDependencyHealthCheck` targeting the x86 worker IPC channel. β†’ **Gap P2:** No project probes its downstream gRPC dependency. OtOpcUa should probe the MxAccessGateway channel; MxAccessGateway should probe the worker IPC. β†’ **Gap P3:** Dead `AddHealthChecks()` in MxAccessGateway (`GatewayApplication.cs:61`) should be removed or replaced β€” it currently implies health checks are configured when they are not. ### Β§3 Akka status-policy divergence | Aspect | OtOpcUa | ScadaBridge | |---|---|---| | Probe implementation | Scans `State.Members` for self by address | Reads `SelfMember.Status` directly | | Joining status | Degraded (not in Members as Up) | Healthy | | Leaving/Exiting status | Degraded | Degraded | | Other (Removed, Down…) | Degraded | Unhealthy | | ActorSystem null guard | β€” (none; `ActorSystem` injected directly) | βœ… Degraded if null | The two implementations diverge in how they classify `Joining` (ScadaBridge calls it Healthy; OtOpcUa would see it as Degraded because `SelfMember` with status `Joining` would not appear as `Up` in the member scan). They also diverge in the Removed/Down classification (ScadaBridge Unhealthy, OtOpcUa Degraded). The shared `ZB.MOM.WW.Health.Akka.AkkaClusterHealthCheck` ships two presets to preserve both behaviors rather than forcing one onto the other: - **Default** β€” ScadaBridge's three-way policy (`Up`/`Joining`=Healthy, `Leaving`/`Exiting`=Degraded, else Unhealthy) - **OtOpcUaCompat** β€” OtOpcUa's self-Up-among-members scan (found Up=Healthy, not found=Degraded) β†’ **Gap A1:** OtOpcUa adopts the `OtOpcUaCompat` preset; ScadaBridge adopts the `Default` preset. Both preserve existing behavior without forcing convergence on a single policy. β†’ **Gap A2:** OtOpcUa's `AkkaClusterHealthCheck` injects `ActorSystem` directly (no null guard). The shared implementation injects via `AkkaHostedService` for startup safety. ### Β§4 Database probe technique | Aspect | OtOpcUa | ScadaBridge | |---|---|---| | Probe method | `db.Deployments.AsNoTracking().Take(1).ToListAsync()` (query) | `_dbContext.Database.CanConnectAsync()` (connection only) | | Injection style | `IDbContextFactory` (pooled, safe for concurrent probes) | `DbContext` directly (scoped, requires care in background use) | | Schema verification | βœ… implies schema is applied | β›” connection only | β†’ **Gap D1:** `ZB.MOM.WW.Health.EntityFrameworkCore.DatabaseHealthCheck` uses `CanConnectAsync` as the default (ScadaBridge behavior). An optional `ProbeQuery` delegate covers OtOpcUa's stricter approach. Both apps retain their existing probe semantics; neither is forced to change unless desired. β†’ **Gap D2:** ScadaBridge injects `DbContext` directly; the shared probe should use `IDbContextFactory` for safe reuse from a background-service health-check context. ScadaBridge's DI registration will need updating on adoption. ### Β§5 Active-node / leader check | Aspect | OtOpcUa | ScadaBridge | |---|---|---| | Probe type | `AdminRoleLeaderHealthCheck` (role-filtered: `"admin"`) | `ActiveNodeHealthCheck` (role-less; Up + leader) | | Non-role-bearing node | Healthy immediately | n/a (all central nodes have no role filter) | | Leader status | Healthy | Healthy | | Non-leader (standby) | Degraded | Unhealthy | | `IActiveNodeGate` backing | Not present | `ActiveNodeGate` (separate type, duplicated logic) | β†’ **Gap L1:** `ZB.MOM.WW.Health.Akka.ActiveNodeHealthCheck` with an optional `RoleFilter` parameter unifies both behaviors. OtOpcUa passes `RoleFilter = "admin"` (role-filtered); ScadaBridge uses no role filter. β†’ **Gap L2:** ScadaBridge's `ActiveNodeGate` duplicates `ActiveNodeHealthCheck` logic. The shared `IActiveNodeGate` seam + a backing singleton eliminates the duplication. ### Β§6 Response writer | | OtOpcUa | MxAccessGateway | ScadaBridge | |---|---|---|---| | Writer | Default (plain-text/JSON) | Bespoke `GatewayHealthReply` JSON | `UIResponseWriter.WriteHealthCheckUIResponse` | β†’ **Gap W1:** the shared `ZB.MOM.WW.Health` package ships a canonical JSON response writer (lifting `HealthChecks.UI.Client` style to the default). All three projects adopt it on `MapZbHealth()` call β€” no per-project writer wiring needed. ### Β§7 Endpoint authentication Both OtOpcUa and ScadaBridge expose health endpoints without authentication (`AllowAnonymous` or open by default). MxAccessGateway's `/health/live` has no authentication requirement. The spec canonizes this: health tiers are `AllowAnonymous`; `MapZbHealth()` applies `AllowAnonymous` by default. No gap β€” consistent across all three. `MapZbHealth()` should document and enforce this default. ## Adoption backlog (ordered) | # | Item | Projects | Priority | Effort | Risk | Notes | |---|---|---|---|---|---|---| | 1 | MxAccessGateway: remove dead `/health/live` + `AddHealthChecks()`, add `GrpcDependencyHealthCheck` (worker IPC) + `MapZbHealth()` | MxGateway | P1 | S | Low | Gap T1, T3, P1, P3 β€” no probes/tiers today; highest delta | | 2 | OtOpcUa: replace 3 bespoke checks with shared probes (`AkkaClusterHealthCheck` OtOpcUaCompat + `ActiveNodeHealthCheck` role-filtered + `DatabaseHealthCheck` ProbeQuery) | OtOpcUa | P2 | S | Low | Gap A1, D1, L1 | | 3 | ScadaBridge: replace 3 bespoke checks with shared probes (Default policy + role-less Active + `CanConnectAsync`) + add `/healthz` + unify `ActiveNodeGate` with `IActiveNodeGate` seam | ScadaBridge | P2 | S | Low | Gap T2, A1, D2, L1, L2 | | 4 | OtOpcUa + MxAccessGateway: add `GrpcDependencyHealthCheck` for downstream gRPC channel | OtOpcUa, MxGateway | P2 | S | Low | Gap P2 β€” closes the silent-gateway-down scenario | | 5 | All: adopt canonical response writer (switch from per-project writers to `MapZbHealth` default) | all 3 | P3 | XS | Low | Gap W1 β€” mechanical; bundled with #1–3 | | 6 | DB injection style: switch ScadaBridge from injected `DbContext` to `IDbContextFactory` | ScadaBridge | P3 | XS | Low | Gap D2 β€” background-service safety | **Note: adoption items #1–6 are all follow-on tasks.** They are tracked here as the backlog for after `ZB.MOM.WW.Health` @ 0.1.0 is published. The library build itself (nupkgs, tests) is a separate task. This is consistent with how `ZB.MOM.WW.Auth` and `ZB.MOM.WW.Theme` are structured: the library is built first; adoption by the three apps is the next step. ## Adoption status β€” 2026-06-01 (DONE) `ZB.MOM.WW.Health` 0.1.0 was published to the `dohertj2-gitea` NuGet feed and adopted across all three apps on branch `feat/adopt-zb-health` in each repo (one branch per repo; commits below). Plan + design: [`../../docs/plans/2026-06-01-health-library-adoption.md`](../../docs/plans/2026-06-01-health-library-adoption.md). | Repo | What shipped | Build / tests | |---|---|---| | **MxAccessGateway** | Removed the pipeline-bypassing `/health/live` lambda + dead `AddHealthChecks()`; added a custom `AuthStoreHealthCheck` (readiness probe over the SQLite auth store) tagged `Ready`; `MapZbHealth()` β†’ ready/active/healthz + canonical writer. | Server builds; 568 pass / 3 **pre-existing** macOS failures unrelated to health (OrphanWorkerTerminator Γ—2, fake-worker timeout Γ—1). | | **OtOpcUa** | Swapped all 3 bespoke checks β†’ shared probes (`DatabaseHealthCheck` + `ProbeQuery`, `AkkaClusterHealthCheck` **OtOpcUaCompat**, `ActiveNodeHealthCheck(role:"admin")`); `MapZbHealth()`. | Host builds clean; health tests pass **including a real two-node-cluster integration test**. Independently code-reviewed: APPROVED, behaviour-preserving. | | **ScadaBridge** | Swapped 3 bespoke checks β†’ shared probes (`DatabaseHealthCheck` scoped fallback, `AkkaClusterHealthCheck` **Default**, role-less `ActiveNodeHealthCheck`); added transient `ActorSystem` DI bridge (Central + Site roots); added `/healthz`; canonical writer. Kept its own `ActiveNodeGate`. | Builds clean (TreatWarningsAsErrors); 212 Host tests pass. | ### Deferred (verified ill-fitting on adoption β€” re-scoped from the original backlog) - **#4 downstream gRPC dependency probes β€” DROPPED for now.** Neither repo holds a host-level `GrpcChannel` to probe: the MxGateway↔worker IPC is **named pipes** (not gRPC), and OtOpcUa's gateway channel is created **per-driver** from DB config (no DI-level channel). MxGateway readiness instead probes the SQLite auth store (`AuthStoreHealthCheck`). A real worker/downstream probe needs a custom non-gRPC check β€” future work. - **#3 `IActiveNodeGate` seam unification (ScadaBridge) β€” DEFERRED.** ScadaBridge's `IActiveNodeGate` is `…ScadaBridge.InboundAPI.IActiveNodeGate`, wired into inbound-API endpoint gating β€” a different interface from the shared `ZB.MOM.WW.Health.IActiveNodeGate`. Unification touches the InboundAPI path; its existing `ActiveNodeGate` (logic identical to the shared `AkkaActiveNodeGate`) was kept. - **#6 `IDbContextFactory` switch (ScadaBridge) β€” DROPPED as unnecessary.** The shared `DatabaseHealthCheck` self-scopes (creates its own DI scope per probe) when no factory is registered β€” that *is* the background-safety fix β€” and ScadaBridge's context is built with an injected `IDataProtectionProvider`, which `AddDbContextFactory` does not accommodate cleanly. ### Accepted behaviour change (one) β€” flag for ops ScadaBridge `/health/active` during the **startup window** (ActorSystem/cluster not yet ready) now returns **Degraded β†’ HTTP 200** instead of the prior **Unhealthy β†’ HTTP 503**. This is the shared `ActiveNodeHealthCheck`'s documented startup-safe behaviour and the normalized convergence target. The steady-state standby case (node Up but not leader) is **unchanged** (Unhealthy β†’ 503). If a load-balancer (Traefik) keys strictly on a 503 from `/health/active` to fence standby nodes during startup, that fail-safe is briefly relaxed until the cluster forms.