diff --git a/components/health/GAPS.md b/components/health/GAPS.md index 7fa3b13..fd04cea 100644 --- a/components/health/GAPS.md +++ b/components/health/GAPS.md @@ -131,3 +131,40 @@ after `ZB.MOM.WW.Health` @ 0.1.0 is published. The library build itself (nupkgs, separate task. This is consistent with how `ZB.MOM.WW.Auth` and `ZB.MOM.WW.Theme` are structured: the library is built first; adoption by the three apps is the next step. +## Adoption status — 2026-06-01 (DONE) + +`ZB.MOM.WW.Health` 0.1.0 was published to the `dohertj2-gitea` NuGet feed and adopted across all +three apps on branch `feat/adopt-zb-health` in each repo (one branch per repo; commits below). +Plan + design: [`../../docs/plans/2026-06-01-health-library-adoption.md`](../../docs/plans/2026-06-01-health-library-adoption.md). + +| Repo | What shipped | Build / tests | +|---|---|---| +| **MxAccessGateway** | Removed the pipeline-bypassing `/health/live` lambda + dead `AddHealthChecks()`; added a custom `AuthStoreHealthCheck` (readiness probe over the SQLite auth store) tagged `Ready`; `MapZbHealth()` → ready/active/healthz + canonical writer. | Server builds; 568 pass / 3 **pre-existing** macOS failures unrelated to health (OrphanWorkerTerminator ×2, fake-worker timeout ×1). | +| **OtOpcUa** | Swapped all 3 bespoke checks → shared probes (`DatabaseHealthCheck` + `ProbeQuery`, `AkkaClusterHealthCheck` **OtOpcUaCompat**, `ActiveNodeHealthCheck(role:"admin")`); `MapZbHealth()`. | Host builds clean; health tests pass **including a real two-node-cluster integration test**. Independently code-reviewed: APPROVED, behaviour-preserving. | +| **ScadaBridge** | Swapped 3 bespoke checks → shared probes (`DatabaseHealthCheck` scoped fallback, `AkkaClusterHealthCheck` **Default**, role-less `ActiveNodeHealthCheck`); added transient `ActorSystem` DI bridge (Central + Site roots); added `/healthz`; canonical writer. Kept its own `ActiveNodeGate`. | Builds clean (TreatWarningsAsErrors); 212 Host tests pass. | + +### Deferred (verified ill-fitting on adoption — re-scoped from the original backlog) + +- **#4 downstream gRPC dependency probes — DROPPED for now.** Neither repo holds a host-level + `GrpcChannel` to probe: the MxGateway↔worker IPC is **named pipes** (not gRPC), and OtOpcUa's + gateway channel is created **per-driver** from DB config (no DI-level channel). MxGateway readiness + instead probes the SQLite auth store (`AuthStoreHealthCheck`). A real worker/downstream probe needs + a custom non-gRPC check — future work. +- **#3 `IActiveNodeGate` seam unification (ScadaBridge) — DEFERRED.** ScadaBridge's `IActiveNodeGate` + is `…ScadaBridge.InboundAPI.IActiveNodeGate`, wired into inbound-API endpoint gating — a different + interface from the shared `ZB.MOM.WW.Health.IActiveNodeGate`. Unification touches the InboundAPI + path; its existing `ActiveNodeGate` (logic identical to the shared `AkkaActiveNodeGate`) was kept. +- **#6 `IDbContextFactory` switch (ScadaBridge) — DROPPED as unnecessary.** The shared + `DatabaseHealthCheck` self-scopes (creates its own DI scope per probe) when no factory is + registered — that *is* the background-safety fix — and ScadaBridge's context is built with an + injected `IDataProtectionProvider`, which `AddDbContextFactory` does not accommodate cleanly. + +### Accepted behaviour change (one) — flag for ops + +ScadaBridge `/health/active` during the **startup window** (ActorSystem/cluster not yet ready) now +returns **Degraded → HTTP 200** instead of the prior **Unhealthy → HTTP 503**. This is the shared +`ActiveNodeHealthCheck`'s documented startup-safe behaviour and the normalized convergence target. +The steady-state standby case (node Up but not leader) is **unchanged** (Unhealthy → 503). If a +load-balancer (Traefik) keys strictly on a 503 from `/health/active` to fence standby nodes during +startup, that fail-safe is briefly relaxed until the cluster forms. +