docs(health): record ZB.MOM.WW.Health adoption across 3 apps + deferrals + accepted /health/active startup behaviour change

This commit is contained in:
Joseph Doherty
2026-06-01 13:50:09 -04:00
parent 1e91784ba3
commit 19f7ea5eeb
+37
View File
@@ -131,3 +131,40 @@ after `ZB.MOM.WW.Health` @ 0.1.0 is published. The library build itself (nupkgs,
separate task. This is consistent with how `ZB.MOM.WW.Auth` and `ZB.MOM.WW.Theme` are structured:
the library is built first; adoption by the three apps is the next step.
## Adoption status — 2026-06-01 (DONE)
`ZB.MOM.WW.Health` 0.1.0 was published to the `dohertj2-gitea` NuGet feed and adopted across all
three apps on branch `feat/adopt-zb-health` in each repo (one branch per repo; commits below).
Plan + design: [`../../docs/plans/2026-06-01-health-library-adoption.md`](../../docs/plans/2026-06-01-health-library-adoption.md).
| Repo | What shipped | Build / tests |
|---|---|---|
| **MxAccessGateway** | Removed the pipeline-bypassing `/health/live` lambda + dead `AddHealthChecks()`; added a custom `AuthStoreHealthCheck` (readiness probe over the SQLite auth store) tagged `Ready`; `MapZbHealth()` → ready/active/healthz + canonical writer. | Server builds; 568 pass / 3 **pre-existing** macOS failures unrelated to health (OrphanWorkerTerminator ×2, fake-worker timeout ×1). |
| **OtOpcUa** | Swapped all 3 bespoke checks → shared probes (`DatabaseHealthCheck<OtOpcUaConfigDbContext>` + `ProbeQuery`, `AkkaClusterHealthCheck` **OtOpcUaCompat**, `ActiveNodeHealthCheck(role:"admin")`); `MapZbHealth()`. | Host builds clean; health tests pass **including a real two-node-cluster integration test**. Independently code-reviewed: APPROVED, behaviour-preserving. |
| **ScadaBridge** | Swapped 3 bespoke checks → shared probes (`DatabaseHealthCheck<ScadaBridgeDbContext>` scoped fallback, `AkkaClusterHealthCheck` **Default**, role-less `ActiveNodeHealthCheck`); added transient `ActorSystem` DI bridge (Central + Site roots); added `/healthz`; canonical writer. Kept its own `ActiveNodeGate`. | Builds clean (TreatWarningsAsErrors); 212 Host tests pass. |
### Deferred (verified ill-fitting on adoption — re-scoped from the original backlog)
- **#4 downstream gRPC dependency probes — DROPPED for now.** Neither repo holds a host-level
`GrpcChannel` to probe: the MxGateway↔worker IPC is **named pipes** (not gRPC), and OtOpcUa's
gateway channel is created **per-driver** from DB config (no DI-level channel). MxGateway readiness
instead probes the SQLite auth store (`AuthStoreHealthCheck`). A real worker/downstream probe needs
a custom non-gRPC check — future work.
- **#3 `IActiveNodeGate` seam unification (ScadaBridge) — DEFERRED.** ScadaBridge's `IActiveNodeGate`
is `…ScadaBridge.InboundAPI.IActiveNodeGate`, wired into inbound-API endpoint gating — a different
interface from the shared `ZB.MOM.WW.Health.IActiveNodeGate`. Unification touches the InboundAPI
path; its existing `ActiveNodeGate` (logic identical to the shared `AkkaActiveNodeGate`) was kept.
- **#6 `IDbContextFactory<T>` switch (ScadaBridge) — DROPPED as unnecessary.** The shared
`DatabaseHealthCheck<T>` self-scopes (creates its own DI scope per probe) when no factory is
registered — that *is* the background-safety fix — and ScadaBridge's context is built with an
injected `IDataProtectionProvider`, which `AddDbContextFactory` does not accommodate cleanly.
### Accepted behaviour change (one) — flag for ops
ScadaBridge `/health/active` during the **startup window** (ActorSystem/cluster not yet ready) now
returns **Degraded → HTTP 200** instead of the prior **Unhealthy → HTTP 503**. This is the shared
`ActiveNodeHealthCheck`'s documented startup-safe behaviour and the normalized convergence target.
The steady-state standby case (node Up but not leader) is **unchanged** (Unhealthy → 503). If a
load-balancer (Traefik) keys strictly on a 503 from `/health/active` to fence standby nodes during
startup, that fail-safe is briefly relaxed until the cluster forms.