docs(health): record ZB.MOM.WW.Health adoption across 3 apps + deferrals + accepted /health/active startup behaviour change
This commit is contained in:
@@ -131,3 +131,40 @@ after `ZB.MOM.WW.Health` @ 0.1.0 is published. The library build itself (nupkgs,
|
||||
separate task. This is consistent with how `ZB.MOM.WW.Auth` and `ZB.MOM.WW.Theme` are structured:
|
||||
the library is built first; adoption by the three apps is the next step.
|
||||
|
||||
## Adoption status — 2026-06-01 (DONE)
|
||||
|
||||
`ZB.MOM.WW.Health` 0.1.0 was published to the `dohertj2-gitea` NuGet feed and adopted across all
|
||||
three apps on branch `feat/adopt-zb-health` in each repo (one branch per repo; commits below).
|
||||
Plan + design: [`../../docs/plans/2026-06-01-health-library-adoption.md`](../../docs/plans/2026-06-01-health-library-adoption.md).
|
||||
|
||||
| Repo | What shipped | Build / tests |
|
||||
|---|---|---|
|
||||
| **MxAccessGateway** | Removed the pipeline-bypassing `/health/live` lambda + dead `AddHealthChecks()`; added a custom `AuthStoreHealthCheck` (readiness probe over the SQLite auth store) tagged `Ready`; `MapZbHealth()` → ready/active/healthz + canonical writer. | Server builds; 568 pass / 3 **pre-existing** macOS failures unrelated to health (OrphanWorkerTerminator ×2, fake-worker timeout ×1). |
|
||||
| **OtOpcUa** | Swapped all 3 bespoke checks → shared probes (`DatabaseHealthCheck<OtOpcUaConfigDbContext>` + `ProbeQuery`, `AkkaClusterHealthCheck` **OtOpcUaCompat**, `ActiveNodeHealthCheck(role:"admin")`); `MapZbHealth()`. | Host builds clean; health tests pass **including a real two-node-cluster integration test**. Independently code-reviewed: APPROVED, behaviour-preserving. |
|
||||
| **ScadaBridge** | Swapped 3 bespoke checks → shared probes (`DatabaseHealthCheck<ScadaBridgeDbContext>` scoped fallback, `AkkaClusterHealthCheck` **Default**, role-less `ActiveNodeHealthCheck`); added transient `ActorSystem` DI bridge (Central + Site roots); added `/healthz`; canonical writer. Kept its own `ActiveNodeGate`. | Builds clean (TreatWarningsAsErrors); 212 Host tests pass. |
|
||||
|
||||
### Deferred (verified ill-fitting on adoption — re-scoped from the original backlog)
|
||||
|
||||
- **#4 downstream gRPC dependency probes — DROPPED for now.** Neither repo holds a host-level
|
||||
`GrpcChannel` to probe: the MxGateway↔worker IPC is **named pipes** (not gRPC), and OtOpcUa's
|
||||
gateway channel is created **per-driver** from DB config (no DI-level channel). MxGateway readiness
|
||||
instead probes the SQLite auth store (`AuthStoreHealthCheck`). A real worker/downstream probe needs
|
||||
a custom non-gRPC check — future work.
|
||||
- **#3 `IActiveNodeGate` seam unification (ScadaBridge) — DEFERRED.** ScadaBridge's `IActiveNodeGate`
|
||||
is `…ScadaBridge.InboundAPI.IActiveNodeGate`, wired into inbound-API endpoint gating — a different
|
||||
interface from the shared `ZB.MOM.WW.Health.IActiveNodeGate`. Unification touches the InboundAPI
|
||||
path; its existing `ActiveNodeGate` (logic identical to the shared `AkkaActiveNodeGate`) was kept.
|
||||
- **#6 `IDbContextFactory<T>` switch (ScadaBridge) — DROPPED as unnecessary.** The shared
|
||||
`DatabaseHealthCheck<T>` self-scopes (creates its own DI scope per probe) when no factory is
|
||||
registered — that *is* the background-safety fix — and ScadaBridge's context is built with an
|
||||
injected `IDataProtectionProvider`, which `AddDbContextFactory` does not accommodate cleanly.
|
||||
|
||||
### Accepted behaviour change (one) — flag for ops
|
||||
|
||||
ScadaBridge `/health/active` during the **startup window** (ActorSystem/cluster not yet ready) now
|
||||
returns **Degraded → HTTP 200** instead of the prior **Unhealthy → HTTP 503**. This is the shared
|
||||
`ActiveNodeHealthCheck`'s documented startup-safe behaviour and the normalized convergence target.
|
||||
The steady-state standby case (node Up but not leader) is **unchanged** (Unhealthy → 503). If a
|
||||
load-balancer (Traefik) keys strictly on a 503 from `/health/active` to fence standby nodes during
|
||||
startup, that fail-safe is briefly relaxed until the cluster forms.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user