diff --git a/docs/plans/2026-06-01-health-library-adoption-design.md b/docs/plans/2026-06-01-health-library-adoption-design.md new file mode 100644 index 0000000..7db1c12 --- /dev/null +++ b/docs/plans/2026-06-01-health-library-adoption-design.md @@ -0,0 +1,177 @@ +# Adopt `ZB.MOM.WW.Health` across the three sister apps — design + +**Date:** 2026-06-01 +**Status:** Approved (design); implementation plan to follow via writing-plans. +**Scope:** Integrate the built-but-unadopted `ZB.MOM.WW.Health` shared library into all three +sister apps — **OtOpcUa**, **MxAccessGateway**, **ScadaBridge** — replacing each app's bespoke +health-check wiring with the shared probes, tiers, and writer. + +This is the first full cross-fleet adoption of one of the six shared `ZB.MOM.WW.*` libraries. +It follows the adoption backlog in [`components/health/GAPS.md`](../../components/health/GAPS.md), +re-verified against current code on 2026-06-01. + +--- + +## 1. Goal & scope + +Replace each app's bespoke health-check wiring with `ZB.MOM.WW.Health`, **preserving each app's +existing health policy** — the library ships presets precisely so neither app's Healthy / Degraded +/ Unhealthy classifications change. Outcome: + +- All three apps expose the canonical tiers `/health/ready`, `/health/active`, `/healthz` with the + canonical JSON writer (`ZbHealthWriter`). +- **MxAccessGateway gains real health checks for the first time** (today its `/health/live` is a + hardcoded `"Healthy"` lambda that bypasses the ASP.NET Core health-check pipeline, and its + `AddHealthChecks()` call is dead code). +- No breaking external contract; no metric, dashboard, or wire-format change; no ops coordination. + +**Out of scope:** OtOpcUa's actor-based `Runtime/Health/*` *driver* health (a different concern — +OPC UA driver connectivity, not the ASP.NET health-endpoint tier). ScadaBridge's distributed +health-monitoring pipeline beyond the endpoint probes. + +### Library public surface this design depends on (code-verified) + +| API | Package | Use | +|---|---|---| +| `IEndpointRouteBuilder.MapZbHealth(ZbHealthEndpointOptions?)` | `ZB.MOM.WW.Health` | Maps `ready`/`active`/`live` tiers by tag. Does **not** call `AddHealthChecks()` — caller registers probes + tags. | +| `ZbHealthTags.Ready / Active / Live` | `ZB.MOM.WW.Health` | Tag each probe so `MapZbHealth` routes it to the right tier. | +| `ZbHealthWriter` | `ZB.MOM.WW.Health` | Canonical JSON response writer. | +| `GrpcDependencyHealthCheck` + `GrpcDependencyOptions { Probe, DependencyName, Timeout }` | `ZB.MOM.WW.Health` | Probe a downstream gRPC channel. | +| `IActiveNodeGate` (+ `AkkaActiveNodeGate`) | `ZB.MOM.WW.Health` / `.Akka` | Active-node seam, replacing duplicated leader logic. | +| `AkkaClusterStatusPolicy.Default` / `.OtOpcUaCompat` → `AkkaClusterHealthCheck(sp, policy)` | `ZB.MOM.WW.Health.Akka` | Cluster-membership probe with per-app preset. | +| `ActiveNodeHealthCheck(sp)` / `(sp, string role)` | `ZB.MOM.WW.Health.Akka` | Active/leader probe, role-filtered overload. | +| `DatabaseHealthCheck` + `DatabaseHealthCheckOptions { ProbeQuery, Timeout }` | `ZB.MOM.WW.Health.EntityFrameworkCore` | DB probe; default `CanConnectAsync`, optional stricter `ProbeQuery`. | + +**Consumer matrix:** MxGateway → `ZB.MOM.WW.Health` (core) only; OtOpcUa & ScadaBridge → all three. + +--- + +## 2. Distribution & referencing — Gitea registry (chosen) + +The family is already inconsistent in how it distributes shared `ZB.MOM.WW.*` packages: +OtOpcUa uses a committed local folder feed (`./nuget-packages/`), ScadaBridge uses the Gitea NuGet +registry + package-source-mapping, MxAccessGateway has no `nuget.config` (it is the *producer* of +`MxGateway.*`). We standardize Health distribution on the **Gitea NuGet registry** — the only +mechanism that gives a single versioned source of truth, commits no binaries, and is already proven +in this family (ScadaBridge consumes `MxGateway.*` exactly this way). + +### Step 0 — publish (one-time per version, prerequisite for all repos) +From `scadaproj`: +1. `dotnet pack` the three Health projects (already emit `0.1.0` nupkgs). +2. `dotnet nuget push` the three packages to the `dohertj2-gitea` feed + (`https://gitea.dohertylan.com/api/packages/dohertj2/nuget/index.json`). +3. Credentials (push token / per-dev feed creds) supplied via env or `dotnet nuget add source`, + **never committed** — same posture ScadaBridge already documents. + +### Per-repo reference wiring + +| Repo | Change | Notes | +|---|---|---| +| **ScadaBridge** | Extend existing `packageSourceMapping` to route `ZB.MOM.WW.Health.*` → `dohertj2-gitea`; add 3 CPM `` entries; add `` (no version) to the Host csproj. | Smallest change — already wired for the Gitea feed + CPM. | +| **OtOpcUa** | Add `dohertj2-gitea` source to `NuGet.config` (keep `local-mxgw` folder feed for `MxGateway.*`); add source-mapping (`MxGateway.*`→local, `Health.*`→gitea, `*`→nuget.org) for determinism; add 3 CPM `` entries + ``s. | Keeps its existing folder-feed arrangement untouched. | +| **MxAccessGateway** | Create its **first** `nuget.config` (nuget.org + gitea sources + source-mapping); add a direct ``. | No CPM in this repo — a direct versioned reference is correct; introducing CPM for one package is deliberately avoided. | + +Existing `MxGateway.*` distribution arrangements are untouched; only `ZB.MOM.WW.Health.*` is added. + +--- + +## 3. Per-repo integration + +### 3a. MxAccessGateway — highest delta (no health infra today) +- Delete the `/health/live` `MapGet` lambda (`GatewayApplication.cs:173`) and the dead + `AddHealthChecks()` (`:66`). +- Re-add `AddHealthChecks()` **with real probes**: register a `GrpcDependencyHealthCheck` + (tag `Ready`) whose `Probe` exercises the **x86 worker IPC gRPC channel** the gateway already + owns; `DependencyName = "mxworker"`, explicit `Timeout`. +- `app.MapZbHealth()` → `/health/ready` (worker reachable), `/health/active`, `/healthz`. +- Update `GatewayApplicationTests` (currently asserts `/health/live` exists) to assert the three + new tier routes; add a worker-down test asserting `ready` = Unhealthy. + +### 3b. OtOpcUa — all three packages +- `Host/Health/AkkaClusterHealthCheck.cs` → shared `AkkaClusterHealthCheck` with + **`AkkaClusterStatusPolicy.OtOpcUaCompat`** (preserves self-Up-among-members semantics). +- `AdminRoleLeaderHealthCheck.cs` → shared `ActiveNodeHealthCheck(sp, role: "admin")`. +- `DatabaseHealthCheck.cs` → shared `DatabaseHealthCheck` with `ProbeQuery` = + its existing `Deployments.AsNoTracking().Take(1)` query (keeps stricter schema-touch semantics). +- `HealthEndpoints.cs` → `MapZbHealth()` (same tier semantics, canonical writer); register each + probe with the matching `ZbHealthTags`. +- Add a downstream `GrpcDependencyHealthCheck` probing the **MxAccessGateway channel** (tag `Ready`) + — closes the silent-gateway-down gap. +- `Runtime/Health/*` (actor-based driver health) left untouched. + +### 3c. ScadaBridge — all three packages +- Three bespoke checks → shared `AkkaClusterHealthCheck` (**`Default`** policy), role-less + `ActiveNodeHealthCheck(sp)`, `DatabaseHealthCheck` (default `CanConnectAsync`). +- Switch the DB probe from injected `DbContext` to `IDbContextFactory` (background-safe). +- Replace bespoke `ActiveNodeGate.cs` with the shared `IActiveNodeGate` seam + `AkkaActiveNodeGate` + backing (removes duplicated leader logic). +- Add `/healthz` (free via `MapZbHealth()`); swap `UIResponseWriter` for `ZbHealthWriter`. + +--- + +## 4. Cross-cutting conventions + +- **Tags drive tiers:** every probe is registered with `tags: [ZbHealthTags.Ready|Active|Live]`; + `MapZbHealth()` routes by tag. This is the one mechanical convention each repo must follow. +- **Canonical writer** (`ZbHealthWriter`) everywhere — replaces three different writers + (gateway `GatewayHealthReply`, ScadaBridge `UIResponseWriter`, OtOpcUa default). +- **Auth:** all tiers stay `AllowAnonymous` (matches all three apps today). + +--- + +## 5. Sequencing — one PR per repo + +The publish-to-Gitea step (§2 Step 0) is a shared prerequisite. After that, each repo PR is +independent. Recommended order: + +1. **MxAccessGateway** — highest delta, smallest surface; validates the publish→consume loop and + the canonical writer end-to-end in the simplest app. +2. **OtOpcUa** — exercises all three packages + the `OtOpcUaCompat`/role-filter presets + the + downstream gRPC probe. +3. **ScadaBridge** — heaviest (the `IActiveNodeGate` / `IDbContextFactory` cleanups); done last + with the pattern proven twice. + +--- + +## 6. Behaviour-preservation & error handling + +- **No policy change:** presets (`OtOpcUaCompat` vs `Default`) and `RoleFilter="admin"` vs role-less + are chosen so each app's Healthy/Degraded/Unhealthy classifications are unchanged. +- **Fail-soft:** a probe that throws maps to `Unhealthy`, never crashes the host; gRPC/DB probes + carry explicit `Timeout`s. +- **Credentials:** Gitea push token + per-dev feed creds handled out-of-band (env / + `dotnet nuget add source`), never committed — verified by a "no secrets in diff" check per PR. + +--- + +## 7. Testing & verification gates (per repo) + +- `dotnet build` + `dotnet test` green **in the sister repo** after adoption (not just scadaproj). +- **MxGateway:** retarget the route-assertion test to the three tiers; add a worker-down → `ready` + = Unhealthy test. +- **OtOpcUa / ScadaBridge:** existing health tests retargeted to the shared types; assert tier→tag + routing and that the preset preserves prior classification (ScadaBridge `Joining` = Healthy; + OtOpcUa self-not-Up = Degraded). +- Check off the corresponding `components/health/GAPS.md` items and update that file to reflect + adoption. + +--- + +## 8. Risks & open questions + +- **MxGateway worker-IPC probe shape** — the exact `Probe` delegate depends on how the gateway holds + the per-session worker channel. Implementation detail; the plan pins it against + `GatewayApplication`'s worker-client wiring. +- **Gitea availability / credentials** in this environment — if the registry is unreachable when + implementation starts, the fallback is the **local folder feed** without changing any per-repo + code, only the `nuget.config` source. This is flagged explicitly rather than switched silently. +- **CPM in MxGateway** — none today; this design uses a direct versioned `PackageReference` rather + than introducing CPM for one package. Standardizing MxGateway onto CPM is a possible follow-up, + out of scope here. + +--- + +## Next step + +Hand off to the **writing-plans** skill to turn this design into a detailed, step-by-step +implementation plan (per-repo tasks, exact edit sites, test changes, commit/PR structure).