docs: design for ZB.MOM.WW.Health adoption across the 3 sister apps

Plan to integrate the built-but-unadopted Health library into OtOpcUa,
MxAccessGateway, and ScadaBridge: Gitea-registry distribution, per-repo
behaviour-preserving probe swaps (preset-based), canonical tiers + writer,
MxGateway-first sequencing.
This commit is contained in:
Joseph Doherty
2026-06-01 13:01:36 -04:00
parent f47d4e1030
commit f72403d6f0
@@ -0,0 +1,177 @@
# Adopt `ZB.MOM.WW.Health` across the three sister apps — design
**Date:** 2026-06-01
**Status:** Approved (design); implementation plan to follow via writing-plans.
**Scope:** Integrate the built-but-unadopted `ZB.MOM.WW.Health` shared library into all three
sister apps — **OtOpcUa**, **MxAccessGateway**, **ScadaBridge** — replacing each app's bespoke
health-check wiring with the shared probes, tiers, and writer.
This is the first full cross-fleet adoption of one of the six shared `ZB.MOM.WW.*` libraries.
It follows the adoption backlog in [`components/health/GAPS.md`](../../components/health/GAPS.md),
re-verified against current code on 2026-06-01.
---
## 1. Goal & scope
Replace each app's bespoke health-check wiring with `ZB.MOM.WW.Health`, **preserving each app's
existing health policy** — the library ships presets precisely so neither app's Healthy / Degraded
/ Unhealthy classifications change. Outcome:
- All three apps expose the canonical tiers `/health/ready`, `/health/active`, `/healthz` with the
canonical JSON writer (`ZbHealthWriter`).
- **MxAccessGateway gains real health checks for the first time** (today its `/health/live` is a
hardcoded `"Healthy"` lambda that bypasses the ASP.NET Core health-check pipeline, and its
`AddHealthChecks()` call is dead code).
- No breaking external contract; no metric, dashboard, or wire-format change; no ops coordination.
**Out of scope:** OtOpcUa's actor-based `Runtime/Health/*` *driver* health (a different concern —
OPC UA driver connectivity, not the ASP.NET health-endpoint tier). ScadaBridge's distributed
health-monitoring pipeline beyond the endpoint probes.
### Library public surface this design depends on (code-verified)
| API | Package | Use |
|---|---|---|
| `IEndpointRouteBuilder.MapZbHealth(ZbHealthEndpointOptions?)` | `ZB.MOM.WW.Health` | Maps `ready`/`active`/`live` tiers by tag. Does **not** call `AddHealthChecks()` — caller registers probes + tags. |
| `ZbHealthTags.Ready / Active / Live` | `ZB.MOM.WW.Health` | Tag each probe so `MapZbHealth` routes it to the right tier. |
| `ZbHealthWriter` | `ZB.MOM.WW.Health` | Canonical JSON response writer. |
| `GrpcDependencyHealthCheck` + `GrpcDependencyOptions { Probe, DependencyName, Timeout }` | `ZB.MOM.WW.Health` | Probe a downstream gRPC channel. |
| `IActiveNodeGate` (+ `AkkaActiveNodeGate`) | `ZB.MOM.WW.Health` / `.Akka` | Active-node seam, replacing duplicated leader logic. |
| `AkkaClusterStatusPolicy.Default` / `.OtOpcUaCompat``AkkaClusterHealthCheck(sp, policy)` | `ZB.MOM.WW.Health.Akka` | Cluster-membership probe with per-app preset. |
| `ActiveNodeHealthCheck(sp)` / `(sp, string role)` | `ZB.MOM.WW.Health.Akka` | Active/leader probe, role-filtered overload. |
| `DatabaseHealthCheck<TContext>` + `DatabaseHealthCheckOptions<TContext> { ProbeQuery, Timeout }` | `ZB.MOM.WW.Health.EntityFrameworkCore` | DB probe; default `CanConnectAsync`, optional stricter `ProbeQuery`. |
**Consumer matrix:** MxGateway → `ZB.MOM.WW.Health` (core) only; OtOpcUa & ScadaBridge → all three.
---
## 2. Distribution & referencing — Gitea registry (chosen)
The family is already inconsistent in how it distributes shared `ZB.MOM.WW.*` packages:
OtOpcUa uses a committed local folder feed (`./nuget-packages/`), ScadaBridge uses the Gitea NuGet
registry + package-source-mapping, MxAccessGateway has no `nuget.config` (it is the *producer* of
`MxGateway.*`). We standardize Health distribution on the **Gitea NuGet registry** — the only
mechanism that gives a single versioned source of truth, commits no binaries, and is already proven
in this family (ScadaBridge consumes `MxGateway.*` exactly this way).
### Step 0 — publish (one-time per version, prerequisite for all repos)
From `scadaproj`:
1. `dotnet pack` the three Health projects (already emit `0.1.0` nupkgs).
2. `dotnet nuget push` the three packages to the `dohertj2-gitea` feed
(`https://gitea.dohertylan.com/api/packages/dohertj2/nuget/index.json`).
3. Credentials (push token / per-dev feed creds) supplied via env or `dotnet nuget add source`,
**never committed** — same posture ScadaBridge already documents.
### Per-repo reference wiring
| Repo | Change | Notes |
|---|---|---|
| **ScadaBridge** | Extend existing `packageSourceMapping` to route `ZB.MOM.WW.Health.*``dohertj2-gitea`; add 3 CPM `<PackageVersion>` entries; add `<PackageReference>` (no version) to the Host csproj. | Smallest change — already wired for the Gitea feed + CPM. |
| **OtOpcUa** | Add `dohertj2-gitea` source to `NuGet.config` (keep `local-mxgw` folder feed for `MxGateway.*`); add source-mapping (`MxGateway.*`→local, `Health.*`→gitea, `*`→nuget.org) for determinism; add 3 CPM `<PackageVersion>` entries + `<PackageReference>`s. | Keeps its existing folder-feed arrangement untouched. |
| **MxAccessGateway** | Create its **first** `nuget.config` (nuget.org + gitea sources + source-mapping); add a direct `<PackageReference Include="ZB.MOM.WW.Health" Version="0.1.0" />`. | No CPM in this repo — a direct versioned reference is correct; introducing CPM for one package is deliberately avoided. |
Existing `MxGateway.*` distribution arrangements are untouched; only `ZB.MOM.WW.Health.*` is added.
---
## 3. Per-repo integration
### 3a. MxAccessGateway — highest delta (no health infra today)
- Delete the `/health/live` `MapGet` lambda (`GatewayApplication.cs:173`) and the dead
`AddHealthChecks()` (`:66`).
- Re-add `AddHealthChecks()` **with real probes**: register a `GrpcDependencyHealthCheck`
(tag `Ready`) whose `Probe` exercises the **x86 worker IPC gRPC channel** the gateway already
owns; `DependencyName = "mxworker"`, explicit `Timeout`.
- `app.MapZbHealth()``/health/ready` (worker reachable), `/health/active`, `/healthz`.
- Update `GatewayApplicationTests` (currently asserts `/health/live` exists) to assert the three
new tier routes; add a worker-down test asserting `ready` = Unhealthy.
### 3b. OtOpcUa — all three packages
- `Host/Health/AkkaClusterHealthCheck.cs` → shared `AkkaClusterHealthCheck` with
**`AkkaClusterStatusPolicy.OtOpcUaCompat`** (preserves self-Up-among-members semantics).
- `AdminRoleLeaderHealthCheck.cs` → shared `ActiveNodeHealthCheck(sp, role: "admin")`.
- `DatabaseHealthCheck.cs` → shared `DatabaseHealthCheck<TContext>` with `ProbeQuery` =
its existing `Deployments.AsNoTracking().Take(1)` query (keeps stricter schema-touch semantics).
- `HealthEndpoints.cs``MapZbHealth()` (same tier semantics, canonical writer); register each
probe with the matching `ZbHealthTags`.
- Add a downstream `GrpcDependencyHealthCheck` probing the **MxAccessGateway channel** (tag `Ready`)
— closes the silent-gateway-down gap.
- `Runtime/Health/*` (actor-based driver health) left untouched.
### 3c. ScadaBridge — all three packages
- Three bespoke checks → shared `AkkaClusterHealthCheck` (**`Default`** policy), role-less
`ActiveNodeHealthCheck(sp)`, `DatabaseHealthCheck<TContext>` (default `CanConnectAsync`).
- Switch the DB probe from injected `DbContext` to `IDbContextFactory<TContext>` (background-safe).
- Replace bespoke `ActiveNodeGate.cs` with the shared `IActiveNodeGate` seam + `AkkaActiveNodeGate`
backing (removes duplicated leader logic).
- Add `/healthz` (free via `MapZbHealth()`); swap `UIResponseWriter` for `ZbHealthWriter`.
---
## 4. Cross-cutting conventions
- **Tags drive tiers:** every probe is registered with `tags: [ZbHealthTags.Ready|Active|Live]`;
`MapZbHealth()` routes by tag. This is the one mechanical convention each repo must follow.
- **Canonical writer** (`ZbHealthWriter`) everywhere — replaces three different writers
(gateway `GatewayHealthReply`, ScadaBridge `UIResponseWriter`, OtOpcUa default).
- **Auth:** all tiers stay `AllowAnonymous` (matches all three apps today).
---
## 5. Sequencing — one PR per repo
The publish-to-Gitea step (§2 Step 0) is a shared prerequisite. After that, each repo PR is
independent. Recommended order:
1. **MxAccessGateway** — highest delta, smallest surface; validates the publish→consume loop and
the canonical writer end-to-end in the simplest app.
2. **OtOpcUa** — exercises all three packages + the `OtOpcUaCompat`/role-filter presets + the
downstream gRPC probe.
3. **ScadaBridge** — heaviest (the `IActiveNodeGate` / `IDbContextFactory` cleanups); done last
with the pattern proven twice.
---
## 6. Behaviour-preservation & error handling
- **No policy change:** presets (`OtOpcUaCompat` vs `Default`) and `RoleFilter="admin"` vs role-less
are chosen so each app's Healthy/Degraded/Unhealthy classifications are unchanged.
- **Fail-soft:** a probe that throws maps to `Unhealthy`, never crashes the host; gRPC/DB probes
carry explicit `Timeout`s.
- **Credentials:** Gitea push token + per-dev feed creds handled out-of-band (env /
`dotnet nuget add source`), never committed — verified by a "no secrets in diff" check per PR.
---
## 7. Testing & verification gates (per repo)
- `dotnet build` + `dotnet test` green **in the sister repo** after adoption (not just scadaproj).
- **MxGateway:** retarget the route-assertion test to the three tiers; add a worker-down → `ready`
= Unhealthy test.
- **OtOpcUa / ScadaBridge:** existing health tests retargeted to the shared types; assert tier→tag
routing and that the preset preserves prior classification (ScadaBridge `Joining` = Healthy;
OtOpcUa self-not-Up = Degraded).
- Check off the corresponding `components/health/GAPS.md` items and update that file to reflect
adoption.
---
## 8. Risks & open questions
- **MxGateway worker-IPC probe shape** — the exact `Probe` delegate depends on how the gateway holds
the per-session worker channel. Implementation detail; the plan pins it against
`GatewayApplication`'s worker-client wiring.
- **Gitea availability / credentials** in this environment — if the registry is unreachable when
implementation starts, the fallback is the **local folder feed** without changing any per-repo
code, only the `nuget.config` source. This is flagged explicitly rather than switched silently.
- **CPM in MxGateway** — none today; this design uses a direct versioned `PackageReference` rather
than introducing CPM for one package. Standardizing MxGateway onto CPM is a possible follow-up,
out of scope here.
---
## Next step
Hand off to the **writing-plans** skill to turn this design into a detailed, step-by-step
implementation plan (per-repo tasks, exact edit sites, test changes, commit/PR structure).