docs: design for ZB.MOM.WW.Health adoption across the 3 sister apps
Plan to integrate the built-but-unadopted Health library into OtOpcUa, MxAccessGateway, and ScadaBridge: Gitea-registry distribution, per-repo behaviour-preserving probe swaps (preset-based), canonical tiers + writer, MxGateway-first sequencing.
This commit is contained in:
@@ -0,0 +1,177 @@
|
||||
# Adopt `ZB.MOM.WW.Health` across the three sister apps — design
|
||||
|
||||
**Date:** 2026-06-01
|
||||
**Status:** Approved (design); implementation plan to follow via writing-plans.
|
||||
**Scope:** Integrate the built-but-unadopted `ZB.MOM.WW.Health` shared library into all three
|
||||
sister apps — **OtOpcUa**, **MxAccessGateway**, **ScadaBridge** — replacing each app's bespoke
|
||||
health-check wiring with the shared probes, tiers, and writer.
|
||||
|
||||
This is the first full cross-fleet adoption of one of the six shared `ZB.MOM.WW.*` libraries.
|
||||
It follows the adoption backlog in [`components/health/GAPS.md`](../../components/health/GAPS.md),
|
||||
re-verified against current code on 2026-06-01.
|
||||
|
||||
---
|
||||
|
||||
## 1. Goal & scope
|
||||
|
||||
Replace each app's bespoke health-check wiring with `ZB.MOM.WW.Health`, **preserving each app's
|
||||
existing health policy** — the library ships presets precisely so neither app's Healthy / Degraded
|
||||
/ Unhealthy classifications change. Outcome:
|
||||
|
||||
- All three apps expose the canonical tiers `/health/ready`, `/health/active`, `/healthz` with the
|
||||
canonical JSON writer (`ZbHealthWriter`).
|
||||
- **MxAccessGateway gains real health checks for the first time** (today its `/health/live` is a
|
||||
hardcoded `"Healthy"` lambda that bypasses the ASP.NET Core health-check pipeline, and its
|
||||
`AddHealthChecks()` call is dead code).
|
||||
- No breaking external contract; no metric, dashboard, or wire-format change; no ops coordination.
|
||||
|
||||
**Out of scope:** OtOpcUa's actor-based `Runtime/Health/*` *driver* health (a different concern —
|
||||
OPC UA driver connectivity, not the ASP.NET health-endpoint tier). ScadaBridge's distributed
|
||||
health-monitoring pipeline beyond the endpoint probes.
|
||||
|
||||
### Library public surface this design depends on (code-verified)
|
||||
|
||||
| API | Package | Use |
|
||||
|---|---|---|
|
||||
| `IEndpointRouteBuilder.MapZbHealth(ZbHealthEndpointOptions?)` | `ZB.MOM.WW.Health` | Maps `ready`/`active`/`live` tiers by tag. Does **not** call `AddHealthChecks()` — caller registers probes + tags. |
|
||||
| `ZbHealthTags.Ready / Active / Live` | `ZB.MOM.WW.Health` | Tag each probe so `MapZbHealth` routes it to the right tier. |
|
||||
| `ZbHealthWriter` | `ZB.MOM.WW.Health` | Canonical JSON response writer. |
|
||||
| `GrpcDependencyHealthCheck` + `GrpcDependencyOptions { Probe, DependencyName, Timeout }` | `ZB.MOM.WW.Health` | Probe a downstream gRPC channel. |
|
||||
| `IActiveNodeGate` (+ `AkkaActiveNodeGate`) | `ZB.MOM.WW.Health` / `.Akka` | Active-node seam, replacing duplicated leader logic. |
|
||||
| `AkkaClusterStatusPolicy.Default` / `.OtOpcUaCompat` → `AkkaClusterHealthCheck(sp, policy)` | `ZB.MOM.WW.Health.Akka` | Cluster-membership probe with per-app preset. |
|
||||
| `ActiveNodeHealthCheck(sp)` / `(sp, string role)` | `ZB.MOM.WW.Health.Akka` | Active/leader probe, role-filtered overload. |
|
||||
| `DatabaseHealthCheck<TContext>` + `DatabaseHealthCheckOptions<TContext> { ProbeQuery, Timeout }` | `ZB.MOM.WW.Health.EntityFrameworkCore` | DB probe; default `CanConnectAsync`, optional stricter `ProbeQuery`. |
|
||||
|
||||
**Consumer matrix:** MxGateway → `ZB.MOM.WW.Health` (core) only; OtOpcUa & ScadaBridge → all three.
|
||||
|
||||
---
|
||||
|
||||
## 2. Distribution & referencing — Gitea registry (chosen)
|
||||
|
||||
The family is already inconsistent in how it distributes shared `ZB.MOM.WW.*` packages:
|
||||
OtOpcUa uses a committed local folder feed (`./nuget-packages/`), ScadaBridge uses the Gitea NuGet
|
||||
registry + package-source-mapping, MxAccessGateway has no `nuget.config` (it is the *producer* of
|
||||
`MxGateway.*`). We standardize Health distribution on the **Gitea NuGet registry** — the only
|
||||
mechanism that gives a single versioned source of truth, commits no binaries, and is already proven
|
||||
in this family (ScadaBridge consumes `MxGateway.*` exactly this way).
|
||||
|
||||
### Step 0 — publish (one-time per version, prerequisite for all repos)
|
||||
From `scadaproj`:
|
||||
1. `dotnet pack` the three Health projects (already emit `0.1.0` nupkgs).
|
||||
2. `dotnet nuget push` the three packages to the `dohertj2-gitea` feed
|
||||
(`https://gitea.dohertylan.com/api/packages/dohertj2/nuget/index.json`).
|
||||
3. Credentials (push token / per-dev feed creds) supplied via env or `dotnet nuget add source`,
|
||||
**never committed** — same posture ScadaBridge already documents.
|
||||
|
||||
### Per-repo reference wiring
|
||||
|
||||
| Repo | Change | Notes |
|
||||
|---|---|---|
|
||||
| **ScadaBridge** | Extend existing `packageSourceMapping` to route `ZB.MOM.WW.Health.*` → `dohertj2-gitea`; add 3 CPM `<PackageVersion>` entries; add `<PackageReference>` (no version) to the Host csproj. | Smallest change — already wired for the Gitea feed + CPM. |
|
||||
| **OtOpcUa** | Add `dohertj2-gitea` source to `NuGet.config` (keep `local-mxgw` folder feed for `MxGateway.*`); add source-mapping (`MxGateway.*`→local, `Health.*`→gitea, `*`→nuget.org) for determinism; add 3 CPM `<PackageVersion>` entries + `<PackageReference>`s. | Keeps its existing folder-feed arrangement untouched. |
|
||||
| **MxAccessGateway** | Create its **first** `nuget.config` (nuget.org + gitea sources + source-mapping); add a direct `<PackageReference Include="ZB.MOM.WW.Health" Version="0.1.0" />`. | No CPM in this repo — a direct versioned reference is correct; introducing CPM for one package is deliberately avoided. |
|
||||
|
||||
Existing `MxGateway.*` distribution arrangements are untouched; only `ZB.MOM.WW.Health.*` is added.
|
||||
|
||||
---
|
||||
|
||||
## 3. Per-repo integration
|
||||
|
||||
### 3a. MxAccessGateway — highest delta (no health infra today)
|
||||
- Delete the `/health/live` `MapGet` lambda (`GatewayApplication.cs:173`) and the dead
|
||||
`AddHealthChecks()` (`:66`).
|
||||
- Re-add `AddHealthChecks()` **with real probes**: register a `GrpcDependencyHealthCheck`
|
||||
(tag `Ready`) whose `Probe` exercises the **x86 worker IPC gRPC channel** the gateway already
|
||||
owns; `DependencyName = "mxworker"`, explicit `Timeout`.
|
||||
- `app.MapZbHealth()` → `/health/ready` (worker reachable), `/health/active`, `/healthz`.
|
||||
- Update `GatewayApplicationTests` (currently asserts `/health/live` exists) to assert the three
|
||||
new tier routes; add a worker-down test asserting `ready` = Unhealthy.
|
||||
|
||||
### 3b. OtOpcUa — all three packages
|
||||
- `Host/Health/AkkaClusterHealthCheck.cs` → shared `AkkaClusterHealthCheck` with
|
||||
**`AkkaClusterStatusPolicy.OtOpcUaCompat`** (preserves self-Up-among-members semantics).
|
||||
- `AdminRoleLeaderHealthCheck.cs` → shared `ActiveNodeHealthCheck(sp, role: "admin")`.
|
||||
- `DatabaseHealthCheck.cs` → shared `DatabaseHealthCheck<TContext>` with `ProbeQuery` =
|
||||
its existing `Deployments.AsNoTracking().Take(1)` query (keeps stricter schema-touch semantics).
|
||||
- `HealthEndpoints.cs` → `MapZbHealth()` (same tier semantics, canonical writer); register each
|
||||
probe with the matching `ZbHealthTags`.
|
||||
- Add a downstream `GrpcDependencyHealthCheck` probing the **MxAccessGateway channel** (tag `Ready`)
|
||||
— closes the silent-gateway-down gap.
|
||||
- `Runtime/Health/*` (actor-based driver health) left untouched.
|
||||
|
||||
### 3c. ScadaBridge — all three packages
|
||||
- Three bespoke checks → shared `AkkaClusterHealthCheck` (**`Default`** policy), role-less
|
||||
`ActiveNodeHealthCheck(sp)`, `DatabaseHealthCheck<TContext>` (default `CanConnectAsync`).
|
||||
- Switch the DB probe from injected `DbContext` to `IDbContextFactory<TContext>` (background-safe).
|
||||
- Replace bespoke `ActiveNodeGate.cs` with the shared `IActiveNodeGate` seam + `AkkaActiveNodeGate`
|
||||
backing (removes duplicated leader logic).
|
||||
- Add `/healthz` (free via `MapZbHealth()`); swap `UIResponseWriter` for `ZbHealthWriter`.
|
||||
|
||||
---
|
||||
|
||||
## 4. Cross-cutting conventions
|
||||
|
||||
- **Tags drive tiers:** every probe is registered with `tags: [ZbHealthTags.Ready|Active|Live]`;
|
||||
`MapZbHealth()` routes by tag. This is the one mechanical convention each repo must follow.
|
||||
- **Canonical writer** (`ZbHealthWriter`) everywhere — replaces three different writers
|
||||
(gateway `GatewayHealthReply`, ScadaBridge `UIResponseWriter`, OtOpcUa default).
|
||||
- **Auth:** all tiers stay `AllowAnonymous` (matches all three apps today).
|
||||
|
||||
---
|
||||
|
||||
## 5. Sequencing — one PR per repo
|
||||
|
||||
The publish-to-Gitea step (§2 Step 0) is a shared prerequisite. After that, each repo PR is
|
||||
independent. Recommended order:
|
||||
|
||||
1. **MxAccessGateway** — highest delta, smallest surface; validates the publish→consume loop and
|
||||
the canonical writer end-to-end in the simplest app.
|
||||
2. **OtOpcUa** — exercises all three packages + the `OtOpcUaCompat`/role-filter presets + the
|
||||
downstream gRPC probe.
|
||||
3. **ScadaBridge** — heaviest (the `IActiveNodeGate` / `IDbContextFactory` cleanups); done last
|
||||
with the pattern proven twice.
|
||||
|
||||
---
|
||||
|
||||
## 6. Behaviour-preservation & error handling
|
||||
|
||||
- **No policy change:** presets (`OtOpcUaCompat` vs `Default`) and `RoleFilter="admin"` vs role-less
|
||||
are chosen so each app's Healthy/Degraded/Unhealthy classifications are unchanged.
|
||||
- **Fail-soft:** a probe that throws maps to `Unhealthy`, never crashes the host; gRPC/DB probes
|
||||
carry explicit `Timeout`s.
|
||||
- **Credentials:** Gitea push token + per-dev feed creds handled out-of-band (env /
|
||||
`dotnet nuget add source`), never committed — verified by a "no secrets in diff" check per PR.
|
||||
|
||||
---
|
||||
|
||||
## 7. Testing & verification gates (per repo)
|
||||
|
||||
- `dotnet build` + `dotnet test` green **in the sister repo** after adoption (not just scadaproj).
|
||||
- **MxGateway:** retarget the route-assertion test to the three tiers; add a worker-down → `ready`
|
||||
= Unhealthy test.
|
||||
- **OtOpcUa / ScadaBridge:** existing health tests retargeted to the shared types; assert tier→tag
|
||||
routing and that the preset preserves prior classification (ScadaBridge `Joining` = Healthy;
|
||||
OtOpcUa self-not-Up = Degraded).
|
||||
- Check off the corresponding `components/health/GAPS.md` items and update that file to reflect
|
||||
adoption.
|
||||
|
||||
---
|
||||
|
||||
## 8. Risks & open questions
|
||||
|
||||
- **MxGateway worker-IPC probe shape** — the exact `Probe` delegate depends on how the gateway holds
|
||||
the per-session worker channel. Implementation detail; the plan pins it against
|
||||
`GatewayApplication`'s worker-client wiring.
|
||||
- **Gitea availability / credentials** in this environment — if the registry is unreachable when
|
||||
implementation starts, the fallback is the **local folder feed** without changing any per-repo
|
||||
code, only the `nuget.config` source. This is flagged explicitly rather than switched silently.
|
||||
- **CPM in MxGateway** — none today; this design uses a direct versioned `PackageReference` rather
|
||||
than introducing CPM for one package. Standardizing MxGateway onto CPM is a possible follow-up,
|
||||
out of scope here.
|
||||
|
||||
---
|
||||
|
||||
## Next step
|
||||
|
||||
Hand off to the **writing-plans** skill to turn this design into a detailed, step-by-step
|
||||
implementation plan (per-repo tasks, exact edit sites, test changes, commit/PR structure).
|
||||
Reference in New Issue
Block a user