Files
scadaproj/docs/plans/2026-06-01-health-library-adoption-design.md
T
Joseph Doherty f72403d6f0 docs: design for ZB.MOM.WW.Health adoption across the 3 sister apps
Plan to integrate the built-but-unadopted Health library into OtOpcUa,
MxAccessGateway, and ScadaBridge: Gitea-registry distribution, per-repo
behaviour-preserving probe swaps (preset-based), canonical tiers + writer,
MxGateway-first sequencing.
2026-06-01 13:01:36 -04:00

10 KiB

Adopt ZB.MOM.WW.Health across the three sister apps — design

Date: 2026-06-01 Status: Approved (design); implementation plan to follow via writing-plans. Scope: Integrate the built-but-unadopted ZB.MOM.WW.Health shared library into all three sister apps — OtOpcUa, MxAccessGateway, ScadaBridge — replacing each app's bespoke health-check wiring with the shared probes, tiers, and writer.

This is the first full cross-fleet adoption of one of the six shared ZB.MOM.WW.* libraries. It follows the adoption backlog in components/health/GAPS.md, re-verified against current code on 2026-06-01.


1. Goal & scope

Replace each app's bespoke health-check wiring with ZB.MOM.WW.Health, preserving each app's existing health policy — the library ships presets precisely so neither app's Healthy / Degraded / Unhealthy classifications change. Outcome:

  • All three apps expose the canonical tiers /health/ready, /health/active, /healthz with the canonical JSON writer (ZbHealthWriter).
  • MxAccessGateway gains real health checks for the first time (today its /health/live is a hardcoded "Healthy" lambda that bypasses the ASP.NET Core health-check pipeline, and its AddHealthChecks() call is dead code).
  • No breaking external contract; no metric, dashboard, or wire-format change; no ops coordination.

Out of scope: OtOpcUa's actor-based Runtime/Health/* driver health (a different concern — OPC UA driver connectivity, not the ASP.NET health-endpoint tier). ScadaBridge's distributed health-monitoring pipeline beyond the endpoint probes.

Library public surface this design depends on (code-verified)

API Package Use
IEndpointRouteBuilder.MapZbHealth(ZbHealthEndpointOptions?) ZB.MOM.WW.Health Maps ready/active/live tiers by tag. Does not call AddHealthChecks() — caller registers probes + tags.
ZbHealthTags.Ready / Active / Live ZB.MOM.WW.Health Tag each probe so MapZbHealth routes it to the right tier.
ZbHealthWriter ZB.MOM.WW.Health Canonical JSON response writer.
GrpcDependencyHealthCheck + GrpcDependencyOptions { Probe, DependencyName, Timeout } ZB.MOM.WW.Health Probe a downstream gRPC channel.
IActiveNodeGate (+ AkkaActiveNodeGate) ZB.MOM.WW.Health / .Akka Active-node seam, replacing duplicated leader logic.
AkkaClusterStatusPolicy.Default / .OtOpcUaCompatAkkaClusterHealthCheck(sp, policy) ZB.MOM.WW.Health.Akka Cluster-membership probe with per-app preset.
ActiveNodeHealthCheck(sp) / (sp, string role) ZB.MOM.WW.Health.Akka Active/leader probe, role-filtered overload.
DatabaseHealthCheck<TContext> + DatabaseHealthCheckOptions<TContext> { ProbeQuery, Timeout } ZB.MOM.WW.Health.EntityFrameworkCore DB probe; default CanConnectAsync, optional stricter ProbeQuery.

Consumer matrix: MxGateway → ZB.MOM.WW.Health (core) only; OtOpcUa & ScadaBridge → all three.


2. Distribution & referencing — Gitea registry (chosen)

The family is already inconsistent in how it distributes shared ZB.MOM.WW.* packages: OtOpcUa uses a committed local folder feed (./nuget-packages/), ScadaBridge uses the Gitea NuGet registry + package-source-mapping, MxAccessGateway has no nuget.config (it is the producer of MxGateway.*). We standardize Health distribution on the Gitea NuGet registry — the only mechanism that gives a single versioned source of truth, commits no binaries, and is already proven in this family (ScadaBridge consumes MxGateway.* exactly this way).

Step 0 — publish (one-time per version, prerequisite for all repos)

From scadaproj:

  1. dotnet pack the three Health projects (already emit 0.1.0 nupkgs).
  2. dotnet nuget push the three packages to the dohertj2-gitea feed (https://gitea.dohertylan.com/api/packages/dohertj2/nuget/index.json).
  3. Credentials (push token / per-dev feed creds) supplied via env or dotnet nuget add source, never committed — same posture ScadaBridge already documents.

Per-repo reference wiring

Repo Change Notes
ScadaBridge Extend existing packageSourceMapping to route ZB.MOM.WW.Health.*dohertj2-gitea; add 3 CPM <PackageVersion> entries; add <PackageReference> (no version) to the Host csproj. Smallest change — already wired for the Gitea feed + CPM.
OtOpcUa Add dohertj2-gitea source to NuGet.config (keep local-mxgw folder feed for MxGateway.*); add source-mapping (MxGateway.*→local, Health.*→gitea, *→nuget.org) for determinism; add 3 CPM <PackageVersion> entries + <PackageReference>s. Keeps its existing folder-feed arrangement untouched.
MxAccessGateway Create its first nuget.config (nuget.org + gitea sources + source-mapping); add a direct <PackageReference Include="ZB.MOM.WW.Health" Version="0.1.0" />. No CPM in this repo — a direct versioned reference is correct; introducing CPM for one package is deliberately avoided.

Existing MxGateway.* distribution arrangements are untouched; only ZB.MOM.WW.Health.* is added.


3. Per-repo integration

3a. MxAccessGateway — highest delta (no health infra today)

  • Delete the /health/live MapGet lambda (GatewayApplication.cs:173) and the dead AddHealthChecks() (:66).
  • Re-add AddHealthChecks() with real probes: register a GrpcDependencyHealthCheck (tag Ready) whose Probe exercises the x86 worker IPC gRPC channel the gateway already owns; DependencyName = "mxworker", explicit Timeout.
  • app.MapZbHealth()/health/ready (worker reachable), /health/active, /healthz.
  • Update GatewayApplicationTests (currently asserts /health/live exists) to assert the three new tier routes; add a worker-down test asserting ready = Unhealthy.

3b. OtOpcUa — all three packages

  • Host/Health/AkkaClusterHealthCheck.cs → shared AkkaClusterHealthCheck with AkkaClusterStatusPolicy.OtOpcUaCompat (preserves self-Up-among-members semantics).
  • AdminRoleLeaderHealthCheck.cs → shared ActiveNodeHealthCheck(sp, role: "admin").
  • DatabaseHealthCheck.cs → shared DatabaseHealthCheck<TContext> with ProbeQuery = its existing Deployments.AsNoTracking().Take(1) query (keeps stricter schema-touch semantics).
  • HealthEndpoints.csMapZbHealth() (same tier semantics, canonical writer); register each probe with the matching ZbHealthTags.
  • Add a downstream GrpcDependencyHealthCheck probing the MxAccessGateway channel (tag Ready) — closes the silent-gateway-down gap.
  • Runtime/Health/* (actor-based driver health) left untouched.

3c. ScadaBridge — all three packages

  • Three bespoke checks → shared AkkaClusterHealthCheck (Default policy), role-less ActiveNodeHealthCheck(sp), DatabaseHealthCheck<TContext> (default CanConnectAsync).
  • Switch the DB probe from injected DbContext to IDbContextFactory<TContext> (background-safe).
  • Replace bespoke ActiveNodeGate.cs with the shared IActiveNodeGate seam + AkkaActiveNodeGate backing (removes duplicated leader logic).
  • Add /healthz (free via MapZbHealth()); swap UIResponseWriter for ZbHealthWriter.

4. Cross-cutting conventions

  • Tags drive tiers: every probe is registered with tags: [ZbHealthTags.Ready|Active|Live]; MapZbHealth() routes by tag. This is the one mechanical convention each repo must follow.
  • Canonical writer (ZbHealthWriter) everywhere — replaces three different writers (gateway GatewayHealthReply, ScadaBridge UIResponseWriter, OtOpcUa default).
  • Auth: all tiers stay AllowAnonymous (matches all three apps today).

5. Sequencing — one PR per repo

The publish-to-Gitea step (§2 Step 0) is a shared prerequisite. After that, each repo PR is independent. Recommended order:

  1. MxAccessGateway — highest delta, smallest surface; validates the publish→consume loop and the canonical writer end-to-end in the simplest app.
  2. OtOpcUa — exercises all three packages + the OtOpcUaCompat/role-filter presets + the downstream gRPC probe.
  3. ScadaBridge — heaviest (the IActiveNodeGate / IDbContextFactory cleanups); done last with the pattern proven twice.

6. Behaviour-preservation & error handling

  • No policy change: presets (OtOpcUaCompat vs Default) and RoleFilter="admin" vs role-less are chosen so each app's Healthy/Degraded/Unhealthy classifications are unchanged.
  • Fail-soft: a probe that throws maps to Unhealthy, never crashes the host; gRPC/DB probes carry explicit Timeouts.
  • Credentials: Gitea push token + per-dev feed creds handled out-of-band (env / dotnet nuget add source), never committed — verified by a "no secrets in diff" check per PR.

7. Testing & verification gates (per repo)

  • dotnet build + dotnet test green in the sister repo after adoption (not just scadaproj).
  • MxGateway: retarget the route-assertion test to the three tiers; add a worker-down → ready = Unhealthy test.
  • OtOpcUa / ScadaBridge: existing health tests retargeted to the shared types; assert tier→tag routing and that the preset preserves prior classification (ScadaBridge Joining = Healthy; OtOpcUa self-not-Up = Degraded).
  • Check off the corresponding components/health/GAPS.md items and update that file to reflect adoption.

8. Risks & open questions

  • MxGateway worker-IPC probe shape — the exact Probe delegate depends on how the gateway holds the per-session worker channel. Implementation detail; the plan pins it against GatewayApplication's worker-client wiring.
  • Gitea availability / credentials in this environment — if the registry is unreachable when implementation starts, the fallback is the local folder feed without changing any per-repo code, only the nuget.config source. This is flagged explicitly rather than switched silently.
  • CPM in MxGateway — none today; this design uses a direct versioned PackageReference rather than introducing CPM for one package. Standardizing MxGateway onto CPM is a possible follow-up, out of scope here.

Next step

Hand off to the writing-plans skill to turn this design into a detailed, step-by-step implementation plan (per-repo tasks, exact edit sites, test changes, commit/PR structure).