12 KiB
Health — gaps & adoption backlog
Divergence of each project from spec/SPEC.md, and the ordered backlog to
reach the shared ZB.MOM.WW.Health library. Status legend: ⛔ gap · 🟡 partial · ✅ matches.
Divergence vs spec
§1 Endpoint tiers
| Spec tier | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
/health/ready (tag ready) |
✅ present | ⛔ absent | ✅ present (name-predicate) |
/health/active (tag active) |
✅ present | ⛔ absent | ✅ present (name-predicate) |
/healthz (bare process liveness) |
✅ present | ⛔ absent | ⛔ absent |
/health/live (non-standard) |
— | ⛔ present (hardcoded "Healthy", bypasses health-check pipeline) |
— |
→ Gap T1 (P1): MxAccessGateway has no standard health tiers. The existing /health/live
MapGet lambda must be replaced by app.MapZbHealth() + real probes.
→ Gap T2: ScadaBridge lacks /healthz. MapZbHealth() adds it automatically.
→ Gap T3: MxAccessGateway's /health/live uses a raw MapGet that bypasses the ASP.NET Core
health-check middleware — it does not participate in IHealthCheckPublisher, HealthReport, or
UI integration. Must be removed.
§2 Probe coverage
| Probe | OtOpcUa | MxAccessGateway | ScadaBridge |
|---|---|---|---|
| Database connectivity | ✅ DatabaseHealthCheck (query probe) |
⛔ none | ✅ DatabaseHealthCheck (CanConnectAsync) |
| Akka cluster membership | ✅ AkkaClusterHealthCheck (2-way) |
n/a (no Akka) | ✅ AkkaClusterHealthCheck (3-way) |
| Active / leader node | ✅ AdminRoleLeaderHealthCheck (role-filtered) |
n/a | ✅ ActiveNodeHealthCheck (role-less) |
| Downstream gRPC dependency | ⛔ none | ⛔ none | ⛔ none |
→ Gap P1 (P1): MxAccessGateway has zero probes — AddHealthChecks() at
GatewayApplication.cs:61 is dead code. Minimum viable: a GrpcDependencyHealthCheck
targeting the x86 worker IPC channel.
→ Gap P2: No project probes its downstream gRPC dependency. OtOpcUa should probe the
MxAccessGateway channel; MxAccessGateway should probe the worker IPC.
→ Gap P3: Dead AddHealthChecks() in MxAccessGateway (GatewayApplication.cs:61) should be
removed or replaced — it currently implies health checks are configured when they are not.
§3 Akka status-policy divergence
| Aspect | OtOpcUa | ScadaBridge |
|---|---|---|
| Probe implementation | Scans State.Members for self by address |
Reads SelfMember.Status directly |
| Joining status | Degraded (not in Members as Up) | Healthy |
| Leaving/Exiting status | Degraded | Degraded |
| Other (Removed, Down…) | Degraded | Unhealthy |
| ActorSystem null guard | — (none; ActorSystem injected directly) |
✅ Degraded if null |
The two implementations diverge in how they classify Joining (ScadaBridge calls it Healthy;
OtOpcUa would see it as Degraded because SelfMember with status Joining would not appear as
Up in the member scan). They also diverge in the Removed/Down classification (ScadaBridge
Unhealthy, OtOpcUa Degraded).
The shared ZB.MOM.WW.Health.Akka.AkkaClusterHealthCheck ships two presets to preserve both
behaviors rather than forcing one onto the other:
- Default — ScadaBridge's three-way policy (
Up/Joining=Healthy,Leaving/Exiting=Degraded, else Unhealthy) - OtOpcUaCompat — OtOpcUa's self-Up-among-members scan (found Up=Healthy, not found=Degraded)
→ Gap A1: OtOpcUa adopts the OtOpcUaCompat preset; ScadaBridge adopts the Default preset.
Both preserve existing behavior without forcing convergence on a single policy.
→ Gap A2: OtOpcUa's AkkaClusterHealthCheck injects ActorSystem directly (no null guard).
The shared implementation injects via AkkaHostedService for startup safety.
§4 Database probe technique
| Aspect | OtOpcUa | ScadaBridge |
|---|---|---|
| Probe method | db.Deployments.AsNoTracking().Take(1).ToListAsync() (query) |
_dbContext.Database.CanConnectAsync() (connection only) |
| Injection style | IDbContextFactory<T> (pooled, safe for concurrent probes) |
DbContext directly (scoped, requires care in background use) |
| Schema verification | ✅ implies schema is applied | ⛔ connection only |
→ Gap D1: ZB.MOM.WW.Health.EntityFrameworkCore.DatabaseHealthCheck<TContext> uses
CanConnectAsync as the default (ScadaBridge behavior). An optional ProbeQuery delegate covers
OtOpcUa's stricter approach. Both apps retain their existing probe semantics; neither is forced
to change unless desired.
→ Gap D2: ScadaBridge injects DbContext directly; the shared probe should use
IDbContextFactory<TContext> for safe reuse from a background-service health-check context.
ScadaBridge's DI registration will need updating on adoption.
§5 Active-node / leader check
| Aspect | OtOpcUa | ScadaBridge |
|---|---|---|
| Probe type | AdminRoleLeaderHealthCheck (role-filtered: "admin") |
ActiveNodeHealthCheck (role-less; Up + leader) |
| Non-role-bearing node | Healthy immediately | n/a (all central nodes have no role filter) |
| Leader status | Healthy | Healthy |
| Non-leader (standby) | Degraded | Unhealthy |
IActiveNodeGate backing |
Not present | ActiveNodeGate (separate type, duplicated logic) |
→ Gap L1: ZB.MOM.WW.Health.Akka.ActiveNodeHealthCheck with an optional RoleFilter
parameter unifies both behaviors. OtOpcUa passes RoleFilter = "admin" (role-filtered);
ScadaBridge uses no role filter.
→ Gap L2: ScadaBridge's ActiveNodeGate duplicates ActiveNodeHealthCheck logic. The shared
IActiveNodeGate seam + a backing singleton eliminates the duplication.
§6 Response writer
| OtOpcUa | MxAccessGateway | ScadaBridge | |
|---|---|---|---|
| Writer | Default (plain-text/JSON) | Bespoke GatewayHealthReply JSON |
UIResponseWriter.WriteHealthCheckUIResponse |
→ Gap W1: the shared ZB.MOM.WW.Health package ships a canonical JSON response writer
(lifting HealthChecks.UI.Client style to the default). All three projects adopt it on
MapZbHealth() call — no per-project writer wiring needed.
§7 Endpoint authentication
Both OtOpcUa and ScadaBridge expose health endpoints without authentication (AllowAnonymous or
open by default). MxAccessGateway's /health/live has no authentication requirement. The spec
canonizes this: health tiers are AllowAnonymous; MapZbHealth() applies AllowAnonymous by
default.
No gap — consistent across all three. MapZbHealth() should document and enforce this default.
Adoption backlog (ordered)
| # | Item | Projects | Priority | Effort | Risk | Notes |
|---|---|---|---|---|---|---|
| 1 | MxAccessGateway: remove dead /health/live + AddHealthChecks(), add GrpcDependencyHealthCheck (worker IPC) + MapZbHealth() |
MxGateway | P1 | S | Low | Gap T1, T3, P1, P3 — no probes/tiers today; highest delta |
| 2 | OtOpcUa: replace 3 bespoke checks with shared probes (AkkaClusterHealthCheck OtOpcUaCompat + ActiveNodeHealthCheck role-filtered + DatabaseHealthCheck<T> ProbeQuery) |
OtOpcUa | P2 | S | Low | Gap A1, D1, L1 |
| 3 | ScadaBridge: replace 3 bespoke checks with shared probes (Default policy + role-less Active + CanConnectAsync) + add /healthz + unify ActiveNodeGate with IActiveNodeGate seam |
ScadaBridge | P2 | S | Low | Gap T2, A1, D2, L1, L2 |
| 4 | OtOpcUa + MxAccessGateway: add GrpcDependencyHealthCheck for downstream gRPC channel |
OtOpcUa, MxGateway | P2 | S | Low | Gap P2 — closes the silent-gateway-down scenario |
| 5 | All: adopt canonical response writer (switch from per-project writers to MapZbHealth default) |
all 3 | P3 | XS | Low | Gap W1 — mechanical; bundled with #1–3 |
| 6 | DB injection style: switch ScadaBridge from injected DbContext to IDbContextFactory<T> |
ScadaBridge | P3 | XS | Low | Gap D2 — background-service safety |
Note: adoption items #1–6 are all follow-on tasks. They are tracked here as the backlog for
after ZB.MOM.WW.Health @ 0.1.0 is published. The library build itself (nupkgs, tests) is a
separate task. This is consistent with how ZB.MOM.WW.Auth and ZB.MOM.WW.Theme are structured:
the library is built first; adoption by the three apps is the next step.
Adoption status — 2026-06-01 (DONE)
ZB.MOM.WW.Health 0.1.0 was published to the dohertj2-gitea NuGet feed and adopted across all
three apps on branch feat/adopt-zb-health in each repo (one branch per repo; commits below).
Plan + design: ../../docs/plans/2026-06-01-health-library-adoption.md.
| Repo | What shipped | Build / tests |
|---|---|---|
| MxAccessGateway | Removed the pipeline-bypassing /health/live lambda + dead AddHealthChecks(); added a custom AuthStoreHealthCheck (readiness probe over the SQLite auth store) tagged Ready; MapZbHealth() → ready/active/healthz + canonical writer. |
Server builds; 568 pass / 3 pre-existing macOS failures unrelated to health (OrphanWorkerTerminator ×2, fake-worker timeout ×1). |
| OtOpcUa | Swapped all 3 bespoke checks → shared probes (DatabaseHealthCheck<OtOpcUaConfigDbContext> + ProbeQuery, AkkaClusterHealthCheck OtOpcUaCompat, ActiveNodeHealthCheck(role:"admin")); MapZbHealth(). |
Host builds clean; health tests pass including a real two-node-cluster integration test. Independently code-reviewed: APPROVED, behaviour-preserving. |
| ScadaBridge | Swapped 3 bespoke checks → shared probes (DatabaseHealthCheck<ScadaBridgeDbContext> scoped fallback, AkkaClusterHealthCheck Default, role-less ActiveNodeHealthCheck); added transient ActorSystem DI bridge (Central + Site roots); added /healthz; canonical writer. Kept its own ActiveNodeGate. |
Builds clean (TreatWarningsAsErrors); 212 Host tests pass. |
Deferred (verified ill-fitting on adoption — re-scoped from the original backlog)
- #4 downstream gRPC dependency probes — DROPPED for now. Neither repo holds a host-level
GrpcChannelto probe: the MxGateway↔worker IPC is named pipes (not gRPC), and OtOpcUa's gateway channel is created per-driver from DB config (no DI-level channel). MxGateway readiness instead probes the SQLite auth store (AuthStoreHealthCheck). A real worker/downstream probe needs a custom non-gRPC check — future work. - #3
IActiveNodeGateseam unification (ScadaBridge) — DEFERRED. ScadaBridge'sIActiveNodeGateis…ScadaBridge.InboundAPI.IActiveNodeGate, wired into inbound-API endpoint gating — a different interface from the sharedZB.MOM.WW.Health.IActiveNodeGate. Unification touches the InboundAPI path; its existingActiveNodeGate(logic identical to the sharedAkkaActiveNodeGate) was kept. - #6
IDbContextFactory<T>switch (ScadaBridge) — DROPPED as unnecessary. The sharedDatabaseHealthCheck<T>self-scopes (creates its own DI scope per probe) when no factory is registered — that is the background-safety fix — and ScadaBridge's context is built with an injectedIDataProtectionProvider, whichAddDbContextFactorydoes not accommodate cleanly.
Accepted behaviour change (one) — flag for ops
ScadaBridge /health/active during the startup window (ActorSystem/cluster not yet ready) now
returns Degraded → HTTP 200 instead of the prior Unhealthy → HTTP 503. This is the shared
ActiveNodeHealthCheck's documented startup-safe behaviour and the normalized convergence target.
The steady-state standby case (node Up but not leader) is unchanged (Unhealthy → 503). If a
load-balancer (Traefik) keys strictly on a 503 from /health/active to fence standby nodes during
startup, that fail-safe is briefly relaxed until the cluster forms.