Files
scadaproj/components/health/GAPS.md
T

12 KiB
Raw Blame History

Health — gaps & adoption backlog

Divergence of each project from spec/SPEC.md, and the ordered backlog to reach the shared ZB.MOM.WW.Health library. Status legend: gap · 🟡 partial · matches.

Divergence vs spec

§1 Endpoint tiers

Spec tier OtOpcUa MxAccessGateway ScadaBridge
/health/ready (tag ready) present absent present (name-predicate)
/health/active (tag active) present absent present (name-predicate)
/healthz (bare process liveness) present absent absent
/health/live (non-standard) present (hardcoded "Healthy", bypasses health-check pipeline)

Gap T1 (P1): MxAccessGateway has no standard health tiers. The existing /health/live MapGet lambda must be replaced by app.MapZbHealth() + real probes. → Gap T2: ScadaBridge lacks /healthz. MapZbHealth() adds it automatically. → Gap T3: MxAccessGateway's /health/live uses a raw MapGet that bypasses the ASP.NET Core health-check middleware — it does not participate in IHealthCheckPublisher, HealthReport, or UI integration. Must be removed.

§2 Probe coverage

Probe OtOpcUa MxAccessGateway ScadaBridge
Database connectivity DatabaseHealthCheck (query probe) none DatabaseHealthCheck (CanConnectAsync)
Akka cluster membership AkkaClusterHealthCheck (2-way) n/a (no Akka) AkkaClusterHealthCheck (3-way)
Active / leader node AdminRoleLeaderHealthCheck (role-filtered) n/a ActiveNodeHealthCheck (role-less)
Downstream gRPC dependency none none none

Gap P1 (P1): MxAccessGateway has zero probes — AddHealthChecks() at GatewayApplication.cs:61 is dead code. Minimum viable: a GrpcDependencyHealthCheck targeting the x86 worker IPC channel. → Gap P2: No project probes its downstream gRPC dependency. OtOpcUa should probe the MxAccessGateway channel; MxAccessGateway should probe the worker IPC. → Gap P3: Dead AddHealthChecks() in MxAccessGateway (GatewayApplication.cs:61) should be removed or replaced — it currently implies health checks are configured when they are not.

§3 Akka status-policy divergence

Aspect OtOpcUa ScadaBridge
Probe implementation Scans State.Members for self by address Reads SelfMember.Status directly
Joining status Degraded (not in Members as Up) Healthy
Leaving/Exiting status Degraded Degraded
Other (Removed, Down…) Degraded Unhealthy
ActorSystem null guard — (none; ActorSystem injected directly) Degraded if null

The two implementations diverge in how they classify Joining (ScadaBridge calls it Healthy; OtOpcUa would see it as Degraded because SelfMember with status Joining would not appear as Up in the member scan). They also diverge in the Removed/Down classification (ScadaBridge Unhealthy, OtOpcUa Degraded).

The shared ZB.MOM.WW.Health.Akka.AkkaClusterHealthCheck ships two presets to preserve both behaviors rather than forcing one onto the other:

  • Default — ScadaBridge's three-way policy (Up/Joining=Healthy, Leaving/Exiting=Degraded, else Unhealthy)
  • OtOpcUaCompat — OtOpcUa's self-Up-among-members scan (found Up=Healthy, not found=Degraded)

Gap A1: OtOpcUa adopts the OtOpcUaCompat preset; ScadaBridge adopts the Default preset. Both preserve existing behavior without forcing convergence on a single policy. → Gap A2: OtOpcUa's AkkaClusterHealthCheck injects ActorSystem directly (no null guard). The shared implementation injects via AkkaHostedService for startup safety.

§4 Database probe technique

Aspect OtOpcUa ScadaBridge
Probe method db.Deployments.AsNoTracking().Take(1).ToListAsync() (query) _dbContext.Database.CanConnectAsync() (connection only)
Injection style IDbContextFactory<T> (pooled, safe for concurrent probes) DbContext directly (scoped, requires care in background use)
Schema verification implies schema is applied connection only

Gap D1: ZB.MOM.WW.Health.EntityFrameworkCore.DatabaseHealthCheck<TContext> uses CanConnectAsync as the default (ScadaBridge behavior). An optional ProbeQuery delegate covers OtOpcUa's stricter approach. Both apps retain their existing probe semantics; neither is forced to change unless desired. → Gap D2: ScadaBridge injects DbContext directly; the shared probe should use IDbContextFactory<TContext> for safe reuse from a background-service health-check context. ScadaBridge's DI registration will need updating on adoption.

§5 Active-node / leader check

Aspect OtOpcUa ScadaBridge
Probe type AdminRoleLeaderHealthCheck (role-filtered: "admin") ActiveNodeHealthCheck (role-less; Up + leader)
Non-role-bearing node Healthy immediately n/a (all central nodes have no role filter)
Leader status Healthy Healthy
Non-leader (standby) Degraded Unhealthy
IActiveNodeGate backing Not present ActiveNodeGate (separate type, duplicated logic)

Gap L1: ZB.MOM.WW.Health.Akka.ActiveNodeHealthCheck with an optional RoleFilter parameter unifies both behaviors. OtOpcUa passes RoleFilter = "admin" (role-filtered); ScadaBridge uses no role filter. → Gap L2: ScadaBridge's ActiveNodeGate duplicates ActiveNodeHealthCheck logic. The shared IActiveNodeGate seam + a backing singleton eliminates the duplication.

§6 Response writer

OtOpcUa MxAccessGateway ScadaBridge
Writer Default (plain-text/JSON) Bespoke GatewayHealthReply JSON UIResponseWriter.WriteHealthCheckUIResponse

Gap W1: the shared ZB.MOM.WW.Health package ships a canonical JSON response writer (lifting HealthChecks.UI.Client style to the default). All three projects adopt it on MapZbHealth() call — no per-project writer wiring needed.

§7 Endpoint authentication

Both OtOpcUa and ScadaBridge expose health endpoints without authentication (AllowAnonymous or open by default). MxAccessGateway's /health/live has no authentication requirement. The spec canonizes this: health tiers are AllowAnonymous; MapZbHealth() applies AllowAnonymous by default.

No gap — consistent across all three. MapZbHealth() should document and enforce this default.

Adoption backlog (ordered)

# Item Projects Priority Effort Risk Notes
1 MxAccessGateway: remove dead /health/live + AddHealthChecks(), add GrpcDependencyHealthCheck (worker IPC) + MapZbHealth() MxGateway P1 S Low Gap T1, T3, P1, P3 — no probes/tiers today; highest delta
2 OtOpcUa: replace 3 bespoke checks with shared probes (AkkaClusterHealthCheck OtOpcUaCompat + ActiveNodeHealthCheck role-filtered + DatabaseHealthCheck<T> ProbeQuery) OtOpcUa P2 S Low Gap A1, D1, L1
3 ScadaBridge: replace 3 bespoke checks with shared probes (Default policy + role-less Active + CanConnectAsync) + add /healthz + unify ActiveNodeGate with IActiveNodeGate seam ScadaBridge P2 S Low Gap T2, A1, D2, L1, L2
4 OtOpcUa + MxAccessGateway: add GrpcDependencyHealthCheck for downstream gRPC channel OtOpcUa, MxGateway P2 S Low Gap P2 — closes the silent-gateway-down scenario
5 All: adopt canonical response writer (switch from per-project writers to MapZbHealth default) all 3 P3 XS Low Gap W1 — mechanical; bundled with #13
6 DB injection style: switch ScadaBridge from injected DbContext to IDbContextFactory<T> ScadaBridge P3 XS Low Gap D2 — background-service safety

Note: adoption items #16 are all follow-on tasks. They are tracked here as the backlog for after ZB.MOM.WW.Health @ 0.1.0 is published. The library build itself (nupkgs, tests) is a separate task. This is consistent with how ZB.MOM.WW.Auth and ZB.MOM.WW.Theme are structured: the library is built first; adoption by the three apps is the next step.

Adoption status — 2026-06-01 (DONE)

ZB.MOM.WW.Health 0.1.0 was published to the dohertj2-gitea NuGet feed and adopted across all three apps on branch feat/adopt-zb-health in each repo (one branch per repo; commits below). Plan + design: ../../docs/plans/2026-06-01-health-library-adoption.md.

Repo What shipped Build / tests
MxAccessGateway Removed the pipeline-bypassing /health/live lambda + dead AddHealthChecks(); added a custom AuthStoreHealthCheck (readiness probe over the SQLite auth store) tagged Ready; MapZbHealth() → ready/active/healthz + canonical writer. Server builds; 568 pass / 3 pre-existing macOS failures unrelated to health (OrphanWorkerTerminator ×2, fake-worker timeout ×1).
OtOpcUa Swapped all 3 bespoke checks → shared probes (DatabaseHealthCheck<OtOpcUaConfigDbContext> + ProbeQuery, AkkaClusterHealthCheck OtOpcUaCompat, ActiveNodeHealthCheck(role:"admin")); MapZbHealth(). Host builds clean; health tests pass including a real two-node-cluster integration test. Independently code-reviewed: APPROVED, behaviour-preserving.
ScadaBridge Swapped 3 bespoke checks → shared probes (DatabaseHealthCheck<ScadaBridgeDbContext> scoped fallback, AkkaClusterHealthCheck Default, role-less ActiveNodeHealthCheck); added transient ActorSystem DI bridge (Central + Site roots); added /healthz; canonical writer. Kept its own ActiveNodeGate. Builds clean (TreatWarningsAsErrors); 212 Host tests pass.

Deferred (verified ill-fitting on adoption — re-scoped from the original backlog)

  • #4 downstream gRPC dependency probes — DROPPED for now. Neither repo holds a host-level GrpcChannel to probe: the MxGateway↔worker IPC is named pipes (not gRPC), and OtOpcUa's gateway channel is created per-driver from DB config (no DI-level channel). MxGateway readiness instead probes the SQLite auth store (AuthStoreHealthCheck). A real worker/downstream probe needs a custom non-gRPC check — future work.
  • #3 IActiveNodeGate seam unification (ScadaBridge) — DEFERRED. ScadaBridge's IActiveNodeGate is …ScadaBridge.InboundAPI.IActiveNodeGate, wired into inbound-API endpoint gating — a different interface from the shared ZB.MOM.WW.Health.IActiveNodeGate. Unification touches the InboundAPI path; its existing ActiveNodeGate (logic identical to the shared AkkaActiveNodeGate) was kept.
  • #6 IDbContextFactory<T> switch (ScadaBridge) — DROPPED as unnecessary. The shared DatabaseHealthCheck<T> self-scopes (creates its own DI scope per probe) when no factory is registered — that is the background-safety fix — and ScadaBridge's context is built with an injected IDataProtectionProvider, which AddDbContextFactory does not accommodate cleanly.

Accepted behaviour change (one) — flag for ops

ScadaBridge /health/active during the startup window (ActorSystem/cluster not yet ready) now returns Degraded → HTTP 200 instead of the prior Unhealthy → HTTP 503. This is the shared ActiveNodeHealthCheck's documented startup-safe behaviour and the normalized convergence target. The steady-state standby case (node Up but not leader) is unchanged (Unhealthy → 503). If a load-balancer (Traefik) keys strictly on a 503 from /health/active to fence standby nodes during startup, that fail-safe is briefly relaxed until the cluster forms.