Files
scadaproj/components/health/GAPS.md
T
Joseph Doherty 07d5907258 docs(health): resolve spec/contract/gaps consistency (review fixes)
Applies canonical resolutions for eight settled decisions:
- GAPS: remove three stale "Decisions still open" bullets (#1 IActiveNodeGate placement, #2 GrpcChannel type, #3 OtOpcUaCompat named constant)
- Shared contract: AkkaClusterHealthCheck, ActiveNodeHealthCheck constructors take IServiceProvider (lazy ActorSystem, Degraded-when-not-ready)
- Shared contract: AkkaActiveNodeGate takes IServiceProvider; reads SelfMember+leader directly, null-guarded; does not proxy ActiveNodeHealthCheck
- Shared contract: DatabaseHealthCheckOptions.Probe renamed to ProbeQuery; consumer matrix updated
- Shared contract: settled AddZbHealthChecks open question removed (spec §5 is per-project AddHealthChecks)
- SPEC §2.2: OtOpcUaCompat Leaving/Exiting cell updated from — to Degraded + footnote; §2.3 startup-safety note added
- README: status line corrected from "built and tested" to "scaffolded … implementation is follow-on (task #7)"; IActiveNodeGate "left per-project" bullet removed
- OtOpcUa current-state: AddZbHealthChecks → AddHealthChecks().AddCheck<...>(); IClusterRoleInfo note reframed as accepted trade-off
- ScadaBridge current-state: IActiveNodeGate bullet rewritten — interface moves to ZB.MOM.WW.Health on adoption, InboundApiEndpointFilter references shared interface
2026-06-01 06:33:42 -04:00

8.1 KiB
Raw Blame History

Health — gaps & adoption backlog

Divergence of each project from spec/SPEC.md, and the ordered backlog to reach the shared ZB.MOM.WW.Health library. Status legend: gap · 🟡 partial · matches.

Divergence vs spec

§1 Endpoint tiers

Spec tier OtOpcUa MxAccessGateway ScadaBridge
/health/ready (tag ready) present absent present (name-predicate)
/health/active (tag active) present absent present (name-predicate)
/healthz (bare process liveness) present absent absent
/health/live (non-standard) present (hardcoded "Healthy", bypasses health-check pipeline)

Gap T1 (P1): MxAccessGateway has no standard health tiers. The existing /health/live MapGet lambda must be replaced by app.MapZbHealth() + real probes. → Gap T2: ScadaBridge lacks /healthz. MapZbHealth() adds it automatically. → Gap T3: MxAccessGateway's /health/live uses a raw MapGet that bypasses the ASP.NET Core health-check middleware — it does not participate in IHealthCheckPublisher, HealthReport, or UI integration. Must be removed.

§2 Probe coverage

Probe OtOpcUa MxAccessGateway ScadaBridge
Database connectivity DatabaseHealthCheck (query probe) none DatabaseHealthCheck (CanConnectAsync)
Akka cluster membership AkkaClusterHealthCheck (2-way) n/a (no Akka) AkkaClusterHealthCheck (3-way)
Active / leader node AdminRoleLeaderHealthCheck (role-filtered) n/a ActiveNodeHealthCheck (role-less)
Downstream gRPC dependency none none none

Gap P1 (P1): MxAccessGateway has zero probes — AddHealthChecks() at GatewayApplication.cs:61 is dead code. Minimum viable: a GrpcDependencyHealthCheck targeting the x86 worker IPC channel. → Gap P2: No project probes its downstream gRPC dependency. OtOpcUa should probe the MxAccessGateway channel; MxAccessGateway should probe the worker IPC. → Gap P3: Dead AddHealthChecks() in MxAccessGateway (GatewayApplication.cs:61) should be removed or replaced — it currently implies health checks are configured when they are not.

§3 Akka status-policy divergence

Aspect OtOpcUa ScadaBridge
Probe implementation Scans State.Members for self by address Reads SelfMember.Status directly
Joining status Degraded (not in Members as Up) Healthy
Leaving/Exiting status Degraded Degraded
Other (Removed, Down…) Degraded Unhealthy
ActorSystem null guard — (none; ActorSystem injected directly) Degraded if null

The two implementations diverge in how they classify Joining (ScadaBridge calls it Healthy; OtOpcUa would see it as Degraded because SelfMember with status Joining would not appear as Up in the member scan). They also diverge in the Removed/Down classification (ScadaBridge Unhealthy, OtOpcUa Degraded).

The shared ZB.MOM.WW.Health.Akka.AkkaClusterHealthCheck ships two presets to preserve both behaviors rather than forcing one onto the other:

  • Default — ScadaBridge's three-way policy (Up/Joining=Healthy, Leaving/Exiting=Degraded, else Unhealthy)
  • OtOpcUaCompat — OtOpcUa's self-Up-among-members scan (found Up=Healthy, not found=Degraded)

Gap A1: OtOpcUa adopts the OtOpcUaCompat preset; ScadaBridge adopts the Default preset. Both preserve existing behavior without forcing convergence on a single policy. → Gap A2: OtOpcUa's AkkaClusterHealthCheck injects ActorSystem directly (no null guard). The shared implementation injects via AkkaHostedService for startup safety.

§4 Database probe technique

Aspect OtOpcUa ScadaBridge
Probe method db.Deployments.AsNoTracking().Take(1).ToListAsync() (query) _dbContext.Database.CanConnectAsync() (connection only)
Injection style IDbContextFactory<T> (pooled, safe for concurrent probes) DbContext directly (scoped, requires care in background use)
Schema verification implies schema is applied connection only

Gap D1: ZB.MOM.WW.Health.EntityFrameworkCore.DatabaseHealthCheck<TContext> uses CanConnectAsync as the default (ScadaBridge behavior). An optional ProbeQuery delegate covers OtOpcUa's stricter approach. Both apps retain their existing probe semantics; neither is forced to change unless desired. → Gap D2: ScadaBridge injects DbContext directly; the shared probe should use IDbContextFactory<TContext> for safe reuse from a background-service health-check context. ScadaBridge's DI registration will need updating on adoption.

§5 Active-node / leader check

Aspect OtOpcUa ScadaBridge
Probe type AdminRoleLeaderHealthCheck (role-filtered: "admin") ActiveNodeHealthCheck (role-less; Up + leader)
Non-role-bearing node Healthy immediately n/a (all central nodes have no role filter)
Leader status Healthy Healthy
Non-leader (standby) Degraded Unhealthy
IActiveNodeGate backing Not present ActiveNodeGate (separate type, duplicated logic)

Gap L1: ZB.MOM.WW.Health.Akka.ActiveNodeHealthCheck with an optional RoleFilter parameter unifies both behaviors. OtOpcUa passes RoleFilter = "admin" (role-filtered); ScadaBridge uses no role filter. → Gap L2: ScadaBridge's ActiveNodeGate duplicates ActiveNodeHealthCheck logic. The shared IActiveNodeGate seam + a backing singleton eliminates the duplication.

§6 Response writer

OtOpcUa MxAccessGateway ScadaBridge
Writer Default (plain-text/JSON) Bespoke GatewayHealthReply JSON UIResponseWriter.WriteHealthCheckUIResponse

Gap W1: the shared ZB.MOM.WW.Health package ships a canonical JSON response writer (lifting HealthChecks.UI.Client style to the default). All three projects adopt it on MapZbHealth() call — no per-project writer wiring needed.

§7 Endpoint authentication

Both OtOpcUa and ScadaBridge expose health endpoints without authentication (AllowAnonymous or open by default). MxAccessGateway's /health/live has no authentication requirement. The spec canonizes this: health tiers are AllowAnonymous; MapZbHealth() applies AllowAnonymous by default.

No gap — consistent across all three. MapZbHealth() should document and enforce this default.

Adoption backlog (ordered)

# Item Projects Priority Effort Risk Notes
1 MxAccessGateway: remove dead /health/live + AddHealthChecks(), add GrpcDependencyHealthCheck (worker IPC) + MapZbHealth() MxGateway P1 S Low Gap T1, T3, P1, P3 — no probes/tiers today; highest delta
2 OtOpcUa: replace 3 bespoke checks with shared probes (AkkaClusterHealthCheck OtOpcUaCompat + ActiveNodeHealthCheck role-filtered + DatabaseHealthCheck<T> ProbeQuery) OtOpcUa P2 S Low Gap A1, D1, L1
3 ScadaBridge: replace 3 bespoke checks with shared probes (Default policy + role-less Active + CanConnectAsync) + add /healthz + unify ActiveNodeGate with IActiveNodeGate seam ScadaBridge P2 S Low Gap T2, A1, D2, L1, L2
4 OtOpcUa + MxAccessGateway: add GrpcDependencyHealthCheck for downstream gRPC channel OtOpcUa, MxGateway P2 S Low Gap P2 — closes the silent-gateway-down scenario
5 All: adopt canonical response writer (switch from per-project writers to MapZbHealth default) all 3 P3 XS Low Gap W1 — mechanical; bundled with #13
6 DB injection style: switch ScadaBridge from injected DbContext to IDbContextFactory<T> ScadaBridge P3 XS Low Gap D2 — background-service safety

Note: adoption items #16 are all follow-on tasks. They are tracked here as the backlog for after ZB.MOM.WW.Health @ 0.1.0 is published. The library build itself (nupkgs, tests) is a separate task. This is consistent with how ZB.MOM.WW.Auth and ZB.MOM.WW.Theme are structured: the library is built first; adoption by the three apps is the next step.