Applies canonical resolutions for eight settled decisions: - GAPS: remove three stale "Decisions still open" bullets (#1 IActiveNodeGate placement, #2 GrpcChannel type, #3 OtOpcUaCompat named constant) - Shared contract: AkkaClusterHealthCheck, ActiveNodeHealthCheck constructors take IServiceProvider (lazy ActorSystem, Degraded-when-not-ready) - Shared contract: AkkaActiveNodeGate takes IServiceProvider; reads SelfMember+leader directly, null-guarded; does not proxy ActiveNodeHealthCheck - Shared contract: DatabaseHealthCheckOptions.Probe renamed to ProbeQuery; consumer matrix updated - Shared contract: settled AddZbHealthChecks open question removed (spec §5 is per-project AddHealthChecks) - SPEC §2.2: OtOpcUaCompat Leaving/Exiting cell updated from — to Degraded + footnote; §2.3 startup-safety note added - README: status line corrected from "built and tested" to "scaffolded … implementation is follow-on (task #7)"; IActiveNodeGate "left per-project" bullet removed - OtOpcUa current-state: AddZbHealthChecks → AddHealthChecks().AddCheck<...>(); IClusterRoleInfo note reframed as accepted trade-off - ScadaBridge current-state: IActiveNodeGate bullet rewritten — interface moves to ZB.MOM.WW.Health on adoption, InboundApiEndpointFilter references shared interface
11 KiB
Health — current state: ScadaBridge
Repo: ~/Desktop/ScadaBridge. Stack: .NET 10, Akka.NET; solution ZB.MOM.WW.ScadaBridge.slnx.
Health code centers on src/ZB.MOM.WW.ScadaBridge.Host/Health/ (ASP.NET probes) and the
separate src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/ project (domain aggregation pipeline).
All paths relative to repo root.
Verified 2026-06-01.
Two-tier pattern: /health/ready and /health/active — no /healthz. Three probes (database,
Akka cluster, active-node). ScadaBridge also has a bespoke distributed HealthMonitoring/
pipeline that is entirely separate from the ASP.NET health checks and is out of scope for the
shared library.
1. Endpoint wiring
src/ZB.MOM.WW.ScadaBridge.Host/Program.cs:
:114–117—builder.Services.AddHealthChecks()followed by three.AddCheck<T>()calls (no tags, checked by name at the endpoint level):.AddCheck<DatabaseHealthCheck>("database").AddCheck<AkkaClusterHealthCheck>("akka-cluster").AddCheck<ActiveNodeHealthCheck>("active-node")
:131—builder.Services.AddSingleton<IActiveNodeGate, ActiveNodeGate>()registers the productionIActiveNodeGateimplementation (Inbound API gating, not a health-check probe).:222–226—/health/readymapped withPredicate = check => check.Name != "active-node"andResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse(fromHealthChecks.UI.Client). Excludes the active-node check so a healthy standby node reports ready.:229–233—/health/activemapped withPredicate = check => check.Name == "active-node"andResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse. Active-node check only.
No /healthz endpoint. Both mapped endpoints use HealthChecks.UI.Client JSON (not the default
plain-text writer), which is a divergence from OtOpcUa.
2. Probes
DatabaseHealthCheck
src/ZB.MOM.WW.ScadaBridge.Host/Health/DatabaseHealthCheck.cs:
:11— injectsScadaBridgeDbContextdirectly (not a factory):33–43— calls_dbContext.Database.CanConnectAsync(cancellationToken):- Returns
true→HealthCheckResult.Healthy("Database connection is available.")(:34–35) - Returns
false→HealthCheckResult.Unhealthy("Database connection failed.")(:36) - Throws →
HealthCheckResult.Unhealthy("Database connection failed.", ex)(:40)
- Returns
CanConnectAsync tests the connection layer only — it does not run any query or verify schema
state. This is less strict than OtOpcUa's Deployments query but more transparent about failure
cause (connection vs. schema). No Degraded path.
AkkaClusterHealthCheck
src/ZB.MOM.WW.ScadaBridge.Host/Health/AkkaClusterHealthCheck.cs:
:13— injectsAkkaHostedService(notActorSystemdirectly):33–50— gets_akkaService.ActorSystem, guards on null →Degraded("ActorSystem not yet available."), then readscluster.SelfMember.Status:UporJoining→Healthy($"Akka cluster member status: {status}")(:43)LeavingorExiting→Degraded($"Akka cluster member status: {status}")(:45)- anything else (Removed, Down, WeaklyUp…) →
Unhealthy($"Akka cluster member status: {status}")(:47)
Three-way status policy: Healthy / Degraded / Unhealthy. This is more granular than OtOpcUa's two-way policy (self-Up-or-not → Healthy/Degraded with no Unhealthy path).
ActiveNodeHealthCheck
src/ZB.MOM.WW.ScadaBridge.Host/Health/ActiveNodeHealthCheck.cs:
:13— injectsAkkaHostedService:29–44— three-path logic:ActorSystem == null→Unhealthy("ActorSystem not yet available.")(:31)SelfMember.Status != Up→Unhealthy($"Node not Up (status: ...)")(:37)UpANDcluster.State.Leader == self.Address→Healthy("Active node (cluster leader).")(:41)Upbut not leader →Unhealthy("Standby node (not cluster leader).")(:43)
No Degraded path — ActiveNodeHealthCheck uses Unhealthy for standby and non-Up states,
which causes /health/active to return HTTP 503 on a standby. This is the intended behavior for
Traefik active-node routing.
3. Tag / tier summary
ScadaBridge uses name-based predicates at the endpoint level rather than tags on the check
registration. Tags are absent from all three .AddCheck<T>() calls.
| Probe | /health/ready |
/health/active |
/healthz |
|---|---|---|---|
DatabaseHealthCheck |
✅ | — (excluded by name) | ⛔ absent |
AkkaClusterHealthCheck |
✅ | — (excluded by name) | ⛔ absent |
ActiveNodeHealthCheck |
— (excluded by name) | ✅ | ⛔ absent |
/healthz is absent — there is no bare process liveness endpoint. Kubernetes or Traefik liveness
probes must either use /health/ready or tolerate its 503-until-ready behavior.
4. IActiveNodeGate and Inbound API gating
src/ZB.MOM.WW.ScadaBridge.Host/Health/ActiveNodeGate.cs:
:24—ActiveNodeGateimplementsIActiveNodeGate(from theInboundAPIproject):40—IsActiveNodeproperty: returnstrueonly when_akkaService.ActorSystem != nullANDcluster.SelfMember.Status == MemberStatus.UpANDcluster.State.Leader == self.Address. Defaults tofalsesafely during startup (:45–46).:131inProgram.cs— registered as a singleton. TheInboundApiEndpointFilterconsults this gate on every/api/*request and returns HTTP 503 on a standby node.
ActiveNodeGate mirrors the exact same logic as ActiveNodeHealthCheck — both check Up + leader.
They are separate types serving two different concerns (the health endpoint and the API gate) but
are not abstracted into a shared service; each reads cluster state independently.
IActiveNodeGate is the generalized seam the ZB.MOM.WW.Health core package lifts to the shared
library.
5. HealthMonitoring domain pipeline (out of scope for shared library)
src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/ is a separate project implementing a distributed
health aggregation pipeline. It is not ASP.NET Core health checks and is not in scope for
ZB.MOM.WW.Health.
Key types:
SiteHealthCollector— thread-safe singleton accumulating per-site error counters, connection metrics, and tag-read metrics. Populated by actors in the DCL layer.HealthReportSender— a background service on site clusters that serializesSiteHealthStateand ships it to the central cluster via Akka remoting at a configurable interval.CentralHealthReportLoop— central-only background service that generates a syntheticSiteHealthReportfor the central cluster itself (siteId"$central") and feeds it into the central aggregator.CentralHealthAggregator— aBackgroundServiceon the central cluster tracking the latest health report per site and detecting offline sites via heartbeat timeout. ExposesGetAggregatedHealth()to the Central UI's/monitoring/healthendpoint.
This pipeline is domain-specific (multi-site ScadaBridge topology) and will remain per-project regardless of shared-library adoption.
6. Notable design choices
- Name-based predicates vs. tags — ScadaBridge uses
check.Name == "active-node"predicate logic at the endpoint level. OtOpcUa uses tag membership (c.Tags.Contains("ready")). The tag approach is more composable (a probe can participate in multiple tiers), the name approach is more explicit. The sharedMapZbHealthshould use tags by default. HealthChecks.UI.Clientresponse writer — ScadaBridge uses the richer JSON response writer from theAspNetCore.HealthChecks.UI.Clientpackage. OtOpcUa uses the default plain-text writer. The shared library's canonical response writer standardizes this.ActiveNodeHealthCheckreturnsUnhealthyfor standby — a standby is not unhealthy in the system sense; it is a deliberate routing discriminator. UsingUnhealthyhere ensures/health/activereturns HTTP 503 (Traefik sees the node as down for active traffic). The naming is semantically imprecise but operationally correct.IActiveNodeGate+ActiveNodeGateduplication — the gate and the health check implement the same logic independently. The shared library'sIActiveNodeGateseam +ActiveNodeHealthCheckunify them: one backing service, two consumers.
Adoption plan → ZB.MOM.WW.Health
Replace with shared probes:
AkkaClusterHealthCheck→ZB.MOM.WW.Health.Akka.AkkaClusterHealthCheckusing the Default policy (Up/Joining=Healthy,Leaving/Exiting=Degraded, else Unhealthy). ScadaBridge's existing three-way policy is the Default — no preset selection needed.ActiveNodeHealthCheck→ZB.MOM.WW.Health.Akka.ActiveNodeHealthCheckwith no role filter (role-less default: Up && leader = Healthy, else Unhealthy). The shared implementation also backsIActiveNodeGate, eliminating the duplicated leader-check logic betweenActiveNodeHealthCheckandActiveNodeGate.DatabaseHealthCheck→ZB.MOM.WW.Health.EntityFrameworkCore.DatabaseHealthCheck<ScadaBridgeDbContext>using the defaultCanConnectAsyncprobe (ScadaBridge's existing behavior). NoProbeQuerydelegate needed.- Replace the name-based predicates with tag-based predicates by adding tags at registration time:
"database"and"akka-cluster"→["ready"];"active-node"→["active"]. Then callapp.MapZbHealth()instead of the two manualMapHealthCheckscalls. - Add
/healthz—MapZbHealth()maps the bare liveness tier automatically. ScadaBridge currently lacks this endpoint. - Switch
ResponseWriterfromUIResponseWriter.WriteHealthCheckUIResponseto the shared canonical writer (a convergence item —HealthChecks.UI.Clientstyle lifted to the default inZB.MOM.WW.Health).
Keep bespoke:
HealthMonitoring/domain pipeline (SiteHealthCollector,CentralHealthAggregator, etc.) — entirely per-project, no shared-library equivalent.IActiveNodeGatemoves from theInboundAPIproject toZB.MOM.WW.Health(core package) on adoption.InboundApiEndpointFilterreferences the shared interface;AkkaActiveNodeGate(fromZB.MOM.WW.Health.Akka) becomes the singleton implementation registered in DI. The interface definition is no longer owned by theInboundAPIproject.- The Central UI's
/monitoring/healthendpoint — powered byCentralHealthAggregator, not by ASP.NET health checks. - The comment at
Program.cs:217–221explains the readiness design decision (standby nodes are ready; leadership is a separate concern). This intent is preserved by the tag-based approach.
Adoption is a follow-on task (tracked in GAPS.md), not part of the ZB.MOM.WW.Health
library build. The library build delivers the shared implementations; adoption lands in the
ScadaBridge repo as a separate commit once the nupkg is available.