Files
scadaproj/components/health/current-state/scadabridge/CURRENT-STATE.md
T
Joseph Doherty 07d5907258 docs(health): resolve spec/contract/gaps consistency (review fixes)
Applies canonical resolutions for eight settled decisions:
- GAPS: remove three stale "Decisions still open" bullets (#1 IActiveNodeGate placement, #2 GrpcChannel type, #3 OtOpcUaCompat named constant)
- Shared contract: AkkaClusterHealthCheck, ActiveNodeHealthCheck constructors take IServiceProvider (lazy ActorSystem, Degraded-when-not-ready)
- Shared contract: AkkaActiveNodeGate takes IServiceProvider; reads SelfMember+leader directly, null-guarded; does not proxy ActiveNodeHealthCheck
- Shared contract: DatabaseHealthCheckOptions.Probe renamed to ProbeQuery; consumer matrix updated
- Shared contract: settled AddZbHealthChecks open question removed (spec §5 is per-project AddHealthChecks)
- SPEC §2.2: OtOpcUaCompat Leaving/Exiting cell updated from — to Degraded + footnote; §2.3 startup-safety note added
- README: status line corrected from "built and tested" to "scaffolded … implementation is follow-on (task #7)"; IActiveNodeGate "left per-project" bullet removed
- OtOpcUa current-state: AddZbHealthChecks → AddHealthChecks().AddCheck<...>(); IClusterRoleInfo note reframed as accepted trade-off
- ScadaBridge current-state: IActiveNodeGate bullet rewritten — interface moves to ZB.MOM.WW.Health on adoption, InboundApiEndpointFilter references shared interface
2026-06-01 06:33:42 -04:00

11 KiB
Raw Blame History

Health — current state: ScadaBridge

Repo: ~/Desktop/ScadaBridge. Stack: .NET 10, Akka.NET; solution ZB.MOM.WW.ScadaBridge.slnx. Health code centers on src/ZB.MOM.WW.ScadaBridge.Host/Health/ (ASP.NET probes) and the separate src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/ project (domain aggregation pipeline). All paths relative to repo root. Verified 2026-06-01.

Two-tier pattern: /health/ready and /health/active — no /healthz. Three probes (database, Akka cluster, active-node). ScadaBridge also has a bespoke distributed HealthMonitoring/ pipeline that is entirely separate from the ASP.NET health checks and is out of scope for the shared library.

1. Endpoint wiring

src/ZB.MOM.WW.ScadaBridge.Host/Program.cs:

  • :114117builder.Services.AddHealthChecks() followed by three .AddCheck<T>() calls (no tags, checked by name at the endpoint level):
    • .AddCheck<DatabaseHealthCheck>("database")
    • .AddCheck<AkkaClusterHealthCheck>("akka-cluster")
    • .AddCheck<ActiveNodeHealthCheck>("active-node")
  • :131builder.Services.AddSingleton<IActiveNodeGate, ActiveNodeGate>() registers the production IActiveNodeGate implementation (Inbound API gating, not a health-check probe).
  • :222226/health/ready mapped with Predicate = check => check.Name != "active-node" and ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse (from HealthChecks.UI.Client). Excludes the active-node check so a healthy standby node reports ready.
  • :229233/health/active mapped with Predicate = check => check.Name == "active-node" and ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse. Active-node check only.

No /healthz endpoint. Both mapped endpoints use HealthChecks.UI.Client JSON (not the default plain-text writer), which is a divergence from OtOpcUa.

2. Probes

DatabaseHealthCheck

src/ZB.MOM.WW.ScadaBridge.Host/Health/DatabaseHealthCheck.cs:

  • :11 — injects ScadaBridgeDbContext directly (not a factory)
  • :3343 — calls _dbContext.Database.CanConnectAsync(cancellationToken):
    • Returns trueHealthCheckResult.Healthy("Database connection is available.") (:3435)
    • Returns falseHealthCheckResult.Unhealthy("Database connection failed.") (:36)
    • Throws → HealthCheckResult.Unhealthy("Database connection failed.", ex) (:40)

CanConnectAsync tests the connection layer only — it does not run any query or verify schema state. This is less strict than OtOpcUa's Deployments query but more transparent about failure cause (connection vs. schema). No Degraded path.

AkkaClusterHealthCheck

src/ZB.MOM.WW.ScadaBridge.Host/Health/AkkaClusterHealthCheck.cs:

  • :13 — injects AkkaHostedService (not ActorSystem directly)
  • :3350 — gets _akkaService.ActorSystem, guards on null → Degraded("ActorSystem not yet available."), then reads cluster.SelfMember.Status:
    • Up or JoiningHealthy($"Akka cluster member status: {status}") (:43)
    • Leaving or ExitingDegraded($"Akka cluster member status: {status}") (:45)
    • anything else (Removed, Down, WeaklyUp…) → Unhealthy($"Akka cluster member status: {status}") (:47)

Three-way status policy: Healthy / Degraded / Unhealthy. This is more granular than OtOpcUa's two-way policy (self-Up-or-not → Healthy/Degraded with no Unhealthy path).

ActiveNodeHealthCheck

src/ZB.MOM.WW.ScadaBridge.Host/Health/ActiveNodeHealthCheck.cs:

  • :13 — injects AkkaHostedService
  • :2944 — three-path logic:
    • ActorSystem == nullUnhealthy("ActorSystem not yet available.") (:31)
    • SelfMember.Status != UpUnhealthy($"Node not Up (status: ...)") (:37)
    • Up AND cluster.State.Leader == self.AddressHealthy("Active node (cluster leader).") (:41)
    • Up but not leader → Unhealthy("Standby node (not cluster leader).") (:43)

No Degraded path — ActiveNodeHealthCheck uses Unhealthy for standby and non-Up states, which causes /health/active to return HTTP 503 on a standby. This is the intended behavior for Traefik active-node routing.

3. Tag / tier summary

ScadaBridge uses name-based predicates at the endpoint level rather than tags on the check registration. Tags are absent from all three .AddCheck<T>() calls.

Probe /health/ready /health/active /healthz
DatabaseHealthCheck — (excluded by name) absent
AkkaClusterHealthCheck — (excluded by name) absent
ActiveNodeHealthCheck — (excluded by name) absent

/healthz is absent — there is no bare process liveness endpoint. Kubernetes or Traefik liveness probes must either use /health/ready or tolerate its 503-until-ready behavior.

4. IActiveNodeGate and Inbound API gating

src/ZB.MOM.WW.ScadaBridge.Host/Health/ActiveNodeGate.cs:

  • :24ActiveNodeGate implements IActiveNodeGate (from the InboundAPI project)
  • :40IsActiveNode property: returns true only when _akkaService.ActorSystem != null AND cluster.SelfMember.Status == MemberStatus.Up AND cluster.State.Leader == self.Address. Defaults to false safely during startup (:4546).
  • :131 in Program.cs — registered as a singleton. The InboundApiEndpointFilter consults this gate on every /api/* request and returns HTTP 503 on a standby node.

ActiveNodeGate mirrors the exact same logic as ActiveNodeHealthCheck — both check Up + leader. They are separate types serving two different concerns (the health endpoint and the API gate) but are not abstracted into a shared service; each reads cluster state independently.

IActiveNodeGate is the generalized seam the ZB.MOM.WW.Health core package lifts to the shared library.

5. HealthMonitoring domain pipeline (out of scope for shared library)

src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/ is a separate project implementing a distributed health aggregation pipeline. It is not ASP.NET Core health checks and is not in scope for ZB.MOM.WW.Health.

Key types:

  • SiteHealthCollector — thread-safe singleton accumulating per-site error counters, connection metrics, and tag-read metrics. Populated by actors in the DCL layer.
  • HealthReportSender — a background service on site clusters that serializes SiteHealthState and ships it to the central cluster via Akka remoting at a configurable interval.
  • CentralHealthReportLoop — central-only background service that generates a synthetic SiteHealthReport for the central cluster itself (siteId "$central") and feeds it into the central aggregator.
  • CentralHealthAggregator — a BackgroundService on the central cluster tracking the latest health report per site and detecting offline sites via heartbeat timeout. Exposes GetAggregatedHealth() to the Central UI's /monitoring/health endpoint.

This pipeline is domain-specific (multi-site ScadaBridge topology) and will remain per-project regardless of shared-library adoption.

6. Notable design choices

  • Name-based predicates vs. tags — ScadaBridge uses check.Name == "active-node" predicate logic at the endpoint level. OtOpcUa uses tag membership (c.Tags.Contains("ready")). The tag approach is more composable (a probe can participate in multiple tiers), the name approach is more explicit. The shared MapZbHealth should use tags by default.
  • HealthChecks.UI.Client response writer — ScadaBridge uses the richer JSON response writer from the AspNetCore.HealthChecks.UI.Client package. OtOpcUa uses the default plain-text writer. The shared library's canonical response writer standardizes this.
  • ActiveNodeHealthCheck returns Unhealthy for standby — a standby is not unhealthy in the system sense; it is a deliberate routing discriminator. Using Unhealthy here ensures /health/active returns HTTP 503 (Traefik sees the node as down for active traffic). The naming is semantically imprecise but operationally correct.
  • IActiveNodeGate + ActiveNodeGate duplication — the gate and the health check implement the same logic independently. The shared library's IActiveNodeGate seam + ActiveNodeHealthCheck unify them: one backing service, two consumers.

Adoption plan → ZB.MOM.WW.Health

Replace with shared probes:

  • AkkaClusterHealthCheckZB.MOM.WW.Health.Akka.AkkaClusterHealthCheck using the Default policy (Up/Joining=Healthy, Leaving/Exiting=Degraded, else Unhealthy). ScadaBridge's existing three-way policy is the Default — no preset selection needed.
  • ActiveNodeHealthCheckZB.MOM.WW.Health.Akka.ActiveNodeHealthCheck with no role filter (role-less default: Up && leader = Healthy, else Unhealthy). The shared implementation also backs IActiveNodeGate, eliminating the duplicated leader-check logic between ActiveNodeHealthCheck and ActiveNodeGate.
  • DatabaseHealthCheckZB.MOM.WW.Health.EntityFrameworkCore.DatabaseHealthCheck<ScadaBridgeDbContext> using the default CanConnectAsync probe (ScadaBridge's existing behavior). No ProbeQuery delegate needed.
  • Replace the name-based predicates with tag-based predicates by adding tags at registration time: "database" and "akka-cluster"["ready"]; "active-node"["active"]. Then call app.MapZbHealth() instead of the two manual MapHealthChecks calls.
  • Add /healthzMapZbHealth() maps the bare liveness tier automatically. ScadaBridge currently lacks this endpoint.
  • Switch ResponseWriter from UIResponseWriter.WriteHealthCheckUIResponse to the shared canonical writer (a convergence item — HealthChecks.UI.Client style lifted to the default in ZB.MOM.WW.Health).

Keep bespoke:

  • HealthMonitoring/ domain pipeline (SiteHealthCollector, CentralHealthAggregator, etc.) — entirely per-project, no shared-library equivalent.
  • IActiveNodeGate moves from the InboundAPI project to ZB.MOM.WW.Health (core package) on adoption. InboundApiEndpointFilter references the shared interface; AkkaActiveNodeGate (from ZB.MOM.WW.Health.Akka) becomes the singleton implementation registered in DI. The interface definition is no longer owned by the InboundAPI project.
  • The Central UI's /monitoring/health endpoint — powered by CentralHealthAggregator, not by ASP.NET health checks.
  • The comment at Program.cs:217221 explains the readiness design decision (standby nodes are ready; leadership is a separate concern). This intent is preserved by the tag-based approach.

Adoption is a follow-on task (tracked in GAPS.md), not part of the ZB.MOM.WW.Health library build. The library build delivers the shared implementations; adoption lands in the ScadaBridge repo as a separate commit once the nupkg is available.