# Health — current state: ScadaBridge Repo: `~/Desktop/ScadaBridge`. Stack: .NET 10, Akka.NET; solution `ZB.MOM.WW.ScadaBridge.slnx`. Health code centers on `src/ZB.MOM.WW.ScadaBridge.Host/Health/` (ASP.NET probes) and the separate `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/` project (domain aggregation pipeline). All paths relative to repo root. Verified 2026-06-01. Two-tier pattern: `/health/ready` and `/health/active` — no `/healthz`. Three probes (database, Akka cluster, active-node). ScadaBridge also has a bespoke distributed `HealthMonitoring/` pipeline that is entirely separate from the ASP.NET health checks and is out of scope for the shared library. ## 1. Endpoint wiring `src/ZB.MOM.WW.ScadaBridge.Host/Program.cs`: - `:114–117` — `builder.Services.AddHealthChecks()` followed by three `.AddCheck()` calls (no tags, checked by name at the endpoint level): - `.AddCheck("database")` - `.AddCheck("akka-cluster")` - `.AddCheck("active-node")` - `:131` — `builder.Services.AddSingleton()` registers the production `IActiveNodeGate` implementation (Inbound API gating, not a health-check probe). - `:222–226` — `/health/ready` mapped with `Predicate = check => check.Name != "active-node"` and `ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse` (from `HealthChecks.UI.Client`). Excludes the active-node check so a healthy standby node reports ready. - `:229–233` — `/health/active` mapped with `Predicate = check => check.Name == "active-node"` and `ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse`. Active-node check only. No `/healthz` endpoint. Both mapped endpoints use `HealthChecks.UI.Client` JSON (not the default plain-text writer), which is a divergence from OtOpcUa. ## 2. Probes ### DatabaseHealthCheck `src/ZB.MOM.WW.ScadaBridge.Host/Health/DatabaseHealthCheck.cs`: - `:11` — injects `ScadaBridgeDbContext` directly (not a factory) - `:33–43` — calls `_dbContext.Database.CanConnectAsync(cancellationToken)`: - Returns `true` → `HealthCheckResult.Healthy("Database connection is available.")` (`:34–35`) - Returns `false` → `HealthCheckResult.Unhealthy("Database connection failed.")` (`:36`) - Throws → `HealthCheckResult.Unhealthy("Database connection failed.", ex)` (`:40`) `CanConnectAsync` tests the connection layer only — it does not run any query or verify schema state. This is less strict than OtOpcUa's `Deployments` query but more transparent about failure cause (connection vs. schema). No `Degraded` path. ### AkkaClusterHealthCheck `src/ZB.MOM.WW.ScadaBridge.Host/Health/AkkaClusterHealthCheck.cs`: - `:13` — injects `AkkaHostedService` (not `ActorSystem` directly) - `:33–50` — gets `_akkaService.ActorSystem`, guards on null → `Degraded("ActorSystem not yet available.")`, then reads `cluster.SelfMember.Status`: - `Up` or `Joining` → `Healthy($"Akka cluster member status: {status}")` (`:43`) - `Leaving` or `Exiting` → `Degraded($"Akka cluster member status: {status}")` (`:45`) - anything else (Removed, Down, WeaklyUp…) → `Unhealthy($"Akka cluster member status: {status}")` (`:47`) Three-way status policy: Healthy / Degraded / Unhealthy. This is more granular than OtOpcUa's two-way policy (self-Up-or-not → Healthy/Degraded with no Unhealthy path). ### ActiveNodeHealthCheck `src/ZB.MOM.WW.ScadaBridge.Host/Health/ActiveNodeHealthCheck.cs`: - `:13` — injects `AkkaHostedService` - `:29–44` — three-path logic: - `ActorSystem == null` → `Unhealthy("ActorSystem not yet available.")` (`:31`) - `SelfMember.Status != Up` → `Unhealthy($"Node not Up (status: ...)")` (`:37`) - `Up` AND `cluster.State.Leader == self.Address` → `Healthy("Active node (cluster leader).")` (`:41`) - `Up` but not leader → `Unhealthy("Standby node (not cluster leader).")` (`:43`) No `Degraded` path — `ActiveNodeHealthCheck` uses `Unhealthy` for standby and non-Up states, which causes `/health/active` to return HTTP 503 on a standby. This is the intended behavior for Traefik active-node routing. ## 3. Tag / tier summary ScadaBridge uses **name-based predicates** at the endpoint level rather than tags on the check registration. Tags are absent from all three `.AddCheck()` calls. | Probe | `/health/ready` | `/health/active` | `/healthz` | |---|---|---|---| | `DatabaseHealthCheck` | ✅ | — (excluded by name) | ⛔ absent | | `AkkaClusterHealthCheck` | ✅ | — (excluded by name) | ⛔ absent | | `ActiveNodeHealthCheck` | — (excluded by name) | ✅ | ⛔ absent | `/healthz` is absent — there is no bare process liveness endpoint. Kubernetes or Traefik liveness probes must either use `/health/ready` or tolerate its 503-until-ready behavior. ## 4. IActiveNodeGate and Inbound API gating `src/ZB.MOM.WW.ScadaBridge.Host/Health/ActiveNodeGate.cs`: - `:24` — `ActiveNodeGate` implements `IActiveNodeGate` (from the `InboundAPI` project) - `:40` — `IsActiveNode` property: returns `true` only when `_akkaService.ActorSystem != null` AND `cluster.SelfMember.Status == MemberStatus.Up` AND `cluster.State.Leader == self.Address`. Defaults to `false` safely during startup (`:45–46`). - `:131` in `Program.cs` — registered as a singleton. The `InboundApiEndpointFilter` consults this gate on every `/api/*` request and returns HTTP 503 on a standby node. `ActiveNodeGate` mirrors the exact same logic as `ActiveNodeHealthCheck` — both check Up + leader. They are separate types serving two different concerns (the health endpoint and the API gate) but are not abstracted into a shared service; each reads cluster state independently. `IActiveNodeGate` is the generalized seam the `ZB.MOM.WW.Health` core package lifts to the shared library. ## 5. HealthMonitoring domain pipeline (out of scope for shared library) `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/` is a separate project implementing a distributed health aggregation pipeline. It is **not ASP.NET Core health checks** and is **not in scope** for `ZB.MOM.WW.Health`. Key types: - `SiteHealthCollector` — thread-safe singleton accumulating per-site error counters, connection metrics, and tag-read metrics. Populated by actors in the DCL layer. - `HealthReportSender` — a background service on site clusters that serializes `SiteHealthState` and ships it to the central cluster via Akka remoting at a configurable interval. - `CentralHealthReportLoop` — central-only background service that generates a synthetic `SiteHealthReport` for the central cluster itself (siteId `"$central"`) and feeds it into the central aggregator. - `CentralHealthAggregator` — a `BackgroundService` on the central cluster tracking the latest health report per site and detecting offline sites via heartbeat timeout. Exposes `GetAggregatedHealth()` to the Central UI's `/monitoring/health` endpoint. This pipeline is domain-specific (multi-site ScadaBridge topology) and will remain per-project regardless of shared-library adoption. ## 6. Notable design choices - **Name-based predicates vs. tags** — ScadaBridge uses `check.Name == "active-node"` predicate logic at the endpoint level. OtOpcUa uses tag membership (`c.Tags.Contains("ready")`). The tag approach is more composable (a probe can participate in multiple tiers), the name approach is more explicit. The shared `MapZbHealth` should use tags by default. - **`HealthChecks.UI.Client` response writer** — ScadaBridge uses the richer JSON response writer from the `AspNetCore.HealthChecks.UI.Client` package. OtOpcUa uses the default plain-text writer. The shared library's canonical response writer standardizes this. - **`ActiveNodeHealthCheck` returns `Unhealthy` for standby** — a standby is not *unhealthy* in the system sense; it is a deliberate routing discriminator. Using `Unhealthy` here ensures `/health/active` returns HTTP 503 (Traefik sees the node as down for active traffic). The naming is semantically imprecise but operationally correct. - **`IActiveNodeGate` + `ActiveNodeGate` duplication** — the gate and the health check implement the same logic independently. The shared library's `IActiveNodeGate` seam + `ActiveNodeHealthCheck` unify them: one backing service, two consumers. --- ## Adoption plan → `ZB.MOM.WW.Health` **Replace with shared probes:** - `AkkaClusterHealthCheck` → `ZB.MOM.WW.Health.Akka.AkkaClusterHealthCheck` using the **Default policy** (`Up`/`Joining`=Healthy, `Leaving`/`Exiting`=Degraded, else Unhealthy). ScadaBridge's existing three-way policy is the Default — no preset selection needed. - `ActiveNodeHealthCheck` → `ZB.MOM.WW.Health.Akka.ActiveNodeHealthCheck` with no role filter (role-less default: Up && leader = Healthy, else Unhealthy). The shared implementation also backs `IActiveNodeGate`, eliminating the duplicated leader-check logic between `ActiveNodeHealthCheck` and `ActiveNodeGate`. - `DatabaseHealthCheck` → `ZB.MOM.WW.Health.EntityFrameworkCore.DatabaseHealthCheck` using the default `CanConnectAsync` probe (ScadaBridge's existing behavior). No `ProbeQuery` delegate needed. - Replace the name-based predicates with tag-based predicates by adding tags at registration time: `"database"` and `"akka-cluster"` → `["ready"]`; `"active-node"` → `["active"]`. Then call `app.MapZbHealth()` instead of the two manual `MapHealthChecks` calls. - **Add `/healthz`** — `MapZbHealth()` maps the bare liveness tier automatically. ScadaBridge currently lacks this endpoint. - Switch `ResponseWriter` from `UIResponseWriter.WriteHealthCheckUIResponse` to the shared canonical writer (a convergence item — `HealthChecks.UI.Client` style lifted to the default in `ZB.MOM.WW.Health`). **Keep bespoke:** - `HealthMonitoring/` domain pipeline (`SiteHealthCollector`, `CentralHealthAggregator`, etc.) — entirely per-project, no shared-library equivalent. - `IActiveNodeGate` moves from the `InboundAPI` project to `ZB.MOM.WW.Health` (core package) on adoption. `InboundApiEndpointFilter` references the shared interface; `AkkaActiveNodeGate` (from `ZB.MOM.WW.Health.Akka`) becomes the singleton implementation registered in DI. The interface definition is no longer owned by the `InboundAPI` project. - The Central UI's `/monitoring/health` endpoint — powered by `CentralHealthAggregator`, not by ASP.NET health checks. - The comment at `Program.cs:217–221` explains the readiness design decision (standby nodes are ready; leadership is a separate concern). This intent is preserved by the tag-based approach. **Adoption is a follow-on task** (tracked in `GAPS.md`), not part of the `ZB.MOM.WW.Health` library build. The library build delivers the shared implementations; adoption lands in the ScadaBridge repo as a separate commit once the nupkg is available.