07d5907258
Applies canonical resolutions for eight settled decisions: - GAPS: remove three stale "Decisions still open" bullets (#1 IActiveNodeGate placement, #2 GrpcChannel type, #3 OtOpcUaCompat named constant) - Shared contract: AkkaClusterHealthCheck, ActiveNodeHealthCheck constructors take IServiceProvider (lazy ActorSystem, Degraded-when-not-ready) - Shared contract: AkkaActiveNodeGate takes IServiceProvider; reads SelfMember+leader directly, null-guarded; does not proxy ActiveNodeHealthCheck - Shared contract: DatabaseHealthCheckOptions.Probe renamed to ProbeQuery; consumer matrix updated - Shared contract: settled AddZbHealthChecks open question removed (spec §5 is per-project AddHealthChecks) - SPEC §2.2: OtOpcUaCompat Leaving/Exiting cell updated from — to Degraded + footnote; §2.3 startup-safety note added - README: status line corrected from "built and tested" to "scaffolded … implementation is follow-on (task #7)"; IActiveNodeGate "left per-project" bullet removed - OtOpcUa current-state: AddZbHealthChecks → AddHealthChecks().AddCheck<...>(); IClusterRoleInfo note reframed as accepted trade-off - ScadaBridge current-state: IActiveNodeGate bullet rewritten — interface moves to ZB.MOM.WW.Health on adoption, InboundApiEndpointFilter references shared interface
187 lines
11 KiB
Markdown
187 lines
11 KiB
Markdown
# Health — current state: ScadaBridge
|
||
|
||
Repo: `~/Desktop/ScadaBridge`. Stack: .NET 10, Akka.NET; solution `ZB.MOM.WW.ScadaBridge.slnx`.
|
||
Health code centers on `src/ZB.MOM.WW.ScadaBridge.Host/Health/` (ASP.NET probes) and the
|
||
separate `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/` project (domain aggregation pipeline).
|
||
All paths relative to repo root.
|
||
Verified 2026-06-01.
|
||
|
||
Two-tier pattern: `/health/ready` and `/health/active` — no `/healthz`. Three probes (database,
|
||
Akka cluster, active-node). ScadaBridge also has a bespoke distributed `HealthMonitoring/`
|
||
pipeline that is entirely separate from the ASP.NET health checks and is out of scope for the
|
||
shared library.
|
||
|
||
## 1. Endpoint wiring
|
||
|
||
`src/ZB.MOM.WW.ScadaBridge.Host/Program.cs`:
|
||
|
||
- `:114–117` — `builder.Services.AddHealthChecks()` followed by three `.AddCheck<T>()` calls
|
||
(no tags, checked by name at the endpoint level):
|
||
- `.AddCheck<DatabaseHealthCheck>("database")`
|
||
- `.AddCheck<AkkaClusterHealthCheck>("akka-cluster")`
|
||
- `.AddCheck<ActiveNodeHealthCheck>("active-node")`
|
||
- `:131` — `builder.Services.AddSingleton<IActiveNodeGate, ActiveNodeGate>()` registers the
|
||
production `IActiveNodeGate` implementation (Inbound API gating, not a health-check probe).
|
||
- `:222–226` — `/health/ready` mapped with `Predicate = check => check.Name != "active-node"` and
|
||
`ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse` (from `HealthChecks.UI.Client`).
|
||
Excludes the active-node check so a healthy standby node reports ready.
|
||
- `:229–233` — `/health/active` mapped with `Predicate = check => check.Name == "active-node"` and
|
||
`ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse`. Active-node check only.
|
||
|
||
No `/healthz` endpoint. Both mapped endpoints use `HealthChecks.UI.Client` JSON (not the default
|
||
plain-text writer), which is a divergence from OtOpcUa.
|
||
|
||
## 2. Probes
|
||
|
||
### DatabaseHealthCheck
|
||
`src/ZB.MOM.WW.ScadaBridge.Host/Health/DatabaseHealthCheck.cs`:
|
||
|
||
- `:11` — injects `ScadaBridgeDbContext` directly (not a factory)
|
||
- `:33–43` — calls `_dbContext.Database.CanConnectAsync(cancellationToken)`:
|
||
- Returns `true` → `HealthCheckResult.Healthy("Database connection is available.")` (`:34–35`)
|
||
- Returns `false` → `HealthCheckResult.Unhealthy("Database connection failed.")` (`:36`)
|
||
- Throws → `HealthCheckResult.Unhealthy("Database connection failed.", ex)` (`:40`)
|
||
|
||
`CanConnectAsync` tests the connection layer only — it does not run any query or verify schema
|
||
state. This is less strict than OtOpcUa's `Deployments` query but more transparent about failure
|
||
cause (connection vs. schema). No `Degraded` path.
|
||
|
||
### AkkaClusterHealthCheck
|
||
`src/ZB.MOM.WW.ScadaBridge.Host/Health/AkkaClusterHealthCheck.cs`:
|
||
|
||
- `:13` — injects `AkkaHostedService` (not `ActorSystem` directly)
|
||
- `:33–50` — gets `_akkaService.ActorSystem`, guards on null → `Degraded("ActorSystem not yet
|
||
available.")`, then reads `cluster.SelfMember.Status`:
|
||
- `Up` or `Joining` → `Healthy($"Akka cluster member status: {status}")` (`:43`)
|
||
- `Leaving` or `Exiting` → `Degraded($"Akka cluster member status: {status}")` (`:45`)
|
||
- anything else (Removed, Down, WeaklyUp…) → `Unhealthy($"Akka cluster member status: {status}")` (`:47`)
|
||
|
||
Three-way status policy: Healthy / Degraded / Unhealthy. This is more granular than OtOpcUa's
|
||
two-way policy (self-Up-or-not → Healthy/Degraded with no Unhealthy path).
|
||
|
||
### ActiveNodeHealthCheck
|
||
`src/ZB.MOM.WW.ScadaBridge.Host/Health/ActiveNodeHealthCheck.cs`:
|
||
|
||
- `:13` — injects `AkkaHostedService`
|
||
- `:29–44` — three-path logic:
|
||
- `ActorSystem == null` → `Unhealthy("ActorSystem not yet available.")` (`:31`)
|
||
- `SelfMember.Status != Up` → `Unhealthy($"Node not Up (status: ...)")` (`:37`)
|
||
- `Up` AND `cluster.State.Leader == self.Address` → `Healthy("Active node (cluster leader).")` (`:41`)
|
||
- `Up` but not leader → `Unhealthy("Standby node (not cluster leader).")` (`:43`)
|
||
|
||
No `Degraded` path — `ActiveNodeHealthCheck` uses `Unhealthy` for standby and non-Up states,
|
||
which causes `/health/active` to return HTTP 503 on a standby. This is the intended behavior for
|
||
Traefik active-node routing.
|
||
|
||
## 3. Tag / tier summary
|
||
|
||
ScadaBridge uses **name-based predicates** at the endpoint level rather than tags on the check
|
||
registration. Tags are absent from all three `.AddCheck<T>()` calls.
|
||
|
||
| Probe | `/health/ready` | `/health/active` | `/healthz` |
|
||
|---|---|---|---|
|
||
| `DatabaseHealthCheck` | ✅ | — (excluded by name) | ⛔ absent |
|
||
| `AkkaClusterHealthCheck` | ✅ | — (excluded by name) | ⛔ absent |
|
||
| `ActiveNodeHealthCheck` | — (excluded by name) | ✅ | ⛔ absent |
|
||
|
||
`/healthz` is absent — there is no bare process liveness endpoint. Kubernetes or Traefik liveness
|
||
probes must either use `/health/ready` or tolerate its 503-until-ready behavior.
|
||
|
||
## 4. IActiveNodeGate and Inbound API gating
|
||
|
||
`src/ZB.MOM.WW.ScadaBridge.Host/Health/ActiveNodeGate.cs`:
|
||
|
||
- `:24` — `ActiveNodeGate` implements `IActiveNodeGate` (from the `InboundAPI` project)
|
||
- `:40` — `IsActiveNode` property: returns `true` only when `_akkaService.ActorSystem != null`
|
||
AND `cluster.SelfMember.Status == MemberStatus.Up` AND `cluster.State.Leader == self.Address`.
|
||
Defaults to `false` safely during startup (`:45–46`).
|
||
- `:131` in `Program.cs` — registered as a singleton. The `InboundApiEndpointFilter` consults this
|
||
gate on every `/api/*` request and returns HTTP 503 on a standby node.
|
||
|
||
`ActiveNodeGate` mirrors the exact same logic as `ActiveNodeHealthCheck` — both check Up + leader.
|
||
They are separate types serving two different concerns (the health endpoint and the API gate) but
|
||
are not abstracted into a shared service; each reads cluster state independently.
|
||
|
||
`IActiveNodeGate` is the generalized seam the `ZB.MOM.WW.Health` core package lifts to the shared
|
||
library.
|
||
|
||
## 5. HealthMonitoring domain pipeline (out of scope for shared library)
|
||
|
||
`src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/` is a separate project implementing a distributed
|
||
health aggregation pipeline. It is **not ASP.NET Core health checks** and is **not in scope** for
|
||
`ZB.MOM.WW.Health`.
|
||
|
||
Key types:
|
||
- `SiteHealthCollector` — thread-safe singleton accumulating per-site error counters, connection
|
||
metrics, and tag-read metrics. Populated by actors in the DCL layer.
|
||
- `HealthReportSender` — a background service on site clusters that serializes `SiteHealthState`
|
||
and ships it to the central cluster via Akka remoting at a configurable interval.
|
||
- `CentralHealthReportLoop` — central-only background service that generates a synthetic
|
||
`SiteHealthReport` for the central cluster itself (siteId `"$central"`) and feeds it into the
|
||
central aggregator.
|
||
- `CentralHealthAggregator` — a `BackgroundService` on the central cluster tracking the latest
|
||
health report per site and detecting offline sites via heartbeat timeout. Exposes
|
||
`GetAggregatedHealth()` to the Central UI's `/monitoring/health` endpoint.
|
||
|
||
This pipeline is domain-specific (multi-site ScadaBridge topology) and will remain per-project
|
||
regardless of shared-library adoption.
|
||
|
||
## 6. Notable design choices
|
||
|
||
- **Name-based predicates vs. tags** — ScadaBridge uses `check.Name == "active-node"` predicate
|
||
logic at the endpoint level. OtOpcUa uses tag membership (`c.Tags.Contains("ready")`). The tag
|
||
approach is more composable (a probe can participate in multiple tiers), the name approach is
|
||
more explicit. The shared `MapZbHealth` should use tags by default.
|
||
- **`HealthChecks.UI.Client` response writer** — ScadaBridge uses the richer JSON response writer
|
||
from the `AspNetCore.HealthChecks.UI.Client` package. OtOpcUa uses the default plain-text writer.
|
||
The shared library's canonical response writer standardizes this.
|
||
- **`ActiveNodeHealthCheck` returns `Unhealthy` for standby** — a standby is not *unhealthy* in the
|
||
system sense; it is a deliberate routing discriminator. Using `Unhealthy` here ensures `/health/active`
|
||
returns HTTP 503 (Traefik sees the node as down for active traffic). The naming is semantically
|
||
imprecise but operationally correct.
|
||
- **`IActiveNodeGate` + `ActiveNodeGate` duplication** — the gate and the health check implement the
|
||
same logic independently. The shared library's `IActiveNodeGate` seam + `ActiveNodeHealthCheck`
|
||
unify them: one backing service, two consumers.
|
||
|
||
---
|
||
|
||
## Adoption plan → `ZB.MOM.WW.Health`
|
||
|
||
**Replace with shared probes:**
|
||
|
||
- `AkkaClusterHealthCheck` → `ZB.MOM.WW.Health.Akka.AkkaClusterHealthCheck` using the **Default
|
||
policy** (`Up`/`Joining`=Healthy, `Leaving`/`Exiting`=Degraded, else Unhealthy). ScadaBridge's
|
||
existing three-way policy is the Default — no preset selection needed.
|
||
- `ActiveNodeHealthCheck` → `ZB.MOM.WW.Health.Akka.ActiveNodeHealthCheck` with no role filter
|
||
(role-less default: Up && leader = Healthy, else Unhealthy). The shared implementation also backs
|
||
`IActiveNodeGate`, eliminating the duplicated leader-check logic between `ActiveNodeHealthCheck`
|
||
and `ActiveNodeGate`.
|
||
- `DatabaseHealthCheck` → `ZB.MOM.WW.Health.EntityFrameworkCore.DatabaseHealthCheck<ScadaBridgeDbContext>`
|
||
using the default `CanConnectAsync` probe (ScadaBridge's existing behavior). No `ProbeQuery`
|
||
delegate needed.
|
||
- Replace the name-based predicates with tag-based predicates by adding tags at registration time:
|
||
`"database"` and `"akka-cluster"` → `["ready"]`; `"active-node"` → `["active"]`. Then call
|
||
`app.MapZbHealth()` instead of the two manual `MapHealthChecks` calls.
|
||
- **Add `/healthz`** — `MapZbHealth()` maps the bare liveness tier automatically. ScadaBridge
|
||
currently lacks this endpoint.
|
||
- Switch `ResponseWriter` from `UIResponseWriter.WriteHealthCheckUIResponse` to the shared
|
||
canonical writer (a convergence item — `HealthChecks.UI.Client` style lifted to the default in
|
||
`ZB.MOM.WW.Health`).
|
||
|
||
**Keep bespoke:**
|
||
|
||
- `HealthMonitoring/` domain pipeline (`SiteHealthCollector`, `CentralHealthAggregator`, etc.) —
|
||
entirely per-project, no shared-library equivalent.
|
||
- `IActiveNodeGate` moves from the `InboundAPI` project to `ZB.MOM.WW.Health` (core package) on
|
||
adoption. `InboundApiEndpointFilter` references the shared interface; `AkkaActiveNodeGate`
|
||
(from `ZB.MOM.WW.Health.Akka`) becomes the singleton implementation registered in DI. The
|
||
interface definition is no longer owned by the `InboundAPI` project.
|
||
- The Central UI's `/monitoring/health` endpoint — powered by `CentralHealthAggregator`, not by
|
||
ASP.NET health checks.
|
||
- The comment at `Program.cs:217–221` explains the readiness design decision (standby nodes are
|
||
ready; leadership is a separate concern). This intent is preserved by the tag-based approach.
|
||
|
||
**Adoption is a follow-on task** (tracked in `GAPS.md`), not part of the `ZB.MOM.WW.Health`
|
||
library build. The library build delivers the shared implementations; adoption lands in the
|
||
ScadaBridge repo as a separate commit once the nupkg is available.
|