Files
scadaproj/components/health/current-state/scadabridge/CURRENT-STATE.md
T
Joseph Doherty 07d5907258 docs(health): resolve spec/contract/gaps consistency (review fixes)
Applies canonical resolutions for eight settled decisions:
- GAPS: remove three stale "Decisions still open" bullets (#1 IActiveNodeGate placement, #2 GrpcChannel type, #3 OtOpcUaCompat named constant)
- Shared contract: AkkaClusterHealthCheck, ActiveNodeHealthCheck constructors take IServiceProvider (lazy ActorSystem, Degraded-when-not-ready)
- Shared contract: AkkaActiveNodeGate takes IServiceProvider; reads SelfMember+leader directly, null-guarded; does not proxy ActiveNodeHealthCheck
- Shared contract: DatabaseHealthCheckOptions.Probe renamed to ProbeQuery; consumer matrix updated
- Shared contract: settled AddZbHealthChecks open question removed (spec §5 is per-project AddHealthChecks)
- SPEC §2.2: OtOpcUaCompat Leaving/Exiting cell updated from — to Degraded + footnote; §2.3 startup-safety note added
- README: status line corrected from "built and tested" to "scaffolded … implementation is follow-on (task #7)"; IActiveNodeGate "left per-project" bullet removed
- OtOpcUa current-state: AddZbHealthChecks → AddHealthChecks().AddCheck<...>(); IClusterRoleInfo note reframed as accepted trade-off
- ScadaBridge current-state: IActiveNodeGate bullet rewritten — interface moves to ZB.MOM.WW.Health on adoption, InboundApiEndpointFilter references shared interface
2026-06-01 06:33:42 -04:00

187 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Health — current state: ScadaBridge
Repo: `~/Desktop/ScadaBridge`. Stack: .NET 10, Akka.NET; solution `ZB.MOM.WW.ScadaBridge.slnx`.
Health code centers on `src/ZB.MOM.WW.ScadaBridge.Host/Health/` (ASP.NET probes) and the
separate `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/` project (domain aggregation pipeline).
All paths relative to repo root.
Verified 2026-06-01.
Two-tier pattern: `/health/ready` and `/health/active` — no `/healthz`. Three probes (database,
Akka cluster, active-node). ScadaBridge also has a bespoke distributed `HealthMonitoring/`
pipeline that is entirely separate from the ASP.NET health checks and is out of scope for the
shared library.
## 1. Endpoint wiring
`src/ZB.MOM.WW.ScadaBridge.Host/Program.cs`:
- `:114117``builder.Services.AddHealthChecks()` followed by three `.AddCheck<T>()` calls
(no tags, checked by name at the endpoint level):
- `.AddCheck<DatabaseHealthCheck>("database")`
- `.AddCheck<AkkaClusterHealthCheck>("akka-cluster")`
- `.AddCheck<ActiveNodeHealthCheck>("active-node")`
- `:131``builder.Services.AddSingleton<IActiveNodeGate, ActiveNodeGate>()` registers the
production `IActiveNodeGate` implementation (Inbound API gating, not a health-check probe).
- `:222226``/health/ready` mapped with `Predicate = check => check.Name != "active-node"` and
`ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse` (from `HealthChecks.UI.Client`).
Excludes the active-node check so a healthy standby node reports ready.
- `:229233``/health/active` mapped with `Predicate = check => check.Name == "active-node"` and
`ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse`. Active-node check only.
No `/healthz` endpoint. Both mapped endpoints use `HealthChecks.UI.Client` JSON (not the default
plain-text writer), which is a divergence from OtOpcUa.
## 2. Probes
### DatabaseHealthCheck
`src/ZB.MOM.WW.ScadaBridge.Host/Health/DatabaseHealthCheck.cs`:
- `:11` — injects `ScadaBridgeDbContext` directly (not a factory)
- `:3343` — calls `_dbContext.Database.CanConnectAsync(cancellationToken)`:
- Returns `true``HealthCheckResult.Healthy("Database connection is available.")` (`:3435`)
- Returns `false``HealthCheckResult.Unhealthy("Database connection failed.")` (`:36`)
- Throws → `HealthCheckResult.Unhealthy("Database connection failed.", ex)` (`:40`)
`CanConnectAsync` tests the connection layer only — it does not run any query or verify schema
state. This is less strict than OtOpcUa's `Deployments` query but more transparent about failure
cause (connection vs. schema). No `Degraded` path.
### AkkaClusterHealthCheck
`src/ZB.MOM.WW.ScadaBridge.Host/Health/AkkaClusterHealthCheck.cs`:
- `:13` — injects `AkkaHostedService` (not `ActorSystem` directly)
- `:3350` — gets `_akkaService.ActorSystem`, guards on null → `Degraded("ActorSystem not yet
available.")`, then reads `cluster.SelfMember.Status`:
- `Up` or `Joining` → `Healthy($"Akka cluster member status: {status}")` (`:43`)
- `Leaving` or `Exiting` → `Degraded($"Akka cluster member status: {status}")` (`:45`)
- anything else (Removed, Down, WeaklyUp…) → `Unhealthy($"Akka cluster member status: {status}")` (`:47`)
Three-way status policy: Healthy / Degraded / Unhealthy. This is more granular than OtOpcUa's
two-way policy (self-Up-or-not → Healthy/Degraded with no Unhealthy path).
### ActiveNodeHealthCheck
`src/ZB.MOM.WW.ScadaBridge.Host/Health/ActiveNodeHealthCheck.cs`:
- `:13` — injects `AkkaHostedService`
- `:2944` — three-path logic:
- `ActorSystem == null` → `Unhealthy("ActorSystem not yet available.")` (`:31`)
- `SelfMember.Status != Up` → `Unhealthy($"Node not Up (status: ...)")` (`:37`)
- `Up` AND `cluster.State.Leader == self.Address` → `Healthy("Active node (cluster leader).")` (`:41`)
- `Up` but not leader → `Unhealthy("Standby node (not cluster leader).")` (`:43`)
No `Degraded` path — `ActiveNodeHealthCheck` uses `Unhealthy` for standby and non-Up states,
which causes `/health/active` to return HTTP 503 on a standby. This is the intended behavior for
Traefik active-node routing.
## 3. Tag / tier summary
ScadaBridge uses **name-based predicates** at the endpoint level rather than tags on the check
registration. Tags are absent from all three `.AddCheck<T>()` calls.
| Probe | `/health/ready` | `/health/active` | `/healthz` |
|---|---|---|---|
| `DatabaseHealthCheck` | ✅ | — (excluded by name) | ⛔ absent |
| `AkkaClusterHealthCheck` | ✅ | — (excluded by name) | ⛔ absent |
| `ActiveNodeHealthCheck` | — (excluded by name) | ✅ | ⛔ absent |
`/healthz` is absent — there is no bare process liveness endpoint. Kubernetes or Traefik liveness
probes must either use `/health/ready` or tolerate its 503-until-ready behavior.
## 4. IActiveNodeGate and Inbound API gating
`src/ZB.MOM.WW.ScadaBridge.Host/Health/ActiveNodeGate.cs`:
- `:24` — `ActiveNodeGate` implements `IActiveNodeGate` (from the `InboundAPI` project)
- `:40` — `IsActiveNode` property: returns `true` only when `_akkaService.ActorSystem != null`
AND `cluster.SelfMember.Status == MemberStatus.Up` AND `cluster.State.Leader == self.Address`.
Defaults to `false` safely during startup (`:4546`).
- `:131` in `Program.cs` — registered as a singleton. The `InboundApiEndpointFilter` consults this
gate on every `/api/*` request and returns HTTP 503 on a standby node.
`ActiveNodeGate` mirrors the exact same logic as `ActiveNodeHealthCheck` — both check Up + leader.
They are separate types serving two different concerns (the health endpoint and the API gate) but
are not abstracted into a shared service; each reads cluster state independently.
`IActiveNodeGate` is the generalized seam the `ZB.MOM.WW.Health` core package lifts to the shared
library.
## 5. HealthMonitoring domain pipeline (out of scope for shared library)
`src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/` is a separate project implementing a distributed
health aggregation pipeline. It is **not ASP.NET Core health checks** and is **not in scope** for
`ZB.MOM.WW.Health`.
Key types:
- `SiteHealthCollector` — thread-safe singleton accumulating per-site error counters, connection
metrics, and tag-read metrics. Populated by actors in the DCL layer.
- `HealthReportSender` — a background service on site clusters that serializes `SiteHealthState`
and ships it to the central cluster via Akka remoting at a configurable interval.
- `CentralHealthReportLoop` — central-only background service that generates a synthetic
`SiteHealthReport` for the central cluster itself (siteId `"$central"`) and feeds it into the
central aggregator.
- `CentralHealthAggregator` — a `BackgroundService` on the central cluster tracking the latest
health report per site and detecting offline sites via heartbeat timeout. Exposes
`GetAggregatedHealth()` to the Central UI's `/monitoring/health` endpoint.
This pipeline is domain-specific (multi-site ScadaBridge topology) and will remain per-project
regardless of shared-library adoption.
## 6. Notable design choices
- **Name-based predicates vs. tags** — ScadaBridge uses `check.Name == "active-node"` predicate
logic at the endpoint level. OtOpcUa uses tag membership (`c.Tags.Contains("ready")`). The tag
approach is more composable (a probe can participate in multiple tiers), the name approach is
more explicit. The shared `MapZbHealth` should use tags by default.
- **`HealthChecks.UI.Client` response writer** — ScadaBridge uses the richer JSON response writer
from the `AspNetCore.HealthChecks.UI.Client` package. OtOpcUa uses the default plain-text writer.
The shared library's canonical response writer standardizes this.
- **`ActiveNodeHealthCheck` returns `Unhealthy` for standby** — a standby is not *unhealthy* in the
system sense; it is a deliberate routing discriminator. Using `Unhealthy` here ensures `/health/active`
returns HTTP 503 (Traefik sees the node as down for active traffic). The naming is semantically
imprecise but operationally correct.
- **`IActiveNodeGate` + `ActiveNodeGate` duplication** — the gate and the health check implement the
same logic independently. The shared library's `IActiveNodeGate` seam + `ActiveNodeHealthCheck`
unify them: one backing service, two consumers.
---
## Adoption plan → `ZB.MOM.WW.Health`
**Replace with shared probes:**
- `AkkaClusterHealthCheck` → `ZB.MOM.WW.Health.Akka.AkkaClusterHealthCheck` using the **Default
policy** (`Up`/`Joining`=Healthy, `Leaving`/`Exiting`=Degraded, else Unhealthy). ScadaBridge's
existing three-way policy is the Default — no preset selection needed.
- `ActiveNodeHealthCheck` → `ZB.MOM.WW.Health.Akka.ActiveNodeHealthCheck` with no role filter
(role-less default: Up && leader = Healthy, else Unhealthy). The shared implementation also backs
`IActiveNodeGate`, eliminating the duplicated leader-check logic between `ActiveNodeHealthCheck`
and `ActiveNodeGate`.
- `DatabaseHealthCheck` → `ZB.MOM.WW.Health.EntityFrameworkCore.DatabaseHealthCheck<ScadaBridgeDbContext>`
using the default `CanConnectAsync` probe (ScadaBridge's existing behavior). No `ProbeQuery`
delegate needed.
- Replace the name-based predicates with tag-based predicates by adding tags at registration time:
`"database"` and `"akka-cluster"` → `["ready"]`; `"active-node"` → `["active"]`. Then call
`app.MapZbHealth()` instead of the two manual `MapHealthChecks` calls.
- **Add `/healthz`** — `MapZbHealth()` maps the bare liveness tier automatically. ScadaBridge
currently lacks this endpoint.
- Switch `ResponseWriter` from `UIResponseWriter.WriteHealthCheckUIResponse` to the shared
canonical writer (a convergence item — `HealthChecks.UI.Client` style lifted to the default in
`ZB.MOM.WW.Health`).
**Keep bespoke:**
- `HealthMonitoring/` domain pipeline (`SiteHealthCollector`, `CentralHealthAggregator`, etc.) —
entirely per-project, no shared-library equivalent.
- `IActiveNodeGate` moves from the `InboundAPI` project to `ZB.MOM.WW.Health` (core package) on
adoption. `InboundApiEndpointFilter` references the shared interface; `AkkaActiveNodeGate`
(from `ZB.MOM.WW.Health.Akka`) becomes the singleton implementation registered in DI. The
interface definition is no longer owned by the `InboundAPI` project.
- The Central UI's `/monitoring/health` endpoint — powered by `CentralHealthAggregator`, not by
ASP.NET health checks.
- The comment at `Program.cs:217221` explains the readiness design decision (standby nodes are
ready; leadership is a separate concern). This intent is preserved by the tag-based approach.
**Adoption is a follow-on task** (tracked in `GAPS.md`), not part of the `ZB.MOM.WW.Health`
library build. The library build delivers the shared implementations; adoption lands in the
ScadaBridge repo as a separate commit once the nupkg is available.