# Health — normalized target spec Status: **Draft**. The single design the sister projects converge on. Derived from the three code-verified current-state docs (`../current-state/`). Goal is *path to shared code* (`../shared-contract/ZB.MOM.WW.Health.md`), so each normalized section maps to a shared library seam. ## 0. Scope **Normalized here:** the three-tier endpoint convention (`/health/ready`, `/health/active`, `/healthz`) with canonical tags `ready` / `active` / `live` and their semantics; the canonical JSON response shape; the `IActiveNodeGate` request-gating seam; a configurable `AkkaClusterHealthCheck` with two named policy presets that reconcile the diverging Akka logic in OtOpcUa and ScadaBridge; a role-filtered `ActiveNodeHealthCheck` that unifies OtOpcUa's `AdminRoleLeaderHealthCheck` and ScadaBridge's `ActiveNodeHealthCheck`; a generic `DatabaseHealthCheck` that covers both apps' EF Core probe patterns; a `GrpcDependencyHealthCheck` for downstream gRPC reachability. **Explicitly NOT normalized** (domain-specific — keep per project): which probes each app registers and how it wires them to tags; orchestrator / Traefik routing rules and routing priorities; ScadaBridge's `HealthMonitoring/` domain-aggregation pipeline — this is a distributed, actor-based domain-health telemetry system (background services + Akka actors that aggregate site-cluster signals into a central health picture) and is **not** an ASP.NET health-probe; it is an independent concern that happens to share the word "health". ## 1. Tier convention Three tiers, always served in this order, each filtered to a named tag: | Tier | Endpoint | Tag | Semantics | Healthy→ | Degraded→ | Unhealthy→ | |---|---|---|---|---|---|---| | Ready | `/health/ready` | `ready` | Can this node serve its dependencies? Fails if a DB, gRPC dependency, or cluster membership check is unhealthy. Orchestrators use this to gate traffic. | 200 | 200 | 503 | | Active | `/health/active` | `active` | Is this the leader / active node? Fails (503) on a standby or role-member-but-not-leader node. Used to route write traffic or admin requests to exactly one node. | 200 | 200 | 503 | | Live | `/healthz` | `live` | Bare process liveness — is the process alive and not deadlocked? **No probes registered to this tag** (predicate `_ => false`). Always 200 as long as the process can handle HTTP. | 200 | 200 | 200 | Notes: - The `live` tier intentionally carries no probes. Registering a probe to `live` is an error — a liveness failure that kills the pod should be reserved for total process hangs, not probe failures. - `Degraded` maps to HTTP 200 (not 503) for the `ready` and `active` tiers. Orchestrators use 503 to remove a node from load-balancing; Degraded means "still up but degraded" — remove the node only on hard failure. - The tag names (`ready`, `active`, `live`) are declared as constants in `ZbHealthTags` and used consistently across all three apps. Per-project probe registrations must filter by these tags. ## 2. Probe catalog ### 2.1 Database probe — `DatabaseHealthCheck` Wraps an EF Core `DbContext` to verify database reachability. Default behavior calls `context.Database.CanConnectAsync()` — matches ScadaBridge's pattern. An optional delegate (`Func`) overrides the default for more specific validation (matches OtOpcUa's "query `Deployments`" pattern). Registered to the `ready` tag. ### 2.2 Akka cluster probe — `AkkaClusterHealthCheck` Checks the local node's cluster membership status via Akka.Cluster. The status-to-health mapping is **configurable** through `AkkaClusterStatusPolicy`. **Two named policy presets reconcile the existing divergence:** | Preset | Origin | `Up` / `Joining` | `Leaving` / `Exiting` | Other (`WeaklyUp`, `Down`, `Removed`, `Unknown`) | |---|---|---|---|---| | `AkkaClusterStatusPolicy.Default` | ScadaBridge `AkkaClusterHealthCheck.cs` | Healthy | Degraded | Unhealthy | | `AkkaClusterStatusPolicy.OtOpcUaCompat` | OtOpcUa `AkkaClusterHealthCheck.cs` | Healthy (if self is `Up` among reachable members) | Degraded[^1] | Degraded | [^1]: In the `OtOpcUaCompat` member-scan approach, `Leaving`/`Exiting` statuses also map to Degraded because a member with those statuses will not appear with `Status == Up` in the reachable member set — the scan finds self without `Up`, so the result is Degraded. The `Default` preset is the convergence target. `OtOpcUaCompat` is provided for backward compatibility during OtOpcUa's migration; it maps any non-`Up`-among-members state to Degraded rather than Unhealthy. Registered to the `ready` tag. > **Note on error/exception cases:** in both modes, if the ActorSystem is not yet ready or cluster > state is inaccessible (e.g. during startup), the check returns Degraded (startup-safety rule). > The status cells in the table above describe the normal-operation path only; the "—" cells in the > `OtOpcUaCompat` column refer to states that collapse into Degraded via the member-scan result, > not to an explicit policy match. ### 2.3 Active / leader probe — `ActiveNodeHealthCheck` Checks whether this node is the designated leader (active node). Accepts an optional Akka cluster role name that scopes the check to nodes carrying that role. **Two behaviors unify the existing divergence:** | Mode | Role param | Origin | Healthy | Degraded | Unhealthy | |---|---|---|---|---|---| | Role-less | `null` | ScadaBridge `ActiveNodeHealthCheck` | Node is `Up` **and** cluster leader | — | Otherwise | | Role-filtered | e.g. `"admin"` | OtOpcUa `AdminRoleLeaderHealthCheck` | Node does **not** carry the role (not a participant — ignore it) **or** node carries the role and is the role-singleton leader | Carries the role but is **not** the role-singleton leader (role member, not leader) | — | The role-filtered variant maps "not a member of the role" to Healthy (transparent — the probe is irrelevant for this node). This is the correct behavior for heterogeneous clusters where not every node carries every role. Registered to the `active` tag. ### 2.4 gRPC dependency probe — `GrpcDependencyHealthCheck` Checks that a downstream gRPC channel is reachable by invoking a caller-supplied probe delegate (`Func>`). The default probe calls `GrpcChannel.ConnectAsync`. Used by: - OtOpcUa — checks the MxAccessGateway gRPC channel. - MxGateway — checks the x86 worker gRPC channel. Registered to the `ready` tag. ## 3. Response-writer contract All health endpoints share one canonical JSON serializer. The shape is lifted from ScadaBridge's `HealthChecks.UI.Client` style and becomes the library default (replacing per-project divergence). **Content-type:** `application/json` **Shape:** ```json { "status": "Healthy", "totalDurationMs": 12, "entries": { "database": { "status": "Healthy", "description": "SQL Server reachable", "durationMs": 12 }, "akka-cluster": { "status": "Healthy", "description": "Member status: Up", "durationMs": 0.1 } } } ``` **Field rules:** | Field | Type | Notes | |---|---|---| | `status` | string | `"Healthy"` \| `"Degraded"` \| `"Unhealthy"` — the aggregate across all filtered checks | | `totalDurationMs` | long | Total wall-clock time for all probes in this tier, milliseconds | | `entries` | object | Keyed by check registration name | | `entries..status` | string | Per-check status | | `entries..description` | string? | Human-readable detail (may be null) | | `entries..durationMs` | number | Per-check elapsed time, milliseconds | The writer is exposed as a static `Task WriteJsonAsync(HttpContext, HealthReport)` so consumers can plug it into `MapHealthChecks` options and also call it from custom endpoints. ## 4. Active-node gating seam — `IActiveNodeGate` `IActiveNodeGate` is a single-property interface (`bool IsActiveNode { get; }`) that expresses whether the current node should accept write / active-role requests. The default implementation, `AkkaActiveNodeGate`, reads cluster state **directly**: `IsActiveNode` returns `true` iff the `ActorSystem` is available, `SelfMember.Status == Up`, and the node is the cluster leader. It is null-guarded and returns `false` when the `ActorSystem` is not yet ready (safe default during startup). It does **not** resolve `ActiveNodeHealthCheck` from DI. A `RequireActiveNode()` extension on `IEndpointConventionBuilder` attaches a policy that short-circuits with `503 Service Unavailable` on standby nodes. This seam is generalized from ScadaBridge's `ActiveNodeGate.cs`. It is in the core `ZB.MOM.WW.Health` package (not the Akka satellite) so MxGateway can implement it without an Akka dependency if needed. ## 5. Endpoint registration `app.MapZbHealth()` maps all three tiers in one call: ```csharp app.MapZbHealth(); // all three tiers, defaults app.MapZbHealth(o => { o.ReadyPath = "/health/ready"; // override paths if needed o.ActivePath = "/health/active"; o.LivePath = "/healthz"; o.ResponseWriter = ZbHealthWriter.WriteJsonAsync; }); ``` The library does **not** call `services.AddHealthChecks()` — that is the app's responsibility, as the probe set is per-project. `MapZbHealth` only maps the three endpoints with the correct tag predicates and response writer. ## 6. Migration notes | Project | Current state | Gap | What normalizes | |---|---|---|---| | **OtOpcUa** | All three tiers present (`/health/ready`, `/health/active`, `/healthz`); `DatabaseHealthCheck`, `AkkaClusterHealthCheck`, `AdminRoleLeaderHealthCheck` inline. | Inline probes diverge from the shared policy model; no `IActiveNodeGate`. | Replace inline `AkkaClusterHealthCheck` with shared + `OtOpcUaCompat` preset; replace `AdminRoleLeaderHealthCheck` with shared `ActiveNodeHealthCheck(role: "admin")`; replace inline `DatabaseHealthCheck` with shared generic; call `app.MapZbHealth()`. | | **ScadaBridge** | `/health/ready` + `/health/active` present; no `/healthz`; `DatabaseHealthCheck`, `AkkaClusterHealthCheck`, `ActiveNodeHealthCheck`, `ActiveNodeGate` inline. | Missing `/healthz` live tier; inline implementations. | Add `/healthz` via `MapZbHealth()`; replace inline probes with shared equivalents (Default policy); replace inline `ActiveNodeGate` with `AkkaActiveNodeGate`. | | **MxGateway** | Only `/health/live` (custom `GatewayHealthReply`); `AddHealthChecks()` called but zero probes registered. | Missing `ready` and `active` tiers; no probes; not using standard health middleware. | Replace custom endpoint with `app.MapZbHealth()`; register `GrpcDependencyHealthCheck` for the x86 worker channel on the `ready` tag. | ## 7. Acceptance (what "converged" means) A project is converged when: (a) it calls `app.MapZbHealth()` and exposes all three canonical endpoints; (b) its Akka probes (if applicable) use the `AkkaClusterHealthCheck` + `ActiveNodeHealthCheck` from `ZB.MOM.WW.Health.Akka` with the Default policy; (c) its DB probe uses `DatabaseHealthCheck` from `ZB.MOM.WW.Health.EntityFrameworkCore`; (d) its gRPC-dependency probe (if applicable) uses `GrpcDependencyHealthCheck`; (e) its `IActiveNodeGate` implementation is `AkkaActiveNodeGate` (or a project-specific implementation of the shared interface); (f) all health endpoints return the canonical JSON shape defined in §3.