11 KiB
Health — normalized target spec
Status: Draft. The single design the sister projects converge on. Derived from the
three code-verified current-state docs (../current-state/). Goal is path to shared code
(../shared-contract/ZB.MOM.WW.Health.md), so each normalized section maps to a shared library seam.
0. Scope
Normalized here: the three-tier endpoint convention (/health/ready, /health/active,
/healthz) with canonical tags ready / active / live and their semantics; the canonical
JSON response shape; the IActiveNodeGate request-gating seam; a configurable
AkkaClusterHealthCheck with two named policy presets that reconcile the diverging Akka logic in
OtOpcUa and ScadaBridge; a role-filtered ActiveNodeHealthCheck that unifies OtOpcUa's
AdminRoleLeaderHealthCheck and ScadaBridge's ActiveNodeHealthCheck; a generic
DatabaseHealthCheck<TContext> that covers both apps' EF Core probe patterns; a
GrpcDependencyHealthCheck for downstream gRPC reachability.
Explicitly NOT normalized (domain-specific — keep per project): which probes each app
registers and how it wires them to tags; orchestrator / Traefik routing rules and routing priorities;
ScadaBridge's HealthMonitoring/ domain-aggregation pipeline — this is a distributed, actor-based
domain-health telemetry system (background services + Akka actors that aggregate site-cluster signals
into a central health picture) and is not an ASP.NET health-probe; it is an independent concern
that happens to share the word "health".
1. Tier convention
Three tiers, always served in this order, each filtered to a named tag:
| Tier | Endpoint | Tag | Semantics | Healthy→ | Degraded→ | Unhealthy→ |
|---|---|---|---|---|---|---|
| Ready | /health/ready |
ready |
Can this node serve its dependencies? Fails if a DB, gRPC dependency, or cluster membership check is unhealthy. Orchestrators use this to gate traffic. | 200 | 200 | 503 |
| Active | /health/active |
active |
Is this the leader / active node? Fails (503) on a standby or role-member-but-not-leader node. Used to route write traffic or admin requests to exactly one node. | 200 | 200 | 503 |
| Live | /healthz |
live |
Bare process liveness — is the process alive and not deadlocked? No probes registered to this tag (predicate _ => false). Always 200 as long as the process can handle HTTP. |
200 | 200 | 200 |
Notes:
- The
livetier intentionally carries no probes. Registering a probe toliveis an error — a liveness failure that kills the pod should be reserved for total process hangs, not probe failures. Degradedmaps to HTTP 200 (not 503) for thereadyandactivetiers. Orchestrators use 503 to remove a node from load-balancing; Degraded means "still up but degraded" — remove the node only on hard failure.- The tag names (
ready,active,live) are declared as constants inZbHealthTagsand used consistently across all three apps. Per-project probe registrations must filter by these tags.
2. Probe catalog
2.1 Database probe — DatabaseHealthCheck<TContext>
Wraps an EF Core DbContext to verify database reachability. Default behavior calls
context.Database.CanConnectAsync() — matches ScadaBridge's pattern. An optional delegate
(Func<TContext, CancellationToken, Task>) overrides the default for more specific validation
(matches OtOpcUa's "query Deployments" pattern). Registered to the ready tag.
2.2 Akka cluster probe — AkkaClusterHealthCheck
Checks the local node's cluster membership status via Akka.Cluster. The status-to-health
mapping is configurable through AkkaClusterStatusPolicy.
Two named policy presets reconcile the existing divergence:
| Preset | Origin | Up / Joining |
Leaving / Exiting |
Other (WeaklyUp, Down, Removed, Unknown) |
|---|---|---|---|---|
AkkaClusterStatusPolicy.Default |
ScadaBridge AkkaClusterHealthCheck.cs |
Healthy | Degraded | Unhealthy |
AkkaClusterStatusPolicy.OtOpcUaCompat |
OtOpcUa AkkaClusterHealthCheck.cs |
Healthy (if self is Up among reachable members) |
Degraded1 | Degraded |
The Default preset is the convergence target. OtOpcUaCompat is provided for backward
compatibility during OtOpcUa's migration; it maps any non-Up-among-members state to Degraded
rather than Unhealthy. Registered to the ready tag.
Note on error/exception cases: in both modes, if the ActorSystem is not yet ready or cluster state is inaccessible (e.g. during startup), the check returns Degraded (startup-safety rule). The status cells in the table above describe the normal-operation path only; the "—" cells in the
OtOpcUaCompatcolumn refer to states that collapse into Degraded via the member-scan result, not to an explicit policy match.
2.3 Active / leader probe — ActiveNodeHealthCheck
Checks whether this node is the designated leader (active node). Accepts an optional Akka cluster role name that scopes the check to nodes carrying that role.
Two behaviors unify the existing divergence:
| Mode | Role param | Origin | Healthy | Degraded | Unhealthy |
|---|---|---|---|---|---|
| Role-less | null |
ScadaBridge ActiveNodeHealthCheck |
Node is Up and cluster leader |
— | Otherwise |
| Role-filtered | e.g. "admin" |
OtOpcUa AdminRoleLeaderHealthCheck |
Node does not carry the role (not a participant — ignore it) or node carries the role and is the role-singleton leader | Carries the role but is not the role-singleton leader (role member, not leader) | — |
The role-filtered variant maps "not a member of the role" to Healthy (transparent — the probe
is irrelevant for this node). This is the correct behavior for heterogeneous clusters where not
every node carries every role. Registered to the active tag.
2.4 gRPC dependency probe — GrpcDependencyHealthCheck
Checks that a downstream gRPC channel is reachable by invoking a caller-supplied probe
delegate (Func<GrpcChannel, CancellationToken, Task<bool>>). The default probe calls
GrpcChannel.ConnectAsync. Used by:
- OtOpcUa — checks the MxAccessGateway gRPC channel.
- MxGateway — checks the x86 worker gRPC channel.
Registered to the ready tag.
3. Response-writer contract
All health endpoints share one canonical JSON serializer. The shape is lifted from ScadaBridge's
HealthChecks.UI.Client style and becomes the library default (replacing per-project divergence).
Content-type: application/json
Shape:
{
"status": "Healthy",
"totalDurationMs": 12,
"entries": {
"database": {
"status": "Healthy",
"description": "SQL Server reachable",
"durationMs": 12
},
"akka-cluster": {
"status": "Healthy",
"description": "Member status: Up",
"durationMs": 0.1
}
}
}
Field rules:
| Field | Type | Notes |
|---|---|---|
status |
string | "Healthy" | "Degraded" | "Unhealthy" — the aggregate across all filtered checks |
totalDurationMs |
long | Total wall-clock time for all probes in this tier, milliseconds |
entries |
object | Keyed by check registration name |
entries.<name>.status |
string | Per-check status |
entries.<name>.description |
string? | Human-readable detail (may be null) |
entries.<name>.durationMs |
number | Per-check elapsed time, milliseconds |
The writer is exposed as a static Task WriteJsonAsync(HttpContext, HealthReport) so consumers can
plug it into MapHealthChecks options and also call it from custom endpoints.
4. Active-node gating seam — IActiveNodeGate
IActiveNodeGate is a single-property interface (bool IsActiveNode { get; }) that expresses
whether the current node should accept write / active-role requests. The default implementation,
AkkaActiveNodeGate, reads cluster state directly: IsActiveNode returns true iff the
ActorSystem is available, SelfMember.Status == Up, and the node is the cluster leader. It is
null-guarded and returns false when the ActorSystem is not yet ready (safe default during
startup). It does not resolve ActiveNodeHealthCheck from DI. A RequireActiveNode() extension on
IEndpointConventionBuilder attaches a policy that short-circuits with 503 Service Unavailable
on standby nodes.
This seam is generalized from ScadaBridge's ActiveNodeGate.cs. It is in the core ZB.MOM.WW.Health
package (not the Akka satellite) so MxGateway can implement it without an Akka dependency if needed.
5. Endpoint registration
app.MapZbHealth() maps all three tiers in one call:
app.MapZbHealth(); // all three tiers, defaults
app.MapZbHealth(o => {
o.ReadyPath = "/health/ready"; // override paths if needed
o.ActivePath = "/health/active";
o.LivePath = "/healthz";
o.ResponseWriter = ZbHealthWriter.WriteJsonAsync;
});
The library does not call services.AddHealthChecks() — that is the app's responsibility, as
the probe set is per-project. MapZbHealth only maps the three endpoints with the correct tag
predicates and response writer.
6. Migration notes
| Project | Current state | Gap | What normalizes |
|---|---|---|---|
| OtOpcUa | All three tiers present (/health/ready, /health/active, /healthz); DatabaseHealthCheck, AkkaClusterHealthCheck, AdminRoleLeaderHealthCheck inline. |
Inline probes diverge from the shared policy model; no IActiveNodeGate. |
Replace inline AkkaClusterHealthCheck with shared + OtOpcUaCompat preset; replace AdminRoleLeaderHealthCheck with shared ActiveNodeHealthCheck(role: "admin"); replace inline DatabaseHealthCheck with shared generic; call app.MapZbHealth(). |
| ScadaBridge | /health/ready + /health/active present; no /healthz; DatabaseHealthCheck, AkkaClusterHealthCheck, ActiveNodeHealthCheck, ActiveNodeGate inline. |
Missing /healthz live tier; inline implementations. |
Add /healthz via MapZbHealth(); replace inline probes with shared equivalents (Default policy); replace inline ActiveNodeGate with AkkaActiveNodeGate. |
| MxGateway | Only /health/live (custom GatewayHealthReply); AddHealthChecks() called but zero probes registered. |
Missing ready and active tiers; no probes; not using standard health middleware. |
Replace custom endpoint with app.MapZbHealth(); register GrpcDependencyHealthCheck for the x86 worker channel on the ready tag. |
7. Acceptance (what "converged" means)
A project is converged when: (a) it calls app.MapZbHealth() and exposes all three canonical
endpoints; (b) its Akka probes (if applicable) use the AkkaClusterHealthCheck + ActiveNodeHealthCheck
from ZB.MOM.WW.Health.Akka with the Default policy; (c) its DB probe uses DatabaseHealthCheck<TContext>
from ZB.MOM.WW.Health.EntityFrameworkCore; (d) its gRPC-dependency probe (if applicable) uses
GrpcDependencyHealthCheck; (e) its IActiveNodeGate implementation is AkkaActiveNodeGate
(or a project-specific implementation of the shared interface); (f) all health endpoints return the
canonical JSON shape defined in §3.
-
In the
OtOpcUaCompatmember-scan approach,Leaving/Exitingstatuses also map to Degraded because a member with those statuses will not appear withStatus == Upin the reachable member set — the scan finds self withoutUp, so the result is Degraded. ↩︎