Files
scadaproj/components/health/spec/SPEC.md
T

11 KiB

Health — normalized target spec

Status: Draft. The single design the sister projects converge on. Derived from the three code-verified current-state docs (../current-state/). Goal is path to shared code (../shared-contract/ZB.MOM.WW.Health.md), so each normalized section maps to a shared library seam.

0. Scope

Normalized here: the three-tier endpoint convention (/health/ready, /health/active, /healthz) with canonical tags ready / active / live and their semantics; the canonical JSON response shape; the IActiveNodeGate request-gating seam; a configurable AkkaClusterHealthCheck with two named policy presets that reconcile the diverging Akka logic in OtOpcUa and ScadaBridge; a role-filtered ActiveNodeHealthCheck that unifies OtOpcUa's AdminRoleLeaderHealthCheck and ScadaBridge's ActiveNodeHealthCheck; a generic DatabaseHealthCheck<TContext> that covers both apps' EF Core probe patterns; a GrpcDependencyHealthCheck for downstream gRPC reachability.

Explicitly NOT normalized (domain-specific — keep per project): which probes each app registers and how it wires them to tags; orchestrator / Traefik routing rules and routing priorities; ScadaBridge's HealthMonitoring/ domain-aggregation pipeline — this is a distributed, actor-based domain-health telemetry system (background services + Akka actors that aggregate site-cluster signals into a central health picture) and is not an ASP.NET health-probe; it is an independent concern that happens to share the word "health".

1. Tier convention

Three tiers, always served in this order, each filtered to a named tag:

Tier Endpoint Tag Semantics Healthy→ Degraded→ Unhealthy→
Ready /health/ready ready Can this node serve its dependencies? Fails if a DB, gRPC dependency, or cluster membership check is unhealthy. Orchestrators use this to gate traffic. 200 200 503
Active /health/active active Is this the leader / active node? Fails (503) on a standby or role-member-but-not-leader node. Used to route write traffic or admin requests to exactly one node. 200 200 503
Live /healthz live Bare process liveness — is the process alive and not deadlocked? No probes registered to this tag (predicate _ => false). Always 200 as long as the process can handle HTTP. 200 200 200

Notes:

  • The live tier intentionally carries no probes. Registering a probe to live is an error — a liveness failure that kills the pod should be reserved for total process hangs, not probe failures.
  • Degraded maps to HTTP 200 (not 503) for the ready and active tiers. Orchestrators use 503 to remove a node from load-balancing; Degraded means "still up but degraded" — remove the node only on hard failure.
  • The tag names (ready, active, live) are declared as constants in ZbHealthTags and used consistently across all three apps. Per-project probe registrations must filter by these tags.

2. Probe catalog

2.1 Database probe — DatabaseHealthCheck<TContext>

Wraps an EF Core DbContext to verify database reachability. Default behavior calls context.Database.CanConnectAsync() — matches ScadaBridge's pattern. An optional delegate (Func<TContext, CancellationToken, Task>) overrides the default for more specific validation (matches OtOpcUa's "query Deployments" pattern). Registered to the ready tag.

2.2 Akka cluster probe — AkkaClusterHealthCheck

Checks the local node's cluster membership status via Akka.Cluster. The status-to-health mapping is configurable through AkkaClusterStatusPolicy.

Two named policy presets reconcile the existing divergence:

Preset Origin Up / Joining Leaving / Exiting Other (WeaklyUp, Down, Removed, Unknown)
AkkaClusterStatusPolicy.Default ScadaBridge AkkaClusterHealthCheck.cs Healthy Degraded Unhealthy
AkkaClusterStatusPolicy.OtOpcUaCompat OtOpcUa AkkaClusterHealthCheck.cs Healthy (if self is Up among reachable members) Degraded1 Degraded

The Default preset is the convergence target. OtOpcUaCompat is provided for backward compatibility during OtOpcUa's migration; it maps any non-Up-among-members state to Degraded rather than Unhealthy. Registered to the ready tag.

Note on error/exception cases: in both modes, if the ActorSystem is not yet ready or cluster state is inaccessible (e.g. during startup), the check returns Degraded (startup-safety rule). The status cells in the table above describe the normal-operation path only; the "—" cells in the OtOpcUaCompat column refer to states that collapse into Degraded via the member-scan result, not to an explicit policy match.

2.3 Active / leader probe — ActiveNodeHealthCheck

Checks whether this node is the designated leader (active node). Accepts an optional Akka cluster role name that scopes the check to nodes carrying that role.

Two behaviors unify the existing divergence:

Mode Role param Origin Healthy Degraded Unhealthy
Role-less null ScadaBridge ActiveNodeHealthCheck Node is Up and cluster leader Otherwise
Role-filtered e.g. "admin" OtOpcUa AdminRoleLeaderHealthCheck Node does not carry the role (not a participant — ignore it) or node carries the role and is the role-singleton leader Carries the role but is not the role-singleton leader (role member, not leader)

The role-filtered variant maps "not a member of the role" to Healthy (transparent — the probe is irrelevant for this node). This is the correct behavior for heterogeneous clusters where not every node carries every role. Registered to the active tag.

2.4 gRPC dependency probe — GrpcDependencyHealthCheck

Checks that a downstream gRPC channel is reachable by invoking a caller-supplied probe delegate (Func<GrpcChannel, CancellationToken, Task<bool>>). The default probe calls GrpcChannel.ConnectAsync. Used by:

  • OtOpcUa — checks the MxAccessGateway gRPC channel.
  • MxGateway — checks the x86 worker gRPC channel.

Registered to the ready tag.

3. Response-writer contract

All health endpoints share one canonical JSON serializer. The shape is lifted from ScadaBridge's HealthChecks.UI.Client style and becomes the library default (replacing per-project divergence).

Content-type: application/json

Shape:

{
  "status": "Healthy",
  "totalDurationMs": 12,
  "entries": {
    "database": {
      "status": "Healthy",
      "description": "SQL Server reachable",
      "durationMs": 12
    },
    "akka-cluster": {
      "status": "Healthy",
      "description": "Member status: Up",
      "durationMs": 0.1
    }
  }
}

Field rules:

Field Type Notes
status string "Healthy" | "Degraded" | "Unhealthy" — the aggregate across all filtered checks
totalDurationMs long Total wall-clock time for all probes in this tier, milliseconds
entries object Keyed by check registration name
entries.<name>.status string Per-check status
entries.<name>.description string? Human-readable detail (may be null)
entries.<name>.durationMs number Per-check elapsed time, milliseconds

The writer is exposed as a static Task WriteJsonAsync(HttpContext, HealthReport) so consumers can plug it into MapHealthChecks options and also call it from custom endpoints.

4. Active-node gating seam — IActiveNodeGate

IActiveNodeGate is a single-property interface (bool IsActiveNode { get; }) that expresses whether the current node should accept write / active-role requests. The default implementation, AkkaActiveNodeGate, reads cluster state directly: IsActiveNode returns true iff the ActorSystem is available, SelfMember.Status == Up, and the node is the cluster leader. It is null-guarded and returns false when the ActorSystem is not yet ready (safe default during startup). It does not resolve ActiveNodeHealthCheck from DI. A RequireActiveNode() extension on IEndpointConventionBuilder attaches a policy that short-circuits with 503 Service Unavailable on standby nodes.

This seam is generalized from ScadaBridge's ActiveNodeGate.cs. It is in the core ZB.MOM.WW.Health package (not the Akka satellite) so MxGateway can implement it without an Akka dependency if needed.

5. Endpoint registration

app.MapZbHealth() maps all three tiers in one call:

app.MapZbHealth();                        // all three tiers, defaults
app.MapZbHealth(o => {
    o.ReadyPath   = "/health/ready";      // override paths if needed
    o.ActivePath  = "/health/active";
    o.LivePath    = "/healthz";
    o.ResponseWriter = ZbHealthWriter.WriteJsonAsync;
});

The library does not call services.AddHealthChecks() — that is the app's responsibility, as the probe set is per-project. MapZbHealth only maps the three endpoints with the correct tag predicates and response writer.

6. Migration notes

Project Current state Gap What normalizes
OtOpcUa All three tiers present (/health/ready, /health/active, /healthz); DatabaseHealthCheck, AkkaClusterHealthCheck, AdminRoleLeaderHealthCheck inline. Inline probes diverge from the shared policy model; no IActiveNodeGate. Replace inline AkkaClusterHealthCheck with shared + OtOpcUaCompat preset; replace AdminRoleLeaderHealthCheck with shared ActiveNodeHealthCheck(role: "admin"); replace inline DatabaseHealthCheck with shared generic; call app.MapZbHealth().
ScadaBridge /health/ready + /health/active present; no /healthz; DatabaseHealthCheck, AkkaClusterHealthCheck, ActiveNodeHealthCheck, ActiveNodeGate inline. Missing /healthz live tier; inline implementations. Add /healthz via MapZbHealth(); replace inline probes with shared equivalents (Default policy); replace inline ActiveNodeGate with AkkaActiveNodeGate.
MxGateway Only /health/live (custom GatewayHealthReply); AddHealthChecks() called but zero probes registered. Missing ready and active tiers; no probes; not using standard health middleware. Replace custom endpoint with app.MapZbHealth(); register GrpcDependencyHealthCheck for the x86 worker channel on the ready tag.

7. Acceptance (what "converged" means)

A project is converged when: (a) it calls app.MapZbHealth() and exposes all three canonical endpoints; (b) its Akka probes (if applicable) use the AkkaClusterHealthCheck + ActiveNodeHealthCheck from ZB.MOM.WW.Health.Akka with the Default policy; (c) its DB probe uses DatabaseHealthCheck<TContext> from ZB.MOM.WW.Health.EntityFrameworkCore; (d) its gRPC-dependency probe (if applicable) uses GrpcDependencyHealthCheck; (e) its IActiveNodeGate implementation is AkkaActiveNodeGate (or a project-specific implementation of the shared interface); (f) all health endpoints return the canonical JSON shape defined in §3.


  1. In the OtOpcUaCompat member-scan approach, Leaving/Exiting statuses also map to Degraded because a member with those statuses will not appear with Status == Up in the reachable member set — the scan finds self without Up, so the result is Degraded. ↩︎