diff --git a/components/health/shared-contract/ZB.MOM.WW.Health.md b/components/health/shared-contract/ZB.MOM.WW.Health.md new file mode 100644 index 0000000..4c5a6fe --- /dev/null +++ b/components/health/shared-contract/ZB.MOM.WW.Health.md @@ -0,0 +1,238 @@ +# Proposed shared library: `ZB.MOM.WW.Health` + +A contract on paper — the public surface to extract so the three projects stop re-implementing +health-check tiers, probe logic, and the active-node gating seam. Realizes +[`../spec/SPEC.md`](../spec/SPEC.md). **Not yet created.** Reference implementations already +exist: OtOpcUa `Health/` (three-tier + probes), ScadaBridge `Health/` (inline probes + +`ActiveNodeGate`). + +## Packages (.NET 10) + +``` +ZB.MOM.WW.Health # core: tier convention, response writer, IActiveNodeGate, GrpcDependencyHealthCheck +ZB.MOM.WW.Health.Akka # AkkaClusterHealthCheck, ActiveNodeHealthCheck, AkkaActiveNodeGate +ZB.MOM.WW.Health.EntityFrameworkCore # DatabaseHealthCheck +``` + +All three are .NET 10. The split keeps Akka.Cluster and EF Core out of MxGateway's dependency +graph — MxGateway pulls only the core package. Published to the Gitea NuGet feed; SemVer; lockstep +to start. The x86 net48 mxaccessgw worker has no HTTP surface — net48 multi-targeting is **not** +required. + +## Packaging & distribution + +**Three NuGet packages, one DLL each**, on the Gitea NuGet feed. These are **libraries** linked +into each app — there is no central health service. Consumers reference only what they need: + +| Package (→ DLL) | Transitive deps | MxGateway | OtOpcUa | ScadaBridge | +|---|---|---|---|---| +| `…Health` | `Microsoft.Extensions.Diagnostics.HealthChecks`, ASP.NET Core abstractions | ✅ | ✅ | ✅ | +| `…Health.Akka` | Akka.Cluster | — | ✅ | ✅ | +| `…Health.EntityFrameworkCore` | EF Core | — | ✅ | ✅ | + +**Why MxGateway takes only core:** it is not Akka-based and does not use EF Core. The +`GrpcDependencyHealthCheck` in the core package covers its only probe need (worker channel +reachability), so it avoids the Akka and EF transitive trees entirely. + +## `ZB.MOM.WW.Health` + +```csharp +namespace ZB.MOM.WW.Health; + +/// Canonical tag constants — use these when calling AddCheck(..., tags: [ZbHealthTags.Ready]). +public static class ZbHealthTags +{ + public const string Ready = "ready"; + public const string Active = "active"; + public const string Live = "live"; +} + +/// Options for MapZbHealth(). All paths and the response writer are overridable. +public sealed class ZbHealthEndpointOptions +{ + public string ReadyPath { get; set; } = "/health/ready"; + public string ActivePath { get; set; } = "/health/active"; + public string LivePath { get; set; } = "/healthz"; + + /// Defaults to ZbHealthWriter.WriteJsonAsync. + public Func? ResponseWriter { get; set; } +} + +/// Extension that maps all three health tiers in one call. +public static class ZbHealthEndpointExtensions +{ + /// Maps /health/ready (tag "ready"), /health/active (tag "active"), /healthz (tag "live"). + /// Does NOT call services.AddHealthChecks() — caller is responsible for probe registration. + public static IEndpointConventionBuilder MapZbHealth( + this IEndpointRouteBuilder endpoints, + ZbHealthEndpointOptions? options = null); + + /// Maps /health/ready (tag "ready"), /health/active (tag "active"), /healthz (tag "live"). + public static IEndpointConventionBuilder MapZbHealth( + this IEndpointRouteBuilder endpoints, + Action configure); +} + +/// Canonical JSON response writer. Shape: { status, totalDurationMs, entries: { name: { status, description, duration } } }. +public static class ZbHealthWriter +{ + public static Task WriteJsonAsync(HttpContext context, HealthReport report); +} + +/// Single-property seam: is this node the active/leader node? +/// Attach to route groups via RequireActiveNode(). Implement with AkkaActiveNodeGate (Health.Akka) +/// or a project-specific implementation for non-Akka nodes. +public interface IActiveNodeGate +{ + bool IsActiveNode { get; } +} + +/// Route convention that returns 503 on standby nodes. DI-resolves IActiveNodeGate. +public static class ActiveNodeGateExtensions +{ + public static IEndpointConventionBuilder RequireActiveNode( + this IEndpointConventionBuilder builder); +} + +/// Checks that a downstream gRPC channel is reachable. +public sealed class GrpcDependencyHealthCheck : IHealthCheck +{ + public GrpcDependencyHealthCheck(GrpcChannel channel, GrpcDependencyOptions? options = null); + + public Task CheckHealthAsync( + HealthCheckContext context, + CancellationToken cancellationToken = default); +} + +/// Options for GrpcDependencyHealthCheck. +public sealed class GrpcDependencyOptions +{ + /// Override the default probe (GrpcChannel.ConnectAsync). + /// Return true = reachable, false = unreachable. + public Func>? Probe { get; set; } + + /// Human-readable name of the dependency, used in the HealthCheckResult description. + public string? DependencyName { get; set; } + + public TimeSpan Timeout { get; set; } = TimeSpan.FromSeconds(5); +} +``` + +## `ZB.MOM.WW.Health.Akka` + +```csharp +namespace ZB.MOM.WW.Health.Akka; + +/// Checks the local node's Akka cluster membership status. +/// Register to tag ZbHealthTags.Ready. +public sealed class AkkaClusterHealthCheck : IHealthCheck +{ + public AkkaClusterHealthCheck( + ActorSystem system, + AkkaClusterStatusPolicy policy); + + public Task CheckHealthAsync( + HealthCheckContext context, + CancellationToken cancellationToken = default); +} + +/// Maps Akka MemberStatus values to HealthStatus. +/// Two named presets cover the two existing implementations; construct a custom instance for +/// project-specific overrides. +public sealed class AkkaClusterStatusPolicy +{ + public AkkaClusterStatusPolicy(Func evaluate); + + /// ScadaBridge origin: Up/Joining→Healthy, Leaving/Exiting→Degraded, else Unhealthy. + /// Convergence target for all projects. + public static AkkaClusterStatusPolicy Default { get; } + + /// OtOpcUa origin: self-Up-among-reachable-members→Healthy, else Degraded. + /// Provided for backward compatibility during OtOpcUa migration. + public static AkkaClusterStatusPolicy OtOpcUaCompat { get; } +} + +/// Checks whether this node is the designated leader / active node. +/// Optional role parameter scopes the check to nodes carrying that role. +/// Register to tag ZbHealthTags.Active. +public sealed class ActiveNodeHealthCheck : IHealthCheck +{ + /// Role-less constructor: Healthy = node is Up AND cluster leader (ScadaBridge ActiveNode pattern). + public ActiveNodeHealthCheck(ActorSystem system); + + /// Role-filtered constructor: Healthy = (node lacks the role) OR (node carries role AND is role-singleton leader). + /// Degraded = node carries role but is not the role-singleton leader (OtOpcUa AdminRoleLeader pattern). + public ActiveNodeHealthCheck(ActorSystem system, string role); + + public Task CheckHealthAsync( + HealthCheckContext context, + CancellationToken cancellationToken = default); +} + +/// IActiveNodeGate implementation backed by ActiveNodeHealthCheck. +/// Register as a singleton; resolves ActiveNodeHealthCheck from DI. +public sealed class AkkaActiveNodeGate : IActiveNodeGate +{ + public AkkaActiveNodeGate(ActiveNodeHealthCheck check); + + public bool IsActiveNode { get; } +} +``` + +## `ZB.MOM.WW.Health.EntityFrameworkCore` + +```csharp +namespace ZB.MOM.WW.Health.EntityFrameworkCore; + +/// Checks database reachability via an EF Core DbContext. +/// Default probe: context.Database.CanConnectAsync() (ScadaBridge pattern). +/// Supply a custom probe delegate for query-based validation (OtOpcUa "query Deployments" pattern). +/// Register to tag ZbHealthTags.Ready. +public sealed class DatabaseHealthCheck : IHealthCheck + where TContext : DbContext +{ + public DatabaseHealthCheck( + IDbContextFactory factory, + DatabaseHealthCheckOptions? options = null); + + public Task CheckHealthAsync( + HealthCheckContext context, + CancellationToken cancellationToken = default); +} + +/// Options for DatabaseHealthCheck. +public sealed class DatabaseHealthCheckOptions + where TContext : DbContext +{ + /// Override the default CanConnectAsync() probe. + /// Throw to signal failure; return normally to signal success. + public Func? Probe { get; set; } + + public TimeSpan Timeout { get; set; } = TimeSpan.FromSeconds(10); +} +``` + +## Consumer matrix summary + +| Consumer | Packages | Notes | +|---|---|---| +| **MxGateway** | `ZB.MOM.WW.Health` (core only) | `GrpcDependencyHealthCheck` on the worker channel; all three tiers via `MapZbHealth()`; `IActiveNodeGate` not needed (not Akka-based) | +| **OtOpcUa** | All three | `AkkaClusterHealthCheck` + `OtOpcUaCompat` preset → `Default` on convergence; `ActiveNodeHealthCheck(role: "admin")`; `DatabaseHealthCheck` with custom probe delegate | +| **ScadaBridge** | All three | `AkkaClusterHealthCheck` + `Default` policy; `ActiveNodeHealthCheck` (role-less); `DatabaseHealthCheck` default probe; `AkkaActiveNodeGate` replaces inline `ActiveNodeGate` | + +## Open contract questions + +1. **`IActiveNodeGate` for non-Akka nodes:** MxGateway does not need active-node gating today. + If a future MxGateway cluster requires it, the interface is in the core package and can be + implemented without an Akka dependency. Validate whether a stub `AlwaysActiveGate` (returns + `true`) should ship in core for single-node deployments. +2. **DI helpers:** decide whether `services.AddZbHealthChecks()` (a DI-registered convenience + that pre-registers gRPC + DB + Akka probes via options) is worth adding, or whether explicit + `services.AddHealthChecks().AddCheck<...>()` calls per project are clearer. The spec currently + leaves probe registration entirely per-project. +3. **`AkkaActiveNodeGate` caching:** `IsActiveNode` is a synchronous property; the underlying + `ActiveNodeHealthCheck.CheckHealthAsync` is async. Validate whether the gate should cache the + last probe result on a short TTL (e.g. 5 s) or drive a background refresh, to avoid blocking + synchronous callers. + +See [`../GAPS.md`](../GAPS.md) for the adoption order and effort/risk. diff --git a/components/health/spec/SPEC.md b/components/health/spec/SPEC.md new file mode 100644 index 0000000..b59947d --- /dev/null +++ b/components/health/spec/SPEC.md @@ -0,0 +1,184 @@ +# Health — normalized target spec + +Status: **Draft**. The single design the sister projects converge on. Derived from the +three code-verified current-state docs (`../current-state/`). Goal is *path to shared code* +(`../shared-contract/ZB.MOM.WW.Health.md`), so each normalized section maps to a shared library seam. + +## 0. Scope + +**Normalized here:** the three-tier endpoint convention (`/health/ready`, `/health/active`, +`/healthz`) with canonical tags `ready` / `active` / `live` and their semantics; the canonical +JSON response shape; the `IActiveNodeGate` request-gating seam; a configurable +`AkkaClusterHealthCheck` with two named policy presets that reconcile the diverging Akka logic in +OtOpcUa and ScadaBridge; a role-filtered `ActiveNodeHealthCheck` that unifies OtOpcUa's +`AdminRoleLeaderHealthCheck` and ScadaBridge's `ActiveNodeHealthCheck`; a generic +`DatabaseHealthCheck` that covers both apps' EF Core probe patterns; a +`GrpcDependencyHealthCheck` for downstream gRPC reachability. + +**Explicitly NOT normalized** (domain-specific — keep per project): which probes each app +registers and how it wires them to tags; orchestrator / Traefik routing rules and routing priorities; +ScadaBridge's `HealthMonitoring/` domain-aggregation pipeline — this is a distributed, actor-based +domain-health telemetry system (background services + Akka actors that aggregate site-cluster signals +into a central health picture) and is **not** an ASP.NET health-probe; it is an independent concern +that happens to share the word "health". + +## 1. Tier convention + +Three tiers, always served in this order, each filtered to a named tag: + +| Tier | Endpoint | Tag | Semantics | Healthy→ | Degraded→ | Unhealthy→ | +|---|---|---|---|---|---|---| +| Ready | `/health/ready` | `ready` | Can this node serve its dependencies? Fails if a DB, gRPC dependency, or cluster membership check is unhealthy. Orchestrators use this to gate traffic. | 200 | 200 | 503 | +| Active | `/health/active` | `active` | Is this the leader / active node? Fails (503) on a standby or role-member-but-not-leader node. Used to route write traffic or admin requests to exactly one node. | 200 | 200 | 503 | +| Live | `/healthz` | `live` | Bare process liveness — is the process alive and not deadlocked? **No probes registered to this tag** (predicate `_ => false`). Always 200 as long as the process can handle HTTP. | 200 | 200 | 200 | + +Notes: + +- The `live` tier intentionally carries no probes. Registering a probe to `live` is an error — + a liveness failure that kills the pod should be reserved for total process hangs, not probe failures. +- `Degraded` maps to HTTP 200 (not 503) for the `ready` and `active` tiers. Orchestrators use 503 + to remove a node from load-balancing; Degraded means "still up but degraded" — remove the node + only on hard failure. +- The tag names (`ready`, `active`, `live`) are declared as constants in `ZbHealthTags` and used + consistently across all three apps. Per-project probe registrations must filter by these tags. + +## 2. Probe catalog + +### 2.1 Database probe — `DatabaseHealthCheck` + +Wraps an EF Core `DbContext` to verify database reachability. Default behavior calls +`context.Database.CanConnectAsync()` — matches ScadaBridge's pattern. An optional delegate +(`Func`) overrides the default for more specific validation +(matches OtOpcUa's "query `Deployments`" pattern). Registered to the `ready` tag. + +### 2.2 Akka cluster probe — `AkkaClusterHealthCheck` + +Checks the local node's cluster membership status via Akka.Cluster. The status-to-health +mapping is **configurable** through `AkkaClusterStatusPolicy`. + +**Two named policy presets reconcile the existing divergence:** + +| Preset | Origin | `Up` / `Joining` | `Leaving` / `Exiting` | Other (`WeaklyUp`, `Down`, `Removed`, `Unknown`) | +|---|---|---|---|---| +| `AkkaClusterStatusPolicy.Default` | ScadaBridge `AkkaClusterHealthCheck.cs` | Healthy | Degraded | Unhealthy | +| `AkkaClusterStatusPolicy.OtOpcUaCompat` | OtOpcUa `AkkaClusterHealthCheck.cs` | Healthy (if self is `Up` among reachable members) | — | Degraded | + +The `Default` preset is the convergence target. `OtOpcUaCompat` is provided for backward +compatibility during OtOpcUa's migration; it maps any non-`Up`-among-members state to Degraded +rather than Unhealthy. Registered to the `ready` tag. + +### 2.3 Active / leader probe — `ActiveNodeHealthCheck` + +Checks whether this node is the designated leader (active node). Accepts an optional Akka +cluster role name that scopes the check to nodes carrying that role. + +**Two behaviors unify the existing divergence:** + +| Mode | Role param | Origin | Healthy | Degraded | Unhealthy | +|---|---|---|---|---|---| +| Role-less | `null` | ScadaBridge `ActiveNodeHealthCheck` | Node is `Up` **and** cluster leader | — | Otherwise | +| Role-filtered | e.g. `"admin"` | OtOpcUa `AdminRoleLeaderHealthCheck` | Node does **not** carry the role (not a participant — ignore it) **or** node carries the role and is the role-singleton leader | Carries the role but is **not** the role-singleton leader (role member, not leader) | — | + +The role-filtered variant maps "not a member of the role" to Healthy (transparent — the probe +is irrelevant for this node). This is the correct behavior for heterogeneous clusters where not +every node carries every role. Registered to the `active` tag. + +### 2.4 gRPC dependency probe — `GrpcDependencyHealthCheck` + +Checks that a downstream gRPC channel is reachable by invoking a caller-supplied probe +delegate (`Func>`). The default probe calls +`GrpcChannel.ConnectAsync`. Used by: + +- OtOpcUa — checks the MxAccessGateway gRPC channel. +- MxGateway — checks the x86 worker gRPC channel. + +Registered to the `ready` tag. + +## 3. Response-writer contract + +All health endpoints share one canonical JSON serializer. The shape is lifted from ScadaBridge's +`HealthChecks.UI.Client` style and becomes the library default (replacing per-project divergence). + +**Content-type:** `application/json` + +**Shape:** + +```json +{ + "status": "Healthy", + "totalDurationMs": 12, + "entries": { + "database": { + "status": "Healthy", + "description": "SQL Server reachable", + "duration": "00:00:00.0120000" + }, + "akka-cluster": { + "status": "Healthy", + "description": "Member status: Up", + "duration": "00:00:00.0001000" + } + } +} +``` + +**Field rules:** + +| Field | Type | Notes | +|---|---|---| +| `status` | string | `"Healthy"` \| `"Degraded"` \| `"Unhealthy"` — the aggregate across all filtered checks | +| `totalDurationMs` | long | Total wall-clock time for all probes in this tier, milliseconds | +| `entries` | object | Keyed by check registration name | +| `entries..status` | string | Per-check status | +| `entries..description` | string? | Human-readable detail (may be null) | +| `entries..duration` | string | TimeSpan `ToString()` — per-check elapsed time | + +The writer is exposed as a static `Task WriteJsonAsync(HttpContext, HealthReport)` so consumers can +plug it into `MapHealthChecks` options and also call it from custom endpoints. + +## 4. Active-node gating seam — `IActiveNodeGate` + +`IActiveNodeGate` is a single-property interface (`bool IsActiveNode { get; }`) that expresses +whether the current node should accept write / active-role requests. The default implementation, +`AkkaActiveNodeGate`, delegates to `ActiveNodeHealthCheck`. A `RequireActiveNode()` extension on +`IEndpointConventionBuilder` attaches a policy that short-circuits with `503 Service Unavailable` +on standby nodes. + +This seam is generalized from ScadaBridge's `ActiveNodeGate.cs`. It is in the core `ZB.MOM.WW.Health` +package (not the Akka satellite) so MxGateway can implement it without an Akka dependency if needed. + +## 5. Endpoint registration + +`app.MapZbHealth()` maps all three tiers in one call: + +```csharp +app.MapZbHealth(); // all three tiers, defaults +app.MapZbHealth(o => { + o.ReadyPath = "/health/ready"; // override paths if needed + o.ActivePath = "/health/active"; + o.LivePath = "/healthz"; + o.ResponseWriter = ZbHealthWriter.WriteJsonAsync; +}); +``` + +The library does **not** call `services.AddHealthChecks()` — that is the app's responsibility, as +the probe set is per-project. `MapZbHealth` only maps the three endpoints with the correct tag +predicates and response writer. + +## 6. Migration notes + +| Project | Current state | Gap | What normalizes | +|---|---|---|---| +| **OtOpcUa** | All three tiers present (`/health/ready`, `/health/active`, `/healthz`); `DatabaseHealthCheck`, `AkkaClusterHealthCheck`, `AdminRoleLeaderHealthCheck` inline. | Inline probes diverge from the shared policy model; no `IActiveNodeGate`. | Replace inline `AkkaClusterHealthCheck` with shared + `OtOpcUaCompat` preset; replace `AdminRoleLeaderHealthCheck` with shared `ActiveNodeHealthCheck(role: "admin")`; replace inline `DatabaseHealthCheck` with shared generic; call `app.MapZbHealth()`. | +| **ScadaBridge** | `/health/ready` + `/health/active` present; no `/healthz`; `DatabaseHealthCheck`, `AkkaClusterHealthCheck`, `ActiveNodeHealthCheck`, `ActiveNodeGate` inline. | Missing `/healthz` live tier; inline implementations. | Add `/healthz` via `MapZbHealth()`; replace inline probes with shared equivalents (Default policy); replace inline `ActiveNodeGate` with `AkkaActiveNodeGate`. | +| **MxGateway** | Only `/health/live` (custom `GatewayHealthReply`); `AddHealthChecks()` called but zero probes registered. | Missing `ready` and `active` tiers; no probes; not using standard health middleware. | Replace custom endpoint with `app.MapZbHealth()`; register `GrpcDependencyHealthCheck` for the x86 worker channel on the `ready` tag. | + +## 7. Acceptance (what "converged" means) + +A project is converged when: (a) it calls `app.MapZbHealth()` and exposes all three canonical +endpoints; (b) its Akka probes (if applicable) use the `AkkaClusterHealthCheck` + `ActiveNodeHealthCheck` +from `ZB.MOM.WW.Health.Akka` with the Default policy; (c) its DB probe uses `DatabaseHealthCheck` +from `ZB.MOM.WW.Health.EntityFrameworkCore`; (d) its gRPC-dependency probe (if applicable) uses +`GrpcDependencyHealthCheck`; (e) its `IActiveNodeGate` implementation is `AkkaActiveNodeGate` +(or a project-specific implementation of the shared interface); (f) all health endpoints return the +canonical JSON shape defined in §3.