From 07d5907258e5ff50bf4f979204c95c8c8327c92b Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Mon, 1 Jun 2026 06:33:42 -0400 Subject: [PATCH] docs(health): resolve spec/contract/gaps consistency (review fixes) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Applies canonical resolutions for eight settled decisions: - GAPS: remove three stale "Decisions still open" bullets (#1 IActiveNodeGate placement, #2 GrpcChannel type, #3 OtOpcUaCompat named constant) - Shared contract: AkkaClusterHealthCheck, ActiveNodeHealthCheck constructors take IServiceProvider (lazy ActorSystem, Degraded-when-not-ready) - Shared contract: AkkaActiveNodeGate takes IServiceProvider; reads SelfMember+leader directly, null-guarded; does not proxy ActiveNodeHealthCheck - Shared contract: DatabaseHealthCheckOptions.Probe renamed to ProbeQuery; consumer matrix updated - Shared contract: settled AddZbHealthChecks open question removed (spec §5 is per-project AddHealthChecks) - SPEC §2.2: OtOpcUaCompat Leaving/Exiting cell updated from — to Degraded + footnote; §2.3 startup-safety note added - README: status line corrected from "built and tested" to "scaffolded … implementation is follow-on (task #7)"; IActiveNodeGate "left per-project" bullet removed - OtOpcUa current-state: AddZbHealthChecks → AddHealthChecks().AddCheck<...>(); IClusterRoleInfo note reframed as accepted trade-off - ScadaBridge current-state: IActiveNodeGate bullet rewritten — interface moves to ZB.MOM.WW.Health on adoption, InboundApiEndpointFilter references shared interface --- components/health/GAPS.md | 8 --- components/health/README.md | 8 +-- .../current-state/otopcua/CURRENT-STATE.md | 12 ++-- .../scadabridge/CURRENT-STATE.md | 7 ++- .../shared-contract/ZB.MOM.WW.Health.md | 62 ++++++++++++++----- components/health/spec/SPEC.md | 12 +++- 6 files changed, 71 insertions(+), 38 deletions(-) diff --git a/components/health/GAPS.md b/components/health/GAPS.md index e50474b..7fa3b13 100644 --- a/components/health/GAPS.md +++ b/components/health/GAPS.md @@ -131,11 +131,3 @@ after `ZB.MOM.WW.Health` @ 0.1.0 is published. The library build itself (nupkgs, separate task. This is consistent with how `ZB.MOM.WW.Auth` and `ZB.MOM.WW.Theme` are structured: the library is built first; adoption by the three apps is the next step. -## Decisions still open - -- Whether `GrpcDependencyHealthCheck` takes a named channel (from DI) or a raw `ChannelBase` — - affects how MxAccessGateway registers the worker-IPC probe without a standard gRPC channel. -- Whether `IActiveNodeGate` lives in `ZB.MOM.WW.Health` (making it a hard dependency) or stays - in ScadaBridge's `InboundAPI` project (keeping the gate as a ScadaBridge concern). -- Whether the `OtOpcUaCompat` preset for `AkkaClusterHealthCheck` is a named constant or just - documented configuration. diff --git a/components/health/README.md b/components/health/README.md index 7e89837..47a9b29 100644 --- a/components/health/README.md +++ b/components/health/README.md @@ -65,8 +65,6 @@ code-verified detail and its adoption plan. keep as a bespoke `/info` endpoint. - The x86 worker process — out of process and out of scope; the gateway-side `GrpcDependencyHealthCheck` observes it indirectly. -- Per-project `IActiveNodeGate` contract location (whether the interface lives in the shared - library or in each project's own surface). ## Package structure @@ -84,6 +82,6 @@ all three. ## Component status **Status: Draft.** Spec and shared-contract written; current-state docs verified; GAPS backlog -populated. Library (`ZB.MOM.WW.Health` @ 0.1.0) built and tested in this repo at -[`../../ZB.MOM.WW.Health/`](../../ZB.MOM.WW.Health/). Adoption by the three apps is a follow-on -tracked in [`GAPS.md`](GAPS.md). +populated. Library scaffolded at [`../../ZB.MOM.WW.Health/`](../../ZB.MOM.WW.Health/); source +implementation is a follow-on (task #7 in the adoption backlog). Adoption by the three apps is +a further follow-on tracked in [`GAPS.md`](GAPS.md). diff --git a/components/health/current-state/otopcua/CURRENT-STATE.md b/components/health/current-state/otopcua/CURRENT-STATE.md index 19ae6ea..83f34f0 100644 --- a/components/health/current-state/otopcua/CURRENT-STATE.md +++ b/components/health/current-state/otopcua/CURRENT-STATE.md @@ -131,15 +131,19 @@ probe in `ZB.MOM.WW.Health` would close. The delegate preserves the stricter query probe rather than falling back to `CanConnectAsync`. - Add `GrpcDependencyHealthCheck` targeting the MxAccessGateway channel (closes the downstream dependency gap noted in §4). Tag `["ready","active"]`. -- Replace `AddOtOpcUaHealth` / `MapOtOpcUaHealth` with `services.AddZbHealthChecks()` + +- Replace `AddOtOpcUaHealth` / `MapOtOpcUaHealth` with + `services.AddHealthChecks().AddCheck<...>()` (one call per probe, per spec §5) + `app.MapZbHealth()`. The `/healthz` bare-liveness tier is part of `MapZbHealth` by default — no separate wiring needed. **Keep bespoke:** -- `IClusterRoleInfo` and its Akka implementation — this is an OtOpcUa abstraction used for more - than health checks; it should remain in the OtOpcUa codebase. The shared `ActiveNodeHealthCheck` - will accept `IClusterRoleInfo` (or an equivalent cluster-info abstraction) as an injection point. +- `IClusterRoleInfo` and its Akka implementation — on adoption this testability seam is given up + for the health-check path. The shared `ActiveNodeHealthCheck` reads cluster role state from the + ActorSystem directly (resolving it lazily via `IServiceProvider`); it does not accept + `IClusterRoleInfo` as an injection point. This is an accepted trade-off: the shared implementation + is simpler and consistent across projects, while `IClusterRoleInfo` remains available elsewhere + in the OtOpcUa codebase where it is used outside health checks. - The `AllowAnonymous` policy — this is an OtOpcUa auth concern; `MapZbHealth` must document that callers are responsible for applying `AllowAnonymous` (or the shared helper applies it by default). - Which probes are registered and their tag assignments — the shared library supplies the check diff --git a/components/health/current-state/scadabridge/CURRENT-STATE.md b/components/health/current-state/scadabridge/CURRENT-STATE.md index ff1d816..dea29f8 100644 --- a/components/health/current-state/scadabridge/CURRENT-STATE.md +++ b/components/health/current-state/scadabridge/CURRENT-STATE.md @@ -172,9 +172,10 @@ regardless of shared-library adoption. - `HealthMonitoring/` domain pipeline (`SiteHealthCollector`, `CentralHealthAggregator`, etc.) — entirely per-project, no shared-library equivalent. -- `IActiveNodeGate` from the `InboundAPI` project is the contract the `InboundApiEndpointFilter` - depends on; it can be implemented by the shared `ActiveNodeHealthCheck` backing service but the - interface definition stays in the InboundAPI project (or moves to a shared abstractions package). +- `IActiveNodeGate` moves from the `InboundAPI` project to `ZB.MOM.WW.Health` (core package) on + adoption. `InboundApiEndpointFilter` references the shared interface; `AkkaActiveNodeGate` + (from `ZB.MOM.WW.Health.Akka`) becomes the singleton implementation registered in DI. The + interface definition is no longer owned by the `InboundAPI` project. - The Central UI's `/monitoring/health` endpoint — powered by `CentralHealthAggregator`, not by ASP.NET health checks. - The comment at `Program.cs:217–221` explains the readiness design decision (standby nodes are diff --git a/components/health/shared-contract/ZB.MOM.WW.Health.md b/components/health/shared-contract/ZB.MOM.WW.Health.md index 4c5a6fe..0f81a35 100644 --- a/components/health/shared-contract/ZB.MOM.WW.Health.md +++ b/components/health/shared-contract/ZB.MOM.WW.Health.md @@ -125,10 +125,19 @@ namespace ZB.MOM.WW.Health.Akka; /// Checks the local node's Akka cluster membership status. /// Register to tag ZbHealthTags.Ready. +/// +/// The ActorSystem is resolved lazily from the service provider. If the ActorSystem is not yet +/// available (e.g. during startup before Akka is initialised), the check returns Degraded rather +/// than throwing. This makes the check safe to register before Akka is fully up. +/// public sealed class AkkaClusterHealthCheck : IHealthCheck { + /// + /// The application service provider. ActorSystem is resolved lazily so the check is + /// startup-safe: if no ActorSystem is registered yet the result is Degraded. + /// public AkkaClusterHealthCheck( - ActorSystem system, + IServiceProvider serviceProvider, AkkaClusterStatusPolicy policy); public Task CheckHealthAsync( @@ -155,25 +164,46 @@ public sealed class AkkaClusterStatusPolicy /// Checks whether this node is the designated leader / active node. /// Optional role parameter scopes the check to nodes carrying that role. /// Register to tag ZbHealthTags.Active. +/// +/// The ActorSystem is resolved lazily from the service provider. If the ActorSystem is not yet +/// available (e.g. during startup before Akka is initialised), the check returns Degraded rather +/// than throwing. This makes the check startup-safe. +/// public sealed class ActiveNodeHealthCheck : IHealthCheck { /// Role-less constructor: Healthy = node is Up AND cluster leader (ScadaBridge ActiveNode pattern). - public ActiveNodeHealthCheck(ActorSystem system); + /// Returns Degraded when ActorSystem/cluster is not yet ready. + /// + /// The application service provider. ActorSystem is resolved lazily so the check is + /// startup-safe: if no ActorSystem is registered yet the result is Degraded. + /// + public ActiveNodeHealthCheck(IServiceProvider serviceProvider); /// Role-filtered constructor: Healthy = (node lacks the role) OR (node carries role AND is role-singleton leader). /// Degraded = node carries role but is not the role-singleton leader (OtOpcUa AdminRoleLeader pattern). - public ActiveNodeHealthCheck(ActorSystem system, string role); + /// Returns Degraded when ActorSystem/cluster is not yet ready. + /// + /// The application service provider. ActorSystem is resolved lazily so the check is + /// startup-safe: if no ActorSystem is registered yet the result is Degraded. + /// + public ActiveNodeHealthCheck(IServiceProvider serviceProvider, string role); public Task CheckHealthAsync( HealthCheckContext context, CancellationToken cancellationToken = default); } -/// IActiveNodeGate implementation backed by ActiveNodeHealthCheck. -/// Register as a singleton; resolves ActiveNodeHealthCheck from DI. +/// IActiveNodeGate implementation that computes IsActiveNode directly from the ActorSystem +/// (SelfMember Up + cluster leader), null-guarded for startup safety. +/// Register as a singleton. Does NOT resolve ActiveNodeHealthCheck from DI. public sealed class AkkaActiveNodeGate : IActiveNodeGate { - public AkkaActiveNodeGate(ActiveNodeHealthCheck check); + /// + /// The application service provider. ActorSystem is resolved lazily; if not yet available + /// IsActiveNode returns false (safe default during startup). + /// The gate checks SelfMember.Status == Up AND cluster.State.Leader == self.Address directly. + /// + public AkkaActiveNodeGate(IServiceProvider serviceProvider); public bool IsActiveNode { get; } } @@ -204,9 +234,10 @@ public sealed class DatabaseHealthCheck : IHealthCheck public sealed class DatabaseHealthCheckOptions where TContext : DbContext { - /// Override the default CanConnectAsync() probe. + /// Override the default CanConnectAsync() probe with a custom query-based probe. /// Throw to signal failure; return normally to signal success. - public Func? Probe { get; set; } + /// Example: db => db.Deployments.AsNoTracking().Take(1).ToListAsync() + public Func? ProbeQuery { get; set; } public TimeSpan Timeout { get; set; } = TimeSpan.FromSeconds(10); } @@ -217,7 +248,7 @@ public sealed class DatabaseHealthCheckOptions | Consumer | Packages | Notes | |---|---|---| | **MxGateway** | `ZB.MOM.WW.Health` (core only) | `GrpcDependencyHealthCheck` on the worker channel; all three tiers via `MapZbHealth()`; `IActiveNodeGate` not needed (not Akka-based) | -| **OtOpcUa** | All three | `AkkaClusterHealthCheck` + `OtOpcUaCompat` preset → `Default` on convergence; `ActiveNodeHealthCheck(role: "admin")`; `DatabaseHealthCheck` with custom probe delegate | +| **OtOpcUa** | All three | `AkkaClusterHealthCheck` + `OtOpcUaCompat` preset → `Default` on convergence; `ActiveNodeHealthCheck(role: "admin")`; `DatabaseHealthCheck` with `ProbeQuery` delegate | | **ScadaBridge** | All three | `AkkaClusterHealthCheck` + `Default` policy; `ActiveNodeHealthCheck` (role-less); `DatabaseHealthCheck` default probe; `AkkaActiveNodeGate` replaces inline `ActiveNodeGate` | ## Open contract questions @@ -226,13 +257,10 @@ public sealed class DatabaseHealthCheckOptions If a future MxGateway cluster requires it, the interface is in the core package and can be implemented without an Akka dependency. Validate whether a stub `AlwaysActiveGate` (returns `true`) should ship in core for single-node deployments. -2. **DI helpers:** decide whether `services.AddZbHealthChecks()` (a DI-registered convenience - that pre-registers gRPC + DB + Akka probes via options) is worth adding, or whether explicit - `services.AddHealthChecks().AddCheck<...>()` calls per project are clearer. The spec currently - leaves probe registration entirely per-project. -3. **`AkkaActiveNodeGate` caching:** `IsActiveNode` is a synchronous property; the underlying - `ActiveNodeHealthCheck.CheckHealthAsync` is async. Validate whether the gate should cache the - last probe result on a short TTL (e.g. 5 s) or drive a background refresh, to avoid blocking - synchronous callers. +2. **`AkkaActiveNodeGate` caching:** `IsActiveNode` is a synchronous property; the underlying + cluster-state read is synchronous but the ActorSystem lookup is lazy. Validate whether the + gate should cache the computed value on a short TTL (e.g. 5 s) to reduce Akka.Cluster API + overhead on high-frequency API routing checks, or whether reading `SelfMember`/`State.Leader` + directly on every call is acceptable. See [`../GAPS.md`](../GAPS.md) for the adoption order and effort/risk. diff --git a/components/health/spec/SPEC.md b/components/health/spec/SPEC.md index b59947d..93224e0 100644 --- a/components/health/spec/SPEC.md +++ b/components/health/spec/SPEC.md @@ -61,12 +61,22 @@ mapping is **configurable** through `AkkaClusterStatusPolicy`. | Preset | Origin | `Up` / `Joining` | `Leaving` / `Exiting` | Other (`WeaklyUp`, `Down`, `Removed`, `Unknown`) | |---|---|---|---|---| | `AkkaClusterStatusPolicy.Default` | ScadaBridge `AkkaClusterHealthCheck.cs` | Healthy | Degraded | Unhealthy | -| `AkkaClusterStatusPolicy.OtOpcUaCompat` | OtOpcUa `AkkaClusterHealthCheck.cs` | Healthy (if self is `Up` among reachable members) | — | Degraded | +| `AkkaClusterStatusPolicy.OtOpcUaCompat` | OtOpcUa `AkkaClusterHealthCheck.cs` | Healthy (if self is `Up` among reachable members) | Degraded[^1] | Degraded | + +[^1]: In the `OtOpcUaCompat` member-scan approach, `Leaving`/`Exiting` statuses also map to +Degraded because a member with those statuses will not appear with `Status == Up` in the reachable +member set — the scan finds self without `Up`, so the result is Degraded. The `Default` preset is the convergence target. `OtOpcUaCompat` is provided for backward compatibility during OtOpcUa's migration; it maps any non-`Up`-among-members state to Degraded rather than Unhealthy. Registered to the `ready` tag. +> **Note on error/exception cases:** in both modes, if the ActorSystem is not yet ready or cluster +> state is inaccessible (e.g. during startup), the check returns Degraded (startup-safety rule). +> The status cells in the table above describe the normal-operation path only; the "—" cells in the +> `OtOpcUaCompat` column refer to states that collapse into Degraded via the member-scan result, +> not to an explicit policy match. + ### 2.3 Active / leader probe — `ActiveNodeHealthCheck` Checks whether this node is the designated leader (active node). Accepts an optional Akka