docs(health): resolve spec/contract/gaps consistency (review fixes)
Applies canonical resolutions for eight settled decisions: - GAPS: remove three stale "Decisions still open" bullets (#1 IActiveNodeGate placement, #2 GrpcChannel type, #3 OtOpcUaCompat named constant) - Shared contract: AkkaClusterHealthCheck, ActiveNodeHealthCheck constructors take IServiceProvider (lazy ActorSystem, Degraded-when-not-ready) - Shared contract: AkkaActiveNodeGate takes IServiceProvider; reads SelfMember+leader directly, null-guarded; does not proxy ActiveNodeHealthCheck - Shared contract: DatabaseHealthCheckOptions.Probe renamed to ProbeQuery; consumer matrix updated - Shared contract: settled AddZbHealthChecks open question removed (spec §5 is per-project AddHealthChecks) - SPEC §2.2: OtOpcUaCompat Leaving/Exiting cell updated from — to Degraded + footnote; §2.3 startup-safety note added - README: status line corrected from "built and tested" to "scaffolded … implementation is follow-on (task #7)"; IActiveNodeGate "left per-project" bullet removed - OtOpcUa current-state: AddZbHealthChecks → AddHealthChecks().AddCheck<...>(); IClusterRoleInfo note reframed as accepted trade-off - ScadaBridge current-state: IActiveNodeGate bullet rewritten — interface moves to ZB.MOM.WW.Health on adoption, InboundApiEndpointFilter references shared interface
This commit is contained in:
@@ -131,11 +131,3 @@ after `ZB.MOM.WW.Health` @ 0.1.0 is published. The library build itself (nupkgs,
|
|||||||
separate task. This is consistent with how `ZB.MOM.WW.Auth` and `ZB.MOM.WW.Theme` are structured:
|
separate task. This is consistent with how `ZB.MOM.WW.Auth` and `ZB.MOM.WW.Theme` are structured:
|
||||||
the library is built first; adoption by the three apps is the next step.
|
the library is built first; adoption by the three apps is the next step.
|
||||||
|
|
||||||
## Decisions still open
|
|
||||||
|
|
||||||
- Whether `GrpcDependencyHealthCheck` takes a named channel (from DI) or a raw `ChannelBase` —
|
|
||||||
affects how MxAccessGateway registers the worker-IPC probe without a standard gRPC channel.
|
|
||||||
- Whether `IActiveNodeGate` lives in `ZB.MOM.WW.Health` (making it a hard dependency) or stays
|
|
||||||
in ScadaBridge's `InboundAPI` project (keeping the gate as a ScadaBridge concern).
|
|
||||||
- Whether the `OtOpcUaCompat` preset for `AkkaClusterHealthCheck` is a named constant or just
|
|
||||||
documented configuration.
|
|
||||||
|
|||||||
@@ -65,8 +65,6 @@ code-verified detail and its adoption plan.
|
|||||||
keep as a bespoke `/info` endpoint.
|
keep as a bespoke `/info` endpoint.
|
||||||
- The x86 worker process — out of process and out of scope; the gateway-side
|
- The x86 worker process — out of process and out of scope; the gateway-side
|
||||||
`GrpcDependencyHealthCheck` observes it indirectly.
|
`GrpcDependencyHealthCheck` observes it indirectly.
|
||||||
- Per-project `IActiveNodeGate` contract location (whether the interface lives in the shared
|
|
||||||
library or in each project's own surface).
|
|
||||||
|
|
||||||
## Package structure
|
## Package structure
|
||||||
|
|
||||||
@@ -84,6 +82,6 @@ all three.
|
|||||||
## Component status
|
## Component status
|
||||||
|
|
||||||
**Status: Draft.** Spec and shared-contract written; current-state docs verified; GAPS backlog
|
**Status: Draft.** Spec and shared-contract written; current-state docs verified; GAPS backlog
|
||||||
populated. Library (`ZB.MOM.WW.Health` @ 0.1.0) built and tested in this repo at
|
populated. Library scaffolded at [`../../ZB.MOM.WW.Health/`](../../ZB.MOM.WW.Health/); source
|
||||||
[`../../ZB.MOM.WW.Health/`](../../ZB.MOM.WW.Health/). Adoption by the three apps is a follow-on
|
implementation is a follow-on (task #7 in the adoption backlog). Adoption by the three apps is
|
||||||
tracked in [`GAPS.md`](GAPS.md).
|
a further follow-on tracked in [`GAPS.md`](GAPS.md).
|
||||||
|
|||||||
@@ -131,15 +131,19 @@ probe in `ZB.MOM.WW.Health` would close.
|
|||||||
The delegate preserves the stricter query probe rather than falling back to `CanConnectAsync`.
|
The delegate preserves the stricter query probe rather than falling back to `CanConnectAsync`.
|
||||||
- Add `GrpcDependencyHealthCheck` targeting the MxAccessGateway channel (closes the downstream
|
- Add `GrpcDependencyHealthCheck` targeting the MxAccessGateway channel (closes the downstream
|
||||||
dependency gap noted in §4). Tag `["ready","active"]`.
|
dependency gap noted in §4). Tag `["ready","active"]`.
|
||||||
- Replace `AddOtOpcUaHealth` / `MapOtOpcUaHealth` with `services.AddZbHealthChecks()` +
|
- Replace `AddOtOpcUaHealth` / `MapOtOpcUaHealth` with
|
||||||
|
`services.AddHealthChecks().AddCheck<...>()` (one call per probe, per spec §5) +
|
||||||
`app.MapZbHealth()`. The `/healthz` bare-liveness tier is part of `MapZbHealth` by default —
|
`app.MapZbHealth()`. The `/healthz` bare-liveness tier is part of `MapZbHealth` by default —
|
||||||
no separate wiring needed.
|
no separate wiring needed.
|
||||||
|
|
||||||
**Keep bespoke:**
|
**Keep bespoke:**
|
||||||
|
|
||||||
- `IClusterRoleInfo` and its Akka implementation — this is an OtOpcUa abstraction used for more
|
- `IClusterRoleInfo` and its Akka implementation — on adoption this testability seam is given up
|
||||||
than health checks; it should remain in the OtOpcUa codebase. The shared `ActiveNodeHealthCheck`
|
for the health-check path. The shared `ActiveNodeHealthCheck` reads cluster role state from the
|
||||||
will accept `IClusterRoleInfo` (or an equivalent cluster-info abstraction) as an injection point.
|
ActorSystem directly (resolving it lazily via `IServiceProvider`); it does not accept
|
||||||
|
`IClusterRoleInfo` as an injection point. This is an accepted trade-off: the shared implementation
|
||||||
|
is simpler and consistent across projects, while `IClusterRoleInfo` remains available elsewhere
|
||||||
|
in the OtOpcUa codebase where it is used outside health checks.
|
||||||
- The `AllowAnonymous` policy — this is an OtOpcUa auth concern; `MapZbHealth` must document that
|
- The `AllowAnonymous` policy — this is an OtOpcUa auth concern; `MapZbHealth` must document that
|
||||||
callers are responsible for applying `AllowAnonymous` (or the shared helper applies it by default).
|
callers are responsible for applying `AllowAnonymous` (or the shared helper applies it by default).
|
||||||
- Which probes are registered and their tag assignments — the shared library supplies the check
|
- Which probes are registered and their tag assignments — the shared library supplies the check
|
||||||
|
|||||||
@@ -172,9 +172,10 @@ regardless of shared-library adoption.
|
|||||||
|
|
||||||
- `HealthMonitoring/` domain pipeline (`SiteHealthCollector`, `CentralHealthAggregator`, etc.) —
|
- `HealthMonitoring/` domain pipeline (`SiteHealthCollector`, `CentralHealthAggregator`, etc.) —
|
||||||
entirely per-project, no shared-library equivalent.
|
entirely per-project, no shared-library equivalent.
|
||||||
- `IActiveNodeGate` from the `InboundAPI` project is the contract the `InboundApiEndpointFilter`
|
- `IActiveNodeGate` moves from the `InboundAPI` project to `ZB.MOM.WW.Health` (core package) on
|
||||||
depends on; it can be implemented by the shared `ActiveNodeHealthCheck` backing service but the
|
adoption. `InboundApiEndpointFilter` references the shared interface; `AkkaActiveNodeGate`
|
||||||
interface definition stays in the InboundAPI project (or moves to a shared abstractions package).
|
(from `ZB.MOM.WW.Health.Akka`) becomes the singleton implementation registered in DI. The
|
||||||
|
interface definition is no longer owned by the `InboundAPI` project.
|
||||||
- The Central UI's `/monitoring/health` endpoint — powered by `CentralHealthAggregator`, not by
|
- The Central UI's `/monitoring/health` endpoint — powered by `CentralHealthAggregator`, not by
|
||||||
ASP.NET health checks.
|
ASP.NET health checks.
|
||||||
- The comment at `Program.cs:217–221` explains the readiness design decision (standby nodes are
|
- The comment at `Program.cs:217–221` explains the readiness design decision (standby nodes are
|
||||||
|
|||||||
@@ -125,10 +125,19 @@ namespace ZB.MOM.WW.Health.Akka;
|
|||||||
|
|
||||||
/// Checks the local node's Akka cluster membership status.
|
/// Checks the local node's Akka cluster membership status.
|
||||||
/// Register to tag ZbHealthTags.Ready.
|
/// Register to tag ZbHealthTags.Ready.
|
||||||
|
/// <remarks>
|
||||||
|
/// The ActorSystem is resolved lazily from the service provider. If the ActorSystem is not yet
|
||||||
|
/// available (e.g. during startup before Akka is initialised), the check returns Degraded rather
|
||||||
|
/// than throwing. This makes the check safe to register before Akka is fully up.
|
||||||
|
/// </remarks>
|
||||||
public sealed class AkkaClusterHealthCheck : IHealthCheck
|
public sealed class AkkaClusterHealthCheck : IHealthCheck
|
||||||
{
|
{
|
||||||
|
/// <param name="serviceProvider">
|
||||||
|
/// The application service provider. ActorSystem is resolved lazily so the check is
|
||||||
|
/// startup-safe: if no ActorSystem is registered yet the result is Degraded.
|
||||||
|
/// </param>
|
||||||
public AkkaClusterHealthCheck(
|
public AkkaClusterHealthCheck(
|
||||||
ActorSystem system,
|
IServiceProvider serviceProvider,
|
||||||
AkkaClusterStatusPolicy policy);
|
AkkaClusterStatusPolicy policy);
|
||||||
|
|
||||||
public Task<HealthCheckResult> CheckHealthAsync(
|
public Task<HealthCheckResult> CheckHealthAsync(
|
||||||
@@ -155,25 +164,46 @@ public sealed class AkkaClusterStatusPolicy
|
|||||||
/// Checks whether this node is the designated leader / active node.
|
/// Checks whether this node is the designated leader / active node.
|
||||||
/// Optional role parameter scopes the check to nodes carrying that role.
|
/// Optional role parameter scopes the check to nodes carrying that role.
|
||||||
/// Register to tag ZbHealthTags.Active.
|
/// Register to tag ZbHealthTags.Active.
|
||||||
|
/// <remarks>
|
||||||
|
/// The ActorSystem is resolved lazily from the service provider. If the ActorSystem is not yet
|
||||||
|
/// available (e.g. during startup before Akka is initialised), the check returns Degraded rather
|
||||||
|
/// than throwing. This makes the check startup-safe.
|
||||||
|
/// </remarks>
|
||||||
public sealed class ActiveNodeHealthCheck : IHealthCheck
|
public sealed class ActiveNodeHealthCheck : IHealthCheck
|
||||||
{
|
{
|
||||||
/// Role-less constructor: Healthy = node is Up AND cluster leader (ScadaBridge ActiveNode pattern).
|
/// Role-less constructor: Healthy = node is Up AND cluster leader (ScadaBridge ActiveNode pattern).
|
||||||
public ActiveNodeHealthCheck(ActorSystem system);
|
/// Returns Degraded when ActorSystem/cluster is not yet ready.
|
||||||
|
/// <param name="serviceProvider">
|
||||||
|
/// The application service provider. ActorSystem is resolved lazily so the check is
|
||||||
|
/// startup-safe: if no ActorSystem is registered yet the result is Degraded.
|
||||||
|
/// </param>
|
||||||
|
public ActiveNodeHealthCheck(IServiceProvider serviceProvider);
|
||||||
|
|
||||||
/// Role-filtered constructor: Healthy = (node lacks the role) OR (node carries role AND is role-singleton leader).
|
/// Role-filtered constructor: Healthy = (node lacks the role) OR (node carries role AND is role-singleton leader).
|
||||||
/// Degraded = node carries role but is not the role-singleton leader (OtOpcUa AdminRoleLeader pattern).
|
/// Degraded = node carries role but is not the role-singleton leader (OtOpcUa AdminRoleLeader pattern).
|
||||||
public ActiveNodeHealthCheck(ActorSystem system, string role);
|
/// Returns Degraded when ActorSystem/cluster is not yet ready.
|
||||||
|
/// <param name="serviceProvider">
|
||||||
|
/// The application service provider. ActorSystem is resolved lazily so the check is
|
||||||
|
/// startup-safe: if no ActorSystem is registered yet the result is Degraded.
|
||||||
|
/// </param>
|
||||||
|
public ActiveNodeHealthCheck(IServiceProvider serviceProvider, string role);
|
||||||
|
|
||||||
public Task<HealthCheckResult> CheckHealthAsync(
|
public Task<HealthCheckResult> CheckHealthAsync(
|
||||||
HealthCheckContext context,
|
HealthCheckContext context,
|
||||||
CancellationToken cancellationToken = default);
|
CancellationToken cancellationToken = default);
|
||||||
}
|
}
|
||||||
|
|
||||||
/// IActiveNodeGate implementation backed by ActiveNodeHealthCheck.
|
/// IActiveNodeGate implementation that computes IsActiveNode directly from the ActorSystem
|
||||||
/// Register as a singleton; resolves ActiveNodeHealthCheck from DI.
|
/// (SelfMember Up + cluster leader), null-guarded for startup safety.
|
||||||
|
/// Register as a singleton. Does NOT resolve ActiveNodeHealthCheck from DI.
|
||||||
public sealed class AkkaActiveNodeGate : IActiveNodeGate
|
public sealed class AkkaActiveNodeGate : IActiveNodeGate
|
||||||
{
|
{
|
||||||
public AkkaActiveNodeGate(ActiveNodeHealthCheck check);
|
/// <param name="serviceProvider">
|
||||||
|
/// The application service provider. ActorSystem is resolved lazily; if not yet available
|
||||||
|
/// IsActiveNode returns false (safe default during startup).
|
||||||
|
/// The gate checks SelfMember.Status == Up AND cluster.State.Leader == self.Address directly.
|
||||||
|
/// </param>
|
||||||
|
public AkkaActiveNodeGate(IServiceProvider serviceProvider);
|
||||||
|
|
||||||
public bool IsActiveNode { get; }
|
public bool IsActiveNode { get; }
|
||||||
}
|
}
|
||||||
@@ -204,9 +234,10 @@ public sealed class DatabaseHealthCheck<TContext> : IHealthCheck
|
|||||||
public sealed class DatabaseHealthCheckOptions<TContext>
|
public sealed class DatabaseHealthCheckOptions<TContext>
|
||||||
where TContext : DbContext
|
where TContext : DbContext
|
||||||
{
|
{
|
||||||
/// Override the default CanConnectAsync() probe.
|
/// Override the default CanConnectAsync() probe with a custom query-based probe.
|
||||||
/// Throw to signal failure; return normally to signal success.
|
/// Throw to signal failure; return normally to signal success.
|
||||||
public Func<TContext, CancellationToken, Task>? Probe { get; set; }
|
/// Example: <c>db => db.Deployments.AsNoTracking().Take(1).ToListAsync()</c>
|
||||||
|
public Func<TContext, CancellationToken, Task>? ProbeQuery { get; set; }
|
||||||
|
|
||||||
public TimeSpan Timeout { get; set; } = TimeSpan.FromSeconds(10);
|
public TimeSpan Timeout { get; set; } = TimeSpan.FromSeconds(10);
|
||||||
}
|
}
|
||||||
@@ -217,7 +248,7 @@ public sealed class DatabaseHealthCheckOptions<TContext>
|
|||||||
| Consumer | Packages | Notes |
|
| Consumer | Packages | Notes |
|
||||||
|---|---|---|
|
|---|---|---|
|
||||||
| **MxGateway** | `ZB.MOM.WW.Health` (core only) | `GrpcDependencyHealthCheck` on the worker channel; all three tiers via `MapZbHealth()`; `IActiveNodeGate` not needed (not Akka-based) |
|
| **MxGateway** | `ZB.MOM.WW.Health` (core only) | `GrpcDependencyHealthCheck` on the worker channel; all three tiers via `MapZbHealth()`; `IActiveNodeGate` not needed (not Akka-based) |
|
||||||
| **OtOpcUa** | All three | `AkkaClusterHealthCheck` + `OtOpcUaCompat` preset → `Default` on convergence; `ActiveNodeHealthCheck(role: "admin")`; `DatabaseHealthCheck<T>` with custom probe delegate |
|
| **OtOpcUa** | All three | `AkkaClusterHealthCheck` + `OtOpcUaCompat` preset → `Default` on convergence; `ActiveNodeHealthCheck(role: "admin")`; `DatabaseHealthCheck<T>` with `ProbeQuery` delegate |
|
||||||
| **ScadaBridge** | All three | `AkkaClusterHealthCheck` + `Default` policy; `ActiveNodeHealthCheck` (role-less); `DatabaseHealthCheck<T>` default probe; `AkkaActiveNodeGate` replaces inline `ActiveNodeGate` |
|
| **ScadaBridge** | All three | `AkkaClusterHealthCheck` + `Default` policy; `ActiveNodeHealthCheck` (role-less); `DatabaseHealthCheck<T>` default probe; `AkkaActiveNodeGate` replaces inline `ActiveNodeGate` |
|
||||||
|
|
||||||
## Open contract questions
|
## Open contract questions
|
||||||
@@ -226,13 +257,10 @@ public sealed class DatabaseHealthCheckOptions<TContext>
|
|||||||
If a future MxGateway cluster requires it, the interface is in the core package and can be
|
If a future MxGateway cluster requires it, the interface is in the core package and can be
|
||||||
implemented without an Akka dependency. Validate whether a stub `AlwaysActiveGate` (returns
|
implemented without an Akka dependency. Validate whether a stub `AlwaysActiveGate` (returns
|
||||||
`true`) should ship in core for single-node deployments.
|
`true`) should ship in core for single-node deployments.
|
||||||
2. **DI helpers:** decide whether `services.AddZbHealthChecks()` (a DI-registered convenience
|
2. **`AkkaActiveNodeGate` caching:** `IsActiveNode` is a synchronous property; the underlying
|
||||||
that pre-registers gRPC + DB + Akka probes via options) is worth adding, or whether explicit
|
cluster-state read is synchronous but the ActorSystem lookup is lazy. Validate whether the
|
||||||
`services.AddHealthChecks().AddCheck<...>()` calls per project are clearer. The spec currently
|
gate should cache the computed value on a short TTL (e.g. 5 s) to reduce Akka.Cluster API
|
||||||
leaves probe registration entirely per-project.
|
overhead on high-frequency API routing checks, or whether reading `SelfMember`/`State.Leader`
|
||||||
3. **`AkkaActiveNodeGate` caching:** `IsActiveNode` is a synchronous property; the underlying
|
directly on every call is acceptable.
|
||||||
`ActiveNodeHealthCheck.CheckHealthAsync` is async. Validate whether the gate should cache the
|
|
||||||
last probe result on a short TTL (e.g. 5 s) or drive a background refresh, to avoid blocking
|
|
||||||
synchronous callers.
|
|
||||||
|
|
||||||
See [`../GAPS.md`](../GAPS.md) for the adoption order and effort/risk.
|
See [`../GAPS.md`](../GAPS.md) for the adoption order and effort/risk.
|
||||||
|
|||||||
@@ -61,12 +61,22 @@ mapping is **configurable** through `AkkaClusterStatusPolicy`.
|
|||||||
| Preset | Origin | `Up` / `Joining` | `Leaving` / `Exiting` | Other (`WeaklyUp`, `Down`, `Removed`, `Unknown`) |
|
| Preset | Origin | `Up` / `Joining` | `Leaving` / `Exiting` | Other (`WeaklyUp`, `Down`, `Removed`, `Unknown`) |
|
||||||
|---|---|---|---|---|
|
|---|---|---|---|---|
|
||||||
| `AkkaClusterStatusPolicy.Default` | ScadaBridge `AkkaClusterHealthCheck.cs` | Healthy | Degraded | Unhealthy |
|
| `AkkaClusterStatusPolicy.Default` | ScadaBridge `AkkaClusterHealthCheck.cs` | Healthy | Degraded | Unhealthy |
|
||||||
| `AkkaClusterStatusPolicy.OtOpcUaCompat` | OtOpcUa `AkkaClusterHealthCheck.cs` | Healthy (if self is `Up` among reachable members) | — | Degraded |
|
| `AkkaClusterStatusPolicy.OtOpcUaCompat` | OtOpcUa `AkkaClusterHealthCheck.cs` | Healthy (if self is `Up` among reachable members) | Degraded[^1] | Degraded |
|
||||||
|
|
||||||
|
[^1]: In the `OtOpcUaCompat` member-scan approach, `Leaving`/`Exiting` statuses also map to
|
||||||
|
Degraded because a member with those statuses will not appear with `Status == Up` in the reachable
|
||||||
|
member set — the scan finds self without `Up`, so the result is Degraded.
|
||||||
|
|
||||||
The `Default` preset is the convergence target. `OtOpcUaCompat` is provided for backward
|
The `Default` preset is the convergence target. `OtOpcUaCompat` is provided for backward
|
||||||
compatibility during OtOpcUa's migration; it maps any non-`Up`-among-members state to Degraded
|
compatibility during OtOpcUa's migration; it maps any non-`Up`-among-members state to Degraded
|
||||||
rather than Unhealthy. Registered to the `ready` tag.
|
rather than Unhealthy. Registered to the `ready` tag.
|
||||||
|
|
||||||
|
> **Note on error/exception cases:** in both modes, if the ActorSystem is not yet ready or cluster
|
||||||
|
> state is inaccessible (e.g. during startup), the check returns Degraded (startup-safety rule).
|
||||||
|
> The status cells in the table above describe the normal-operation path only; the "—" cells in the
|
||||||
|
> `OtOpcUaCompat` column refer to states that collapse into Degraded via the member-scan result,
|
||||||
|
> not to an explicit policy match.
|
||||||
|
|
||||||
### 2.3 Active / leader probe — `ActiveNodeHealthCheck`
|
### 2.3 Active / leader probe — `ActiveNodeHealthCheck`
|
||||||
|
|
||||||
Checks whether this node is the designated leader (active node). Accepts an optional Akka
|
Checks whether this node is the designated leader (active node). Accepts an optional Akka
|
||||||
|
|||||||
Reference in New Issue
Block a user