docs(health): resolve spec/contract/gaps consistency (review fixes)
Applies canonical resolutions for eight settled decisions: - GAPS: remove three stale "Decisions still open" bullets (#1 IActiveNodeGate placement, #2 GrpcChannel type, #3 OtOpcUaCompat named constant) - Shared contract: AkkaClusterHealthCheck, ActiveNodeHealthCheck constructors take IServiceProvider (lazy ActorSystem, Degraded-when-not-ready) - Shared contract: AkkaActiveNodeGate takes IServiceProvider; reads SelfMember+leader directly, null-guarded; does not proxy ActiveNodeHealthCheck - Shared contract: DatabaseHealthCheckOptions.Probe renamed to ProbeQuery; consumer matrix updated - Shared contract: settled AddZbHealthChecks open question removed (spec §5 is per-project AddHealthChecks) - SPEC §2.2: OtOpcUaCompat Leaving/Exiting cell updated from — to Degraded + footnote; §2.3 startup-safety note added - README: status line corrected from "built and tested" to "scaffolded … implementation is follow-on (task #7)"; IActiveNodeGate "left per-project" bullet removed - OtOpcUa current-state: AddZbHealthChecks → AddHealthChecks().AddCheck<...>(); IClusterRoleInfo note reframed as accepted trade-off - ScadaBridge current-state: IActiveNodeGate bullet rewritten — interface moves to ZB.MOM.WW.Health on adoption, InboundApiEndpointFilter references shared interface
This commit is contained in:
@@ -125,10 +125,19 @@ namespace ZB.MOM.WW.Health.Akka;
|
||||
|
||||
/// Checks the local node's Akka cluster membership status.
|
||||
/// Register to tag ZbHealthTags.Ready.
|
||||
/// <remarks>
|
||||
/// The ActorSystem is resolved lazily from the service provider. If the ActorSystem is not yet
|
||||
/// available (e.g. during startup before Akka is initialised), the check returns Degraded rather
|
||||
/// than throwing. This makes the check safe to register before Akka is fully up.
|
||||
/// </remarks>
|
||||
public sealed class AkkaClusterHealthCheck : IHealthCheck
|
||||
{
|
||||
/// <param name="serviceProvider">
|
||||
/// The application service provider. ActorSystem is resolved lazily so the check is
|
||||
/// startup-safe: if no ActorSystem is registered yet the result is Degraded.
|
||||
/// </param>
|
||||
public AkkaClusterHealthCheck(
|
||||
ActorSystem system,
|
||||
IServiceProvider serviceProvider,
|
||||
AkkaClusterStatusPolicy policy);
|
||||
|
||||
public Task<HealthCheckResult> CheckHealthAsync(
|
||||
@@ -155,25 +164,46 @@ public sealed class AkkaClusterStatusPolicy
|
||||
/// Checks whether this node is the designated leader / active node.
|
||||
/// Optional role parameter scopes the check to nodes carrying that role.
|
||||
/// Register to tag ZbHealthTags.Active.
|
||||
/// <remarks>
|
||||
/// The ActorSystem is resolved lazily from the service provider. If the ActorSystem is not yet
|
||||
/// available (e.g. during startup before Akka is initialised), the check returns Degraded rather
|
||||
/// than throwing. This makes the check startup-safe.
|
||||
/// </remarks>
|
||||
public sealed class ActiveNodeHealthCheck : IHealthCheck
|
||||
{
|
||||
/// Role-less constructor: Healthy = node is Up AND cluster leader (ScadaBridge ActiveNode pattern).
|
||||
public ActiveNodeHealthCheck(ActorSystem system);
|
||||
/// Returns Degraded when ActorSystem/cluster is not yet ready.
|
||||
/// <param name="serviceProvider">
|
||||
/// The application service provider. ActorSystem is resolved lazily so the check is
|
||||
/// startup-safe: if no ActorSystem is registered yet the result is Degraded.
|
||||
/// </param>
|
||||
public ActiveNodeHealthCheck(IServiceProvider serviceProvider);
|
||||
|
||||
/// Role-filtered constructor: Healthy = (node lacks the role) OR (node carries role AND is role-singleton leader).
|
||||
/// Degraded = node carries role but is not the role-singleton leader (OtOpcUa AdminRoleLeader pattern).
|
||||
public ActiveNodeHealthCheck(ActorSystem system, string role);
|
||||
/// Returns Degraded when ActorSystem/cluster is not yet ready.
|
||||
/// <param name="serviceProvider">
|
||||
/// The application service provider. ActorSystem is resolved lazily so the check is
|
||||
/// startup-safe: if no ActorSystem is registered yet the result is Degraded.
|
||||
/// </param>
|
||||
public ActiveNodeHealthCheck(IServiceProvider serviceProvider, string role);
|
||||
|
||||
public Task<HealthCheckResult> CheckHealthAsync(
|
||||
HealthCheckContext context,
|
||||
CancellationToken cancellationToken = default);
|
||||
}
|
||||
|
||||
/// IActiveNodeGate implementation backed by ActiveNodeHealthCheck.
|
||||
/// Register as a singleton; resolves ActiveNodeHealthCheck from DI.
|
||||
/// IActiveNodeGate implementation that computes IsActiveNode directly from the ActorSystem
|
||||
/// (SelfMember Up + cluster leader), null-guarded for startup safety.
|
||||
/// Register as a singleton. Does NOT resolve ActiveNodeHealthCheck from DI.
|
||||
public sealed class AkkaActiveNodeGate : IActiveNodeGate
|
||||
{
|
||||
public AkkaActiveNodeGate(ActiveNodeHealthCheck check);
|
||||
/// <param name="serviceProvider">
|
||||
/// The application service provider. ActorSystem is resolved lazily; if not yet available
|
||||
/// IsActiveNode returns false (safe default during startup).
|
||||
/// The gate checks SelfMember.Status == Up AND cluster.State.Leader == self.Address directly.
|
||||
/// </param>
|
||||
public AkkaActiveNodeGate(IServiceProvider serviceProvider);
|
||||
|
||||
public bool IsActiveNode { get; }
|
||||
}
|
||||
@@ -204,9 +234,10 @@ public sealed class DatabaseHealthCheck<TContext> : IHealthCheck
|
||||
public sealed class DatabaseHealthCheckOptions<TContext>
|
||||
where TContext : DbContext
|
||||
{
|
||||
/// Override the default CanConnectAsync() probe.
|
||||
/// Override the default CanConnectAsync() probe with a custom query-based probe.
|
||||
/// Throw to signal failure; return normally to signal success.
|
||||
public Func<TContext, CancellationToken, Task>? Probe { get; set; }
|
||||
/// Example: <c>db => db.Deployments.AsNoTracking().Take(1).ToListAsync()</c>
|
||||
public Func<TContext, CancellationToken, Task>? ProbeQuery { get; set; }
|
||||
|
||||
public TimeSpan Timeout { get; set; } = TimeSpan.FromSeconds(10);
|
||||
}
|
||||
@@ -217,7 +248,7 @@ public sealed class DatabaseHealthCheckOptions<TContext>
|
||||
| Consumer | Packages | Notes |
|
||||
|---|---|---|
|
||||
| **MxGateway** | `ZB.MOM.WW.Health` (core only) | `GrpcDependencyHealthCheck` on the worker channel; all three tiers via `MapZbHealth()`; `IActiveNodeGate` not needed (not Akka-based) |
|
||||
| **OtOpcUa** | All three | `AkkaClusterHealthCheck` + `OtOpcUaCompat` preset → `Default` on convergence; `ActiveNodeHealthCheck(role: "admin")`; `DatabaseHealthCheck<T>` with custom probe delegate |
|
||||
| **OtOpcUa** | All three | `AkkaClusterHealthCheck` + `OtOpcUaCompat` preset → `Default` on convergence; `ActiveNodeHealthCheck(role: "admin")`; `DatabaseHealthCheck<T>` with `ProbeQuery` delegate |
|
||||
| **ScadaBridge** | All three | `AkkaClusterHealthCheck` + `Default` policy; `ActiveNodeHealthCheck` (role-less); `DatabaseHealthCheck<T>` default probe; `AkkaActiveNodeGate` replaces inline `ActiveNodeGate` |
|
||||
|
||||
## Open contract questions
|
||||
@@ -226,13 +257,10 @@ public sealed class DatabaseHealthCheckOptions<TContext>
|
||||
If a future MxGateway cluster requires it, the interface is in the core package and can be
|
||||
implemented without an Akka dependency. Validate whether a stub `AlwaysActiveGate` (returns
|
||||
`true`) should ship in core for single-node deployments.
|
||||
2. **DI helpers:** decide whether `services.AddZbHealthChecks()` (a DI-registered convenience
|
||||
that pre-registers gRPC + DB + Akka probes via options) is worth adding, or whether explicit
|
||||
`services.AddHealthChecks().AddCheck<...>()` calls per project are clearer. The spec currently
|
||||
leaves probe registration entirely per-project.
|
||||
3. **`AkkaActiveNodeGate` caching:** `IsActiveNode` is a synchronous property; the underlying
|
||||
`ActiveNodeHealthCheck.CheckHealthAsync` is async. Validate whether the gate should cache the
|
||||
last probe result on a short TTL (e.g. 5 s) or drive a background refresh, to avoid blocking
|
||||
synchronous callers.
|
||||
2. **`AkkaActiveNodeGate` caching:** `IsActiveNode` is a synchronous property; the underlying
|
||||
cluster-state read is synchronous but the ActorSystem lookup is lazy. Validate whether the
|
||||
gate should cache the computed value on a short TTL (e.g. 5 s) to reduce Akka.Cluster API
|
||||
overhead on high-frequency API routing checks, or whether reading `SelfMember`/`State.Leader`
|
||||
directly on every call is acceptable.
|
||||
|
||||
See [`../GAPS.md`](../GAPS.md) for the adoption order and effort/risk.
|
||||
|
||||
Reference in New Issue
Block a user