All 35 findings fixed in 544a6dd and marked Status: Resolved with resolution
notes. README regenerated: 0 pending / 35 total across 6 libraries.
16 KiB
Code Review — Health
| Field | Value |
|---|---|
| Library | ZB.MOM.WW.Health/ |
| Packages | ZB.MOM.WW.Health, ZB.MOM.WW.Health.Akka, ZB.MOM.WW.Health.EntityFrameworkCore |
| Component spec | components/health/spec/SPEC.md |
| Shared contract | components/health/shared-contract/ZB.MOM.WW.Health.md |
| Status | Reviewed |
| Last reviewed | 2026-06-01 |
| Reviewer | Claude (automated baseline) |
| Commit reviewed | 5f75cd4 |
| Open findings | 0 |
Summary
The library is small, cohesive, and well-documented, and it tracks the normalized spec closely:
the three-tier endpoint convention, the canonical JSON writer, the IActiveNodeGate seam, the
configurable AkkaClusterStatusPolicy presets, and the role-filtered ActiveNodeHealthCheck all
match the contract. The package split is clean — Akka and EF Core stay out of the core package, so
MxGateway can take core only (category 8: no leakage). The decision logic is factored into pure,
exhaustively table-tested functions (AkkaClusterStatusPolicy, ActiveNodeDecision), and the test
suite (58 tests) exercises the public surface through a real TestServer / real SQLite / real probe
delegates rather than mocks.
The findings cluster around error-handling completeness in the health-check probes (category 4),
the highest-risk area for this library: a probe that throws past the IHealthCheck boundary, or that
returns a status the spec says it should not, degrades the value of every consuming app's /health/*
endpoints. The two Akka checks and the gRPC check do not guard the cluster-state / probe call the way
the spec requires (return Degraded on inaccessible cluster) or the way the sibling DatabaseHealthCheck
does (catch-all → Unhealthy). The framework's HealthCheckService catches escaping exceptions, so none
of these crash an endpoint — that caps their severity at Medium — but they produce a result that
disagrees with the documented contract. The remainder are Low: a JSON shape nuance (description key
omitted vs. emitted-null), recommended-tag drift in XML docs, a return-value footgun in MapZbHealth,
and missing GenerateDocumentationFile so the (otherwise excellent) XML docs do not ship in the nupkgs.
Checklist coverage
| # | Category | Examined | Notes |
|---|---|---|---|
| 1 | Correctness & logic bugs | ☑ | Tier predicates, policy presets, and ActiveNodeDecision matrices are correct and match the spec tables. No logic defects found. |
| 2 | Public API surface & compatibility | ☑ | Health-005. Surface is minimal, sealed, well-named; nullable annotations correct; no internal types leak (ActiveNodeDecision is internal). |
| 3 | Concurrency & thread safety | ☑ | Options are mutable POCOs but per-registration; checks hold no shared mutable state; cluster-state reads are snapshot reads; static writer options are immutable. No issues found. |
| 4 | Error handling & resilience | ☑ | Health-001, Health-002. Akka checks don't guard cluster-state access; gRPC check lets non-Rpc/non-OCE exceptions escape. |
| 5 | Security & secret handling | ☑ | No secrets/PII. Exception objects attached to HealthCheckResult (standard); endpoints AllowAnonymous by design. No issues found. |
| 6 | Performance & resource management | ☑ | DatabaseHealthCheck is pool-safe and eagerly releases the timeout timer; scopes/contexts disposed via await using; channel not owned (correctly not disposed). No issues found. |
| 7 | Spec & shared-contract adherence | ☑ | Health-001, Health-003, Health-004. Endpoint convention + status strings otherwise match §3. |
| 8 | Packaging, dependencies & project layout | ☑ | Health-006. Dependency split correct; central versions; net10.0; correct package ids. |
| 9 | Testing coverage | ☑ | Strong: TestServer tier tests, real SQLite EF tests, table-driven policy/decision tests, gRPC probe + timeout + cancellation. Gaps noted inline in findings, not raised separately. |
| 10 | Documentation & XML docs | ☑ | Health-004, Health-006. Public API XML docs are thorough and accurate; the issues are recommended-tag drift and docs not being packed. |
Findings
Health-001 — Akka health checks throw (instead of returning Degraded) when cluster state is inaccessible
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | ZB.MOM.WW.Health/src/ZB.MOM.WW.Health.Akka/AkkaClusterHealthCheck.cs:45, ZB.MOM.WW.Health/src/ZB.MOM.WW.Health.Akka/ActiveNodeHealthCheck.cs:106 |
Description
Both Akka checks guard only the null ActorSystem case: _serviceProvider.GetService<ActorSystem>()
returns null → Degraded. But once a non-null ActorSystem is resolved they call Cluster.Get(system)
and read .SelfMember / .State.Leader with no try/catch. The spec is explicit that this path must be
Degraded, not an exception:
"in both modes, if the ActorSystem is not yet ready or cluster state is inaccessible (e.g. during startup), the check returns Degraded (startup-safety rule)." —
SPEC.md§2.2 note.
Cluster.Get(system) throws (ConfigurationException) when the Akka.Cluster extension is not
configured on the resolved ActorSystem, and SelfMember can throw during the window where the
ActorSystem object exists but the cluster has not finished initialising — exactly the startup race the
spec calls out. The ActorSystem being present-but-not-yet-clustered is a very real ordering in a
Microsoft.Extensions.Hosting app (the ActorSystem is registered in DI before Cluster has joined).
The escaping exception is caught by ASP.NET Core's HealthCheckService, which records the entry as
Unhealthy — so the endpoint does not crash. But Unhealthy on the ready tier returns 503 and
removes the node from load-balancing, whereas the spec's intended Degraded returns 200 (still in
rotation). A transient startup window therefore yanks a healthy-but-starting node out of rotation,
the opposite of the "startup-safe" guarantee the XML docs advertise.
Recommendation
Wrap the cluster-state read in a try/catch and return HealthCheckResult.Degraded(...) on failure, so
both the null-ActorSystem and inaccessible-cluster paths converge on Degraded as the spec requires:
resolve Cluster.Get(system) and read membership inside a try, catching the cluster extension's
startup exceptions and mapping them to Degraded with a "cluster not yet ready" description.
Resolution
Resolved in 544a6dd, 2026-06-01 — wrapped the Cluster.Get(system)/SelfMember/leader reads in AkkaClusterHealthCheck and ActiveNodeHealthCheck in try/catch (Exception when not OCE) returning Degraded("Akka cluster state not yet accessible."); tests add a plain non-clustered ActorSystem to assert Degraded instead of the escaping ConfigurationException.
Health-002 — GrpcDependencyHealthCheck lets non-RpcException/non-OperationCanceledException errors escape
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | ZB.MOM.WW.Health/src/ZB.MOM.WW.Health/GrpcDependencyHealthCheck.cs:54 |
Description
CheckHealthAsync catches only RpcException and OperationCanceledException. Any other exception
thrown by the probe escapes the method. This is reachable on both paths:
- Default probe:
GrpcChannel.ConnectAsynccan throw exceptions other thanRpcException— e.g.InvalidOperationException(channel/socket misuse, shutdown) orHttpRequestException/SocketExceptionsurfaced from the transport before a gRPC status is produced. - Custom probe (
GrpcDependencyOptions.Probe): a caller-supplied delegate (e.g. agrpc.health.v1.Health/CheckRPC built on a non-gRPC client) can throw anything.
The XML docs and shared contract describe the result as Unhealthy "when it returns false, throws an
RpcException, or times out" — they do not promise to handle arbitrary exceptions, and the code does
not. By contrast the sibling DatabaseHealthCheck has a catch-all catch (Exception ex) →
Unhealthy("Database connection failed.", ex) (DatabaseHealthCheck.cs:78), so the two probes are
asymmetric. As with Health-001 the framework's HealthCheckService records the escaping exception as
Unhealthy, so no endpoint crashes — but the dependency-named, controlled description
("{name} probe failed…") is lost, and the library violates its own "must not throw past the
IHealthCheck boundary" intent.
Recommendation
Add a trailing catch (Exception ex) returning HealthCheckResult.Unhealthy($"{name} probe failed: {ex.Message}", ex), after the existing OCE/Rpc handlers (so external cancellation still propagates).
This brings the gRPC probe in line with DatabaseHealthCheck and the documented contract.
Resolution
Resolved in 544a6dd, 2026-06-01 — added a trailing catch (Exception ex) to GrpcDependencyHealthCheck.CheckHealthAsync returning Unhealthy("{name} probe failed: {ex.Message}", ex) after the OCE/Rpc external-cancellation handlers; new test asserts a probe throwing InvalidOperationException maps to Unhealthy.
Health-003 — Null description is omitted from the JSON body instead of emitted as null
| Severity | Low |
| Category | Spec & shared-contract adherence |
| Status | Resolved |
| Location | ZB.MOM.WW.Health/src/ZB.MOM.WW.Health/ZbHealthWriter.cs:32 |
Description
SerializerOptions sets DefaultIgnoreCondition = JsonIgnoreCondition.WhenWritingNull, so a check that
produces no description renders an entry with no description key at all:
{ "status": "Healthy", "durationMs": 0.1 }. The spec models description as a string? that "may be
null" and both the canonical example (SPEC.md §3) and the HealthChecks.UI.Client shape the writer
claims to mirror emit the key with a null value rather than dropping it. Consumers/dashboards parsing
the body and reading entries.<name>.description must therefore handle a missing property, not just a
null one. The deviation is undocumented and untested — ResponseWriterTests only asserts the
present-description case.
Low because the aggregate status/HTTP-code contract that orchestrators key on is unaffected; only the
descriptive sub-field shape drifts.
Recommendation
Either remove WhenWritingNull (so description: null is emitted, matching the spec example and the
UI-client shape), or document explicitly in the writer's XML docs and SPEC.md §3 that absent
descriptions are omitted, and add a test pinning the chosen behaviour.
Resolution
Resolved in 544a6dd, 2026-06-01 — removed DefaultIgnoreCondition = WhenWritingNull from ZbHealthWriter.SerializerOptions so a null description renders as "description": null (matching the spec example / UI-client shape); writer XML doc now states the key is always present, and a new ResponseWriterTests case pins present-and-null.
Health-004 — XML docs recommend the active tag for ready-tier probes, contradicting the spec
| Severity | Low |
| Category | Documentation & XML docs |
| Status | Resolved |
| Location | ZB.MOM.WW.Health/src/ZB.MOM.WW.Health.Akka/AkkaClusterHealthCheck.cs:11, ZB.MOM.WW.Health/src/ZB.MOM.WW.Health/GrpcDependencyHealthCheck.cs:21 |
Description
AkkaClusterHealthCheck's summary says "Register to the ZbHealthTags.Ready tag (recommended
[ready, active])", and GrpcDependencyHealthCheck likewise recommends tagging [ready, active].
The spec is unambiguous that both probes belong to the ready tier only: §2.2 "Registered to the
ready tag" (Akka cluster) and §2.4 "Registered to the ready tag" (gRPC dependency). The active
tier is reserved by §1 / §2.3 for the leader/active-singleton probe (ActiveNodeHealthCheck); putting
cluster-membership or gRPC-reachability checks on the active tier pollutes the active-node tier with
non-leadership concerns and contradicts the convergence convention the library exists to enforce.
Recommendation
Bring the XML-doc recommendations in line with the spec — recommend ZbHealthTags.Ready only for both
probes (drop the active suggestion); or, if the dual-tag recommendation is intentional, reconcile it
back into SPEC.md §2.2/§2.4 so code and spec agree.
Resolution
Resolved in 544a6dd, 2026-06-01 — updated the XML summaries on AkkaClusterHealthCheck and GrpcDependencyHealthCheck to recommend ZbHealthTags.Ready only (dropped the [ready, active] suggestion), aligning the docs with SPEC §2.2/§2.4.
Health-005 — MapZbHealth returns only the readiness builder, silently dropping conventions for the active/live tiers
| Severity | Low |
| Category | Public API surface & compatibility |
| Status | Resolved |
| Location | ZB.MOM.WW.Health/src/ZB.MOM.WW.Health/ZbHealthEndpointExtensions.cs:64 |
Description
MapZbHealth maps three endpoints but returns the IEndpointConventionBuilder for /health/ready
only. Any convention a caller chains onto the result — .RequireHost(...), .RequireAuthorization(),
.WithMetadata(...), .CacheOutput() — applies to the readiness endpoint alone; the active and
liveness endpoints silently do not receive it. The behaviour is documented in the XML <returns>, so
this is a deliberate, disclosed trade-off rather than a bug, but it is a real footgun: the most natural
reading of endpoints.MapZbHealth().RequireHost("…") is "gate all three health endpoints", and the
result type gives the caller no signal that two of the three are excluded.
Recommendation
Consider returning a composite builder that fans conventions out to all three endpoints (collect the
three IEndpointConventionBuilders and wrap them) so chained conventions behave as a caller expects.
If the single-builder return is kept for simplicity, strengthen the <returns> remark to a warning and
note it in the README's MapZbHealth example.
Resolution
Resolved in 544a6dd, 2026-06-01 — MapZbHealth now returns a private CompositeEndpointConventionBuilder that fans Add/Finally to all three endpoint builders (readiness, active, liveness); <returns> docs updated, and a new TierMappingTests case proves .RequireHost(...) gates all three tiers.
Health-006 — XML documentation is not emitted into the packed nupkgs
| Severity | Low |
| Category | Packaging, dependencies & project layout |
| Status | Resolved |
| Location | ZB.MOM.WW.Health/Directory.Build.props:3, ZB.MOM.WW.Health/src/ZB.MOM.WW.Health/ZB.MOM.WW.Health.csproj:3 |
Description
Every public type and member in all three packages carries thorough, accurate XML documentation, but no
project (and neither Directory.Build.props nor the three .csproj files) sets
<GenerateDocumentationFile>true</GenerateDocumentationFile>. As a result dotnet pack produces nupkgs
that contain the DLLs but no .xml doc files, so consumers of ZB.MOM.WW.Health / .Akka /
.EntityFrameworkCore get no IntelliSense summaries or parameter docs for the API. For a shared library
whose value proposition is a documented common surface across three apps, this discards the
documentation effort at the package boundary. (It also means the compiler does not enforce the
"public members are documented" invariant via CS1591.)
Recommendation
Set <GenerateDocumentationFile>true</GenerateDocumentationFile> once in Directory.Build.props so all
three packable projects emit and pack their XML docs. Optionally pair with targeted <NoWarn> if any
CS1591 warnings surface on currently-undocumented members.
Resolution
Resolved in 544a6dd, 2026-06-01 — set <GenerateDocumentationFile>true</GenerateDocumentationFile> (plus NoWarn=CS1591 for test/non-packed members) in Directory.Build.props; verified dotnet pack -c Release now ships lib/net10.0/*.xml in all three nupkgs.