ae0ccc9a3a
All 35 findings fixed in 544a6dd and marked Status: Resolved with resolution
notes. README regenerated: 0 pending / 35 total across 6 libraries.
256 lines
16 KiB
Markdown
256 lines
16 KiB
Markdown
# Code Review — Health
|
|
|
|
| Field | Value |
|
|
|-------|-------|
|
|
| Library | `ZB.MOM.WW.Health/` |
|
|
| Packages | `ZB.MOM.WW.Health`, `ZB.MOM.WW.Health.Akka`, `ZB.MOM.WW.Health.EntityFrameworkCore` |
|
|
| Component spec | `components/health/spec/SPEC.md` |
|
|
| Shared contract | `components/health/shared-contract/ZB.MOM.WW.Health.md` |
|
|
| Status | Reviewed |
|
|
| Last reviewed | 2026-06-01 |
|
|
| Reviewer | Claude (automated baseline) |
|
|
| Commit reviewed | `5f75cd4` |
|
|
| Open findings | 0 |
|
|
|
|
## Summary
|
|
|
|
The library is small, cohesive, and well-documented, and it tracks the normalized spec closely:
|
|
the three-tier endpoint convention, the canonical JSON writer, the `IActiveNodeGate` seam, the
|
|
configurable `AkkaClusterStatusPolicy` presets, and the role-filtered `ActiveNodeHealthCheck` all
|
|
match the contract. The package split is clean — Akka and EF Core stay out of the core package, so
|
|
MxGateway can take core only (category 8: no leakage). The decision logic is factored into pure,
|
|
exhaustively table-tested functions (`AkkaClusterStatusPolicy`, `ActiveNodeDecision`), and the test
|
|
suite (58 tests) exercises the public surface through a real TestServer / real SQLite / real probe
|
|
delegates rather than mocks.
|
|
|
|
The findings cluster around **error-handling completeness in the health-check probes** (category 4),
|
|
the highest-risk area for this library: a probe that throws past the `IHealthCheck` boundary, or that
|
|
returns a status the spec says it should not, degrades the value of every consuming app's `/health/*`
|
|
endpoints. The two Akka checks and the gRPC check do not guard the cluster-state / probe call the way
|
|
the spec requires (return Degraded on inaccessible cluster) or the way the sibling `DatabaseHealthCheck`
|
|
does (catch-all → Unhealthy). The framework's `HealthCheckService` catches escaping exceptions, so none
|
|
of these crash an endpoint — that caps their severity at Medium — but they produce a result that
|
|
disagrees with the documented contract. The remainder are Low: a JSON shape nuance (`description` key
|
|
omitted vs. emitted-null), recommended-tag drift in XML docs, a return-value footgun in `MapZbHealth`,
|
|
and missing `GenerateDocumentationFile` so the (otherwise excellent) XML docs do not ship in the nupkgs.
|
|
|
|
## Checklist coverage
|
|
|
|
| # | Category | Examined | Notes |
|
|
|---|----------|----------|-------|
|
|
| 1 | Correctness & logic bugs | ☑ | Tier predicates, policy presets, and `ActiveNodeDecision` matrices are correct and match the spec tables. No logic defects found. |
|
|
| 2 | Public API surface & compatibility | ☑ | Health-005. Surface is minimal, sealed, well-named; nullable annotations correct; no internal types leak (`ActiveNodeDecision` is `internal`). |
|
|
| 3 | Concurrency & thread safety | ☑ | Options are mutable POCOs but per-registration; checks hold no shared mutable state; cluster-state reads are snapshot reads; static writer options are immutable. No issues found. |
|
|
| 4 | Error handling & resilience | ☑ | Health-001, Health-002. Akka checks don't guard cluster-state access; gRPC check lets non-Rpc/non-OCE exceptions escape. |
|
|
| 5 | Security & secret handling | ☑ | No secrets/PII. Exception objects attached to `HealthCheckResult` (standard); endpoints `AllowAnonymous` by design. No issues found. |
|
|
| 6 | Performance & resource management | ☑ | `DatabaseHealthCheck` is pool-safe and eagerly releases the timeout timer; scopes/contexts disposed via `await using`; channel not owned (correctly not disposed). No issues found. |
|
|
| 7 | Spec & shared-contract adherence | ☑ | Health-001, Health-003, Health-004. Endpoint convention + status strings otherwise match §3. |
|
|
| 8 | Packaging, dependencies & project layout | ☑ | Health-006. Dependency split correct; central versions; net10.0; correct package ids. |
|
|
| 9 | Testing coverage | ☑ | Strong: TestServer tier tests, real SQLite EF tests, table-driven policy/decision tests, gRPC probe + timeout + cancellation. Gaps noted inline in findings, not raised separately. |
|
|
| 10 | Documentation & XML docs | ☑ | Health-004, Health-006. Public API XML docs are thorough and accurate; the issues are recommended-tag drift and docs not being packed. |
|
|
|
|
## Findings
|
|
|
|
### Health-001 — Akka health checks throw (instead of returning Degraded) when cluster state is inaccessible
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Error handling & resilience |
|
|
| Status | Resolved |
|
|
| Location | `ZB.MOM.WW.Health/src/ZB.MOM.WW.Health.Akka/AkkaClusterHealthCheck.cs:45`, `ZB.MOM.WW.Health/src/ZB.MOM.WW.Health.Akka/ActiveNodeHealthCheck.cs:106` |
|
|
|
|
**Description**
|
|
|
|
Both Akka checks guard only the *null* ActorSystem case: `_serviceProvider.GetService<ActorSystem>()`
|
|
returns `null` → Degraded. But once a non-null `ActorSystem` is resolved they call `Cluster.Get(system)`
|
|
and read `.SelfMember` / `.State.Leader` with no try/catch. The spec is explicit that this path must be
|
|
Degraded, not an exception:
|
|
|
|
> "in both modes, if the ActorSystem is not yet ready **or cluster state is inaccessible** (e.g.
|
|
> during startup), the check returns Degraded (startup-safety rule)." — `SPEC.md` §2.2 note.
|
|
|
|
`Cluster.Get(system)` throws (`ConfigurationException`) when the `Akka.Cluster` extension is not
|
|
configured on the resolved ActorSystem, and `SelfMember` can throw during the window where the
|
|
ActorSystem object exists but the cluster has not finished initialising — exactly the startup race the
|
|
spec calls out. The ActorSystem being present-but-not-yet-clustered is a very real ordering in a
|
|
`Microsoft.Extensions.Hosting` app (the ActorSystem is registered in DI before `Cluster` has joined).
|
|
|
|
The escaping exception is caught by ASP.NET Core's `HealthCheckService`, which records the entry as
|
|
Unhealthy — so the endpoint does not crash. But Unhealthy on the `ready` tier returns **503** and
|
|
removes the node from load-balancing, whereas the spec's intended Degraded returns **200** (still in
|
|
rotation). A transient startup window therefore yanks a healthy-but-starting node out of rotation,
|
|
the opposite of the "startup-safe" guarantee the XML docs advertise.
|
|
|
|
**Recommendation**
|
|
|
|
Wrap the cluster-state read in a try/catch and return `HealthCheckResult.Degraded(...)` on failure, so
|
|
both the null-ActorSystem and inaccessible-cluster paths converge on Degraded as the spec requires:
|
|
resolve `Cluster.Get(system)` and read membership inside a `try`, catching the cluster extension's
|
|
startup exceptions and mapping them to Degraded with a "cluster not yet ready" description.
|
|
|
|
**Resolution**
|
|
|
|
Resolved in `544a6dd`, 2026-06-01 — wrapped the `Cluster.Get(system)`/`SelfMember`/leader reads in `AkkaClusterHealthCheck` and `ActiveNodeHealthCheck` in `try/catch (Exception when not OCE)` returning `Degraded("Akka cluster state not yet accessible.")`; tests add a plain non-clustered ActorSystem to assert Degraded instead of the escaping `ConfigurationException`.
|
|
|
|
### Health-002 — `GrpcDependencyHealthCheck` lets non-`RpcException`/non-`OperationCanceledException` errors escape
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Error handling & resilience |
|
|
| Status | Resolved |
|
|
| Location | `ZB.MOM.WW.Health/src/ZB.MOM.WW.Health/GrpcDependencyHealthCheck.cs:54` |
|
|
|
|
**Description**
|
|
|
|
`CheckHealthAsync` catches only `RpcException` and `OperationCanceledException`. Any other exception
|
|
thrown by the probe escapes the method. This is reachable on both paths:
|
|
|
|
- Default probe: `GrpcChannel.ConnectAsync` can throw exceptions other than `RpcException` —
|
|
e.g. `InvalidOperationException` (channel/socket misuse, shutdown) or `HttpRequestException` /
|
|
`SocketException` surfaced from the transport before a gRPC status is produced.
|
|
- Custom probe (`GrpcDependencyOptions.Probe`): a caller-supplied delegate (e.g. a
|
|
`grpc.health.v1.Health/Check` RPC built on a non-gRPC client) can throw anything.
|
|
|
|
The XML docs and shared contract describe the result as Unhealthy "when it returns `false`, throws an
|
|
`RpcException`, or times out" — they do not promise to handle arbitrary exceptions, and the code does
|
|
not. By contrast the sibling `DatabaseHealthCheck` has a catch-all `catch (Exception ex)` →
|
|
`Unhealthy("Database connection failed.", ex)` (`DatabaseHealthCheck.cs:78`), so the two probes are
|
|
asymmetric. As with Health-001 the framework's `HealthCheckService` records the escaping exception as
|
|
Unhealthy, so no endpoint crashes — but the dependency-named, controlled description
|
|
(`"{name} probe failed…"`) is lost, and the library violates its own "must not throw past the
|
|
`IHealthCheck` boundary" intent.
|
|
|
|
**Recommendation**
|
|
|
|
Add a trailing `catch (Exception ex)` returning `HealthCheckResult.Unhealthy($"{name} probe failed:
|
|
{ex.Message}", ex)`, after the existing OCE/Rpc handlers (so external cancellation still propagates).
|
|
This brings the gRPC probe in line with `DatabaseHealthCheck` and the documented contract.
|
|
|
|
**Resolution**
|
|
|
|
Resolved in `544a6dd`, 2026-06-01 — added a trailing `catch (Exception ex)` to `GrpcDependencyHealthCheck.CheckHealthAsync` returning `Unhealthy("{name} probe failed: {ex.Message}", ex)` after the OCE/Rpc external-cancellation handlers; new test asserts a probe throwing `InvalidOperationException` maps to Unhealthy.
|
|
|
|
### Health-003 — Null `description` is omitted from the JSON body instead of emitted as `null`
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Spec & shared-contract adherence |
|
|
| Status | Resolved |
|
|
| Location | `ZB.MOM.WW.Health/src/ZB.MOM.WW.Health/ZbHealthWriter.cs:32` |
|
|
|
|
**Description**
|
|
|
|
`SerializerOptions` sets `DefaultIgnoreCondition = JsonIgnoreCondition.WhenWritingNull`, so a check that
|
|
produces no description renders an entry with **no** `description` key at all:
|
|
`{ "status": "Healthy", "durationMs": 0.1 }`. The spec models `description` as a `string?` that "may be
|
|
null" and both the canonical example (`SPEC.md` §3) and the `HealthChecks.UI.Client` shape the writer
|
|
claims to mirror emit the key with a `null` value rather than dropping it. Consumers/dashboards parsing
|
|
the body and reading `entries.<name>.description` must therefore handle a *missing* property, not just a
|
|
null one. The deviation is undocumented and untested — `ResponseWriterTests` only asserts the
|
|
present-description case.
|
|
|
|
Low because the aggregate `status`/HTTP-code contract that orchestrators key on is unaffected; only the
|
|
descriptive sub-field shape drifts.
|
|
|
|
**Recommendation**
|
|
|
|
Either remove `WhenWritingNull` (so `description: null` is emitted, matching the spec example and the
|
|
UI-client shape), or document explicitly in the writer's XML docs and `SPEC.md` §3 that absent
|
|
descriptions are omitted, and add a test pinning the chosen behaviour.
|
|
|
|
**Resolution**
|
|
|
|
Resolved in `544a6dd`, 2026-06-01 — removed `DefaultIgnoreCondition = WhenWritingNull` from `ZbHealthWriter.SerializerOptions` so a null `description` renders as `"description": null` (matching the spec example / UI-client shape); writer XML doc now states the key is always present, and a new `ResponseWriterTests` case pins present-and-null.
|
|
|
|
### Health-004 — XML docs recommend the `active` tag for ready-tier probes, contradicting the spec
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Documentation & XML docs |
|
|
| Status | Resolved |
|
|
| Location | `ZB.MOM.WW.Health/src/ZB.MOM.WW.Health.Akka/AkkaClusterHealthCheck.cs:11`, `ZB.MOM.WW.Health/src/ZB.MOM.WW.Health/GrpcDependencyHealthCheck.cs:21` |
|
|
|
|
**Description**
|
|
|
|
`AkkaClusterHealthCheck`'s summary says *"Register to the `ZbHealthTags.Ready` tag (recommended
|
|
`[ready, active]`)"*, and `GrpcDependencyHealthCheck` likewise recommends tagging `[ready, active]`.
|
|
The spec is unambiguous that both probes belong to the `ready` tier only: §2.2 "Registered to the
|
|
`ready` tag" (Akka cluster) and §2.4 "Registered to the `ready` tag" (gRPC dependency). The `active`
|
|
tier is reserved by §1 / §2.3 for the leader/active-singleton probe (`ActiveNodeHealthCheck`); putting
|
|
cluster-membership or gRPC-reachability checks on the `active` tier pollutes the active-node tier with
|
|
non-leadership concerns and contradicts the convergence convention the library exists to enforce.
|
|
|
|
**Recommendation**
|
|
|
|
Bring the XML-doc recommendations in line with the spec — recommend `ZbHealthTags.Ready` only for both
|
|
probes (drop the `active` suggestion); or, if the dual-tag recommendation is intentional, reconcile it
|
|
back into `SPEC.md` §2.2/§2.4 so code and spec agree.
|
|
|
|
**Resolution**
|
|
|
|
Resolved in `544a6dd`, 2026-06-01 — updated the XML summaries on `AkkaClusterHealthCheck` and `GrpcDependencyHealthCheck` to recommend `ZbHealthTags.Ready` only (dropped the `[ready, active]` suggestion), aligning the docs with SPEC §2.2/§2.4.
|
|
|
|
### Health-005 — `MapZbHealth` returns only the readiness builder, silently dropping conventions for the active/live tiers
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Public API surface & compatibility |
|
|
| Status | Resolved |
|
|
| Location | `ZB.MOM.WW.Health/src/ZB.MOM.WW.Health/ZbHealthEndpointExtensions.cs:64` |
|
|
|
|
**Description**
|
|
|
|
`MapZbHealth` maps three endpoints but returns the `IEndpointConventionBuilder` for `/health/ready`
|
|
only. Any convention a caller chains onto the result — `.RequireHost(...)`, `.RequireAuthorization()`,
|
|
`.WithMetadata(...)`, `.CacheOutput()` — applies to the readiness endpoint **alone**; the active and
|
|
liveness endpoints silently do not receive it. The behaviour is documented in the XML `<returns>`, so
|
|
this is a deliberate, disclosed trade-off rather than a bug, but it is a real footgun: the most natural
|
|
reading of `endpoints.MapZbHealth().RequireHost("…")` is "gate all three health endpoints", and the
|
|
result type gives the caller no signal that two of the three are excluded.
|
|
|
|
**Recommendation**
|
|
|
|
Consider returning a composite builder that fans conventions out to all three endpoints (collect the
|
|
three `IEndpointConventionBuilder`s and wrap them) so chained conventions behave as a caller expects.
|
|
If the single-builder return is kept for simplicity, strengthen the `<returns>` remark to a warning and
|
|
note it in the README's `MapZbHealth` example.
|
|
|
|
**Resolution**
|
|
|
|
Resolved in `544a6dd`, 2026-06-01 — `MapZbHealth` now returns a private `CompositeEndpointConventionBuilder` that fans `Add`/`Finally` to all three endpoint builders (readiness, active, liveness); `<returns>` docs updated, and a new `TierMappingTests` case proves `.RequireHost(...)` gates all three tiers.
|
|
|
|
### Health-006 — XML documentation is not emitted into the packed nupkgs
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Packaging, dependencies & project layout |
|
|
| Status | Resolved |
|
|
| Location | `ZB.MOM.WW.Health/Directory.Build.props:3`, `ZB.MOM.WW.Health/src/ZB.MOM.WW.Health/ZB.MOM.WW.Health.csproj:3` |
|
|
|
|
**Description**
|
|
|
|
Every public type and member in all three packages carries thorough, accurate XML documentation, but no
|
|
project (and neither `Directory.Build.props` nor the three `.csproj` files) sets
|
|
`<GenerateDocumentationFile>true</GenerateDocumentationFile>`. As a result `dotnet pack` produces nupkgs
|
|
that contain the DLLs but **no `.xml` doc files**, so consumers of `ZB.MOM.WW.Health` / `.Akka` /
|
|
`.EntityFrameworkCore` get no IntelliSense summaries or parameter docs for the API. For a shared library
|
|
whose value proposition is a documented common surface across three apps, this discards the
|
|
documentation effort at the package boundary. (It also means the compiler does not enforce the
|
|
"public members are documented" invariant via CS1591.)
|
|
|
|
**Recommendation**
|
|
|
|
Set `<GenerateDocumentationFile>true</GenerateDocumentationFile>` once in `Directory.Build.props` so all
|
|
three packable projects emit and pack their XML docs. Optionally pair with targeted `<NoWarn>` if any
|
|
CS1591 warnings surface on currently-undocumented members.
|
|
|
|
**Resolution**
|
|
|
|
Resolved in `544a6dd`, 2026-06-01 — set `<GenerateDocumentationFile>true</GenerateDocumentationFile>` (plus `NoWarn=CS1591` for test/non-packed members) in `Directory.Build.props`; verified `dotnet pack -c Release` now ships `lib/net10.0/*.xml` in all three nupkgs.
|