Files
scadaproj/code-reviews/Health/findings.md
T
Joseph Doherty ae0ccc9a3a Mark all baseline code-review findings resolved
All 35 findings fixed in 544a6dd and marked Status: Resolved with resolution
notes. README regenerated: 0 pending / 35 total across 6 libraries.
2026-06-01 11:22:37 -04:00

256 lines
16 KiB
Markdown

# Code Review — Health
| Field | Value |
|-------|-------|
| Library | `ZB.MOM.WW.Health/` |
| Packages | `ZB.MOM.WW.Health`, `ZB.MOM.WW.Health.Akka`, `ZB.MOM.WW.Health.EntityFrameworkCore` |
| Component spec | `components/health/spec/SPEC.md` |
| Shared contract | `components/health/shared-contract/ZB.MOM.WW.Health.md` |
| Status | Reviewed |
| Last reviewed | 2026-06-01 |
| Reviewer | Claude (automated baseline) |
| Commit reviewed | `5f75cd4` |
| Open findings | 0 |
## Summary
The library is small, cohesive, and well-documented, and it tracks the normalized spec closely:
the three-tier endpoint convention, the canonical JSON writer, the `IActiveNodeGate` seam, the
configurable `AkkaClusterStatusPolicy` presets, and the role-filtered `ActiveNodeHealthCheck` all
match the contract. The package split is clean — Akka and EF Core stay out of the core package, so
MxGateway can take core only (category 8: no leakage). The decision logic is factored into pure,
exhaustively table-tested functions (`AkkaClusterStatusPolicy`, `ActiveNodeDecision`), and the test
suite (58 tests) exercises the public surface through a real TestServer / real SQLite / real probe
delegates rather than mocks.
The findings cluster around **error-handling completeness in the health-check probes** (category 4),
the highest-risk area for this library: a probe that throws past the `IHealthCheck` boundary, or that
returns a status the spec says it should not, degrades the value of every consuming app's `/health/*`
endpoints. The two Akka checks and the gRPC check do not guard the cluster-state / probe call the way
the spec requires (return Degraded on inaccessible cluster) or the way the sibling `DatabaseHealthCheck`
does (catch-all → Unhealthy). The framework's `HealthCheckService` catches escaping exceptions, so none
of these crash an endpoint — that caps their severity at Medium — but they produce a result that
disagrees with the documented contract. The remainder are Low: a JSON shape nuance (`description` key
omitted vs. emitted-null), recommended-tag drift in XML docs, a return-value footgun in `MapZbHealth`,
and missing `GenerateDocumentationFile` so the (otherwise excellent) XML docs do not ship in the nupkgs.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | Tier predicates, policy presets, and `ActiveNodeDecision` matrices are correct and match the spec tables. No logic defects found. |
| 2 | Public API surface & compatibility | ☑ | Health-005. Surface is minimal, sealed, well-named; nullable annotations correct; no internal types leak (`ActiveNodeDecision` is `internal`). |
| 3 | Concurrency & thread safety | ☑ | Options are mutable POCOs but per-registration; checks hold no shared mutable state; cluster-state reads are snapshot reads; static writer options are immutable. No issues found. |
| 4 | Error handling & resilience | ☑ | Health-001, Health-002. Akka checks don't guard cluster-state access; gRPC check lets non-Rpc/non-OCE exceptions escape. |
| 5 | Security & secret handling | ☑ | No secrets/PII. Exception objects attached to `HealthCheckResult` (standard); endpoints `AllowAnonymous` by design. No issues found. |
| 6 | Performance & resource management | ☑ | `DatabaseHealthCheck` is pool-safe and eagerly releases the timeout timer; scopes/contexts disposed via `await using`; channel not owned (correctly not disposed). No issues found. |
| 7 | Spec & shared-contract adherence | ☑ | Health-001, Health-003, Health-004. Endpoint convention + status strings otherwise match §3. |
| 8 | Packaging, dependencies & project layout | ☑ | Health-006. Dependency split correct; central versions; net10.0; correct package ids. |
| 9 | Testing coverage | ☑ | Strong: TestServer tier tests, real SQLite EF tests, table-driven policy/decision tests, gRPC probe + timeout + cancellation. Gaps noted inline in findings, not raised separately. |
| 10 | Documentation & XML docs | ☑ | Health-004, Health-006. Public API XML docs are thorough and accurate; the issues are recommended-tag drift and docs not being packed. |
## Findings
### Health-001 — Akka health checks throw (instead of returning Degraded) when cluster state is inaccessible
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `ZB.MOM.WW.Health/src/ZB.MOM.WW.Health.Akka/AkkaClusterHealthCheck.cs:45`, `ZB.MOM.WW.Health/src/ZB.MOM.WW.Health.Akka/ActiveNodeHealthCheck.cs:106` |
**Description**
Both Akka checks guard only the *null* ActorSystem case: `_serviceProvider.GetService<ActorSystem>()`
returns `null` → Degraded. But once a non-null `ActorSystem` is resolved they call `Cluster.Get(system)`
and read `.SelfMember` / `.State.Leader` with no try/catch. The spec is explicit that this path must be
Degraded, not an exception:
> "in both modes, if the ActorSystem is not yet ready **or cluster state is inaccessible** (e.g.
> during startup), the check returns Degraded (startup-safety rule)." — `SPEC.md` §2.2 note.
`Cluster.Get(system)` throws (`ConfigurationException`) when the `Akka.Cluster` extension is not
configured on the resolved ActorSystem, and `SelfMember` can throw during the window where the
ActorSystem object exists but the cluster has not finished initialising — exactly the startup race the
spec calls out. The ActorSystem being present-but-not-yet-clustered is a very real ordering in a
`Microsoft.Extensions.Hosting` app (the ActorSystem is registered in DI before `Cluster` has joined).
The escaping exception is caught by ASP.NET Core's `HealthCheckService`, which records the entry as
Unhealthy — so the endpoint does not crash. But Unhealthy on the `ready` tier returns **503** and
removes the node from load-balancing, whereas the spec's intended Degraded returns **200** (still in
rotation). A transient startup window therefore yanks a healthy-but-starting node out of rotation,
the opposite of the "startup-safe" guarantee the XML docs advertise.
**Recommendation**
Wrap the cluster-state read in a try/catch and return `HealthCheckResult.Degraded(...)` on failure, so
both the null-ActorSystem and inaccessible-cluster paths converge on Degraded as the spec requires:
resolve `Cluster.Get(system)` and read membership inside a `try`, catching the cluster extension's
startup exceptions and mapping them to Degraded with a "cluster not yet ready" description.
**Resolution**
Resolved in `544a6dd`, 2026-06-01 — wrapped the `Cluster.Get(system)`/`SelfMember`/leader reads in `AkkaClusterHealthCheck` and `ActiveNodeHealthCheck` in `try/catch (Exception when not OCE)` returning `Degraded("Akka cluster state not yet accessible.")`; tests add a plain non-clustered ActorSystem to assert Degraded instead of the escaping `ConfigurationException`.
### Health-002 — `GrpcDependencyHealthCheck` lets non-`RpcException`/non-`OperationCanceledException` errors escape
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `ZB.MOM.WW.Health/src/ZB.MOM.WW.Health/GrpcDependencyHealthCheck.cs:54` |
**Description**
`CheckHealthAsync` catches only `RpcException` and `OperationCanceledException`. Any other exception
thrown by the probe escapes the method. This is reachable on both paths:
- Default probe: `GrpcChannel.ConnectAsync` can throw exceptions other than `RpcException`
e.g. `InvalidOperationException` (channel/socket misuse, shutdown) or `HttpRequestException` /
`SocketException` surfaced from the transport before a gRPC status is produced.
- Custom probe (`GrpcDependencyOptions.Probe`): a caller-supplied delegate (e.g. a
`grpc.health.v1.Health/Check` RPC built on a non-gRPC client) can throw anything.
The XML docs and shared contract describe the result as Unhealthy "when it returns `false`, throws an
`RpcException`, or times out" — they do not promise to handle arbitrary exceptions, and the code does
not. By contrast the sibling `DatabaseHealthCheck` has a catch-all `catch (Exception ex)`
`Unhealthy("Database connection failed.", ex)` (`DatabaseHealthCheck.cs:78`), so the two probes are
asymmetric. As with Health-001 the framework's `HealthCheckService` records the escaping exception as
Unhealthy, so no endpoint crashes — but the dependency-named, controlled description
(`"{name} probe failed…"`) is lost, and the library violates its own "must not throw past the
`IHealthCheck` boundary" intent.
**Recommendation**
Add a trailing `catch (Exception ex)` returning `HealthCheckResult.Unhealthy($"{name} probe failed:
{ex.Message}", ex)`, after the existing OCE/Rpc handlers (so external cancellation still propagates).
This brings the gRPC probe in line with `DatabaseHealthCheck` and the documented contract.
**Resolution**
Resolved in `544a6dd`, 2026-06-01 — added a trailing `catch (Exception ex)` to `GrpcDependencyHealthCheck.CheckHealthAsync` returning `Unhealthy("{name} probe failed: {ex.Message}", ex)` after the OCE/Rpc external-cancellation handlers; new test asserts a probe throwing `InvalidOperationException` maps to Unhealthy.
### Health-003 — Null `description` is omitted from the JSON body instead of emitted as `null`
| | |
|--|--|
| Severity | Low |
| Category | Spec & shared-contract adherence |
| Status | Resolved |
| Location | `ZB.MOM.WW.Health/src/ZB.MOM.WW.Health/ZbHealthWriter.cs:32` |
**Description**
`SerializerOptions` sets `DefaultIgnoreCondition = JsonIgnoreCondition.WhenWritingNull`, so a check that
produces no description renders an entry with **no** `description` key at all:
`{ "status": "Healthy", "durationMs": 0.1 }`. The spec models `description` as a `string?` that "may be
null" and both the canonical example (`SPEC.md` §3) and the `HealthChecks.UI.Client` shape the writer
claims to mirror emit the key with a `null` value rather than dropping it. Consumers/dashboards parsing
the body and reading `entries.<name>.description` must therefore handle a *missing* property, not just a
null one. The deviation is undocumented and untested — `ResponseWriterTests` only asserts the
present-description case.
Low because the aggregate `status`/HTTP-code contract that orchestrators key on is unaffected; only the
descriptive sub-field shape drifts.
**Recommendation**
Either remove `WhenWritingNull` (so `description: null` is emitted, matching the spec example and the
UI-client shape), or document explicitly in the writer's XML docs and `SPEC.md` §3 that absent
descriptions are omitted, and add a test pinning the chosen behaviour.
**Resolution**
Resolved in `544a6dd`, 2026-06-01 — removed `DefaultIgnoreCondition = WhenWritingNull` from `ZbHealthWriter.SerializerOptions` so a null `description` renders as `"description": null` (matching the spec example / UI-client shape); writer XML doc now states the key is always present, and a new `ResponseWriterTests` case pins present-and-null.
### Health-004 — XML docs recommend the `active` tag for ready-tier probes, contradicting the spec
| | |
|--|--|
| Severity | Low |
| Category | Documentation & XML docs |
| Status | Resolved |
| Location | `ZB.MOM.WW.Health/src/ZB.MOM.WW.Health.Akka/AkkaClusterHealthCheck.cs:11`, `ZB.MOM.WW.Health/src/ZB.MOM.WW.Health/GrpcDependencyHealthCheck.cs:21` |
**Description**
`AkkaClusterHealthCheck`'s summary says *"Register to the `ZbHealthTags.Ready` tag (recommended
`[ready, active]`)"*, and `GrpcDependencyHealthCheck` likewise recommends tagging `[ready, active]`.
The spec is unambiguous that both probes belong to the `ready` tier only: §2.2 "Registered to the
`ready` tag" (Akka cluster) and §2.4 "Registered to the `ready` tag" (gRPC dependency). The `active`
tier is reserved by §1 / §2.3 for the leader/active-singleton probe (`ActiveNodeHealthCheck`); putting
cluster-membership or gRPC-reachability checks on the `active` tier pollutes the active-node tier with
non-leadership concerns and contradicts the convergence convention the library exists to enforce.
**Recommendation**
Bring the XML-doc recommendations in line with the spec — recommend `ZbHealthTags.Ready` only for both
probes (drop the `active` suggestion); or, if the dual-tag recommendation is intentional, reconcile it
back into `SPEC.md` §2.2/§2.4 so code and spec agree.
**Resolution**
Resolved in `544a6dd`, 2026-06-01 — updated the XML summaries on `AkkaClusterHealthCheck` and `GrpcDependencyHealthCheck` to recommend `ZbHealthTags.Ready` only (dropped the `[ready, active]` suggestion), aligning the docs with SPEC §2.2/§2.4.
### Health-005 — `MapZbHealth` returns only the readiness builder, silently dropping conventions for the active/live tiers
| | |
|--|--|
| Severity | Low |
| Category | Public API surface & compatibility |
| Status | Resolved |
| Location | `ZB.MOM.WW.Health/src/ZB.MOM.WW.Health/ZbHealthEndpointExtensions.cs:64` |
**Description**
`MapZbHealth` maps three endpoints but returns the `IEndpointConventionBuilder` for `/health/ready`
only. Any convention a caller chains onto the result — `.RequireHost(...)`, `.RequireAuthorization()`,
`.WithMetadata(...)`, `.CacheOutput()` — applies to the readiness endpoint **alone**; the active and
liveness endpoints silently do not receive it. The behaviour is documented in the XML `<returns>`, so
this is a deliberate, disclosed trade-off rather than a bug, but it is a real footgun: the most natural
reading of `endpoints.MapZbHealth().RequireHost("…")` is "gate all three health endpoints", and the
result type gives the caller no signal that two of the three are excluded.
**Recommendation**
Consider returning a composite builder that fans conventions out to all three endpoints (collect the
three `IEndpointConventionBuilder`s and wrap them) so chained conventions behave as a caller expects.
If the single-builder return is kept for simplicity, strengthen the `<returns>` remark to a warning and
note it in the README's `MapZbHealth` example.
**Resolution**
Resolved in `544a6dd`, 2026-06-01 — `MapZbHealth` now returns a private `CompositeEndpointConventionBuilder` that fans `Add`/`Finally` to all three endpoint builders (readiness, active, liveness); `<returns>` docs updated, and a new `TierMappingTests` case proves `.RequireHost(...)` gates all three tiers.
### Health-006 — XML documentation is not emitted into the packed nupkgs
| | |
|--|--|
| Severity | Low |
| Category | Packaging, dependencies & project layout |
| Status | Resolved |
| Location | `ZB.MOM.WW.Health/Directory.Build.props:3`, `ZB.MOM.WW.Health/src/ZB.MOM.WW.Health/ZB.MOM.WW.Health.csproj:3` |
**Description**
Every public type and member in all three packages carries thorough, accurate XML documentation, but no
project (and neither `Directory.Build.props` nor the three `.csproj` files) sets
`<GenerateDocumentationFile>true</GenerateDocumentationFile>`. As a result `dotnet pack` produces nupkgs
that contain the DLLs but **no `.xml` doc files**, so consumers of `ZB.MOM.WW.Health` / `.Akka` /
`.EntityFrameworkCore` get no IntelliSense summaries or parameter docs for the API. For a shared library
whose value proposition is a documented common surface across three apps, this discards the
documentation effort at the package boundary. (It also means the compiler does not enforce the
"public members are documented" invariant via CS1591.)
**Recommendation**
Set `<GenerateDocumentationFile>true</GenerateDocumentationFile>` once in `Directory.Build.props` so all
three packable projects emit and pack their XML docs. Optionally pair with targeted `<NoWarn>` if any
CS1591 warnings surface on currently-undocumented members.
**Resolution**
Resolved in `544a6dd`, 2026-06-01 — set `<GenerateDocumentationFile>true</GenerateDocumentationFile>` (plus `NoWarn=CS1591` for test/non-packed members) in `Directory.Build.props`; verified `dotnet pack -c Release` now ships `lib/net10.0/*.xml` in all three nupkgs.