Files
scadaproj/code-reviews/Health/findings.md
T
Joseph Doherty ae0ccc9a3a Mark all baseline code-review findings resolved
All 35 findings fixed in 544a6dd and marked Status: Resolved with resolution
notes. README regenerated: 0 pending / 35 total across 6 libraries.
2026-06-01 11:22:37 -04:00

16 KiB

Code Review — Health

Field Value
Library ZB.MOM.WW.Health/
Packages ZB.MOM.WW.Health, ZB.MOM.WW.Health.Akka, ZB.MOM.WW.Health.EntityFrameworkCore
Component spec components/health/spec/SPEC.md
Shared contract components/health/shared-contract/ZB.MOM.WW.Health.md
Status Reviewed
Last reviewed 2026-06-01
Reviewer Claude (automated baseline)
Commit reviewed 5f75cd4
Open findings 0

Summary

The library is small, cohesive, and well-documented, and it tracks the normalized spec closely: the three-tier endpoint convention, the canonical JSON writer, the IActiveNodeGate seam, the configurable AkkaClusterStatusPolicy presets, and the role-filtered ActiveNodeHealthCheck all match the contract. The package split is clean — Akka and EF Core stay out of the core package, so MxGateway can take core only (category 8: no leakage). The decision logic is factored into pure, exhaustively table-tested functions (AkkaClusterStatusPolicy, ActiveNodeDecision), and the test suite (58 tests) exercises the public surface through a real TestServer / real SQLite / real probe delegates rather than mocks.

The findings cluster around error-handling completeness in the health-check probes (category 4), the highest-risk area for this library: a probe that throws past the IHealthCheck boundary, or that returns a status the spec says it should not, degrades the value of every consuming app's /health/* endpoints. The two Akka checks and the gRPC check do not guard the cluster-state / probe call the way the spec requires (return Degraded on inaccessible cluster) or the way the sibling DatabaseHealthCheck does (catch-all → Unhealthy). The framework's HealthCheckService catches escaping exceptions, so none of these crash an endpoint — that caps their severity at Medium — but they produce a result that disagrees with the documented contract. The remainder are Low: a JSON shape nuance (description key omitted vs. emitted-null), recommended-tag drift in XML docs, a return-value footgun in MapZbHealth, and missing GenerateDocumentationFile so the (otherwise excellent) XML docs do not ship in the nupkgs.

Checklist coverage

# Category Examined Notes
1 Correctness & logic bugs Tier predicates, policy presets, and ActiveNodeDecision matrices are correct and match the spec tables. No logic defects found.
2 Public API surface & compatibility Health-005. Surface is minimal, sealed, well-named; nullable annotations correct; no internal types leak (ActiveNodeDecision is internal).
3 Concurrency & thread safety Options are mutable POCOs but per-registration; checks hold no shared mutable state; cluster-state reads are snapshot reads; static writer options are immutable. No issues found.
4 Error handling & resilience Health-001, Health-002. Akka checks don't guard cluster-state access; gRPC check lets non-Rpc/non-OCE exceptions escape.
5 Security & secret handling No secrets/PII. Exception objects attached to HealthCheckResult (standard); endpoints AllowAnonymous by design. No issues found.
6 Performance & resource management DatabaseHealthCheck is pool-safe and eagerly releases the timeout timer; scopes/contexts disposed via await using; channel not owned (correctly not disposed). No issues found.
7 Spec & shared-contract adherence Health-001, Health-003, Health-004. Endpoint convention + status strings otherwise match §3.
8 Packaging, dependencies & project layout Health-006. Dependency split correct; central versions; net10.0; correct package ids.
9 Testing coverage Strong: TestServer tier tests, real SQLite EF tests, table-driven policy/decision tests, gRPC probe + timeout + cancellation. Gaps noted inline in findings, not raised separately.
10 Documentation & XML docs Health-004, Health-006. Public API XML docs are thorough and accurate; the issues are recommended-tag drift and docs not being packed.

Findings

Health-001 — Akka health checks throw (instead of returning Degraded) when cluster state is inaccessible

Severity Medium
Category Error handling & resilience
Status Resolved
Location ZB.MOM.WW.Health/src/ZB.MOM.WW.Health.Akka/AkkaClusterHealthCheck.cs:45, ZB.MOM.WW.Health/src/ZB.MOM.WW.Health.Akka/ActiveNodeHealthCheck.cs:106

Description

Both Akka checks guard only the null ActorSystem case: _serviceProvider.GetService<ActorSystem>() returns null → Degraded. But once a non-null ActorSystem is resolved they call Cluster.Get(system) and read .SelfMember / .State.Leader with no try/catch. The spec is explicit that this path must be Degraded, not an exception:

"in both modes, if the ActorSystem is not yet ready or cluster state is inaccessible (e.g. during startup), the check returns Degraded (startup-safety rule)." — SPEC.md §2.2 note.

Cluster.Get(system) throws (ConfigurationException) when the Akka.Cluster extension is not configured on the resolved ActorSystem, and SelfMember can throw during the window where the ActorSystem object exists but the cluster has not finished initialising — exactly the startup race the spec calls out. The ActorSystem being present-but-not-yet-clustered is a very real ordering in a Microsoft.Extensions.Hosting app (the ActorSystem is registered in DI before Cluster has joined).

The escaping exception is caught by ASP.NET Core's HealthCheckService, which records the entry as Unhealthy — so the endpoint does not crash. But Unhealthy on the ready tier returns 503 and removes the node from load-balancing, whereas the spec's intended Degraded returns 200 (still in rotation). A transient startup window therefore yanks a healthy-but-starting node out of rotation, the opposite of the "startup-safe" guarantee the XML docs advertise.

Recommendation

Wrap the cluster-state read in a try/catch and return HealthCheckResult.Degraded(...) on failure, so both the null-ActorSystem and inaccessible-cluster paths converge on Degraded as the spec requires: resolve Cluster.Get(system) and read membership inside a try, catching the cluster extension's startup exceptions and mapping them to Degraded with a "cluster not yet ready" description.

Resolution

Resolved in 544a6dd, 2026-06-01 — wrapped the Cluster.Get(system)/SelfMember/leader reads in AkkaClusterHealthCheck and ActiveNodeHealthCheck in try/catch (Exception when not OCE) returning Degraded("Akka cluster state not yet accessible."); tests add a plain non-clustered ActorSystem to assert Degraded instead of the escaping ConfigurationException.

Health-002 — GrpcDependencyHealthCheck lets non-RpcException/non-OperationCanceledException errors escape

Severity Medium
Category Error handling & resilience
Status Resolved
Location ZB.MOM.WW.Health/src/ZB.MOM.WW.Health/GrpcDependencyHealthCheck.cs:54

Description

CheckHealthAsync catches only RpcException and OperationCanceledException. Any other exception thrown by the probe escapes the method. This is reachable on both paths:

  • Default probe: GrpcChannel.ConnectAsync can throw exceptions other than RpcException — e.g. InvalidOperationException (channel/socket misuse, shutdown) or HttpRequestException / SocketException surfaced from the transport before a gRPC status is produced.
  • Custom probe (GrpcDependencyOptions.Probe): a caller-supplied delegate (e.g. a grpc.health.v1.Health/Check RPC built on a non-gRPC client) can throw anything.

The XML docs and shared contract describe the result as Unhealthy "when it returns false, throws an RpcException, or times out" — they do not promise to handle arbitrary exceptions, and the code does not. By contrast the sibling DatabaseHealthCheck has a catch-all catch (Exception ex)Unhealthy("Database connection failed.", ex) (DatabaseHealthCheck.cs:78), so the two probes are asymmetric. As with Health-001 the framework's HealthCheckService records the escaping exception as Unhealthy, so no endpoint crashes — but the dependency-named, controlled description ("{name} probe failed…") is lost, and the library violates its own "must not throw past the IHealthCheck boundary" intent.

Recommendation

Add a trailing catch (Exception ex) returning HealthCheckResult.Unhealthy($"{name} probe failed: {ex.Message}", ex), after the existing OCE/Rpc handlers (so external cancellation still propagates). This brings the gRPC probe in line with DatabaseHealthCheck and the documented contract.

Resolution

Resolved in 544a6dd, 2026-06-01 — added a trailing catch (Exception ex) to GrpcDependencyHealthCheck.CheckHealthAsync returning Unhealthy("{name} probe failed: {ex.Message}", ex) after the OCE/Rpc external-cancellation handlers; new test asserts a probe throwing InvalidOperationException maps to Unhealthy.

Health-003 — Null description is omitted from the JSON body instead of emitted as null

Severity Low
Category Spec & shared-contract adherence
Status Resolved
Location ZB.MOM.WW.Health/src/ZB.MOM.WW.Health/ZbHealthWriter.cs:32

Description

SerializerOptions sets DefaultIgnoreCondition = JsonIgnoreCondition.WhenWritingNull, so a check that produces no description renders an entry with no description key at all: { "status": "Healthy", "durationMs": 0.1 }. The spec models description as a string? that "may be null" and both the canonical example (SPEC.md §3) and the HealthChecks.UI.Client shape the writer claims to mirror emit the key with a null value rather than dropping it. Consumers/dashboards parsing the body and reading entries.<name>.description must therefore handle a missing property, not just a null one. The deviation is undocumented and untested — ResponseWriterTests only asserts the present-description case.

Low because the aggregate status/HTTP-code contract that orchestrators key on is unaffected; only the descriptive sub-field shape drifts.

Recommendation

Either remove WhenWritingNull (so description: null is emitted, matching the spec example and the UI-client shape), or document explicitly in the writer's XML docs and SPEC.md §3 that absent descriptions are omitted, and add a test pinning the chosen behaviour.

Resolution

Resolved in 544a6dd, 2026-06-01 — removed DefaultIgnoreCondition = WhenWritingNull from ZbHealthWriter.SerializerOptions so a null description renders as "description": null (matching the spec example / UI-client shape); writer XML doc now states the key is always present, and a new ResponseWriterTests case pins present-and-null.

Health-004 — XML docs recommend the active tag for ready-tier probes, contradicting the spec

Severity Low
Category Documentation & XML docs
Status Resolved
Location ZB.MOM.WW.Health/src/ZB.MOM.WW.Health.Akka/AkkaClusterHealthCheck.cs:11, ZB.MOM.WW.Health/src/ZB.MOM.WW.Health/GrpcDependencyHealthCheck.cs:21

Description

AkkaClusterHealthCheck's summary says "Register to the ZbHealthTags.Ready tag (recommended [ready, active])", and GrpcDependencyHealthCheck likewise recommends tagging [ready, active]. The spec is unambiguous that both probes belong to the ready tier only: §2.2 "Registered to the ready tag" (Akka cluster) and §2.4 "Registered to the ready tag" (gRPC dependency). The active tier is reserved by §1 / §2.3 for the leader/active-singleton probe (ActiveNodeHealthCheck); putting cluster-membership or gRPC-reachability checks on the active tier pollutes the active-node tier with non-leadership concerns and contradicts the convergence convention the library exists to enforce.

Recommendation

Bring the XML-doc recommendations in line with the spec — recommend ZbHealthTags.Ready only for both probes (drop the active suggestion); or, if the dual-tag recommendation is intentional, reconcile it back into SPEC.md §2.2/§2.4 so code and spec agree.

Resolution

Resolved in 544a6dd, 2026-06-01 — updated the XML summaries on AkkaClusterHealthCheck and GrpcDependencyHealthCheck to recommend ZbHealthTags.Ready only (dropped the [ready, active] suggestion), aligning the docs with SPEC §2.2/§2.4.

Health-005 — MapZbHealth returns only the readiness builder, silently dropping conventions for the active/live tiers

Severity Low
Category Public API surface & compatibility
Status Resolved
Location ZB.MOM.WW.Health/src/ZB.MOM.WW.Health/ZbHealthEndpointExtensions.cs:64

Description

MapZbHealth maps three endpoints but returns the IEndpointConventionBuilder for /health/ready only. Any convention a caller chains onto the result — .RequireHost(...), .RequireAuthorization(), .WithMetadata(...), .CacheOutput() — applies to the readiness endpoint alone; the active and liveness endpoints silently do not receive it. The behaviour is documented in the XML <returns>, so this is a deliberate, disclosed trade-off rather than a bug, but it is a real footgun: the most natural reading of endpoints.MapZbHealth().RequireHost("…") is "gate all three health endpoints", and the result type gives the caller no signal that two of the three are excluded.

Recommendation

Consider returning a composite builder that fans conventions out to all three endpoints (collect the three IEndpointConventionBuilders and wrap them) so chained conventions behave as a caller expects. If the single-builder return is kept for simplicity, strengthen the <returns> remark to a warning and note it in the README's MapZbHealth example.

Resolution

Resolved in 544a6dd, 2026-06-01 — MapZbHealth now returns a private CompositeEndpointConventionBuilder that fans Add/Finally to all three endpoint builders (readiness, active, liveness); <returns> docs updated, and a new TierMappingTests case proves .RequireHost(...) gates all three tiers.

Health-006 — XML documentation is not emitted into the packed nupkgs

Severity Low
Category Packaging, dependencies & project layout
Status Resolved
Location ZB.MOM.WW.Health/Directory.Build.props:3, ZB.MOM.WW.Health/src/ZB.MOM.WW.Health/ZB.MOM.WW.Health.csproj:3

Description

Every public type and member in all three packages carries thorough, accurate XML documentation, but no project (and neither Directory.Build.props nor the three .csproj files) sets <GenerateDocumentationFile>true</GenerateDocumentationFile>. As a result dotnet pack produces nupkgs that contain the DLLs but no .xml doc files, so consumers of ZB.MOM.WW.Health / .Akka / .EntityFrameworkCore get no IntelliSense summaries or parameter docs for the API. For a shared library whose value proposition is a documented common surface across three apps, this discards the documentation effort at the package boundary. (It also means the compiler does not enforce the "public members are documented" invariant via CS1591.)

Recommendation

Set <GenerateDocumentationFile>true</GenerateDocumentationFile> once in Directory.Build.props so all three packable projects emit and pack their XML docs. Optionally pair with targeted <NoWarn> if any CS1591 warnings surface on currently-undocumented members.

Resolution

Resolved in 544a6dd, 2026-06-01 — set <GenerateDocumentationFile>true</GenerateDocumentationFile> (plus NoWarn=CS1591 for test/non-packed members) in Directory.Build.props; verified dotnet pack -c Release now ships lib/net10.0/*.xml in all three nupkgs.