fix(health-monitoring): resolve HealthMonitoring-013,014,016 — shorter-timeout cadence, options validation, injected TimeProvider; HealthMonitoring-015 left open (cross-module design decision)

This commit is contained in:
Joseph Doherty
2026-05-17 03:18:24 -04:00
parent da8c9f171b
commit eae4077414
8 changed files with 296 additions and 12 deletions

View File

@@ -8,7 +8,7 @@
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `39d737e` |
| Open findings | 4 |
| Open findings | 1 |
## Summary
@@ -585,7 +585,7 @@ the contract is now honest and no further code change was required.
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:194-196` |
**Description**
@@ -617,7 +617,16 @@ that the cadence is derived from `OfflineTimeout` only (acceptable while the def
**Resolution**
_Unresolved._
Resolved 2026-05-17. Root cause confirmed against source — `ExecuteAsync` derived the
`PeriodicTimer` cadence solely from `OfflineTimeout` while the comment claimed it
tracked the "(shorter)" timeout. Took the code-matches-comment option: extracted the
cadence into `CentralHealthAggregator.ComputeCheckInterval`, which now derives it from
half of the *shorter* of `OfflineTimeout` and `CentralOfflineTimeout`, so a
`CentralOfflineTimeout` configured below `OfflineTimeout` is still polled at least
twice within its window. The comment was rewritten to match. Regression test
`CentralHealthAggregatorTests.CheckInterval_IsHalfTheShorterTimeout` asserts the
default case (30s) and the shorter-`CentralOfflineTimeout` case (10s) — the latter
would have returned 30s against the pre-fix code.
### HealthMonitoring-014 — `HealthMonitoringOptions` intervals are unvalidated; a zero/negative value crashes the hosted service
@@ -625,7 +634,7 @@ _Unresolved._
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.HealthMonitoring/HealthMonitoringOptions.cs:3-20`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:196`, `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:67`, `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:63` |
**Description**
@@ -653,7 +662,21 @@ configuration fails fast with a message naming the section and key.
**Resolution**
_Unresolved._
Resolved 2026-05-17. Root cause confirmed — `HealthMonitoringOptions` had no
validator, so a zero/negative interval reached `new PeriodicTimer(...)` and crashed
the hosted service with an opaque `ArgumentOutOfRangeException`. Added
`HealthMonitoringOptionsValidator : IValidateOptions<HealthMonitoringOptions>` that
rejects non-positive `ReportInterval`/`OfflineTimeout`/`CentralOfflineTimeout` and a
`CentralOfflineTimeout` shorter than `OfflineTimeout`, each failure naming the
`ScadaLink:HealthMonitoring` config key. It is registered (idempotently, via
`TryAddEnumerable`) by all three `ServiceCollectionExtensions` registration methods,
so it fires when the hosted services resolve `IOptions.Value` at startup — failing
fast with a clear message. (`ValidateOnStart()` lives in the Host module's binding
call, which is out of scope; the validator nonetheless runs at startup because the
hosted-service constructors resolve the options eagerly — matching the existing
`ClusterOptionsValidator` registration pattern.) Regression tests in
`HealthMonitoringOptionsValidatorTests` cover the valid default plus zero/negative
intervals and the `CentralOfflineTimeout < OfflineTimeout` case.
### HealthMonitoring-015 — Heartbeat-registered site is left with a year-0001 `LastReportReceivedAt`
@@ -689,7 +712,25 @@ nullable option is safer and matches the existing `LatestReport` treatment.
**Resolution**
_Unresolved._
_Unresolved — left Open; requires a coordinated cross-module change._ Root cause
confirmed against source: `CentralHealthAggregator.MarkHeartbeat` registers an
unknown site with `LastReportReceivedAt = default` (`0001-01-01`), and the audit of
the single UI reader (`src/ScadaLink.CentralUI/Components/Pages/Monitoring/Health.razor:74`)
confirms the bug is live — it passes `state.LastReportReceivedAt` straight into the
`TimestampDisplay` component, which would render the year-0001 sentinel verbatim for
a heartbeat-only site. The finding's recommended fix — change the field to
`DateTimeOffset?` — is the correct one, but it is a **breaking public-API change**:
`TimestampDisplay.Value` is a non-nullable `DateTimeOffset` parameter, so making
`LastReportReceivedAt` nullable stops `Health.razor` compiling, and the
`CentralUI` module is explicitly outside this task's edit scope. The only fix that
stays module-local would be to stamp the heartbeat time into `LastReportReceivedAt`,
which is semantically dishonest (the field documents "time the latest full report
was processed") and would simply move the bug rather than fix it. The correct
resolution therefore needs the HealthMonitoring change (`DateTimeOffset?`) **and** a
matching `CentralUI` change (a null-checked `TimestampDisplay` or an "awaiting first
report" placeholder) applied together — a design decision spanning two modules.
**Called out for follow-up: open as a coordinated HealthMonitoring + CentralUI work
item.**
### HealthMonitoring-016 — `SiteHealthCollector.CollectReport` reads `DateTimeOffset.UtcNow` directly instead of an injected `TimeProvider`
@@ -697,7 +738,7 @@ _Unresolved._
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs:151` |
**Description**
@@ -722,4 +763,15 @@ testable and the module is consistent.
**Resolution**
_Unresolved._
Resolved 2026-05-17. Root cause confirmed — `CollectReport` stamped `ReportTimestamp`
from `DateTimeOffset.UtcNow` directly while every other time-dependent class in the
module takes an injectable `TimeProvider`. Added an optional `TimeProvider`
constructor parameter to `SiteHealthCollector` (defaulting to `TimeProvider.System`,
mirroring `CentralHealthAggregator`/`HealthReportSender`/`CentralHealthReportLoop`)
and `CollectReport` now derives `ReportTimestamp` from `_timeProvider.GetUtcNow()`.
The DI registration (`AddSingleton<ISiteHealthCollector, SiteHealthCollector>`)
continues to work via the optional parameter. Regression test
`SiteHealthCollectorTests.CollectReport_StampsTimestamp_FromInjectedTimeProvider`
asserts the timestamp equals a fixed injected instant exactly (not just a
before/after window); it would not compile against the pre-fix single-arg-less
constructor.