fix(health-monitoring): resolve HealthMonitoring-013,014,016 — shorter-timeout cadence, options validation, injected TimeProvider; HealthMonitoring-015 left open (cross-module design decision)
This commit is contained in:
@@ -8,7 +8,7 @@
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 4 |
|
||||
| Open findings | 1 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -585,7 +585,7 @@ the contract is now honest and no further code change was required.
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Documentation & comments |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:194-196` |
|
||||
|
||||
**Description**
|
||||
@@ -617,7 +617,16 @@ that the cadence is derived from `OfflineTimeout` only (acceptable while the def
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-17. Root cause confirmed against source — `ExecuteAsync` derived the
|
||||
`PeriodicTimer` cadence solely from `OfflineTimeout` while the comment claimed it
|
||||
tracked the "(shorter)" timeout. Took the code-matches-comment option: extracted the
|
||||
cadence into `CentralHealthAggregator.ComputeCheckInterval`, which now derives it from
|
||||
half of the *shorter* of `OfflineTimeout` and `CentralOfflineTimeout`, so a
|
||||
`CentralOfflineTimeout` configured below `OfflineTimeout` is still polled at least
|
||||
twice within its window. The comment was rewritten to match. Regression test
|
||||
`CentralHealthAggregatorTests.CheckInterval_IsHalfTheShorterTimeout` asserts the
|
||||
default case (30s) and the shorter-`CentralOfflineTimeout` case (10s) — the latter
|
||||
would have returned 30s against the pre-fix code.
|
||||
|
||||
### HealthMonitoring-014 — `HealthMonitoringOptions` intervals are unvalidated; a zero/negative value crashes the hosted service
|
||||
|
||||
@@ -625,7 +634,7 @@ _Unresolved._
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.HealthMonitoring/HealthMonitoringOptions.cs:3-20`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:196`, `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:67`, `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:63` |
|
||||
|
||||
**Description**
|
||||
@@ -653,7 +662,21 @@ configuration fails fast with a message naming the section and key.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-17. Root cause confirmed — `HealthMonitoringOptions` had no
|
||||
validator, so a zero/negative interval reached `new PeriodicTimer(...)` and crashed
|
||||
the hosted service with an opaque `ArgumentOutOfRangeException`. Added
|
||||
`HealthMonitoringOptionsValidator : IValidateOptions<HealthMonitoringOptions>` that
|
||||
rejects non-positive `ReportInterval`/`OfflineTimeout`/`CentralOfflineTimeout` and a
|
||||
`CentralOfflineTimeout` shorter than `OfflineTimeout`, each failure naming the
|
||||
`ScadaLink:HealthMonitoring` config key. It is registered (idempotently, via
|
||||
`TryAddEnumerable`) by all three `ServiceCollectionExtensions` registration methods,
|
||||
so it fires when the hosted services resolve `IOptions.Value` at startup — failing
|
||||
fast with a clear message. (`ValidateOnStart()` lives in the Host module's binding
|
||||
call, which is out of scope; the validator nonetheless runs at startup because the
|
||||
hosted-service constructors resolve the options eagerly — matching the existing
|
||||
`ClusterOptionsValidator` registration pattern.) Regression tests in
|
||||
`HealthMonitoringOptionsValidatorTests` cover the valid default plus zero/negative
|
||||
intervals and the `CentralOfflineTimeout < OfflineTimeout` case.
|
||||
|
||||
### HealthMonitoring-015 — Heartbeat-registered site is left with a year-0001 `LastReportReceivedAt`
|
||||
|
||||
@@ -689,7 +712,25 @@ nullable option is safer and matches the existing `LatestReport` treatment.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
_Unresolved — left Open; requires a coordinated cross-module change._ Root cause
|
||||
confirmed against source: `CentralHealthAggregator.MarkHeartbeat` registers an
|
||||
unknown site with `LastReportReceivedAt = default` (`0001-01-01`), and the audit of
|
||||
the single UI reader (`src/ScadaLink.CentralUI/Components/Pages/Monitoring/Health.razor:74`)
|
||||
confirms the bug is live — it passes `state.LastReportReceivedAt` straight into the
|
||||
`TimestampDisplay` component, which would render the year-0001 sentinel verbatim for
|
||||
a heartbeat-only site. The finding's recommended fix — change the field to
|
||||
`DateTimeOffset?` — is the correct one, but it is a **breaking public-API change**:
|
||||
`TimestampDisplay.Value` is a non-nullable `DateTimeOffset` parameter, so making
|
||||
`LastReportReceivedAt` nullable stops `Health.razor` compiling, and the
|
||||
`CentralUI` module is explicitly outside this task's edit scope. The only fix that
|
||||
stays module-local would be to stamp the heartbeat time into `LastReportReceivedAt`,
|
||||
which is semantically dishonest (the field documents "time the latest full report
|
||||
was processed") and would simply move the bug rather than fix it. The correct
|
||||
resolution therefore needs the HealthMonitoring change (`DateTimeOffset?`) **and** a
|
||||
matching `CentralUI` change (a null-checked `TimestampDisplay` or an "awaiting first
|
||||
report" placeholder) applied together — a design decision spanning two modules.
|
||||
**Called out for follow-up: open as a coordinated HealthMonitoring + CentralUI work
|
||||
item.**
|
||||
|
||||
### HealthMonitoring-016 — `SiteHealthCollector.CollectReport` reads `DateTimeOffset.UtcNow` directly instead of an injected `TimeProvider`
|
||||
|
||||
@@ -697,7 +738,7 @@ _Unresolved._
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Code organization & conventions |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs:151` |
|
||||
|
||||
**Description**
|
||||
@@ -722,4 +763,15 @@ testable and the module is consistent.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-17. Root cause confirmed — `CollectReport` stamped `ReportTimestamp`
|
||||
from `DateTimeOffset.UtcNow` directly while every other time-dependent class in the
|
||||
module takes an injectable `TimeProvider`. Added an optional `TimeProvider`
|
||||
constructor parameter to `SiteHealthCollector` (defaulting to `TimeProvider.System`,
|
||||
mirroring `CentralHealthAggregator`/`HealthReportSender`/`CentralHealthReportLoop`)
|
||||
and `CollectReport` now derives `ReportTimestamp` from `_timeProvider.GetUtcNow()`.
|
||||
The DI registration (`AddSingleton<ISiteHealthCollector, SiteHealthCollector>`)
|
||||
continues to work via the optional parameter. Regression test
|
||||
`SiteHealthCollectorTests.CollectReport_StampsTimestamp_FromInjectedTimeProvider`
|
||||
asserts the timestamp equals a fixed injected instant exactly (not just a
|
||||
before/after window); it would not compile against the pre-fix single-arg-less
|
||||
constructor.
|
||||
|
||||
Reference in New Issue
Block a user