fix(health-monitoring): resolve HealthMonitoring-013,014,016 — shorter-timeout cadence, options validation, injected TimeProvider; HealthMonitoring-015 left open (cross-module design decision)

This commit is contained in:
Joseph Doherty
2026-05-17 03:18:24 -04:00
parent da8c9f171b
commit eae4077414
8 changed files with 296 additions and 12 deletions

View File

@@ -8,7 +8,7 @@
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `39d737e` |
| Open findings | 4 |
| Open findings | 1 |
## Summary
@@ -585,7 +585,7 @@ the contract is now honest and no further code change was required.
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:194-196` |
**Description**
@@ -617,7 +617,16 @@ that the cadence is derived from `OfflineTimeout` only (acceptable while the def
**Resolution**
_Unresolved._
Resolved 2026-05-17. Root cause confirmed against source — `ExecuteAsync` derived the
`PeriodicTimer` cadence solely from `OfflineTimeout` while the comment claimed it
tracked the "(shorter)" timeout. Took the code-matches-comment option: extracted the
cadence into `CentralHealthAggregator.ComputeCheckInterval`, which now derives it from
half of the *shorter* of `OfflineTimeout` and `CentralOfflineTimeout`, so a
`CentralOfflineTimeout` configured below `OfflineTimeout` is still polled at least
twice within its window. The comment was rewritten to match. Regression test
`CentralHealthAggregatorTests.CheckInterval_IsHalfTheShorterTimeout` asserts the
default case (30s) and the shorter-`CentralOfflineTimeout` case (10s) — the latter
would have returned 30s against the pre-fix code.
### HealthMonitoring-014 — `HealthMonitoringOptions` intervals are unvalidated; a zero/negative value crashes the hosted service
@@ -625,7 +634,7 @@ _Unresolved._
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.HealthMonitoring/HealthMonitoringOptions.cs:3-20`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:196`, `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:67`, `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:63` |
**Description**
@@ -653,7 +662,21 @@ configuration fails fast with a message naming the section and key.
**Resolution**
_Unresolved._
Resolved 2026-05-17. Root cause confirmed — `HealthMonitoringOptions` had no
validator, so a zero/negative interval reached `new PeriodicTimer(...)` and crashed
the hosted service with an opaque `ArgumentOutOfRangeException`. Added
`HealthMonitoringOptionsValidator : IValidateOptions<HealthMonitoringOptions>` that
rejects non-positive `ReportInterval`/`OfflineTimeout`/`CentralOfflineTimeout` and a
`CentralOfflineTimeout` shorter than `OfflineTimeout`, each failure naming the
`ScadaLink:HealthMonitoring` config key. It is registered (idempotently, via
`TryAddEnumerable`) by all three `ServiceCollectionExtensions` registration methods,
so it fires when the hosted services resolve `IOptions.Value` at startup — failing
fast with a clear message. (`ValidateOnStart()` lives in the Host module's binding
call, which is out of scope; the validator nonetheless runs at startup because the
hosted-service constructors resolve the options eagerly — matching the existing
`ClusterOptionsValidator` registration pattern.) Regression tests in
`HealthMonitoringOptionsValidatorTests` cover the valid default plus zero/negative
intervals and the `CentralOfflineTimeout < OfflineTimeout` case.
### HealthMonitoring-015 — Heartbeat-registered site is left with a year-0001 `LastReportReceivedAt`
@@ -689,7 +712,25 @@ nullable option is safer and matches the existing `LatestReport` treatment.
**Resolution**
_Unresolved._
_Unresolved — left Open; requires a coordinated cross-module change._ Root cause
confirmed against source: `CentralHealthAggregator.MarkHeartbeat` registers an
unknown site with `LastReportReceivedAt = default` (`0001-01-01`), and the audit of
the single UI reader (`src/ScadaLink.CentralUI/Components/Pages/Monitoring/Health.razor:74`)
confirms the bug is live — it passes `state.LastReportReceivedAt` straight into the
`TimestampDisplay` component, which would render the year-0001 sentinel verbatim for
a heartbeat-only site. The finding's recommended fix — change the field to
`DateTimeOffset?` — is the correct one, but it is a **breaking public-API change**:
`TimestampDisplay.Value` is a non-nullable `DateTimeOffset` parameter, so making
`LastReportReceivedAt` nullable stops `Health.razor` compiling, and the
`CentralUI` module is explicitly outside this task's edit scope. The only fix that
stays module-local would be to stamp the heartbeat time into `LastReportReceivedAt`,
which is semantically dishonest (the field documents "time the latest full report
was processed") and would simply move the bug rather than fix it. The correct
resolution therefore needs the HealthMonitoring change (`DateTimeOffset?`) **and** a
matching `CentralUI` change (a null-checked `TimestampDisplay` or an "awaiting first
report" placeholder) applied together — a design decision spanning two modules.
**Called out for follow-up: open as a coordinated HealthMonitoring + CentralUI work
item.**
### HealthMonitoring-016 — `SiteHealthCollector.CollectReport` reads `DateTimeOffset.UtcNow` directly instead of an injected `TimeProvider`
@@ -697,7 +738,7 @@ _Unresolved._
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs:151` |
**Description**
@@ -722,4 +763,15 @@ testable and the module is consistent.
**Resolution**
_Unresolved._
Resolved 2026-05-17. Root cause confirmed — `CollectReport` stamped `ReportTimestamp`
from `DateTimeOffset.UtcNow` directly while every other time-dependent class in the
module takes an injectable `TimeProvider`. Added an optional `TimeProvider`
constructor parameter to `SiteHealthCollector` (defaulting to `TimeProvider.System`,
mirroring `CentralHealthAggregator`/`HealthReportSender`/`CentralHealthReportLoop`)
and `CollectReport` now derives `ReportTimestamp` from `_timeProvider.GetUtcNow()`.
The DI registration (`AddSingleton<ISiteHealthCollector, SiteHealthCollector>`)
continues to work via the optional parameter. Regression test
`SiteHealthCollectorTests.CollectReport_StampsTimestamp_FromInjectedTimeProvider`
asserts the timestamp equals a fixed injected instant exactly (not just a
before/after window); it would not compile against the pre-fix single-arg-less
constructor.

View File

@@ -191,9 +191,10 @@ public class CentralHealthAggregator : BackgroundService, ICentralHealthAggregat
"Central health aggregator started, offline timeout {Timeout}s (central {CentralTimeout}s)",
_options.OfflineTimeout.TotalSeconds, _options.CentralOfflineTimeout.TotalSeconds);
// Check at half the (shorter) offline timeout interval for timely detection
var checkInterval = TimeSpan.FromMilliseconds(_options.OfflineTimeout.TotalMilliseconds / 2);
using var timer = new PeriodicTimer(checkInterval);
// Check at half the shorter of the two offline timeouts so detection is
// timely for whichever site class (real or "central") has the tighter
// window — see ComputeCheckInterval.
using var timer = new PeriodicTimer(ComputeCheckInterval(_options));
while (await timer.WaitForNextTickAsync(stoppingToken).ConfigureAwait(false))
{
@@ -201,6 +202,24 @@ public class CentralHealthAggregator : BackgroundService, ICentralHealthAggregat
}
}
/// <summary>
/// Computes the offline-check timer cadence: half of the <em>shorter</em> of
/// <see cref="HealthMonitoringOptions.OfflineTimeout"/> and
/// <see cref="HealthMonitoringOptions.CentralOfflineTimeout"/>. Deriving it
/// from the shorter timeout guarantees that whichever site class has the
/// tighter window is still polled at least twice within it — so if an
/// operator configures <c>CentralOfflineTimeout</c> smaller than
/// <c>OfflineTimeout</c>, central offline detection is not delayed by up to a
/// full <c>OfflineTimeout / 2</c>.
/// </summary>
internal static TimeSpan ComputeCheckInterval(HealthMonitoringOptions options)
{
var shorter = options.OfflineTimeout < options.CentralOfflineTimeout
? options.OfflineTimeout
: options.CentralOfflineTimeout;
return TimeSpan.FromMilliseconds(shorter.TotalMilliseconds / 2);
}
internal void CheckForOfflineSites()
{
var now = _timeProvider.GetUtcNow();

View File

@@ -0,0 +1,59 @@
using Microsoft.Extensions.Options;
namespace ScadaLink.HealthMonitoring;
/// <summary>
/// HealthMonitoring-014: validates <see cref="HealthMonitoringOptions"/> at
/// startup. The interval values are fed straight into <c>new PeriodicTimer(...)</c>
/// (and into a division for the offline-check cadence); a zero or negative value
/// makes <see cref="PeriodicTimer"/>'s constructor throw
/// <see cref="ArgumentOutOfRangeException"/>, crashing the
/// <see cref="HealthReportSender"/> / <see cref="CentralHealthReportLoop"/> /
/// <see cref="CentralHealthAggregator"/> hosted service with an opaque exception
/// that does not name the offending config key. Registered with
/// <c>ValidateOnStart()</c> so a bad <c>ScadaLink:HealthMonitoring</c> section
/// fails fast at boot with a clear, key-naming message.
/// </summary>
public sealed class HealthMonitoringOptionsValidator : IValidateOptions<HealthMonitoringOptions>
{
public ValidateOptionsResult Validate(string? name, HealthMonitoringOptions options)
{
var failures = new List<string>();
if (options.ReportInterval <= TimeSpan.Zero)
{
failures.Add(
$"ScadaLink:HealthMonitoring:ReportInterval must be a positive duration " +
$"(was {options.ReportInterval}); it is used directly as a PeriodicTimer period.");
}
if (options.OfflineTimeout <= TimeSpan.Zero)
{
failures.Add(
$"ScadaLink:HealthMonitoring:OfflineTimeout must be a positive duration " +
$"(was {options.OfflineTimeout}); it drives the offline-check PeriodicTimer cadence.");
}
if (options.CentralOfflineTimeout <= TimeSpan.Zero)
{
failures.Add(
$"ScadaLink:HealthMonitoring:CentralOfflineTimeout must be a positive duration " +
$"(was {options.CentralOfflineTimeout}).");
}
if (options.OfflineTimeout > TimeSpan.Zero
&& options.CentralOfflineTimeout > TimeSpan.Zero
&& options.CentralOfflineTimeout < options.OfflineTimeout)
{
failures.Add(
$"ScadaLink:HealthMonitoring:CentralOfflineTimeout ({options.CentralOfflineTimeout}) " +
$"must be >= OfflineTimeout ({options.OfflineTimeout}): the synthetic 'central' site has " +
"no heartbeat source and is fed only by the slower self-report loop, so it needs at " +
"least as much offline grace as a real site.");
}
return failures.Count > 0
? ValidateOptionsResult.Fail(failures)
: ValidateOptionsResult.Success;
}
}

View File

@@ -1,4 +1,6 @@
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.DependencyInjection.Extensions;
using Microsoft.Extensions.Options;
namespace ScadaLink.HealthMonitoring;
@@ -10,6 +12,7 @@ public static class ServiceCollectionExtensions
/// </summary>
public static IServiceCollection AddSiteHealthMonitoring(this IServiceCollection services)
{
AddOptionsValidation(services);
services.AddSingleton<ISiteHealthCollector, SiteHealthCollector>();
services.AddHostedService<HealthReportSender>();
return services;
@@ -21,6 +24,7 @@ public static class ServiceCollectionExtensions
/// </summary>
public static IServiceCollection AddHealthMonitoring(this IServiceCollection services)
{
AddOptionsValidation(services);
services.AddSingleton<ISiteHealthCollector, SiteHealthCollector>();
return services;
}
@@ -32,10 +36,27 @@ public static class ServiceCollectionExtensions
/// </summary>
public static IServiceCollection AddCentralHealthAggregation(this IServiceCollection services)
{
AddOptionsValidation(services);
services.AddSingleton<CentralHealthAggregator>();
services.AddSingleton<ICentralHealthAggregator>(sp => sp.GetRequiredService<CentralHealthAggregator>());
services.AddHostedService(sp => sp.GetRequiredService<CentralHealthAggregator>());
services.AddHostedService<CentralHealthReportLoop>();
return services;
}
/// <summary>
/// HealthMonitoring-014: register the <see cref="HealthMonitoringOptionsValidator"/>
/// so a misconfigured <c>ScadaLink:HealthMonitoring</c> section (zero/negative
/// intervals, or a <c>CentralOfflineTimeout</c> shorter than
/// <c>OfflineTimeout</c>) is rejected with a clear, key-naming message when the
/// hosted services resolve their options at startup — rather than crashing
/// later inside a <see cref="PeriodicTimer"/> constructor with an opaque
/// <see cref="ArgumentOutOfRangeException"/>. Idempotent so it is safe when
/// more than one of the registration methods above is called.
/// </summary>
private static void AddOptionsValidation(IServiceCollection services)
{
services.TryAddEnumerable(
ServiceDescriptor.Singleton<IValidateOptions<HealthMonitoringOptions>, HealthMonitoringOptionsValidator>());
}
}

View File

@@ -23,6 +23,18 @@ public class SiteHealthCollector : ISiteHealthCollector
private volatile string _nodeHostname = "";
private volatile IReadOnlyList<Commons.Messages.Health.NodeStatus>? _clusterNodes;
private volatile bool _isActiveNode;
private readonly TimeProvider _timeProvider;
/// <summary>
/// Creates a collector. The <paramref name="timeProvider"/> stamps each
/// report's timestamp; it defaults to <see cref="TimeProvider.System"/> and
/// is injectable so the report timestamp is deterministically testable —
/// consistent with the rest of the module's time-dependent classes.
/// </summary>
public SiteHealthCollector(TimeProvider? timeProvider = null)
{
_timeProvider = timeProvider ?? TimeProvider.System;
}
/// <summary>
/// Increment the script error counter. Covers unhandled exceptions,
@@ -148,7 +160,7 @@ public class SiteHealthCollector : ISiteHealthCollector
return new SiteHealthReport(
SiteId: siteId,
SequenceNumber: 0, // Caller (HealthReportSender) assigns the sequence number
ReportTimestamp: DateTimeOffset.UtcNow,
ReportTimestamp: _timeProvider.GetUtcNow(),
DataConnectionStatuses: connectionStatuses,
TagResolutionCounts: tagResolution,
ScriptErrorCount: scriptErrors,

View File

@@ -307,6 +307,36 @@ public class CentralHealthAggregatorTests
Assert.False(_aggregator.GetSiteState(CentralHealthReportLoop.CentralSiteId)!.IsOnline);
}
/// <summary>
/// HealthMonitoring-013 regression: the offline-check cadence must be derived
/// from the *shorter* of <see cref="HealthMonitoringOptions.OfflineTimeout"/>
/// and <see cref="HealthMonitoringOptions.CentralOfflineTimeout"/>, so that if
/// an operator configures <c>CentralOfflineTimeout</c> smaller than
/// <c>OfflineTimeout</c>, central offline detection is still timely instead of
/// being delayed by up to a full <c>OfflineTimeout / 2</c>.
/// </summary>
[Fact]
public void CheckInterval_IsHalfTheShorterTimeout()
{
// Default: OfflineTimeout (60s) is the shorter of the two.
Assert.Equal(
TimeSpan.FromSeconds(30),
CentralHealthAggregator.ComputeCheckInterval(new HealthMonitoringOptions
{
OfflineTimeout = TimeSpan.FromSeconds(60),
CentralOfflineTimeout = TimeSpan.FromMinutes(3)
}));
// Operator configures CentralOfflineTimeout shorter — cadence must adapt.
Assert.Equal(
TimeSpan.FromSeconds(10),
CentralHealthAggregator.ComputeCheckInterval(new HealthMonitoringOptions
{
OfflineTimeout = TimeSpan.FromSeconds(60),
CentralOfflineTimeout = TimeSpan.FromSeconds(20)
}));
}
[Fact]
public void SequenceNumberReset_RejectedUntilExceedsPrevMax()
{

View File

@@ -0,0 +1,73 @@
using Microsoft.Extensions.Options;
namespace ScadaLink.HealthMonitoring.Tests;
/// <summary>
/// HealthMonitoring-014 regression: <see cref="HealthMonitoringOptions"/> intervals
/// are fed straight into <c>new PeriodicTimer(...)</c>, which throws
/// <see cref="ArgumentOutOfRangeException"/> for a zero/negative period. A
/// misconfigured <c>appsettings.json</c> must be rejected by an
/// <see cref="IValidateOptions{TOptions}"/> with a clear, key-naming message
/// rather than crashing the hosted service with an opaque exception.
/// </summary>
public class HealthMonitoringOptionsValidatorTests
{
private static ValidateOptionsResult Validate(HealthMonitoringOptions options) =>
new HealthMonitoringOptionsValidator().Validate(Options.DefaultName, options);
[Fact]
public void DefaultOptions_AreValid()
{
var result = Validate(new HealthMonitoringOptions());
Assert.True(result.Succeeded, result.FailureMessage);
}
[Fact]
public void ZeroReportInterval_IsRejected()
{
var result = Validate(new HealthMonitoringOptions { ReportInterval = TimeSpan.Zero });
Assert.True(result.Failed);
Assert.Contains("ReportInterval", result.FailureMessage);
}
[Fact]
public void NegativeReportInterval_IsRejected()
{
var result = Validate(new HealthMonitoringOptions { ReportInterval = TimeSpan.FromSeconds(-1) });
Assert.True(result.Failed);
Assert.Contains("ReportInterval", result.FailureMessage);
}
[Fact]
public void ZeroOfflineTimeout_IsRejected()
{
var result = Validate(new HealthMonitoringOptions { OfflineTimeout = TimeSpan.Zero });
Assert.True(result.Failed);
Assert.Contains("OfflineTimeout", result.FailureMessage);
}
[Fact]
public void ZeroCentralOfflineTimeout_IsRejected()
{
var result = Validate(new HealthMonitoringOptions { CentralOfflineTimeout = TimeSpan.Zero });
Assert.True(result.Failed);
Assert.Contains("CentralOfflineTimeout", result.FailureMessage);
}
[Fact]
public void CentralOfflineTimeout_ShorterThanOfflineTimeout_IsRejected()
{
var result = Validate(new HealthMonitoringOptions
{
OfflineTimeout = TimeSpan.FromSeconds(60),
CentralOfflineTimeout = TimeSpan.FromSeconds(30)
});
Assert.True(result.Failed);
Assert.Contains("CentralOfflineTimeout", result.FailureMessage);
}
}

View File

@@ -131,6 +131,24 @@ public class SiteHealthCollectorTests
Assert.InRange(report.ReportTimestamp, before, after);
}
/// <summary>
/// HealthMonitoring-016 regression: <see cref="SiteHealthCollector.CollectReport"/>
/// must stamp <c>ReportTimestamp</c> from an injected <see cref="TimeProvider"/>
/// (consistent with the rest of the module), not directly from
/// <c>DateTimeOffset.UtcNow</c>, so the report timestamp is deterministically
/// testable against a known instant.
/// </summary>
[Fact]
public void CollectReport_StampsTimestamp_FromInjectedTimeProvider()
{
var fixedInstant = new DateTimeOffset(2026, 5, 17, 9, 30, 0, TimeSpan.Zero);
var collector = new SiteHealthCollector(new TestTimeProvider(fixedInstant));
var report = collector.CollectReport("site-1");
Assert.Equal(fixedInstant, report.ReportTimestamp);
}
[Fact]
public void CollectReport_SequenceNumberIsZero_CallerAssignsIt()
{