diff --git a/code-reviews/HealthMonitoring/findings.md b/code-reviews/HealthMonitoring/findings.md index a15e9d6..073cd89 100644 --- a/code-reviews/HealthMonitoring/findings.md +++ b/code-reviews/HealthMonitoring/findings.md @@ -8,7 +8,7 @@ | Last reviewed | 2026-05-17 | | Reviewer | claude-agent | | Commit reviewed | `39d737e` | -| Open findings | 4 | +| Open findings | 1 | ## Summary @@ -585,7 +585,7 @@ the contract is now honest and no further code change was required. |--|--| | Severity | Low | | Category | Documentation & comments | -| Status | Open | +| Status | Resolved | | Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:194-196` | **Description** @@ -617,7 +617,16 @@ that the cadence is derived from `OfflineTimeout` only (acceptable while the def **Resolution** -_Unresolved._ +Resolved 2026-05-17. Root cause confirmed against source — `ExecuteAsync` derived the +`PeriodicTimer` cadence solely from `OfflineTimeout` while the comment claimed it +tracked the "(shorter)" timeout. Took the code-matches-comment option: extracted the +cadence into `CentralHealthAggregator.ComputeCheckInterval`, which now derives it from +half of the *shorter* of `OfflineTimeout` and `CentralOfflineTimeout`, so a +`CentralOfflineTimeout` configured below `OfflineTimeout` is still polled at least +twice within its window. The comment was rewritten to match. Regression test +`CentralHealthAggregatorTests.CheckInterval_IsHalfTheShorterTimeout` asserts the +default case (30s) and the shorter-`CentralOfflineTimeout` case (10s) — the latter +would have returned 30s against the pre-fix code. ### HealthMonitoring-014 — `HealthMonitoringOptions` intervals are unvalidated; a zero/negative value crashes the hosted service @@ -625,7 +634,7 @@ _Unresolved._ |--|--| | Severity | Low | | Category | Error handling & resilience | -| Status | Open | +| Status | Resolved | | Location | `src/ScadaLink.HealthMonitoring/HealthMonitoringOptions.cs:3-20`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:196`, `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:67`, `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:63` | **Description** @@ -653,7 +662,21 @@ configuration fails fast with a message naming the section and key. **Resolution** -_Unresolved._ +Resolved 2026-05-17. Root cause confirmed — `HealthMonitoringOptions` had no +validator, so a zero/negative interval reached `new PeriodicTimer(...)` and crashed +the hosted service with an opaque `ArgumentOutOfRangeException`. Added +`HealthMonitoringOptionsValidator : IValidateOptions` that +rejects non-positive `ReportInterval`/`OfflineTimeout`/`CentralOfflineTimeout` and a +`CentralOfflineTimeout` shorter than `OfflineTimeout`, each failure naming the +`ScadaLink:HealthMonitoring` config key. It is registered (idempotently, via +`TryAddEnumerable`) by all three `ServiceCollectionExtensions` registration methods, +so it fires when the hosted services resolve `IOptions.Value` at startup — failing +fast with a clear message. (`ValidateOnStart()` lives in the Host module's binding +call, which is out of scope; the validator nonetheless runs at startup because the +hosted-service constructors resolve the options eagerly — matching the existing +`ClusterOptionsValidator` registration pattern.) Regression tests in +`HealthMonitoringOptionsValidatorTests` cover the valid default plus zero/negative +intervals and the `CentralOfflineTimeout < OfflineTimeout` case. ### HealthMonitoring-015 — Heartbeat-registered site is left with a year-0001 `LastReportReceivedAt` @@ -689,7 +712,25 @@ nullable option is safer and matches the existing `LatestReport` treatment. **Resolution** -_Unresolved._ +_Unresolved — left Open; requires a coordinated cross-module change._ Root cause +confirmed against source: `CentralHealthAggregator.MarkHeartbeat` registers an +unknown site with `LastReportReceivedAt = default` (`0001-01-01`), and the audit of +the single UI reader (`src/ScadaLink.CentralUI/Components/Pages/Monitoring/Health.razor:74`) +confirms the bug is live — it passes `state.LastReportReceivedAt` straight into the +`TimestampDisplay` component, which would render the year-0001 sentinel verbatim for +a heartbeat-only site. The finding's recommended fix — change the field to +`DateTimeOffset?` — is the correct one, but it is a **breaking public-API change**: +`TimestampDisplay.Value` is a non-nullable `DateTimeOffset` parameter, so making +`LastReportReceivedAt` nullable stops `Health.razor` compiling, and the +`CentralUI` module is explicitly outside this task's edit scope. The only fix that +stays module-local would be to stamp the heartbeat time into `LastReportReceivedAt`, +which is semantically dishonest (the field documents "time the latest full report +was processed") and would simply move the bug rather than fix it. The correct +resolution therefore needs the HealthMonitoring change (`DateTimeOffset?`) **and** a +matching `CentralUI` change (a null-checked `TimestampDisplay` or an "awaiting first +report" placeholder) applied together — a design decision spanning two modules. +**Called out for follow-up: open as a coordinated HealthMonitoring + CentralUI work +item.** ### HealthMonitoring-016 — `SiteHealthCollector.CollectReport` reads `DateTimeOffset.UtcNow` directly instead of an injected `TimeProvider` @@ -697,7 +738,7 @@ _Unresolved._ |--|--| | Severity | Low | | Category | Code organization & conventions | -| Status | Open | +| Status | Resolved | | Location | `src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs:151` | **Description** @@ -722,4 +763,15 @@ testable and the module is consistent. **Resolution** -_Unresolved._ +Resolved 2026-05-17. Root cause confirmed — `CollectReport` stamped `ReportTimestamp` +from `DateTimeOffset.UtcNow` directly while every other time-dependent class in the +module takes an injectable `TimeProvider`. Added an optional `TimeProvider` +constructor parameter to `SiteHealthCollector` (defaulting to `TimeProvider.System`, +mirroring `CentralHealthAggregator`/`HealthReportSender`/`CentralHealthReportLoop`) +and `CollectReport` now derives `ReportTimestamp` from `_timeProvider.GetUtcNow()`. +The DI registration (`AddSingleton`) +continues to work via the optional parameter. Regression test +`SiteHealthCollectorTests.CollectReport_StampsTimestamp_FromInjectedTimeProvider` +asserts the timestamp equals a fixed injected instant exactly (not just a +before/after window); it would not compile against the pre-fix single-arg-less +constructor. diff --git a/src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs b/src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs index 4238b5e..f02986e 100644 --- a/src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs +++ b/src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs @@ -191,9 +191,10 @@ public class CentralHealthAggregator : BackgroundService, ICentralHealthAggregat "Central health aggregator started, offline timeout {Timeout}s (central {CentralTimeout}s)", _options.OfflineTimeout.TotalSeconds, _options.CentralOfflineTimeout.TotalSeconds); - // Check at half the (shorter) offline timeout interval for timely detection - var checkInterval = TimeSpan.FromMilliseconds(_options.OfflineTimeout.TotalMilliseconds / 2); - using var timer = new PeriodicTimer(checkInterval); + // Check at half the shorter of the two offline timeouts so detection is + // timely for whichever site class (real or "central") has the tighter + // window — see ComputeCheckInterval. + using var timer = new PeriodicTimer(ComputeCheckInterval(_options)); while (await timer.WaitForNextTickAsync(stoppingToken).ConfigureAwait(false)) { @@ -201,6 +202,24 @@ public class CentralHealthAggregator : BackgroundService, ICentralHealthAggregat } } + /// + /// Computes the offline-check timer cadence: half of the shorter of + /// and + /// . Deriving it + /// from the shorter timeout guarantees that whichever site class has the + /// tighter window is still polled at least twice within it — so if an + /// operator configures CentralOfflineTimeout smaller than + /// OfflineTimeout, central offline detection is not delayed by up to a + /// full OfflineTimeout / 2. + /// + internal static TimeSpan ComputeCheckInterval(HealthMonitoringOptions options) + { + var shorter = options.OfflineTimeout < options.CentralOfflineTimeout + ? options.OfflineTimeout + : options.CentralOfflineTimeout; + return TimeSpan.FromMilliseconds(shorter.TotalMilliseconds / 2); + } + internal void CheckForOfflineSites() { var now = _timeProvider.GetUtcNow(); diff --git a/src/ScadaLink.HealthMonitoring/HealthMonitoringOptionsValidator.cs b/src/ScadaLink.HealthMonitoring/HealthMonitoringOptionsValidator.cs new file mode 100644 index 0000000..1defbe3 --- /dev/null +++ b/src/ScadaLink.HealthMonitoring/HealthMonitoringOptionsValidator.cs @@ -0,0 +1,59 @@ +using Microsoft.Extensions.Options; + +namespace ScadaLink.HealthMonitoring; + +/// +/// HealthMonitoring-014: validates at +/// startup. The interval values are fed straight into new PeriodicTimer(...) +/// (and into a division for the offline-check cadence); a zero or negative value +/// makes 's constructor throw +/// , crashing the +/// / / +/// hosted service with an opaque exception +/// that does not name the offending config key. Registered with +/// ValidateOnStart() so a bad ScadaLink:HealthMonitoring section +/// fails fast at boot with a clear, key-naming message. +/// +public sealed class HealthMonitoringOptionsValidator : IValidateOptions +{ + public ValidateOptionsResult Validate(string? name, HealthMonitoringOptions options) + { + var failures = new List(); + + if (options.ReportInterval <= TimeSpan.Zero) + { + failures.Add( + $"ScadaLink:HealthMonitoring:ReportInterval must be a positive duration " + + $"(was {options.ReportInterval}); it is used directly as a PeriodicTimer period."); + } + + if (options.OfflineTimeout <= TimeSpan.Zero) + { + failures.Add( + $"ScadaLink:HealthMonitoring:OfflineTimeout must be a positive duration " + + $"(was {options.OfflineTimeout}); it drives the offline-check PeriodicTimer cadence."); + } + + if (options.CentralOfflineTimeout <= TimeSpan.Zero) + { + failures.Add( + $"ScadaLink:HealthMonitoring:CentralOfflineTimeout must be a positive duration " + + $"(was {options.CentralOfflineTimeout})."); + } + + if (options.OfflineTimeout > TimeSpan.Zero + && options.CentralOfflineTimeout > TimeSpan.Zero + && options.CentralOfflineTimeout < options.OfflineTimeout) + { + failures.Add( + $"ScadaLink:HealthMonitoring:CentralOfflineTimeout ({options.CentralOfflineTimeout}) " + + $"must be >= OfflineTimeout ({options.OfflineTimeout}): the synthetic 'central' site has " + + "no heartbeat source and is fed only by the slower self-report loop, so it needs at " + + "least as much offline grace as a real site."); + } + + return failures.Count > 0 + ? ValidateOptionsResult.Fail(failures) + : ValidateOptionsResult.Success; + } +} diff --git a/src/ScadaLink.HealthMonitoring/ServiceCollectionExtensions.cs b/src/ScadaLink.HealthMonitoring/ServiceCollectionExtensions.cs index 9f533cc..4a26c6c 100644 --- a/src/ScadaLink.HealthMonitoring/ServiceCollectionExtensions.cs +++ b/src/ScadaLink.HealthMonitoring/ServiceCollectionExtensions.cs @@ -1,4 +1,6 @@ using Microsoft.Extensions.DependencyInjection; +using Microsoft.Extensions.DependencyInjection.Extensions; +using Microsoft.Extensions.Options; namespace ScadaLink.HealthMonitoring; @@ -10,6 +12,7 @@ public static class ServiceCollectionExtensions /// public static IServiceCollection AddSiteHealthMonitoring(this IServiceCollection services) { + AddOptionsValidation(services); services.AddSingleton(); services.AddHostedService(); return services; @@ -21,6 +24,7 @@ public static class ServiceCollectionExtensions /// public static IServiceCollection AddHealthMonitoring(this IServiceCollection services) { + AddOptionsValidation(services); services.AddSingleton(); return services; } @@ -32,10 +36,27 @@ public static class ServiceCollectionExtensions /// public static IServiceCollection AddCentralHealthAggregation(this IServiceCollection services) { + AddOptionsValidation(services); services.AddSingleton(); services.AddSingleton(sp => sp.GetRequiredService()); services.AddHostedService(sp => sp.GetRequiredService()); services.AddHostedService(); return services; } + + /// + /// HealthMonitoring-014: register the + /// so a misconfigured ScadaLink:HealthMonitoring section (zero/negative + /// intervals, or a CentralOfflineTimeout shorter than + /// OfflineTimeout) is rejected with a clear, key-naming message when the + /// hosted services resolve their options at startup — rather than crashing + /// later inside a constructor with an opaque + /// . Idempotent so it is safe when + /// more than one of the registration methods above is called. + /// + private static void AddOptionsValidation(IServiceCollection services) + { + services.TryAddEnumerable( + ServiceDescriptor.Singleton, HealthMonitoringOptionsValidator>()); + } } diff --git a/src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs b/src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs index 8dc02a7..ca05cf9 100644 --- a/src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs +++ b/src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs @@ -23,6 +23,18 @@ public class SiteHealthCollector : ISiteHealthCollector private volatile string _nodeHostname = ""; private volatile IReadOnlyList? _clusterNodes; private volatile bool _isActiveNode; + private readonly TimeProvider _timeProvider; + + /// + /// Creates a collector. The stamps each + /// report's timestamp; it defaults to and + /// is injectable so the report timestamp is deterministically testable — + /// consistent with the rest of the module's time-dependent classes. + /// + public SiteHealthCollector(TimeProvider? timeProvider = null) + { + _timeProvider = timeProvider ?? TimeProvider.System; + } /// /// Increment the script error counter. Covers unhandled exceptions, @@ -148,7 +160,7 @@ public class SiteHealthCollector : ISiteHealthCollector return new SiteHealthReport( SiteId: siteId, SequenceNumber: 0, // Caller (HealthReportSender) assigns the sequence number - ReportTimestamp: DateTimeOffset.UtcNow, + ReportTimestamp: _timeProvider.GetUtcNow(), DataConnectionStatuses: connectionStatuses, TagResolutionCounts: tagResolution, ScriptErrorCount: scriptErrors, diff --git a/tests/ScadaLink.HealthMonitoring.Tests/CentralHealthAggregatorTests.cs b/tests/ScadaLink.HealthMonitoring.Tests/CentralHealthAggregatorTests.cs index 963fd75..22b3395 100644 --- a/tests/ScadaLink.HealthMonitoring.Tests/CentralHealthAggregatorTests.cs +++ b/tests/ScadaLink.HealthMonitoring.Tests/CentralHealthAggregatorTests.cs @@ -307,6 +307,36 @@ public class CentralHealthAggregatorTests Assert.False(_aggregator.GetSiteState(CentralHealthReportLoop.CentralSiteId)!.IsOnline); } + /// + /// HealthMonitoring-013 regression: the offline-check cadence must be derived + /// from the *shorter* of + /// and , so that if + /// an operator configures CentralOfflineTimeout smaller than + /// OfflineTimeout, central offline detection is still timely instead of + /// being delayed by up to a full OfflineTimeout / 2. + /// + [Fact] + public void CheckInterval_IsHalfTheShorterTimeout() + { + // Default: OfflineTimeout (60s) is the shorter of the two. + Assert.Equal( + TimeSpan.FromSeconds(30), + CentralHealthAggregator.ComputeCheckInterval(new HealthMonitoringOptions + { + OfflineTimeout = TimeSpan.FromSeconds(60), + CentralOfflineTimeout = TimeSpan.FromMinutes(3) + })); + + // Operator configures CentralOfflineTimeout shorter — cadence must adapt. + Assert.Equal( + TimeSpan.FromSeconds(10), + CentralHealthAggregator.ComputeCheckInterval(new HealthMonitoringOptions + { + OfflineTimeout = TimeSpan.FromSeconds(60), + CentralOfflineTimeout = TimeSpan.FromSeconds(20) + })); + } + [Fact] public void SequenceNumberReset_RejectedUntilExceedsPrevMax() { diff --git a/tests/ScadaLink.HealthMonitoring.Tests/HealthMonitoringOptionsValidatorTests.cs b/tests/ScadaLink.HealthMonitoring.Tests/HealthMonitoringOptionsValidatorTests.cs new file mode 100644 index 0000000..4532188 --- /dev/null +++ b/tests/ScadaLink.HealthMonitoring.Tests/HealthMonitoringOptionsValidatorTests.cs @@ -0,0 +1,73 @@ +using Microsoft.Extensions.Options; + +namespace ScadaLink.HealthMonitoring.Tests; + +/// +/// HealthMonitoring-014 regression: intervals +/// are fed straight into new PeriodicTimer(...), which throws +/// for a zero/negative period. A +/// misconfigured appsettings.json must be rejected by an +/// with a clear, key-naming message +/// rather than crashing the hosted service with an opaque exception. +/// +public class HealthMonitoringOptionsValidatorTests +{ + private static ValidateOptionsResult Validate(HealthMonitoringOptions options) => + new HealthMonitoringOptionsValidator().Validate(Options.DefaultName, options); + + [Fact] + public void DefaultOptions_AreValid() + { + var result = Validate(new HealthMonitoringOptions()); + Assert.True(result.Succeeded, result.FailureMessage); + } + + [Fact] + public void ZeroReportInterval_IsRejected() + { + var result = Validate(new HealthMonitoringOptions { ReportInterval = TimeSpan.Zero }); + + Assert.True(result.Failed); + Assert.Contains("ReportInterval", result.FailureMessage); + } + + [Fact] + public void NegativeReportInterval_IsRejected() + { + var result = Validate(new HealthMonitoringOptions { ReportInterval = TimeSpan.FromSeconds(-1) }); + + Assert.True(result.Failed); + Assert.Contains("ReportInterval", result.FailureMessage); + } + + [Fact] + public void ZeroOfflineTimeout_IsRejected() + { + var result = Validate(new HealthMonitoringOptions { OfflineTimeout = TimeSpan.Zero }); + + Assert.True(result.Failed); + Assert.Contains("OfflineTimeout", result.FailureMessage); + } + + [Fact] + public void ZeroCentralOfflineTimeout_IsRejected() + { + var result = Validate(new HealthMonitoringOptions { CentralOfflineTimeout = TimeSpan.Zero }); + + Assert.True(result.Failed); + Assert.Contains("CentralOfflineTimeout", result.FailureMessage); + } + + [Fact] + public void CentralOfflineTimeout_ShorterThanOfflineTimeout_IsRejected() + { + var result = Validate(new HealthMonitoringOptions + { + OfflineTimeout = TimeSpan.FromSeconds(60), + CentralOfflineTimeout = TimeSpan.FromSeconds(30) + }); + + Assert.True(result.Failed); + Assert.Contains("CentralOfflineTimeout", result.FailureMessage); + } +} diff --git a/tests/ScadaLink.HealthMonitoring.Tests/SiteHealthCollectorTests.cs b/tests/ScadaLink.HealthMonitoring.Tests/SiteHealthCollectorTests.cs index 2a6c45f..a640a9b 100644 --- a/tests/ScadaLink.HealthMonitoring.Tests/SiteHealthCollectorTests.cs +++ b/tests/ScadaLink.HealthMonitoring.Tests/SiteHealthCollectorTests.cs @@ -131,6 +131,24 @@ public class SiteHealthCollectorTests Assert.InRange(report.ReportTimestamp, before, after); } + /// + /// HealthMonitoring-016 regression: + /// must stamp ReportTimestamp from an injected + /// (consistent with the rest of the module), not directly from + /// DateTimeOffset.UtcNow, so the report timestamp is deterministically + /// testable against a known instant. + /// + [Fact] + public void CollectReport_StampsTimestamp_FromInjectedTimeProvider() + { + var fixedInstant = new DateTimeOffset(2026, 5, 17, 9, 30, 0, TimeSpan.Zero); + var collector = new SiteHealthCollector(new TestTimeProvider(fixedInstant)); + + var report = collector.CollectReport("site-1"); + + Assert.Equal(fixedInstant, report.ReportTimestamp); + } + [Fact] public void CollectReport_SequenceNumberIsZero_CallerAssignsIt() {