diff --git a/code-reviews/HealthMonitoring/findings.md b/code-reviews/HealthMonitoring/findings.md index 3c706fa..6f4674d 100644 --- a/code-reviews/HealthMonitoring/findings.md +++ b/code-reviews/HealthMonitoring/findings.md @@ -8,7 +8,7 @@ | Last reviewed | 2026-05-16 | | Reviewer | claude-agent | | Commit reviewed | `9c60592` | -| Open findings | 5 | +| Open findings | 0 | ## Summary @@ -187,7 +187,7 @@ updates and no torn snapshots. No further code change was required for this find |--|--| | Severity | Low | | Category | Documentation & comments | -| Status | Open | +| Status | Resolved | | Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:146-148`, `src/ScadaLink.HealthMonitoring/SiteHealthState.cs:21`, `src/ScadaLink.HealthMonitoring/ICentralHealthAggregator.cs:16` | **Description** @@ -207,7 +207,17 @@ comments, ideally referencing the owning component rather than restating a magic **Resolution** -_Unresolved._ +Resolved 2026-05-16 (commit `pending`). Documentation-only — no regression test is +meaningful. Verified the authoritative cadence against the Communication module: +`SiteCommunicationActor.PreStart` schedules the application-level heartbeat to central +at `CommunicationOptions.TransportHeartbeatInterval`, which defaults to **5 seconds** +(`CommunicationOptions.cs:49`). The stale "~2s" in `ICentralHealthAggregator.MarkHeartbeat` +was corrected; all three XML docs (`ICentralHealthAggregator.MarkHeartbeat`, +`SiteHealthState.LastHeartbeatAt`, `CentralHealthAggregator.CheckForOfflineSites`) now +state the single authoritative ~5s figure and reference the owning component +(`Cluster Infrastructure / SiteCommunicationActor` — +`CommunicationOptions.TransportHeartbeatInterval`) rather than restating a bare magic +number, so readers can reason about the 60s offline grace. ### HealthMonitoring-005 — Central self-report site can flap offline; no heartbeat grace like real sites @@ -259,7 +269,7 @@ offline after 10 minutes) verify the behaviour. |--|--| | Severity | Low | | Category | Design-document adherence | -| Status | Open | +| Status | Resolved | | Location | `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:28`, `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:32` | **Description** @@ -283,7 +293,20 @@ clock dependency is explicit. **Resolution** -_Unresolved._ +Resolved 2026-05-16 (commit `pending`). The `HealthReportSender` XML summary was +rewritten to describe the real strategy — sequence numbers are monotonic and +restart-resetting but explicitly **not** zero/one-based; they are seeded with the +current Unix epoch (ms) so a freshly-active node always sorts after the prior active. +Both `HealthReportSender` and `CentralHealthReportLoop` now accept an optional +`TimeProvider` (defaulting to `TimeProvider.System`) and derive the seed via +`GetUtcNow().ToUnixTimeMilliseconds()` in the constructor body instead of reading +`DateTimeOffset.UtcNow` at field initialization, so the seeding is deterministically +testable and the clock dependency is explicit. `CentralHealthReportLoop` gained a +`CurrentSequenceNumber` test accessor mirroring `HealthReportSender`. Regression tests +`HealthReportSenderTests.SequenceNumberSeed_UsesInjectedTimeProvider` and +`CentralHealthReportLoopTests.SequenceNumberSeed_UsesInjectedTimeProvider` assert the +seed equals the injected provider's Unix-ms instant (these would not compile against +the pre-fix code, which had no `TimeProvider` parameter). ### HealthMonitoring-007 — Heartbeats for not-yet-registered sites are silently dropped @@ -430,7 +453,7 @@ The HealthMonitoring test suite now stands at 47 passing tests (was 30). |--|--| | Severity | Low | | Category | Error handling & resilience | -| Status | Open | +| Status | Resolved | | Location | `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:70-87` | **Description** @@ -451,7 +474,21 @@ the failure so persistent degradation is diagnosable; avoid swallowing **Resolution** -_Unresolved._ +Resolved 2026-05-16 (commit `pending`). All three bare `catch { }` blocks in +`HealthReportSender.ExecuteAsync` (cluster-nodes refresh, parked-message-count query, +S&F buffer-depth query) were changed to `catch (Exception ex)` and now emit a +`LogWarning(ex, ...)` naming the failed operation and the site, so a persistent +degradation (broken S&F SQLite store, always-throwing `GetClusterNodes()`) is +diagnosable instead of silently shipping stale/zero metrics. On the `OperationCanceledException` +concern: verified the inner calls cannot raise OCE from cancellation — +`IClusterNodeProvider.GetClusterNodes()` is synchronous and takes no token, and +`StoreAndForwardStorage.GetParkedMessageCountAsync()`/`GetBufferDepthByCategoryAsync()` +take no `CancellationToken` parameter, so the only cancellation path is the outer loop's +`PeriodicTimer.WaitForNextTickAsync(stoppingToken)`, which is unaffected. Regression +test `HealthReportSenderTests.ClusterNodeRefreshFailure_IsLoggedNotSwallowed` injects a +throwing `IClusterNodeProvider`, asserts the loop still ships reports, and asserts a +warning carrying the `InvalidOperationException` is logged — confirmed to fail against +the pre-fix bare `catch { }` (logged-entry collection was empty). ### HealthMonitoring-011 — `AddHealthMonitoringActors` is a dead no-op placeholder @@ -459,7 +496,7 @@ _Unresolved._ |--|--| | Severity | Low | | Category | Code organization & conventions | -| Status | Open | +| Status | Resolved | | Location | `src/ScadaLink.HealthMonitoring/ServiceCollectionExtensions.cs:42-46` | **Description** @@ -476,15 +513,23 @@ planned, track it in the design doc instead of a half-method. **Resolution** -_Unresolved._ +Resolved 2026-05-16 (commit `pending`). Verified a codebase-wide search — the +`AddHealthMonitoringActors` no-op extension has no production callers (only references +were in the code-review docs themselves). The HealthMonitoring module deliberately +contains no actors (transport is abstracted behind `IHealthReportTransport`; actor-side +wiring lives in the Communication module), so there is no genuine Phase-4 actor work for +this method to host. The dead placeholder was removed from `ServiceCollectionExtensions` +so a caller can no longer be misled into believing actor wiring exists. No regression +test is meaningful for a deleted no-op; removal is verified by the module continuing to +build and all 50 tests passing. ### HealthMonitoring-012 — `SiteHealthState.LatestReport` initialized to `null!`, misrepresenting the contract | | | |--|--| -| Severity | Low | +| Severity | Low — re-triaged: already resolved as a side-effect of HealthMonitoring-002. | | Category | Documentation & comments | -| Status | Open | +| Status | Resolved | | Location | `src/ScadaLink.HealthMonitoring/SiteHealthState.cs:11` | **Description** @@ -506,4 +551,12 @@ for HealthMonitoring-007. **Resolution** -_Unresolved._ +Resolved 2026-05-16 (commit `pending`). Re-triaged: verified against the current +source — the root cause was already eliminated by the HealthMonitoring-002 / +HealthMonitoring-007 fixes. `SiteHealthState.LatestReport` is already declared +`SiteHealthReport? LatestReport { get; init; }` (the recommendation's "make it properly +nullable" option) with an XML doc explaining the `null` case ("known only via heartbeats, +has not yet sent a report"). A codebase-wide search confirms no `null!` suppression +remains anywhere in `src/ScadaLink.HealthMonitoring`. This is exactly the change +HealthMonitoring-002 made when converting `SiteHealthState` to an immutable record, so +the contract is now honest and no further code change was required. diff --git a/src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs b/src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs index 1f6c5e5..4238b5e 100644 --- a/src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs +++ b/src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs @@ -210,10 +210,12 @@ public class CentralHealthAggregator : BackgroundService, ICentralHealthAggregat var state = kvp.Value; if (!state.IsOnline) continue; - // Use LastHeartbeatAt — heartbeats arrive frequently from any + // Use LastHeartbeatAt — heartbeats arrive every ~5s from any // healthy site node (cadence owned by Cluster Infrastructure / - // SiteCommunicationActor), so OfflineTimeout only fires when no - // node can reach central, not during single-node failovers. + // SiteCommunicationActor — CommunicationOptions.TransportHeartbeatInterval), + // so the 60s OfflineTimeout tolerates several missed heartbeats and + // only fires when no node can reach central, not during single-node + // failovers. // // The synthetic "central" site has no heartbeat source — its only // signal is the 30s CentralHealthReportLoop self-report — so it gets diff --git a/src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs b/src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs index b0cc9cd..eaa93f7 100644 --- a/src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs +++ b/src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs @@ -29,22 +29,31 @@ public class CentralHealthReportLoop : BackgroundService // Seeded with Unix-ms so reports from a newly-elected central leader // always sort after reports from any prior leader for siteId="central". - private long _sequenceNumber = DateTimeOffset.UtcNow.ToUnixTimeMilliseconds(); + // The clock is read through the injected TimeProvider so the seeding is + // deterministically testable. + private long _sequenceNumber; public CentralHealthReportLoop( ISiteHealthCollector collector, ICentralHealthAggregator aggregator, IClusterNodeProvider clusterNodeProvider, IOptions options, - ILogger logger) + ILogger logger, + TimeProvider? timeProvider = null) { _collector = collector; _aggregator = aggregator; _clusterNodeProvider = clusterNodeProvider; _options = options.Value; _logger = logger; + _sequenceNumber = (timeProvider ?? TimeProvider.System).GetUtcNow().ToUnixTimeMilliseconds(); } + /// + /// Current sequence number (for testing). + /// + public long CurrentSequenceNumber => Interlocked.Read(ref _sequenceNumber); + protected override async Task ExecuteAsync(CancellationToken stoppingToken) { _logger.LogInformation( diff --git a/src/ScadaLink.HealthMonitoring/HealthReportSender.cs b/src/ScadaLink.HealthMonitoring/HealthReportSender.cs index dc640d4..6e32a6c 100644 --- a/src/ScadaLink.HealthMonitoring/HealthReportSender.cs +++ b/src/ScadaLink.HealthMonitoring/HealthReportSender.cs @@ -8,7 +8,12 @@ namespace ScadaLink.HealthMonitoring; /// /// Periodically collects a SiteHealthReport and sends it to central via Akka remoting. -/// Sequence numbers are monotonic, starting at 1, and reset on service restart. +/// Sequence numbers are monotonic and reset on service restart. They are not +/// zero/one-based: the per-process counter is seeded with the current Unix epoch +/// (milliseconds) at construction so that, after a failover, reports from a +/// freshly-active node always sort after reports from any prior active node for the +/// same site — otherwise the central aggregator's sequence-number guard would +/// silently reject the new active's first reports as stale. /// public class HealthReportSender : BackgroundService { @@ -24,8 +29,9 @@ public class HealthReportSender : BackgroundService // node always sort after reports from any prior active node for the same // site. Without this seeding, failover would silently drop the new // active's first reports because their per-process counter starts below - // the prior active's last sequence number. - private long _sequenceNumber = DateTimeOffset.UtcNow.ToUnixTimeMilliseconds(); + // the prior active's last sequence number. The clock is read through the + // injected TimeProvider so the seeding is deterministically testable. + private long _sequenceNumber; public HealthReportSender( ISiteHealthCollector collector, @@ -34,7 +40,8 @@ public class HealthReportSender : BackgroundService ILogger logger, ISiteIdentityProvider siteIdentityProvider, StoreAndForwardStorage? sfStorage = null, - IClusterNodeProvider? clusterNodeProvider = null) + IClusterNodeProvider? clusterNodeProvider = null, + TimeProvider? timeProvider = null) { _collector = collector; _transport = transport; @@ -43,6 +50,7 @@ public class HealthReportSender : BackgroundService _siteId = siteIdentityProvider.SiteId; _sfStorage = sfStorage; _clusterNodeProvider = clusterNodeProvider; + _sequenceNumber = (timeProvider ?? TimeProvider.System).GetUtcNow().ToUnixTimeMilliseconds(); } /// @@ -73,7 +81,14 @@ public class HealthReportSender : BackgroundService { _collector.SetClusterNodes(_clusterNodeProvider.GetClusterNodes()); } - catch { /* Non-fatal */ } + catch (Exception ex) + { + // Non-fatal — the report ships with the previous cluster + // node list. Logged so a persistent failure is diagnosable. + _logger.LogWarning(ex, + "Failed to refresh cluster nodes for health report (site {SiteId}); using stale list", + _siteId); + } } if (_sfStorage != null) @@ -83,7 +98,13 @@ public class HealthReportSender : BackgroundService var parkedCount = await _sfStorage.GetParkedMessageCountAsync(); _collector.SetParkedMessageCount(parkedCount); } - catch { /* Non-fatal — parked count will be 0 */ } + catch (Exception ex) + { + // Non-fatal — parked count will be 0 in this report. + _logger.LogWarning(ex, + "Failed to query parked message count for health report (site {SiteId})", + _siteId); + } try { @@ -97,7 +118,13 @@ public class HealthReportSender : BackgroundService kvp => kvp.Value); _collector.SetStoreAndForwardDepths(depths); } - catch { /* Non-fatal — buffer depths will be empty */ } + catch (Exception ex) + { + // Non-fatal — buffer depths will be empty in this report. + _logger.LogWarning(ex, + "Failed to query store-and-forward buffer depths for health report (site {SiteId})", + _siteId); + } } var seq = Interlocked.Increment(ref _sequenceNumber); diff --git a/src/ScadaLink.HealthMonitoring/ICentralHealthAggregator.cs b/src/ScadaLink.HealthMonitoring/ICentralHealthAggregator.cs index b9b11e8..083a5d6 100644 --- a/src/ScadaLink.HealthMonitoring/ICentralHealthAggregator.cs +++ b/src/ScadaLink.HealthMonitoring/ICentralHealthAggregator.cs @@ -13,9 +13,14 @@ public interface ICentralHealthAggregator /// /// Bumps the last-seen timestamp for a site, keeping it marked online /// between full 30s reports when heartbeats are arriving — protects against - /// the offline threshold firing on a transiently delayed report. A heartbeat - /// for a site with no aggregator state yet (e.g. just after a central - /// restart/failover) registers that site as online with no + /// the offline threshold firing on a transiently delayed report. Heartbeat + /// cadence is owned by the Cluster Infrastructure / SiteCommunicationActor + /// (the application-level heartbeat to central, sent every + /// CommunicationOptions.TransportHeartbeatInterval — 5s by default); + /// the 60s therefore + /// tolerates several missed heartbeats. A heartbeat for a site with no + /// aggregator state yet (e.g. just after a central restart/failover) + /// registers that site as online with no /// , so reachable sites are not /// shown as "unknown" during the failover window. /// diff --git a/src/ScadaLink.HealthMonitoring/ServiceCollectionExtensions.cs b/src/ScadaLink.HealthMonitoring/ServiceCollectionExtensions.cs index 58c232f..9f533cc 100644 --- a/src/ScadaLink.HealthMonitoring/ServiceCollectionExtensions.cs +++ b/src/ScadaLink.HealthMonitoring/ServiceCollectionExtensions.cs @@ -38,10 +38,4 @@ public static class ServiceCollectionExtensions services.AddHostedService(); return services; } - - public static IServiceCollection AddHealthMonitoringActors(this IServiceCollection services) - { - // Placeholder for Akka actor registration (Phase 4+) - return services; - } } diff --git a/src/ScadaLink.HealthMonitoring/SiteHealthState.cs b/src/ScadaLink.HealthMonitoring/SiteHealthState.cs index 2ec0d2a..0774105 100644 --- a/src/ScadaLink.HealthMonitoring/SiteHealthState.cs +++ b/src/ScadaLink.HealthMonitoring/SiteHealthState.cs @@ -30,8 +30,9 @@ public sealed record SiteHealthState /// Time the most recent signal of any kind (full report OR heartbeat) was /// received. Drives offline detection — heartbeats from the standby keep the /// site marked online even when the active node is unable to produce a report - /// (mid-failover, brief stalls). See the heartbeat scheduler owned by the - /// Cluster Infrastructure / SiteCommunicationActor for the actual cadence. + /// (mid-failover, brief stalls). Heartbeat cadence is owned by the Cluster + /// Infrastructure / SiteCommunicationActor (every + /// CommunicationOptions.TransportHeartbeatInterval — 5s by default). /// public DateTimeOffset LastHeartbeatAt { get; init; } diff --git a/tests/ScadaLink.HealthMonitoring.Tests/CentralHealthReportLoopTests.cs b/tests/ScadaLink.HealthMonitoring.Tests/CentralHealthReportLoopTests.cs index e014252..0bbb29b 100644 --- a/tests/ScadaLink.HealthMonitoring.Tests/CentralHealthReportLoopTests.cs +++ b/tests/ScadaLink.HealthMonitoring.Tests/CentralHealthReportLoopTests.cs @@ -110,6 +110,29 @@ public class CentralHealthReportLoopTests } } + /// + /// HealthMonitoring-006 regression: the central loop's sequence-number seed + /// must be derived from the injected (Unix-ms), + /// not from DateTimeOffset.UtcNow read at field initialization, so the + /// seeding strategy is deterministically testable. + /// + [Fact] + public void SequenceNumberSeed_UsesInjectedTimeProvider() + { + var fixedInstant = new DateTimeOffset(2026, 5, 16, 12, 0, 0, TimeSpan.Zero); + var timeProvider = new TestTimeProvider(fixedInstant); + + var loop = new CentralHealthReportLoop( + new SiteHealthCollector(), + new RecordingAggregator(), + new FakeClusterNodeProvider { SelfIsPrimary = true }, + Options.Create(new HealthMonitoringOptions()), + NullLogger.Instance, + timeProvider); + + Assert.Equal(fixedInstant.ToUnixTimeMilliseconds(), loop.CurrentSequenceNumber); + } + [Fact] public async Task SetsActiveNodeFlag_EvenWhenNotPrimary() { diff --git a/tests/ScadaLink.HealthMonitoring.Tests/HealthReportSenderTests.cs b/tests/ScadaLink.HealthMonitoring.Tests/HealthReportSenderTests.cs index c23c06b..25e1515 100644 --- a/tests/ScadaLink.HealthMonitoring.Tests/HealthReportSenderTests.cs +++ b/tests/ScadaLink.HealthMonitoring.Tests/HealthReportSenderTests.cs @@ -1,4 +1,5 @@ using Microsoft.Data.Sqlite; +using Microsoft.Extensions.Logging; using Microsoft.Extensions.Logging.Abstractions; using Microsoft.Extensions.Options; using ScadaLink.Commons.Messages.Health; @@ -20,6 +21,44 @@ public class HealthReportSenderTests public string SiteId { get; set; } = "test-site"; } + /// + /// Captures emitted log entries so tests can assert that non-fatal failures + /// are surfaced (HealthMonitoring-010) rather than silently swallowed. + /// + private sealed class CapturingLogger : ILogger + { + public sealed record Entry(LogLevel Level, string Message, Exception? Exception); + + public List Entries { get; } = []; + + public IDisposable BeginScope(TState state) where TState : notnull => NullScope.Instance; + public bool IsEnabled(LogLevel logLevel) => true; + + public void Log( + LogLevel logLevel, EventId eventId, TState state, Exception? exception, + Func formatter) + { + lock (Entries) + { + Entries.Add(new Entry(logLevel, formatter(state, exception), exception)); + } + } + + private sealed class NullScope : IDisposable + { + public static readonly NullScope Instance = new(); + public void Dispose() { } + } + } + + /// An whose query always throws. + private sealed class ThrowingClusterNodeProvider : IClusterNodeProvider + { + public bool SelfIsPrimary => true; + public IReadOnlyList GetClusterNodes() => + throw new InvalidOperationException("cluster query failed"); + } + [Fact] public async Task SendsReportsWithMonotonicSequenceNumbers() { @@ -226,4 +265,76 @@ public class HealthReportSenderTests Assert.InRange(sender.CurrentSequenceNumber, beforeCtor, afterCtor); } + + /// + /// HealthMonitoring-010 regression: a failure refreshing cluster nodes is + /// non-fatal (the report still ships) but must no longer be swallowed by a + /// bare catch {} — it must be logged as a warning with the exception so + /// persistent degradation is diagnosable. + /// + [Fact] + public async Task ClusterNodeRefreshFailure_IsLoggedNotSwallowed() + { + var transport = new FakeTransport(); + var collector = new SiteHealthCollector(); + collector.SetActiveNode(true); + var logger = new CapturingLogger(); + var options = Options.Create(new HealthMonitoringOptions + { + ReportInterval = TimeSpan.FromMilliseconds(50) + }); + + var sender = new HealthReportSender( + collector, + transport, + options, + logger, + new FakeSiteIdentityProvider(), + clusterNodeProvider: new ThrowingClusterNodeProvider()); + + using var cts = new CancellationTokenSource(TimeSpan.FromMilliseconds(300)); + try + { + await sender.StartAsync(cts.Token); + await Task.Delay(250, CancellationToken.None); + await sender.StopAsync(CancellationToken.None); + } + catch (OperationCanceledException) { } + + // The report loop continues despite the failure... + Assert.NotEmpty(transport.SentReports); + // ...but the failure is surfaced as a warning carrying the exception. + CapturingLogger.Entry[] warnings; + lock (logger.Entries) + { + warnings = logger.Entries + .Where(e => e.Level == LogLevel.Warning && e.Exception is InvalidOperationException) + .ToArray(); + } + Assert.NotEmpty(warnings); + Assert.Contains(warnings, w => w.Message.Contains("cluster nodes", StringComparison.OrdinalIgnoreCase)); + } + + /// + /// HealthMonitoring-006 regression: the sequence-number seed must be derived + /// from the injected so the Unix-ms seeding strategy + /// is deterministically testable and the clock dependency is explicit, rather + /// than reading DateTimeOffset.UtcNow directly at field initialization. + /// + [Fact] + public void SequenceNumberSeed_UsesInjectedTimeProvider() + { + var fixedInstant = new DateTimeOffset(2026, 5, 16, 12, 0, 0, TimeSpan.Zero); + var timeProvider = new TestTimeProvider(fixedInstant); + + var sender = new HealthReportSender( + new SiteHealthCollector(), + new FakeTransport(), + Options.Create(new HealthMonitoringOptions()), + NullLogger.Instance, + new FakeSiteIdentityProvider(), + timeProvider: timeProvider); + + Assert.Equal(fixedInstant.ToUnixTimeMilliseconds(), sender.CurrentSequenceNumber); + } }