# Code Review — HealthMonitoring | Field | Value | |-------|-------| | Module | `src/ScadaLink.HealthMonitoring` | | Design doc | `docs/requirements/Component-HealthMonitoring.md` | | Status | Reviewed | | Last reviewed | 2026-05-17 | | Reviewer | claude-agent | | Commit reviewed | `39d737e` | | Open findings | 4 | ## Summary The HealthMonitoring module is small, readable, and broadly faithful to the design intent: per-interval error counters with atomic read-and-reset, monotonic sequence numbers with Unix-ms seeding to survive failover, sequence-guarded staleness rejection, and a 60s offline timeout. However, the review surfaced two recurring themes. First, **a documented metric is silently unimplemented** — store-and-forward buffer depths are never populated (`SetStoreAndForwardDepths` has zero callers and a test asserts the field is always empty), so the dashboard cannot show the buffer depth metric the design doc requires. Second, **the central aggregator's in-memory state model has unguarded shared mutable state**: `SiteHealthState` is a mutable class whose fields are written by a background timer thread, by `ProcessReport`, and by `MarkHeartbeat` with no synchronization, and the same live mutable objects are handed straight to UI callers via `GetAllSiteStates`. The `ProcessReport` logic also mutates shared state inside a `ConcurrentDictionary.AddOrUpdate` update delegate, which the runtime may invoke more than once under contention. Additionally there are gaps around central self-report offline detection, heartbeats for not-yet-registered sites being dropped, and missing test coverage for the central report loop, heartbeat path, and most collector setters. None of the findings are crash-class, but the concurrency issues are Medium/High and the missing S&F metric is a real design-adherence gap. #### Re-review 2026-05-17 (commit `39d737e`) All twelve prior findings (HealthMonitoring-001..012) are confirmed `Resolved` — `SiteHealthState` is now an immutable `sealed record` mutated only via atomic compare-and-swap, the store-and-forward buffer-depth metric is populated, the central-site offline grace and the unknown-site heartbeat registration are in place, and the test suite has grown to cover the report loop, heartbeat path, and collector setters. This re-review found **4 new findings, all Low/Medium, none crash-class**. They are residual polish items rather than behaviour regressions: an inaccurate offline-check-interval comment (HealthMonitoring-013), unvalidated `HealthMonitoringOptions` intervals that crash the hosted service on misconfiguration (HealthMonitoring-014), a heartbeat-only registered site left with a year-0001 `LastReportReceivedAt` that the UI's staleness display must special-case (HealthMonitoring-015), and `CollectReport` reading `DateTimeOffset.UtcNow` directly instead of the module's now-standard injected `TimeProvider` (HealthMonitoring-016). The module remains small, readable, and broadly faithful to the design intent. ## Checklist coverage | # | Category | Examined | Notes | |---|----------|----------|-------| | 1 | Correctness & logic bugs | x | `MarkHeartbeat` drops heartbeats for unregistered sites (HealthMonitoring-007); central self-report has no heartbeat grace (HealthMonitoring-005). Re-review: heartbeat-registered site left with year-0001 `LastReportReceivedAt` (HealthMonitoring-015). | | 2 | Akka.NET conventions | x | Module itself contains no actors (transport abstracted via `IHealthReportTransport`); `AddHealthMonitoringActors` is a dead placeholder (HealthMonitoring-011). Actor-side wiring lives in Communication and is out of scope. | | 3 | Concurrency & thread safety | x | Unguarded mutable `SiteHealthState` (HealthMonitoring-002); mutation inside `AddOrUpdate` delegate (HealthMonitoring-003); `GetAllSiteStates` leaks live mutable references (HealthMonitoring-008). Collector counters correctly use `Interlocked`. | | 4 | Error handling & resilience | x | `HealthReportSender` silently swallows inner failures with bare `catch {}` (HealthMonitoring-010, resolved); top-level loop error handling is sound. Re-review: `HealthMonitoringOptions` intervals unvalidated — a zero/negative value crashes the hosted service at `PeriodicTimer` construction (HealthMonitoring-014). | | 5 | Security | x | No issues found. Module handles only numeric/string operational metrics, no secrets, no external input parsing, no auth surface. | | 6 | Performance & resource management | x | `PeriodicTimer` instances correctly disposed via `using`. Dictionary snapshots per report are acceptable at the documented scale. No issues found. | | 7 | Design-document adherence | x | Store-and-forward buffer depth metric unimplemented (HealthMonitoring-001); sequence seeding deviates from doc's "starting at 1" wording (HealthMonitoring-006). | | 8 | Code organization & conventions | x | Options class correctly owned by the component; POCO/messages in Commons. Dead placeholder method noted (HealthMonitoring-011, resolved). Re-review: `SiteHealthCollector.CollectReport` reads `DateTimeOffset.UtcNow` directly instead of the module's now-standard injected `TimeProvider` (HealthMonitoring-016). | | 9 | Testing coverage | x | No tests for `CentralHealthReportLoop`, `MarkHeartbeat`, offline-via-heartbeat, replica idempotency, or most collector setters (HealthMonitoring-009). | | 10 | Documentation & comments | x | Heartbeat interval is described inconsistently (~2s vs ~5s) across XML docs (HealthMonitoring-004, resolved); `LatestReport = null!` misrepresents the contract (HealthMonitoring-012, resolved). Re-review: offline-check-interval comment claims "(shorter)" timeout but code only uses `OfflineTimeout` (HealthMonitoring-013). | ## Findings ### HealthMonitoring-001 — Store-and-forward buffer depth metric is never populated | | | |--|--| | Severity | High | | Category | Design-document adherence | | Status | Resolved | | Location | `src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs:104`, `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:79` | **Description** `Component-HealthMonitoring.md` lists "Store-and-forward buffer depth" (pending messages by category) as a required monitored metric. `SiteHealthCollector` exposes `SetStoreAndForwardDepths(...)` to receive it, but a codebase-wide search shows the method has **no callers** — `_sfBufferDepths` always remains the empty dictionary it is initialized to. `HealthReportSender` queries `GetParkedMessageCountAsync()` and sets `ParkedMessageCount`, but parked count is a distinct metric from per-category buffer depth. The test `SiteHealthCollectorTests.StoreAndForwardBufferDepths_IsEmptyPlaceholder` even codifies the unimplemented state as expected behaviour. The result is that the central dashboard cannot display buffer depth, a documented triage metric. **Recommendation** Wire `SetStoreAndForwardDepths` into `HealthReportSender.ExecuteAsync` (alongside the existing parked-count call) using the S&F engine's per-category depth API, or, if the metric is intentionally deferred, record that decision in the design doc and remove the dead setter. Update the placeholder test accordingly once implemented. **Resolution** Resolved 2026-05-16 (commit ``). `HealthReportSender.ExecuteAsync` now queries the existing public `StoreAndForwardStorage.GetBufferDepthByCategoryAsync()` API alongside the parked-count call and feeds the per-category depths into `SiteHealthCollector.SetStoreAndForwardDepths` (category enum names as keys), so the documented store-and-forward buffer depth metric is populated in every emitted report. Regression test `HealthReportSenderTests.ReportsIncludeStoreAndForwardBufferDepthsFromStorage` verifies populated per-category depths. The obsolete placeholder test `SiteHealthCollectorTests.StoreAndForwardBufferDepths_IsEmptyPlaceholder` continues to pass — it only exercises the collector with no setter call and still correctly asserts the empty default; it was left in place as the collector-level default-state test. No StoreAndForward source was modified (existing public API only). ### HealthMonitoring-002 — `SiteHealthState` mutable fields written from multiple threads without synchronization | | | |--|--| | Severity | High | | Category | Concurrency & thread safety | | Status | Resolved | | Location | `src/ScadaLink.HealthMonitoring/SiteHealthState.cs:11`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:86`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:137` | **Description** `SiteHealthState` is a plain mutable class. Its fields (`LatestReport`, `LastReportReceivedAt`, `LastHeartbeatAt`, `LastSequenceNumber`, `IsOnline`) are mutated from at least three concurrent contexts: `ProcessReport` (caller thread — ClusterClient/PubSub message handlers), `MarkHeartbeat` (caller thread — heartbeat handler), and `CheckForOfflineSites` (the `BackgroundService` timer thread). The `ConcurrentDictionary` only protects the dictionary structure, not the objects it stores. A heartbeat update and the offline-check can interleave on the same `SiteHealthState` instance, and reads/writes of `DateTimeOffset` (a 16-byte struct) and `long` fields are not guaranteed atomic on all platforms — producing torn reads and lost updates of `IsOnline`/`LastHeartbeatAt`. **Recommendation** Make state transitions atomic: either guard all reads/writes of a `SiteHealthState` with a per-site lock, or replace `SiteHealthState` with an immutable record updated via `ConcurrentDictionary` compare-and-swap (`TryUpdate`) so every transition is a single atomic reference swap. **Resolution** Resolved 2026-05-16 (commit ``). `SiteHealthState` is now a `sealed record` with `init`-only properties. `CentralHealthAggregator.ProcessReport`, `MarkHeartbeat`, and `CheckForOfflineSites` were rewritten to perform every state transition as an atomic compare-and-swap (`TryAdd`/`TryUpdate`) producing a new record instance — no field of a stored state is ever mutated in place. `ProcessReport` uses an explicit CAS retry loop instead of the `AddOrUpdate` update delegate so the sequence-number guard and the field writes are evaluated against the value actually installed (this also closes the root cause behind HealthMonitoring-003). Reads via `GetAllSiteStates`/`GetSiteState` now hand out immutable snapshots, so a concurrent reader can never observe a torn or half-applied state. `LatestReport` was changed from `SiteHealthReport` (`null!`) to `SiteHealthReport?`, making the contract honest; all existing consumers (CentralUI, integration/perf tests) already null-checked it and continue to build clean. Regression test `CentralHealthAggregatorTests.ProcessReport_ConcurrentUpdates_NeverLoseSequenceOrTearState` exercises concurrent report/heartbeat/read threads and asserts snapshot consistency and no lost updates. ### HealthMonitoring-003 — Shared state mutated inside `ConcurrentDictionary.AddOrUpdate` update delegate | | | |--|--| | Severity | Medium — re-triaged: already resolved as a side-effect of HealthMonitoring-002. | | Category | Concurrency & thread safety | | Status | Resolved | | Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:45-103` | **Description** The update delegate passed to `AddOrUpdate` mutates the `existing` object in place (`existing.LatestReport = report; existing.IsOnline = true; ...`). `AddOrUpdate`'s contract explicitly allows the update delegate to be invoked **more than once** under contention (when the CAS that installs the result loses a race and is retried). Each invocation mutates the shared object, so a concurrent report for the same site can observe a half-applied update, and the multi-field assignment is not atomic with respect to readers in `GetAllSiteStates`/`CheckForOfflineSites`. The intended "only replace if sequence is higher" guard can also be subverted because the sequence comparison and the field writes are not a single atomic step. **Recommendation** Have the update delegate return a **new** `SiteHealthState` (record `with` copy) rather than mutating `existing`, and treat the dictionary value as immutable. Combined with HealthMonitoring-002, this makes every state transition an atomic reference swap with no observable intermediate state. **Resolution** Resolved 2026-05-16 (commit `pending`). Re-triaged: verified against the current source — the root cause was already eliminated by the HealthMonitoring-002 fix. `ProcessReport` no longer uses `AddOrUpdate` at all; it is now an explicit compare-and-swap retry loop (`TryGetValue` → guard → `TryAdd`/`TryUpdate`) that produces a brand-new immutable `SiteHealthState` record per transition and never mutates a stored value in place. The sequence-number guard and the field writes are evaluated against the value actually installed by the CAS, so the "only replace if sequence is higher" invariant holds. The concurrency stress test `CentralHealthAggregatorTests.ProcessReport_ConcurrentUpdates_NeverLoseSequenceOrTearState` (added under HealthMonitoring-002) already exercises this path and asserts no lost updates and no torn snapshots. No further code change was required for this finding. ### HealthMonitoring-004 — Inconsistent heartbeat interval described across XML docs | | | |--|--| | Severity | Low | | Category | Documentation & comments | | Status | Resolved | | Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:146-148`, `src/ScadaLink.HealthMonitoring/SiteHealthState.cs:21`, `src/ScadaLink.HealthMonitoring/ICentralHealthAggregator.cs:16` | **Description** The heartbeat cadence that offline detection relies on is documented inconsistently. `CheckForOfflineSites` says "heartbeats arrive every ~5s"; `SiteHealthState.LastHeartbeatAt` says "~5s heartbeat"; but `ICentralHealthAggregator.MarkHeartbeat` says "~2s heartbeats are arriving". The actual cadence is set elsewhere (Cluster Infrastructure / `SiteCommunicationActor`). Readers cannot reason about whether a 60s offline timeout gives the intended grace without a single authoritative number. **Recommendation** Pick the correct interval (verify against the heartbeat scheduler in `SiteCommunicationActor`/Cluster Infrastructure) and use it consistently in all three comments, ideally referencing the owning component rather than restating a magic number. **Resolution** Resolved 2026-05-16 (commit `pending`). Documentation-only — no regression test is meaningful. Verified the authoritative cadence against the Communication module: `SiteCommunicationActor.PreStart` schedules the application-level heartbeat to central at `CommunicationOptions.TransportHeartbeatInterval`, which defaults to **5 seconds** (`CommunicationOptions.cs:49`). The stale "~2s" in `ICentralHealthAggregator.MarkHeartbeat` was corrected; all three XML docs (`ICentralHealthAggregator.MarkHeartbeat`, `SiteHealthState.LastHeartbeatAt`, `CentralHealthAggregator.CheckForOfflineSites`) now state the single authoritative ~5s figure and reference the owning component (`Cluster Infrastructure / SiteCommunicationActor` — `CommunicationOptions.TransportHeartbeatInterval`) rather than restating a bare magic number, so readers can reason about the 60s offline grace. ### HealthMonitoring-005 — Central self-report site can flap offline; no heartbeat grace like real sites | | | |--|--| | Severity | Medium | | Category | Correctness & logic bugs | | Status | Resolved | | Location | `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:48-81`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:149` | **Description** `CheckForOfflineSites` decides offline status purely from `LastHeartbeatAt`, and for real sites that field is kept fresh by frequent (~2-5s) heartbeats so the 60s timeout only fires on genuine total loss. The synthetic `central` site, however, has no heartbeat source — `LastHeartbeatAt` is only bumped by `ProcessReport` from the 30s `CentralHealthReportLoop`. The loop also only runs on the cluster leader and silently skips a cycle on any exception. Consequently, a single skipped/late central self-report (leader GC pause, brief stall, mid-failover before the new leader's loop spins up) leaves `central` with no signal for >60s and it is marked offline even though the central cluster is healthy. The central card thus has no equivalent of the "one missed report grace" the design doc grants real sites. **Recommendation** Either feed `central` a heartbeat equivalent (e.g. have `MarkHeartbeat` called for `CentralSiteId` on a fast timer independent of the leader-only report loop), or apply a longer/distinct offline timeout to the `central` keyspace entry, and ensure the new leader starts the report loop promptly on failover. **Resolution** Resolved 2026-05-16 (commit `pending`). Applied the distinct-timeout option. A new `HealthMonitoringOptions.CentralOfflineTimeout` (default 3x the report interval = 3 minutes) is applied by `CentralHealthAggregator.CheckForOfflineSites` to the `central` keyspace entry only — real sites keep the existing `OfflineTimeout`. This gives the synthetic `central` site (which has no heartbeat source and is fed solely by the 30s leader-only `CentralHealthReportLoop`) enough grace to survive a single skipped or late self-report — the equivalent of the "one missed report" grace the design doc grants real sites — while still going offline on genuine total loss. Regression tests `CentralHealthAggregatorTests.OfflineDetection_CentralSite_HasLongerGraceThanRealSites` (central survives 75s of silence while a real site goes offline) and `OfflineDetection_CentralSite_StillGoesOfflineOnGenuineLoss` (central still detected offline after 10 minutes) verify the behaviour. ### HealthMonitoring-006 — Sequence seeding contradicts the doc's "starting at 1" wording and is untestable | | | |--|--| | Severity | Low | | Category | Design-document adherence | | Status | Resolved | | Location | `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:28`, `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:32` | **Description** The `HealthReportSender` class XML summary states "Sequence numbers are monotonic, starting at 1, and reset on service restart." The implementation instead seeds `_sequenceNumber` with `DateTimeOffset.UtcNow.ToUnixTimeMilliseconds()` so the first emitted sequence is a large epoch value, specifically to keep ordering correct across failover. The summary is therefore stale and contradicts the code. Separately, the seed reads `DateTimeOffset.UtcNow` directly at field initialization rather than through an injected `TimeProvider` (which `CentralHealthAggregator` already uses), making the seeding logic impossible to unit-test deterministically and dependent on node wall-clock agreement — if one node's clock lags, its post-failover reports can be silently rejected as stale by the aggregator. **Recommendation** Fix the `HealthReportSender` XML summary to describe the actual Unix-ms seeding strategy, and inject `TimeProvider` for the seed so the behaviour is testable and the clock dependency is explicit. **Resolution** Resolved 2026-05-16 (commit `pending`). The `HealthReportSender` XML summary was rewritten to describe the real strategy — sequence numbers are monotonic and restart-resetting but explicitly **not** zero/one-based; they are seeded with the current Unix epoch (ms) so a freshly-active node always sorts after the prior active. Both `HealthReportSender` and `CentralHealthReportLoop` now accept an optional `TimeProvider` (defaulting to `TimeProvider.System`) and derive the seed via `GetUtcNow().ToUnixTimeMilliseconds()` in the constructor body instead of reading `DateTimeOffset.UtcNow` at field initialization, so the seeding is deterministically testable and the clock dependency is explicit. `CentralHealthReportLoop` gained a `CurrentSequenceNumber` test accessor mirroring `HealthReportSender`. Regression tests `HealthReportSenderTests.SequenceNumberSeed_UsesInjectedTimeProvider` and `CentralHealthReportLoopTests.SequenceNumberSeed_UsesInjectedTimeProvider` assert the seed equals the injected provider's Unix-ms instant (these would not compile against the pre-fix code, which had no `TimeProvider` parameter). ### HealthMonitoring-007 — Heartbeats for not-yet-registered sites are silently dropped | | | |--|--| | Severity | Medium | | Category | Correctness & logic bugs | | Status | Resolved | | Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:86-99` | **Description** `MarkHeartbeat` returns immediately if the site is not already in `_siteStates` ("registration only happens on report"). Central health state is in-memory only and not persisted. After a central restart or failover the aggregator starts empty, so for up to one full report interval (default 30s) every site emits only heartbeats that are all discarded — the site is reported as *unknown* (absent from `GetAllSiteStates`) rather than *online*, even though heartbeats prove it is reachable. This is a visible dashboard regression precisely during the failover window, which is when operators most need accurate status. **Recommendation** Allow `MarkHeartbeat` to register a minimal `SiteHealthState` (online, no `LatestReport` yet, with a UI-visible "awaiting first report" indication) when a heartbeat arrives for an unknown site, so reachable sites show online immediately after a central restart. **Resolution** Resolved 2026-05-16 (commit `pending`). `CentralHealthAggregator.MarkHeartbeat` no longer returns early for an unknown site. When a heartbeat arrives for a site with no aggregator state, it now atomically registers (`TryAdd`, with CAS-loss retry) a minimal `SiteHealthState` that is `IsOnline = true`, `LatestReport = null`, `LastSequenceNumber = 0` and `LastHeartbeatAt = receivedAt` — an "online, awaiting first report" state. This relies on the HealthMonitoring-002 change that made `LatestReport` properly nullable, so UI consumers already handle the null case. Reachable sites therefore show online immediately after a central restart/failover instead of being absent ("unknown") for up to a full report interval. The `ICentralHealthAggregator.MarkHeartbeat` XML doc was corrected to describe the new behaviour. Regression test `CentralHealthAggregatorTests.MarkHeartbeat_RegistersUnknownSite_AsOnlineAwaitingReport` verifies the registration; `MarkHeartbeat_KeepsSiteOnline_BetweenReports` and `MarkHeartbeat_BringsOfflineSiteBackOnline` cover the already-registered paths. ### HealthMonitoring-008 — `GetAllSiteStates` / `GetSiteState` leak live mutable state objects to callers | | | |--|--| | Severity | Medium — re-triaged: already resolved as a side-effect of HealthMonitoring-002. | | Category | Concurrency & thread safety | | Status | Resolved | | Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:146-158` | **Description** `GetAllSiteStates` copies the dictionary but the copy still holds references to the same live mutable `SiteHealthState` instances; `GetSiteState` returns the live instance directly. UI consumers (Blazor Server / SignalR circuits) read these objects on their own threads while the aggregator's background timer and report handlers concurrently mutate the very same instances (see HealthMonitoring-002). A UI render can observe a `SiteHealthState` with, e.g., `IsOnline == true` but a `LatestReport` from a different update, or a torn `DateTimeOffset`. Callers could also mutate the shared state, corrupting aggregator state. **Recommendation** Return immutable snapshots: convert `SiteHealthState` to a record (per HealthMonitoring-002/003) so handing out the reference is safe, or deep-copy each state into an immutable DTO before returning. **Resolution** Resolved 2026-05-16 (commit `pending`). Re-triaged: verified against the current source — the root cause was already eliminated by the HealthMonitoring-002 fix. `SiteHealthState` is now a `sealed record` with `init`-only properties (fully immutable). Every aggregator transition installs a brand-new record instance via an atomic compare-and-swap, so the references `GetAllSiteStates` and `GetSiteState` hand out are immutable snapshots — a UI consumer reading one on its own thread can never observe a torn or half-applied state, and cannot mutate aggregator state through the returned reference. The recommended fix (make `SiteHealthState` a record) is exactly what the HealthMonitoring-002 change did, so no further code change was required. ### HealthMonitoring-009 — Missing test coverage for central report loop, heartbeat path, replication, and collector setters | | | |--|--| | Severity | Medium | | Category | Testing coverage | | Status | Resolved | | Location | `tests/ScadaLink.HealthMonitoring.Tests/` | **Description** Several behaviours have no automated coverage: - `CentralHealthReportLoop` — leader-only gating (`SelfIsPrimary`), self-report generation, sequence assignment: no test file at all. - `CentralHealthAggregator.MarkHeartbeat` — keeping a site online between reports, online recovery via heartbeat, and the unknown-site drop behaviour (HealthMonitoring-007): untested. - Offline detection driven by `LastHeartbeatAt` vs `LastReportReceivedAt` — the existing offline tests only advance time after a report, never exercising the heartbeat-keeps-alive path the design depends on. - `SiteHealthCollector` — `SetClusterNodes`, `SetInstanceCounts`, `SetParkedMessageCount`, `SetNodeHostname`, `SetActiveNode`/`NodeRole`, `UpdateTagQuality`, `UpdateConnectionEndpoint`: not reflected-in-report tested. - `SiteHealthReportReplica` idempotency under double delivery: untested. **Recommendation** Add tests for the central report loop (with a fake `IClusterNodeProvider`), the heartbeat-keeps-online and unknown-site heartbeat paths, and the remaining collector setters' presence in `CollectReport` output. **Resolution** Resolved 2026-05-16 (commit `pending`). Added the missing coverage: - **`CentralHealthReportLoopTests`** (new file) — `GeneratesCentralReports_WhenSelfIsPrimary`, `GeneratesNoReports_WhenNotPrimary` (leader-only `SelfIsPrimary` gating with a fake `IClusterNodeProvider`), `AssignsMonotonicSequenceNumbers`, and `SetsActiveNodeFlag_EvenWhenNotPrimary`. - **`CentralHealthAggregatorTests`** — `MarkHeartbeat_RegistersUnknownSite_AsOnlineAwaitingReport`, `MarkHeartbeat_KeepsSiteOnline_BetweenReports` (heartbeat keeps a site online past the offline timeout — the path the design depends on), and `MarkHeartbeat_BringsOfflineSiteBackOnline`. - **`SiteHealthCollectorTests`** — reflected-in-report tests for `SetClusterNodes`, `SetInstanceCounts`, `SetParkedMessageCount`, `SetNodeHostname`, `SetActiveNode`/`NodeRole`, `UpdateTagQuality`, `UpdateConnectionEndpoint`, and `SetStoreAndForwardDepths`. The `SiteHealthReportReplica` idempotency item is **out of scope** for this module: `SiteHealthReportReplica` is declared in `ScadaLink.Commons` and published/consumed by `CentralCommunicationActor` in the `ScadaLink.Communication` module — the HealthMonitoring module itself has no replication code. Replica double-delivery idempotency is already covered by `ProcessReport`'s sequence-number guard (`ProcessReport_RejectsEqualSequence`, `ProcessReport_RejectsStaleReport_WhenSequenceNotGreater`); testing the actor-side double-publish belongs in the Communication module's review. The HealthMonitoring test suite now stands at 47 passing tests (was 30). ### HealthMonitoring-010 — `HealthReportSender` silently swallows inner failures with bare `catch {}` | | | |--|--| | Severity | Low | | Category | Error handling & resilience | | Status | Resolved | | Location | `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:70-87` | **Description** The cluster-nodes update and parked-message-count query are each wrapped in `try { ... } catch { /* Non-fatal */ }` with no logging. A persistent failure (e.g. the S&F SQLite store is permanently broken, or `GetClusterNodes()` always throws) is then completely invisible — every report silently ships with stale cluster nodes and a parked count of 0, with nothing in the logs to explain the wrong dashboard values. Bare `catch` with no exception variable also catches `OperationCanceledException` and would mask shutdown signalling if the awaited call observed the token. **Recommendation** Catch a specific exception type (or at least `Exception ex`) and `LogWarning`/`LogDebug` the failure so persistent degradation is diagnosable; avoid swallowing `OperationCanceledException`. **Resolution** Resolved 2026-05-16 (commit `pending`). All three bare `catch { }` blocks in `HealthReportSender.ExecuteAsync` (cluster-nodes refresh, parked-message-count query, S&F buffer-depth query) were changed to `catch (Exception ex)` and now emit a `LogWarning(ex, ...)` naming the failed operation and the site, so a persistent degradation (broken S&F SQLite store, always-throwing `GetClusterNodes()`) is diagnosable instead of silently shipping stale/zero metrics. On the `OperationCanceledException` concern: verified the inner calls cannot raise OCE from cancellation — `IClusterNodeProvider.GetClusterNodes()` is synchronous and takes no token, and `StoreAndForwardStorage.GetParkedMessageCountAsync()`/`GetBufferDepthByCategoryAsync()` take no `CancellationToken` parameter, so the only cancellation path is the outer loop's `PeriodicTimer.WaitForNextTickAsync(stoppingToken)`, which is unaffected. Regression test `HealthReportSenderTests.ClusterNodeRefreshFailure_IsLoggedNotSwallowed` injects a throwing `IClusterNodeProvider`, asserts the loop still ships reports, and asserts a warning carrying the `InvalidOperationException` is logged — confirmed to fail against the pre-fix bare `catch { }` (logged-entry collection was empty). ### HealthMonitoring-011 — `AddHealthMonitoringActors` is a dead no-op placeholder | | | |--|--| | Severity | Low | | Category | Code organization & conventions | | Status | Resolved | | Location | `src/ScadaLink.HealthMonitoring/ServiceCollectionExtensions.cs:42-46` | **Description** `AddHealthMonitoringActors` does nothing but `return services` with a "Placeholder for Phase 4+" comment. A public extension method that silently no-ops is a trap: a caller who registers it will believe actor wiring is in place. No caller currently invokes it. **Recommendation** Remove the method until it has real behaviour, or throw `NotImplementedException` so accidental use fails loudly. If the actor model for this component is genuinely planned, track it in the design doc instead of a half-method. **Resolution** Resolved 2026-05-16 (commit `pending`). Verified a codebase-wide search — the `AddHealthMonitoringActors` no-op extension has no production callers (only references were in the code-review docs themselves). The HealthMonitoring module deliberately contains no actors (transport is abstracted behind `IHealthReportTransport`; actor-side wiring lives in the Communication module), so there is no genuine Phase-4 actor work for this method to host. The dead placeholder was removed from `ServiceCollectionExtensions` so a caller can no longer be misled into believing actor wiring exists. No regression test is meaningful for a deleted no-op; removal is verified by the module continuing to build and all 50 tests passing. ### HealthMonitoring-012 — `SiteHealthState.LatestReport` initialized to `null!`, misrepresenting the contract | | | |--|--| | Severity | Low — re-triaged: already resolved as a side-effect of HealthMonitoring-002. | | Category | Documentation & comments | | Status | Resolved | | Location | `src/ScadaLink.HealthMonitoring/SiteHealthState.cs:11` | **Description** `LatestReport` is declared `SiteHealthReport LatestReport { get; set; } = null!;`, suppressing nullability. Today every code path that creates a `SiteHealthState` (only `ProcessReport`) assigns `LatestReport`, so it is never actually null — but the `null!` declaration tells readers and the compiler the opposite of the real invariant. If HealthMonitoring-007 is addressed by registering state from a heartbeat (no report yet), this becomes a live `NullReferenceException` risk for UI code that dereferences `LatestReport`. **Recommendation** Either make `LatestReport` `required` (matching how it is genuinely always set today) or make it properly nullable `SiteHealthReport?` and have consumers handle the "registered, no report yet" case explicitly — consistent with whatever is decided for HealthMonitoring-007. **Resolution** Resolved 2026-05-16 (commit `pending`). Re-triaged: verified against the current source — the root cause was already eliminated by the HealthMonitoring-002 / HealthMonitoring-007 fixes. `SiteHealthState.LatestReport` is already declared `SiteHealthReport? LatestReport { get; init; }` (the recommendation's "make it properly nullable" option) with an XML doc explaining the `null` case ("known only via heartbeats, has not yet sent a report"). A codebase-wide search confirms no `null!` suppression remains anywhere in `src/ScadaLink.HealthMonitoring`. This is exactly the change HealthMonitoring-002 made when converting `SiteHealthState` to an immutable record, so the contract is now honest and no further code change was required. ### HealthMonitoring-013 — Offline-check interval comment claims "shorter timeout" but only ever uses `OfflineTimeout` | | | |--|--| | Severity | Low | | Category | Documentation & comments | | Status | Open | | Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:194-196` | **Description** `ExecuteAsync` derives the `PeriodicTimer` cadence with the comment "Check at half the (shorter) offline timeout interval for timely detection", but the code only reads `_options.OfflineTimeout`: ```csharp var checkInterval = TimeSpan.FromMilliseconds(_options.OfflineTimeout.TotalMilliseconds / 2); ``` `CentralOfflineTimeout` (HealthMonitoring-005's fix) is never considered. With the default options (`OfflineTimeout` 60s, `CentralOfflineTimeout` 3m) `OfflineTimeout` genuinely is the shorter of the two, so the parenthetical happens to hold. But the comment states an invariant the code does not enforce: if an operator configures `CentralOfflineTimeout` *smaller* than `OfflineTimeout`, the check cadence stays tied to `OfflineTimeout`, and central offline detection is delayed by up to a full `OfflineTimeout / 2` beyond the intended `CentralOfflineTimeout` window. The comment misleads a reader into believing the cadence already adapts to whichever timeout is shorter. **Recommendation** Either compute `checkInterval` from `Math.Min(OfflineTimeout, CentralOfflineTimeout)` so the code matches the comment, or drop the "(shorter)" wording and state plainly that the cadence is derived from `OfflineTimeout` only (acceptable while the default `CentralOfflineTimeout` is the larger value). **Resolution** _Unresolved._ ### HealthMonitoring-014 — `HealthMonitoringOptions` intervals are unvalidated; a zero/negative value crashes the hosted service | | | |--|--| | Severity | Low | | Category | Error handling & resilience | | Status | Open | | Location | `src/ScadaLink.HealthMonitoring/HealthMonitoringOptions.cs:3-20`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:196`, `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:67`, `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:63` | **Description** `HealthMonitoringOptions` is bound from the `ScadaLink:HealthMonitoring` config section (`SiteServiceRegistration.BindSharedOptions`) with no validation — no `IValidateOptions`, no `ValidateDataAnnotations`, no `ValidateOnStart`. `ReportInterval`, `OfflineTimeout`, and `CentralOfflineTimeout` are all fed straight into `new PeriodicTimer(...)` (and `OfflineTimeout` into a division for the check interval). `PeriodicTimer`'s constructor throws `ArgumentOutOfRangeException` for a zero or negative period. A misconfigured `appsettings.json` (e.g. `"ReportInterval": "00:00:00"`, an empty/garbled value that binds to `TimeSpan.Zero`, or a negative span) therefore crashes the `HealthReportSender` / `CentralHealthReportLoop` / `CentralHealthAggregator` hosted service at startup with an opaque exception that does not name the offending config key, rather than failing fast with a clear validation message. **Recommendation** Add an options validator (DataAnnotations `[Range]`-style on the spans, or an `IValidateOptions`) that rejects non-positive `ReportInterval`/`OfflineTimeout`/`CentralOfflineTimeout` and ideally requires `CentralOfflineTimeout >= OfflineTimeout`, and call `.ValidateOnStart()` so a bad configuration fails fast with a message naming the section and key. **Resolution** _Unresolved._ ### HealthMonitoring-015 — Heartbeat-registered site is left with a year-0001 `LastReportReceivedAt` | | | |--|--| | Severity | Medium | | Category | Correctness & logic bugs | | Status | Open | | Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:122-130`, `src/ScadaLink.HealthMonitoring/SiteHealthState.cs:27` | **Description** When `MarkHeartbeat` registers a previously-unknown site (HealthMonitoring-007's fix), it sets `LastReportReceivedAt = default` — i.e. `DateTimeOffset.MinValue` (`0001-01-01`). The XML doc on `SiteHealthState.LastReportReceivedAt` states the field is "Used by the UI to surface report staleness during failover." A heartbeat-only site therefore has `LatestReport == null` **and** `LastReportReceivedAt == DateTimeOffset.MinValue`. Any UI code that computes "last report N ago" as `now - LastReportReceivedAt` without first checking `LatestReport != null` will render a nonsensical staleness of roughly two thousand years for a site that is, in fact, freshly reachable. The two "no report yet" signals (`LatestReport == null`, `LastReportReceivedAt == default`) are independent and both must be special-cased; the sentinel value is an easy trap. **Recommendation** Make `LastReportReceivedAt` nullable (`DateTimeOffset?`) so "no report received yet" is an explicit, unmissable state rather than a magic sentinel — consistent with how `LatestReport` was already made nullable for the same case — and have UI consumers render staleness only when it has a value. Alternatively, document the `default` sentinel prominently on the field and audit every UI reader, but the nullable option is safer and matches the existing `LatestReport` treatment. **Resolution** _Unresolved._ ### HealthMonitoring-016 — `SiteHealthCollector.CollectReport` reads `DateTimeOffset.UtcNow` directly instead of an injected `TimeProvider` | | | |--|--| | Severity | Low | | Category | Code organization & conventions | | Status | Open | | Location | `src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs:151` | **Description** `CollectReport` stamps each report with `ReportTimestamp: DateTimeOffset.UtcNow`, read directly from the system clock. Every other time-dependent class in the module — `CentralHealthAggregator`, `HealthReportSender`, `CentralHealthReportLoop` — was deliberately refactored (HealthMonitoring-006) to take an injectable `TimeProvider` so the behaviour is deterministically testable and the clock dependency is explicit. `SiteHealthCollector` is the lone holdout: the report timestamp cannot be controlled in a unit test, which is why `SiteHealthCollectorTests.CollectReport_IncludesUtcTimestamp` can only assert the timestamp falls in a `before`/`after` wall-clock window rather than equalling a known instant. This is a minor consistency/testability gap, not a behaviour bug. **Recommendation** Add an optional `TimeProvider` constructor parameter to `SiteHealthCollector` (defaulting to `TimeProvider.System`, mirroring the other classes) and derive `ReportTimestamp` from `GetUtcNow()`, so the report timestamp is deterministically testable and the module is consistent. **Resolution** _Unresolved._