A heartbeat-registered site that has never sent a full report now has
LastReportReceivedAt = null instead of the year-0001 sentinel. TimestampDisplay
accepts DateTimeOffset? and renders null as a placeholder ('awaiting first
report') rather than a ~2000-year-stale date. Cross-module: HealthMonitoring +
CentralUI.
44 KiB
Code Review — HealthMonitoring
| Field | Value |
|---|---|
| Module | src/ScadaLink.HealthMonitoring |
| Design doc | docs/requirements/Component-HealthMonitoring.md |
| Status | Reviewed |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | 39d737e |
| Open findings | 0 |
Summary
The HealthMonitoring module is small, readable, and broadly faithful to the design
intent: per-interval error counters with atomic read-and-reset, monotonic sequence
numbers with Unix-ms seeding to survive failover, sequence-guarded staleness
rejection, and a 60s offline timeout. However, the review surfaced two recurring
themes. First, a documented metric is silently unimplemented — store-and-forward
buffer depths are never populated (SetStoreAndForwardDepths has zero callers and a
test asserts the field is always empty), so the dashboard cannot show the buffer
depth metric the design doc requires. Second, the central aggregator's in-memory
state model has unguarded shared mutable state: SiteHealthState is a mutable
class whose fields are written by a background timer thread, by ProcessReport, and
by MarkHeartbeat with no synchronization, and the same live mutable objects are
handed straight to UI callers via GetAllSiteStates. The ProcessReport logic also
mutates shared state inside a ConcurrentDictionary.AddOrUpdate update delegate,
which the runtime may invoke more than once under contention. Additionally there are
gaps around central self-report offline detection, heartbeats for not-yet-registered
sites being dropped, and missing test coverage for the central report loop,
heartbeat path, and most collector setters. None of the findings are crash-class,
but the concurrency issues are Medium/High and the missing S&F metric is a real
design-adherence gap.
Re-review 2026-05-17 (commit 39d737e)
All twelve prior findings (HealthMonitoring-001..012) are confirmed Resolved —
SiteHealthState is now an immutable sealed record mutated only via atomic
compare-and-swap, the store-and-forward buffer-depth metric is populated, the
central-site offline grace and the unknown-site heartbeat registration are in
place, and the test suite has grown to cover the report loop, heartbeat path, and
collector setters. This re-review found 4 new findings, all Low/Medium, none
crash-class. They are residual polish items rather than behaviour regressions:
an inaccurate offline-check-interval comment (HealthMonitoring-013), unvalidated
HealthMonitoringOptions intervals that crash the hosted service on
misconfiguration (HealthMonitoring-014), a heartbeat-only registered site left
with a year-0001 LastReportReceivedAt that the UI's staleness display would
otherwise render verbatim (HealthMonitoring-015 — resolved via a coordinated
HealthMonitoring + CentralUI change), and CollectReport reading
DateTimeOffset.UtcNow directly instead of the module's now-standard injected
TimeProvider (HealthMonitoring-016). The module remains small, readable, and
broadly faithful to the design intent.
Checklist coverage
| # | Category | Examined | Notes |
|---|---|---|---|
| 1 | Correctness & logic bugs | x | MarkHeartbeat drops heartbeats for unregistered sites (HealthMonitoring-007); central self-report has no heartbeat grace (HealthMonitoring-005). Re-review: heartbeat-registered site left with year-0001 LastReportReceivedAt (HealthMonitoring-015). |
| 2 | Akka.NET conventions | x | Module itself contains no actors (transport abstracted via IHealthReportTransport); AddHealthMonitoringActors is a dead placeholder (HealthMonitoring-011). Actor-side wiring lives in Communication and is out of scope. |
| 3 | Concurrency & thread safety | x | Unguarded mutable SiteHealthState (HealthMonitoring-002); mutation inside AddOrUpdate delegate (HealthMonitoring-003); GetAllSiteStates leaks live mutable references (HealthMonitoring-008). Collector counters correctly use Interlocked. |
| 4 | Error handling & resilience | x | HealthReportSender silently swallows inner failures with bare catch {} (HealthMonitoring-010, resolved); top-level loop error handling is sound. Re-review: HealthMonitoringOptions intervals unvalidated — a zero/negative value crashes the hosted service at PeriodicTimer construction (HealthMonitoring-014). |
| 5 | Security | x | No issues found. Module handles only numeric/string operational metrics, no secrets, no external input parsing, no auth surface. |
| 6 | Performance & resource management | x | PeriodicTimer instances correctly disposed via using. Dictionary snapshots per report are acceptable at the documented scale. No issues found. |
| 7 | Design-document adherence | x | Store-and-forward buffer depth metric unimplemented (HealthMonitoring-001); sequence seeding deviates from doc's "starting at 1" wording (HealthMonitoring-006). |
| 8 | Code organization & conventions | x | Options class correctly owned by the component; POCO/messages in Commons. Dead placeholder method noted (HealthMonitoring-011, resolved). Re-review: SiteHealthCollector.CollectReport reads DateTimeOffset.UtcNow directly instead of the module's now-standard injected TimeProvider (HealthMonitoring-016). |
| 9 | Testing coverage | x | No tests for CentralHealthReportLoop, MarkHeartbeat, offline-via-heartbeat, replica idempotency, or most collector setters (HealthMonitoring-009). |
| 10 | Documentation & comments | x | Heartbeat interval is described inconsistently (~2s vs ~5s) across XML docs (HealthMonitoring-004, resolved); LatestReport = null! misrepresents the contract (HealthMonitoring-012, resolved). Re-review: offline-check-interval comment claims "(shorter)" timeout but code only uses OfflineTimeout (HealthMonitoring-013). |
Findings
HealthMonitoring-001 — Store-and-forward buffer depth metric is never populated
| Severity | High |
| Category | Design-document adherence |
| Status | Resolved |
| Location | src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs:104, src/ScadaLink.HealthMonitoring/HealthReportSender.cs:79 |
Description
Component-HealthMonitoring.md lists "Store-and-forward buffer depth" (pending
messages by category) as a required monitored metric. SiteHealthCollector exposes
SetStoreAndForwardDepths(...) to receive it, but a codebase-wide search shows the
method has no callers — _sfBufferDepths always remains the empty dictionary it
is initialized to. HealthReportSender queries GetParkedMessageCountAsync() and
sets ParkedMessageCount, but parked count is a distinct metric from per-category
buffer depth. The test SiteHealthCollectorTests.StoreAndForwardBufferDepths_IsEmptyPlaceholder
even codifies the unimplemented state as expected behaviour. The result is that the
central dashboard cannot display buffer depth, a documented triage metric.
Recommendation
Wire SetStoreAndForwardDepths into HealthReportSender.ExecuteAsync (alongside the
existing parked-count call) using the S&F engine's per-category depth API, or, if the
metric is intentionally deferred, record that decision in the design doc and remove
the dead setter. Update the placeholder test accordingly once implemented.
Resolution
Resolved 2026-05-16 (commit <pending>). HealthReportSender.ExecuteAsync now
queries the existing public StoreAndForwardStorage.GetBufferDepthByCategoryAsync()
API alongside the parked-count call and feeds the per-category depths into
SiteHealthCollector.SetStoreAndForwardDepths (category enum names as keys), so the
documented store-and-forward buffer depth metric is populated in every emitted
report. Regression test HealthReportSenderTests.ReportsIncludeStoreAndForwardBufferDepthsFromStorage
verifies populated per-category depths. The obsolete placeholder test
SiteHealthCollectorTests.StoreAndForwardBufferDepths_IsEmptyPlaceholder continues
to pass — it only exercises the collector with no setter call and still correctly
asserts the empty default; it was left in place as the collector-level default-state
test. No StoreAndForward source was modified (existing public API only).
HealthMonitoring-002 — SiteHealthState mutable fields written from multiple threads without synchronization
| Severity | High |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | src/ScadaLink.HealthMonitoring/SiteHealthState.cs:11, src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:86, src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:137 |
Description
SiteHealthState is a plain mutable class. Its fields (LatestReport,
LastReportReceivedAt, LastHeartbeatAt, LastSequenceNumber, IsOnline) are
mutated from at least three concurrent contexts: ProcessReport (caller thread —
ClusterClient/PubSub message handlers), MarkHeartbeat (caller thread — heartbeat
handler), and CheckForOfflineSites (the BackgroundService timer thread). The
ConcurrentDictionary only protects the dictionary structure, not the objects it
stores. A heartbeat update and the offline-check can interleave on the same
SiteHealthState instance, and reads/writes of DateTimeOffset (a 16-byte struct)
and long fields are not guaranteed atomic on all platforms — producing torn reads
and lost updates of IsOnline/LastHeartbeatAt.
Recommendation
Make state transitions atomic: either guard all reads/writes of a SiteHealthState
with a per-site lock, or replace SiteHealthState with an immutable record updated
via ConcurrentDictionary compare-and-swap (TryUpdate) so every transition is
a single atomic reference swap.
Resolution
Resolved 2026-05-16 (commit <pending>). SiteHealthState is now a sealed record
with init-only properties. CentralHealthAggregator.ProcessReport,
MarkHeartbeat, and CheckForOfflineSites were rewritten to perform every state
transition as an atomic compare-and-swap (TryAdd/TryUpdate) producing a new
record instance — no field of a stored state is ever mutated in place. ProcessReport
uses an explicit CAS retry loop instead of the AddOrUpdate update delegate so the
sequence-number guard and the field writes are evaluated against the value actually
installed (this also closes the root cause behind HealthMonitoring-003). Reads via
GetAllSiteStates/GetSiteState now hand out immutable snapshots, so a concurrent
reader can never observe a torn or half-applied state. LatestReport was changed
from SiteHealthReport (null!) to SiteHealthReport?, making the contract honest;
all existing consumers (CentralUI, integration/perf tests) already null-checked it
and continue to build clean. Regression test
CentralHealthAggregatorTests.ProcessReport_ConcurrentUpdates_NeverLoseSequenceOrTearState
exercises concurrent report/heartbeat/read threads and asserts snapshot consistency
and no lost updates.
HealthMonitoring-003 — Shared state mutated inside ConcurrentDictionary.AddOrUpdate update delegate
| Severity | Medium — re-triaged: already resolved as a side-effect of HealthMonitoring-002. |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:45-103 |
Description
The update delegate passed to AddOrUpdate mutates the existing object in place
(existing.LatestReport = report; existing.IsOnline = true; ...). AddOrUpdate's
contract explicitly allows the update delegate to be invoked more than once under
contention (when the CAS that installs the result loses a race and is retried). Each
invocation mutates the shared object, so a concurrent report for the same site can
observe a half-applied update, and the multi-field assignment is not atomic with
respect to readers in GetAllSiteStates/CheckForOfflineSites. The intended
"only replace if sequence is higher" guard can also be subverted because the
sequence comparison and the field writes are not a single atomic step.
Recommendation
Have the update delegate return a new SiteHealthState (record with copy)
rather than mutating existing, and treat the dictionary value as immutable.
Combined with HealthMonitoring-002, this makes every state transition an atomic
reference swap with no observable intermediate state.
Resolution
Resolved 2026-05-16 (commit pending). Re-triaged: verified against the current
source — the root cause was already eliminated by the HealthMonitoring-002 fix.
ProcessReport no longer uses AddOrUpdate at all; it is now an explicit
compare-and-swap retry loop (TryGetValue → guard → TryAdd/TryUpdate) that
produces a brand-new immutable SiteHealthState record per transition and never
mutates a stored value in place. The sequence-number guard and the field writes are
evaluated against the value actually installed by the CAS, so the "only replace if
sequence is higher" invariant holds. The concurrency stress test
CentralHealthAggregatorTests.ProcessReport_ConcurrentUpdates_NeverLoseSequenceOrTearState
(added under HealthMonitoring-002) already exercises this path and asserts no lost
updates and no torn snapshots. No further code change was required for this finding.
HealthMonitoring-004 — Inconsistent heartbeat interval described across XML docs
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:146-148, src/ScadaLink.HealthMonitoring/SiteHealthState.cs:21, src/ScadaLink.HealthMonitoring/ICentralHealthAggregator.cs:16 |
Description
The heartbeat cadence that offline detection relies on is documented inconsistently.
CheckForOfflineSites says "heartbeats arrive every ~5s"; SiteHealthState.LastHeartbeatAt
says "~5s heartbeat"; but ICentralHealthAggregator.MarkHeartbeat says "~2s
heartbeats are arriving". The actual cadence is set elsewhere (Cluster Infrastructure /
SiteCommunicationActor). Readers cannot reason about whether a 60s offline timeout
gives the intended grace without a single authoritative number.
Recommendation
Pick the correct interval (verify against the heartbeat scheduler in
SiteCommunicationActor/Cluster Infrastructure) and use it consistently in all three
comments, ideally referencing the owning component rather than restating a magic number.
Resolution
Resolved 2026-05-16 (commit pending). Documentation-only — no regression test is
meaningful. Verified the authoritative cadence against the Communication module:
SiteCommunicationActor.PreStart schedules the application-level heartbeat to central
at CommunicationOptions.TransportHeartbeatInterval, which defaults to 5 seconds
(CommunicationOptions.cs:49). The stale "~2s" in ICentralHealthAggregator.MarkHeartbeat
was corrected; all three XML docs (ICentralHealthAggregator.MarkHeartbeat,
SiteHealthState.LastHeartbeatAt, CentralHealthAggregator.CheckForOfflineSites) now
state the single authoritative ~5s figure and reference the owning component
(Cluster Infrastructure / SiteCommunicationActor —
CommunicationOptions.TransportHeartbeatInterval) rather than restating a bare magic
number, so readers can reason about the 60s offline grace.
HealthMonitoring-005 — Central self-report site can flap offline; no heartbeat grace like real sites
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:48-81, src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:149 |
Description
CheckForOfflineSites decides offline status purely from LastHeartbeatAt, and for
real sites that field is kept fresh by frequent (~2-5s) heartbeats so the 60s timeout
only fires on genuine total loss. The synthetic central site, however, has no
heartbeat source — LastHeartbeatAt is only bumped by ProcessReport from the
30s CentralHealthReportLoop. The loop also only runs on the cluster leader and
silently skips a cycle on any exception. Consequently, a single skipped/late central
self-report (leader GC pause, brief stall, mid-failover before the new leader's loop
spins up) leaves central with no signal for >60s and it is marked offline even
though the central cluster is healthy. The central card thus has no equivalent of
the "one missed report grace" the design doc grants real sites.
Recommendation
Either feed central a heartbeat equivalent (e.g. have MarkHeartbeat called for
CentralSiteId on a fast timer independent of the leader-only report loop), or apply
a longer/distinct offline timeout to the central keyspace entry, and ensure the new
leader starts the report loop promptly on failover.
Resolution
Resolved 2026-05-16 (commit pending). Applied the distinct-timeout option. A new
HealthMonitoringOptions.CentralOfflineTimeout (default 3x the report interval =
3 minutes) is applied by CentralHealthAggregator.CheckForOfflineSites to the
central keyspace entry only — real sites keep the existing OfflineTimeout. This
gives the synthetic central site (which has no heartbeat source and is fed solely
by the 30s leader-only CentralHealthReportLoop) enough grace to survive a single
skipped or late self-report — the equivalent of the "one missed report" grace the
design doc grants real sites — while still going offline on genuine total loss.
Regression tests CentralHealthAggregatorTests.OfflineDetection_CentralSite_HasLongerGraceThanRealSites
(central survives 75s of silence while a real site goes offline) and
OfflineDetection_CentralSite_StillGoesOfflineOnGenuineLoss (central still detected
offline after 10 minutes) verify the behaviour.
HealthMonitoring-006 — Sequence seeding contradicts the doc's "starting at 1" wording and is untestable
| Severity | Low |
| Category | Design-document adherence |
| Status | Resolved |
| Location | src/ScadaLink.HealthMonitoring/HealthReportSender.cs:28, src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:32 |
Description
The HealthReportSender class XML summary states "Sequence numbers are monotonic,
starting at 1, and reset on service restart." The implementation instead seeds
_sequenceNumber with DateTimeOffset.UtcNow.ToUnixTimeMilliseconds() so the first
emitted sequence is a large epoch value, specifically to keep ordering correct across
failover. The summary is therefore stale and contradicts the code. Separately, the
seed reads DateTimeOffset.UtcNow directly at field initialization rather than
through an injected TimeProvider (which CentralHealthAggregator already uses),
making the seeding logic impossible to unit-test deterministically and dependent on
node wall-clock agreement — if one node's clock lags, its post-failover reports can
be silently rejected as stale by the aggregator.
Recommendation
Fix the HealthReportSender XML summary to describe the actual Unix-ms seeding
strategy, and inject TimeProvider for the seed so the behaviour is testable and the
clock dependency is explicit.
Resolution
Resolved 2026-05-16 (commit pending). The HealthReportSender XML summary was
rewritten to describe the real strategy — sequence numbers are monotonic and
restart-resetting but explicitly not zero/one-based; they are seeded with the
current Unix epoch (ms) so a freshly-active node always sorts after the prior active.
Both HealthReportSender and CentralHealthReportLoop now accept an optional
TimeProvider (defaulting to TimeProvider.System) and derive the seed via
GetUtcNow().ToUnixTimeMilliseconds() in the constructor body instead of reading
DateTimeOffset.UtcNow at field initialization, so the seeding is deterministically
testable and the clock dependency is explicit. CentralHealthReportLoop gained a
CurrentSequenceNumber test accessor mirroring HealthReportSender. Regression tests
HealthReportSenderTests.SequenceNumberSeed_UsesInjectedTimeProvider and
CentralHealthReportLoopTests.SequenceNumberSeed_UsesInjectedTimeProvider assert the
seed equals the injected provider's Unix-ms instant (these would not compile against
the pre-fix code, which had no TimeProvider parameter).
HealthMonitoring-007 — Heartbeats for not-yet-registered sites are silently dropped
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:86-99 |
Description
MarkHeartbeat returns immediately if the site is not already in _siteStates
("registration only happens on report"). Central health state is in-memory only and
not persisted. After a central restart or failover the aggregator starts empty, so
for up to one full report interval (default 30s) every site emits only heartbeats
that are all discarded — the site is reported as unknown (absent from
GetAllSiteStates) rather than online, even though heartbeats prove it is
reachable. This is a visible dashboard regression precisely during the failover
window, which is when operators most need accurate status.
Recommendation
Allow MarkHeartbeat to register a minimal SiteHealthState (online, no
LatestReport yet, with a UI-visible "awaiting first report" indication) when a
heartbeat arrives for an unknown site, so reachable sites show online immediately
after a central restart.
Resolution
Resolved 2026-05-16 (commit pending). CentralHealthAggregator.MarkHeartbeat no
longer returns early for an unknown site. When a heartbeat arrives for a site with no
aggregator state, it now atomically registers (TryAdd, with CAS-loss retry) a
minimal SiteHealthState that is IsOnline = true, LatestReport = null,
LastSequenceNumber = 0 and LastHeartbeatAt = receivedAt — an "online, awaiting
first report" state. This relies on the HealthMonitoring-002 change that made
LatestReport properly nullable, so UI consumers already handle the null case.
Reachable sites therefore show online immediately after a central restart/failover
instead of being absent ("unknown") for up to a full report interval. The
ICentralHealthAggregator.MarkHeartbeat XML doc was corrected to describe the new
behaviour. Regression test
CentralHealthAggregatorTests.MarkHeartbeat_RegistersUnknownSite_AsOnlineAwaitingReport
verifies the registration; MarkHeartbeat_KeepsSiteOnline_BetweenReports and
MarkHeartbeat_BringsOfflineSiteBackOnline cover the already-registered paths.
HealthMonitoring-008 — GetAllSiteStates / GetSiteState leak live mutable state objects to callers
| Severity | Medium — re-triaged: already resolved as a side-effect of HealthMonitoring-002. |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:146-158 |
Description
GetAllSiteStates copies the dictionary but the copy still holds references to the
same live mutable SiteHealthState instances; GetSiteState returns the live
instance directly. UI consumers (Blazor Server / SignalR circuits) read these objects
on their own threads while the aggregator's background timer and report handlers
concurrently mutate the very same instances (see HealthMonitoring-002). A UI render
can observe a SiteHealthState with, e.g., IsOnline == true but a LatestReport
from a different update, or a torn DateTimeOffset. Callers could also mutate the
shared state, corrupting aggregator state.
Recommendation
Return immutable snapshots: convert SiteHealthState to a record (per
HealthMonitoring-002/003) so handing out the reference is safe, or deep-copy each
state into an immutable DTO before returning.
Resolution
Resolved 2026-05-16 (commit pending). Re-triaged: verified against the current
source — the root cause was already eliminated by the HealthMonitoring-002 fix.
SiteHealthState is now a sealed record with init-only properties (fully
immutable). Every aggregator transition installs a brand-new record instance via an
atomic compare-and-swap, so the references GetAllSiteStates and GetSiteState hand
out are immutable snapshots — a UI consumer reading one on its own thread can never
observe a torn or half-applied state, and cannot mutate aggregator state through the
returned reference. The recommended fix (make SiteHealthState a record) is exactly
what the HealthMonitoring-002 change did, so no further code change was required.
HealthMonitoring-009 — Missing test coverage for central report loop, heartbeat path, replication, and collector setters
| Severity | Medium |
| Category | Testing coverage |
| Status | Resolved |
| Location | tests/ScadaLink.HealthMonitoring.Tests/ |
Description
Several behaviours have no automated coverage:
CentralHealthReportLoop— leader-only gating (SelfIsPrimary), self-report generation, sequence assignment: no test file at all.CentralHealthAggregator.MarkHeartbeat— keeping a site online between reports, online recovery via heartbeat, and the unknown-site drop behaviour (HealthMonitoring-007): untested.- Offline detection driven by
LastHeartbeatAtvsLastReportReceivedAt— the existing offline tests only advance time after a report, never exercising the heartbeat-keeps-alive path the design depends on. SiteHealthCollector—SetClusterNodes,SetInstanceCounts,SetParkedMessageCount,SetNodeHostname,SetActiveNode/NodeRole,UpdateTagQuality,UpdateConnectionEndpoint: not reflected-in-report tested.SiteHealthReportReplicaidempotency under double delivery: untested.
Recommendation
Add tests for the central report loop (with a fake IClusterNodeProvider), the
heartbeat-keeps-online and unknown-site heartbeat paths, and the remaining collector
setters' presence in CollectReport output.
Resolution
Resolved 2026-05-16 (commit pending). Added the missing coverage:
CentralHealthReportLoopTests(new file) —GeneratesCentralReports_WhenSelfIsPrimary,GeneratesNoReports_WhenNotPrimary(leader-onlySelfIsPrimarygating with a fakeIClusterNodeProvider),AssignsMonotonicSequenceNumbers, andSetsActiveNodeFlag_EvenWhenNotPrimary.CentralHealthAggregatorTests—MarkHeartbeat_RegistersUnknownSite_AsOnlineAwaitingReport,MarkHeartbeat_KeepsSiteOnline_BetweenReports(heartbeat keeps a site online past the offline timeout — the path the design depends on), andMarkHeartbeat_BringsOfflineSiteBackOnline.SiteHealthCollectorTests— reflected-in-report tests forSetClusterNodes,SetInstanceCounts,SetParkedMessageCount,SetNodeHostname,SetActiveNode/NodeRole,UpdateTagQuality,UpdateConnectionEndpoint, andSetStoreAndForwardDepths.
The SiteHealthReportReplica idempotency item is out of scope for this module:
SiteHealthReportReplica is declared in ScadaLink.Commons and published/consumed by
CentralCommunicationActor in the ScadaLink.Communication module — the
HealthMonitoring module itself has no replication code. Replica double-delivery
idempotency is already covered by ProcessReport's sequence-number guard
(ProcessReport_RejectsEqualSequence, ProcessReport_RejectsStaleReport_WhenSequenceNotGreater);
testing the actor-side double-publish belongs in the Communication module's review.
The HealthMonitoring test suite now stands at 47 passing tests (was 30).
HealthMonitoring-010 — HealthReportSender silently swallows inner failures with bare catch {}
| Severity | Low |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | src/ScadaLink.HealthMonitoring/HealthReportSender.cs:70-87 |
Description
The cluster-nodes update and parked-message-count query are each wrapped in
try { ... } catch { /* Non-fatal */ } with no logging. A persistent failure (e.g.
the S&F SQLite store is permanently broken, or GetClusterNodes() always throws)
is then completely invisible — every report silently ships with stale cluster nodes
and a parked count of 0, with nothing in the logs to explain the wrong dashboard
values. Bare catch with no exception variable also catches OperationCanceledException
and would mask shutdown signalling if the awaited call observed the token.
Recommendation
Catch a specific exception type (or at least Exception ex) and LogWarning/LogDebug
the failure so persistent degradation is diagnosable; avoid swallowing
OperationCanceledException.
Resolution
Resolved 2026-05-16 (commit pending). All three bare catch { } blocks in
HealthReportSender.ExecuteAsync (cluster-nodes refresh, parked-message-count query,
S&F buffer-depth query) were changed to catch (Exception ex) and now emit a
LogWarning(ex, ...) naming the failed operation and the site, so a persistent
degradation (broken S&F SQLite store, always-throwing GetClusterNodes()) is
diagnosable instead of silently shipping stale/zero metrics. On the OperationCanceledException
concern: verified the inner calls cannot raise OCE from cancellation —
IClusterNodeProvider.GetClusterNodes() is synchronous and takes no token, and
StoreAndForwardStorage.GetParkedMessageCountAsync()/GetBufferDepthByCategoryAsync()
take no CancellationToken parameter, so the only cancellation path is the outer loop's
PeriodicTimer.WaitForNextTickAsync(stoppingToken), which is unaffected. Regression
test HealthReportSenderTests.ClusterNodeRefreshFailure_IsLoggedNotSwallowed injects a
throwing IClusterNodeProvider, asserts the loop still ships reports, and asserts a
warning carrying the InvalidOperationException is logged — confirmed to fail against
the pre-fix bare catch { } (logged-entry collection was empty).
HealthMonitoring-011 — AddHealthMonitoringActors is a dead no-op placeholder
| Severity | Low |
| Category | Code organization & conventions |
| Status | Resolved |
| Location | src/ScadaLink.HealthMonitoring/ServiceCollectionExtensions.cs:42-46 |
Description
AddHealthMonitoringActors does nothing but return services with a "Placeholder for
Phase 4+" comment. A public extension method that silently no-ops is a trap: a caller
who registers it will believe actor wiring is in place. No caller currently invokes it.
Recommendation
Remove the method until it has real behaviour, or throw NotImplementedException so
accidental use fails loudly. If the actor model for this component is genuinely
planned, track it in the design doc instead of a half-method.
Resolution
Resolved 2026-05-16 (commit pending). Verified a codebase-wide search — the
AddHealthMonitoringActors no-op extension has no production callers (only references
were in the code-review docs themselves). The HealthMonitoring module deliberately
contains no actors (transport is abstracted behind IHealthReportTransport; actor-side
wiring lives in the Communication module), so there is no genuine Phase-4 actor work for
this method to host. The dead placeholder was removed from ServiceCollectionExtensions
so a caller can no longer be misled into believing actor wiring exists. No regression
test is meaningful for a deleted no-op; removal is verified by the module continuing to
build and all 50 tests passing.
HealthMonitoring-012 — SiteHealthState.LatestReport initialized to null!, misrepresenting the contract
| Severity | Low — re-triaged: already resolved as a side-effect of HealthMonitoring-002. |
| Category | Documentation & comments |
| Status | Resolved |
| Location | src/ScadaLink.HealthMonitoring/SiteHealthState.cs:11 |
Description
LatestReport is declared SiteHealthReport LatestReport { get; set; } = null!;,
suppressing nullability. Today every code path that creates a SiteHealthState (only
ProcessReport) assigns LatestReport, so it is never actually null — but the
null! declaration tells readers and the compiler the opposite of the real
invariant. If HealthMonitoring-007 is addressed by registering state from a heartbeat
(no report yet), this becomes a live NullReferenceException risk for UI code that
dereferences LatestReport.
Recommendation
Either make LatestReport required (matching how it is genuinely always set today)
or make it properly nullable SiteHealthReport? and have consumers handle the
"registered, no report yet" case explicitly — consistent with whatever is decided
for HealthMonitoring-007.
Resolution
Resolved 2026-05-16 (commit pending). Re-triaged: verified against the current
source — the root cause was already eliminated by the HealthMonitoring-002 /
HealthMonitoring-007 fixes. SiteHealthState.LatestReport is already declared
SiteHealthReport? LatestReport { get; init; } (the recommendation's "make it properly
nullable" option) with an XML doc explaining the null case ("known only via heartbeats,
has not yet sent a report"). A codebase-wide search confirms no null! suppression
remains anywhere in src/ScadaLink.HealthMonitoring. This is exactly the change
HealthMonitoring-002 made when converting SiteHealthState to an immutable record, so
the contract is now honest and no further code change was required.
HealthMonitoring-013 — Offline-check interval comment claims "shorter timeout" but only ever uses OfflineTimeout
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:194-196 |
Description
ExecuteAsync derives the PeriodicTimer cadence with the comment "Check at half
the (shorter) offline timeout interval for timely detection", but the code only
reads _options.OfflineTimeout:
var checkInterval = TimeSpan.FromMilliseconds(_options.OfflineTimeout.TotalMilliseconds / 2);
CentralOfflineTimeout (HealthMonitoring-005's fix) is never considered. With the
default options (OfflineTimeout 60s, CentralOfflineTimeout 3m) OfflineTimeout
genuinely is the shorter of the two, so the parenthetical happens to hold. But the
comment states an invariant the code does not enforce: if an operator configures
CentralOfflineTimeout smaller than OfflineTimeout, the check cadence stays
tied to OfflineTimeout, and central offline detection is delayed by up to a full
OfflineTimeout / 2 beyond the intended CentralOfflineTimeout window. The comment
misleads a reader into believing the cadence already adapts to whichever timeout is
shorter.
Recommendation
Either compute checkInterval from Math.Min(OfflineTimeout, CentralOfflineTimeout)
so the code matches the comment, or drop the "(shorter)" wording and state plainly
that the cadence is derived from OfflineTimeout only (acceptable while the default
CentralOfflineTimeout is the larger value).
Resolution
Resolved 2026-05-17. Root cause confirmed against source — ExecuteAsync derived the
PeriodicTimer cadence solely from OfflineTimeout while the comment claimed it
tracked the "(shorter)" timeout. Took the code-matches-comment option: extracted the
cadence into CentralHealthAggregator.ComputeCheckInterval, which now derives it from
half of the shorter of OfflineTimeout and CentralOfflineTimeout, so a
CentralOfflineTimeout configured below OfflineTimeout is still polled at least
twice within its window. The comment was rewritten to match. Regression test
CentralHealthAggregatorTests.CheckInterval_IsHalfTheShorterTimeout asserts the
default case (30s) and the shorter-CentralOfflineTimeout case (10s) — the latter
would have returned 30s against the pre-fix code.
HealthMonitoring-014 — HealthMonitoringOptions intervals are unvalidated; a zero/negative value crashes the hosted service
| Severity | Low |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | src/ScadaLink.HealthMonitoring/HealthMonitoringOptions.cs:3-20, src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:196, src/ScadaLink.HealthMonitoring/HealthReportSender.cs:67, src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:63 |
Description
HealthMonitoringOptions is bound from the ScadaLink:HealthMonitoring config
section (SiteServiceRegistration.BindSharedOptions) with no validation —
no IValidateOptions<HealthMonitoringOptions>, no ValidateDataAnnotations, no
ValidateOnStart. ReportInterval, OfflineTimeout, and CentralOfflineTimeout
are all fed straight into new PeriodicTimer(...) (and OfflineTimeout into a
division for the check interval). PeriodicTimer's constructor throws
ArgumentOutOfRangeException for a zero or negative period. A misconfigured
appsettings.json (e.g. "ReportInterval": "00:00:00", an empty/garbled value
that binds to TimeSpan.Zero, or a negative span) therefore crashes the
HealthReportSender / CentralHealthReportLoop / CentralHealthAggregator
hosted service at startup with an opaque exception that does not name the
offending config key, rather than failing fast with a clear validation message.
Recommendation
Add an options validator (DataAnnotations [Range]-style on the spans, or an
IValidateOptions<HealthMonitoringOptions>) that rejects non-positive
ReportInterval/OfflineTimeout/CentralOfflineTimeout and ideally requires
CentralOfflineTimeout >= OfflineTimeout, and call .ValidateOnStart() so a bad
configuration fails fast with a message naming the section and key.
Resolution
Resolved 2026-05-17. Root cause confirmed — HealthMonitoringOptions had no
validator, so a zero/negative interval reached new PeriodicTimer(...) and crashed
the hosted service with an opaque ArgumentOutOfRangeException. Added
HealthMonitoringOptionsValidator : IValidateOptions<HealthMonitoringOptions> that
rejects non-positive ReportInterval/OfflineTimeout/CentralOfflineTimeout and a
CentralOfflineTimeout shorter than OfflineTimeout, each failure naming the
ScadaLink:HealthMonitoring config key. It is registered (idempotently, via
TryAddEnumerable) by all three ServiceCollectionExtensions registration methods,
so it fires when the hosted services resolve IOptions.Value at startup — failing
fast with a clear message. (ValidateOnStart() lives in the Host module's binding
call, which is out of scope; the validator nonetheless runs at startup because the
hosted-service constructors resolve the options eagerly — matching the existing
ClusterOptionsValidator registration pattern.) Regression tests in
HealthMonitoringOptionsValidatorTests cover the valid default plus zero/negative
intervals and the CentralOfflineTimeout < OfflineTimeout case.
HealthMonitoring-015 — Heartbeat-registered site is left with a year-0001 LastReportReceivedAt
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:122-130, src/ScadaLink.HealthMonitoring/SiteHealthState.cs:27 |
Description
When MarkHeartbeat registers a previously-unknown site (HealthMonitoring-007's
fix), it sets LastReportReceivedAt = default — i.e. DateTimeOffset.MinValue
(0001-01-01). The XML doc on SiteHealthState.LastReportReceivedAt states the
field is "Used by the UI to surface report staleness during failover." A
heartbeat-only site therefore has LatestReport == null and
LastReportReceivedAt == DateTimeOffset.MinValue. Any UI code that computes
"last report N ago" as now - LastReportReceivedAt without first checking
LatestReport != null will render a nonsensical staleness of roughly two
thousand years for a site that is, in fact, freshly reachable. The two
"no report yet" signals (LatestReport == null, LastReportReceivedAt == default)
are independent and both must be special-cased; the sentinel value is an easy trap.
Recommendation
Make LastReportReceivedAt nullable (DateTimeOffset?) so "no report received
yet" is an explicit, unmissable state rather than a magic sentinel — consistent
with how LatestReport was already made nullable for the same case — and have UI
consumers render staleness only when it has a value. Alternatively, document the
default sentinel prominently on the field and audit every UI reader, but the
nullable option is safer and matches the existing LatestReport treatment.
Resolution
Resolved 2026-05-17. Root cause confirmed against source — CentralHealthAggregator.MarkHeartbeat
registered a heartbeat-only site with LastReportReceivedAt = default (0001-01-01),
which Health.razor:74 passed verbatim into TimestampDisplay, rendering a
~2000-year-stale timestamp. Applied the finding's recommended cross-module fix.
HealthMonitoring: SiteHealthState.LastReportReceivedAt is now DateTimeOffset?,
and MarkHeartbeat's unknown-site registration sets it to null — "no report yet"
is now an explicit nullable state, consistent with the already-nullable LatestReport.
ProcessReport continues to set a real instant. CentralUI: TimestampDisplay.Value
now accepts DateTimeOffset? and renders null as a plain text-muted placeholder
(default "never", configurable via a new NullText parameter); existing non-null
callers (AuditLog, EventLogs, Deployments) are unaffected by the implicit
widening. Health.razor's "Last report" cell passes NullText="awaiting first report"
so a heartbeat-only site reads cleanly. Regression tests:
CentralHealthAggregatorTests.MarkHeartbeat_RegistersUnknownSite_WithNullLastReportReceivedAt
and ProcessReport_SetsLastReportReceivedAt_ForHeartbeatRegisteredSite (HealthMonitoring —
the first would not compile against the pre-fix non-nullable field), and
TimestampDisplayTests.Render_NullValue_ShowsNeverPlaceholder,
Render_NullValue_DoesNotRenderYear0001Sentinel, Render_NonNullValue_ShowsFormattedTimestamp
(CentralUI).
HealthMonitoring-016 — SiteHealthCollector.CollectReport reads DateTimeOffset.UtcNow directly instead of an injected TimeProvider
| Severity | Low |
| Category | Code organization & conventions |
| Status | Resolved |
| Location | src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs:151 |
Description
CollectReport stamps each report with ReportTimestamp: DateTimeOffset.UtcNow,
read directly from the system clock. Every other time-dependent class in the
module — CentralHealthAggregator, HealthReportSender, CentralHealthReportLoop
— was deliberately refactored (HealthMonitoring-006) to take an injectable
TimeProvider so the behaviour is deterministically testable and the clock
dependency is explicit. SiteHealthCollector is the lone holdout: the report
timestamp cannot be controlled in a unit test, which is why
SiteHealthCollectorTests.CollectReport_IncludesUtcTimestamp can only assert the
timestamp falls in a before/after wall-clock window rather than equalling a
known instant. This is a minor consistency/testability gap, not a behaviour bug.
Recommendation
Add an optional TimeProvider constructor parameter to SiteHealthCollector
(defaulting to TimeProvider.System, mirroring the other classes) and derive
ReportTimestamp from GetUtcNow(), so the report timestamp is deterministically
testable and the module is consistent.
Resolution
Resolved 2026-05-17. Root cause confirmed — CollectReport stamped ReportTimestamp
from DateTimeOffset.UtcNow directly while every other time-dependent class in the
module takes an injectable TimeProvider. Added an optional TimeProvider
constructor parameter to SiteHealthCollector (defaulting to TimeProvider.System,
mirroring CentralHealthAggregator/HealthReportSender/CentralHealthReportLoop)
and CollectReport now derives ReportTimestamp from _timeProvider.GetUtcNow().
The DI registration (AddSingleton<ISiteHealthCollector, SiteHealthCollector>)
continues to work via the optional parameter. Regression test
SiteHealthCollectorTests.CollectReport_StampsTimestamp_FromInjectedTimeProvider
asserts the timestamp equals a fixed injected instant exactly (not just a
before/after window); it would not compile against the pre-fix single-arg-less
constructor.