Files
scadalink-design/code-reviews/HealthMonitoring/findings.md

21 KiB

Code Review — HealthMonitoring

Field Value
Module src/ScadaLink.HealthMonitoring
Design doc docs/requirements/Component-HealthMonitoring.md
Status Reviewed
Last reviewed 2026-05-16
Reviewer claude-agent
Commit reviewed 9c60592
Open findings 10

Summary

The HealthMonitoring module is small, readable, and broadly faithful to the design intent: per-interval error counters with atomic read-and-reset, monotonic sequence numbers with Unix-ms seeding to survive failover, sequence-guarded staleness rejection, and a 60s offline timeout. However, the review surfaced two recurring themes. First, a documented metric is silently unimplemented — store-and-forward buffer depths are never populated (SetStoreAndForwardDepths has zero callers and a test asserts the field is always empty), so the dashboard cannot show the buffer depth metric the design doc requires. Second, the central aggregator's in-memory state model has unguarded shared mutable state: SiteHealthState is a mutable class whose fields are written by a background timer thread, by ProcessReport, and by MarkHeartbeat with no synchronization, and the same live mutable objects are handed straight to UI callers via GetAllSiteStates. The ProcessReport logic also mutates shared state inside a ConcurrentDictionary.AddOrUpdate update delegate, which the runtime may invoke more than once under contention. Additionally there are gaps around central self-report offline detection, heartbeats for not-yet-registered sites being dropped, and missing test coverage for the central report loop, heartbeat path, and most collector setters. None of the findings are crash-class, but the concurrency issues are Medium/High and the missing S&F metric is a real design-adherence gap.

Checklist coverage

# Category Examined Notes
1 Correctness & logic bugs x MarkHeartbeat drops heartbeats for unregistered sites (HealthMonitoring-007); central self-report has no heartbeat grace (HealthMonitoring-005).
2 Akka.NET conventions x Module itself contains no actors (transport abstracted via IHealthReportTransport); AddHealthMonitoringActors is a dead placeholder (HealthMonitoring-011). Actor-side wiring lives in Communication and is out of scope.
3 Concurrency & thread safety x Unguarded mutable SiteHealthState (HealthMonitoring-002); mutation inside AddOrUpdate delegate (HealthMonitoring-003); GetAllSiteStates leaks live mutable references (HealthMonitoring-008). Collector counters correctly use Interlocked.
4 Error handling & resilience x HealthReportSender silently swallows inner failures with bare catch {} (HealthMonitoring-010); top-level loop error handling is sound.
5 Security x No issues found. Module handles only numeric/string operational metrics, no secrets, no external input parsing, no auth surface.
6 Performance & resource management x PeriodicTimer instances correctly disposed via using. Dictionary snapshots per report are acceptable at the documented scale. No issues found.
7 Design-document adherence x Store-and-forward buffer depth metric unimplemented (HealthMonitoring-001); sequence seeding deviates from doc's "starting at 1" wording (HealthMonitoring-006).
8 Code organization & conventions x Options class correctly owned by the component; POCO/messages in Commons. Dead placeholder method noted (HealthMonitoring-011).
9 Testing coverage x No tests for CentralHealthReportLoop, MarkHeartbeat, offline-via-heartbeat, replica idempotency, or most collector setters (HealthMonitoring-009).
10 Documentation & comments x Heartbeat interval is described inconsistently (~2s vs ~5s) across XML docs (HealthMonitoring-004); LatestReport = null! misrepresents the contract (HealthMonitoring-012).

Findings

HealthMonitoring-001 — Store-and-forward buffer depth metric is never populated

Severity High
Category Design-document adherence
Status Resolved
Location src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs:104, src/ScadaLink.HealthMonitoring/HealthReportSender.cs:79

Description

Component-HealthMonitoring.md lists "Store-and-forward buffer depth" (pending messages by category) as a required monitored metric. SiteHealthCollector exposes SetStoreAndForwardDepths(...) to receive it, but a codebase-wide search shows the method has no callers_sfBufferDepths always remains the empty dictionary it is initialized to. HealthReportSender queries GetParkedMessageCountAsync() and sets ParkedMessageCount, but parked count is a distinct metric from per-category buffer depth. The test SiteHealthCollectorTests.StoreAndForwardBufferDepths_IsEmptyPlaceholder even codifies the unimplemented state as expected behaviour. The result is that the central dashboard cannot display buffer depth, a documented triage metric.

Recommendation

Wire SetStoreAndForwardDepths into HealthReportSender.ExecuteAsync (alongside the existing parked-count call) using the S&F engine's per-category depth API, or, if the metric is intentionally deferred, record that decision in the design doc and remove the dead setter. Update the placeholder test accordingly once implemented.

Resolution

Resolved 2026-05-16 (commit <pending>). HealthReportSender.ExecuteAsync now queries the existing public StoreAndForwardStorage.GetBufferDepthByCategoryAsync() API alongside the parked-count call and feeds the per-category depths into SiteHealthCollector.SetStoreAndForwardDepths (category enum names as keys), so the documented store-and-forward buffer depth metric is populated in every emitted report. Regression test HealthReportSenderTests.ReportsIncludeStoreAndForwardBufferDepthsFromStorage verifies populated per-category depths. The obsolete placeholder test SiteHealthCollectorTests.StoreAndForwardBufferDepths_IsEmptyPlaceholder continues to pass — it only exercises the collector with no setter call and still correctly asserts the empty default; it was left in place as the collector-level default-state test. No StoreAndForward source was modified (existing public API only).

HealthMonitoring-002 — SiteHealthState mutable fields written from multiple threads without synchronization

Severity High
Category Concurrency & thread safety
Status Resolved
Location src/ScadaLink.HealthMonitoring/SiteHealthState.cs:11, src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:86, src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:137

Description

SiteHealthState is a plain mutable class. Its fields (LatestReport, LastReportReceivedAt, LastHeartbeatAt, LastSequenceNumber, IsOnline) are mutated from at least three concurrent contexts: ProcessReport (caller thread — ClusterClient/PubSub message handlers), MarkHeartbeat (caller thread — heartbeat handler), and CheckForOfflineSites (the BackgroundService timer thread). The ConcurrentDictionary only protects the dictionary structure, not the objects it stores. A heartbeat update and the offline-check can interleave on the same SiteHealthState instance, and reads/writes of DateTimeOffset (a 16-byte struct) and long fields are not guaranteed atomic on all platforms — producing torn reads and lost updates of IsOnline/LastHeartbeatAt.

Recommendation

Make state transitions atomic: either guard all reads/writes of a SiteHealthState with a per-site lock, or replace SiteHealthState with an immutable record updated via ConcurrentDictionary compare-and-swap (TryUpdate) so every transition is a single atomic reference swap.

Resolution

Resolved 2026-05-16 (commit <pending>). SiteHealthState is now a sealed record with init-only properties. CentralHealthAggregator.ProcessReport, MarkHeartbeat, and CheckForOfflineSites were rewritten to perform every state transition as an atomic compare-and-swap (TryAdd/TryUpdate) producing a new record instance — no field of a stored state is ever mutated in place. ProcessReport uses an explicit CAS retry loop instead of the AddOrUpdate update delegate so the sequence-number guard and the field writes are evaluated against the value actually installed (this also closes the root cause behind HealthMonitoring-003). Reads via GetAllSiteStates/GetSiteState now hand out immutable snapshots, so a concurrent reader can never observe a torn or half-applied state. LatestReport was changed from SiteHealthReport (null!) to SiteHealthReport?, making the contract honest; all existing consumers (CentralUI, integration/perf tests) already null-checked it and continue to build clean. Regression test CentralHealthAggregatorTests.ProcessReport_ConcurrentUpdates_NeverLoseSequenceOrTearState exercises concurrent report/heartbeat/read threads and asserts snapshot consistency and no lost updates.

HealthMonitoring-003 — Shared state mutated inside ConcurrentDictionary.AddOrUpdate update delegate

Severity Medium
Category Concurrency & thread safety
Status Open
Location src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:55-78

Description

The update delegate passed to AddOrUpdate mutates the existing object in place (existing.LatestReport = report; existing.IsOnline = true; ...). AddOrUpdate's contract explicitly allows the update delegate to be invoked more than once under contention (when the CAS that installs the result loses a race and is retried). Each invocation mutates the shared object, so a concurrent report for the same site can observe a half-applied update, and the multi-field assignment is not atomic with respect to readers in GetAllSiteStates/CheckForOfflineSites. The intended "only replace if sequence is higher" guard can also be subverted because the sequence comparison and the field writes are not a single atomic step.

Recommendation

Have the update delegate return a new SiteHealthState (record with copy) rather than mutating existing, and treat the dictionary value as immutable. Combined with HealthMonitoring-002, this makes every state transition an atomic reference swap with no observable intermediate state.

Resolution

Unresolved.

HealthMonitoring-004 — Inconsistent heartbeat interval described across XML docs

Severity Low
Category Documentation & comments
Status Open
Location src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:146-148, src/ScadaLink.HealthMonitoring/SiteHealthState.cs:21, src/ScadaLink.HealthMonitoring/ICentralHealthAggregator.cs:16

Description

The heartbeat cadence that offline detection relies on is documented inconsistently. CheckForOfflineSites says "heartbeats arrive every ~5s"; SiteHealthState.LastHeartbeatAt says "~5s heartbeat"; but ICentralHealthAggregator.MarkHeartbeat says "~2s heartbeats are arriving". The actual cadence is set elsewhere (Cluster Infrastructure / SiteCommunicationActor). Readers cannot reason about whether a 60s offline timeout gives the intended grace without a single authoritative number.

Recommendation

Pick the correct interval (verify against the heartbeat scheduler in SiteCommunicationActor/Cluster Infrastructure) and use it consistently in all three comments, ideally referencing the owning component rather than restating a magic number.

Resolution

Unresolved.

HealthMonitoring-005 — Central self-report site can flap offline; no heartbeat grace like real sites

Severity Medium
Category Correctness & logic bugs
Status Open
Location src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:48-81, src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:149

Description

CheckForOfflineSites decides offline status purely from LastHeartbeatAt, and for real sites that field is kept fresh by frequent (~2-5s) heartbeats so the 60s timeout only fires on genuine total loss. The synthetic central site, however, has no heartbeat source — LastHeartbeatAt is only bumped by ProcessReport from the 30s CentralHealthReportLoop. The loop also only runs on the cluster leader and silently skips a cycle on any exception. Consequently, a single skipped/late central self-report (leader GC pause, brief stall, mid-failover before the new leader's loop spins up) leaves central with no signal for >60s and it is marked offline even though the central cluster is healthy. The central card thus has no equivalent of the "one missed report grace" the design doc grants real sites.

Recommendation

Either feed central a heartbeat equivalent (e.g. have MarkHeartbeat called for CentralSiteId on a fast timer independent of the leader-only report loop), or apply a longer/distinct offline timeout to the central keyspace entry, and ensure the new leader starts the report loop promptly on failover.

Resolution

Unresolved.

HealthMonitoring-006 — Sequence seeding contradicts the doc's "starting at 1" wording and is untestable

Severity Low
Category Design-document adherence
Status Open
Location src/ScadaLink.HealthMonitoring/HealthReportSender.cs:28, src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:32

Description

The HealthReportSender class XML summary states "Sequence numbers are monotonic, starting at 1, and reset on service restart." The implementation instead seeds _sequenceNumber with DateTimeOffset.UtcNow.ToUnixTimeMilliseconds() so the first emitted sequence is a large epoch value, specifically to keep ordering correct across failover. The summary is therefore stale and contradicts the code. Separately, the seed reads DateTimeOffset.UtcNow directly at field initialization rather than through an injected TimeProvider (which CentralHealthAggregator already uses), making the seeding logic impossible to unit-test deterministically and dependent on node wall-clock agreement — if one node's clock lags, its post-failover reports can be silently rejected as stale by the aggregator.

Recommendation

Fix the HealthReportSender XML summary to describe the actual Unix-ms seeding strategy, and inject TimeProvider for the seed so the behaviour is testable and the clock dependency is explicit.

Resolution

Unresolved.

HealthMonitoring-007 — Heartbeats for not-yet-registered sites are silently dropped

Severity Medium
Category Correctness & logic bugs
Status Open
Location src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:86-99

Description

MarkHeartbeat returns immediately if the site is not already in _siteStates ("registration only happens on report"). Central health state is in-memory only and not persisted. After a central restart or failover the aggregator starts empty, so for up to one full report interval (default 30s) every site emits only heartbeats that are all discarded — the site is reported as unknown (absent from GetAllSiteStates) rather than online, even though heartbeats prove it is reachable. This is a visible dashboard regression precisely during the failover window, which is when operators most need accurate status.

Recommendation

Allow MarkHeartbeat to register a minimal SiteHealthState (online, no LatestReport yet, with a UI-visible "awaiting first report" indication) when a heartbeat arrives for an unknown site, so reachable sites show online immediately after a central restart.

Resolution

Unresolved.

HealthMonitoring-008 — GetAllSiteStates / GetSiteState leak live mutable state objects to callers

Severity Medium
Category Concurrency & thread safety
Status Open
Location src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:104-116

Description

GetAllSiteStates copies the dictionary but the copy still holds references to the same live mutable SiteHealthState instances; GetSiteState returns the live instance directly. UI consumers (Blazor Server / SignalR circuits) read these objects on their own threads while the aggregator's background timer and report handlers concurrently mutate the very same instances (see HealthMonitoring-002). A UI render can observe a SiteHealthState with, e.g., IsOnline == true but a LatestReport from a different update, or a torn DateTimeOffset. Callers could also mutate the shared state, corrupting aggregator state.

Recommendation

Return immutable snapshots: convert SiteHealthState to a record (per HealthMonitoring-002/003) so handing out the reference is safe, or deep-copy each state into an immutable DTO before returning.

Resolution

Unresolved.

HealthMonitoring-009 — Missing test coverage for central report loop, heartbeat path, replication, and collector setters

Severity Medium
Category Testing coverage
Status Open
Location tests/ScadaLink.HealthMonitoring.Tests/

Description

Several behaviours have no automated coverage:

  • CentralHealthReportLoop — leader-only gating (SelfIsPrimary), self-report generation, sequence assignment: no test file at all.
  • CentralHealthAggregator.MarkHeartbeat — keeping a site online between reports, online recovery via heartbeat, and the unknown-site drop behaviour (HealthMonitoring-007): untested.
  • Offline detection driven by LastHeartbeatAt vs LastReportReceivedAt — the existing offline tests only advance time after a report, never exercising the heartbeat-keeps-alive path the design depends on.
  • SiteHealthCollectorSetClusterNodes, SetInstanceCounts, SetParkedMessageCount, SetNodeHostname, SetActiveNode/NodeRole, UpdateTagQuality, UpdateConnectionEndpoint: not reflected-in-report tested.
  • SiteHealthReportReplica idempotency under double delivery: untested.

Recommendation

Add tests for the central report loop (with a fake IClusterNodeProvider), the heartbeat-keeps-online and unknown-site heartbeat paths, and the remaining collector setters' presence in CollectReport output.

Resolution

Unresolved.

HealthMonitoring-010 — HealthReportSender silently swallows inner failures with bare catch {}

Severity Low
Category Error handling & resilience
Status Open
Location src/ScadaLink.HealthMonitoring/HealthReportSender.cs:70-87

Description

The cluster-nodes update and parked-message-count query are each wrapped in try { ... } catch { /* Non-fatal */ } with no logging. A persistent failure (e.g. the S&F SQLite store is permanently broken, or GetClusterNodes() always throws) is then completely invisible — every report silently ships with stale cluster nodes and a parked count of 0, with nothing in the logs to explain the wrong dashboard values. Bare catch with no exception variable also catches OperationCanceledException and would mask shutdown signalling if the awaited call observed the token.

Recommendation

Catch a specific exception type (or at least Exception ex) and LogWarning/LogDebug the failure so persistent degradation is diagnosable; avoid swallowing OperationCanceledException.

Resolution

Unresolved.

HealthMonitoring-011 — AddHealthMonitoringActors is a dead no-op placeholder

Severity Low
Category Code organization & conventions
Status Open
Location src/ScadaLink.HealthMonitoring/ServiceCollectionExtensions.cs:42-46

Description

AddHealthMonitoringActors does nothing but return services with a "Placeholder for Phase 4+" comment. A public extension method that silently no-ops is a trap: a caller who registers it will believe actor wiring is in place. No caller currently invokes it.

Recommendation

Remove the method until it has real behaviour, or throw NotImplementedException so accidental use fails loudly. If the actor model for this component is genuinely planned, track it in the design doc instead of a half-method.

Resolution

Unresolved.

HealthMonitoring-012 — SiteHealthState.LatestReport initialized to null!, misrepresenting the contract

Severity Low
Category Documentation & comments
Status Open
Location src/ScadaLink.HealthMonitoring/SiteHealthState.cs:11

Description

LatestReport is declared SiteHealthReport LatestReport { get; set; } = null!;, suppressing nullability. Today every code path that creates a SiteHealthState (only ProcessReport) assigns LatestReport, so it is never actually null — but the null! declaration tells readers and the compiler the opposite of the real invariant. If HealthMonitoring-007 is addressed by registering state from a heartbeat (no report yet), this becomes a live NullReferenceException risk for UI code that dereferences LatestReport.

Recommendation

Either make LatestReport required (matching how it is genuinely always set today) or make it properly nullable SiteHealthReport? and have consumers handle the "registered, no report yet" case explicitly — consistent with whatever is decided for HealthMonitoring-007.

Resolution

Unresolved.