# Code Review — HealthMonitoring

| Field | Value |
|-------|-------|
| Module | `src/ScadaLink.HealthMonitoring` |
| Design doc | `docs/requirements/Component-HealthMonitoring.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 12 |

## Summary

The HealthMonitoring module is small, readable, and broadly faithful to the design
intent: per-interval error counters with atomic read-and-reset, monotonic sequence
numbers with Unix-ms seeding to survive failover, sequence-guarded staleness
rejection, and a 60s offline timeout. However, the review surfaced two recurring
themes. First, **a documented metric is silently unimplemented** — store-and-forward
buffer depths are never populated (`SetStoreAndForwardDepths` has zero callers and a
test asserts the field is always empty), so the dashboard cannot show the buffer
depth metric the design doc requires. Second, **the central aggregator's in-memory
state model has unguarded shared mutable state**: `SiteHealthState` is a mutable
class whose fields are written by a background timer thread, by `ProcessReport`, and
by `MarkHeartbeat` with no synchronization, and the same live mutable objects are
handed straight to UI callers via `GetAllSiteStates`. The `ProcessReport` logic also
mutates shared state inside a `ConcurrentDictionary.AddOrUpdate` update delegate,
which the runtime may invoke more than once under contention. Additionally there are
gaps around central self-report offline detection, heartbeats for not-yet-registered
sites being dropped, and missing test coverage for the central report loop,
heartbeat path, and most collector setters. None of the findings are crash-class,
but the concurrency issues are Medium/High and the missing S&F metric is a real
design-adherence gap.

## Checklist coverage

| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | x | `MarkHeartbeat` drops heartbeats for unregistered sites (HealthMonitoring-007); central self-report has no heartbeat grace (HealthMonitoring-005). |
| 2 | Akka.NET conventions | x | Module itself contains no actors (transport abstracted via `IHealthReportTransport`); `AddHealthMonitoringActors` is a dead placeholder (HealthMonitoring-011). Actor-side wiring lives in Communication and is out of scope. |
| 3 | Concurrency & thread safety | x | Unguarded mutable `SiteHealthState` (HealthMonitoring-002); mutation inside `AddOrUpdate` delegate (HealthMonitoring-003); `GetAllSiteStates` leaks live mutable references (HealthMonitoring-008). Collector counters correctly use `Interlocked`. |
| 4 | Error handling & resilience | x | `HealthReportSender` silently swallows inner failures with bare `catch {}` (HealthMonitoring-010); top-level loop error handling is sound. |
| 5 | Security | x | No issues found. Module handles only numeric/string operational metrics, no secrets, no external input parsing, no auth surface. |
| 6 | Performance & resource management | x | `PeriodicTimer` instances correctly disposed via `using`. Dictionary snapshots per report are acceptable at the documented scale. No issues found. |
| 7 | Design-document adherence | x | Store-and-forward buffer depth metric unimplemented (HealthMonitoring-001); sequence seeding deviates from doc's "starting at 1" wording (HealthMonitoring-006). |
| 8 | Code organization & conventions | x | Options class correctly owned by the component; POCO/messages in Commons. Dead placeholder method noted (HealthMonitoring-011). |
| 9 | Testing coverage | x | No tests for `CentralHealthReportLoop`, `MarkHeartbeat`, offline-via-heartbeat, replica idempotency, or most collector setters (HealthMonitoring-009). |
| 10 | Documentation & comments | x | Heartbeat interval is described inconsistently (~2s vs ~5s) across XML docs (HealthMonitoring-004); `LatestReport = null!` misrepresents the contract (HealthMonitoring-012). |

## Findings

### HealthMonitoring-001 — Store-and-forward buffer depth metric is never populated

| | |
|--|--|
| Severity | High |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs:104`, `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:79` |

**Description**

`Component-HealthMonitoring.md` lists "Store-and-forward buffer depth" (pending
messages by category) as a required monitored metric. `SiteHealthCollector` exposes
`SetStoreAndForwardDepths(...)` to receive it, but a codebase-wide search shows the
method has **no callers** — `_sfBufferDepths` always remains the empty dictionary it
is initialized to. `HealthReportSender` queries `GetParkedMessageCountAsync()` and
sets `ParkedMessageCount`, but parked count is a distinct metric from per-category
buffer depth. The test `SiteHealthCollectorTests.StoreAndForwardBufferDepths_IsEmptyPlaceholder`
even codifies the unimplemented state as expected behaviour. The result is that the
central dashboard cannot display buffer depth, a documented triage metric.

**Recommendation**

Wire `SetStoreAndForwardDepths` into `HealthReportSender.ExecuteAsync` (alongside the
existing parked-count call) using the S&F engine's per-category depth API, or, if the
metric is intentionally deferred, record that decision in the design doc and remove
the dead setter. Update the placeholder test accordingly once implemented.

**Resolution**

_Unresolved._

### HealthMonitoring-002 — `SiteHealthState` mutable fields written from multiple threads without synchronization

| | |
|--|--|
| Severity | High |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/SiteHealthState.cs:11`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:86`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:137` |

**Description**

`SiteHealthState` is a plain mutable class. Its fields (`LatestReport`,
`LastReportReceivedAt`, `LastHeartbeatAt`, `LastSequenceNumber`, `IsOnline`) are
mutated from at least three concurrent contexts: `ProcessReport` (caller thread —
ClusterClient/PubSub message handlers), `MarkHeartbeat` (caller thread — heartbeat
handler), and `CheckForOfflineSites` (the `BackgroundService` timer thread). The
`ConcurrentDictionary` only protects the dictionary structure, not the objects it
stores. A heartbeat update and the offline-check can interleave on the same
`SiteHealthState` instance, and reads/writes of `DateTimeOffset` (a 16-byte struct)
and `long` fields are not guaranteed atomic on all platforms — producing torn reads
and lost updates of `IsOnline`/`LastHeartbeatAt`.

**Recommendation**

Make state transitions atomic: either guard all reads/writes of a `SiteHealthState`
with a per-site lock, or replace `SiteHealthState` with an immutable record updated
via `ConcurrentDictionary` compare-and-swap (`TryUpdate`) so every transition is
a single atomic reference swap.

**Resolution**

_Unresolved._

### HealthMonitoring-003 — Shared state mutated inside `ConcurrentDictionary.AddOrUpdate` update delegate

| | |
|--|--|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:55-78` |

**Description**

The update delegate passed to `AddOrUpdate` mutates the `existing` object in place
(`existing.LatestReport = report; existing.IsOnline = true; ...`). `AddOrUpdate`'s
contract explicitly allows the update delegate to be invoked **more than once** under
contention (when the CAS that installs the result loses a race and is retried). Each
invocation mutates the shared object, so a concurrent report for the same site can
observe a half-applied update, and the multi-field assignment is not atomic with
respect to readers in `GetAllSiteStates`/`CheckForOfflineSites`. The intended
"only replace if sequence is higher" guard can also be subverted because the
sequence comparison and the field writes are not a single atomic step.

**Recommendation**

Have the update delegate return a **new** `SiteHealthState` (record `with` copy)
rather than mutating `existing`, and treat the dictionary value as immutable.
Combined with HealthMonitoring-002, this makes every state transition an atomic
reference swap with no observable intermediate state.

**Resolution**

_Unresolved._

### HealthMonitoring-004 — Inconsistent heartbeat interval described across XML docs

| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:146-148`, `src/ScadaLink.HealthMonitoring/SiteHealthState.cs:21`, `src/ScadaLink.HealthMonitoring/ICentralHealthAggregator.cs:16` |

**Description**

The heartbeat cadence that offline detection relies on is documented inconsistently.
`CheckForOfflineSites` says "heartbeats arrive every ~5s"; `SiteHealthState.LastHeartbeatAt`
says "~5s heartbeat"; but `ICentralHealthAggregator.MarkHeartbeat` says "~2s
heartbeats are arriving". The actual cadence is set elsewhere (Cluster Infrastructure /
`SiteCommunicationActor`). Readers cannot reason about whether a 60s offline timeout
gives the intended grace without a single authoritative number.

**Recommendation**

Pick the correct interval (verify against the heartbeat scheduler in
`SiteCommunicationActor`/Cluster Infrastructure) and use it consistently in all three
comments, ideally referencing the owning component rather than restating a magic number.

**Resolution**

_Unresolved._

### HealthMonitoring-005 — Central self-report site can flap offline; no heartbeat grace like real sites

| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:48-81`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:149` |

**Description**

`CheckForOfflineSites` decides offline status purely from `LastHeartbeatAt`, and for
real sites that field is kept fresh by frequent (~2-5s) heartbeats so the 60s timeout
only fires on genuine total loss. The synthetic `central` site, however, has no
heartbeat source — `LastHeartbeatAt` is only bumped by `ProcessReport` from the
30s `CentralHealthReportLoop`. The loop also only runs on the cluster leader and
silently skips a cycle on any exception. Consequently, a single skipped/late central
self-report (leader GC pause, brief stall, mid-failover before the new leader's loop
spins up) leaves `central` with no signal for >60s and it is marked offline even
though the central cluster is healthy. The central card thus has no equivalent of
the "one missed report grace" the design doc grants real sites.

**Recommendation**

Either feed `central` a heartbeat equivalent (e.g. have `MarkHeartbeat` called for
`CentralSiteId` on a fast timer independent of the leader-only report loop), or apply
a longer/distinct offline timeout to the `central` keyspace entry, and ensure the new
leader starts the report loop promptly on failover.

**Resolution**

_Unresolved._

### HealthMonitoring-006 — Sequence seeding contradicts the doc's "starting at 1" wording and is untestable

| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:28`, `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:32` |

**Description**

The `HealthReportSender` class XML summary states "Sequence numbers are monotonic,
starting at 1, and reset on service restart." The implementation instead seeds
`_sequenceNumber` with `DateTimeOffset.UtcNow.ToUnixTimeMilliseconds()` so the first
emitted sequence is a large epoch value, specifically to keep ordering correct across
failover. The summary is therefore stale and contradicts the code. Separately, the
seed reads `DateTimeOffset.UtcNow` directly at field initialization rather than
through an injected `TimeProvider` (which `CentralHealthAggregator` already uses),
making the seeding logic impossible to unit-test deterministically and dependent on
node wall-clock agreement — if one node's clock lags, its post-failover reports can
be silently rejected as stale by the aggregator.

**Recommendation**

Fix the `HealthReportSender` XML summary to describe the actual Unix-ms seeding
strategy, and inject `TimeProvider` for the seed so the behaviour is testable and the
clock dependency is explicit.

**Resolution**

_Unresolved._

### HealthMonitoring-007 — Heartbeats for not-yet-registered sites are silently dropped

| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:86-99` |

**Description**

`MarkHeartbeat` returns immediately if the site is not already in `_siteStates`
("registration only happens on report"). Central health state is in-memory only and
not persisted. After a central restart or failover the aggregator starts empty, so
for up to one full report interval (default 30s) every site emits only heartbeats
that are all discarded — the site is reported as *unknown* (absent from
`GetAllSiteStates`) rather than *online*, even though heartbeats prove it is
reachable. This is a visible dashboard regression precisely during the failover
window, which is when operators most need accurate status.

**Recommendation**

Allow `MarkHeartbeat` to register a minimal `SiteHealthState` (online, no
`LatestReport` yet, with a UI-visible "awaiting first report" indication) when a
heartbeat arrives for an unknown site, so reachable sites show online immediately
after a central restart.

**Resolution**

_Unresolved._

### HealthMonitoring-008 — `GetAllSiteStates` / `GetSiteState` leak live mutable state objects to callers

| | |
|--|--|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:104-116` |

**Description**

`GetAllSiteStates` copies the dictionary but the copy still holds references to the
same live mutable `SiteHealthState` instances; `GetSiteState` returns the live
instance directly. UI consumers (Blazor Server / SignalR circuits) read these objects
on their own threads while the aggregator's background timer and report handlers
concurrently mutate the very same instances (see HealthMonitoring-002). A UI render
can observe a `SiteHealthState` with, e.g., `IsOnline == true` but a `LatestReport`
from a different update, or a torn `DateTimeOffset`. Callers could also mutate the
shared state, corrupting aggregator state.

**Recommendation**

Return immutable snapshots: convert `SiteHealthState` to a record (per
HealthMonitoring-002/003) so handing out the reference is safe, or deep-copy each
state into an immutable DTO before returning.

**Resolution**

_Unresolved._

### HealthMonitoring-009 — Missing test coverage for central report loop, heartbeat path, replication, and collector setters

| | |
|--|--|
| Severity | Medium |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.HealthMonitoring.Tests/` |

**Description**

Several behaviours have no automated coverage:
- `CentralHealthReportLoop` — leader-only gating (`SelfIsPrimary`), self-report
  generation, sequence assignment: no test file at all.
- `CentralHealthAggregator.MarkHeartbeat` — keeping a site online between reports,
  online recovery via heartbeat, and the unknown-site drop behaviour
  (HealthMonitoring-007): untested.
- Offline detection driven by `LastHeartbeatAt` vs `LastReportReceivedAt` — the
  existing offline tests only advance time after a report, never exercising the
  heartbeat-keeps-alive path the design depends on.
- `SiteHealthCollector` — `SetClusterNodes`, `SetInstanceCounts`, `SetParkedMessageCount`,
  `SetNodeHostname`, `SetActiveNode`/`NodeRole`, `UpdateTagQuality`,
  `UpdateConnectionEndpoint`: not reflected-in-report tested.
- `SiteHealthReportReplica` idempotency under double delivery: untested.

**Recommendation**

Add tests for the central report loop (with a fake `IClusterNodeProvider`), the
heartbeat-keeps-online and unknown-site heartbeat paths, and the remaining collector
setters' presence in `CollectReport` output.

**Resolution**

_Unresolved._

### HealthMonitoring-010 — `HealthReportSender` silently swallows inner failures with bare `catch {}`

| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:70-87` |

**Description**

The cluster-nodes update and parked-message-count query are each wrapped in
`try { ... } catch { /* Non-fatal */ }` with no logging. A persistent failure (e.g.
the S&F SQLite store is permanently broken, or `GetClusterNodes()` always throws)
is then completely invisible — every report silently ships with stale cluster nodes
and a parked count of 0, with nothing in the logs to explain the wrong dashboard
values. Bare `catch` with no exception variable also catches `OperationCanceledException`
and would mask shutdown signalling if the awaited call observed the token.

**Recommendation**

Catch a specific exception type (or at least `Exception ex`) and `LogWarning`/`LogDebug`
the failure so persistent degradation is diagnosable; avoid swallowing
`OperationCanceledException`.

**Resolution**

_Unresolved._

### HealthMonitoring-011 — `AddHealthMonitoringActors` is a dead no-op placeholder

| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/ServiceCollectionExtensions.cs:42-46` |

**Description**

`AddHealthMonitoringActors` does nothing but `return services` with a "Placeholder for
Phase 4+" comment. A public extension method that silently no-ops is a trap: a caller
who registers it will believe actor wiring is in place. No caller currently invokes it.

**Recommendation**

Remove the method until it has real behaviour, or throw `NotImplementedException` so
accidental use fails loudly. If the actor model for this component is genuinely
planned, track it in the design doc instead of a half-method.

**Resolution**

_Unresolved._

### HealthMonitoring-012 — `SiteHealthState.LatestReport` initialized to `null!`, misrepresenting the contract

| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/SiteHealthState.cs:11` |

**Description**

`LatestReport` is declared `SiteHealthReport LatestReport { get; set; } = null!;`,
suppressing nullability. Today every code path that creates a `SiteHealthState` (only
`ProcessReport`) assigns `LatestReport`, so it is never actually null — but the
`null!` declaration tells readers and the compiler the opposite of the real
invariant. If HealthMonitoring-007 is addressed by registering state from a heartbeat
(no report yet), this becomes a live `NullReferenceException` risk for UI code that
dereferences `LatestReport`.

**Recommendation**

Either make `LatestReport` `required` (matching how it is genuinely always set today)
or make it properly nullable `SiteHealthReport?` and have consumers handle the
"registered, no report yet" case explicitly — consistent with whatever is decided
for HealthMonitoring-007.

**Resolution**

_Unresolved._