docs: add code review process and baseline review of all 19 modules
Establishes a per-module code review workflow under code-reviews/ and
records the 2026-05-16 baseline review (commit 9c60592): 241 findings
across all src/ modules (6 Critical, 46 High, 100 Medium, 89 Low).
This is the clean starting point for remediation work.
This commit is contained in:
420
code-reviews/HealthMonitoring/findings.md
Normal file
420
code-reviews/HealthMonitoring/findings.md
Normal file
@@ -0,0 +1,420 @@
|
||||
# Code Review — HealthMonitoring
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| Module | `src/ScadaLink.HealthMonitoring` |
|
||||
| Design doc | `docs/requirements/Component-HealthMonitoring.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 12 |
|
||||
|
||||
## Summary
|
||||
|
||||
The HealthMonitoring module is small, readable, and broadly faithful to the design
|
||||
intent: per-interval error counters with atomic read-and-reset, monotonic sequence
|
||||
numbers with Unix-ms seeding to survive failover, sequence-guarded staleness
|
||||
rejection, and a 60s offline timeout. However, the review surfaced two recurring
|
||||
themes. First, **a documented metric is silently unimplemented** — store-and-forward
|
||||
buffer depths are never populated (`SetStoreAndForwardDepths` has zero callers and a
|
||||
test asserts the field is always empty), so the dashboard cannot show the buffer
|
||||
depth metric the design doc requires. Second, **the central aggregator's in-memory
|
||||
state model has unguarded shared mutable state**: `SiteHealthState` is a mutable
|
||||
class whose fields are written by a background timer thread, by `ProcessReport`, and
|
||||
by `MarkHeartbeat` with no synchronization, and the same live mutable objects are
|
||||
handed straight to UI callers via `GetAllSiteStates`. The `ProcessReport` logic also
|
||||
mutates shared state inside a `ConcurrentDictionary.AddOrUpdate` update delegate,
|
||||
which the runtime may invoke more than once under contention. Additionally there are
|
||||
gaps around central self-report offline detection, heartbeats for not-yet-registered
|
||||
sites being dropped, and missing test coverage for the central report loop,
|
||||
heartbeat path, and most collector setters. None of the findings are crash-class,
|
||||
but the concurrency issues are Medium/High and the missing S&F metric is a real
|
||||
design-adherence gap.
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | x | `MarkHeartbeat` drops heartbeats for unregistered sites (HealthMonitoring-007); central self-report has no heartbeat grace (HealthMonitoring-005). |
|
||||
| 2 | Akka.NET conventions | x | Module itself contains no actors (transport abstracted via `IHealthReportTransport`); `AddHealthMonitoringActors` is a dead placeholder (HealthMonitoring-011). Actor-side wiring lives in Communication and is out of scope. |
|
||||
| 3 | Concurrency & thread safety | x | Unguarded mutable `SiteHealthState` (HealthMonitoring-002); mutation inside `AddOrUpdate` delegate (HealthMonitoring-003); `GetAllSiteStates` leaks live mutable references (HealthMonitoring-008). Collector counters correctly use `Interlocked`. |
|
||||
| 4 | Error handling & resilience | x | `HealthReportSender` silently swallows inner failures with bare `catch {}` (HealthMonitoring-010); top-level loop error handling is sound. |
|
||||
| 5 | Security | x | No issues found. Module handles only numeric/string operational metrics, no secrets, no external input parsing, no auth surface. |
|
||||
| 6 | Performance & resource management | x | `PeriodicTimer` instances correctly disposed via `using`. Dictionary snapshots per report are acceptable at the documented scale. No issues found. |
|
||||
| 7 | Design-document adherence | x | Store-and-forward buffer depth metric unimplemented (HealthMonitoring-001); sequence seeding deviates from doc's "starting at 1" wording (HealthMonitoring-006). |
|
||||
| 8 | Code organization & conventions | x | Options class correctly owned by the component; POCO/messages in Commons. Dead placeholder method noted (HealthMonitoring-011). |
|
||||
| 9 | Testing coverage | x | No tests for `CentralHealthReportLoop`, `MarkHeartbeat`, offline-via-heartbeat, replica idempotency, or most collector setters (HealthMonitoring-009). |
|
||||
| 10 | Documentation & comments | x | Heartbeat interval is described inconsistently (~2s vs ~5s) across XML docs (HealthMonitoring-004); `LatestReport = null!` misrepresents the contract (HealthMonitoring-012). |
|
||||
|
||||
## Findings
|
||||
|
||||
### HealthMonitoring-001 — Store-and-forward buffer depth metric is never populated
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs:104`, `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:79` |
|
||||
|
||||
**Description**
|
||||
|
||||
`Component-HealthMonitoring.md` lists "Store-and-forward buffer depth" (pending
|
||||
messages by category) as a required monitored metric. `SiteHealthCollector` exposes
|
||||
`SetStoreAndForwardDepths(...)` to receive it, but a codebase-wide search shows the
|
||||
method has **no callers** — `_sfBufferDepths` always remains the empty dictionary it
|
||||
is initialized to. `HealthReportSender` queries `GetParkedMessageCountAsync()` and
|
||||
sets `ParkedMessageCount`, but parked count is a distinct metric from per-category
|
||||
buffer depth. The test `SiteHealthCollectorTests.StoreAndForwardBufferDepths_IsEmptyPlaceholder`
|
||||
even codifies the unimplemented state as expected behaviour. The result is that the
|
||||
central dashboard cannot display buffer depth, a documented triage metric.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Wire `SetStoreAndForwardDepths` into `HealthReportSender.ExecuteAsync` (alongside the
|
||||
existing parked-count call) using the S&F engine's per-category depth API, or, if the
|
||||
metric is intentionally deferred, record that decision in the design doc and remove
|
||||
the dead setter. Update the placeholder test accordingly once implemented.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### HealthMonitoring-002 — `SiteHealthState` mutable fields written from multiple threads without synchronization
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.HealthMonitoring/SiteHealthState.cs:11`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:86`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:137` |
|
||||
|
||||
**Description**
|
||||
|
||||
`SiteHealthState` is a plain mutable class. Its fields (`LatestReport`,
|
||||
`LastReportReceivedAt`, `LastHeartbeatAt`, `LastSequenceNumber`, `IsOnline`) are
|
||||
mutated from at least three concurrent contexts: `ProcessReport` (caller thread —
|
||||
ClusterClient/PubSub message handlers), `MarkHeartbeat` (caller thread — heartbeat
|
||||
handler), and `CheckForOfflineSites` (the `BackgroundService` timer thread). The
|
||||
`ConcurrentDictionary` only protects the dictionary structure, not the objects it
|
||||
stores. A heartbeat update and the offline-check can interleave on the same
|
||||
`SiteHealthState` instance, and reads/writes of `DateTimeOffset` (a 16-byte struct)
|
||||
and `long` fields are not guaranteed atomic on all platforms — producing torn reads
|
||||
and lost updates of `IsOnline`/`LastHeartbeatAt`.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Make state transitions atomic: either guard all reads/writes of a `SiteHealthState`
|
||||
with a per-site lock, or replace `SiteHealthState` with an immutable record updated
|
||||
via `ConcurrentDictionary` compare-and-swap (`TryUpdate`) so every transition is
|
||||
a single atomic reference swap.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### HealthMonitoring-003 — Shared state mutated inside `ConcurrentDictionary.AddOrUpdate` update delegate
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:55-78` |
|
||||
|
||||
**Description**
|
||||
|
||||
The update delegate passed to `AddOrUpdate` mutates the `existing` object in place
|
||||
(`existing.LatestReport = report; existing.IsOnline = true; ...`). `AddOrUpdate`'s
|
||||
contract explicitly allows the update delegate to be invoked **more than once** under
|
||||
contention (when the CAS that installs the result loses a race and is retried). Each
|
||||
invocation mutates the shared object, so a concurrent report for the same site can
|
||||
observe a half-applied update, and the multi-field assignment is not atomic with
|
||||
respect to readers in `GetAllSiteStates`/`CheckForOfflineSites`. The intended
|
||||
"only replace if sequence is higher" guard can also be subverted because the
|
||||
sequence comparison and the field writes are not a single atomic step.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Have the update delegate return a **new** `SiteHealthState` (record `with` copy)
|
||||
rather than mutating `existing`, and treat the dictionary value as immutable.
|
||||
Combined with HealthMonitoring-002, this makes every state transition an atomic
|
||||
reference swap with no observable intermediate state.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### HealthMonitoring-004 — Inconsistent heartbeat interval described across XML docs
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Documentation & comments |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:146-148`, `src/ScadaLink.HealthMonitoring/SiteHealthState.cs:21`, `src/ScadaLink.HealthMonitoring/ICentralHealthAggregator.cs:16` |
|
||||
|
||||
**Description**
|
||||
|
||||
The heartbeat cadence that offline detection relies on is documented inconsistently.
|
||||
`CheckForOfflineSites` says "heartbeats arrive every ~5s"; `SiteHealthState.LastHeartbeatAt`
|
||||
says "~5s heartbeat"; but `ICentralHealthAggregator.MarkHeartbeat` says "~2s
|
||||
heartbeats are arriving". The actual cadence is set elsewhere (Cluster Infrastructure /
|
||||
`SiteCommunicationActor`). Readers cannot reason about whether a 60s offline timeout
|
||||
gives the intended grace without a single authoritative number.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Pick the correct interval (verify against the heartbeat scheduler in
|
||||
`SiteCommunicationActor`/Cluster Infrastructure) and use it consistently in all three
|
||||
comments, ideally referencing the owning component rather than restating a magic number.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### HealthMonitoring-005 — Central self-report site can flap offline; no heartbeat grace like real sites
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:48-81`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:149` |
|
||||
|
||||
**Description**
|
||||
|
||||
`CheckForOfflineSites` decides offline status purely from `LastHeartbeatAt`, and for
|
||||
real sites that field is kept fresh by frequent (~2-5s) heartbeats so the 60s timeout
|
||||
only fires on genuine total loss. The synthetic `central` site, however, has no
|
||||
heartbeat source — `LastHeartbeatAt` is only bumped by `ProcessReport` from the
|
||||
30s `CentralHealthReportLoop`. The loop also only runs on the cluster leader and
|
||||
silently skips a cycle on any exception. Consequently, a single skipped/late central
|
||||
self-report (leader GC pause, brief stall, mid-failover before the new leader's loop
|
||||
spins up) leaves `central` with no signal for >60s and it is marked offline even
|
||||
though the central cluster is healthy. The central card thus has no equivalent of
|
||||
the "one missed report grace" the design doc grants real sites.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either feed `central` a heartbeat equivalent (e.g. have `MarkHeartbeat` called for
|
||||
`CentralSiteId` on a fast timer independent of the leader-only report loop), or apply
|
||||
a longer/distinct offline timeout to the `central` keyspace entry, and ensure the new
|
||||
leader starts the report loop promptly on failover.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### HealthMonitoring-006 — Sequence seeding contradicts the doc's "starting at 1" wording and is untestable
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:28`, `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:32` |
|
||||
|
||||
**Description**
|
||||
|
||||
The `HealthReportSender` class XML summary states "Sequence numbers are monotonic,
|
||||
starting at 1, and reset on service restart." The implementation instead seeds
|
||||
`_sequenceNumber` with `DateTimeOffset.UtcNow.ToUnixTimeMilliseconds()` so the first
|
||||
emitted sequence is a large epoch value, specifically to keep ordering correct across
|
||||
failover. The summary is therefore stale and contradicts the code. Separately, the
|
||||
seed reads `DateTimeOffset.UtcNow` directly at field initialization rather than
|
||||
through an injected `TimeProvider` (which `CentralHealthAggregator` already uses),
|
||||
making the seeding logic impossible to unit-test deterministically and dependent on
|
||||
node wall-clock agreement — if one node's clock lags, its post-failover reports can
|
||||
be silently rejected as stale by the aggregator.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Fix the `HealthReportSender` XML summary to describe the actual Unix-ms seeding
|
||||
strategy, and inject `TimeProvider` for the seed so the behaviour is testable and the
|
||||
clock dependency is explicit.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### HealthMonitoring-007 — Heartbeats for not-yet-registered sites are silently dropped
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:86-99` |
|
||||
|
||||
**Description**
|
||||
|
||||
`MarkHeartbeat` returns immediately if the site is not already in `_siteStates`
|
||||
("registration only happens on report"). Central health state is in-memory only and
|
||||
not persisted. After a central restart or failover the aggregator starts empty, so
|
||||
for up to one full report interval (default 30s) every site emits only heartbeats
|
||||
that are all discarded — the site is reported as *unknown* (absent from
|
||||
`GetAllSiteStates`) rather than *online*, even though heartbeats prove it is
|
||||
reachable. This is a visible dashboard regression precisely during the failover
|
||||
window, which is when operators most need accurate status.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Allow `MarkHeartbeat` to register a minimal `SiteHealthState` (online, no
|
||||
`LatestReport` yet, with a UI-visible "awaiting first report" indication) when a
|
||||
heartbeat arrives for an unknown site, so reachable sites show online immediately
|
||||
after a central restart.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### HealthMonitoring-008 — `GetAllSiteStates` / `GetSiteState` leak live mutable state objects to callers
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:104-116` |
|
||||
|
||||
**Description**
|
||||
|
||||
`GetAllSiteStates` copies the dictionary but the copy still holds references to the
|
||||
same live mutable `SiteHealthState` instances; `GetSiteState` returns the live
|
||||
instance directly. UI consumers (Blazor Server / SignalR circuits) read these objects
|
||||
on their own threads while the aggregator's background timer and report handlers
|
||||
concurrently mutate the very same instances (see HealthMonitoring-002). A UI render
|
||||
can observe a `SiteHealthState` with, e.g., `IsOnline == true` but a `LatestReport`
|
||||
from a different update, or a torn `DateTimeOffset`. Callers could also mutate the
|
||||
shared state, corrupting aggregator state.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Return immutable snapshots: convert `SiteHealthState` to a record (per
|
||||
HealthMonitoring-002/003) so handing out the reference is safe, or deep-copy each
|
||||
state into an immutable DTO before returning.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### HealthMonitoring-009 — Missing test coverage for central report loop, heartbeat path, replication, and collector setters
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Testing coverage |
|
||||
| Status | Open |
|
||||
| Location | `tests/ScadaLink.HealthMonitoring.Tests/` |
|
||||
|
||||
**Description**
|
||||
|
||||
Several behaviours have no automated coverage:
|
||||
- `CentralHealthReportLoop` — leader-only gating (`SelfIsPrimary`), self-report
|
||||
generation, sequence assignment: no test file at all.
|
||||
- `CentralHealthAggregator.MarkHeartbeat` — keeping a site online between reports,
|
||||
online recovery via heartbeat, and the unknown-site drop behaviour
|
||||
(HealthMonitoring-007): untested.
|
||||
- Offline detection driven by `LastHeartbeatAt` vs `LastReportReceivedAt` — the
|
||||
existing offline tests only advance time after a report, never exercising the
|
||||
heartbeat-keeps-alive path the design depends on.
|
||||
- `SiteHealthCollector` — `SetClusterNodes`, `SetInstanceCounts`, `SetParkedMessageCount`,
|
||||
`SetNodeHostname`, `SetActiveNode`/`NodeRole`, `UpdateTagQuality`,
|
||||
`UpdateConnectionEndpoint`: not reflected-in-report tested.
|
||||
- `SiteHealthReportReplica` idempotency under double delivery: untested.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add tests for the central report loop (with a fake `IClusterNodeProvider`), the
|
||||
heartbeat-keeps-online and unknown-site heartbeat paths, and the remaining collector
|
||||
setters' presence in `CollectReport` output.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### HealthMonitoring-010 — `HealthReportSender` silently swallows inner failures with bare `catch {}`
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:70-87` |
|
||||
|
||||
**Description**
|
||||
|
||||
The cluster-nodes update and parked-message-count query are each wrapped in
|
||||
`try { ... } catch { /* Non-fatal */ }` with no logging. A persistent failure (e.g.
|
||||
the S&F SQLite store is permanently broken, or `GetClusterNodes()` always throws)
|
||||
is then completely invisible — every report silently ships with stale cluster nodes
|
||||
and a parked count of 0, with nothing in the logs to explain the wrong dashboard
|
||||
values. Bare `catch` with no exception variable also catches `OperationCanceledException`
|
||||
and would mask shutdown signalling if the awaited call observed the token.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Catch a specific exception type (or at least `Exception ex`) and `LogWarning`/`LogDebug`
|
||||
the failure so persistent degradation is diagnosable; avoid swallowing
|
||||
`OperationCanceledException`.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### HealthMonitoring-011 — `AddHealthMonitoringActors` is a dead no-op placeholder
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Code organization & conventions |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.HealthMonitoring/ServiceCollectionExtensions.cs:42-46` |
|
||||
|
||||
**Description**
|
||||
|
||||
`AddHealthMonitoringActors` does nothing but `return services` with a "Placeholder for
|
||||
Phase 4+" comment. A public extension method that silently no-ops is a trap: a caller
|
||||
who registers it will believe actor wiring is in place. No caller currently invokes it.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Remove the method until it has real behaviour, or throw `NotImplementedException` so
|
||||
accidental use fails loudly. If the actor model for this component is genuinely
|
||||
planned, track it in the design doc instead of a half-method.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### HealthMonitoring-012 — `SiteHealthState.LatestReport` initialized to `null!`, misrepresenting the contract
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Documentation & comments |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.HealthMonitoring/SiteHealthState.cs:11` |
|
||||
|
||||
**Description**
|
||||
|
||||
`LatestReport` is declared `SiteHealthReport LatestReport { get; set; } = null!;`,
|
||||
suppressing nullability. Today every code path that creates a `SiteHealthState` (only
|
||||
`ProcessReport`) assigns `LatestReport`, so it is never actually null — but the
|
||||
`null!` declaration tells readers and the compiler the opposite of the real
|
||||
invariant. If HealthMonitoring-007 is addressed by registering state from a heartbeat
|
||||
(no report yet), this becomes a live `NullReferenceException` risk for UI code that
|
||||
dereferences `LatestReport`.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either make `LatestReport` `required` (matching how it is genuinely always set today)
|
||||
or make it properly nullable `SiteHealthReport?` and have consumers handle the
|
||||
"registered, no report yet" case explicitly — consistent with whatever is decided
|
||||
for HealthMonitoring-007.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Reference in New Issue
Block a user