docs: add code review process and baseline review of all 19 modules

Establishes a per-module code review workflow under code-reviews/ and records the 2026-05-16 baseline review (commit 9c60592): 241 findings across all src/ modules (6 Critical, 46 High, 100 Medium, 89 Low). This is the clean starting point for remediation work.
2026-05-16 18:09:09 -04:00
parent 9c60592632
commit 977d7369a7
23 changed files with 8899 additions and 0 deletions
--- a/code-reviews/HealthMonitoring/findings.md
+++ b/code-reviews/HealthMonitoring/findings.md
@@ -0,0 +1,420 @@
+# Code Review — HealthMonitoring
+
+| Field | Value |
+|-------|-------|
+| Module | `src/ScadaLink.HealthMonitoring` |
+| Design doc | `docs/requirements/Component-HealthMonitoring.md` |
+| Status | Reviewed |
+| Last reviewed | 2026-05-16 |
+| Reviewer | claude-agent |
+| Commit reviewed | `9c60592` |
+| Open findings | 12 |
+
+## Summary
+
+The HealthMonitoring module is small, readable, and broadly faithful to the design
+intent: per-interval error counters with atomic read-and-reset, monotonic sequence
+numbers with Unix-ms seeding to survive failover, sequence-guarded staleness
+rejection, and a 60s offline timeout. However, the review surfaced two recurring
+themes. First, **a documented metric is silently unimplemented** — store-and-forward
+buffer depths are never populated (`SetStoreAndForwardDepths` has zero callers and a
+test asserts the field is always empty), so the dashboard cannot show the buffer
+depth metric the design doc requires. Second, **the central aggregator's in-memory
+state model has unguarded shared mutable state**: `SiteHealthState` is a mutable
+class whose fields are written by a background timer thread, by `ProcessReport`, and
+by `MarkHeartbeat` with no synchronization, and the same live mutable objects are
+handed straight to UI callers via `GetAllSiteStates`. The `ProcessReport` logic also
+mutates shared state inside a `ConcurrentDictionary.AddOrUpdate` update delegate,
+which the runtime may invoke more than once under contention. Additionally there are
+gaps around central self-report offline detection, heartbeats for not-yet-registered
+sites being dropped, and missing test coverage for the central report loop,
+heartbeat path, and most collector setters. None of the findings are crash-class,
+but the concurrency issues are Medium/High and the missing S&F metric is a real
+design-adherence gap.
+
+## Checklist coverage
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | x | `MarkHeartbeat` drops heartbeats for unregistered sites (HealthMonitoring-007); central self-report has no heartbeat grace (HealthMonitoring-005). |
+| 2 | Akka.NET conventions | x | Module itself contains no actors (transport abstracted via `IHealthReportTransport`); `AddHealthMonitoringActors` is a dead placeholder (HealthMonitoring-011). Actor-side wiring lives in Communication and is out of scope. |
+| 3 | Concurrency & thread safety | x | Unguarded mutable `SiteHealthState` (HealthMonitoring-002); mutation inside `AddOrUpdate` delegate (HealthMonitoring-003); `GetAllSiteStates` leaks live mutable references (HealthMonitoring-008). Collector counters correctly use `Interlocked`. |
+| 4 | Error handling & resilience | x | `HealthReportSender` silently swallows inner failures with bare `catch {}` (HealthMonitoring-010); top-level loop error handling is sound. |
+| 5 | Security | x | No issues found. Module handles only numeric/string operational metrics, no secrets, no external input parsing, no auth surface. |
+| 6 | Performance & resource management | x | `PeriodicTimer` instances correctly disposed via `using`. Dictionary snapshots per report are acceptable at the documented scale. No issues found. |
+| 7 | Design-document adherence | x | Store-and-forward buffer depth metric unimplemented (HealthMonitoring-001); sequence seeding deviates from doc's "starting at 1" wording (HealthMonitoring-006). |
+| 8 | Code organization & conventions | x | Options class correctly owned by the component; POCO/messages in Commons. Dead placeholder method noted (HealthMonitoring-011). |
+| 9 | Testing coverage | x | No tests for `CentralHealthReportLoop`, `MarkHeartbeat`, offline-via-heartbeat, replica idempotency, or most collector setters (HealthMonitoring-009). |
+| 10 | Documentation & comments | x | Heartbeat interval is described inconsistently (~2s vs ~5s) across XML docs (HealthMonitoring-004); `LatestReport = null!` misrepresents the contract (HealthMonitoring-012). |
+
+## Findings
+
+### HealthMonitoring-001 — Store-and-forward buffer depth metric is never populated
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs:104`, `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:79` |
+
+**Description**
+
+`Component-HealthMonitoring.md` lists "Store-and-forward buffer depth" (pending
+messages by category) as a required monitored metric. `SiteHealthCollector` exposes
+`SetStoreAndForwardDepths(...)` to receive it, but a codebase-wide search shows the
+method has **no callers** — `_sfBufferDepths` always remains the empty dictionary it
+is initialized to. `HealthReportSender` queries `GetParkedMessageCountAsync()` and
+sets `ParkedMessageCount`, but parked count is a distinct metric from per-category
+buffer depth. The test `SiteHealthCollectorTests.StoreAndForwardBufferDepths_IsEmptyPlaceholder`
+even codifies the unimplemented state as expected behaviour. The result is that the
+central dashboard cannot display buffer depth, a documented triage metric.
+
+**Recommendation**
+
+Wire `SetStoreAndForwardDepths` into `HealthReportSender.ExecuteAsync` (alongside the
+existing parked-count call) using the S&F engine's per-category depth API, or, if the
+metric is intentionally deferred, record that decision in the design doc and remove
+the dead setter. Update the placeholder test accordingly once implemented.
+
+**Resolution**
+
+_Unresolved._
+
+### HealthMonitoring-002 — `SiteHealthState` mutable fields written from multiple threads without synchronization
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Concurrency & thread safety |
+| Status | Open |
+| Location | `src/ScadaLink.HealthMonitoring/SiteHealthState.cs:11`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:86`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:137` |
+
+**Description**
+
+`SiteHealthState` is a plain mutable class. Its fields (`LatestReport`,
+`LastReportReceivedAt`, `LastHeartbeatAt`, `LastSequenceNumber`, `IsOnline`) are
+mutated from at least three concurrent contexts: `ProcessReport` (caller thread —
+ClusterClient/PubSub message handlers), `MarkHeartbeat` (caller thread — heartbeat
+handler), and `CheckForOfflineSites` (the `BackgroundService` timer thread). The
+`ConcurrentDictionary` only protects the dictionary structure, not the objects it
+stores. A heartbeat update and the offline-check can interleave on the same
+`SiteHealthState` instance, and reads/writes of `DateTimeOffset` (a 16-byte struct)
+and `long` fields are not guaranteed atomic on all platforms — producing torn reads
+and lost updates of `IsOnline`/`LastHeartbeatAt`.
+
+**Recommendation**
+
+Make state transitions atomic: either guard all reads/writes of a `SiteHealthState`
+with a per-site lock, or replace `SiteHealthState` with an immutable record updated
+via `ConcurrentDictionary` compare-and-swap (`TryUpdate`) so every transition is
+a single atomic reference swap.
+
+**Resolution**
+
+_Unresolved._
+
+### HealthMonitoring-003 — Shared state mutated inside `ConcurrentDictionary.AddOrUpdate` update delegate
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Concurrency & thread safety |
+| Status | Open |
+| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:55-78` |
+
+**Description**
+
+The update delegate passed to `AddOrUpdate` mutates the `existing` object in place
+(`existing.LatestReport = report; existing.IsOnline = true; ...`). `AddOrUpdate`'s
+contract explicitly allows the update delegate to be invoked **more than once** under
+contention (when the CAS that installs the result loses a race and is retried). Each
+invocation mutates the shared object, so a concurrent report for the same site can
+observe a half-applied update, and the multi-field assignment is not atomic with
+respect to readers in `GetAllSiteStates`/`CheckForOfflineSites`. The intended
+"only replace if sequence is higher" guard can also be subverted because the
+sequence comparison and the field writes are not a single atomic step.
+
+**Recommendation**
+
+Have the update delegate return a **new** `SiteHealthState` (record `with` copy)
+rather than mutating `existing`, and treat the dictionary value as immutable.
+Combined with HealthMonitoring-002, this makes every state transition an atomic
+reference swap with no observable intermediate state.
+
+**Resolution**
+
+_Unresolved._
+
+### HealthMonitoring-004 — Inconsistent heartbeat interval described across XML docs
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Open |
+| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:146-148`, `src/ScadaLink.HealthMonitoring/SiteHealthState.cs:21`, `src/ScadaLink.HealthMonitoring/ICentralHealthAggregator.cs:16` |
+
+**Description**
+
+The heartbeat cadence that offline detection relies on is documented inconsistently.
+`CheckForOfflineSites` says "heartbeats arrive every ~5s"; `SiteHealthState.LastHeartbeatAt`
+says "~5s heartbeat"; but `ICentralHealthAggregator.MarkHeartbeat` says "~2s
+heartbeats are arriving". The actual cadence is set elsewhere (Cluster Infrastructure /
+`SiteCommunicationActor`). Readers cannot reason about whether a 60s offline timeout
+gives the intended grace without a single authoritative number.
+
+**Recommendation**
+
+Pick the correct interval (verify against the heartbeat scheduler in
+`SiteCommunicationActor`/Cluster Infrastructure) and use it consistently in all three
+comments, ideally referencing the owning component rather than restating a magic number.
+
+**Resolution**
+
+_Unresolved._
+
+### HealthMonitoring-005 — Central self-report site can flap offline; no heartbeat grace like real sites
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:48-81`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:149` |
+
+**Description**
+
+`CheckForOfflineSites` decides offline status purely from `LastHeartbeatAt`, and for
+real sites that field is kept fresh by frequent (~2-5s) heartbeats so the 60s timeout
+only fires on genuine total loss. The synthetic `central` site, however, has no
+heartbeat source — `LastHeartbeatAt` is only bumped by `ProcessReport` from the
+30s `CentralHealthReportLoop`. The loop also only runs on the cluster leader and
+silently skips a cycle on any exception. Consequently, a single skipped/late central
+self-report (leader GC pause, brief stall, mid-failover before the new leader's loop
+spins up) leaves `central` with no signal for >60s and it is marked offline even
+though the central cluster is healthy. The central card thus has no equivalent of
+the "one missed report grace" the design doc grants real sites.
+
+**Recommendation**
+
+Either feed `central` a heartbeat equivalent (e.g. have `MarkHeartbeat` called for
+`CentralSiteId` on a fast timer independent of the leader-only report loop), or apply
+a longer/distinct offline timeout to the `central` keyspace entry, and ensure the new
+leader starts the report loop promptly on failover.
+
+**Resolution**
+
+_Unresolved._
+
+### HealthMonitoring-006 — Sequence seeding contradicts the doc's "starting at 1" wording and is untestable
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:28`, `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:32` |
+
+**Description**
+
+The `HealthReportSender` class XML summary states "Sequence numbers are monotonic,
+starting at 1, and reset on service restart." The implementation instead seeds
+`_sequenceNumber` with `DateTimeOffset.UtcNow.ToUnixTimeMilliseconds()` so the first
+emitted sequence is a large epoch value, specifically to keep ordering correct across
+failover. The summary is therefore stale and contradicts the code. Separately, the
+seed reads `DateTimeOffset.UtcNow` directly at field initialization rather than
+through an injected `TimeProvider` (which `CentralHealthAggregator` already uses),
+making the seeding logic impossible to unit-test deterministically and dependent on
+node wall-clock agreement — if one node's clock lags, its post-failover reports can
+be silently rejected as stale by the aggregator.
+
+**Recommendation**
+
+Fix the `HealthReportSender` XML summary to describe the actual Unix-ms seeding
+strategy, and inject `TimeProvider` for the seed so the behaviour is testable and the
+clock dependency is explicit.
+
+**Resolution**
+
+_Unresolved._
+
+### HealthMonitoring-007 — Heartbeats for not-yet-registered sites are silently dropped
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:86-99` |
+
+**Description**
+
+`MarkHeartbeat` returns immediately if the site is not already in `_siteStates`
+("registration only happens on report"). Central health state is in-memory only and
+not persisted. After a central restart or failover the aggregator starts empty, so
+for up to one full report interval (default 30s) every site emits only heartbeats
+that are all discarded — the site is reported as *unknown* (absent from
+`GetAllSiteStates`) rather than *online*, even though heartbeats prove it is
+reachable. This is a visible dashboard regression precisely during the failover
+window, which is when operators most need accurate status.
+
+**Recommendation**
+
+Allow `MarkHeartbeat` to register a minimal `SiteHealthState` (online, no
+`LatestReport` yet, with a UI-visible "awaiting first report" indication) when a
+heartbeat arrives for an unknown site, so reachable sites show online immediately
+after a central restart.
+
+**Resolution**
+
+_Unresolved._
+
+### HealthMonitoring-008 — `GetAllSiteStates` / `GetSiteState` leak live mutable state objects to callers
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Concurrency & thread safety |
+| Status | Open |
+| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:104-116` |
+
+**Description**
+
+`GetAllSiteStates` copies the dictionary but the copy still holds references to the
+same live mutable `SiteHealthState` instances; `GetSiteState` returns the live
+instance directly. UI consumers (Blazor Server / SignalR circuits) read these objects
+on their own threads while the aggregator's background timer and report handlers
+concurrently mutate the very same instances (see HealthMonitoring-002). A UI render
+can observe a `SiteHealthState` with, e.g., `IsOnline == true` but a `LatestReport`
+from a different update, or a torn `DateTimeOffset`. Callers could also mutate the
+shared state, corrupting aggregator state.
+
+**Recommendation**
+
+Return immutable snapshots: convert `SiteHealthState` to a record (per
+HealthMonitoring-002/003) so handing out the reference is safe, or deep-copy each
+state into an immutable DTO before returning.
+
+**Resolution**
+
+_Unresolved._
+
+### HealthMonitoring-009 — Missing test coverage for central report loop, heartbeat path, replication, and collector setters
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Testing coverage |
+| Status | Open |
+| Location | `tests/ScadaLink.HealthMonitoring.Tests/` |
+
+**Description**
+
+Several behaviours have no automated coverage:
+- `CentralHealthReportLoop` — leader-only gating (`SelfIsPrimary`), self-report
+  generation, sequence assignment: no test file at all.
+- `CentralHealthAggregator.MarkHeartbeat` — keeping a site online between reports,
+  online recovery via heartbeat, and the unknown-site drop behaviour
+  (HealthMonitoring-007): untested.
+- Offline detection driven by `LastHeartbeatAt` vs `LastReportReceivedAt` — the
+  existing offline tests only advance time after a report, never exercising the
+  heartbeat-keeps-alive path the design depends on.
+- `SiteHealthCollector` — `SetClusterNodes`, `SetInstanceCounts`, `SetParkedMessageCount`,
+  `SetNodeHostname`, `SetActiveNode`/`NodeRole`, `UpdateTagQuality`,
+  `UpdateConnectionEndpoint`: not reflected-in-report tested.
+- `SiteHealthReportReplica` idempotency under double delivery: untested.
+
+**Recommendation**
+
+Add tests for the central report loop (with a fake `IClusterNodeProvider`), the
+heartbeat-keeps-online and unknown-site heartbeat paths, and the remaining collector
+setters' presence in `CollectReport` output.
+
+**Resolution**
+
+_Unresolved._
+
+### HealthMonitoring-010 — `HealthReportSender` silently swallows inner failures with bare `catch {}`
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:70-87` |
+
+**Description**
+
+The cluster-nodes update and parked-message-count query are each wrapped in
+`try { ... } catch { /* Non-fatal */ }` with no logging. A persistent failure (e.g.
+the S&F SQLite store is permanently broken, or `GetClusterNodes()` always throws)
+is then completely invisible — every report silently ships with stale cluster nodes
+and a parked count of 0, with nothing in the logs to explain the wrong dashboard
+values. Bare `catch` with no exception variable also catches `OperationCanceledException`
+and would mask shutdown signalling if the awaited call observed the token.
+
+**Recommendation**
+
+Catch a specific exception type (or at least `Exception ex`) and `LogWarning`/`LogDebug`
+the failure so persistent degradation is diagnosable; avoid swallowing
+`OperationCanceledException`.
+
+**Resolution**
+
+_Unresolved._
+
+### HealthMonitoring-011 — `AddHealthMonitoringActors` is a dead no-op placeholder
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Code organization & conventions |
+| Status | Open |
+| Location | `src/ScadaLink.HealthMonitoring/ServiceCollectionExtensions.cs:42-46` |
+
+**Description**
+
+`AddHealthMonitoringActors` does nothing but `return services` with a "Placeholder for
+Phase 4+" comment. A public extension method that silently no-ops is a trap: a caller
+who registers it will believe actor wiring is in place. No caller currently invokes it.
+
+**Recommendation**
+
+Remove the method until it has real behaviour, or throw `NotImplementedException` so
+accidental use fails loudly. If the actor model for this component is genuinely
+planned, track it in the design doc instead of a half-method.
+
+**Resolution**
+
+_Unresolved._
+
+### HealthMonitoring-012 — `SiteHealthState.LatestReport` initialized to `null!`, misrepresenting the contract
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Open |
+| Location | `src/ScadaLink.HealthMonitoring/SiteHealthState.cs:11` |
+
+**Description**
+
+`LatestReport` is declared `SiteHealthReport LatestReport { get; set; } = null!;`,
+suppressing nullability. Today every code path that creates a `SiteHealthState` (only
+`ProcessReport`) assigns `LatestReport`, so it is never actually null — but the
+`null!` declaration tells readers and the compiler the opposite of the real
+invariant. If HealthMonitoring-007 is addressed by registering state from a heartbeat
+(no report yet), this becomes a live `NullReferenceException` risk for UI code that
+dereferences `LatestReport`.
+
+**Recommendation**
+
+Either make `LatestReport` `required` (matching how it is genuinely always set today)
+or make it properly nullable `SiteHealthReport?` and have consumers handle the
+"registered, no report yet" case explicitly — consistent with whatever is decided
+for HealthMonitoring-007.
+
+**Resolution**
+
+_Unresolved._