code-review: 2026-05-28 baseline re-review of all 23 modules at 1eb6e97

Re-applies the full 10-category checklist to every src/ project — including first-time reviews of the four newer components (AuditLog, NotificationOutbox, SiteCallAudit, Transport) — so the code-reviews/ index reflects today's codebase rather than the 2026-05-16 baseline. 172 new Open findings (0 Critical, 18 High, 62 Medium, 92 Low); 481 findings total across 23 modules. regen-readme.py now derives each module's Last reviewed + Commit from its findings.md header instead of hard-coding 2026-05-16 / 9c60592, so future single-module re-reviews show their own date in the Module Status table.
2026-05-28 02:55:47 -04:00
parent 1eb6e972b0
commit f93b7b99bb
25 changed files with 8793 additions and 115 deletions
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.HealthMonitoring` |
 | Design doc | `docs/requirements/Component-HealthMonitoring.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-17 |
+| Last reviewed | 2026-05-28 |
 | Reviewer | claude-agent |
-| Commit reviewed | `39d737e` |
-| Open findings | 0 |
+| Commit reviewed | `1eb6e97` |
+| Open findings | 7 |

 ## Summary

@@ -51,6 +51,35 @@ HealthMonitoring + CentralUI change), and `CollectReport` reading
 `TimeProvider` (HealthMonitoring-016). The module remains small, readable, and
 broadly faithful to the design intent.

+#### Re-review 2026-05-28 (commit `1eb6e97`)
+
+All sixteen prior findings (HealthMonitoring-001..016) remain `Resolved`. This
+baseline re-review applied the full 10-category checklist and produced **7 new
+findings** (1 Medium, 6 Low — none crash-class). The most material observation
+is a **metric-loss race** in `HealthReportSender.ExecuteAsync`
+(HealthMonitoring-017): `CollectReport` resets the per-interval error counters
+(`ScriptErrorCount`, `AlarmEvaluationErrorCount`, `DeadLetterCount`,
+`SiteAuditWriteFailures`, `AuditRedactionFailure`) **before**
+`_transport.Send(...)` is attempted, so a transport failure (the existing
+`catch { LogError; }` path) silently discards every error this site recorded in
+the failed interval — the module-specific concern of "metric counters drifting
+from raw-per-interval to cumulative" inverted into _drifting_ to _lost_. A
+parallel hazard exists in `CentralHealthReportLoop` (HealthMonitoring-018). The
+remaining items are smaller: two Audit Log metrics
+(`SiteAuditTelemetryStalled`, `CentralAuditWriteFailures`) listed in the design
+doc never make it into a HealthMonitoring surface (HealthMonitoring-019); a
+heartbeat with `receivedAt <= existing.LastHeartbeatAt` brings an offline site
+back online with a stale heartbeat that can flap right back to offline on the
+next check (HealthMonitoring-020); the reserved `CentralSiteId = "central"`
+constant collides with any real site named `"central"` and silently extends its
+offline grace (HealthMonitoring-021); `CentralHealthReportLoopTests` uses real
+wall-clock 50 ms timers + `Task.Delay`, making it timing-sensitive
+(HealthMonitoring-022); and one obsolete placeholder test name
+(`StoreAndForwardBufferDepths_IsEmptyPlaceholder`) misrepresents what it now
+covers (HealthMonitoring-023). All sequence-number and offline-detection
+arithmetic uses `_timeProvider.GetUtcNow()` consistently — no wall-clock vs
+monotonic mismatch was observed.
+
 ## Checklist coverage

 | # | Category | Examined | Notes |
@@ -66,6 +95,21 @@ broadly faithful to the design intent.
 | 9 | Testing coverage | x | No tests for `CentralHealthReportLoop`, `MarkHeartbeat`, offline-via-heartbeat, replica idempotency, or most collector setters (HealthMonitoring-009). |
 | 10 | Documentation & comments | x | Heartbeat interval is described inconsistently (~2s vs ~5s) across XML docs (HealthMonitoring-004, resolved); `LatestReport = null!` misrepresents the contract (HealthMonitoring-012, resolved). Re-review: offline-check-interval comment claims "(shorter)" timeout but code only uses `OfflineTimeout` (HealthMonitoring-013). |

+_Re-review (2026-05-28, `1eb6e97`):_
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | x | `HealthReportSender` and `CentralHealthReportLoop` reset per-interval counters before the send/process call — counts lost on transport failure (HealthMonitoring-017, HealthMonitoring-018). `MarkHeartbeat` brings an offline site back online with a stale `LastHeartbeatAt` when `receivedAt <= existing.LastHeartbeatAt` — site can flap straight back to offline (HealthMonitoring-020). `CentralSiteId = "central"` reserved constant silently collides with any real site named "central" (HealthMonitoring-021). |
+| 2 | Akka.NET conventions | x | Module contains no actors itself. `IHealthReportTransport` cleanly abstracts the Akka-remoting send. `ProcessReport`/`MarkHeartbeat` are called from `CentralCommunicationActor`'s receive — invoked on the actor thread but the aggregator's CAS loops make that safe regardless. No issues found. |
+| 3 | Concurrency & thread safety | x | Verified the resolved `SiteHealthState` immutable-record / CAS-loop pattern still holds across `ProcessReport`, `MarkHeartbeat`, `CheckForOfflineSites`. `SiteHealthCollector` uses `volatile` for reference fields (`_clusterNodes`, `_nodeHostname`, `_siteAuditBacklog`, `_isActiveNode`) and `Interlocked` for counters consistently. `CollectReport`'s `new Dictionary<>(concurrentDict)` snapshots are not strictly atomic but acceptable at the documented scale. No new issues found. |
+| 4 | Error handling & resilience | x | `try/catch` blocks now log all non-fatal failures (resolved HealthMonitoring-010 still in place). Outer `catch (Exception)` in `ExecuteAsync` keeps the loop alive — sound. New: the counter-reset-before-send issue (HealthMonitoring-017, HealthMonitoring-018) is an error-handling gap — transport failure silently swallows the interval's metric data. |
+| 5 | Security | x | No issues found. The module handles only numeric/string operational metrics; no secrets, auth surface, or untrusted input parsing. `MarkHeartbeat` and `ProcessReport` trust the caller (intra-cluster). |
+| 6 | Performance & resource management | x | `PeriodicTimer` instances disposed via `using`. CAS retry loops in `ProcessReport`/`MarkHeartbeat` have no bounded retry cap but contention is the dictionary-size limit (one entry per site) so the loop is effectively wait-free for the common case. No issues found. |
+| 7 | Design-document adherence | x | `SiteAuditTelemetryStalled` and `CentralAuditWriteFailures` are listed as required dashboard tiles in `Component-HealthMonitoring.md` but have no HealthMonitoring-side surface — both live only in `AuditLog`'s `AuditCentralHealthSnapshot` with no integration into the health aggregator or its consumers (HealthMonitoring-019). |
+| 8 | Code organization & conventions | x | Options class correctly owned by the component, validator registered idempotently across all three `Add*` paths. POCO/messages in Commons. `AddCentralHealthAggregation` implicitly depends on `ISiteHealthCollector` being registered elsewhere (Host calls `AddHealthMonitoring()` first) — works but is a hidden ordering requirement. Minor; not flagged. |
+| 9 | Testing coverage | x | Per-interval reset semantics covered for site-side counters but NOT for the failed-send case (no test asserts counters remain accumulated when the transport throws — would catch HealthMonitoring-017). `CentralHealthReportLoopTests` uses real wall-clock 50 ms `PeriodicTimer` + `Task.Delay(250)` for timing — flake-prone on a slow CI runner (HealthMonitoring-022). The placeholder test `StoreAndForwardBufferDepths_IsEmptyPlaceholder` name is stale (HealthMonitoring-023). |
+| 10 | Documentation & comments | x | XML docs in the new audit-bridge surfaces (`IncrementSiteAuditWriteFailures`, `IncrementAuditRedactionFailure`, `UpdateSiteAuditBacklog`) are accurate. The stale placeholder test name is the only issue (HealthMonitoring-023). |
+
 ## Findings

 ### HealthMonitoring-001 — Store-and-forward buffer depth metric is never populated
@@ -776,3 +820,314 @@ continues to work via the optional parameter. Regression test
 asserts the timestamp equals a fixed injected instant exactly (not just a
 before/after window); it would not compile against the pre-fix single-arg-less
 constructor.
+
+### HealthMonitoring-017 — `HealthReportSender` resets interval counters before `Send`; transport failures silently drop the interval's error counts
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:140-154`, `src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs:146-153` |
+
+**Description**
+
+`HealthReportSender.ExecuteAsync` calls `_collector.CollectReport(_siteId)` and
+then `_transport.Send(reportWithSeq)` inside a single `try` block whose `catch`
+logs and continues. `CollectReport` atomically read-and-resets the per-interval
+counters via `Interlocked.Exchange(ref _scriptErrorCount, 0)` (and the same for
+`_alarmErrorCount`, `_deadLetterCount`, `_siteAuditWriteFailures`,
+`_auditRedactionFailures`). If `_transport.Send` then throws — Akka remoting
+hiccup, transport not yet associated, central side temporarily unavailable,
+serialization failure on a malformed metric, etc. — the `catch (Exception ex)`
+on line 150 logs an error and the loop simply waits for the next tick. The
+report was never delivered, but the counters have already been reset to zero, so
+**every error this site recorded in the failed interval is gone**: it is neither
+in the (un-sent) report nor in the (zeroed) collector. The very next successful
+report will show "0 script errors / 0 alarm errors" for the entire window in
+which the transport was broken, masking exactly the period the operator most
+needs to triage.
+
+This contradicts the design doc's "raw counts per reporting interval" / "counter
+resets **after each report is sent**" wording — current code resets on each
+report _attempt_, regardless of outcome. The hazard worsens under sustained
+transport failure: every interval's errors are lost; the central dashboard sees
+a quiet site while the site is, in fact, failing.
+
+The same shape exists in `CentralHealthReportLoop` (see HealthMonitoring-018) —
+`CollectReport` is called before `_aggregator.ProcessReport`. The aggregator
+call is in-process and unlikely to throw, but the structural bug is identical.
+
+**Recommendation**
+
+Build the report from a non-destructive read first (`PeekReport(siteId)`,
+returning a snapshot without mutating the counters) and only call a dedicated
+`ResetIntervalCounters()` after a successful `_transport.Send`. Alternatively,
+on a `Send` failure, restore the lost counts via `Interlocked.Add` of the
+captured values back into the collector fields — atomically correct as long as
+no other thread can read them in between, which is true here because the next
+read is the next `CollectReport` on the same loop. The "peek then commit"
+shape is the cleaner public API.
+
+A regression test should add a failing-transport scenario:
+`Send` throws an `InvalidOperationException`; assert that the next successful
+report includes the previously-failed interval's `ScriptErrorCount`.
+
+### HealthMonitoring-018 — Same counter-reset-before-publish hazard in `CentralHealthReportLoop`
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:87-98` |
+
+**Description**
+
+`CentralHealthReportLoop.ExecuteAsync` calls `_collector.CollectReport(CentralSiteId)`
+(which resets the per-interval counters on the shared `SiteHealthCollector`
+instance — see HealthMonitoring-017) and then `_aggregator.ProcessReport(reportWithSeq)`
+inside the same `try` block. If `ProcessReport` throws, the central node's own
+per-interval counters (`ScriptErrorCount`, `AlarmEvaluationErrorCount`,
+`DeadLetterCount`, `SiteAuditWriteFailures`, `AuditRedactionFailure`) are lost
+for that interval.
+
+In practice `ProcessReport` is a pure in-memory CAS loop and is very unlikely
+to throw, so the operational impact is small. However, the structural bug is
+identical to HealthMonitoring-017 and would be fixed by the same
+"peek then commit" refactor in `SiteHealthCollector`. The Audit-Log-related
+metrics matter most here: `AuditRedactionFailure` is genuinely incremented at
+central during normal operation (the Notification Outbox dispatcher and
+Inbound API middleware both write through `CentralAuditRedactionFailureCounter`
+which can fan out to the collector via the bridge), so this is not purely
+theoretical.
+
+**Recommendation**
+
+Adopt the same "peek then reset on successful publish" pattern recommended for
+HealthMonitoring-017. Reuse the new `PeekReport` / `ResetIntervalCounters`
+collector API once it lands.
+
+### HealthMonitoring-019 — `SiteAuditTelemetryStalled` and `CentralAuditWriteFailures` design-doc metrics have no HealthMonitoring-side surface
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `docs/requirements/Component-HealthMonitoring.md:39,40`, `src/ScadaLink.HealthMonitoring/ICentralHealthAggregator.cs`, `src/ScadaLink.AuditLog/Central/AuditCentralHealthSnapshot.cs:39-58` |
+
+**Description**
+
+`Component-HealthMonitoring.md` lists `SiteAuditTelemetryStalled` and
+`CentralAuditWriteFailures` (and reiterates them under the Audit Log KPIs
+section and in the Dependencies section) as required dashboard metrics. The
+doc also says they "are central-computed alongside the existing central KPIs"
+(Notification Outbox / Site Call Audit) and surface in the **Audit** dashboard
+tile group.
+
+Tracing the code:
+
+- `SiteAuditTelemetryStalled` is published by `SiteAuditReconciliationActor`,
+  picked up by `SiteAuditTelemetryStalledTracker`, and latched into
+  `AuditCentralHealthSnapshot._stalled` (a `ConcurrentDictionary<string, bool>`
+  in the `ScadaLink.AuditLog` assembly).
+- `CentralAuditWriteFailures` is incremented inside `AuditCentralHealthSnapshot`
+  via `ICentralAuditWriteFailureCounter.Increment()` (also in `ScadaLink.AuditLog`).
+
+Neither metric is referenced anywhere in `src/ScadaLink.HealthMonitoring/`:
+- `ICentralHealthAggregator` does not expose them.
+- `SiteHealthCollector` has no central counterpart (it is site-only).
+- `SiteHealthReport` has no `SiteAuditTelemetryStalled` / `CentralAuditWriteFailures`
+  fields (the site-only `SiteAuditWriteFailures`, `AuditRedactionFailure`, and
+  `SiteAuditBacklog` _are_ wired; the central pair is the gap).
+
+Currently the only consumer of `IAuditCentralHealthSnapshot` is whatever
+Central UI page binds to it directly (out of scope for this module), but the
+design doc places these metrics under HealthMonitoring's responsibility
+("Health Monitoring Dashboard displays aggregated metrics"). At minimum the
+Dependencies section's claim that Health Monitoring provides "the
+central-computed `CentralAuditWriteFailures` / `AuditRedactionFailure` metrics"
+is false for `CentralAuditWriteFailures`: nothing under
+`src/ScadaLink.HealthMonitoring/` knows about it.
+
+**Recommendation**
+
+Decide whether HealthMonitoring or the consuming UI page owns the
+`IAuditCentralHealthSnapshot` integration:
+
+- If HealthMonitoring owns it, expose a `CentralKpis` accessor on
+  `ICentralHealthAggregator` (e.g. a `GetCentralAuditHealth()` method that
+  returns a typed DTO derived from the injected `IAuditCentralHealthSnapshot`)
+  so the dashboard has a single read surface mirroring `GetAllSiteStates`.
+- If the UI page binds `IAuditCentralHealthSnapshot` directly, update the
+  HealthMonitoring design doc's Responsibilities / Dependencies sections to
+  reflect that and remove the implied integration.
+
+Either way, add a regression test that the chosen surface returns the live
+counter and per-site stalled state.
+
+### HealthMonitoring-020 — `MarkHeartbeat` brings offline site back online with a stale `LastHeartbeatAt` when `receivedAt <= existing.LastHeartbeatAt`
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:128-147` |
+
+**Description**
+
+The CAS path in `MarkHeartbeat` picks `newHeartbeat = max(receivedAt, existing.LastHeartbeatAt)`,
+then short-circuits only when `newHeartbeat == existing.LastHeartbeatAt &&
+existing.IsOnline`. That short-circuit is correct, but consider the case where
+`existing.IsOnline == false` and `receivedAt <= existing.LastHeartbeatAt`:
+
+1. Suppose a site is marked offline by `CheckForOfflineSites` at time T1.
+2. A late/out-of-order heartbeat carrying a `receivedAt` _older_ than the last
+   stored `LastHeartbeatAt` arrives at T2 (clock skew at the receive site, or a
+   delayed message that was generated before the offline-marking).
+3. `newHeartbeat == existing.LastHeartbeatAt` (kept), but the short-circuit
+   condition fails because `existing.IsOnline == false`, so the CAS produces a
+   new record with `IsOnline = true` and the **stale** `LastHeartbeatAt`.
+4. On the very next `CheckForOfflineSites` tick (≤ `OfflineTimeout/2` later),
+   `now - LastHeartbeatAt` is still ≥ `OfflineTimeout`, so the site is
+   immediately marked offline again — the heartbeat brought it online for less
+   than the check cadence, producing a "flap" in the dashboard.
+
+In practice `receivedAt` is normally `_timeProvider.GetUtcNow()` at the
+`CentralCommunicationActor` receive site, so monotonically increasing — the bug
+is latent. But the contract `MarkHeartbeat(string siteId, DateTimeOffset receivedAt)`
+makes no guarantee about ordering, and an out-of-order delivery (Akka remoting
+ordering across connection re-establishment edge cases) or a small wall-clock
+correction at central would expose it.
+
+**Recommendation**
+
+When transitioning offline → online, use `now` (from the injected
+`TimeProvider`) rather than the caller-supplied `receivedAt` for
+`LastHeartbeatAt`, or take `max(receivedAt, _timeProvider.GetUtcNow())` so the
+recovery point is always recent. A unit test driving `MarkHeartbeat` with a
+`receivedAt` older than the last stored heartbeat on an offline site, then a
+`CheckForOfflineSites` immediately afterwards, would assert the site stays
+online.
+
+### HealthMonitoring-021 — `CentralSiteId = "central"` reserved constant silently collides with a real site named "central"
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:22`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:224-226` |
+
+**Description**
+
+`CentralHealthAggregator.CheckForOfflineSites` looks up the per-site offline
+timeout with:
+
+```csharp
+var timeout = kvp.Key == CentralHealthReportLoop.CentralSiteId
+    ? _options.CentralOfflineTimeout
+    : _options.OfflineTimeout;
+```
+
+`CentralSiteId` is the literal string `"central"`. Site IDs are free-form
+strings set in configuration / the Sites repository; there is no validation
+that excludes the reserved `"central"` name. An operator who creates a real
+site with `SiteId = "central"` will have:
+
+- Their real-site reports arriving via `ProcessReport` get stored in the same
+  dictionary slot as the central self-report (they share the keyspace), so the
+  central self-report and the real-site report repeatedly overwrite each
+  other via the sequence-number guard — whichever has the higher Unix-ms seed
+  wins, and the other is silently rejected as stale. The dashboard alternates
+  between two unrelated payloads.
+- The real site gets the longer `CentralOfflineTimeout` (default 3 minutes)
+  instead of the normal `OfflineTimeout` (60 s), so a genuinely-failed real
+  site marked "central" stays falsely-online for an extra two minutes.
+
+**Recommendation**
+
+Two options:
+
+1. Reject the reserved name at the Site entity / configuration validation
+   layer (Configuration Database component, out of this module's scope) and
+   document `"central"` as reserved. This is the cleaner UX fix.
+2. As a defence-in-depth inside HealthMonitoring, store the central
+   self-report under a key that cannot collide — e.g. prefix it with a
+   character that is forbidden in real site IDs (`":central"` or `"#central"`)
+   — and adjust `CheckForOfflineSites` accordingly.
+
+Either fix should include a regression test creating a real `SiteHealthReport`
+with `SiteId = "central"` and asserting the central self-report's identity is
+preserved.
+
+### HealthMonitoring-022 — `CentralHealthReportLoopTests` uses real-time `PeriodicTimer` + `Task.Delay`; flake-prone on slow CI
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Testing coverage |
+| Status | Open |
+| Location | `tests/ScadaLink.HealthMonitoring.Tests/CentralHealthReportLoopTests.cs:32-42` |
+
+**Description**
+
+`RunLoopBriefly` starts the hosted service with a 50 ms `PeriodicTimer` and
+then `await Task.Delay(runForMs, CancellationToken.None)` (with `runForMs`
+between 150 ms and 300 ms). `GeneratesCentralReports_WhenSelfIsPrimary` and
+`AssignsMonotonicSequenceNumbers` both assert "at least 2 reports were
+generated" within the window. On a heavily-contended CI runner where the
+hosted-service start-up plus a couple of `PeriodicTimer` ticks can blow past
+300 ms, these tests will silently flake.
+
+The rest of the suite (`CentralHealthAggregatorTests`, `SiteHealthCollectorTests`,
+`HealthReportSenderTests` partially) was deliberately refactored to use the
+injected `TimeProvider` precisely to avoid this. `CentralHealthReportLoop` and
+`HealthReportSender` already accept a `TimeProvider`, but the loop's
+`PeriodicTimer` is still real-time because `PeriodicTimer` does not consume
+the `TimeProvider` parameter.
+
+**Recommendation**
+
+Either (a) accept the timing-sensitivity and bump the delay budget
+generously, or (b) refactor the hosted-service loop to use a
+`TimeProvider.CreateTimer`-based tick mechanism so the test can advance a
+fake clock and assert deterministically how many ticks fire. Option (b) is
+the better long-term fix and matches the pattern used elsewhere in the
+module's tests.
+
+### HealthMonitoring-023 — `StoreAndForwardBufferDepths_IsEmptyPlaceholder` test name is stale; it now covers the default-state contract, not a placeholder
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Open |
+| Location | `tests/ScadaLink.HealthMonitoring.Tests/SiteHealthCollectorTests.cs:117-122` |
+
+**Description**
+
+The test `StoreAndForwardBufferDepths_IsEmptyPlaceholder` was originally named
+to codify the HealthMonitoring-001 bug ("`SetStoreAndForwardDepths` has no
+callers, so `StoreAndForwardBufferDepths` is always empty"). HealthMonitoring-001
+is `Resolved` — `HealthReportSender` now populates per-category depths from
+the S&F engine, and the same test class has `SetStoreAndForwardDepths_ReflectedInReport`
+covering the populated path. The "placeholder" test still passes because it
+constructs a fresh collector and never calls the setter, so its assertion
+(`Assert.Empty(report.StoreAndForwardBufferDepths)`) is now testing the
+**default empty state of an un-configured collector**. The HealthMonitoring-001
+resolution note explicitly chose to keep it as "the collector-level
+default-state test", but the test method name and the implied semantics no
+longer match.
+
+A maintainer reading the test name today will misread it as documentation that
+the metric is unimplemented (which it isn't), and may waste time investigating
+a non-bug.
+
+**Recommendation**
+
+Rename to `StoreAndForwardBufferDepths_DefaultsToEmpty_WhenSetterNotCalled`
+(or similar) and update the test body's intent — purely a documentation /
+maintainability fix; no behaviour change.