code-review: 2026-05-28 baseline re-review of all 23 modules at 1eb6e97

Re-applies the full 10-category checklist to every src/ project — including
first-time reviews of the four newer components (AuditLog, NotificationOutbox,
SiteCallAudit, Transport) — so the code-reviews/ index reflects today's
codebase rather than the 2026-05-16 baseline. 172 new Open findings (0
Critical, 18 High, 62 Medium, 92 Low); 481 findings total across 23 modules.

regen-readme.py now derives each module's Last reviewed + Commit from its
findings.md header instead of hard-coding 2026-05-16 / 9c60592, so future
single-module re-reviews show their own date in the Module Status table.
This commit is contained in:
Joseph Doherty
2026-05-28 02:55:47 -04:00
parent 1eb6e972b0
commit f93b7b99bb
25 changed files with 8793 additions and 115 deletions
+358 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.HealthMonitoring` |
| Design doc | `docs/requirements/Component-HealthMonitoring.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-17 |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `39d737e` |
| Open findings | 0 |
| Commit reviewed | `1eb6e97` |
| Open findings | 7 |
## Summary
@@ -51,6 +51,35 @@ HealthMonitoring + CentralUI change), and `CollectReport` reading
`TimeProvider` (HealthMonitoring-016). The module remains small, readable, and
broadly faithful to the design intent.
#### Re-review 2026-05-28 (commit `1eb6e97`)
All sixteen prior findings (HealthMonitoring-001..016) remain `Resolved`. This
baseline re-review applied the full 10-category checklist and produced **7 new
findings** (1 Medium, 6 Low — none crash-class). The most material observation
is a **metric-loss race** in `HealthReportSender.ExecuteAsync`
(HealthMonitoring-017): `CollectReport` resets the per-interval error counters
(`ScriptErrorCount`, `AlarmEvaluationErrorCount`, `DeadLetterCount`,
`SiteAuditWriteFailures`, `AuditRedactionFailure`) **before**
`_transport.Send(...)` is attempted, so a transport failure (the existing
`catch { LogError; }` path) silently discards every error this site recorded in
the failed interval — the module-specific concern of "metric counters drifting
from raw-per-interval to cumulative" inverted into _drifting_ to _lost_. A
parallel hazard exists in `CentralHealthReportLoop` (HealthMonitoring-018). The
remaining items are smaller: two Audit Log metrics
(`SiteAuditTelemetryStalled`, `CentralAuditWriteFailures`) listed in the design
doc never make it into a HealthMonitoring surface (HealthMonitoring-019); a
heartbeat with `receivedAt <= existing.LastHeartbeatAt` brings an offline site
back online with a stale heartbeat that can flap right back to offline on the
next check (HealthMonitoring-020); the reserved `CentralSiteId = "central"`
constant collides with any real site named `"central"` and silently extends its
offline grace (HealthMonitoring-021); `CentralHealthReportLoopTests` uses real
wall-clock 50 ms timers + `Task.Delay`, making it timing-sensitive
(HealthMonitoring-022); and one obsolete placeholder test name
(`StoreAndForwardBufferDepths_IsEmptyPlaceholder`) misrepresents what it now
covers (HealthMonitoring-023). All sequence-number and offline-detection
arithmetic uses `_timeProvider.GetUtcNow()` consistently — no wall-clock vs
monotonic mismatch was observed.
## Checklist coverage
| # | Category | Examined | Notes |
@@ -66,6 +95,21 @@ broadly faithful to the design intent.
| 9 | Testing coverage | x | No tests for `CentralHealthReportLoop`, `MarkHeartbeat`, offline-via-heartbeat, replica idempotency, or most collector setters (HealthMonitoring-009). |
| 10 | Documentation & comments | x | Heartbeat interval is described inconsistently (~2s vs ~5s) across XML docs (HealthMonitoring-004, resolved); `LatestReport = null!` misrepresents the contract (HealthMonitoring-012, resolved). Re-review: offline-check-interval comment claims "(shorter)" timeout but code only uses `OfflineTimeout` (HealthMonitoring-013). |
_Re-review (2026-05-28, `1eb6e97`):_
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | x | `HealthReportSender` and `CentralHealthReportLoop` reset per-interval counters before the send/process call — counts lost on transport failure (HealthMonitoring-017, HealthMonitoring-018). `MarkHeartbeat` brings an offline site back online with a stale `LastHeartbeatAt` when `receivedAt <= existing.LastHeartbeatAt` — site can flap straight back to offline (HealthMonitoring-020). `CentralSiteId = "central"` reserved constant silently collides with any real site named "central" (HealthMonitoring-021). |
| 2 | Akka.NET conventions | x | Module contains no actors itself. `IHealthReportTransport` cleanly abstracts the Akka-remoting send. `ProcessReport`/`MarkHeartbeat` are called from `CentralCommunicationActor`'s receive — invoked on the actor thread but the aggregator's CAS loops make that safe regardless. No issues found. |
| 3 | Concurrency & thread safety | x | Verified the resolved `SiteHealthState` immutable-record / CAS-loop pattern still holds across `ProcessReport`, `MarkHeartbeat`, `CheckForOfflineSites`. `SiteHealthCollector` uses `volatile` for reference fields (`_clusterNodes`, `_nodeHostname`, `_siteAuditBacklog`, `_isActiveNode`) and `Interlocked` for counters consistently. `CollectReport`'s `new Dictionary<>(concurrentDict)` snapshots are not strictly atomic but acceptable at the documented scale. No new issues found. |
| 4 | Error handling & resilience | x | `try/catch` blocks now log all non-fatal failures (resolved HealthMonitoring-010 still in place). Outer `catch (Exception)` in `ExecuteAsync` keeps the loop alive — sound. New: the counter-reset-before-send issue (HealthMonitoring-017, HealthMonitoring-018) is an error-handling gap — transport failure silently swallows the interval's metric data. |
| 5 | Security | x | No issues found. The module handles only numeric/string operational metrics; no secrets, auth surface, or untrusted input parsing. `MarkHeartbeat` and `ProcessReport` trust the caller (intra-cluster). |
| 6 | Performance & resource management | x | `PeriodicTimer` instances disposed via `using`. CAS retry loops in `ProcessReport`/`MarkHeartbeat` have no bounded retry cap but contention is the dictionary-size limit (one entry per site) so the loop is effectively wait-free for the common case. No issues found. |
| 7 | Design-document adherence | x | `SiteAuditTelemetryStalled` and `CentralAuditWriteFailures` are listed as required dashboard tiles in `Component-HealthMonitoring.md` but have no HealthMonitoring-side surface — both live only in `AuditLog`'s `AuditCentralHealthSnapshot` with no integration into the health aggregator or its consumers (HealthMonitoring-019). |
| 8 | Code organization & conventions | x | Options class correctly owned by the component, validator registered idempotently across all three `Add*` paths. POCO/messages in Commons. `AddCentralHealthAggregation` implicitly depends on `ISiteHealthCollector` being registered elsewhere (Host calls `AddHealthMonitoring()` first) — works but is a hidden ordering requirement. Minor; not flagged. |
| 9 | Testing coverage | x | Per-interval reset semantics covered for site-side counters but NOT for the failed-send case (no test asserts counters remain accumulated when the transport throws — would catch HealthMonitoring-017). `CentralHealthReportLoopTests` uses real wall-clock 50 ms `PeriodicTimer` + `Task.Delay(250)` for timing — flake-prone on a slow CI runner (HealthMonitoring-022). The placeholder test `StoreAndForwardBufferDepths_IsEmptyPlaceholder` name is stale (HealthMonitoring-023). |
| 10 | Documentation & comments | x | XML docs in the new audit-bridge surfaces (`IncrementSiteAuditWriteFailures`, `IncrementAuditRedactionFailure`, `UpdateSiteAuditBacklog`) are accurate. The stale placeholder test name is the only issue (HealthMonitoring-023). |
## Findings
### HealthMonitoring-001 — Store-and-forward buffer depth metric is never populated
@@ -776,3 +820,314 @@ continues to work via the optional parameter. Regression test
asserts the timestamp equals a fixed injected instant exactly (not just a
before/after window); it would not compile against the pre-fix single-arg-less
constructor.
### HealthMonitoring-017 — `HealthReportSender` resets interval counters before `Send`; transport failures silently drop the interval's error counts
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:140-154`, `src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs:146-153` |
**Description**
`HealthReportSender.ExecuteAsync` calls `_collector.CollectReport(_siteId)` and
then `_transport.Send(reportWithSeq)` inside a single `try` block whose `catch`
logs and continues. `CollectReport` atomically read-and-resets the per-interval
counters via `Interlocked.Exchange(ref _scriptErrorCount, 0)` (and the same for
`_alarmErrorCount`, `_deadLetterCount`, `_siteAuditWriteFailures`,
`_auditRedactionFailures`). If `_transport.Send` then throws — Akka remoting
hiccup, transport not yet associated, central side temporarily unavailable,
serialization failure on a malformed metric, etc. — the `catch (Exception ex)`
on line 150 logs an error and the loop simply waits for the next tick. The
report was never delivered, but the counters have already been reset to zero, so
**every error this site recorded in the failed interval is gone**: it is neither
in the (un-sent) report nor in the (zeroed) collector. The very next successful
report will show "0 script errors / 0 alarm errors" for the entire window in
which the transport was broken, masking exactly the period the operator most
needs to triage.
This contradicts the design doc's "raw counts per reporting interval" / "counter
resets **after each report is sent**" wording — current code resets on each
report _attempt_, regardless of outcome. The hazard worsens under sustained
transport failure: every interval's errors are lost; the central dashboard sees
a quiet site while the site is, in fact, failing.
The same shape exists in `CentralHealthReportLoop` (see HealthMonitoring-018) —
`CollectReport` is called before `_aggregator.ProcessReport`. The aggregator
call is in-process and unlikely to throw, but the structural bug is identical.
**Recommendation**
Build the report from a non-destructive read first (`PeekReport(siteId)`,
returning a snapshot without mutating the counters) and only call a dedicated
`ResetIntervalCounters()` after a successful `_transport.Send`. Alternatively,
on a `Send` failure, restore the lost counts via `Interlocked.Add` of the
captured values back into the collector fields — atomically correct as long as
no other thread can read them in between, which is true here because the next
read is the next `CollectReport` on the same loop. The "peek then commit"
shape is the cleaner public API.
A regression test should add a failing-transport scenario:
`Send` throws an `InvalidOperationException`; assert that the next successful
report includes the previously-failed interval's `ScriptErrorCount`.
### HealthMonitoring-018 — Same counter-reset-before-publish hazard in `CentralHealthReportLoop`
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:87-98` |
**Description**
`CentralHealthReportLoop.ExecuteAsync` calls `_collector.CollectReport(CentralSiteId)`
(which resets the per-interval counters on the shared `SiteHealthCollector`
instance — see HealthMonitoring-017) and then `_aggregator.ProcessReport(reportWithSeq)`
inside the same `try` block. If `ProcessReport` throws, the central node's own
per-interval counters (`ScriptErrorCount`, `AlarmEvaluationErrorCount`,
`DeadLetterCount`, `SiteAuditWriteFailures`, `AuditRedactionFailure`) are lost
for that interval.
In practice `ProcessReport` is a pure in-memory CAS loop and is very unlikely
to throw, so the operational impact is small. However, the structural bug is
identical to HealthMonitoring-017 and would be fixed by the same
"peek then commit" refactor in `SiteHealthCollector`. The Audit-Log-related
metrics matter most here: `AuditRedactionFailure` is genuinely incremented at
central during normal operation (the Notification Outbox dispatcher and
Inbound API middleware both write through `CentralAuditRedactionFailureCounter`
which can fan out to the collector via the bridge), so this is not purely
theoretical.
**Recommendation**
Adopt the same "peek then reset on successful publish" pattern recommended for
HealthMonitoring-017. Reuse the new `PeekReport` / `ResetIntervalCounters`
collector API once it lands.
### HealthMonitoring-019 — `SiteAuditTelemetryStalled` and `CentralAuditWriteFailures` design-doc metrics have no HealthMonitoring-side surface
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `docs/requirements/Component-HealthMonitoring.md:39,40`, `src/ScadaLink.HealthMonitoring/ICentralHealthAggregator.cs`, `src/ScadaLink.AuditLog/Central/AuditCentralHealthSnapshot.cs:39-58` |
**Description**
`Component-HealthMonitoring.md` lists `SiteAuditTelemetryStalled` and
`CentralAuditWriteFailures` (and reiterates them under the Audit Log KPIs
section and in the Dependencies section) as required dashboard metrics. The
doc also says they "are central-computed alongside the existing central KPIs"
(Notification Outbox / Site Call Audit) and surface in the **Audit** dashboard
tile group.
Tracing the code:
- `SiteAuditTelemetryStalled` is published by `SiteAuditReconciliationActor`,
picked up by `SiteAuditTelemetryStalledTracker`, and latched into
`AuditCentralHealthSnapshot._stalled` (a `ConcurrentDictionary<string, bool>`
in the `ScadaLink.AuditLog` assembly).
- `CentralAuditWriteFailures` is incremented inside `AuditCentralHealthSnapshot`
via `ICentralAuditWriteFailureCounter.Increment()` (also in `ScadaLink.AuditLog`).
Neither metric is referenced anywhere in `src/ScadaLink.HealthMonitoring/`:
- `ICentralHealthAggregator` does not expose them.
- `SiteHealthCollector` has no central counterpart (it is site-only).
- `SiteHealthReport` has no `SiteAuditTelemetryStalled` / `CentralAuditWriteFailures`
fields (the site-only `SiteAuditWriteFailures`, `AuditRedactionFailure`, and
`SiteAuditBacklog` _are_ wired; the central pair is the gap).
Currently the only consumer of `IAuditCentralHealthSnapshot` is whatever
Central UI page binds to it directly (out of scope for this module), but the
design doc places these metrics under HealthMonitoring's responsibility
("Health Monitoring Dashboard displays aggregated metrics"). At minimum the
Dependencies section's claim that Health Monitoring provides "the
central-computed `CentralAuditWriteFailures` / `AuditRedactionFailure` metrics"
is false for `CentralAuditWriteFailures`: nothing under
`src/ScadaLink.HealthMonitoring/` knows about it.
**Recommendation**
Decide whether HealthMonitoring or the consuming UI page owns the
`IAuditCentralHealthSnapshot` integration:
- If HealthMonitoring owns it, expose a `CentralKpis` accessor on
`ICentralHealthAggregator` (e.g. a `GetCentralAuditHealth()` method that
returns a typed DTO derived from the injected `IAuditCentralHealthSnapshot`)
so the dashboard has a single read surface mirroring `GetAllSiteStates`.
- If the UI page binds `IAuditCentralHealthSnapshot` directly, update the
HealthMonitoring design doc's Responsibilities / Dependencies sections to
reflect that and remove the implied integration.
Either way, add a regression test that the chosen surface returns the live
counter and per-site stalled state.
### HealthMonitoring-020 — `MarkHeartbeat` brings offline site back online with a stale `LastHeartbeatAt` when `receivedAt <= existing.LastHeartbeatAt`
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:128-147` |
**Description**
The CAS path in `MarkHeartbeat` picks `newHeartbeat = max(receivedAt, existing.LastHeartbeatAt)`,
then short-circuits only when `newHeartbeat == existing.LastHeartbeatAt &&
existing.IsOnline`. That short-circuit is correct, but consider the case where
`existing.IsOnline == false` and `receivedAt <= existing.LastHeartbeatAt`:
1. Suppose a site is marked offline by `CheckForOfflineSites` at time T1.
2. A late/out-of-order heartbeat carrying a `receivedAt` _older_ than the last
stored `LastHeartbeatAt` arrives at T2 (clock skew at the receive site, or a
delayed message that was generated before the offline-marking).
3. `newHeartbeat == existing.LastHeartbeatAt` (kept), but the short-circuit
condition fails because `existing.IsOnline == false`, so the CAS produces a
new record with `IsOnline = true` and the **stale** `LastHeartbeatAt`.
4. On the very next `CheckForOfflineSites` tick (≤ `OfflineTimeout/2` later),
`now - LastHeartbeatAt` is still ≥ `OfflineTimeout`, so the site is
immediately marked offline again — the heartbeat brought it online for less
than the check cadence, producing a "flap" in the dashboard.
In practice `receivedAt` is normally `_timeProvider.GetUtcNow()` at the
`CentralCommunicationActor` receive site, so monotonically increasing — the bug
is latent. But the contract `MarkHeartbeat(string siteId, DateTimeOffset receivedAt)`
makes no guarantee about ordering, and an out-of-order delivery (Akka remoting
ordering across connection re-establishment edge cases) or a small wall-clock
correction at central would expose it.
**Recommendation**
When transitioning offline → online, use `now` (from the injected
`TimeProvider`) rather than the caller-supplied `receivedAt` for
`LastHeartbeatAt`, or take `max(receivedAt, _timeProvider.GetUtcNow())` so the
recovery point is always recent. A unit test driving `MarkHeartbeat` with a
`receivedAt` older than the last stored heartbeat on an offline site, then a
`CheckForOfflineSites` immediately afterwards, would assert the site stays
online.
### HealthMonitoring-021 — `CentralSiteId = "central"` reserved constant silently collides with a real site named "central"
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:22`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:224-226` |
**Description**
`CentralHealthAggregator.CheckForOfflineSites` looks up the per-site offline
timeout with:
```csharp
var timeout = kvp.Key == CentralHealthReportLoop.CentralSiteId
? _options.CentralOfflineTimeout
: _options.OfflineTimeout;
```
`CentralSiteId` is the literal string `"central"`. Site IDs are free-form
strings set in configuration / the Sites repository; there is no validation
that excludes the reserved `"central"` name. An operator who creates a real
site with `SiteId = "central"` will have:
- Their real-site reports arriving via `ProcessReport` get stored in the same
dictionary slot as the central self-report (they share the keyspace), so the
central self-report and the real-site report repeatedly overwrite each
other via the sequence-number guard — whichever has the higher Unix-ms seed
wins, and the other is silently rejected as stale. The dashboard alternates
between two unrelated payloads.
- The real site gets the longer `CentralOfflineTimeout` (default 3 minutes)
instead of the normal `OfflineTimeout` (60 s), so a genuinely-failed real
site marked "central" stays falsely-online for an extra two minutes.
**Recommendation**
Two options:
1. Reject the reserved name at the Site entity / configuration validation
layer (Configuration Database component, out of this module's scope) and
document `"central"` as reserved. This is the cleaner UX fix.
2. As a defence-in-depth inside HealthMonitoring, store the central
self-report under a key that cannot collide — e.g. prefix it with a
character that is forbidden in real site IDs (`":central"` or `"#central"`)
— and adjust `CheckForOfflineSites` accordingly.
Either fix should include a regression test creating a real `SiteHealthReport`
with `SiteId = "central"` and asserting the central self-report's identity is
preserved.
### HealthMonitoring-022 — `CentralHealthReportLoopTests` uses real-time `PeriodicTimer` + `Task.Delay`; flake-prone on slow CI
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | `tests/ScadaLink.HealthMonitoring.Tests/CentralHealthReportLoopTests.cs:32-42` |
**Description**
`RunLoopBriefly` starts the hosted service with a 50 ms `PeriodicTimer` and
then `await Task.Delay(runForMs, CancellationToken.None)` (with `runForMs`
between 150 ms and 300 ms). `GeneratesCentralReports_WhenSelfIsPrimary` and
`AssignsMonotonicSequenceNumbers` both assert "at least 2 reports were
generated" within the window. On a heavily-contended CI runner where the
hosted-service start-up plus a couple of `PeriodicTimer` ticks can blow past
300 ms, these tests will silently flake.
The rest of the suite (`CentralHealthAggregatorTests`, `SiteHealthCollectorTests`,
`HealthReportSenderTests` partially) was deliberately refactored to use the
injected `TimeProvider` precisely to avoid this. `CentralHealthReportLoop` and
`HealthReportSender` already accept a `TimeProvider`, but the loop's
`PeriodicTimer` is still real-time because `PeriodicTimer` does not consume
the `TimeProvider` parameter.
**Recommendation**
Either (a) accept the timing-sensitivity and bump the delay budget
generously, or (b) refactor the hosted-service loop to use a
`TimeProvider.CreateTimer`-based tick mechanism so the test can advance a
fake clock and assert deterministically how many ticks fire. Option (b) is
the better long-term fix and matches the pattern used elsewhere in the
module's tests.
### HealthMonitoring-023 — `StoreAndForwardBufferDepths_IsEmptyPlaceholder` test name is stale; it now covers the default-state contract, not a placeholder
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `tests/ScadaLink.HealthMonitoring.Tests/SiteHealthCollectorTests.cs:117-122` |
**Description**
The test `StoreAndForwardBufferDepths_IsEmptyPlaceholder` was originally named
to codify the HealthMonitoring-001 bug ("`SetStoreAndForwardDepths` has no
callers, so `StoreAndForwardBufferDepths` is always empty"). HealthMonitoring-001
is `Resolved``HealthReportSender` now populates per-category depths from
the S&F engine, and the same test class has `SetStoreAndForwardDepths_ReflectedInReport`
covering the populated path. The "placeholder" test still passes because it
constructs a fresh collector and never calls the setter, so its assertion
(`Assert.Empty(report.StoreAndForwardBufferDepths)`) is now testing the
**default empty state of an un-configured collector**. The HealthMonitoring-001
resolution note explicitly chose to keep it as "the collector-level
default-state test", but the test method name and the implied semantics no
longer match.
A maintainer reading the test name today will misread it as documentation that
the metric is unimplemented (which it isn't), and may waste time investigating
a non-bug.
**Recommendation**
Rename to `StoreAndForwardBufferDepths_DefaultsToEmpty_WhenSetterNotCalled`
(or similar) and update the test body's intent — purely a documentation /
maintainability fix; no behaviour change.