docs(code-reviews): re-review batch 2 at 39d737e — ConfigurationDatabase, DataConnectionLayer, DeploymentManager, ExternalSystemGateway, HealthMonitoring

17 new findings: ConfigurationDatabase-012..014, DataConnectionLayer-014..017, DeploymentManager-015..017, ExternalSystemGateway-015..017, HealthMonitoring-013..016.
This commit is contained in:
Joseph Doherty
2026-05-17 00:45:10 -04:00
parent e49846603e
commit 89636e2bbf
6 changed files with 895 additions and 64 deletions

View File

@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.HealthMonitoring` |
| Design doc | `docs/requirements/Component-HealthMonitoring.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 0 |
| Commit reviewed | `39d737e` |
| Open findings | 4 |
## Summary
@@ -32,20 +32,38 @@ heartbeat path, and most collector setters. None of the findings are crash-class
but the concurrency issues are Medium/High and the missing S&F metric is a real
design-adherence gap.
#### Re-review 2026-05-17 (commit `39d737e`)
All twelve prior findings (HealthMonitoring-001..012) are confirmed `Resolved`
`SiteHealthState` is now an immutable `sealed record` mutated only via atomic
compare-and-swap, the store-and-forward buffer-depth metric is populated, the
central-site offline grace and the unknown-site heartbeat registration are in
place, and the test suite has grown to cover the report loop, heartbeat path, and
collector setters. This re-review found **4 new findings, all Low/Medium, none
crash-class**. They are residual polish items rather than behaviour regressions:
an inaccurate offline-check-interval comment (HealthMonitoring-013), unvalidated
`HealthMonitoringOptions` intervals that crash the hosted service on
misconfiguration (HealthMonitoring-014), a heartbeat-only registered site left
with a year-0001 `LastReportReceivedAt` that the UI's staleness display must
special-case (HealthMonitoring-015), and `CollectReport` reading
`DateTimeOffset.UtcNow` directly instead of the module's now-standard injected
`TimeProvider` (HealthMonitoring-016). The module remains small, readable, and
broadly faithful to the design intent.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | x | `MarkHeartbeat` drops heartbeats for unregistered sites (HealthMonitoring-007); central self-report has no heartbeat grace (HealthMonitoring-005). |
| 1 | Correctness & logic bugs | x | `MarkHeartbeat` drops heartbeats for unregistered sites (HealthMonitoring-007); central self-report has no heartbeat grace (HealthMonitoring-005). Re-review: heartbeat-registered site left with year-0001 `LastReportReceivedAt` (HealthMonitoring-015). |
| 2 | Akka.NET conventions | x | Module itself contains no actors (transport abstracted via `IHealthReportTransport`); `AddHealthMonitoringActors` is a dead placeholder (HealthMonitoring-011). Actor-side wiring lives in Communication and is out of scope. |
| 3 | Concurrency & thread safety | x | Unguarded mutable `SiteHealthState` (HealthMonitoring-002); mutation inside `AddOrUpdate` delegate (HealthMonitoring-003); `GetAllSiteStates` leaks live mutable references (HealthMonitoring-008). Collector counters correctly use `Interlocked`. |
| 4 | Error handling & resilience | x | `HealthReportSender` silently swallows inner failures with bare `catch {}` (HealthMonitoring-010); top-level loop error handling is sound. |
| 4 | Error handling & resilience | x | `HealthReportSender` silently swallows inner failures with bare `catch {}` (HealthMonitoring-010, resolved); top-level loop error handling is sound. Re-review: `HealthMonitoringOptions` intervals unvalidated — a zero/negative value crashes the hosted service at `PeriodicTimer` construction (HealthMonitoring-014). |
| 5 | Security | x | No issues found. Module handles only numeric/string operational metrics, no secrets, no external input parsing, no auth surface. |
| 6 | Performance & resource management | x | `PeriodicTimer` instances correctly disposed via `using`. Dictionary snapshots per report are acceptable at the documented scale. No issues found. |
| 7 | Design-document adherence | x | Store-and-forward buffer depth metric unimplemented (HealthMonitoring-001); sequence seeding deviates from doc's "starting at 1" wording (HealthMonitoring-006). |
| 8 | Code organization & conventions | x | Options class correctly owned by the component; POCO/messages in Commons. Dead placeholder method noted (HealthMonitoring-011). |
| 8 | Code organization & conventions | x | Options class correctly owned by the component; POCO/messages in Commons. Dead placeholder method noted (HealthMonitoring-011, resolved). Re-review: `SiteHealthCollector.CollectReport` reads `DateTimeOffset.UtcNow` directly instead of the module's now-standard injected `TimeProvider` (HealthMonitoring-016). |
| 9 | Testing coverage | x | No tests for `CentralHealthReportLoop`, `MarkHeartbeat`, offline-via-heartbeat, replica idempotency, or most collector setters (HealthMonitoring-009). |
| 10 | Documentation & comments | x | Heartbeat interval is described inconsistently (~2s vs ~5s) across XML docs (HealthMonitoring-004); `LatestReport = null!` misrepresents the contract (HealthMonitoring-012). |
| 10 | Documentation & comments | x | Heartbeat interval is described inconsistently (~2s vs ~5s) across XML docs (HealthMonitoring-004, resolved); `LatestReport = null!` misrepresents the contract (HealthMonitoring-012, resolved). Re-review: offline-check-interval comment claims "(shorter)" timeout but code only uses `OfflineTimeout` (HealthMonitoring-013). |
## Findings
@@ -560,3 +578,148 @@ has not yet sent a report"). A codebase-wide search confirms no `null!` suppress
remains anywhere in `src/ScadaLink.HealthMonitoring`. This is exactly the change
HealthMonitoring-002 made when converting `SiteHealthState` to an immutable record, so
the contract is now honest and no further code change was required.
### HealthMonitoring-013 — Offline-check interval comment claims "shorter timeout" but only ever uses `OfflineTimeout`
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:194-196` |
**Description**
`ExecuteAsync` derives the `PeriodicTimer` cadence with the comment "Check at half
the (shorter) offline timeout interval for timely detection", but the code only
reads `_options.OfflineTimeout`:
```csharp
var checkInterval = TimeSpan.FromMilliseconds(_options.OfflineTimeout.TotalMilliseconds / 2);
```
`CentralOfflineTimeout` (HealthMonitoring-005's fix) is never considered. With the
default options (`OfflineTimeout` 60s, `CentralOfflineTimeout` 3m) `OfflineTimeout`
genuinely is the shorter of the two, so the parenthetical happens to hold. But the
comment states an invariant the code does not enforce: if an operator configures
`CentralOfflineTimeout` *smaller* than `OfflineTimeout`, the check cadence stays
tied to `OfflineTimeout`, and central offline detection is delayed by up to a full
`OfflineTimeout / 2` beyond the intended `CentralOfflineTimeout` window. The comment
misleads a reader into believing the cadence already adapts to whichever timeout is
shorter.
**Recommendation**
Either compute `checkInterval` from `Math.Min(OfflineTimeout, CentralOfflineTimeout)`
so the code matches the comment, or drop the "(shorter)" wording and state plainly
that the cadence is derived from `OfflineTimeout` only (acceptable while the default
`CentralOfflineTimeout` is the larger value).
**Resolution**
_Unresolved._
### HealthMonitoring-014 — `HealthMonitoringOptions` intervals are unvalidated; a zero/negative value crashes the hosted service
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/HealthMonitoringOptions.cs:3-20`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:196`, `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:67`, `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:63` |
**Description**
`HealthMonitoringOptions` is bound from the `ScadaLink:HealthMonitoring` config
section (`SiteServiceRegistration.BindSharedOptions`) with no validation —
no `IValidateOptions<HealthMonitoringOptions>`, no `ValidateDataAnnotations`, no
`ValidateOnStart`. `ReportInterval`, `OfflineTimeout`, and `CentralOfflineTimeout`
are all fed straight into `new PeriodicTimer(...)` (and `OfflineTimeout` into a
division for the check interval). `PeriodicTimer`'s constructor throws
`ArgumentOutOfRangeException` for a zero or negative period. A misconfigured
`appsettings.json` (e.g. `"ReportInterval": "00:00:00"`, an empty/garbled value
that binds to `TimeSpan.Zero`, or a negative span) therefore crashes the
`HealthReportSender` / `CentralHealthReportLoop` / `CentralHealthAggregator`
hosted service at startup with an opaque exception that does not name the
offending config key, rather than failing fast with a clear validation message.
**Recommendation**
Add an options validator (DataAnnotations `[Range]`-style on the spans, or an
`IValidateOptions<HealthMonitoringOptions>`) that rejects non-positive
`ReportInterval`/`OfflineTimeout`/`CentralOfflineTimeout` and ideally requires
`CentralOfflineTimeout >= OfflineTimeout`, and call `.ValidateOnStart()` so a bad
configuration fails fast with a message naming the section and key.
**Resolution**
_Unresolved._
### HealthMonitoring-015 — Heartbeat-registered site is left with a year-0001 `LastReportReceivedAt`
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:122-130`, `src/ScadaLink.HealthMonitoring/SiteHealthState.cs:27` |
**Description**
When `MarkHeartbeat` registers a previously-unknown site (HealthMonitoring-007's
fix), it sets `LastReportReceivedAt = default` — i.e. `DateTimeOffset.MinValue`
(`0001-01-01`). The XML doc on `SiteHealthState.LastReportReceivedAt` states the
field is "Used by the UI to surface report staleness during failover." A
heartbeat-only site therefore has `LatestReport == null` **and**
`LastReportReceivedAt == DateTimeOffset.MinValue`. Any UI code that computes
"last report N ago" as `now - LastReportReceivedAt` without first checking
`LatestReport != null` will render a nonsensical staleness of roughly two
thousand years for a site that is, in fact, freshly reachable. The two
"no report yet" signals (`LatestReport == null`, `LastReportReceivedAt == default`)
are independent and both must be special-cased; the sentinel value is an easy trap.
**Recommendation**
Make `LastReportReceivedAt` nullable (`DateTimeOffset?`) so "no report received
yet" is an explicit, unmissable state rather than a magic sentinel — consistent
with how `LatestReport` was already made nullable for the same case — and have UI
consumers render staleness only when it has a value. Alternatively, document the
`default` sentinel prominently on the field and audit every UI reader, but the
nullable option is safer and matches the existing `LatestReport` treatment.
**Resolution**
_Unresolved._
### HealthMonitoring-016 — `SiteHealthCollector.CollectReport` reads `DateTimeOffset.UtcNow` directly instead of an injected `TimeProvider`
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs:151` |
**Description**
`CollectReport` stamps each report with `ReportTimestamp: DateTimeOffset.UtcNow`,
read directly from the system clock. Every other time-dependent class in the
module — `CentralHealthAggregator`, `HealthReportSender`, `CentralHealthReportLoop`
— was deliberately refactored (HealthMonitoring-006) to take an injectable
`TimeProvider` so the behaviour is deterministically testable and the clock
dependency is explicit. `SiteHealthCollector` is the lone holdout: the report
timestamp cannot be controlled in a unit test, which is why
`SiteHealthCollectorTests.CollectReport_IncludesUtcTimestamp` can only assert the
timestamp falls in a `before`/`after` wall-clock window rather than equalling a
known instant. This is a minor consistency/testability gap, not a behaviour bug.
**Recommendation**
Add an optional `TimeProvider` constructor parameter to `SiteHealthCollector`
(defaulting to `TimeProvider.System`, mirroring the other classes) and derive
`ReportTimestamp` from `GetUtcNow()`, so the report timestamp is deterministically
testable and the module is consistent.
**Resolution**
_Unresolved._