docs(code-reviews): re-review batch 2 at 39d737e — ConfigurationDatabase, DataConnectionLayer, DeploymentManager, ExternalSystemGateway, HealthMonitoring
17 new findings: ConfigurationDatabase-012..014, DataConnectionLayer-014..017, DeploymentManager-015..017, ExternalSystemGateway-015..017, HealthMonitoring-013..016.
This commit is contained in:
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.HealthMonitoring` |
|
||||
| Design doc | `docs/requirements/Component-HealthMonitoring.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 4 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -32,20 +32,38 @@ heartbeat path, and most collector setters. None of the findings are crash-class
|
||||
but the concurrency issues are Medium/High and the missing S&F metric is a real
|
||||
design-adherence gap.
|
||||
|
||||
#### Re-review 2026-05-17 (commit `39d737e`)
|
||||
|
||||
All twelve prior findings (HealthMonitoring-001..012) are confirmed `Resolved` —
|
||||
`SiteHealthState` is now an immutable `sealed record` mutated only via atomic
|
||||
compare-and-swap, the store-and-forward buffer-depth metric is populated, the
|
||||
central-site offline grace and the unknown-site heartbeat registration are in
|
||||
place, and the test suite has grown to cover the report loop, heartbeat path, and
|
||||
collector setters. This re-review found **4 new findings, all Low/Medium, none
|
||||
crash-class**. They are residual polish items rather than behaviour regressions:
|
||||
an inaccurate offline-check-interval comment (HealthMonitoring-013), unvalidated
|
||||
`HealthMonitoringOptions` intervals that crash the hosted service on
|
||||
misconfiguration (HealthMonitoring-014), a heartbeat-only registered site left
|
||||
with a year-0001 `LastReportReceivedAt` that the UI's staleness display must
|
||||
special-case (HealthMonitoring-015), and `CollectReport` reading
|
||||
`DateTimeOffset.UtcNow` directly instead of the module's now-standard injected
|
||||
`TimeProvider` (HealthMonitoring-016). The module remains small, readable, and
|
||||
broadly faithful to the design intent.
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | x | `MarkHeartbeat` drops heartbeats for unregistered sites (HealthMonitoring-007); central self-report has no heartbeat grace (HealthMonitoring-005). |
|
||||
| 1 | Correctness & logic bugs | x | `MarkHeartbeat` drops heartbeats for unregistered sites (HealthMonitoring-007); central self-report has no heartbeat grace (HealthMonitoring-005). Re-review: heartbeat-registered site left with year-0001 `LastReportReceivedAt` (HealthMonitoring-015). |
|
||||
| 2 | Akka.NET conventions | x | Module itself contains no actors (transport abstracted via `IHealthReportTransport`); `AddHealthMonitoringActors` is a dead placeholder (HealthMonitoring-011). Actor-side wiring lives in Communication and is out of scope. |
|
||||
| 3 | Concurrency & thread safety | x | Unguarded mutable `SiteHealthState` (HealthMonitoring-002); mutation inside `AddOrUpdate` delegate (HealthMonitoring-003); `GetAllSiteStates` leaks live mutable references (HealthMonitoring-008). Collector counters correctly use `Interlocked`. |
|
||||
| 4 | Error handling & resilience | x | `HealthReportSender` silently swallows inner failures with bare `catch {}` (HealthMonitoring-010); top-level loop error handling is sound. |
|
||||
| 4 | Error handling & resilience | x | `HealthReportSender` silently swallows inner failures with bare `catch {}` (HealthMonitoring-010, resolved); top-level loop error handling is sound. Re-review: `HealthMonitoringOptions` intervals unvalidated — a zero/negative value crashes the hosted service at `PeriodicTimer` construction (HealthMonitoring-014). |
|
||||
| 5 | Security | x | No issues found. Module handles only numeric/string operational metrics, no secrets, no external input parsing, no auth surface. |
|
||||
| 6 | Performance & resource management | x | `PeriodicTimer` instances correctly disposed via `using`. Dictionary snapshots per report are acceptable at the documented scale. No issues found. |
|
||||
| 7 | Design-document adherence | x | Store-and-forward buffer depth metric unimplemented (HealthMonitoring-001); sequence seeding deviates from doc's "starting at 1" wording (HealthMonitoring-006). |
|
||||
| 8 | Code organization & conventions | x | Options class correctly owned by the component; POCO/messages in Commons. Dead placeholder method noted (HealthMonitoring-011). |
|
||||
| 8 | Code organization & conventions | x | Options class correctly owned by the component; POCO/messages in Commons. Dead placeholder method noted (HealthMonitoring-011, resolved). Re-review: `SiteHealthCollector.CollectReport` reads `DateTimeOffset.UtcNow` directly instead of the module's now-standard injected `TimeProvider` (HealthMonitoring-016). |
|
||||
| 9 | Testing coverage | x | No tests for `CentralHealthReportLoop`, `MarkHeartbeat`, offline-via-heartbeat, replica idempotency, or most collector setters (HealthMonitoring-009). |
|
||||
| 10 | Documentation & comments | x | Heartbeat interval is described inconsistently (~2s vs ~5s) across XML docs (HealthMonitoring-004); `LatestReport = null!` misrepresents the contract (HealthMonitoring-012). |
|
||||
| 10 | Documentation & comments | x | Heartbeat interval is described inconsistently (~2s vs ~5s) across XML docs (HealthMonitoring-004, resolved); `LatestReport = null!` misrepresents the contract (HealthMonitoring-012, resolved). Re-review: offline-check-interval comment claims "(shorter)" timeout but code only uses `OfflineTimeout` (HealthMonitoring-013). |
|
||||
|
||||
## Findings
|
||||
|
||||
@@ -560,3 +578,148 @@ has not yet sent a report"). A codebase-wide search confirms no `null!` suppress
|
||||
remains anywhere in `src/ScadaLink.HealthMonitoring`. This is exactly the change
|
||||
HealthMonitoring-002 made when converting `SiteHealthState` to an immutable record, so
|
||||
the contract is now honest and no further code change was required.
|
||||
|
||||
### HealthMonitoring-013 — Offline-check interval comment claims "shorter timeout" but only ever uses `OfflineTimeout`
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Documentation & comments |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:194-196` |
|
||||
|
||||
**Description**
|
||||
|
||||
`ExecuteAsync` derives the `PeriodicTimer` cadence with the comment "Check at half
|
||||
the (shorter) offline timeout interval for timely detection", but the code only
|
||||
reads `_options.OfflineTimeout`:
|
||||
|
||||
```csharp
|
||||
var checkInterval = TimeSpan.FromMilliseconds(_options.OfflineTimeout.TotalMilliseconds / 2);
|
||||
```
|
||||
|
||||
`CentralOfflineTimeout` (HealthMonitoring-005's fix) is never considered. With the
|
||||
default options (`OfflineTimeout` 60s, `CentralOfflineTimeout` 3m) `OfflineTimeout`
|
||||
genuinely is the shorter of the two, so the parenthetical happens to hold. But the
|
||||
comment states an invariant the code does not enforce: if an operator configures
|
||||
`CentralOfflineTimeout` *smaller* than `OfflineTimeout`, the check cadence stays
|
||||
tied to `OfflineTimeout`, and central offline detection is delayed by up to a full
|
||||
`OfflineTimeout / 2` beyond the intended `CentralOfflineTimeout` window. The comment
|
||||
misleads a reader into believing the cadence already adapts to whichever timeout is
|
||||
shorter.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either compute `checkInterval` from `Math.Min(OfflineTimeout, CentralOfflineTimeout)`
|
||||
so the code matches the comment, or drop the "(shorter)" wording and state plainly
|
||||
that the cadence is derived from `OfflineTimeout` only (acceptable while the default
|
||||
`CentralOfflineTimeout` is the larger value).
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### HealthMonitoring-014 — `HealthMonitoringOptions` intervals are unvalidated; a zero/negative value crashes the hosted service
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.HealthMonitoring/HealthMonitoringOptions.cs:3-20`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:196`, `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:67`, `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:63` |
|
||||
|
||||
**Description**
|
||||
|
||||
`HealthMonitoringOptions` is bound from the `ScadaLink:HealthMonitoring` config
|
||||
section (`SiteServiceRegistration.BindSharedOptions`) with no validation —
|
||||
no `IValidateOptions<HealthMonitoringOptions>`, no `ValidateDataAnnotations`, no
|
||||
`ValidateOnStart`. `ReportInterval`, `OfflineTimeout`, and `CentralOfflineTimeout`
|
||||
are all fed straight into `new PeriodicTimer(...)` (and `OfflineTimeout` into a
|
||||
division for the check interval). `PeriodicTimer`'s constructor throws
|
||||
`ArgumentOutOfRangeException` for a zero or negative period. A misconfigured
|
||||
`appsettings.json` (e.g. `"ReportInterval": "00:00:00"`, an empty/garbled value
|
||||
that binds to `TimeSpan.Zero`, or a negative span) therefore crashes the
|
||||
`HealthReportSender` / `CentralHealthReportLoop` / `CentralHealthAggregator`
|
||||
hosted service at startup with an opaque exception that does not name the
|
||||
offending config key, rather than failing fast with a clear validation message.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add an options validator (DataAnnotations `[Range]`-style on the spans, or an
|
||||
`IValidateOptions<HealthMonitoringOptions>`) that rejects non-positive
|
||||
`ReportInterval`/`OfflineTimeout`/`CentralOfflineTimeout` and ideally requires
|
||||
`CentralOfflineTimeout >= OfflineTimeout`, and call `.ValidateOnStart()` so a bad
|
||||
configuration fails fast with a message naming the section and key.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### HealthMonitoring-015 — Heartbeat-registered site is left with a year-0001 `LastReportReceivedAt`
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:122-130`, `src/ScadaLink.HealthMonitoring/SiteHealthState.cs:27` |
|
||||
|
||||
**Description**
|
||||
|
||||
When `MarkHeartbeat` registers a previously-unknown site (HealthMonitoring-007's
|
||||
fix), it sets `LastReportReceivedAt = default` — i.e. `DateTimeOffset.MinValue`
|
||||
(`0001-01-01`). The XML doc on `SiteHealthState.LastReportReceivedAt` states the
|
||||
field is "Used by the UI to surface report staleness during failover." A
|
||||
heartbeat-only site therefore has `LatestReport == null` **and**
|
||||
`LastReportReceivedAt == DateTimeOffset.MinValue`. Any UI code that computes
|
||||
"last report N ago" as `now - LastReportReceivedAt` without first checking
|
||||
`LatestReport != null` will render a nonsensical staleness of roughly two
|
||||
thousand years for a site that is, in fact, freshly reachable. The two
|
||||
"no report yet" signals (`LatestReport == null`, `LastReportReceivedAt == default`)
|
||||
are independent and both must be special-cased; the sentinel value is an easy trap.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Make `LastReportReceivedAt` nullable (`DateTimeOffset?`) so "no report received
|
||||
yet" is an explicit, unmissable state rather than a magic sentinel — consistent
|
||||
with how `LatestReport` was already made nullable for the same case — and have UI
|
||||
consumers render staleness only when it has a value. Alternatively, document the
|
||||
`default` sentinel prominently on the field and audit every UI reader, but the
|
||||
nullable option is safer and matches the existing `LatestReport` treatment.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### HealthMonitoring-016 — `SiteHealthCollector.CollectReport` reads `DateTimeOffset.UtcNow` directly instead of an injected `TimeProvider`
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Code organization & conventions |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs:151` |
|
||||
|
||||
**Description**
|
||||
|
||||
`CollectReport` stamps each report with `ReportTimestamp: DateTimeOffset.UtcNow`,
|
||||
read directly from the system clock. Every other time-dependent class in the
|
||||
module — `CentralHealthAggregator`, `HealthReportSender`, `CentralHealthReportLoop`
|
||||
— was deliberately refactored (HealthMonitoring-006) to take an injectable
|
||||
`TimeProvider` so the behaviour is deterministically testable and the clock
|
||||
dependency is explicit. `SiteHealthCollector` is the lone holdout: the report
|
||||
timestamp cannot be controlled in a unit test, which is why
|
||||
`SiteHealthCollectorTests.CollectReport_IncludesUtcTimestamp` can only assert the
|
||||
timestamp falls in a `before`/`after` wall-clock window rather than equalling a
|
||||
known instant. This is a minor consistency/testability gap, not a behaviour bug.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add an optional `TimeProvider` constructor parameter to `SiteHealthCollector`
|
||||
(defaulting to `TimeProvider.System`, mirroring the other classes) and derive
|
||||
`ReportTimestamp` from `GetUtcNow()`, so the report timestamp is deterministically
|
||||
testable and the module is consistent.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
Reference in New Issue
Block a user