A heartbeat-registered site that has never sent a full report now has
LastReportReceivedAt = null instead of the year-0001 sentinel. TimestampDisplay
accepts DateTimeOffset? and renders null as a placeholder ('awaiting first
report') rather than a ~2000-year-stale date. Cross-module: HealthMonitoring +
CentralUI.
779 lines
44 KiB
Markdown
779 lines
44 KiB
Markdown
# Code Review — HealthMonitoring
|
|
|
|
| Field | Value |
|
|
|-------|-------|
|
|
| Module | `src/ScadaLink.HealthMonitoring` |
|
|
| Design doc | `docs/requirements/Component-HealthMonitoring.md` |
|
|
| Status | Reviewed |
|
|
| Last reviewed | 2026-05-17 |
|
|
| Reviewer | claude-agent |
|
|
| Commit reviewed | `39d737e` |
|
|
| Open findings | 0 |
|
|
|
|
## Summary
|
|
|
|
The HealthMonitoring module is small, readable, and broadly faithful to the design
|
|
intent: per-interval error counters with atomic read-and-reset, monotonic sequence
|
|
numbers with Unix-ms seeding to survive failover, sequence-guarded staleness
|
|
rejection, and a 60s offline timeout. However, the review surfaced two recurring
|
|
themes. First, **a documented metric is silently unimplemented** — store-and-forward
|
|
buffer depths are never populated (`SetStoreAndForwardDepths` has zero callers and a
|
|
test asserts the field is always empty), so the dashboard cannot show the buffer
|
|
depth metric the design doc requires. Second, **the central aggregator's in-memory
|
|
state model has unguarded shared mutable state**: `SiteHealthState` is a mutable
|
|
class whose fields are written by a background timer thread, by `ProcessReport`, and
|
|
by `MarkHeartbeat` with no synchronization, and the same live mutable objects are
|
|
handed straight to UI callers via `GetAllSiteStates`. The `ProcessReport` logic also
|
|
mutates shared state inside a `ConcurrentDictionary.AddOrUpdate` update delegate,
|
|
which the runtime may invoke more than once under contention. Additionally there are
|
|
gaps around central self-report offline detection, heartbeats for not-yet-registered
|
|
sites being dropped, and missing test coverage for the central report loop,
|
|
heartbeat path, and most collector setters. None of the findings are crash-class,
|
|
but the concurrency issues are Medium/High and the missing S&F metric is a real
|
|
design-adherence gap.
|
|
|
|
#### Re-review 2026-05-17 (commit `39d737e`)
|
|
|
|
All twelve prior findings (HealthMonitoring-001..012) are confirmed `Resolved` —
|
|
`SiteHealthState` is now an immutable `sealed record` mutated only via atomic
|
|
compare-and-swap, the store-and-forward buffer-depth metric is populated, the
|
|
central-site offline grace and the unknown-site heartbeat registration are in
|
|
place, and the test suite has grown to cover the report loop, heartbeat path, and
|
|
collector setters. This re-review found **4 new findings, all Low/Medium, none
|
|
crash-class**. They are residual polish items rather than behaviour regressions:
|
|
an inaccurate offline-check-interval comment (HealthMonitoring-013), unvalidated
|
|
`HealthMonitoringOptions` intervals that crash the hosted service on
|
|
misconfiguration (HealthMonitoring-014), a heartbeat-only registered site left
|
|
with a year-0001 `LastReportReceivedAt` that the UI's staleness display would
|
|
otherwise render verbatim (HealthMonitoring-015 — resolved via a coordinated
|
|
HealthMonitoring + CentralUI change), and `CollectReport` reading
|
|
`DateTimeOffset.UtcNow` directly instead of the module's now-standard injected
|
|
`TimeProvider` (HealthMonitoring-016). The module remains small, readable, and
|
|
broadly faithful to the design intent.
|
|
|
|
## Checklist coverage
|
|
|
|
| # | Category | Examined | Notes |
|
|
|---|----------|----------|-------|
|
|
| 1 | Correctness & logic bugs | x | `MarkHeartbeat` drops heartbeats for unregistered sites (HealthMonitoring-007); central self-report has no heartbeat grace (HealthMonitoring-005). Re-review: heartbeat-registered site left with year-0001 `LastReportReceivedAt` (HealthMonitoring-015). |
|
|
| 2 | Akka.NET conventions | x | Module itself contains no actors (transport abstracted via `IHealthReportTransport`); `AddHealthMonitoringActors` is a dead placeholder (HealthMonitoring-011). Actor-side wiring lives in Communication and is out of scope. |
|
|
| 3 | Concurrency & thread safety | x | Unguarded mutable `SiteHealthState` (HealthMonitoring-002); mutation inside `AddOrUpdate` delegate (HealthMonitoring-003); `GetAllSiteStates` leaks live mutable references (HealthMonitoring-008). Collector counters correctly use `Interlocked`. |
|
|
| 4 | Error handling & resilience | x | `HealthReportSender` silently swallows inner failures with bare `catch {}` (HealthMonitoring-010, resolved); top-level loop error handling is sound. Re-review: `HealthMonitoringOptions` intervals unvalidated — a zero/negative value crashes the hosted service at `PeriodicTimer` construction (HealthMonitoring-014). |
|
|
| 5 | Security | x | No issues found. Module handles only numeric/string operational metrics, no secrets, no external input parsing, no auth surface. |
|
|
| 6 | Performance & resource management | x | `PeriodicTimer` instances correctly disposed via `using`. Dictionary snapshots per report are acceptable at the documented scale. No issues found. |
|
|
| 7 | Design-document adherence | x | Store-and-forward buffer depth metric unimplemented (HealthMonitoring-001); sequence seeding deviates from doc's "starting at 1" wording (HealthMonitoring-006). |
|
|
| 8 | Code organization & conventions | x | Options class correctly owned by the component; POCO/messages in Commons. Dead placeholder method noted (HealthMonitoring-011, resolved). Re-review: `SiteHealthCollector.CollectReport` reads `DateTimeOffset.UtcNow` directly instead of the module's now-standard injected `TimeProvider` (HealthMonitoring-016). |
|
|
| 9 | Testing coverage | x | No tests for `CentralHealthReportLoop`, `MarkHeartbeat`, offline-via-heartbeat, replica idempotency, or most collector setters (HealthMonitoring-009). |
|
|
| 10 | Documentation & comments | x | Heartbeat interval is described inconsistently (~2s vs ~5s) across XML docs (HealthMonitoring-004, resolved); `LatestReport = null!` misrepresents the contract (HealthMonitoring-012, resolved). Re-review: offline-check-interval comment claims "(shorter)" timeout but code only uses `OfflineTimeout` (HealthMonitoring-013). |
|
|
|
|
## Findings
|
|
|
|
### HealthMonitoring-001 — Store-and-forward buffer depth metric is never populated
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | High |
|
|
| Category | Design-document adherence |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs:104`, `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:79` |
|
|
|
|
**Description**
|
|
|
|
`Component-HealthMonitoring.md` lists "Store-and-forward buffer depth" (pending
|
|
messages by category) as a required monitored metric. `SiteHealthCollector` exposes
|
|
`SetStoreAndForwardDepths(...)` to receive it, but a codebase-wide search shows the
|
|
method has **no callers** — `_sfBufferDepths` always remains the empty dictionary it
|
|
is initialized to. `HealthReportSender` queries `GetParkedMessageCountAsync()` and
|
|
sets `ParkedMessageCount`, but parked count is a distinct metric from per-category
|
|
buffer depth. The test `SiteHealthCollectorTests.StoreAndForwardBufferDepths_IsEmptyPlaceholder`
|
|
even codifies the unimplemented state as expected behaviour. The result is that the
|
|
central dashboard cannot display buffer depth, a documented triage metric.
|
|
|
|
**Recommendation**
|
|
|
|
Wire `SetStoreAndForwardDepths` into `HealthReportSender.ExecuteAsync` (alongside the
|
|
existing parked-count call) using the S&F engine's per-category depth API, or, if the
|
|
metric is intentionally deferred, record that decision in the design doc and remove
|
|
the dead setter. Update the placeholder test accordingly once implemented.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit `<pending>`). `HealthReportSender.ExecuteAsync` now
|
|
queries the existing public `StoreAndForwardStorage.GetBufferDepthByCategoryAsync()`
|
|
API alongside the parked-count call and feeds the per-category depths into
|
|
`SiteHealthCollector.SetStoreAndForwardDepths` (category enum names as keys), so the
|
|
documented store-and-forward buffer depth metric is populated in every emitted
|
|
report. Regression test `HealthReportSenderTests.ReportsIncludeStoreAndForwardBufferDepthsFromStorage`
|
|
verifies populated per-category depths. The obsolete placeholder test
|
|
`SiteHealthCollectorTests.StoreAndForwardBufferDepths_IsEmptyPlaceholder` continues
|
|
to pass — it only exercises the collector with no setter call and still correctly
|
|
asserts the empty default; it was left in place as the collector-level default-state
|
|
test. No StoreAndForward source was modified (existing public API only).
|
|
|
|
### HealthMonitoring-002 — `SiteHealthState` mutable fields written from multiple threads without synchronization
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | High |
|
|
| Category | Concurrency & thread safety |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.HealthMonitoring/SiteHealthState.cs:11`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:86`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:137` |
|
|
|
|
**Description**
|
|
|
|
`SiteHealthState` is a plain mutable class. Its fields (`LatestReport`,
|
|
`LastReportReceivedAt`, `LastHeartbeatAt`, `LastSequenceNumber`, `IsOnline`) are
|
|
mutated from at least three concurrent contexts: `ProcessReport` (caller thread —
|
|
ClusterClient/PubSub message handlers), `MarkHeartbeat` (caller thread — heartbeat
|
|
handler), and `CheckForOfflineSites` (the `BackgroundService` timer thread). The
|
|
`ConcurrentDictionary` only protects the dictionary structure, not the objects it
|
|
stores. A heartbeat update and the offline-check can interleave on the same
|
|
`SiteHealthState` instance, and reads/writes of `DateTimeOffset` (a 16-byte struct)
|
|
and `long` fields are not guaranteed atomic on all platforms — producing torn reads
|
|
and lost updates of `IsOnline`/`LastHeartbeatAt`.
|
|
|
|
**Recommendation**
|
|
|
|
Make state transitions atomic: either guard all reads/writes of a `SiteHealthState`
|
|
with a per-site lock, or replace `SiteHealthState` with an immutable record updated
|
|
via `ConcurrentDictionary` compare-and-swap (`TryUpdate`) so every transition is
|
|
a single atomic reference swap.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit `<pending>`). `SiteHealthState` is now a `sealed record`
|
|
with `init`-only properties. `CentralHealthAggregator.ProcessReport`,
|
|
`MarkHeartbeat`, and `CheckForOfflineSites` were rewritten to perform every state
|
|
transition as an atomic compare-and-swap (`TryAdd`/`TryUpdate`) producing a new
|
|
record instance — no field of a stored state is ever mutated in place. `ProcessReport`
|
|
uses an explicit CAS retry loop instead of the `AddOrUpdate` update delegate so the
|
|
sequence-number guard and the field writes are evaluated against the value actually
|
|
installed (this also closes the root cause behind HealthMonitoring-003). Reads via
|
|
`GetAllSiteStates`/`GetSiteState` now hand out immutable snapshots, so a concurrent
|
|
reader can never observe a torn or half-applied state. `LatestReport` was changed
|
|
from `SiteHealthReport` (`null!`) to `SiteHealthReport?`, making the contract honest;
|
|
all existing consumers (CentralUI, integration/perf tests) already null-checked it
|
|
and continue to build clean. Regression test
|
|
`CentralHealthAggregatorTests.ProcessReport_ConcurrentUpdates_NeverLoseSequenceOrTearState`
|
|
exercises concurrent report/heartbeat/read threads and asserts snapshot consistency
|
|
and no lost updates.
|
|
|
|
### HealthMonitoring-003 — Shared state mutated inside `ConcurrentDictionary.AddOrUpdate` update delegate
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium — re-triaged: already resolved as a side-effect of HealthMonitoring-002. |
|
|
| Category | Concurrency & thread safety |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:45-103` |
|
|
|
|
**Description**
|
|
|
|
The update delegate passed to `AddOrUpdate` mutates the `existing` object in place
|
|
(`existing.LatestReport = report; existing.IsOnline = true; ...`). `AddOrUpdate`'s
|
|
contract explicitly allows the update delegate to be invoked **more than once** under
|
|
contention (when the CAS that installs the result loses a race and is retried). Each
|
|
invocation mutates the shared object, so a concurrent report for the same site can
|
|
observe a half-applied update, and the multi-field assignment is not atomic with
|
|
respect to readers in `GetAllSiteStates`/`CheckForOfflineSites`. The intended
|
|
"only replace if sequence is higher" guard can also be subverted because the
|
|
sequence comparison and the field writes are not a single atomic step.
|
|
|
|
**Recommendation**
|
|
|
|
Have the update delegate return a **new** `SiteHealthState` (record `with` copy)
|
|
rather than mutating `existing`, and treat the dictionary value as immutable.
|
|
Combined with HealthMonitoring-002, this makes every state transition an atomic
|
|
reference swap with no observable intermediate state.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit `pending`). Re-triaged: verified against the current
|
|
source — the root cause was already eliminated by the HealthMonitoring-002 fix.
|
|
`ProcessReport` no longer uses `AddOrUpdate` at all; it is now an explicit
|
|
compare-and-swap retry loop (`TryGetValue` → guard → `TryAdd`/`TryUpdate`) that
|
|
produces a brand-new immutable `SiteHealthState` record per transition and never
|
|
mutates a stored value in place. The sequence-number guard and the field writes are
|
|
evaluated against the value actually installed by the CAS, so the "only replace if
|
|
sequence is higher" invariant holds. The concurrency stress test
|
|
`CentralHealthAggregatorTests.ProcessReport_ConcurrentUpdates_NeverLoseSequenceOrTearState`
|
|
(added under HealthMonitoring-002) already exercises this path and asserts no lost
|
|
updates and no torn snapshots. No further code change was required for this finding.
|
|
|
|
### HealthMonitoring-004 — Inconsistent heartbeat interval described across XML docs
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Documentation & comments |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:146-148`, `src/ScadaLink.HealthMonitoring/SiteHealthState.cs:21`, `src/ScadaLink.HealthMonitoring/ICentralHealthAggregator.cs:16` |
|
|
|
|
**Description**
|
|
|
|
The heartbeat cadence that offline detection relies on is documented inconsistently.
|
|
`CheckForOfflineSites` says "heartbeats arrive every ~5s"; `SiteHealthState.LastHeartbeatAt`
|
|
says "~5s heartbeat"; but `ICentralHealthAggregator.MarkHeartbeat` says "~2s
|
|
heartbeats are arriving". The actual cadence is set elsewhere (Cluster Infrastructure /
|
|
`SiteCommunicationActor`). Readers cannot reason about whether a 60s offline timeout
|
|
gives the intended grace without a single authoritative number.
|
|
|
|
**Recommendation**
|
|
|
|
Pick the correct interval (verify against the heartbeat scheduler in
|
|
`SiteCommunicationActor`/Cluster Infrastructure) and use it consistently in all three
|
|
comments, ideally referencing the owning component rather than restating a magic number.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit `pending`). Documentation-only — no regression test is
|
|
meaningful. Verified the authoritative cadence against the Communication module:
|
|
`SiteCommunicationActor.PreStart` schedules the application-level heartbeat to central
|
|
at `CommunicationOptions.TransportHeartbeatInterval`, which defaults to **5 seconds**
|
|
(`CommunicationOptions.cs:49`). The stale "~2s" in `ICentralHealthAggregator.MarkHeartbeat`
|
|
was corrected; all three XML docs (`ICentralHealthAggregator.MarkHeartbeat`,
|
|
`SiteHealthState.LastHeartbeatAt`, `CentralHealthAggregator.CheckForOfflineSites`) now
|
|
state the single authoritative ~5s figure and reference the owning component
|
|
(`Cluster Infrastructure / SiteCommunicationActor` —
|
|
`CommunicationOptions.TransportHeartbeatInterval`) rather than restating a bare magic
|
|
number, so readers can reason about the 60s offline grace.
|
|
|
|
### HealthMonitoring-005 — Central self-report site can flap offline; no heartbeat grace like real sites
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Correctness & logic bugs |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:48-81`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:149` |
|
|
|
|
**Description**
|
|
|
|
`CheckForOfflineSites` decides offline status purely from `LastHeartbeatAt`, and for
|
|
real sites that field is kept fresh by frequent (~2-5s) heartbeats so the 60s timeout
|
|
only fires on genuine total loss. The synthetic `central` site, however, has no
|
|
heartbeat source — `LastHeartbeatAt` is only bumped by `ProcessReport` from the
|
|
30s `CentralHealthReportLoop`. The loop also only runs on the cluster leader and
|
|
silently skips a cycle on any exception. Consequently, a single skipped/late central
|
|
self-report (leader GC pause, brief stall, mid-failover before the new leader's loop
|
|
spins up) leaves `central` with no signal for >60s and it is marked offline even
|
|
though the central cluster is healthy. The central card thus has no equivalent of
|
|
the "one missed report grace" the design doc grants real sites.
|
|
|
|
**Recommendation**
|
|
|
|
Either feed `central` a heartbeat equivalent (e.g. have `MarkHeartbeat` called for
|
|
`CentralSiteId` on a fast timer independent of the leader-only report loop), or apply
|
|
a longer/distinct offline timeout to the `central` keyspace entry, and ensure the new
|
|
leader starts the report loop promptly on failover.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit `pending`). Applied the distinct-timeout option. A new
|
|
`HealthMonitoringOptions.CentralOfflineTimeout` (default 3x the report interval =
|
|
3 minutes) is applied by `CentralHealthAggregator.CheckForOfflineSites` to the
|
|
`central` keyspace entry only — real sites keep the existing `OfflineTimeout`. This
|
|
gives the synthetic `central` site (which has no heartbeat source and is fed solely
|
|
by the 30s leader-only `CentralHealthReportLoop`) enough grace to survive a single
|
|
skipped or late self-report — the equivalent of the "one missed report" grace the
|
|
design doc grants real sites — while still going offline on genuine total loss.
|
|
Regression tests `CentralHealthAggregatorTests.OfflineDetection_CentralSite_HasLongerGraceThanRealSites`
|
|
(central survives 75s of silence while a real site goes offline) and
|
|
`OfflineDetection_CentralSite_StillGoesOfflineOnGenuineLoss` (central still detected
|
|
offline after 10 minutes) verify the behaviour.
|
|
|
|
### HealthMonitoring-006 — Sequence seeding contradicts the doc's "starting at 1" wording and is untestable
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Design-document adherence |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:28`, `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:32` |
|
|
|
|
**Description**
|
|
|
|
The `HealthReportSender` class XML summary states "Sequence numbers are monotonic,
|
|
starting at 1, and reset on service restart." The implementation instead seeds
|
|
`_sequenceNumber` with `DateTimeOffset.UtcNow.ToUnixTimeMilliseconds()` so the first
|
|
emitted sequence is a large epoch value, specifically to keep ordering correct across
|
|
failover. The summary is therefore stale and contradicts the code. Separately, the
|
|
seed reads `DateTimeOffset.UtcNow` directly at field initialization rather than
|
|
through an injected `TimeProvider` (which `CentralHealthAggregator` already uses),
|
|
making the seeding logic impossible to unit-test deterministically and dependent on
|
|
node wall-clock agreement — if one node's clock lags, its post-failover reports can
|
|
be silently rejected as stale by the aggregator.
|
|
|
|
**Recommendation**
|
|
|
|
Fix the `HealthReportSender` XML summary to describe the actual Unix-ms seeding
|
|
strategy, and inject `TimeProvider` for the seed so the behaviour is testable and the
|
|
clock dependency is explicit.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit `pending`). The `HealthReportSender` XML summary was
|
|
rewritten to describe the real strategy — sequence numbers are monotonic and
|
|
restart-resetting but explicitly **not** zero/one-based; they are seeded with the
|
|
current Unix epoch (ms) so a freshly-active node always sorts after the prior active.
|
|
Both `HealthReportSender` and `CentralHealthReportLoop` now accept an optional
|
|
`TimeProvider` (defaulting to `TimeProvider.System`) and derive the seed via
|
|
`GetUtcNow().ToUnixTimeMilliseconds()` in the constructor body instead of reading
|
|
`DateTimeOffset.UtcNow` at field initialization, so the seeding is deterministically
|
|
testable and the clock dependency is explicit. `CentralHealthReportLoop` gained a
|
|
`CurrentSequenceNumber` test accessor mirroring `HealthReportSender`. Regression tests
|
|
`HealthReportSenderTests.SequenceNumberSeed_UsesInjectedTimeProvider` and
|
|
`CentralHealthReportLoopTests.SequenceNumberSeed_UsesInjectedTimeProvider` assert the
|
|
seed equals the injected provider's Unix-ms instant (these would not compile against
|
|
the pre-fix code, which had no `TimeProvider` parameter).
|
|
|
|
### HealthMonitoring-007 — Heartbeats for not-yet-registered sites are silently dropped
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Correctness & logic bugs |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:86-99` |
|
|
|
|
**Description**
|
|
|
|
`MarkHeartbeat` returns immediately if the site is not already in `_siteStates`
|
|
("registration only happens on report"). Central health state is in-memory only and
|
|
not persisted. After a central restart or failover the aggregator starts empty, so
|
|
for up to one full report interval (default 30s) every site emits only heartbeats
|
|
that are all discarded — the site is reported as *unknown* (absent from
|
|
`GetAllSiteStates`) rather than *online*, even though heartbeats prove it is
|
|
reachable. This is a visible dashboard regression precisely during the failover
|
|
window, which is when operators most need accurate status.
|
|
|
|
**Recommendation**
|
|
|
|
Allow `MarkHeartbeat` to register a minimal `SiteHealthState` (online, no
|
|
`LatestReport` yet, with a UI-visible "awaiting first report" indication) when a
|
|
heartbeat arrives for an unknown site, so reachable sites show online immediately
|
|
after a central restart.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit `pending`). `CentralHealthAggregator.MarkHeartbeat` no
|
|
longer returns early for an unknown site. When a heartbeat arrives for a site with no
|
|
aggregator state, it now atomically registers (`TryAdd`, with CAS-loss retry) a
|
|
minimal `SiteHealthState` that is `IsOnline = true`, `LatestReport = null`,
|
|
`LastSequenceNumber = 0` and `LastHeartbeatAt = receivedAt` — an "online, awaiting
|
|
first report" state. This relies on the HealthMonitoring-002 change that made
|
|
`LatestReport` properly nullable, so UI consumers already handle the null case.
|
|
Reachable sites therefore show online immediately after a central restart/failover
|
|
instead of being absent ("unknown") for up to a full report interval. The
|
|
`ICentralHealthAggregator.MarkHeartbeat` XML doc was corrected to describe the new
|
|
behaviour. Regression test
|
|
`CentralHealthAggregatorTests.MarkHeartbeat_RegistersUnknownSite_AsOnlineAwaitingReport`
|
|
verifies the registration; `MarkHeartbeat_KeepsSiteOnline_BetweenReports` and
|
|
`MarkHeartbeat_BringsOfflineSiteBackOnline` cover the already-registered paths.
|
|
|
|
### HealthMonitoring-008 — `GetAllSiteStates` / `GetSiteState` leak live mutable state objects to callers
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium — re-triaged: already resolved as a side-effect of HealthMonitoring-002. |
|
|
| Category | Concurrency & thread safety |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:146-158` |
|
|
|
|
**Description**
|
|
|
|
`GetAllSiteStates` copies the dictionary but the copy still holds references to the
|
|
same live mutable `SiteHealthState` instances; `GetSiteState` returns the live
|
|
instance directly. UI consumers (Blazor Server / SignalR circuits) read these objects
|
|
on their own threads while the aggregator's background timer and report handlers
|
|
concurrently mutate the very same instances (see HealthMonitoring-002). A UI render
|
|
can observe a `SiteHealthState` with, e.g., `IsOnline == true` but a `LatestReport`
|
|
from a different update, or a torn `DateTimeOffset`. Callers could also mutate the
|
|
shared state, corrupting aggregator state.
|
|
|
|
**Recommendation**
|
|
|
|
Return immutable snapshots: convert `SiteHealthState` to a record (per
|
|
HealthMonitoring-002/003) so handing out the reference is safe, or deep-copy each
|
|
state into an immutable DTO before returning.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit `pending`). Re-triaged: verified against the current
|
|
source — the root cause was already eliminated by the HealthMonitoring-002 fix.
|
|
`SiteHealthState` is now a `sealed record` with `init`-only properties (fully
|
|
immutable). Every aggregator transition installs a brand-new record instance via an
|
|
atomic compare-and-swap, so the references `GetAllSiteStates` and `GetSiteState` hand
|
|
out are immutable snapshots — a UI consumer reading one on its own thread can never
|
|
observe a torn or half-applied state, and cannot mutate aggregator state through the
|
|
returned reference. The recommended fix (make `SiteHealthState` a record) is exactly
|
|
what the HealthMonitoring-002 change did, so no further code change was required.
|
|
|
|
### HealthMonitoring-009 — Missing test coverage for central report loop, heartbeat path, replication, and collector setters
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Testing coverage |
|
|
| Status | Resolved |
|
|
| Location | `tests/ScadaLink.HealthMonitoring.Tests/` |
|
|
|
|
**Description**
|
|
|
|
Several behaviours have no automated coverage:
|
|
- `CentralHealthReportLoop` — leader-only gating (`SelfIsPrimary`), self-report
|
|
generation, sequence assignment: no test file at all.
|
|
- `CentralHealthAggregator.MarkHeartbeat` — keeping a site online between reports,
|
|
online recovery via heartbeat, and the unknown-site drop behaviour
|
|
(HealthMonitoring-007): untested.
|
|
- Offline detection driven by `LastHeartbeatAt` vs `LastReportReceivedAt` — the
|
|
existing offline tests only advance time after a report, never exercising the
|
|
heartbeat-keeps-alive path the design depends on.
|
|
- `SiteHealthCollector` — `SetClusterNodes`, `SetInstanceCounts`, `SetParkedMessageCount`,
|
|
`SetNodeHostname`, `SetActiveNode`/`NodeRole`, `UpdateTagQuality`,
|
|
`UpdateConnectionEndpoint`: not reflected-in-report tested.
|
|
- `SiteHealthReportReplica` idempotency under double delivery: untested.
|
|
|
|
**Recommendation**
|
|
|
|
Add tests for the central report loop (with a fake `IClusterNodeProvider`), the
|
|
heartbeat-keeps-online and unknown-site heartbeat paths, and the remaining collector
|
|
setters' presence in `CollectReport` output.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit `pending`). Added the missing coverage:
|
|
|
|
- **`CentralHealthReportLoopTests`** (new file) — `GeneratesCentralReports_WhenSelfIsPrimary`,
|
|
`GeneratesNoReports_WhenNotPrimary` (leader-only `SelfIsPrimary` gating with a fake
|
|
`IClusterNodeProvider`), `AssignsMonotonicSequenceNumbers`, and
|
|
`SetsActiveNodeFlag_EvenWhenNotPrimary`.
|
|
- **`CentralHealthAggregatorTests`** — `MarkHeartbeat_RegistersUnknownSite_AsOnlineAwaitingReport`,
|
|
`MarkHeartbeat_KeepsSiteOnline_BetweenReports` (heartbeat keeps a site online past
|
|
the offline timeout — the path the design depends on), and
|
|
`MarkHeartbeat_BringsOfflineSiteBackOnline`.
|
|
- **`SiteHealthCollectorTests`** — reflected-in-report tests for `SetClusterNodes`,
|
|
`SetInstanceCounts`, `SetParkedMessageCount`, `SetNodeHostname`,
|
|
`SetActiveNode`/`NodeRole`, `UpdateTagQuality`, `UpdateConnectionEndpoint`, and
|
|
`SetStoreAndForwardDepths`.
|
|
|
|
The `SiteHealthReportReplica` idempotency item is **out of scope** for this module:
|
|
`SiteHealthReportReplica` is declared in `ScadaLink.Commons` and published/consumed by
|
|
`CentralCommunicationActor` in the `ScadaLink.Communication` module — the
|
|
HealthMonitoring module itself has no replication code. Replica double-delivery
|
|
idempotency is already covered by `ProcessReport`'s sequence-number guard
|
|
(`ProcessReport_RejectsEqualSequence`, `ProcessReport_RejectsStaleReport_WhenSequenceNotGreater`);
|
|
testing the actor-side double-publish belongs in the Communication module's review.
|
|
The HealthMonitoring test suite now stands at 47 passing tests (was 30).
|
|
|
|
### HealthMonitoring-010 — `HealthReportSender` silently swallows inner failures with bare `catch {}`
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Error handling & resilience |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:70-87` |
|
|
|
|
**Description**
|
|
|
|
The cluster-nodes update and parked-message-count query are each wrapped in
|
|
`try { ... } catch { /* Non-fatal */ }` with no logging. A persistent failure (e.g.
|
|
the S&F SQLite store is permanently broken, or `GetClusterNodes()` always throws)
|
|
is then completely invisible — every report silently ships with stale cluster nodes
|
|
and a parked count of 0, with nothing in the logs to explain the wrong dashboard
|
|
values. Bare `catch` with no exception variable also catches `OperationCanceledException`
|
|
and would mask shutdown signalling if the awaited call observed the token.
|
|
|
|
**Recommendation**
|
|
|
|
Catch a specific exception type (or at least `Exception ex`) and `LogWarning`/`LogDebug`
|
|
the failure so persistent degradation is diagnosable; avoid swallowing
|
|
`OperationCanceledException`.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit `pending`). All three bare `catch { }` blocks in
|
|
`HealthReportSender.ExecuteAsync` (cluster-nodes refresh, parked-message-count query,
|
|
S&F buffer-depth query) were changed to `catch (Exception ex)` and now emit a
|
|
`LogWarning(ex, ...)` naming the failed operation and the site, so a persistent
|
|
degradation (broken S&F SQLite store, always-throwing `GetClusterNodes()`) is
|
|
diagnosable instead of silently shipping stale/zero metrics. On the `OperationCanceledException`
|
|
concern: verified the inner calls cannot raise OCE from cancellation —
|
|
`IClusterNodeProvider.GetClusterNodes()` is synchronous and takes no token, and
|
|
`StoreAndForwardStorage.GetParkedMessageCountAsync()`/`GetBufferDepthByCategoryAsync()`
|
|
take no `CancellationToken` parameter, so the only cancellation path is the outer loop's
|
|
`PeriodicTimer.WaitForNextTickAsync(stoppingToken)`, which is unaffected. Regression
|
|
test `HealthReportSenderTests.ClusterNodeRefreshFailure_IsLoggedNotSwallowed` injects a
|
|
throwing `IClusterNodeProvider`, asserts the loop still ships reports, and asserts a
|
|
warning carrying the `InvalidOperationException` is logged — confirmed to fail against
|
|
the pre-fix bare `catch { }` (logged-entry collection was empty).
|
|
|
|
### HealthMonitoring-011 — `AddHealthMonitoringActors` is a dead no-op placeholder
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Code organization & conventions |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.HealthMonitoring/ServiceCollectionExtensions.cs:42-46` |
|
|
|
|
**Description**
|
|
|
|
`AddHealthMonitoringActors` does nothing but `return services` with a "Placeholder for
|
|
Phase 4+" comment. A public extension method that silently no-ops is a trap: a caller
|
|
who registers it will believe actor wiring is in place. No caller currently invokes it.
|
|
|
|
**Recommendation**
|
|
|
|
Remove the method until it has real behaviour, or throw `NotImplementedException` so
|
|
accidental use fails loudly. If the actor model for this component is genuinely
|
|
planned, track it in the design doc instead of a half-method.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit `pending`). Verified a codebase-wide search — the
|
|
`AddHealthMonitoringActors` no-op extension has no production callers (only references
|
|
were in the code-review docs themselves). The HealthMonitoring module deliberately
|
|
contains no actors (transport is abstracted behind `IHealthReportTransport`; actor-side
|
|
wiring lives in the Communication module), so there is no genuine Phase-4 actor work for
|
|
this method to host. The dead placeholder was removed from `ServiceCollectionExtensions`
|
|
so a caller can no longer be misled into believing actor wiring exists. No regression
|
|
test is meaningful for a deleted no-op; removal is verified by the module continuing to
|
|
build and all 50 tests passing.
|
|
|
|
### HealthMonitoring-012 — `SiteHealthState.LatestReport` initialized to `null!`, misrepresenting the contract
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low — re-triaged: already resolved as a side-effect of HealthMonitoring-002. |
|
|
| Category | Documentation & comments |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.HealthMonitoring/SiteHealthState.cs:11` |
|
|
|
|
**Description**
|
|
|
|
`LatestReport` is declared `SiteHealthReport LatestReport { get; set; } = null!;`,
|
|
suppressing nullability. Today every code path that creates a `SiteHealthState` (only
|
|
`ProcessReport`) assigns `LatestReport`, so it is never actually null — but the
|
|
`null!` declaration tells readers and the compiler the opposite of the real
|
|
invariant. If HealthMonitoring-007 is addressed by registering state from a heartbeat
|
|
(no report yet), this becomes a live `NullReferenceException` risk for UI code that
|
|
dereferences `LatestReport`.
|
|
|
|
**Recommendation**
|
|
|
|
Either make `LatestReport` `required` (matching how it is genuinely always set today)
|
|
or make it properly nullable `SiteHealthReport?` and have consumers handle the
|
|
"registered, no report yet" case explicitly — consistent with whatever is decided
|
|
for HealthMonitoring-007.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-16 (commit `pending`). Re-triaged: verified against the current
|
|
source — the root cause was already eliminated by the HealthMonitoring-002 /
|
|
HealthMonitoring-007 fixes. `SiteHealthState.LatestReport` is already declared
|
|
`SiteHealthReport? LatestReport { get; init; }` (the recommendation's "make it properly
|
|
nullable" option) with an XML doc explaining the `null` case ("known only via heartbeats,
|
|
has not yet sent a report"). A codebase-wide search confirms no `null!` suppression
|
|
remains anywhere in `src/ScadaLink.HealthMonitoring`. This is exactly the change
|
|
HealthMonitoring-002 made when converting `SiteHealthState` to an immutable record, so
|
|
the contract is now honest and no further code change was required.
|
|
|
|
### HealthMonitoring-013 — Offline-check interval comment claims "shorter timeout" but only ever uses `OfflineTimeout`
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Documentation & comments |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:194-196` |
|
|
|
|
**Description**
|
|
|
|
`ExecuteAsync` derives the `PeriodicTimer` cadence with the comment "Check at half
|
|
the (shorter) offline timeout interval for timely detection", but the code only
|
|
reads `_options.OfflineTimeout`:
|
|
|
|
```csharp
|
|
var checkInterval = TimeSpan.FromMilliseconds(_options.OfflineTimeout.TotalMilliseconds / 2);
|
|
```
|
|
|
|
`CentralOfflineTimeout` (HealthMonitoring-005's fix) is never considered. With the
|
|
default options (`OfflineTimeout` 60s, `CentralOfflineTimeout` 3m) `OfflineTimeout`
|
|
genuinely is the shorter of the two, so the parenthetical happens to hold. But the
|
|
comment states an invariant the code does not enforce: if an operator configures
|
|
`CentralOfflineTimeout` *smaller* than `OfflineTimeout`, the check cadence stays
|
|
tied to `OfflineTimeout`, and central offline detection is delayed by up to a full
|
|
`OfflineTimeout / 2` beyond the intended `CentralOfflineTimeout` window. The comment
|
|
misleads a reader into believing the cadence already adapts to whichever timeout is
|
|
shorter.
|
|
|
|
**Recommendation**
|
|
|
|
Either compute `checkInterval` from `Math.Min(OfflineTimeout, CentralOfflineTimeout)`
|
|
so the code matches the comment, or drop the "(shorter)" wording and state plainly
|
|
that the cadence is derived from `OfflineTimeout` only (acceptable while the default
|
|
`CentralOfflineTimeout` is the larger value).
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-17. Root cause confirmed against source — `ExecuteAsync` derived the
|
|
`PeriodicTimer` cadence solely from `OfflineTimeout` while the comment claimed it
|
|
tracked the "(shorter)" timeout. Took the code-matches-comment option: extracted the
|
|
cadence into `CentralHealthAggregator.ComputeCheckInterval`, which now derives it from
|
|
half of the *shorter* of `OfflineTimeout` and `CentralOfflineTimeout`, so a
|
|
`CentralOfflineTimeout` configured below `OfflineTimeout` is still polled at least
|
|
twice within its window. The comment was rewritten to match. Regression test
|
|
`CentralHealthAggregatorTests.CheckInterval_IsHalfTheShorterTimeout` asserts the
|
|
default case (30s) and the shorter-`CentralOfflineTimeout` case (10s) — the latter
|
|
would have returned 30s against the pre-fix code.
|
|
|
|
### HealthMonitoring-014 — `HealthMonitoringOptions` intervals are unvalidated; a zero/negative value crashes the hosted service
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Error handling & resilience |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.HealthMonitoring/HealthMonitoringOptions.cs:3-20`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:196`, `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:67`, `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:63` |
|
|
|
|
**Description**
|
|
|
|
`HealthMonitoringOptions` is bound from the `ScadaLink:HealthMonitoring` config
|
|
section (`SiteServiceRegistration.BindSharedOptions`) with no validation —
|
|
no `IValidateOptions<HealthMonitoringOptions>`, no `ValidateDataAnnotations`, no
|
|
`ValidateOnStart`. `ReportInterval`, `OfflineTimeout`, and `CentralOfflineTimeout`
|
|
are all fed straight into `new PeriodicTimer(...)` (and `OfflineTimeout` into a
|
|
division for the check interval). `PeriodicTimer`'s constructor throws
|
|
`ArgumentOutOfRangeException` for a zero or negative period. A misconfigured
|
|
`appsettings.json` (e.g. `"ReportInterval": "00:00:00"`, an empty/garbled value
|
|
that binds to `TimeSpan.Zero`, or a negative span) therefore crashes the
|
|
`HealthReportSender` / `CentralHealthReportLoop` / `CentralHealthAggregator`
|
|
hosted service at startup with an opaque exception that does not name the
|
|
offending config key, rather than failing fast with a clear validation message.
|
|
|
|
**Recommendation**
|
|
|
|
Add an options validator (DataAnnotations `[Range]`-style on the spans, or an
|
|
`IValidateOptions<HealthMonitoringOptions>`) that rejects non-positive
|
|
`ReportInterval`/`OfflineTimeout`/`CentralOfflineTimeout` and ideally requires
|
|
`CentralOfflineTimeout >= OfflineTimeout`, and call `.ValidateOnStart()` so a bad
|
|
configuration fails fast with a message naming the section and key.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-17. Root cause confirmed — `HealthMonitoringOptions` had no
|
|
validator, so a zero/negative interval reached `new PeriodicTimer(...)` and crashed
|
|
the hosted service with an opaque `ArgumentOutOfRangeException`. Added
|
|
`HealthMonitoringOptionsValidator : IValidateOptions<HealthMonitoringOptions>` that
|
|
rejects non-positive `ReportInterval`/`OfflineTimeout`/`CentralOfflineTimeout` and a
|
|
`CentralOfflineTimeout` shorter than `OfflineTimeout`, each failure naming the
|
|
`ScadaLink:HealthMonitoring` config key. It is registered (idempotently, via
|
|
`TryAddEnumerable`) by all three `ServiceCollectionExtensions` registration methods,
|
|
so it fires when the hosted services resolve `IOptions.Value` at startup — failing
|
|
fast with a clear message. (`ValidateOnStart()` lives in the Host module's binding
|
|
call, which is out of scope; the validator nonetheless runs at startup because the
|
|
hosted-service constructors resolve the options eagerly — matching the existing
|
|
`ClusterOptionsValidator` registration pattern.) Regression tests in
|
|
`HealthMonitoringOptionsValidatorTests` cover the valid default plus zero/negative
|
|
intervals and the `CentralOfflineTimeout < OfflineTimeout` case.
|
|
|
|
### HealthMonitoring-015 — Heartbeat-registered site is left with a year-0001 `LastReportReceivedAt`
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Correctness & logic bugs |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:122-130`, `src/ScadaLink.HealthMonitoring/SiteHealthState.cs:27` |
|
|
|
|
**Description**
|
|
|
|
When `MarkHeartbeat` registers a previously-unknown site (HealthMonitoring-007's
|
|
fix), it sets `LastReportReceivedAt = default` — i.e. `DateTimeOffset.MinValue`
|
|
(`0001-01-01`). The XML doc on `SiteHealthState.LastReportReceivedAt` states the
|
|
field is "Used by the UI to surface report staleness during failover." A
|
|
heartbeat-only site therefore has `LatestReport == null` **and**
|
|
`LastReportReceivedAt == DateTimeOffset.MinValue`. Any UI code that computes
|
|
"last report N ago" as `now - LastReportReceivedAt` without first checking
|
|
`LatestReport != null` will render a nonsensical staleness of roughly two
|
|
thousand years for a site that is, in fact, freshly reachable. The two
|
|
"no report yet" signals (`LatestReport == null`, `LastReportReceivedAt == default`)
|
|
are independent and both must be special-cased; the sentinel value is an easy trap.
|
|
|
|
**Recommendation**
|
|
|
|
Make `LastReportReceivedAt` nullable (`DateTimeOffset?`) so "no report received
|
|
yet" is an explicit, unmissable state rather than a magic sentinel — consistent
|
|
with how `LatestReport` was already made nullable for the same case — and have UI
|
|
consumers render staleness only when it has a value. Alternatively, document the
|
|
`default` sentinel prominently on the field and audit every UI reader, but the
|
|
nullable option is safer and matches the existing `LatestReport` treatment.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-17. Root cause confirmed against source — `CentralHealthAggregator.MarkHeartbeat`
|
|
registered a heartbeat-only site with `LastReportReceivedAt = default` (`0001-01-01`),
|
|
which `Health.razor:74` passed verbatim into `TimestampDisplay`, rendering a
|
|
~2000-year-stale timestamp. Applied the finding's recommended cross-module fix.
|
|
**HealthMonitoring:** `SiteHealthState.LastReportReceivedAt` is now `DateTimeOffset?`,
|
|
and `MarkHeartbeat`'s unknown-site registration sets it to `null` — "no report yet"
|
|
is now an explicit nullable state, consistent with the already-nullable `LatestReport`.
|
|
`ProcessReport` continues to set a real instant. **CentralUI:** `TimestampDisplay.Value`
|
|
now accepts `DateTimeOffset?` and renders `null` as a plain `text-muted` placeholder
|
|
(default "never", configurable via a new `NullText` parameter); existing non-null
|
|
callers (`AuditLog`, `EventLogs`, `Deployments`) are unaffected by the implicit
|
|
widening. `Health.razor`'s "Last report" cell passes `NullText="awaiting first report"`
|
|
so a heartbeat-only site reads cleanly. Regression tests:
|
|
`CentralHealthAggregatorTests.MarkHeartbeat_RegistersUnknownSite_WithNullLastReportReceivedAt`
|
|
and `ProcessReport_SetsLastReportReceivedAt_ForHeartbeatRegisteredSite` (HealthMonitoring —
|
|
the first would not compile against the pre-fix non-nullable field), and
|
|
`TimestampDisplayTests.Render_NullValue_ShowsNeverPlaceholder`,
|
|
`Render_NullValue_DoesNotRenderYear0001Sentinel`, `Render_NonNullValue_ShowsFormattedTimestamp`
|
|
(CentralUI).
|
|
|
|
### HealthMonitoring-016 — `SiteHealthCollector.CollectReport` reads `DateTimeOffset.UtcNow` directly instead of an injected `TimeProvider`
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Code organization & conventions |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs:151` |
|
|
|
|
**Description**
|
|
|
|
`CollectReport` stamps each report with `ReportTimestamp: DateTimeOffset.UtcNow`,
|
|
read directly from the system clock. Every other time-dependent class in the
|
|
module — `CentralHealthAggregator`, `HealthReportSender`, `CentralHealthReportLoop`
|
|
— was deliberately refactored (HealthMonitoring-006) to take an injectable
|
|
`TimeProvider` so the behaviour is deterministically testable and the clock
|
|
dependency is explicit. `SiteHealthCollector` is the lone holdout: the report
|
|
timestamp cannot be controlled in a unit test, which is why
|
|
`SiteHealthCollectorTests.CollectReport_IncludesUtcTimestamp` can only assert the
|
|
timestamp falls in a `before`/`after` wall-clock window rather than equalling a
|
|
known instant. This is a minor consistency/testability gap, not a behaviour bug.
|
|
|
|
**Recommendation**
|
|
|
|
Add an optional `TimeProvider` constructor parameter to `SiteHealthCollector`
|
|
(defaulting to `TimeProvider.System`, mirroring the other classes) and derive
|
|
`ReportTimestamp` from `GetUtcNow()`, so the report timestamp is deterministically
|
|
testable and the module is consistent.
|
|
|
|
**Resolution**
|
|
|
|
Resolved 2026-05-17. Root cause confirmed — `CollectReport` stamped `ReportTimestamp`
|
|
from `DateTimeOffset.UtcNow` directly while every other time-dependent class in the
|
|
module takes an injectable `TimeProvider`. Added an optional `TimeProvider`
|
|
constructor parameter to `SiteHealthCollector` (defaulting to `TimeProvider.System`,
|
|
mirroring `CentralHealthAggregator`/`HealthReportSender`/`CentralHealthReportLoop`)
|
|
and `CollectReport` now derives `ReportTimestamp` from `_timeProvider.GetUtcNow()`.
|
|
The DI registration (`AddSingleton<ISiteHealthCollector, SiteHealthCollector>`)
|
|
continues to work via the optional parameter. Regression test
|
|
`SiteHealthCollectorTests.CollectReport_StampsTimestamp_FromInjectedTimeProvider`
|
|
asserts the timestamp equals a fixed injected instant exactly (not just a
|
|
before/after window); it would not compile against the pre-fix single-arg-less
|
|
constructor.
|