code-review: 2026-05-28 baseline re-review of all 23 modules at 1eb6e97
Re-applies the full 10-category checklist to every src/ project — including
first-time reviews of the four newer components (AuditLog, NotificationOutbox,
SiteCallAudit, Transport) — so the code-reviews/ index reflects today's
codebase rather than the 2026-05-16 baseline. 172 new Open findings (0
Critical, 18 High, 62 Medium, 92 Low); 481 findings total across 23 modules.
regen-readme.py now derives each module's Last reviewed + Commit from its
findings.md header instead of hard-coding 2026-05-16 / 9c60592, so future
single-module re-reviews show their own date in the Module Status table.
This commit is contained in:
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.HealthMonitoring` |
|
||||
| Design doc | `docs/requirements/Component-HealthMonitoring.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Last reviewed | 2026-05-28 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 0 |
|
||||
| Commit reviewed | `1eb6e97` |
|
||||
| Open findings | 7 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -51,6 +51,35 @@ HealthMonitoring + CentralUI change), and `CollectReport` reading
|
||||
`TimeProvider` (HealthMonitoring-016). The module remains small, readable, and
|
||||
broadly faithful to the design intent.
|
||||
|
||||
#### Re-review 2026-05-28 (commit `1eb6e97`)
|
||||
|
||||
All sixteen prior findings (HealthMonitoring-001..016) remain `Resolved`. This
|
||||
baseline re-review applied the full 10-category checklist and produced **7 new
|
||||
findings** (1 Medium, 6 Low — none crash-class). The most material observation
|
||||
is a **metric-loss race** in `HealthReportSender.ExecuteAsync`
|
||||
(HealthMonitoring-017): `CollectReport` resets the per-interval error counters
|
||||
(`ScriptErrorCount`, `AlarmEvaluationErrorCount`, `DeadLetterCount`,
|
||||
`SiteAuditWriteFailures`, `AuditRedactionFailure`) **before**
|
||||
`_transport.Send(...)` is attempted, so a transport failure (the existing
|
||||
`catch { LogError; }` path) silently discards every error this site recorded in
|
||||
the failed interval — the module-specific concern of "metric counters drifting
|
||||
from raw-per-interval to cumulative" inverted into _drifting_ to _lost_. A
|
||||
parallel hazard exists in `CentralHealthReportLoop` (HealthMonitoring-018). The
|
||||
remaining items are smaller: two Audit Log metrics
|
||||
(`SiteAuditTelemetryStalled`, `CentralAuditWriteFailures`) listed in the design
|
||||
doc never make it into a HealthMonitoring surface (HealthMonitoring-019); a
|
||||
heartbeat with `receivedAt <= existing.LastHeartbeatAt` brings an offline site
|
||||
back online with a stale heartbeat that can flap right back to offline on the
|
||||
next check (HealthMonitoring-020); the reserved `CentralSiteId = "central"`
|
||||
constant collides with any real site named `"central"` and silently extends its
|
||||
offline grace (HealthMonitoring-021); `CentralHealthReportLoopTests` uses real
|
||||
wall-clock 50 ms timers + `Task.Delay`, making it timing-sensitive
|
||||
(HealthMonitoring-022); and one obsolete placeholder test name
|
||||
(`StoreAndForwardBufferDepths_IsEmptyPlaceholder`) misrepresents what it now
|
||||
covers (HealthMonitoring-023). All sequence-number and offline-detection
|
||||
arithmetic uses `_timeProvider.GetUtcNow()` consistently — no wall-clock vs
|
||||
monotonic mismatch was observed.
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
@@ -66,6 +95,21 @@ broadly faithful to the design intent.
|
||||
| 9 | Testing coverage | x | No tests for `CentralHealthReportLoop`, `MarkHeartbeat`, offline-via-heartbeat, replica idempotency, or most collector setters (HealthMonitoring-009). |
|
||||
| 10 | Documentation & comments | x | Heartbeat interval is described inconsistently (~2s vs ~5s) across XML docs (HealthMonitoring-004, resolved); `LatestReport = null!` misrepresents the contract (HealthMonitoring-012, resolved). Re-review: offline-check-interval comment claims "(shorter)" timeout but code only uses `OfflineTimeout` (HealthMonitoring-013). |
|
||||
|
||||
_Re-review (2026-05-28, `1eb6e97`):_
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | x | `HealthReportSender` and `CentralHealthReportLoop` reset per-interval counters before the send/process call — counts lost on transport failure (HealthMonitoring-017, HealthMonitoring-018). `MarkHeartbeat` brings an offline site back online with a stale `LastHeartbeatAt` when `receivedAt <= existing.LastHeartbeatAt` — site can flap straight back to offline (HealthMonitoring-020). `CentralSiteId = "central"` reserved constant silently collides with any real site named "central" (HealthMonitoring-021). |
|
||||
| 2 | Akka.NET conventions | x | Module contains no actors itself. `IHealthReportTransport` cleanly abstracts the Akka-remoting send. `ProcessReport`/`MarkHeartbeat` are called from `CentralCommunicationActor`'s receive — invoked on the actor thread but the aggregator's CAS loops make that safe regardless. No issues found. |
|
||||
| 3 | Concurrency & thread safety | x | Verified the resolved `SiteHealthState` immutable-record / CAS-loop pattern still holds across `ProcessReport`, `MarkHeartbeat`, `CheckForOfflineSites`. `SiteHealthCollector` uses `volatile` for reference fields (`_clusterNodes`, `_nodeHostname`, `_siteAuditBacklog`, `_isActiveNode`) and `Interlocked` for counters consistently. `CollectReport`'s `new Dictionary<>(concurrentDict)` snapshots are not strictly atomic but acceptable at the documented scale. No new issues found. |
|
||||
| 4 | Error handling & resilience | x | `try/catch` blocks now log all non-fatal failures (resolved HealthMonitoring-010 still in place). Outer `catch (Exception)` in `ExecuteAsync` keeps the loop alive — sound. New: the counter-reset-before-send issue (HealthMonitoring-017, HealthMonitoring-018) is an error-handling gap — transport failure silently swallows the interval's metric data. |
|
||||
| 5 | Security | x | No issues found. The module handles only numeric/string operational metrics; no secrets, auth surface, or untrusted input parsing. `MarkHeartbeat` and `ProcessReport` trust the caller (intra-cluster). |
|
||||
| 6 | Performance & resource management | x | `PeriodicTimer` instances disposed via `using`. CAS retry loops in `ProcessReport`/`MarkHeartbeat` have no bounded retry cap but contention is the dictionary-size limit (one entry per site) so the loop is effectively wait-free for the common case. No issues found. |
|
||||
| 7 | Design-document adherence | x | `SiteAuditTelemetryStalled` and `CentralAuditWriteFailures` are listed as required dashboard tiles in `Component-HealthMonitoring.md` but have no HealthMonitoring-side surface — both live only in `AuditLog`'s `AuditCentralHealthSnapshot` with no integration into the health aggregator or its consumers (HealthMonitoring-019). |
|
||||
| 8 | Code organization & conventions | x | Options class correctly owned by the component, validator registered idempotently across all three `Add*` paths. POCO/messages in Commons. `AddCentralHealthAggregation` implicitly depends on `ISiteHealthCollector` being registered elsewhere (Host calls `AddHealthMonitoring()` first) — works but is a hidden ordering requirement. Minor; not flagged. |
|
||||
| 9 | Testing coverage | x | Per-interval reset semantics covered for site-side counters but NOT for the failed-send case (no test asserts counters remain accumulated when the transport throws — would catch HealthMonitoring-017). `CentralHealthReportLoopTests` uses real wall-clock 50 ms `PeriodicTimer` + `Task.Delay(250)` for timing — flake-prone on a slow CI runner (HealthMonitoring-022). The placeholder test `StoreAndForwardBufferDepths_IsEmptyPlaceholder` name is stale (HealthMonitoring-023). |
|
||||
| 10 | Documentation & comments | x | XML docs in the new audit-bridge surfaces (`IncrementSiteAuditWriteFailures`, `IncrementAuditRedactionFailure`, `UpdateSiteAuditBacklog`) are accurate. The stale placeholder test name is the only issue (HealthMonitoring-023). |
|
||||
|
||||
## Findings
|
||||
|
||||
### HealthMonitoring-001 — Store-and-forward buffer depth metric is never populated
|
||||
@@ -776,3 +820,314 @@ continues to work via the optional parameter. Regression test
|
||||
asserts the timestamp equals a fixed injected instant exactly (not just a
|
||||
before/after window); it would not compile against the pre-fix single-arg-less
|
||||
constructor.
|
||||
|
||||
### HealthMonitoring-017 — `HealthReportSender` resets interval counters before `Send`; transport failures silently drop the interval's error counts
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:140-154`, `src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs:146-153` |
|
||||
|
||||
**Description**
|
||||
|
||||
`HealthReportSender.ExecuteAsync` calls `_collector.CollectReport(_siteId)` and
|
||||
then `_transport.Send(reportWithSeq)` inside a single `try` block whose `catch`
|
||||
logs and continues. `CollectReport` atomically read-and-resets the per-interval
|
||||
counters via `Interlocked.Exchange(ref _scriptErrorCount, 0)` (and the same for
|
||||
`_alarmErrorCount`, `_deadLetterCount`, `_siteAuditWriteFailures`,
|
||||
`_auditRedactionFailures`). If `_transport.Send` then throws — Akka remoting
|
||||
hiccup, transport not yet associated, central side temporarily unavailable,
|
||||
serialization failure on a malformed metric, etc. — the `catch (Exception ex)`
|
||||
on line 150 logs an error and the loop simply waits for the next tick. The
|
||||
report was never delivered, but the counters have already been reset to zero, so
|
||||
**every error this site recorded in the failed interval is gone**: it is neither
|
||||
in the (un-sent) report nor in the (zeroed) collector. The very next successful
|
||||
report will show "0 script errors / 0 alarm errors" for the entire window in
|
||||
which the transport was broken, masking exactly the period the operator most
|
||||
needs to triage.
|
||||
|
||||
This contradicts the design doc's "raw counts per reporting interval" / "counter
|
||||
resets **after each report is sent**" wording — current code resets on each
|
||||
report _attempt_, regardless of outcome. The hazard worsens under sustained
|
||||
transport failure: every interval's errors are lost; the central dashboard sees
|
||||
a quiet site while the site is, in fact, failing.
|
||||
|
||||
The same shape exists in `CentralHealthReportLoop` (see HealthMonitoring-018) —
|
||||
`CollectReport` is called before `_aggregator.ProcessReport`. The aggregator
|
||||
call is in-process and unlikely to throw, but the structural bug is identical.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Build the report from a non-destructive read first (`PeekReport(siteId)`,
|
||||
returning a snapshot without mutating the counters) and only call a dedicated
|
||||
`ResetIntervalCounters()` after a successful `_transport.Send`. Alternatively,
|
||||
on a `Send` failure, restore the lost counts via `Interlocked.Add` of the
|
||||
captured values back into the collector fields — atomically correct as long as
|
||||
no other thread can read them in between, which is true here because the next
|
||||
read is the next `CollectReport` on the same loop. The "peek then commit"
|
||||
shape is the cleaner public API.
|
||||
|
||||
A regression test should add a failing-transport scenario:
|
||||
`Send` throws an `InvalidOperationException`; assert that the next successful
|
||||
report includes the previously-failed interval's `ScriptErrorCount`.
|
||||
|
||||
### HealthMonitoring-018 — Same counter-reset-before-publish hazard in `CentralHealthReportLoop`
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:87-98` |
|
||||
|
||||
**Description**
|
||||
|
||||
`CentralHealthReportLoop.ExecuteAsync` calls `_collector.CollectReport(CentralSiteId)`
|
||||
(which resets the per-interval counters on the shared `SiteHealthCollector`
|
||||
instance — see HealthMonitoring-017) and then `_aggregator.ProcessReport(reportWithSeq)`
|
||||
inside the same `try` block. If `ProcessReport` throws, the central node's own
|
||||
per-interval counters (`ScriptErrorCount`, `AlarmEvaluationErrorCount`,
|
||||
`DeadLetterCount`, `SiteAuditWriteFailures`, `AuditRedactionFailure`) are lost
|
||||
for that interval.
|
||||
|
||||
In practice `ProcessReport` is a pure in-memory CAS loop and is very unlikely
|
||||
to throw, so the operational impact is small. However, the structural bug is
|
||||
identical to HealthMonitoring-017 and would be fixed by the same
|
||||
"peek then commit" refactor in `SiteHealthCollector`. The Audit-Log-related
|
||||
metrics matter most here: `AuditRedactionFailure` is genuinely incremented at
|
||||
central during normal operation (the Notification Outbox dispatcher and
|
||||
Inbound API middleware both write through `CentralAuditRedactionFailureCounter`
|
||||
which can fan out to the collector via the bridge), so this is not purely
|
||||
theoretical.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Adopt the same "peek then reset on successful publish" pattern recommended for
|
||||
HealthMonitoring-017. Reuse the new `PeekReport` / `ResetIntervalCounters`
|
||||
collector API once it lands.
|
||||
|
||||
### HealthMonitoring-019 — `SiteAuditTelemetryStalled` and `CentralAuditWriteFailures` design-doc metrics have no HealthMonitoring-side surface
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `docs/requirements/Component-HealthMonitoring.md:39,40`, `src/ScadaLink.HealthMonitoring/ICentralHealthAggregator.cs`, `src/ScadaLink.AuditLog/Central/AuditCentralHealthSnapshot.cs:39-58` |
|
||||
|
||||
**Description**
|
||||
|
||||
`Component-HealthMonitoring.md` lists `SiteAuditTelemetryStalled` and
|
||||
`CentralAuditWriteFailures` (and reiterates them under the Audit Log KPIs
|
||||
section and in the Dependencies section) as required dashboard metrics. The
|
||||
doc also says they "are central-computed alongside the existing central KPIs"
|
||||
(Notification Outbox / Site Call Audit) and surface in the **Audit** dashboard
|
||||
tile group.
|
||||
|
||||
Tracing the code:
|
||||
|
||||
- `SiteAuditTelemetryStalled` is published by `SiteAuditReconciliationActor`,
|
||||
picked up by `SiteAuditTelemetryStalledTracker`, and latched into
|
||||
`AuditCentralHealthSnapshot._stalled` (a `ConcurrentDictionary<string, bool>`
|
||||
in the `ScadaLink.AuditLog` assembly).
|
||||
- `CentralAuditWriteFailures` is incremented inside `AuditCentralHealthSnapshot`
|
||||
via `ICentralAuditWriteFailureCounter.Increment()` (also in `ScadaLink.AuditLog`).
|
||||
|
||||
Neither metric is referenced anywhere in `src/ScadaLink.HealthMonitoring/`:
|
||||
- `ICentralHealthAggregator` does not expose them.
|
||||
- `SiteHealthCollector` has no central counterpart (it is site-only).
|
||||
- `SiteHealthReport` has no `SiteAuditTelemetryStalled` / `CentralAuditWriteFailures`
|
||||
fields (the site-only `SiteAuditWriteFailures`, `AuditRedactionFailure`, and
|
||||
`SiteAuditBacklog` _are_ wired; the central pair is the gap).
|
||||
|
||||
Currently the only consumer of `IAuditCentralHealthSnapshot` is whatever
|
||||
Central UI page binds to it directly (out of scope for this module), but the
|
||||
design doc places these metrics under HealthMonitoring's responsibility
|
||||
("Health Monitoring Dashboard displays aggregated metrics"). At minimum the
|
||||
Dependencies section's claim that Health Monitoring provides "the
|
||||
central-computed `CentralAuditWriteFailures` / `AuditRedactionFailure` metrics"
|
||||
is false for `CentralAuditWriteFailures`: nothing under
|
||||
`src/ScadaLink.HealthMonitoring/` knows about it.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Decide whether HealthMonitoring or the consuming UI page owns the
|
||||
`IAuditCentralHealthSnapshot` integration:
|
||||
|
||||
- If HealthMonitoring owns it, expose a `CentralKpis` accessor on
|
||||
`ICentralHealthAggregator` (e.g. a `GetCentralAuditHealth()` method that
|
||||
returns a typed DTO derived from the injected `IAuditCentralHealthSnapshot`)
|
||||
so the dashboard has a single read surface mirroring `GetAllSiteStates`.
|
||||
- If the UI page binds `IAuditCentralHealthSnapshot` directly, update the
|
||||
HealthMonitoring design doc's Responsibilities / Dependencies sections to
|
||||
reflect that and remove the implied integration.
|
||||
|
||||
Either way, add a regression test that the chosen surface returns the live
|
||||
counter and per-site stalled state.
|
||||
|
||||
### HealthMonitoring-020 — `MarkHeartbeat` brings offline site back online with a stale `LastHeartbeatAt` when `receivedAt <= existing.LastHeartbeatAt`
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:128-147` |
|
||||
|
||||
**Description**
|
||||
|
||||
The CAS path in `MarkHeartbeat` picks `newHeartbeat = max(receivedAt, existing.LastHeartbeatAt)`,
|
||||
then short-circuits only when `newHeartbeat == existing.LastHeartbeatAt &&
|
||||
existing.IsOnline`. That short-circuit is correct, but consider the case where
|
||||
`existing.IsOnline == false` and `receivedAt <= existing.LastHeartbeatAt`:
|
||||
|
||||
1. Suppose a site is marked offline by `CheckForOfflineSites` at time T1.
|
||||
2. A late/out-of-order heartbeat carrying a `receivedAt` _older_ than the last
|
||||
stored `LastHeartbeatAt` arrives at T2 (clock skew at the receive site, or a
|
||||
delayed message that was generated before the offline-marking).
|
||||
3. `newHeartbeat == existing.LastHeartbeatAt` (kept), but the short-circuit
|
||||
condition fails because `existing.IsOnline == false`, so the CAS produces a
|
||||
new record with `IsOnline = true` and the **stale** `LastHeartbeatAt`.
|
||||
4. On the very next `CheckForOfflineSites` tick (≤ `OfflineTimeout/2` later),
|
||||
`now - LastHeartbeatAt` is still ≥ `OfflineTimeout`, so the site is
|
||||
immediately marked offline again — the heartbeat brought it online for less
|
||||
than the check cadence, producing a "flap" in the dashboard.
|
||||
|
||||
In practice `receivedAt` is normally `_timeProvider.GetUtcNow()` at the
|
||||
`CentralCommunicationActor` receive site, so monotonically increasing — the bug
|
||||
is latent. But the contract `MarkHeartbeat(string siteId, DateTimeOffset receivedAt)`
|
||||
makes no guarantee about ordering, and an out-of-order delivery (Akka remoting
|
||||
ordering across connection re-establishment edge cases) or a small wall-clock
|
||||
correction at central would expose it.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
When transitioning offline → online, use `now` (from the injected
|
||||
`TimeProvider`) rather than the caller-supplied `receivedAt` for
|
||||
`LastHeartbeatAt`, or take `max(receivedAt, _timeProvider.GetUtcNow())` so the
|
||||
recovery point is always recent. A unit test driving `MarkHeartbeat` with a
|
||||
`receivedAt` older than the last stored heartbeat on an offline site, then a
|
||||
`CheckForOfflineSites` immediately afterwards, would assert the site stays
|
||||
online.
|
||||
|
||||
### HealthMonitoring-021 — `CentralSiteId = "central"` reserved constant silently collides with a real site named "central"
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:22`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:224-226` |
|
||||
|
||||
**Description**
|
||||
|
||||
`CentralHealthAggregator.CheckForOfflineSites` looks up the per-site offline
|
||||
timeout with:
|
||||
|
||||
```csharp
|
||||
var timeout = kvp.Key == CentralHealthReportLoop.CentralSiteId
|
||||
? _options.CentralOfflineTimeout
|
||||
: _options.OfflineTimeout;
|
||||
```
|
||||
|
||||
`CentralSiteId` is the literal string `"central"`. Site IDs are free-form
|
||||
strings set in configuration / the Sites repository; there is no validation
|
||||
that excludes the reserved `"central"` name. An operator who creates a real
|
||||
site with `SiteId = "central"` will have:
|
||||
|
||||
- Their real-site reports arriving via `ProcessReport` get stored in the same
|
||||
dictionary slot as the central self-report (they share the keyspace), so the
|
||||
central self-report and the real-site report repeatedly overwrite each
|
||||
other via the sequence-number guard — whichever has the higher Unix-ms seed
|
||||
wins, and the other is silently rejected as stale. The dashboard alternates
|
||||
between two unrelated payloads.
|
||||
- The real site gets the longer `CentralOfflineTimeout` (default 3 minutes)
|
||||
instead of the normal `OfflineTimeout` (60 s), so a genuinely-failed real
|
||||
site marked "central" stays falsely-online for an extra two minutes.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Two options:
|
||||
|
||||
1. Reject the reserved name at the Site entity / configuration validation
|
||||
layer (Configuration Database component, out of this module's scope) and
|
||||
document `"central"` as reserved. This is the cleaner UX fix.
|
||||
2. As a defence-in-depth inside HealthMonitoring, store the central
|
||||
self-report under a key that cannot collide — e.g. prefix it with a
|
||||
character that is forbidden in real site IDs (`":central"` or `"#central"`)
|
||||
— and adjust `CheckForOfflineSites` accordingly.
|
||||
|
||||
Either fix should include a regression test creating a real `SiteHealthReport`
|
||||
with `SiteId = "central"` and asserting the central self-report's identity is
|
||||
preserved.
|
||||
|
||||
### HealthMonitoring-022 — `CentralHealthReportLoopTests` uses real-time `PeriodicTimer` + `Task.Delay`; flake-prone on slow CI
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Testing coverage |
|
||||
| Status | Open |
|
||||
| Location | `tests/ScadaLink.HealthMonitoring.Tests/CentralHealthReportLoopTests.cs:32-42` |
|
||||
|
||||
**Description**
|
||||
|
||||
`RunLoopBriefly` starts the hosted service with a 50 ms `PeriodicTimer` and
|
||||
then `await Task.Delay(runForMs, CancellationToken.None)` (with `runForMs`
|
||||
between 150 ms and 300 ms). `GeneratesCentralReports_WhenSelfIsPrimary` and
|
||||
`AssignsMonotonicSequenceNumbers` both assert "at least 2 reports were
|
||||
generated" within the window. On a heavily-contended CI runner where the
|
||||
hosted-service start-up plus a couple of `PeriodicTimer` ticks can blow past
|
||||
300 ms, these tests will silently flake.
|
||||
|
||||
The rest of the suite (`CentralHealthAggregatorTests`, `SiteHealthCollectorTests`,
|
||||
`HealthReportSenderTests` partially) was deliberately refactored to use the
|
||||
injected `TimeProvider` precisely to avoid this. `CentralHealthReportLoop` and
|
||||
`HealthReportSender` already accept a `TimeProvider`, but the loop's
|
||||
`PeriodicTimer` is still real-time because `PeriodicTimer` does not consume
|
||||
the `TimeProvider` parameter.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either (a) accept the timing-sensitivity and bump the delay budget
|
||||
generously, or (b) refactor the hosted-service loop to use a
|
||||
`TimeProvider.CreateTimer`-based tick mechanism so the test can advance a
|
||||
fake clock and assert deterministically how many ticks fire. Option (b) is
|
||||
the better long-term fix and matches the pattern used elsewhere in the
|
||||
module's tests.
|
||||
|
||||
### HealthMonitoring-023 — `StoreAndForwardBufferDepths_IsEmptyPlaceholder` test name is stale; it now covers the default-state contract, not a placeholder
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Documentation & comments |
|
||||
| Status | Open |
|
||||
| Location | `tests/ScadaLink.HealthMonitoring.Tests/SiteHealthCollectorTests.cs:117-122` |
|
||||
|
||||
**Description**
|
||||
|
||||
The test `StoreAndForwardBufferDepths_IsEmptyPlaceholder` was originally named
|
||||
to codify the HealthMonitoring-001 bug ("`SetStoreAndForwardDepths` has no
|
||||
callers, so `StoreAndForwardBufferDepths` is always empty"). HealthMonitoring-001
|
||||
is `Resolved` — `HealthReportSender` now populates per-category depths from
|
||||
the S&F engine, and the same test class has `SetStoreAndForwardDepths_ReflectedInReport`
|
||||
covering the populated path. The "placeholder" test still passes because it
|
||||
constructs a fresh collector and never calls the setter, so its assertion
|
||||
(`Assert.Empty(report.StoreAndForwardBufferDepths)`) is now testing the
|
||||
**default empty state of an un-configured collector**. The HealthMonitoring-001
|
||||
resolution note explicitly chose to keep it as "the collector-level
|
||||
default-state test", but the test method name and the implied semantics no
|
||||
longer match.
|
||||
|
||||
A maintainer reading the test name today will misread it as documentation that
|
||||
the metric is unimplemented (which it isn't), and may waste time investigating
|
||||
a non-bug.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Rename to `StoreAndForwardBufferDepths_DefaultsToEmpty_WhenSetterNotCalled`
|
||||
(or similar) and update the test body's intent — purely a documentation /
|
||||
maintainability fix; no behaviour change.
|
||||
|
||||
Reference in New Issue
Block a user