fix(health-monitoring): resolve HealthMonitoring-003..009 — central offline grace, register unknown-site heartbeats, test coverage
This commit is contained in:
@@ -8,7 +8,7 @@
|
||||
| Last reviewed | 2026-05-16 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `9c60592` |
|
||||
| Open findings | 10 |
|
||||
| Open findings | 5 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -143,10 +143,10 @@ and no lost updates.
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Severity | Medium — re-triaged: already resolved as a side-effect of HealthMonitoring-002. |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:55-78` |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:45-103` |
|
||||
|
||||
**Description**
|
||||
|
||||
@@ -169,7 +169,17 @@ reference swap with no observable intermediate state.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-16 (commit `pending`). Re-triaged: verified against the current
|
||||
source — the root cause was already eliminated by the HealthMonitoring-002 fix.
|
||||
`ProcessReport` no longer uses `AddOrUpdate` at all; it is now an explicit
|
||||
compare-and-swap retry loop (`TryGetValue` → guard → `TryAdd`/`TryUpdate`) that
|
||||
produces a brand-new immutable `SiteHealthState` record per transition and never
|
||||
mutates a stored value in place. The sequence-number guard and the field writes are
|
||||
evaluated against the value actually installed by the CAS, so the "only replace if
|
||||
sequence is higher" invariant holds. The concurrency stress test
|
||||
`CentralHealthAggregatorTests.ProcessReport_ConcurrentUpdates_NeverLoseSequenceOrTearState`
|
||||
(added under HealthMonitoring-002) already exercises this path and asserts no lost
|
||||
updates and no torn snapshots. No further code change was required for this finding.
|
||||
|
||||
### HealthMonitoring-004 — Inconsistent heartbeat interval described across XML docs
|
||||
|
||||
@@ -205,7 +215,7 @@ _Unresolved._
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:48-81`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:149` |
|
||||
|
||||
**Description**
|
||||
@@ -230,7 +240,18 @@ leader starts the report loop promptly on failover.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-16 (commit `pending`). Applied the distinct-timeout option. A new
|
||||
`HealthMonitoringOptions.CentralOfflineTimeout` (default 3x the report interval =
|
||||
3 minutes) is applied by `CentralHealthAggregator.CheckForOfflineSites` to the
|
||||
`central` keyspace entry only — real sites keep the existing `OfflineTimeout`. This
|
||||
gives the synthetic `central` site (which has no heartbeat source and is fed solely
|
||||
by the 30s leader-only `CentralHealthReportLoop`) enough grace to survive a single
|
||||
skipped or late self-report — the equivalent of the "one missed report" grace the
|
||||
design doc grants real sites — while still going offline on genuine total loss.
|
||||
Regression tests `CentralHealthAggregatorTests.OfflineDetection_CentralSite_HasLongerGraceThanRealSites`
|
||||
(central survives 75s of silence while a real site goes offline) and
|
||||
`OfflineDetection_CentralSite_StillGoesOfflineOnGenuineLoss` (central still detected
|
||||
offline after 10 minutes) verify the behaviour.
|
||||
|
||||
### HealthMonitoring-006 — Sequence seeding contradicts the doc's "starting at 1" wording and is untestable
|
||||
|
||||
@@ -270,7 +291,7 @@ _Unresolved._
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:86-99` |
|
||||
|
||||
**Description**
|
||||
@@ -293,16 +314,29 @@ after a central restart.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-16 (commit `pending`). `CentralHealthAggregator.MarkHeartbeat` no
|
||||
longer returns early for an unknown site. When a heartbeat arrives for a site with no
|
||||
aggregator state, it now atomically registers (`TryAdd`, with CAS-loss retry) a
|
||||
minimal `SiteHealthState` that is `IsOnline = true`, `LatestReport = null`,
|
||||
`LastSequenceNumber = 0` and `LastHeartbeatAt = receivedAt` — an "online, awaiting
|
||||
first report" state. This relies on the HealthMonitoring-002 change that made
|
||||
`LatestReport` properly nullable, so UI consumers already handle the null case.
|
||||
Reachable sites therefore show online immediately after a central restart/failover
|
||||
instead of being absent ("unknown") for up to a full report interval. The
|
||||
`ICentralHealthAggregator.MarkHeartbeat` XML doc was corrected to describe the new
|
||||
behaviour. Regression test
|
||||
`CentralHealthAggregatorTests.MarkHeartbeat_RegistersUnknownSite_AsOnlineAwaitingReport`
|
||||
verifies the registration; `MarkHeartbeat_KeepsSiteOnline_BetweenReports` and
|
||||
`MarkHeartbeat_BringsOfflineSiteBackOnline` cover the already-registered paths.
|
||||
|
||||
### HealthMonitoring-008 — `GetAllSiteStates` / `GetSiteState` leak live mutable state objects to callers
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Severity | Medium — re-triaged: already resolved as a side-effect of HealthMonitoring-002. |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:104-116` |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:146-158` |
|
||||
|
||||
**Description**
|
||||
|
||||
@@ -323,7 +357,15 @@ state into an immutable DTO before returning.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-16 (commit `pending`). Re-triaged: verified against the current
|
||||
source — the root cause was already eliminated by the HealthMonitoring-002 fix.
|
||||
`SiteHealthState` is now a `sealed record` with `init`-only properties (fully
|
||||
immutable). Every aggregator transition installs a brand-new record instance via an
|
||||
atomic compare-and-swap, so the references `GetAllSiteStates` and `GetSiteState` hand
|
||||
out are immutable snapshots — a UI consumer reading one on its own thread can never
|
||||
observe a torn or half-applied state, and cannot mutate aggregator state through the
|
||||
returned reference. The recommended fix (make `SiteHealthState` a record) is exactly
|
||||
what the HealthMonitoring-002 change did, so no further code change was required.
|
||||
|
||||
### HealthMonitoring-009 — Missing test coverage for central report loop, heartbeat path, replication, and collector setters
|
||||
|
||||
@@ -331,7 +373,7 @@ _Unresolved._
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Testing coverage |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `tests/ScadaLink.HealthMonitoring.Tests/` |
|
||||
|
||||
**Description**
|
||||
@@ -358,7 +400,29 @@ setters' presence in `CollectReport` output.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
Resolved 2026-05-16 (commit `pending`). Added the missing coverage:
|
||||
|
||||
- **`CentralHealthReportLoopTests`** (new file) — `GeneratesCentralReports_WhenSelfIsPrimary`,
|
||||
`GeneratesNoReports_WhenNotPrimary` (leader-only `SelfIsPrimary` gating with a fake
|
||||
`IClusterNodeProvider`), `AssignsMonotonicSequenceNumbers`, and
|
||||
`SetsActiveNodeFlag_EvenWhenNotPrimary`.
|
||||
- **`CentralHealthAggregatorTests`** — `MarkHeartbeat_RegistersUnknownSite_AsOnlineAwaitingReport`,
|
||||
`MarkHeartbeat_KeepsSiteOnline_BetweenReports` (heartbeat keeps a site online past
|
||||
the offline timeout — the path the design depends on), and
|
||||
`MarkHeartbeat_BringsOfflineSiteBackOnline`.
|
||||
- **`SiteHealthCollectorTests`** — reflected-in-report tests for `SetClusterNodes`,
|
||||
`SetInstanceCounts`, `SetParkedMessageCount`, `SetNodeHostname`,
|
||||
`SetActiveNode`/`NodeRole`, `UpdateTagQuality`, `UpdateConnectionEndpoint`, and
|
||||
`SetStoreAndForwardDepths`.
|
||||
|
||||
The `SiteHealthReportReplica` idempotency item is **out of scope** for this module:
|
||||
`SiteHealthReportReplica` is declared in `ScadaLink.Commons` and published/consumed by
|
||||
`CentralCommunicationActor` in the `ScadaLink.Communication` module — the
|
||||
HealthMonitoring module itself has no replication code. Replica double-delivery
|
||||
idempotency is already covered by `ProcessReport`'s sequence-number guard
|
||||
(`ProcessReport_RejectsEqualSequence`, `ProcessReport_RejectsStaleReport_WhenSequenceNotGreater`);
|
||||
testing the actor-side double-publish belongs in the Communication module's review.
|
||||
The HealthMonitoring test suite now stands at 47 passing tests (was 30).
|
||||
|
||||
### HealthMonitoring-010 — `HealthReportSender` silently swallows inner failures with bare `catch {}`
|
||||
|
||||
|
||||
Reference in New Issue
Block a user