Files
ScadaBridge/code-reviews/HealthMonitoring/findings.md
T
Joseph Doherty 344379a40a fix(utc/locale): close Theme 2 — 8 UTC / time / locale findings
UTC invariant + culture-safety fixes across UI form binding, audit entity
hydrate, and locale-dependent parses. Highlights:
- CentralUI-026/027: AuditFilterBar / SiteCallsReport / NotificationReport /
  EventLogs now apply SpecifyKind(Local) + ToUniversalTime() at form submit
  so browser-local datetime-local inputs aren't silently treated as UTC.
- Commons-019: AuditEvent.OccurredAtUtc / IngestedAtUtc init-setters
  re-tag any incoming DateTime as Kind=Utc, documenting the invariant.
- CD-018: AuditLogEntityTypeConfiguration adds UTC ValueConverters on the
  *Utc DateTime columns so EF hydrate yields Kind=Utc (SQL Server's
  datetime2 has no Kind metadata, so reads were returning Unspecified).
- CD-020: GetPartitionBoundariesOlderThanAsync now SpecifyKind(Utc) on the
  raw-ADO read, matching the existing defence in AuditLogPartitionMaintenance.
- SEL-021: EventLogQueryService.DateTimeOffset.Parse now uses
  InvariantCulture + AssumeUniversal | AdjustToUniversal.
- SR-023: Convert.ToDouble in ScriptActor + AlarmActor (4 sites) now
  passes InvariantCulture so non-US locales don't mis-parse string values.
- HM-020: CentralHealthAggregator.MarkHeartbeat anchors LastHeartbeatAt to
  max(receivedAt, now) on offline→online so a stale receivedAt can't
  leave a recovered site one tick from re-going-offline.

3 new tests added (AuditLog UTC converter, AuditFilterBar/EventLogs/
NotificationReport-touching CentralUI tests already cover Apply paths,
heartbeat offline→online). Build clean; ConfigurationDatabase 236,
Commons 330, HealthMonitoring 71, SiteRuntime 301, SiteEventLogging 50,
CentralUI 50 — all green. README regenerated: 104 open (was 112).
2026-05-28 06:36:44 -04:00

66 KiB

Code Review — HealthMonitoring

Field Value
Module src/ScadaLink.HealthMonitoring
Design doc docs/requirements/Component-HealthMonitoring.md
Status Reviewed
Last reviewed 2026-05-28
Reviewer claude-agent
Commit reviewed 1eb6e97
Open findings 6

Summary

The HealthMonitoring module is small, readable, and broadly faithful to the design intent: per-interval error counters with atomic read-and-reset, monotonic sequence numbers with Unix-ms seeding to survive failover, sequence-guarded staleness rejection, and a 60s offline timeout. However, the review surfaced two recurring themes. First, a documented metric is silently unimplemented — store-and-forward buffer depths are never populated (SetStoreAndForwardDepths has zero callers and a test asserts the field is always empty), so the dashboard cannot show the buffer depth metric the design doc requires. Second, the central aggregator's in-memory state model has unguarded shared mutable state: SiteHealthState is a mutable class whose fields are written by a background timer thread, by ProcessReport, and by MarkHeartbeat with no synchronization, and the same live mutable objects are handed straight to UI callers via GetAllSiteStates. The ProcessReport logic also mutates shared state inside a ConcurrentDictionary.AddOrUpdate update delegate, which the runtime may invoke more than once under contention. Additionally there are gaps around central self-report offline detection, heartbeats for not-yet-registered sites being dropped, and missing test coverage for the central report loop, heartbeat path, and most collector setters. None of the findings are crash-class, but the concurrency issues are Medium/High and the missing S&F metric is a real design-adherence gap.

Re-review 2026-05-17 (commit 39d737e)

All twelve prior findings (HealthMonitoring-001..012) are confirmed ResolvedSiteHealthState is now an immutable sealed record mutated only via atomic compare-and-swap, the store-and-forward buffer-depth metric is populated, the central-site offline grace and the unknown-site heartbeat registration are in place, and the test suite has grown to cover the report loop, heartbeat path, and collector setters. This re-review found 4 new findings, all Low/Medium, none crash-class. They are residual polish items rather than behaviour regressions: an inaccurate offline-check-interval comment (HealthMonitoring-013), unvalidated HealthMonitoringOptions intervals that crash the hosted service on misconfiguration (HealthMonitoring-014), a heartbeat-only registered site left with a year-0001 LastReportReceivedAt that the UI's staleness display would otherwise render verbatim (HealthMonitoring-015 — resolved via a coordinated HealthMonitoring + CentralUI change), and CollectReport reading DateTimeOffset.UtcNow directly instead of the module's now-standard injected TimeProvider (HealthMonitoring-016). The module remains small, readable, and broadly faithful to the design intent.

Re-review 2026-05-28 (commit 1eb6e97)

All sixteen prior findings (HealthMonitoring-001..016) remain Resolved. This baseline re-review applied the full 10-category checklist and produced 7 new findings (1 Medium, 6 Low — none crash-class). The most material observation is a metric-loss race in HealthReportSender.ExecuteAsync (HealthMonitoring-017): CollectReport resets the per-interval error counters (ScriptErrorCount, AlarmEvaluationErrorCount, DeadLetterCount, SiteAuditWriteFailures, AuditRedactionFailure) before _transport.Send(...) is attempted, so a transport failure (the existing catch { LogError; } path) silently discards every error this site recorded in the failed interval — the module-specific concern of "metric counters drifting from raw-per-interval to cumulative" inverted into drifting to lost. A parallel hazard exists in CentralHealthReportLoop (HealthMonitoring-018). The remaining items are smaller: two Audit Log metrics (SiteAuditTelemetryStalled, CentralAuditWriteFailures) listed in the design doc never make it into a HealthMonitoring surface (HealthMonitoring-019); a heartbeat with receivedAt <= existing.LastHeartbeatAt brings an offline site back online with a stale heartbeat that can flap right back to offline on the next check (HealthMonitoring-020); the reserved CentralSiteId = "central" constant collides with any real site named "central" and silently extends its offline grace (HealthMonitoring-021); CentralHealthReportLoopTests uses real wall-clock 50 ms timers + Task.Delay, making it timing-sensitive (HealthMonitoring-022); and one obsolete placeholder test name (StoreAndForwardBufferDepths_IsEmptyPlaceholder) misrepresents what it now covers (HealthMonitoring-023). All sequence-number and offline-detection arithmetic uses _timeProvider.GetUtcNow() consistently — no wall-clock vs monotonic mismatch was observed.

Checklist coverage

# Category Examined Notes
1 Correctness & logic bugs x MarkHeartbeat drops heartbeats for unregistered sites (HealthMonitoring-007); central self-report has no heartbeat grace (HealthMonitoring-005). Re-review: heartbeat-registered site left with year-0001 LastReportReceivedAt (HealthMonitoring-015).
2 Akka.NET conventions x Module itself contains no actors (transport abstracted via IHealthReportTransport); AddHealthMonitoringActors is a dead placeholder (HealthMonitoring-011). Actor-side wiring lives in Communication and is out of scope.
3 Concurrency & thread safety x Unguarded mutable SiteHealthState (HealthMonitoring-002); mutation inside AddOrUpdate delegate (HealthMonitoring-003); GetAllSiteStates leaks live mutable references (HealthMonitoring-008). Collector counters correctly use Interlocked.
4 Error handling & resilience x HealthReportSender silently swallows inner failures with bare catch {} (HealthMonitoring-010, resolved); top-level loop error handling is sound. Re-review: HealthMonitoringOptions intervals unvalidated — a zero/negative value crashes the hosted service at PeriodicTimer construction (HealthMonitoring-014).
5 Security x No issues found. Module handles only numeric/string operational metrics, no secrets, no external input parsing, no auth surface.
6 Performance & resource management x PeriodicTimer instances correctly disposed via using. Dictionary snapshots per report are acceptable at the documented scale. No issues found.
7 Design-document adherence x Store-and-forward buffer depth metric unimplemented (HealthMonitoring-001); sequence seeding deviates from doc's "starting at 1" wording (HealthMonitoring-006).
8 Code organization & conventions x Options class correctly owned by the component; POCO/messages in Commons. Dead placeholder method noted (HealthMonitoring-011, resolved). Re-review: SiteHealthCollector.CollectReport reads DateTimeOffset.UtcNow directly instead of the module's now-standard injected TimeProvider (HealthMonitoring-016).
9 Testing coverage x No tests for CentralHealthReportLoop, MarkHeartbeat, offline-via-heartbeat, replica idempotency, or most collector setters (HealthMonitoring-009).
10 Documentation & comments x Heartbeat interval is described inconsistently (~2s vs ~5s) across XML docs (HealthMonitoring-004, resolved); LatestReport = null! misrepresents the contract (HealthMonitoring-012, resolved). Re-review: offline-check-interval comment claims "(shorter)" timeout but code only uses OfflineTimeout (HealthMonitoring-013).

Re-review (2026-05-28, 1eb6e97):

# Category Examined Notes
1 Correctness & logic bugs x HealthReportSender and CentralHealthReportLoop reset per-interval counters before the send/process call — counts lost on transport failure (HealthMonitoring-017, HealthMonitoring-018). MarkHeartbeat brings an offline site back online with a stale LastHeartbeatAt when receivedAt <= existing.LastHeartbeatAt — site can flap straight back to offline (HealthMonitoring-020). CentralSiteId = "central" reserved constant silently collides with any real site named "central" (HealthMonitoring-021).
2 Akka.NET conventions x Module contains no actors itself. IHealthReportTransport cleanly abstracts the Akka-remoting send. ProcessReport/MarkHeartbeat are called from CentralCommunicationActor's receive — invoked on the actor thread but the aggregator's CAS loops make that safe regardless. No issues found.
3 Concurrency & thread safety x Verified the resolved SiteHealthState immutable-record / CAS-loop pattern still holds across ProcessReport, MarkHeartbeat, CheckForOfflineSites. SiteHealthCollector uses volatile for reference fields (_clusterNodes, _nodeHostname, _siteAuditBacklog, _isActiveNode) and Interlocked for counters consistently. CollectReport's new Dictionary<>(concurrentDict) snapshots are not strictly atomic but acceptable at the documented scale. No new issues found.
4 Error handling & resilience x try/catch blocks now log all non-fatal failures (resolved HealthMonitoring-010 still in place). Outer catch (Exception) in ExecuteAsync keeps the loop alive — sound. New: the counter-reset-before-send issue (HealthMonitoring-017, HealthMonitoring-018) is an error-handling gap — transport failure silently swallows the interval's metric data.
5 Security x No issues found. The module handles only numeric/string operational metrics; no secrets, auth surface, or untrusted input parsing. MarkHeartbeat and ProcessReport trust the caller (intra-cluster).
6 Performance & resource management x PeriodicTimer instances disposed via using. CAS retry loops in ProcessReport/MarkHeartbeat have no bounded retry cap but contention is the dictionary-size limit (one entry per site) so the loop is effectively wait-free for the common case. No issues found.
7 Design-document adherence x SiteAuditTelemetryStalled and CentralAuditWriteFailures are listed as required dashboard tiles in Component-HealthMonitoring.md but have no HealthMonitoring-side surface — both live only in AuditLog's AuditCentralHealthSnapshot with no integration into the health aggregator or its consumers (HealthMonitoring-019).
8 Code organization & conventions x Options class correctly owned by the component, validator registered idempotently across all three Add* paths. POCO/messages in Commons. AddCentralHealthAggregation implicitly depends on ISiteHealthCollector being registered elsewhere (Host calls AddHealthMonitoring() first) — works but is a hidden ordering requirement. Minor; not flagged.
9 Testing coverage x Per-interval reset semantics covered for site-side counters but NOT for the failed-send case (no test asserts counters remain accumulated when the transport throws — would catch HealthMonitoring-017). CentralHealthReportLoopTests uses real wall-clock 50 ms PeriodicTimer + Task.Delay(250) for timing — flake-prone on a slow CI runner (HealthMonitoring-022). The placeholder test StoreAndForwardBufferDepths_IsEmptyPlaceholder name is stale (HealthMonitoring-023).
10 Documentation & comments x XML docs in the new audit-bridge surfaces (IncrementSiteAuditWriteFailures, IncrementAuditRedactionFailure, UpdateSiteAuditBacklog) are accurate. The stale placeholder test name is the only issue (HealthMonitoring-023).

Findings

HealthMonitoring-001 — Store-and-forward buffer depth metric is never populated

Severity High
Category Design-document adherence
Status Resolved
Location src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs:104, src/ScadaLink.HealthMonitoring/HealthReportSender.cs:79

Description

Component-HealthMonitoring.md lists "Store-and-forward buffer depth" (pending messages by category) as a required monitored metric. SiteHealthCollector exposes SetStoreAndForwardDepths(...) to receive it, but a codebase-wide search shows the method has no callers_sfBufferDepths always remains the empty dictionary it is initialized to. HealthReportSender queries GetParkedMessageCountAsync() and sets ParkedMessageCount, but parked count is a distinct metric from per-category buffer depth. The test SiteHealthCollectorTests.StoreAndForwardBufferDepths_IsEmptyPlaceholder even codifies the unimplemented state as expected behaviour. The result is that the central dashboard cannot display buffer depth, a documented triage metric.

Recommendation

Wire SetStoreAndForwardDepths into HealthReportSender.ExecuteAsync (alongside the existing parked-count call) using the S&F engine's per-category depth API, or, if the metric is intentionally deferred, record that decision in the design doc and remove the dead setter. Update the placeholder test accordingly once implemented.

Resolution

Resolved 2026-05-16 (commit <pending>). HealthReportSender.ExecuteAsync now queries the existing public StoreAndForwardStorage.GetBufferDepthByCategoryAsync() API alongside the parked-count call and feeds the per-category depths into SiteHealthCollector.SetStoreAndForwardDepths (category enum names as keys), so the documented store-and-forward buffer depth metric is populated in every emitted report. Regression test HealthReportSenderTests.ReportsIncludeStoreAndForwardBufferDepthsFromStorage verifies populated per-category depths. The obsolete placeholder test SiteHealthCollectorTests.StoreAndForwardBufferDepths_IsEmptyPlaceholder continues to pass — it only exercises the collector with no setter call and still correctly asserts the empty default; it was left in place as the collector-level default-state test. No StoreAndForward source was modified (existing public API only).

HealthMonitoring-002 — SiteHealthState mutable fields written from multiple threads without synchronization

Severity High
Category Concurrency & thread safety
Status Resolved
Location src/ScadaLink.HealthMonitoring/SiteHealthState.cs:11, src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:86, src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:137

Description

SiteHealthState is a plain mutable class. Its fields (LatestReport, LastReportReceivedAt, LastHeartbeatAt, LastSequenceNumber, IsOnline) are mutated from at least three concurrent contexts: ProcessReport (caller thread — ClusterClient/PubSub message handlers), MarkHeartbeat (caller thread — heartbeat handler), and CheckForOfflineSites (the BackgroundService timer thread). The ConcurrentDictionary only protects the dictionary structure, not the objects it stores. A heartbeat update and the offline-check can interleave on the same SiteHealthState instance, and reads/writes of DateTimeOffset (a 16-byte struct) and long fields are not guaranteed atomic on all platforms — producing torn reads and lost updates of IsOnline/LastHeartbeatAt.

Recommendation

Make state transitions atomic: either guard all reads/writes of a SiteHealthState with a per-site lock, or replace SiteHealthState with an immutable record updated via ConcurrentDictionary compare-and-swap (TryUpdate) so every transition is a single atomic reference swap.

Resolution

Resolved 2026-05-16 (commit <pending>). SiteHealthState is now a sealed record with init-only properties. CentralHealthAggregator.ProcessReport, MarkHeartbeat, and CheckForOfflineSites were rewritten to perform every state transition as an atomic compare-and-swap (TryAdd/TryUpdate) producing a new record instance — no field of a stored state is ever mutated in place. ProcessReport uses an explicit CAS retry loop instead of the AddOrUpdate update delegate so the sequence-number guard and the field writes are evaluated against the value actually installed (this also closes the root cause behind HealthMonitoring-003). Reads via GetAllSiteStates/GetSiteState now hand out immutable snapshots, so a concurrent reader can never observe a torn or half-applied state. LatestReport was changed from SiteHealthReport (null!) to SiteHealthReport?, making the contract honest; all existing consumers (CentralUI, integration/perf tests) already null-checked it and continue to build clean. Regression test CentralHealthAggregatorTests.ProcessReport_ConcurrentUpdates_NeverLoseSequenceOrTearState exercises concurrent report/heartbeat/read threads and asserts snapshot consistency and no lost updates.

HealthMonitoring-003 — Shared state mutated inside ConcurrentDictionary.AddOrUpdate update delegate

Severity Medium — re-triaged: already resolved as a side-effect of HealthMonitoring-002.
Category Concurrency & thread safety
Status Resolved
Location src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:45-103

Description

The update delegate passed to AddOrUpdate mutates the existing object in place (existing.LatestReport = report; existing.IsOnline = true; ...). AddOrUpdate's contract explicitly allows the update delegate to be invoked more than once under contention (when the CAS that installs the result loses a race and is retried). Each invocation mutates the shared object, so a concurrent report for the same site can observe a half-applied update, and the multi-field assignment is not atomic with respect to readers in GetAllSiteStates/CheckForOfflineSites. The intended "only replace if sequence is higher" guard can also be subverted because the sequence comparison and the field writes are not a single atomic step.

Recommendation

Have the update delegate return a new SiteHealthState (record with copy) rather than mutating existing, and treat the dictionary value as immutable. Combined with HealthMonitoring-002, this makes every state transition an atomic reference swap with no observable intermediate state.

Resolution

Resolved 2026-05-16 (commit pending). Re-triaged: verified against the current source — the root cause was already eliminated by the HealthMonitoring-002 fix. ProcessReport no longer uses AddOrUpdate at all; it is now an explicit compare-and-swap retry loop (TryGetValue → guard → TryAdd/TryUpdate) that produces a brand-new immutable SiteHealthState record per transition and never mutates a stored value in place. The sequence-number guard and the field writes are evaluated against the value actually installed by the CAS, so the "only replace if sequence is higher" invariant holds. The concurrency stress test CentralHealthAggregatorTests.ProcessReport_ConcurrentUpdates_NeverLoseSequenceOrTearState (added under HealthMonitoring-002) already exercises this path and asserts no lost updates and no torn snapshots. No further code change was required for this finding.

HealthMonitoring-004 — Inconsistent heartbeat interval described across XML docs

Severity Low
Category Documentation & comments
Status Resolved
Location src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:146-148, src/ScadaLink.HealthMonitoring/SiteHealthState.cs:21, src/ScadaLink.HealthMonitoring/ICentralHealthAggregator.cs:16

Description

The heartbeat cadence that offline detection relies on is documented inconsistently. CheckForOfflineSites says "heartbeats arrive every ~5s"; SiteHealthState.LastHeartbeatAt says "~5s heartbeat"; but ICentralHealthAggregator.MarkHeartbeat says "~2s heartbeats are arriving". The actual cadence is set elsewhere (Cluster Infrastructure / SiteCommunicationActor). Readers cannot reason about whether a 60s offline timeout gives the intended grace without a single authoritative number.

Recommendation

Pick the correct interval (verify against the heartbeat scheduler in SiteCommunicationActor/Cluster Infrastructure) and use it consistently in all three comments, ideally referencing the owning component rather than restating a magic number.

Resolution

Resolved 2026-05-16 (commit pending). Documentation-only — no regression test is meaningful. Verified the authoritative cadence against the Communication module: SiteCommunicationActor.PreStart schedules the application-level heartbeat to central at CommunicationOptions.TransportHeartbeatInterval, which defaults to 5 seconds (CommunicationOptions.cs:49). The stale "~2s" in ICentralHealthAggregator.MarkHeartbeat was corrected; all three XML docs (ICentralHealthAggregator.MarkHeartbeat, SiteHealthState.LastHeartbeatAt, CentralHealthAggregator.CheckForOfflineSites) now state the single authoritative ~5s figure and reference the owning component (Cluster Infrastructure / SiteCommunicationActorCommunicationOptions.TransportHeartbeatInterval) rather than restating a bare magic number, so readers can reason about the 60s offline grace.

HealthMonitoring-005 — Central self-report site can flap offline; no heartbeat grace like real sites

Severity Medium
Category Correctness & logic bugs
Status Resolved
Location src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:48-81, src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:149

Description

CheckForOfflineSites decides offline status purely from LastHeartbeatAt, and for real sites that field is kept fresh by frequent (~2-5s) heartbeats so the 60s timeout only fires on genuine total loss. The synthetic central site, however, has no heartbeat source — LastHeartbeatAt is only bumped by ProcessReport from the 30s CentralHealthReportLoop. The loop also only runs on the cluster leader and silently skips a cycle on any exception. Consequently, a single skipped/late central self-report (leader GC pause, brief stall, mid-failover before the new leader's loop spins up) leaves central with no signal for >60s and it is marked offline even though the central cluster is healthy. The central card thus has no equivalent of the "one missed report grace" the design doc grants real sites.

Recommendation

Either feed central a heartbeat equivalent (e.g. have MarkHeartbeat called for CentralSiteId on a fast timer independent of the leader-only report loop), or apply a longer/distinct offline timeout to the central keyspace entry, and ensure the new leader starts the report loop promptly on failover.

Resolution

Resolved 2026-05-16 (commit pending). Applied the distinct-timeout option. A new HealthMonitoringOptions.CentralOfflineTimeout (default 3x the report interval = 3 minutes) is applied by CentralHealthAggregator.CheckForOfflineSites to the central keyspace entry only — real sites keep the existing OfflineTimeout. This gives the synthetic central site (which has no heartbeat source and is fed solely by the 30s leader-only CentralHealthReportLoop) enough grace to survive a single skipped or late self-report — the equivalent of the "one missed report" grace the design doc grants real sites — while still going offline on genuine total loss. Regression tests CentralHealthAggregatorTests.OfflineDetection_CentralSite_HasLongerGraceThanRealSites (central survives 75s of silence while a real site goes offline) and OfflineDetection_CentralSite_StillGoesOfflineOnGenuineLoss (central still detected offline after 10 minutes) verify the behaviour.

HealthMonitoring-006 — Sequence seeding contradicts the doc's "starting at 1" wording and is untestable

Severity Low
Category Design-document adherence
Status Resolved
Location src/ScadaLink.HealthMonitoring/HealthReportSender.cs:28, src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:32

Description

The HealthReportSender class XML summary states "Sequence numbers are monotonic, starting at 1, and reset on service restart." The implementation instead seeds _sequenceNumber with DateTimeOffset.UtcNow.ToUnixTimeMilliseconds() so the first emitted sequence is a large epoch value, specifically to keep ordering correct across failover. The summary is therefore stale and contradicts the code. Separately, the seed reads DateTimeOffset.UtcNow directly at field initialization rather than through an injected TimeProvider (which CentralHealthAggregator already uses), making the seeding logic impossible to unit-test deterministically and dependent on node wall-clock agreement — if one node's clock lags, its post-failover reports can be silently rejected as stale by the aggregator.

Recommendation

Fix the HealthReportSender XML summary to describe the actual Unix-ms seeding strategy, and inject TimeProvider for the seed so the behaviour is testable and the clock dependency is explicit.

Resolution

Resolved 2026-05-16 (commit pending). The HealthReportSender XML summary was rewritten to describe the real strategy — sequence numbers are monotonic and restart-resetting but explicitly not zero/one-based; they are seeded with the current Unix epoch (ms) so a freshly-active node always sorts after the prior active. Both HealthReportSender and CentralHealthReportLoop now accept an optional TimeProvider (defaulting to TimeProvider.System) and derive the seed via GetUtcNow().ToUnixTimeMilliseconds() in the constructor body instead of reading DateTimeOffset.UtcNow at field initialization, so the seeding is deterministically testable and the clock dependency is explicit. CentralHealthReportLoop gained a CurrentSequenceNumber test accessor mirroring HealthReportSender. Regression tests HealthReportSenderTests.SequenceNumberSeed_UsesInjectedTimeProvider and CentralHealthReportLoopTests.SequenceNumberSeed_UsesInjectedTimeProvider assert the seed equals the injected provider's Unix-ms instant (these would not compile against the pre-fix code, which had no TimeProvider parameter).

HealthMonitoring-007 — Heartbeats for not-yet-registered sites are silently dropped

Severity Medium
Category Correctness & logic bugs
Status Resolved
Location src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:86-99

Description

MarkHeartbeat returns immediately if the site is not already in _siteStates ("registration only happens on report"). Central health state is in-memory only and not persisted. After a central restart or failover the aggregator starts empty, so for up to one full report interval (default 30s) every site emits only heartbeats that are all discarded — the site is reported as unknown (absent from GetAllSiteStates) rather than online, even though heartbeats prove it is reachable. This is a visible dashboard regression precisely during the failover window, which is when operators most need accurate status.

Recommendation

Allow MarkHeartbeat to register a minimal SiteHealthState (online, no LatestReport yet, with a UI-visible "awaiting first report" indication) when a heartbeat arrives for an unknown site, so reachable sites show online immediately after a central restart.

Resolution

Resolved 2026-05-16 (commit pending). CentralHealthAggregator.MarkHeartbeat no longer returns early for an unknown site. When a heartbeat arrives for a site with no aggregator state, it now atomically registers (TryAdd, with CAS-loss retry) a minimal SiteHealthState that is IsOnline = true, LatestReport = null, LastSequenceNumber = 0 and LastHeartbeatAt = receivedAt — an "online, awaiting first report" state. This relies on the HealthMonitoring-002 change that made LatestReport properly nullable, so UI consumers already handle the null case. Reachable sites therefore show online immediately after a central restart/failover instead of being absent ("unknown") for up to a full report interval. The ICentralHealthAggregator.MarkHeartbeat XML doc was corrected to describe the new behaviour. Regression test CentralHealthAggregatorTests.MarkHeartbeat_RegistersUnknownSite_AsOnlineAwaitingReport verifies the registration; MarkHeartbeat_KeepsSiteOnline_BetweenReports and MarkHeartbeat_BringsOfflineSiteBackOnline cover the already-registered paths.

HealthMonitoring-008 — GetAllSiteStates / GetSiteState leak live mutable state objects to callers

Severity Medium — re-triaged: already resolved as a side-effect of HealthMonitoring-002.
Category Concurrency & thread safety
Status Resolved
Location src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:146-158

Description

GetAllSiteStates copies the dictionary but the copy still holds references to the same live mutable SiteHealthState instances; GetSiteState returns the live instance directly. UI consumers (Blazor Server / SignalR circuits) read these objects on their own threads while the aggregator's background timer and report handlers concurrently mutate the very same instances (see HealthMonitoring-002). A UI render can observe a SiteHealthState with, e.g., IsOnline == true but a LatestReport from a different update, or a torn DateTimeOffset. Callers could also mutate the shared state, corrupting aggregator state.

Recommendation

Return immutable snapshots: convert SiteHealthState to a record (per HealthMonitoring-002/003) so handing out the reference is safe, or deep-copy each state into an immutable DTO before returning.

Resolution

Resolved 2026-05-16 (commit pending). Re-triaged: verified against the current source — the root cause was already eliminated by the HealthMonitoring-002 fix. SiteHealthState is now a sealed record with init-only properties (fully immutable). Every aggregator transition installs a brand-new record instance via an atomic compare-and-swap, so the references GetAllSiteStates and GetSiteState hand out are immutable snapshots — a UI consumer reading one on its own thread can never observe a torn or half-applied state, and cannot mutate aggregator state through the returned reference. The recommended fix (make SiteHealthState a record) is exactly what the HealthMonitoring-002 change did, so no further code change was required.

HealthMonitoring-009 — Missing test coverage for central report loop, heartbeat path, replication, and collector setters

Severity Medium
Category Testing coverage
Status Resolved
Location tests/ScadaLink.HealthMonitoring.Tests/

Description

Several behaviours have no automated coverage:

  • CentralHealthReportLoop — leader-only gating (SelfIsPrimary), self-report generation, sequence assignment: no test file at all.
  • CentralHealthAggregator.MarkHeartbeat — keeping a site online between reports, online recovery via heartbeat, and the unknown-site drop behaviour (HealthMonitoring-007): untested.
  • Offline detection driven by LastHeartbeatAt vs LastReportReceivedAt — the existing offline tests only advance time after a report, never exercising the heartbeat-keeps-alive path the design depends on.
  • SiteHealthCollectorSetClusterNodes, SetInstanceCounts, SetParkedMessageCount, SetNodeHostname, SetActiveNode/NodeRole, UpdateTagQuality, UpdateConnectionEndpoint: not reflected-in-report tested.
  • SiteHealthReportReplica idempotency under double delivery: untested.

Recommendation

Add tests for the central report loop (with a fake IClusterNodeProvider), the heartbeat-keeps-online and unknown-site heartbeat paths, and the remaining collector setters' presence in CollectReport output.

Resolution

Resolved 2026-05-16 (commit pending). Added the missing coverage:

  • CentralHealthReportLoopTests (new file) — GeneratesCentralReports_WhenSelfIsPrimary, GeneratesNoReports_WhenNotPrimary (leader-only SelfIsPrimary gating with a fake IClusterNodeProvider), AssignsMonotonicSequenceNumbers, and SetsActiveNodeFlag_EvenWhenNotPrimary.
  • CentralHealthAggregatorTestsMarkHeartbeat_RegistersUnknownSite_AsOnlineAwaitingReport, MarkHeartbeat_KeepsSiteOnline_BetweenReports (heartbeat keeps a site online past the offline timeout — the path the design depends on), and MarkHeartbeat_BringsOfflineSiteBackOnline.
  • SiteHealthCollectorTests — reflected-in-report tests for SetClusterNodes, SetInstanceCounts, SetParkedMessageCount, SetNodeHostname, SetActiveNode/NodeRole, UpdateTagQuality, UpdateConnectionEndpoint, and SetStoreAndForwardDepths.

The SiteHealthReportReplica idempotency item is out of scope for this module: SiteHealthReportReplica is declared in ScadaLink.Commons and published/consumed by CentralCommunicationActor in the ScadaLink.Communication module — the HealthMonitoring module itself has no replication code. Replica double-delivery idempotency is already covered by ProcessReport's sequence-number guard (ProcessReport_RejectsEqualSequence, ProcessReport_RejectsStaleReport_WhenSequenceNotGreater); testing the actor-side double-publish belongs in the Communication module's review. The HealthMonitoring test suite now stands at 47 passing tests (was 30).

HealthMonitoring-010 — HealthReportSender silently swallows inner failures with bare catch {}

Severity Low
Category Error handling & resilience
Status Resolved
Location src/ScadaLink.HealthMonitoring/HealthReportSender.cs:70-87

Description

The cluster-nodes update and parked-message-count query are each wrapped in try { ... } catch { /* Non-fatal */ } with no logging. A persistent failure (e.g. the S&F SQLite store is permanently broken, or GetClusterNodes() always throws) is then completely invisible — every report silently ships with stale cluster nodes and a parked count of 0, with nothing in the logs to explain the wrong dashboard values. Bare catch with no exception variable also catches OperationCanceledException and would mask shutdown signalling if the awaited call observed the token.

Recommendation

Catch a specific exception type (or at least Exception ex) and LogWarning/LogDebug the failure so persistent degradation is diagnosable; avoid swallowing OperationCanceledException.

Resolution

Resolved 2026-05-16 (commit pending). All three bare catch { } blocks in HealthReportSender.ExecuteAsync (cluster-nodes refresh, parked-message-count query, S&F buffer-depth query) were changed to catch (Exception ex) and now emit a LogWarning(ex, ...) naming the failed operation and the site, so a persistent degradation (broken S&F SQLite store, always-throwing GetClusterNodes()) is diagnosable instead of silently shipping stale/zero metrics. On the OperationCanceledException concern: verified the inner calls cannot raise OCE from cancellation — IClusterNodeProvider.GetClusterNodes() is synchronous and takes no token, and StoreAndForwardStorage.GetParkedMessageCountAsync()/GetBufferDepthByCategoryAsync() take no CancellationToken parameter, so the only cancellation path is the outer loop's PeriodicTimer.WaitForNextTickAsync(stoppingToken), which is unaffected. Regression test HealthReportSenderTests.ClusterNodeRefreshFailure_IsLoggedNotSwallowed injects a throwing IClusterNodeProvider, asserts the loop still ships reports, and asserts a warning carrying the InvalidOperationException is logged — confirmed to fail against the pre-fix bare catch { } (logged-entry collection was empty).

HealthMonitoring-011 — AddHealthMonitoringActors is a dead no-op placeholder

Severity Low
Category Code organization & conventions
Status Resolved
Location src/ScadaLink.HealthMonitoring/ServiceCollectionExtensions.cs:42-46

Description

AddHealthMonitoringActors does nothing but return services with a "Placeholder for Phase 4+" comment. A public extension method that silently no-ops is a trap: a caller who registers it will believe actor wiring is in place. No caller currently invokes it.

Recommendation

Remove the method until it has real behaviour, or throw NotImplementedException so accidental use fails loudly. If the actor model for this component is genuinely planned, track it in the design doc instead of a half-method.

Resolution

Resolved 2026-05-16 (commit pending). Verified a codebase-wide search — the AddHealthMonitoringActors no-op extension has no production callers (only references were in the code-review docs themselves). The HealthMonitoring module deliberately contains no actors (transport is abstracted behind IHealthReportTransport; actor-side wiring lives in the Communication module), so there is no genuine Phase-4 actor work for this method to host. The dead placeholder was removed from ServiceCollectionExtensions so a caller can no longer be misled into believing actor wiring exists. No regression test is meaningful for a deleted no-op; removal is verified by the module continuing to build and all 50 tests passing.

HealthMonitoring-012 — SiteHealthState.LatestReport initialized to null!, misrepresenting the contract

Severity Low — re-triaged: already resolved as a side-effect of HealthMonitoring-002.
Category Documentation & comments
Status Resolved
Location src/ScadaLink.HealthMonitoring/SiteHealthState.cs:11

Description

LatestReport is declared SiteHealthReport LatestReport { get; set; } = null!;, suppressing nullability. Today every code path that creates a SiteHealthState (only ProcessReport) assigns LatestReport, so it is never actually null — but the null! declaration tells readers and the compiler the opposite of the real invariant. If HealthMonitoring-007 is addressed by registering state from a heartbeat (no report yet), this becomes a live NullReferenceException risk for UI code that dereferences LatestReport.

Recommendation

Either make LatestReport required (matching how it is genuinely always set today) or make it properly nullable SiteHealthReport? and have consumers handle the "registered, no report yet" case explicitly — consistent with whatever is decided for HealthMonitoring-007.

Resolution

Resolved 2026-05-16 (commit pending). Re-triaged: verified against the current source — the root cause was already eliminated by the HealthMonitoring-002 / HealthMonitoring-007 fixes. SiteHealthState.LatestReport is already declared SiteHealthReport? LatestReport { get; init; } (the recommendation's "make it properly nullable" option) with an XML doc explaining the null case ("known only via heartbeats, has not yet sent a report"). A codebase-wide search confirms no null! suppression remains anywhere in src/ScadaLink.HealthMonitoring. This is exactly the change HealthMonitoring-002 made when converting SiteHealthState to an immutable record, so the contract is now honest and no further code change was required.

HealthMonitoring-013 — Offline-check interval comment claims "shorter timeout" but only ever uses OfflineTimeout

Severity Low
Category Documentation & comments
Status Resolved
Location src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:194-196

Description

ExecuteAsync derives the PeriodicTimer cadence with the comment "Check at half the (shorter) offline timeout interval for timely detection", but the code only reads _options.OfflineTimeout:

var checkInterval = TimeSpan.FromMilliseconds(_options.OfflineTimeout.TotalMilliseconds / 2);

CentralOfflineTimeout (HealthMonitoring-005's fix) is never considered. With the default options (OfflineTimeout 60s, CentralOfflineTimeout 3m) OfflineTimeout genuinely is the shorter of the two, so the parenthetical happens to hold. But the comment states an invariant the code does not enforce: if an operator configures CentralOfflineTimeout smaller than OfflineTimeout, the check cadence stays tied to OfflineTimeout, and central offline detection is delayed by up to a full OfflineTimeout / 2 beyond the intended CentralOfflineTimeout window. The comment misleads a reader into believing the cadence already adapts to whichever timeout is shorter.

Recommendation

Either compute checkInterval from Math.Min(OfflineTimeout, CentralOfflineTimeout) so the code matches the comment, or drop the "(shorter)" wording and state plainly that the cadence is derived from OfflineTimeout only (acceptable while the default CentralOfflineTimeout is the larger value).

Resolution

Resolved 2026-05-17. Root cause confirmed against source — ExecuteAsync derived the PeriodicTimer cadence solely from OfflineTimeout while the comment claimed it tracked the "(shorter)" timeout. Took the code-matches-comment option: extracted the cadence into CentralHealthAggregator.ComputeCheckInterval, which now derives it from half of the shorter of OfflineTimeout and CentralOfflineTimeout, so a CentralOfflineTimeout configured below OfflineTimeout is still polled at least twice within its window. The comment was rewritten to match. Regression test CentralHealthAggregatorTests.CheckInterval_IsHalfTheShorterTimeout asserts the default case (30s) and the shorter-CentralOfflineTimeout case (10s) — the latter would have returned 30s against the pre-fix code.

HealthMonitoring-014 — HealthMonitoringOptions intervals are unvalidated; a zero/negative value crashes the hosted service

Severity Low
Category Error handling & resilience
Status Resolved
Location src/ScadaLink.HealthMonitoring/HealthMonitoringOptions.cs:3-20, src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:196, src/ScadaLink.HealthMonitoring/HealthReportSender.cs:67, src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:63

Description

HealthMonitoringOptions is bound from the ScadaLink:HealthMonitoring config section (SiteServiceRegistration.BindSharedOptions) with no validation — no IValidateOptions<HealthMonitoringOptions>, no ValidateDataAnnotations, no ValidateOnStart. ReportInterval, OfflineTimeout, and CentralOfflineTimeout are all fed straight into new PeriodicTimer(...) (and OfflineTimeout into a division for the check interval). PeriodicTimer's constructor throws ArgumentOutOfRangeException for a zero or negative period. A misconfigured appsettings.json (e.g. "ReportInterval": "00:00:00", an empty/garbled value that binds to TimeSpan.Zero, or a negative span) therefore crashes the HealthReportSender / CentralHealthReportLoop / CentralHealthAggregator hosted service at startup with an opaque exception that does not name the offending config key, rather than failing fast with a clear validation message.

Recommendation

Add an options validator (DataAnnotations [Range]-style on the spans, or an IValidateOptions<HealthMonitoringOptions>) that rejects non-positive ReportInterval/OfflineTimeout/CentralOfflineTimeout and ideally requires CentralOfflineTimeout >= OfflineTimeout, and call .ValidateOnStart() so a bad configuration fails fast with a message naming the section and key.

Resolution

Resolved 2026-05-17. Root cause confirmed — HealthMonitoringOptions had no validator, so a zero/negative interval reached new PeriodicTimer(...) and crashed the hosted service with an opaque ArgumentOutOfRangeException. Added HealthMonitoringOptionsValidator : IValidateOptions<HealthMonitoringOptions> that rejects non-positive ReportInterval/OfflineTimeout/CentralOfflineTimeout and a CentralOfflineTimeout shorter than OfflineTimeout, each failure naming the ScadaLink:HealthMonitoring config key. It is registered (idempotently, via TryAddEnumerable) by all three ServiceCollectionExtensions registration methods, so it fires when the hosted services resolve IOptions.Value at startup — failing fast with a clear message. (ValidateOnStart() lives in the Host module's binding call, which is out of scope; the validator nonetheless runs at startup because the hosted-service constructors resolve the options eagerly — matching the existing ClusterOptionsValidator registration pattern.) Regression tests in HealthMonitoringOptionsValidatorTests cover the valid default plus zero/negative intervals and the CentralOfflineTimeout < OfflineTimeout case.

HealthMonitoring-015 — Heartbeat-registered site is left with a year-0001 LastReportReceivedAt

Severity Medium
Category Correctness & logic bugs
Status Resolved
Location src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:122-130, src/ScadaLink.HealthMonitoring/SiteHealthState.cs:27

Description

When MarkHeartbeat registers a previously-unknown site (HealthMonitoring-007's fix), it sets LastReportReceivedAt = default — i.e. DateTimeOffset.MinValue (0001-01-01). The XML doc on SiteHealthState.LastReportReceivedAt states the field is "Used by the UI to surface report staleness during failover." A heartbeat-only site therefore has LatestReport == null and LastReportReceivedAt == DateTimeOffset.MinValue. Any UI code that computes "last report N ago" as now - LastReportReceivedAt without first checking LatestReport != null will render a nonsensical staleness of roughly two thousand years for a site that is, in fact, freshly reachable. The two "no report yet" signals (LatestReport == null, LastReportReceivedAt == default) are independent and both must be special-cased; the sentinel value is an easy trap.

Recommendation

Make LastReportReceivedAt nullable (DateTimeOffset?) so "no report received yet" is an explicit, unmissable state rather than a magic sentinel — consistent with how LatestReport was already made nullable for the same case — and have UI consumers render staleness only when it has a value. Alternatively, document the default sentinel prominently on the field and audit every UI reader, but the nullable option is safer and matches the existing LatestReport treatment.

Resolution

Resolved 2026-05-17. Root cause confirmed against source — CentralHealthAggregator.MarkHeartbeat registered a heartbeat-only site with LastReportReceivedAt = default (0001-01-01), which Health.razor:74 passed verbatim into TimestampDisplay, rendering a ~2000-year-stale timestamp. Applied the finding's recommended cross-module fix. HealthMonitoring: SiteHealthState.LastReportReceivedAt is now DateTimeOffset?, and MarkHeartbeat's unknown-site registration sets it to null — "no report yet" is now an explicit nullable state, consistent with the already-nullable LatestReport. ProcessReport continues to set a real instant. CentralUI: TimestampDisplay.Value now accepts DateTimeOffset? and renders null as a plain text-muted placeholder (default "never", configurable via a new NullText parameter); existing non-null callers (AuditLog, EventLogs, Deployments) are unaffected by the implicit widening. Health.razor's "Last report" cell passes NullText="awaiting first report" so a heartbeat-only site reads cleanly. Regression tests: CentralHealthAggregatorTests.MarkHeartbeat_RegistersUnknownSite_WithNullLastReportReceivedAt and ProcessReport_SetsLastReportReceivedAt_ForHeartbeatRegisteredSite (HealthMonitoring — the first would not compile against the pre-fix non-nullable field), and TimestampDisplayTests.Render_NullValue_ShowsNeverPlaceholder, Render_NullValue_DoesNotRenderYear0001Sentinel, Render_NonNullValue_ShowsFormattedTimestamp (CentralUI).

HealthMonitoring-016 — SiteHealthCollector.CollectReport reads DateTimeOffset.UtcNow directly instead of an injected TimeProvider

Severity Low
Category Code organization & conventions
Status Resolved
Location src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs:151

Description

CollectReport stamps each report with ReportTimestamp: DateTimeOffset.UtcNow, read directly from the system clock. Every other time-dependent class in the module — CentralHealthAggregator, HealthReportSender, CentralHealthReportLoop — was deliberately refactored (HealthMonitoring-006) to take an injectable TimeProvider so the behaviour is deterministically testable and the clock dependency is explicit. SiteHealthCollector is the lone holdout: the report timestamp cannot be controlled in a unit test, which is why SiteHealthCollectorTests.CollectReport_IncludesUtcTimestamp can only assert the timestamp falls in a before/after wall-clock window rather than equalling a known instant. This is a minor consistency/testability gap, not a behaviour bug.

Recommendation

Add an optional TimeProvider constructor parameter to SiteHealthCollector (defaulting to TimeProvider.System, mirroring the other classes) and derive ReportTimestamp from GetUtcNow(), so the report timestamp is deterministically testable and the module is consistent.

Resolution

Resolved 2026-05-17. Root cause confirmed — CollectReport stamped ReportTimestamp from DateTimeOffset.UtcNow directly while every other time-dependent class in the module takes an injectable TimeProvider. Added an optional TimeProvider constructor parameter to SiteHealthCollector (defaulting to TimeProvider.System, mirroring CentralHealthAggregator/HealthReportSender/CentralHealthReportLoop) and CollectReport now derives ReportTimestamp from _timeProvider.GetUtcNow(). The DI registration (AddSingleton<ISiteHealthCollector, SiteHealthCollector>) continues to work via the optional parameter. Regression test SiteHealthCollectorTests.CollectReport_StampsTimestamp_FromInjectedTimeProvider asserts the timestamp equals a fixed injected instant exactly (not just a before/after window); it would not compile against the pre-fix single-arg-less constructor.

HealthMonitoring-017 — HealthReportSender resets interval counters before Send; transport failures silently drop the interval's error counts

Severity Medium
Category Correctness & logic bugs
Status Open
Location src/ScadaLink.HealthMonitoring/HealthReportSender.cs:140-154, src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs:146-153

Description

HealthReportSender.ExecuteAsync calls _collector.CollectReport(_siteId) and then _transport.Send(reportWithSeq) inside a single try block whose catch logs and continues. CollectReport atomically read-and-resets the per-interval counters via Interlocked.Exchange(ref _scriptErrorCount, 0) (and the same for _alarmErrorCount, _deadLetterCount, _siteAuditWriteFailures, _auditRedactionFailures). If _transport.Send then throws — Akka remoting hiccup, transport not yet associated, central side temporarily unavailable, serialization failure on a malformed metric, etc. — the catch (Exception ex) on line 150 logs an error and the loop simply waits for the next tick. The report was never delivered, but the counters have already been reset to zero, so every error this site recorded in the failed interval is gone: it is neither in the (un-sent) report nor in the (zeroed) collector. The very next successful report will show "0 script errors / 0 alarm errors" for the entire window in which the transport was broken, masking exactly the period the operator most needs to triage.

This contradicts the design doc's "raw counts per reporting interval" / "counter resets after each report is sent" wording — current code resets on each report attempt, regardless of outcome. The hazard worsens under sustained transport failure: every interval's errors are lost; the central dashboard sees a quiet site while the site is, in fact, failing.

The same shape exists in CentralHealthReportLoop (see HealthMonitoring-018) — CollectReport is called before _aggregator.ProcessReport. The aggregator call is in-process and unlikely to throw, but the structural bug is identical.

Recommendation

Build the report from a non-destructive read first (PeekReport(siteId), returning a snapshot without mutating the counters) and only call a dedicated ResetIntervalCounters() after a successful _transport.Send. Alternatively, on a Send failure, restore the lost counts via Interlocked.Add of the captured values back into the collector fields — atomically correct as long as no other thread can read them in between, which is true here because the next read is the next CollectReport on the same loop. The "peek then commit" shape is the cleaner public API.

A regression test should add a failing-transport scenario: Send throws an InvalidOperationException; assert that the next successful report includes the previously-failed interval's ScriptErrorCount.

HealthMonitoring-018 — Same counter-reset-before-publish hazard in CentralHealthReportLoop

Severity Low
Category Correctness & logic bugs
Status Open
Location src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:87-98

Description

CentralHealthReportLoop.ExecuteAsync calls _collector.CollectReport(CentralSiteId) (which resets the per-interval counters on the shared SiteHealthCollector instance — see HealthMonitoring-017) and then _aggregator.ProcessReport(reportWithSeq) inside the same try block. If ProcessReport throws, the central node's own per-interval counters (ScriptErrorCount, AlarmEvaluationErrorCount, DeadLetterCount, SiteAuditWriteFailures, AuditRedactionFailure) are lost for that interval.

In practice ProcessReport is a pure in-memory CAS loop and is very unlikely to throw, so the operational impact is small. However, the structural bug is identical to HealthMonitoring-017 and would be fixed by the same "peek then commit" refactor in SiteHealthCollector. The Audit-Log-related metrics matter most here: AuditRedactionFailure is genuinely incremented at central during normal operation (the Notification Outbox dispatcher and Inbound API middleware both write through CentralAuditRedactionFailureCounter which can fan out to the collector via the bridge), so this is not purely theoretical.

Recommendation

Adopt the same "peek then reset on successful publish" pattern recommended for HealthMonitoring-017. Reuse the new PeekReport / ResetIntervalCounters collector API once it lands.

HealthMonitoring-019 — SiteAuditTelemetryStalled and CentralAuditWriteFailures design-doc metrics have no HealthMonitoring-side surface

Severity Medium
Category Design-document adherence
Status Resolved
Location docs/requirements/Component-HealthMonitoring.md:39,40, src/ScadaLink.HealthMonitoring/ICentralHealthAggregator.cs, src/ScadaLink.AuditLog/Central/AuditCentralHealthSnapshot.cs:39-58

Resolution (2026-05-28): Took the simpler doc-removal path rather than surfacing the metrics through a new HealthMonitoring API. Removed SiteAuditTelemetryStalled and CentralAuditWriteFailures from the Monitored Metrics table, the Audit Log KPIs section, and the Audit Log Dependencies entry in Component-HealthMonitoring.md. The two metrics remain internal to AuditLog's AuditCentralHealthSnapshot — promoting them to dashboard tiles is a separate feature, out of scope here.

Description

Component-HealthMonitoring.md lists SiteAuditTelemetryStalled and CentralAuditWriteFailures (and reiterates them under the Audit Log KPIs section and in the Dependencies section) as required dashboard metrics. The doc also says they "are central-computed alongside the existing central KPIs" (Notification Outbox / Site Call Audit) and surface in the Audit dashboard tile group.

Tracing the code:

  • SiteAuditTelemetryStalled is published by SiteAuditReconciliationActor, picked up by SiteAuditTelemetryStalledTracker, and latched into AuditCentralHealthSnapshot._stalled (a ConcurrentDictionary<string, bool> in the ScadaLink.AuditLog assembly).
  • CentralAuditWriteFailures is incremented inside AuditCentralHealthSnapshot via ICentralAuditWriteFailureCounter.Increment() (also in ScadaLink.AuditLog).

Neither metric is referenced anywhere in src/ScadaLink.HealthMonitoring/:

  • ICentralHealthAggregator does not expose them.
  • SiteHealthCollector has no central counterpart (it is site-only).
  • SiteHealthReport has no SiteAuditTelemetryStalled / CentralAuditWriteFailures fields (the site-only SiteAuditWriteFailures, AuditRedactionFailure, and SiteAuditBacklog are wired; the central pair is the gap).

Currently the only consumer of IAuditCentralHealthSnapshot is whatever Central UI page binds to it directly (out of scope for this module), but the design doc places these metrics under HealthMonitoring's responsibility ("Health Monitoring Dashboard displays aggregated metrics"). At minimum the Dependencies section's claim that Health Monitoring provides "the central-computed CentralAuditWriteFailures / AuditRedactionFailure metrics" is false for CentralAuditWriteFailures: nothing under src/ScadaLink.HealthMonitoring/ knows about it.

Recommendation

Decide whether HealthMonitoring or the consuming UI page owns the IAuditCentralHealthSnapshot integration:

  • If HealthMonitoring owns it, expose a CentralKpis accessor on ICentralHealthAggregator (e.g. a GetCentralAuditHealth() method that returns a typed DTO derived from the injected IAuditCentralHealthSnapshot) so the dashboard has a single read surface mirroring GetAllSiteStates.
  • If the UI page binds IAuditCentralHealthSnapshot directly, update the HealthMonitoring design doc's Responsibilities / Dependencies sections to reflect that and remove the implied integration.

Either way, add a regression test that the chosen surface returns the live counter and per-site stalled state.

HealthMonitoring-020 — MarkHeartbeat brings offline site back online with a stale LastHeartbeatAt when receivedAt <= existing.LastHeartbeatAt

Severity Low
Category Correctness & logic bugs
Status Resolved
Location src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:128-147

Resolution (2026-05-28): MarkHeartbeat now branches on existing.IsOnline: when transitioning offline-to-online it anchors LastHeartbeatAt to max(receivedAt, _timeProvider.GetUtcNow()) so an out-of-order or older receivedAt cannot leave the recovered site one tick away from re-going-offline. The online path retains the prior max(receivedAt, existing.LastHeartbeatAt) semantics. Regression test MarkHeartbeat_OfflineToOnline_StampsFreshLastHeartbeatAt asserts both the fresh LastHeartbeatAt (within 5 s of "now") and that the next CheckForOfflineSites does not flap the site back to offline.

Description

The CAS path in MarkHeartbeat picks newHeartbeat = max(receivedAt, existing.LastHeartbeatAt), then short-circuits only when newHeartbeat == existing.LastHeartbeatAt && existing.IsOnline. That short-circuit is correct, but consider the case where existing.IsOnline == false and receivedAt <= existing.LastHeartbeatAt:

  1. Suppose a site is marked offline by CheckForOfflineSites at time T1.
  2. A late/out-of-order heartbeat carrying a receivedAt older than the last stored LastHeartbeatAt arrives at T2 (clock skew at the receive site, or a delayed message that was generated before the offline-marking).
  3. newHeartbeat == existing.LastHeartbeatAt (kept), but the short-circuit condition fails because existing.IsOnline == false, so the CAS produces a new record with IsOnline = true and the stale LastHeartbeatAt.
  4. On the very next CheckForOfflineSites tick (≤ OfflineTimeout/2 later), now - LastHeartbeatAt is still ≥ OfflineTimeout, so the site is immediately marked offline again — the heartbeat brought it online for less than the check cadence, producing a "flap" in the dashboard.

In practice receivedAt is normally _timeProvider.GetUtcNow() at the CentralCommunicationActor receive site, so monotonically increasing — the bug is latent. But the contract MarkHeartbeat(string siteId, DateTimeOffset receivedAt) makes no guarantee about ordering, and an out-of-order delivery (Akka remoting ordering across connection re-establishment edge cases) or a small wall-clock correction at central would expose it.

Recommendation

When transitioning offline → online, use now (from the injected TimeProvider) rather than the caller-supplied receivedAt for LastHeartbeatAt, or take max(receivedAt, _timeProvider.GetUtcNow()) so the recovery point is always recent. A unit test driving MarkHeartbeat with a receivedAt older than the last stored heartbeat on an offline site, then a CheckForOfflineSites immediately afterwards, would assert the site stays online.

HealthMonitoring-021 — CentralSiteId = "central" reserved constant silently collides with a real site named "central"

Severity Low
Category Correctness & logic bugs
Status Open
Location src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:22, src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:224-226

Description

CentralHealthAggregator.CheckForOfflineSites looks up the per-site offline timeout with:

var timeout = kvp.Key == CentralHealthReportLoop.CentralSiteId
    ? _options.CentralOfflineTimeout
    : _options.OfflineTimeout;

CentralSiteId is the literal string "central". Site IDs are free-form strings set in configuration / the Sites repository; there is no validation that excludes the reserved "central" name. An operator who creates a real site with SiteId = "central" will have:

  • Their real-site reports arriving via ProcessReport get stored in the same dictionary slot as the central self-report (they share the keyspace), so the central self-report and the real-site report repeatedly overwrite each other via the sequence-number guard — whichever has the higher Unix-ms seed wins, and the other is silently rejected as stale. The dashboard alternates between two unrelated payloads.
  • The real site gets the longer CentralOfflineTimeout (default 3 minutes) instead of the normal OfflineTimeout (60 s), so a genuinely-failed real site marked "central" stays falsely-online for an extra two minutes.

Recommendation

Two options:

  1. Reject the reserved name at the Site entity / configuration validation layer (Configuration Database component, out of this module's scope) and document "central" as reserved. This is the cleaner UX fix.
  2. As a defence-in-depth inside HealthMonitoring, store the central self-report under a key that cannot collide — e.g. prefix it with a character that is forbidden in real site IDs (":central" or "#central") — and adjust CheckForOfflineSites accordingly.

Either fix should include a regression test creating a real SiteHealthReport with SiteId = "central" and asserting the central self-report's identity is preserved.

HealthMonitoring-022 — CentralHealthReportLoopTests uses real-time PeriodicTimer + Task.Delay; flake-prone on slow CI

Severity Low
Category Testing coverage
Status Open
Location tests/ScadaLink.HealthMonitoring.Tests/CentralHealthReportLoopTests.cs:32-42

Description

RunLoopBriefly starts the hosted service with a 50 ms PeriodicTimer and then await Task.Delay(runForMs, CancellationToken.None) (with runForMs between 150 ms and 300 ms). GeneratesCentralReports_WhenSelfIsPrimary and AssignsMonotonicSequenceNumbers both assert "at least 2 reports were generated" within the window. On a heavily-contended CI runner where the hosted-service start-up plus a couple of PeriodicTimer ticks can blow past 300 ms, these tests will silently flake.

The rest of the suite (CentralHealthAggregatorTests, SiteHealthCollectorTests, HealthReportSenderTests partially) was deliberately refactored to use the injected TimeProvider precisely to avoid this. CentralHealthReportLoop and HealthReportSender already accept a TimeProvider, but the loop's PeriodicTimer is still real-time because PeriodicTimer does not consume the TimeProvider parameter.

Recommendation

Either (a) accept the timing-sensitivity and bump the delay budget generously, or (b) refactor the hosted-service loop to use a TimeProvider.CreateTimer-based tick mechanism so the test can advance a fake clock and assert deterministically how many ticks fire. Option (b) is the better long-term fix and matches the pattern used elsewhere in the module's tests.

HealthMonitoring-023 — StoreAndForwardBufferDepths_IsEmptyPlaceholder test name is stale; it now covers the default-state contract, not a placeholder

Severity Low
Category Documentation & comments
Status Resolved
Location tests/ScadaLink.HealthMonitoring.Tests/SiteHealthCollectorTests.cs:117-122

Resolution (2026-05-28): Renamed the test method from StoreAndForwardBufferDepths_IsEmptyPlaceholder to StoreAndForwardBufferDepths_DefaultsToEmpty_WhenSetterNotCalled so the name describes what it actually asserts — the default-state contract of a fresh collector. No behaviour change; the body still constructs a collector without calling SetStoreAndForwardDepths and asserts Empty(report.StoreAndForwardBufferDepths).

Description

The test StoreAndForwardBufferDepths_IsEmptyPlaceholder was originally named to codify the HealthMonitoring-001 bug ("SetStoreAndForwardDepths has no callers, so StoreAndForwardBufferDepths is always empty"). HealthMonitoring-001 is ResolvedHealthReportSender now populates per-category depths from the S&F engine, and the same test class has SetStoreAndForwardDepths_ReflectedInReport covering the populated path. The "placeholder" test still passes because it constructs a fresh collector and never calls the setter, so its assertion (Assert.Empty(report.StoreAndForwardBufferDepths)) is now testing the default empty state of an un-configured collector. The HealthMonitoring-001 resolution note explicitly chose to keep it as "the collector-level default-state test", but the test method name and the implied semantics no longer match.

A maintainer reading the test name today will misread it as documentation that the metric is unimplemented (which it isn't), and may waste time investigating a non-bug.

Recommendation

Rename to StoreAndForwardBufferDepths_DefaultsToEmpty_WhenSetterNotCalled (or similar) and update the test body's intent — purely a documentation / maintainability fix; no behaviour change.