Files
Joseph Doherty 7b0b9c7365 refactor: rename ScadaLink → ZB.MOM.WW.ScadaBridge (code + projects + namespaces)
Solution + 23 src projects + 26 test projects renamed; folders, csproj,
namespaces, and ScadaLinkDbContext/ScadaBridgeDbContext class updated.
ActorSystem "scadalink" → "scadabridge", Akka seed-node URLs migrated.
SQL roles/logins, LDAP domains, CLI command name, and CLI config dir
(~/.scadalink → ~/.scadabridge) also renamed.

Build green; 5 Host.Tests fail awaiting SQL login rename in next commit.
Pre-existing StaleTagMonitor timing flakes unchanged.

Rename script committed at tools/rename-to-scadabridge.sh.
2026-05-28 09:37:45 -04:00

1156 lines
70 KiB
Markdown

# Code Review — HealthMonitoring
| Field | Value |
|-------|-------|
| Module | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring` |
| Design doc | `docs/requirements/Component-HealthMonitoring.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
| Open findings | 2 |
## Summary
The HealthMonitoring module is small, readable, and broadly faithful to the design
intent: per-interval error counters with atomic read-and-reset, monotonic sequence
numbers with Unix-ms seeding to survive failover, sequence-guarded staleness
rejection, and a 60s offline timeout. However, the review surfaced two recurring
themes. First, **a documented metric is silently unimplemented** — store-and-forward
buffer depths are never populated (`SetStoreAndForwardDepths` has zero callers and a
test asserts the field is always empty), so the dashboard cannot show the buffer
depth metric the design doc requires. Second, **the central aggregator's in-memory
state model has unguarded shared mutable state**: `SiteHealthState` is a mutable
class whose fields are written by a background timer thread, by `ProcessReport`, and
by `MarkHeartbeat` with no synchronization, and the same live mutable objects are
handed straight to UI callers via `GetAllSiteStates`. The `ProcessReport` logic also
mutates shared state inside a `ConcurrentDictionary.AddOrUpdate` update delegate,
which the runtime may invoke more than once under contention. Additionally there are
gaps around central self-report offline detection, heartbeats for not-yet-registered
sites being dropped, and missing test coverage for the central report loop,
heartbeat path, and most collector setters. None of the findings are crash-class,
but the concurrency issues are Medium/High and the missing S&F metric is a real
design-adherence gap.
#### Re-review 2026-05-17 (commit `39d737e`)
All twelve prior findings (HealthMonitoring-001..012) are confirmed `Resolved`
`SiteHealthState` is now an immutable `sealed record` mutated only via atomic
compare-and-swap, the store-and-forward buffer-depth metric is populated, the
central-site offline grace and the unknown-site heartbeat registration are in
place, and the test suite has grown to cover the report loop, heartbeat path, and
collector setters. This re-review found **4 new findings, all Low/Medium, none
crash-class**. They are residual polish items rather than behaviour regressions:
an inaccurate offline-check-interval comment (HealthMonitoring-013), unvalidated
`HealthMonitoringOptions` intervals that crash the hosted service on
misconfiguration (HealthMonitoring-014), a heartbeat-only registered site left
with a year-0001 `LastReportReceivedAt` that the UI's staleness display would
otherwise render verbatim (HealthMonitoring-015 — resolved via a coordinated
HealthMonitoring + CentralUI change), and `CollectReport` reading
`DateTimeOffset.UtcNow` directly instead of the module's now-standard injected
`TimeProvider` (HealthMonitoring-016). The module remains small, readable, and
broadly faithful to the design intent.
#### Re-review 2026-05-28 (commit `1eb6e97`)
All sixteen prior findings (HealthMonitoring-001..016) remain `Resolved`. This
baseline re-review applied the full 10-category checklist and produced **7 new
findings** (1 Medium, 6 Low — none crash-class). The most material observation
is a **metric-loss race** in `HealthReportSender.ExecuteAsync`
(HealthMonitoring-017): `CollectReport` resets the per-interval error counters
(`ScriptErrorCount`, `AlarmEvaluationErrorCount`, `DeadLetterCount`,
`SiteAuditWriteFailures`, `AuditRedactionFailure`) **before**
`_transport.Send(...)` is attempted, so a transport failure (the existing
`catch { LogError; }` path) silently discards every error this site recorded in
the failed interval — the module-specific concern of "metric counters drifting
from raw-per-interval to cumulative" inverted into _drifting_ to _lost_. A
parallel hazard exists in `CentralHealthReportLoop` (HealthMonitoring-018). The
remaining items are smaller: two Audit Log metrics
(`SiteAuditTelemetryStalled`, `CentralAuditWriteFailures`) listed in the design
doc never make it into a HealthMonitoring surface (HealthMonitoring-019); a
heartbeat with `receivedAt <= existing.LastHeartbeatAt` brings an offline site
back online with a stale heartbeat that can flap right back to offline on the
next check (HealthMonitoring-020); the reserved `CentralSiteId = "central"`
constant collides with any real site named `"central"` and silently extends its
offline grace (HealthMonitoring-021); `CentralHealthReportLoopTests` uses real
wall-clock 50 ms timers + `Task.Delay`, making it timing-sensitive
(HealthMonitoring-022); and one obsolete placeholder test name
(`StoreAndForwardBufferDepths_IsEmptyPlaceholder`) misrepresents what it now
covers (HealthMonitoring-023). All sequence-number and offline-detection
arithmetic uses `_timeProvider.GetUtcNow()` consistently — no wall-clock vs
monotonic mismatch was observed.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | x | `MarkHeartbeat` drops heartbeats for unregistered sites (HealthMonitoring-007); central self-report has no heartbeat grace (HealthMonitoring-005). Re-review: heartbeat-registered site left with year-0001 `LastReportReceivedAt` (HealthMonitoring-015). |
| 2 | Akka.NET conventions | x | Module itself contains no actors (transport abstracted via `IHealthReportTransport`); `AddHealthMonitoringActors` is a dead placeholder (HealthMonitoring-011). Actor-side wiring lives in Communication and is out of scope. |
| 3 | Concurrency & thread safety | x | Unguarded mutable `SiteHealthState` (HealthMonitoring-002); mutation inside `AddOrUpdate` delegate (HealthMonitoring-003); `GetAllSiteStates` leaks live mutable references (HealthMonitoring-008). Collector counters correctly use `Interlocked`. |
| 4 | Error handling & resilience | x | `HealthReportSender` silently swallows inner failures with bare `catch {}` (HealthMonitoring-010, resolved); top-level loop error handling is sound. Re-review: `HealthMonitoringOptions` intervals unvalidated — a zero/negative value crashes the hosted service at `PeriodicTimer` construction (HealthMonitoring-014). |
| 5 | Security | x | No issues found. Module handles only numeric/string operational metrics, no secrets, no external input parsing, no auth surface. |
| 6 | Performance & resource management | x | `PeriodicTimer` instances correctly disposed via `using`. Dictionary snapshots per report are acceptable at the documented scale. No issues found. |
| 7 | Design-document adherence | x | Store-and-forward buffer depth metric unimplemented (HealthMonitoring-001); sequence seeding deviates from doc's "starting at 1" wording (HealthMonitoring-006). |
| 8 | Code organization & conventions | x | Options class correctly owned by the component; POCO/messages in Commons. Dead placeholder method noted (HealthMonitoring-011, resolved). Re-review: `SiteHealthCollector.CollectReport` reads `DateTimeOffset.UtcNow` directly instead of the module's now-standard injected `TimeProvider` (HealthMonitoring-016). |
| 9 | Testing coverage | x | No tests for `CentralHealthReportLoop`, `MarkHeartbeat`, offline-via-heartbeat, replica idempotency, or most collector setters (HealthMonitoring-009). |
| 10 | Documentation & comments | x | Heartbeat interval is described inconsistently (~2s vs ~5s) across XML docs (HealthMonitoring-004, resolved); `LatestReport = null!` misrepresents the contract (HealthMonitoring-012, resolved). Re-review: offline-check-interval comment claims "(shorter)" timeout but code only uses `OfflineTimeout` (HealthMonitoring-013). |
_Re-review (2026-05-28, `1eb6e97`):_
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | x | `HealthReportSender` and `CentralHealthReportLoop` reset per-interval counters before the send/process call — counts lost on transport failure (HealthMonitoring-017, HealthMonitoring-018). `MarkHeartbeat` brings an offline site back online with a stale `LastHeartbeatAt` when `receivedAt <= existing.LastHeartbeatAt` — site can flap straight back to offline (HealthMonitoring-020). `CentralSiteId = "central"` reserved constant silently collides with any real site named "central" (HealthMonitoring-021). |
| 2 | Akka.NET conventions | x | Module contains no actors itself. `IHealthReportTransport` cleanly abstracts the Akka-remoting send. `ProcessReport`/`MarkHeartbeat` are called from `CentralCommunicationActor`'s receive — invoked on the actor thread but the aggregator's CAS loops make that safe regardless. No issues found. |
| 3 | Concurrency & thread safety | x | Verified the resolved `SiteHealthState` immutable-record / CAS-loop pattern still holds across `ProcessReport`, `MarkHeartbeat`, `CheckForOfflineSites`. `SiteHealthCollector` uses `volatile` for reference fields (`_clusterNodes`, `_nodeHostname`, `_siteAuditBacklog`, `_isActiveNode`) and `Interlocked` for counters consistently. `CollectReport`'s `new Dictionary<>(concurrentDict)` snapshots are not strictly atomic but acceptable at the documented scale. No new issues found. |
| 4 | Error handling & resilience | x | `try/catch` blocks now log all non-fatal failures (resolved HealthMonitoring-010 still in place). Outer `catch (Exception)` in `ExecuteAsync` keeps the loop alive — sound. New: the counter-reset-before-send issue (HealthMonitoring-017, HealthMonitoring-018) is an error-handling gap — transport failure silently swallows the interval's metric data. |
| 5 | Security | x | No issues found. The module handles only numeric/string operational metrics; no secrets, auth surface, or untrusted input parsing. `MarkHeartbeat` and `ProcessReport` trust the caller (intra-cluster). |
| 6 | Performance & resource management | x | `PeriodicTimer` instances disposed via `using`. CAS retry loops in `ProcessReport`/`MarkHeartbeat` have no bounded retry cap but contention is the dictionary-size limit (one entry per site) so the loop is effectively wait-free for the common case. No issues found. |
| 7 | Design-document adherence | x | `SiteAuditTelemetryStalled` and `CentralAuditWriteFailures` are listed as required dashboard tiles in `Component-HealthMonitoring.md` but have no HealthMonitoring-side surface — both live only in `AuditLog`'s `AuditCentralHealthSnapshot` with no integration into the health aggregator or its consumers (HealthMonitoring-019). |
| 8 | Code organization & conventions | x | Options class correctly owned by the component, validator registered idempotently across all three `Add*` paths. POCO/messages in Commons. `AddCentralHealthAggregation` implicitly depends on `ISiteHealthCollector` being registered elsewhere (Host calls `AddHealthMonitoring()` first) — works but is a hidden ordering requirement. Minor; not flagged. |
| 9 | Testing coverage | x | Per-interval reset semantics covered for site-side counters but NOT for the failed-send case (no test asserts counters remain accumulated when the transport throws — would catch HealthMonitoring-017). `CentralHealthReportLoopTests` uses real wall-clock 50 ms `PeriodicTimer` + `Task.Delay(250)` for timing — flake-prone on a slow CI runner (HealthMonitoring-022). The placeholder test `StoreAndForwardBufferDepths_IsEmptyPlaceholder` name is stale (HealthMonitoring-023). |
| 10 | Documentation & comments | x | XML docs in the new audit-bridge surfaces (`IncrementSiteAuditWriteFailures`, `IncrementAuditRedactionFailure`, `UpdateSiteAuditBacklog`) are accurate. The stale placeholder test name is the only issue (HealthMonitoring-023). |
## Findings
### HealthMonitoring-001 — Store-and-forward buffer depth metric is never populated
| | |
|--|--|
| Severity | High |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/SiteHealthCollector.cs:104`, `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/HealthReportSender.cs:79` |
**Description**
`Component-HealthMonitoring.md` lists "Store-and-forward buffer depth" (pending
messages by category) as a required monitored metric. `SiteHealthCollector` exposes
`SetStoreAndForwardDepths(...)` to receive it, but a codebase-wide search shows the
method has **no callers**`_sfBufferDepths` always remains the empty dictionary it
is initialized to. `HealthReportSender` queries `GetParkedMessageCountAsync()` and
sets `ParkedMessageCount`, but parked count is a distinct metric from per-category
buffer depth. The test `SiteHealthCollectorTests.StoreAndForwardBufferDepths_IsEmptyPlaceholder`
even codifies the unimplemented state as expected behaviour. The result is that the
central dashboard cannot display buffer depth, a documented triage metric.
**Recommendation**
Wire `SetStoreAndForwardDepths` into `HealthReportSender.ExecuteAsync` (alongside the
existing parked-count call) using the S&F engine's per-category depth API, or, if the
metric is intentionally deferred, record that decision in the design doc and remove
the dead setter. Update the placeholder test accordingly once implemented.
**Resolution**
Resolved 2026-05-16 (commit `<pending>`). `HealthReportSender.ExecuteAsync` now
queries the existing public `StoreAndForwardStorage.GetBufferDepthByCategoryAsync()`
API alongside the parked-count call and feeds the per-category depths into
`SiteHealthCollector.SetStoreAndForwardDepths` (category enum names as keys), so the
documented store-and-forward buffer depth metric is populated in every emitted
report. Regression test `HealthReportSenderTests.ReportsIncludeStoreAndForwardBufferDepthsFromStorage`
verifies populated per-category depths. The obsolete placeholder test
`SiteHealthCollectorTests.StoreAndForwardBufferDepths_IsEmptyPlaceholder` continues
to pass — it only exercises the collector with no setter call and still correctly
asserts the empty default; it was left in place as the collector-level default-state
test. No StoreAndForward source was modified (existing public API only).
### HealthMonitoring-002 — `SiteHealthState` mutable fields written from multiple threads without synchronization
| | |
|--|--|
| Severity | High |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/SiteHealthState.cs:11`, `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/CentralHealthAggregator.cs:86`, `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/CentralHealthAggregator.cs:137` |
**Description**
`SiteHealthState` is a plain mutable class. Its fields (`LatestReport`,
`LastReportReceivedAt`, `LastHeartbeatAt`, `LastSequenceNumber`, `IsOnline`) are
mutated from at least three concurrent contexts: `ProcessReport` (caller thread —
ClusterClient/PubSub message handlers), `MarkHeartbeat` (caller thread — heartbeat
handler), and `CheckForOfflineSites` (the `BackgroundService` timer thread). The
`ConcurrentDictionary` only protects the dictionary structure, not the objects it
stores. A heartbeat update and the offline-check can interleave on the same
`SiteHealthState` instance, and reads/writes of `DateTimeOffset` (a 16-byte struct)
and `long` fields are not guaranteed atomic on all platforms — producing torn reads
and lost updates of `IsOnline`/`LastHeartbeatAt`.
**Recommendation**
Make state transitions atomic: either guard all reads/writes of a `SiteHealthState`
with a per-site lock, or replace `SiteHealthState` with an immutable record updated
via `ConcurrentDictionary` compare-and-swap (`TryUpdate`) so every transition is
a single atomic reference swap.
**Resolution**
Resolved 2026-05-16 (commit `<pending>`). `SiteHealthState` is now a `sealed record`
with `init`-only properties. `CentralHealthAggregator.ProcessReport`,
`MarkHeartbeat`, and `CheckForOfflineSites` were rewritten to perform every state
transition as an atomic compare-and-swap (`TryAdd`/`TryUpdate`) producing a new
record instance — no field of a stored state is ever mutated in place. `ProcessReport`
uses an explicit CAS retry loop instead of the `AddOrUpdate` update delegate so the
sequence-number guard and the field writes are evaluated against the value actually
installed (this also closes the root cause behind HealthMonitoring-003). Reads via
`GetAllSiteStates`/`GetSiteState` now hand out immutable snapshots, so a concurrent
reader can never observe a torn or half-applied state. `LatestReport` was changed
from `SiteHealthReport` (`null!`) to `SiteHealthReport?`, making the contract honest;
all existing consumers (CentralUI, integration/perf tests) already null-checked it
and continue to build clean. Regression test
`CentralHealthAggregatorTests.ProcessReport_ConcurrentUpdates_NeverLoseSequenceOrTearState`
exercises concurrent report/heartbeat/read threads and asserts snapshot consistency
and no lost updates.
### HealthMonitoring-003 — Shared state mutated inside `ConcurrentDictionary.AddOrUpdate` update delegate
| | |
|--|--|
| Severity | Medium — re-triaged: already resolved as a side-effect of HealthMonitoring-002. |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/CentralHealthAggregator.cs:45-103` |
**Description**
The update delegate passed to `AddOrUpdate` mutates the `existing` object in place
(`existing.LatestReport = report; existing.IsOnline = true; ...`). `AddOrUpdate`'s
contract explicitly allows the update delegate to be invoked **more than once** under
contention (when the CAS that installs the result loses a race and is retried). Each
invocation mutates the shared object, so a concurrent report for the same site can
observe a half-applied update, and the multi-field assignment is not atomic with
respect to readers in `GetAllSiteStates`/`CheckForOfflineSites`. The intended
"only replace if sequence is higher" guard can also be subverted because the
sequence comparison and the field writes are not a single atomic step.
**Recommendation**
Have the update delegate return a **new** `SiteHealthState` (record `with` copy)
rather than mutating `existing`, and treat the dictionary value as immutable.
Combined with HealthMonitoring-002, this makes every state transition an atomic
reference swap with no observable intermediate state.
**Resolution**
Resolved 2026-05-16 (commit `pending`). Re-triaged: verified against the current
source — the root cause was already eliminated by the HealthMonitoring-002 fix.
`ProcessReport` no longer uses `AddOrUpdate` at all; it is now an explicit
compare-and-swap retry loop (`TryGetValue` → guard → `TryAdd`/`TryUpdate`) that
produces a brand-new immutable `SiteHealthState` record per transition and never
mutates a stored value in place. The sequence-number guard and the field writes are
evaluated against the value actually installed by the CAS, so the "only replace if
sequence is higher" invariant holds. The concurrency stress test
`CentralHealthAggregatorTests.ProcessReport_ConcurrentUpdates_NeverLoseSequenceOrTearState`
(added under HealthMonitoring-002) already exercises this path and asserts no lost
updates and no torn snapshots. No further code change was required for this finding.
### HealthMonitoring-004 — Inconsistent heartbeat interval described across XML docs
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/CentralHealthAggregator.cs:146-148`, `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/SiteHealthState.cs:21`, `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/ICentralHealthAggregator.cs:16` |
**Description**
The heartbeat cadence that offline detection relies on is documented inconsistently.
`CheckForOfflineSites` says "heartbeats arrive every ~5s"; `SiteHealthState.LastHeartbeatAt`
says "~5s heartbeat"; but `ICentralHealthAggregator.MarkHeartbeat` says "~2s
heartbeats are arriving". The actual cadence is set elsewhere (Cluster Infrastructure /
`SiteCommunicationActor`). Readers cannot reason about whether a 60s offline timeout
gives the intended grace without a single authoritative number.
**Recommendation**
Pick the correct interval (verify against the heartbeat scheduler in
`SiteCommunicationActor`/Cluster Infrastructure) and use it consistently in all three
comments, ideally referencing the owning component rather than restating a magic number.
**Resolution**
Resolved 2026-05-16 (commit `pending`). Documentation-only — no regression test is
meaningful. Verified the authoritative cadence against the Communication module:
`SiteCommunicationActor.PreStart` schedules the application-level heartbeat to central
at `CommunicationOptions.TransportHeartbeatInterval`, which defaults to **5 seconds**
(`CommunicationOptions.cs:49`). The stale "~2s" in `ICentralHealthAggregator.MarkHeartbeat`
was corrected; all three XML docs (`ICentralHealthAggregator.MarkHeartbeat`,
`SiteHealthState.LastHeartbeatAt`, `CentralHealthAggregator.CheckForOfflineSites`) now
state the single authoritative ~5s figure and reference the owning component
(`Cluster Infrastructure / SiteCommunicationActor`
`CommunicationOptions.TransportHeartbeatInterval`) rather than restating a bare magic
number, so readers can reason about the 60s offline grace.
### HealthMonitoring-005 — Central self-report site can flap offline; no heartbeat grace like real sites
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/CentralHealthReportLoop.cs:48-81`, `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/CentralHealthAggregator.cs:149` |
**Description**
`CheckForOfflineSites` decides offline status purely from `LastHeartbeatAt`, and for
real sites that field is kept fresh by frequent (~2-5s) heartbeats so the 60s timeout
only fires on genuine total loss. The synthetic `central` site, however, has no
heartbeat source — `LastHeartbeatAt` is only bumped by `ProcessReport` from the
30s `CentralHealthReportLoop`. The loop also only runs on the cluster leader and
silently skips a cycle on any exception. Consequently, a single skipped/late central
self-report (leader GC pause, brief stall, mid-failover before the new leader's loop
spins up) leaves `central` with no signal for >60s and it is marked offline even
though the central cluster is healthy. The central card thus has no equivalent of
the "one missed report grace" the design doc grants real sites.
**Recommendation**
Either feed `central` a heartbeat equivalent (e.g. have `MarkHeartbeat` called for
`CentralSiteId` on a fast timer independent of the leader-only report loop), or apply
a longer/distinct offline timeout to the `central` keyspace entry, and ensure the new
leader starts the report loop promptly on failover.
**Resolution**
Resolved 2026-05-16 (commit `pending`). Applied the distinct-timeout option. A new
`HealthMonitoringOptions.CentralOfflineTimeout` (default 3x the report interval =
3 minutes) is applied by `CentralHealthAggregator.CheckForOfflineSites` to the
`central` keyspace entry only — real sites keep the existing `OfflineTimeout`. This
gives the synthetic `central` site (which has no heartbeat source and is fed solely
by the 30s leader-only `CentralHealthReportLoop`) enough grace to survive a single
skipped or late self-report — the equivalent of the "one missed report" grace the
design doc grants real sites — while still going offline on genuine total loss.
Regression tests `CentralHealthAggregatorTests.OfflineDetection_CentralSite_HasLongerGraceThanRealSites`
(central survives 75s of silence while a real site goes offline) and
`OfflineDetection_CentralSite_StillGoesOfflineOnGenuineLoss` (central still detected
offline after 10 minutes) verify the behaviour.
### HealthMonitoring-006 — Sequence seeding contradicts the doc's "starting at 1" wording and is untestable
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/HealthReportSender.cs:28`, `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/CentralHealthReportLoop.cs:32` |
**Description**
The `HealthReportSender` class XML summary states "Sequence numbers are monotonic,
starting at 1, and reset on service restart." The implementation instead seeds
`_sequenceNumber` with `DateTimeOffset.UtcNow.ToUnixTimeMilliseconds()` so the first
emitted sequence is a large epoch value, specifically to keep ordering correct across
failover. The summary is therefore stale and contradicts the code. Separately, the
seed reads `DateTimeOffset.UtcNow` directly at field initialization rather than
through an injected `TimeProvider` (which `CentralHealthAggregator` already uses),
making the seeding logic impossible to unit-test deterministically and dependent on
node wall-clock agreement — if one node's clock lags, its post-failover reports can
be silently rejected as stale by the aggregator.
**Recommendation**
Fix the `HealthReportSender` XML summary to describe the actual Unix-ms seeding
strategy, and inject `TimeProvider` for the seed so the behaviour is testable and the
clock dependency is explicit.
**Resolution**
Resolved 2026-05-16 (commit `pending`). The `HealthReportSender` XML summary was
rewritten to describe the real strategy — sequence numbers are monotonic and
restart-resetting but explicitly **not** zero/one-based; they are seeded with the
current Unix epoch (ms) so a freshly-active node always sorts after the prior active.
Both `HealthReportSender` and `CentralHealthReportLoop` now accept an optional
`TimeProvider` (defaulting to `TimeProvider.System`) and derive the seed via
`GetUtcNow().ToUnixTimeMilliseconds()` in the constructor body instead of reading
`DateTimeOffset.UtcNow` at field initialization, so the seeding is deterministically
testable and the clock dependency is explicit. `CentralHealthReportLoop` gained a
`CurrentSequenceNumber` test accessor mirroring `HealthReportSender`. Regression tests
`HealthReportSenderTests.SequenceNumberSeed_UsesInjectedTimeProvider` and
`CentralHealthReportLoopTests.SequenceNumberSeed_UsesInjectedTimeProvider` assert the
seed equals the injected provider's Unix-ms instant (these would not compile against
the pre-fix code, which had no `TimeProvider` parameter).
### HealthMonitoring-007 — Heartbeats for not-yet-registered sites are silently dropped
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/CentralHealthAggregator.cs:86-99` |
**Description**
`MarkHeartbeat` returns immediately if the site is not already in `_siteStates`
("registration only happens on report"). Central health state is in-memory only and
not persisted. After a central restart or failover the aggregator starts empty, so
for up to one full report interval (default 30s) every site emits only heartbeats
that are all discarded — the site is reported as *unknown* (absent from
`GetAllSiteStates`) rather than *online*, even though heartbeats prove it is
reachable. This is a visible dashboard regression precisely during the failover
window, which is when operators most need accurate status.
**Recommendation**
Allow `MarkHeartbeat` to register a minimal `SiteHealthState` (online, no
`LatestReport` yet, with a UI-visible "awaiting first report" indication) when a
heartbeat arrives for an unknown site, so reachable sites show online immediately
after a central restart.
**Resolution**
Resolved 2026-05-16 (commit `pending`). `CentralHealthAggregator.MarkHeartbeat` no
longer returns early for an unknown site. When a heartbeat arrives for a site with no
aggregator state, it now atomically registers (`TryAdd`, with CAS-loss retry) a
minimal `SiteHealthState` that is `IsOnline = true`, `LatestReport = null`,
`LastSequenceNumber = 0` and `LastHeartbeatAt = receivedAt` — an "online, awaiting
first report" state. This relies on the HealthMonitoring-002 change that made
`LatestReport` properly nullable, so UI consumers already handle the null case.
Reachable sites therefore show online immediately after a central restart/failover
instead of being absent ("unknown") for up to a full report interval. The
`ICentralHealthAggregator.MarkHeartbeat` XML doc was corrected to describe the new
behaviour. Regression test
`CentralHealthAggregatorTests.MarkHeartbeat_RegistersUnknownSite_AsOnlineAwaitingReport`
verifies the registration; `MarkHeartbeat_KeepsSiteOnline_BetweenReports` and
`MarkHeartbeat_BringsOfflineSiteBackOnline` cover the already-registered paths.
### HealthMonitoring-008 — `GetAllSiteStates` / `GetSiteState` leak live mutable state objects to callers
| | |
|--|--|
| Severity | Medium — re-triaged: already resolved as a side-effect of HealthMonitoring-002. |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/CentralHealthAggregator.cs:146-158` |
**Description**
`GetAllSiteStates` copies the dictionary but the copy still holds references to the
same live mutable `SiteHealthState` instances; `GetSiteState` returns the live
instance directly. UI consumers (Blazor Server / SignalR circuits) read these objects
on their own threads while the aggregator's background timer and report handlers
concurrently mutate the very same instances (see HealthMonitoring-002). A UI render
can observe a `SiteHealthState` with, e.g., `IsOnline == true` but a `LatestReport`
from a different update, or a torn `DateTimeOffset`. Callers could also mutate the
shared state, corrupting aggregator state.
**Recommendation**
Return immutable snapshots: convert `SiteHealthState` to a record (per
HealthMonitoring-002/003) so handing out the reference is safe, or deep-copy each
state into an immutable DTO before returning.
**Resolution**
Resolved 2026-05-16 (commit `pending`). Re-triaged: verified against the current
source — the root cause was already eliminated by the HealthMonitoring-002 fix.
`SiteHealthState` is now a `sealed record` with `init`-only properties (fully
immutable). Every aggregator transition installs a brand-new record instance via an
atomic compare-and-swap, so the references `GetAllSiteStates` and `GetSiteState` hand
out are immutable snapshots — a UI consumer reading one on its own thread can never
observe a torn or half-applied state, and cannot mutate aggregator state through the
returned reference. The recommended fix (make `SiteHealthState` a record) is exactly
what the HealthMonitoring-002 change did, so no further code change was required.
### HealthMonitoring-009 — Missing test coverage for central report loop, heartbeat path, replication, and collector setters
| | |
|--|--|
| Severity | Medium |
| Category | Testing coverage |
| Status | Resolved |
| Location | `tests/ZB.MOM.WW.ScadaBridge.HealthMonitoring.Tests/` |
**Description**
Several behaviours have no automated coverage:
- `CentralHealthReportLoop` — leader-only gating (`SelfIsPrimary`), self-report
generation, sequence assignment: no test file at all.
- `CentralHealthAggregator.MarkHeartbeat` — keeping a site online between reports,
online recovery via heartbeat, and the unknown-site drop behaviour
(HealthMonitoring-007): untested.
- Offline detection driven by `LastHeartbeatAt` vs `LastReportReceivedAt` — the
existing offline tests only advance time after a report, never exercising the
heartbeat-keeps-alive path the design depends on.
- `SiteHealthCollector``SetClusterNodes`, `SetInstanceCounts`, `SetParkedMessageCount`,
`SetNodeHostname`, `SetActiveNode`/`NodeRole`, `UpdateTagQuality`,
`UpdateConnectionEndpoint`: not reflected-in-report tested.
- `SiteHealthReportReplica` idempotency under double delivery: untested.
**Recommendation**
Add tests for the central report loop (with a fake `IClusterNodeProvider`), the
heartbeat-keeps-online and unknown-site heartbeat paths, and the remaining collector
setters' presence in `CollectReport` output.
**Resolution**
Resolved 2026-05-16 (commit `pending`). Added the missing coverage:
- **`CentralHealthReportLoopTests`** (new file) — `GeneratesCentralReports_WhenSelfIsPrimary`,
`GeneratesNoReports_WhenNotPrimary` (leader-only `SelfIsPrimary` gating with a fake
`IClusterNodeProvider`), `AssignsMonotonicSequenceNumbers`, and
`SetsActiveNodeFlag_EvenWhenNotPrimary`.
- **`CentralHealthAggregatorTests`** — `MarkHeartbeat_RegistersUnknownSite_AsOnlineAwaitingReport`,
`MarkHeartbeat_KeepsSiteOnline_BetweenReports` (heartbeat keeps a site online past
the offline timeout — the path the design depends on), and
`MarkHeartbeat_BringsOfflineSiteBackOnline`.
- **`SiteHealthCollectorTests`** — reflected-in-report tests for `SetClusterNodes`,
`SetInstanceCounts`, `SetParkedMessageCount`, `SetNodeHostname`,
`SetActiveNode`/`NodeRole`, `UpdateTagQuality`, `UpdateConnectionEndpoint`, and
`SetStoreAndForwardDepths`.
The `SiteHealthReportReplica` idempotency item is **out of scope** for this module:
`SiteHealthReportReplica` is declared in `ZB.MOM.WW.ScadaBridge.Commons` and published/consumed by
`CentralCommunicationActor` in the `ZB.MOM.WW.ScadaBridge.Communication` module — the
HealthMonitoring module itself has no replication code. Replica double-delivery
idempotency is already covered by `ProcessReport`'s sequence-number guard
(`ProcessReport_RejectsEqualSequence`, `ProcessReport_RejectsStaleReport_WhenSequenceNotGreater`);
testing the actor-side double-publish belongs in the Communication module's review.
The HealthMonitoring test suite now stands at 47 passing tests (was 30).
### HealthMonitoring-010 — `HealthReportSender` silently swallows inner failures with bare `catch {}`
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/HealthReportSender.cs:70-87` |
**Description**
The cluster-nodes update and parked-message-count query are each wrapped in
`try { ... } catch { /* Non-fatal */ }` with no logging. A persistent failure (e.g.
the S&F SQLite store is permanently broken, or `GetClusterNodes()` always throws)
is then completely invisible — every report silently ships with stale cluster nodes
and a parked count of 0, with nothing in the logs to explain the wrong dashboard
values. Bare `catch` with no exception variable also catches `OperationCanceledException`
and would mask shutdown signalling if the awaited call observed the token.
**Recommendation**
Catch a specific exception type (or at least `Exception ex`) and `LogWarning`/`LogDebug`
the failure so persistent degradation is diagnosable; avoid swallowing
`OperationCanceledException`.
**Resolution**
Resolved 2026-05-16 (commit `pending`). All three bare `catch { }` blocks in
`HealthReportSender.ExecuteAsync` (cluster-nodes refresh, parked-message-count query,
S&F buffer-depth query) were changed to `catch (Exception ex)` and now emit a
`LogWarning(ex, ...)` naming the failed operation and the site, so a persistent
degradation (broken S&F SQLite store, always-throwing `GetClusterNodes()`) is
diagnosable instead of silently shipping stale/zero metrics. On the `OperationCanceledException`
concern: verified the inner calls cannot raise OCE from cancellation —
`IClusterNodeProvider.GetClusterNodes()` is synchronous and takes no token, and
`StoreAndForwardStorage.GetParkedMessageCountAsync()`/`GetBufferDepthByCategoryAsync()`
take no `CancellationToken` parameter, so the only cancellation path is the outer loop's
`PeriodicTimer.WaitForNextTickAsync(stoppingToken)`, which is unaffected. Regression
test `HealthReportSenderTests.ClusterNodeRefreshFailure_IsLoggedNotSwallowed` injects a
throwing `IClusterNodeProvider`, asserts the loop still ships reports, and asserts a
warning carrying the `InvalidOperationException` is logged — confirmed to fail against
the pre-fix bare `catch { }` (logged-entry collection was empty).
### HealthMonitoring-011 — `AddHealthMonitoringActors` is a dead no-op placeholder
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/ServiceCollectionExtensions.cs:42-46` |
**Description**
`AddHealthMonitoringActors` does nothing but `return services` with a "Placeholder for
Phase 4+" comment. A public extension method that silently no-ops is a trap: a caller
who registers it will believe actor wiring is in place. No caller currently invokes it.
**Recommendation**
Remove the method until it has real behaviour, or throw `NotImplementedException` so
accidental use fails loudly. If the actor model for this component is genuinely
planned, track it in the design doc instead of a half-method.
**Resolution**
Resolved 2026-05-16 (commit `pending`). Verified a codebase-wide search — the
`AddHealthMonitoringActors` no-op extension has no production callers (only references
were in the code-review docs themselves). The HealthMonitoring module deliberately
contains no actors (transport is abstracted behind `IHealthReportTransport`; actor-side
wiring lives in the Communication module), so there is no genuine Phase-4 actor work for
this method to host. The dead placeholder was removed from `ServiceCollectionExtensions`
so a caller can no longer be misled into believing actor wiring exists. No regression
test is meaningful for a deleted no-op; removal is verified by the module continuing to
build and all 50 tests passing.
### HealthMonitoring-012 — `SiteHealthState.LatestReport` initialized to `null!`, misrepresenting the contract
| | |
|--|--|
| Severity | Low — re-triaged: already resolved as a side-effect of HealthMonitoring-002. |
| Category | Documentation & comments |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/SiteHealthState.cs:11` |
**Description**
`LatestReport` is declared `SiteHealthReport LatestReport { get; set; } = null!;`,
suppressing nullability. Today every code path that creates a `SiteHealthState` (only
`ProcessReport`) assigns `LatestReport`, so it is never actually null — but the
`null!` declaration tells readers and the compiler the opposite of the real
invariant. If HealthMonitoring-007 is addressed by registering state from a heartbeat
(no report yet), this becomes a live `NullReferenceException` risk for UI code that
dereferences `LatestReport`.
**Recommendation**
Either make `LatestReport` `required` (matching how it is genuinely always set today)
or make it properly nullable `SiteHealthReport?` and have consumers handle the
"registered, no report yet" case explicitly — consistent with whatever is decided
for HealthMonitoring-007.
**Resolution**
Resolved 2026-05-16 (commit `pending`). Re-triaged: verified against the current
source — the root cause was already eliminated by the HealthMonitoring-002 /
HealthMonitoring-007 fixes. `SiteHealthState.LatestReport` is already declared
`SiteHealthReport? LatestReport { get; init; }` (the recommendation's "make it properly
nullable" option) with an XML doc explaining the `null` case ("known only via heartbeats,
has not yet sent a report"). A codebase-wide search confirms no `null!` suppression
remains anywhere in `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring`. This is exactly the change
HealthMonitoring-002 made when converting `SiteHealthState` to an immutable record, so
the contract is now honest and no further code change was required.
### HealthMonitoring-013 — Offline-check interval comment claims "shorter timeout" but only ever uses `OfflineTimeout`
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/CentralHealthAggregator.cs:194-196` |
**Description**
`ExecuteAsync` derives the `PeriodicTimer` cadence with the comment "Check at half
the (shorter) offline timeout interval for timely detection", but the code only
reads `_options.OfflineTimeout`:
```csharp
var checkInterval = TimeSpan.FromMilliseconds(_options.OfflineTimeout.TotalMilliseconds / 2);
```
`CentralOfflineTimeout` (HealthMonitoring-005's fix) is never considered. With the
default options (`OfflineTimeout` 60s, `CentralOfflineTimeout` 3m) `OfflineTimeout`
genuinely is the shorter of the two, so the parenthetical happens to hold. But the
comment states an invariant the code does not enforce: if an operator configures
`CentralOfflineTimeout` *smaller* than `OfflineTimeout`, the check cadence stays
tied to `OfflineTimeout`, and central offline detection is delayed by up to a full
`OfflineTimeout / 2` beyond the intended `CentralOfflineTimeout` window. The comment
misleads a reader into believing the cadence already adapts to whichever timeout is
shorter.
**Recommendation**
Either compute `checkInterval` from `Math.Min(OfflineTimeout, CentralOfflineTimeout)`
so the code matches the comment, or drop the "(shorter)" wording and state plainly
that the cadence is derived from `OfflineTimeout` only (acceptable while the default
`CentralOfflineTimeout` is the larger value).
**Resolution**
Resolved 2026-05-17. Root cause confirmed against source — `ExecuteAsync` derived the
`PeriodicTimer` cadence solely from `OfflineTimeout` while the comment claimed it
tracked the "(shorter)" timeout. Took the code-matches-comment option: extracted the
cadence into `CentralHealthAggregator.ComputeCheckInterval`, which now derives it from
half of the *shorter* of `OfflineTimeout` and `CentralOfflineTimeout`, so a
`CentralOfflineTimeout` configured below `OfflineTimeout` is still polled at least
twice within its window. The comment was rewritten to match. Regression test
`CentralHealthAggregatorTests.CheckInterval_IsHalfTheShorterTimeout` asserts the
default case (30s) and the shorter-`CentralOfflineTimeout` case (10s) — the latter
would have returned 30s against the pre-fix code.
### HealthMonitoring-014 — `HealthMonitoringOptions` intervals are unvalidated; a zero/negative value crashes the hosted service
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/HealthMonitoringOptions.cs:3-20`, `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/CentralHealthAggregator.cs:196`, `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/HealthReportSender.cs:67`, `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/CentralHealthReportLoop.cs:63` |
**Description**
`HealthMonitoringOptions` is bound from the `ScadaBridge:HealthMonitoring` config
section (`SiteServiceRegistration.BindSharedOptions`) with no validation —
no `IValidateOptions<HealthMonitoringOptions>`, no `ValidateDataAnnotations`, no
`ValidateOnStart`. `ReportInterval`, `OfflineTimeout`, and `CentralOfflineTimeout`
are all fed straight into `new PeriodicTimer(...)` (and `OfflineTimeout` into a
division for the check interval). `PeriodicTimer`'s constructor throws
`ArgumentOutOfRangeException` for a zero or negative period. A misconfigured
`appsettings.json` (e.g. `"ReportInterval": "00:00:00"`, an empty/garbled value
that binds to `TimeSpan.Zero`, or a negative span) therefore crashes the
`HealthReportSender` / `CentralHealthReportLoop` / `CentralHealthAggregator`
hosted service at startup with an opaque exception that does not name the
offending config key, rather than failing fast with a clear validation message.
**Recommendation**
Add an options validator (DataAnnotations `[Range]`-style on the spans, or an
`IValidateOptions<HealthMonitoringOptions>`) that rejects non-positive
`ReportInterval`/`OfflineTimeout`/`CentralOfflineTimeout` and ideally requires
`CentralOfflineTimeout >= OfflineTimeout`, and call `.ValidateOnStart()` so a bad
configuration fails fast with a message naming the section and key.
**Resolution**
Resolved 2026-05-17. Root cause confirmed — `HealthMonitoringOptions` had no
validator, so a zero/negative interval reached `new PeriodicTimer(...)` and crashed
the hosted service with an opaque `ArgumentOutOfRangeException`. Added
`HealthMonitoringOptionsValidator : IValidateOptions<HealthMonitoringOptions>` that
rejects non-positive `ReportInterval`/`OfflineTimeout`/`CentralOfflineTimeout` and a
`CentralOfflineTimeout` shorter than `OfflineTimeout`, each failure naming the
`ScadaBridge:HealthMonitoring` config key. It is registered (idempotently, via
`TryAddEnumerable`) by all three `ServiceCollectionExtensions` registration methods,
so it fires when the hosted services resolve `IOptions.Value` at startup — failing
fast with a clear message. (`ValidateOnStart()` lives in the Host module's binding
call, which is out of scope; the validator nonetheless runs at startup because the
hosted-service constructors resolve the options eagerly — matching the existing
`ClusterOptionsValidator` registration pattern.) Regression tests in
`HealthMonitoringOptionsValidatorTests` cover the valid default plus zero/negative
intervals and the `CentralOfflineTimeout < OfflineTimeout` case.
### HealthMonitoring-015 — Heartbeat-registered site is left with a year-0001 `LastReportReceivedAt`
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/CentralHealthAggregator.cs:122-130`, `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/SiteHealthState.cs:27` |
**Description**
When `MarkHeartbeat` registers a previously-unknown site (HealthMonitoring-007's
fix), it sets `LastReportReceivedAt = default` — i.e. `DateTimeOffset.MinValue`
(`0001-01-01`). The XML doc on `SiteHealthState.LastReportReceivedAt` states the
field is "Used by the UI to surface report staleness during failover." A
heartbeat-only site therefore has `LatestReport == null` **and**
`LastReportReceivedAt == DateTimeOffset.MinValue`. Any UI code that computes
"last report N ago" as `now - LastReportReceivedAt` without first checking
`LatestReport != null` will render a nonsensical staleness of roughly two
thousand years for a site that is, in fact, freshly reachable. The two
"no report yet" signals (`LatestReport == null`, `LastReportReceivedAt == default`)
are independent and both must be special-cased; the sentinel value is an easy trap.
**Recommendation**
Make `LastReportReceivedAt` nullable (`DateTimeOffset?`) so "no report received
yet" is an explicit, unmissable state rather than a magic sentinel — consistent
with how `LatestReport` was already made nullable for the same case — and have UI
consumers render staleness only when it has a value. Alternatively, document the
`default` sentinel prominently on the field and audit every UI reader, but the
nullable option is safer and matches the existing `LatestReport` treatment.
**Resolution**
Resolved 2026-05-17. Root cause confirmed against source — `CentralHealthAggregator.MarkHeartbeat`
registered a heartbeat-only site with `LastReportReceivedAt = default` (`0001-01-01`),
which `Health.razor:74` passed verbatim into `TimestampDisplay`, rendering a
~2000-year-stale timestamp. Applied the finding's recommended cross-module fix.
**HealthMonitoring:** `SiteHealthState.LastReportReceivedAt` is now `DateTimeOffset?`,
and `MarkHeartbeat`'s unknown-site registration sets it to `null` — "no report yet"
is now an explicit nullable state, consistent with the already-nullable `LatestReport`.
`ProcessReport` continues to set a real instant. **CentralUI:** `TimestampDisplay.Value`
now accepts `DateTimeOffset?` and renders `null` as a plain `text-muted` placeholder
(default "never", configurable via a new `NullText` parameter); existing non-null
callers (`AuditLog`, `EventLogs`, `Deployments`) are unaffected by the implicit
widening. `Health.razor`'s "Last report" cell passes `NullText="awaiting first report"`
so a heartbeat-only site reads cleanly. Regression tests:
`CentralHealthAggregatorTests.MarkHeartbeat_RegistersUnknownSite_WithNullLastReportReceivedAt`
and `ProcessReport_SetsLastReportReceivedAt_ForHeartbeatRegisteredSite` (HealthMonitoring —
the first would not compile against the pre-fix non-nullable field), and
`TimestampDisplayTests.Render_NullValue_ShowsNeverPlaceholder`,
`Render_NullValue_DoesNotRenderYear0001Sentinel`, `Render_NonNullValue_ShowsFormattedTimestamp`
(CentralUI).
### HealthMonitoring-016 — `SiteHealthCollector.CollectReport` reads `DateTimeOffset.UtcNow` directly instead of an injected `TimeProvider`
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/SiteHealthCollector.cs:151` |
**Description**
`CollectReport` stamps each report with `ReportTimestamp: DateTimeOffset.UtcNow`,
read directly from the system clock. Every other time-dependent class in the
module — `CentralHealthAggregator`, `HealthReportSender`, `CentralHealthReportLoop`
— was deliberately refactored (HealthMonitoring-006) to take an injectable
`TimeProvider` so the behaviour is deterministically testable and the clock
dependency is explicit. `SiteHealthCollector` is the lone holdout: the report
timestamp cannot be controlled in a unit test, which is why
`SiteHealthCollectorTests.CollectReport_IncludesUtcTimestamp` can only assert the
timestamp falls in a `before`/`after` wall-clock window rather than equalling a
known instant. This is a minor consistency/testability gap, not a behaviour bug.
**Recommendation**
Add an optional `TimeProvider` constructor parameter to `SiteHealthCollector`
(defaulting to `TimeProvider.System`, mirroring the other classes) and derive
`ReportTimestamp` from `GetUtcNow()`, so the report timestamp is deterministically
testable and the module is consistent.
**Resolution**
Resolved 2026-05-17. Root cause confirmed — `CollectReport` stamped `ReportTimestamp`
from `DateTimeOffset.UtcNow` directly while every other time-dependent class in the
module takes an injectable `TimeProvider`. Added an optional `TimeProvider`
constructor parameter to `SiteHealthCollector` (defaulting to `TimeProvider.System`,
mirroring `CentralHealthAggregator`/`HealthReportSender`/`CentralHealthReportLoop`)
and `CollectReport` now derives `ReportTimestamp` from `_timeProvider.GetUtcNow()`.
The DI registration (`AddSingleton<ISiteHealthCollector, SiteHealthCollector>`)
continues to work via the optional parameter. Regression test
`SiteHealthCollectorTests.CollectReport_StampsTimestamp_FromInjectedTimeProvider`
asserts the timestamp equals a fixed injected instant exactly (not just a
before/after window); it would not compile against the pre-fix single-arg-less
constructor.
### HealthMonitoring-017 — `HealthReportSender` resets interval counters before `Send`; transport failures silently drop the interval's error counts
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/HealthReportSender.cs:140-154`, `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/SiteHealthCollector.cs:146-153` |
**Resolution (2026-05-28):** Wrapped `_transport.Send(reportWithSeq)` in an inner try/catch that, on failure, atomically restores the captured per-interval counts via a new `ISiteHealthCollector.AddIntervalCounters(scriptErrors, alarmErrors, deadLetters, siteAuditWriteFailures, auditRedactionFailures)` API backed by `Interlocked.Add`. Concurrent increments arriving during the Send accumulate against the zero left by `CollectReport`'s `Exchange`; the restore Add sums correctly with them. The new interface method ships with a default no-op so existing test fakes (`CountCapturingHealthCollector` etc.) keep compiling without per-fake updates. Regression test `HealthReportSenderTests.SendFailure_PreservesIntervalCountersForNextReport` pre-populates all five counters, makes the first Send throw, and asserts the next successful report carries the original counts (2 / 1 / 3 / 1 / 2).
**Description**
`HealthReportSender.ExecuteAsync` calls `_collector.CollectReport(_siteId)` and
then `_transport.Send(reportWithSeq)` inside a single `try` block whose `catch`
logs and continues. `CollectReport` atomically read-and-resets the per-interval
counters via `Interlocked.Exchange(ref _scriptErrorCount, 0)` (and the same for
`_alarmErrorCount`, `_deadLetterCount`, `_siteAuditWriteFailures`,
`_auditRedactionFailures`). If `_transport.Send` then throws — Akka remoting
hiccup, transport not yet associated, central side temporarily unavailable,
serialization failure on a malformed metric, etc. — the `catch (Exception ex)`
on line 150 logs an error and the loop simply waits for the next tick. The
report was never delivered, but the counters have already been reset to zero, so
**every error this site recorded in the failed interval is gone**: it is neither
in the (un-sent) report nor in the (zeroed) collector. The very next successful
report will show "0 script errors / 0 alarm errors" for the entire window in
which the transport was broken, masking exactly the period the operator most
needs to triage.
This contradicts the design doc's "raw counts per reporting interval" / "counter
resets **after each report is sent**" wording — current code resets on each
report _attempt_, regardless of outcome. The hazard worsens under sustained
transport failure: every interval's errors are lost; the central dashboard sees
a quiet site while the site is, in fact, failing.
The same shape exists in `CentralHealthReportLoop` (see HealthMonitoring-018) —
`CollectReport` is called before `_aggregator.ProcessReport`. The aggregator
call is in-process and unlikely to throw, but the structural bug is identical.
**Recommendation**
Build the report from a non-destructive read first (`PeekReport(siteId)`,
returning a snapshot without mutating the counters) and only call a dedicated
`ResetIntervalCounters()` after a successful `_transport.Send`. Alternatively,
on a `Send` failure, restore the lost counts via `Interlocked.Add` of the
captured values back into the collector fields — atomically correct as long as
no other thread can read them in between, which is true here because the next
read is the next `CollectReport` on the same loop. The "peek then commit"
shape is the cleaner public API.
A regression test should add a failing-transport scenario:
`Send` throws an `InvalidOperationException`; assert that the next successful
report includes the previously-failed interval's `ScriptErrorCount`.
### HealthMonitoring-018 — Same counter-reset-before-publish hazard in `CentralHealthReportLoop`
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/CentralHealthReportLoop.cs:87-98` |
**Resolution (2026-05-28):** Same shape of fix as HealthMonitoring-017 — `_aggregator.ProcessReport(reportWithSeq)` now sits inside an inner try/catch that, on failure, calls `_collector.AddIntervalCounters(...)` with the captured report's counts. Reuses the same `ISiteHealthCollector.AddIntervalCounters` API; no extra collector surface. Regression test `CentralHealthReportLoopTests.ProcessReportFailure_PreservesIntervalCountersForNextReport` pre-populates all five counters, makes the first `ProcessReport` throw, and asserts the next successful report carries the original counts.
**Description**
`CentralHealthReportLoop.ExecuteAsync` calls `_collector.CollectReport(CentralSiteId)`
(which resets the per-interval counters on the shared `SiteHealthCollector`
instance — see HealthMonitoring-017) and then `_aggregator.ProcessReport(reportWithSeq)`
inside the same `try` block. If `ProcessReport` throws, the central node's own
per-interval counters (`ScriptErrorCount`, `AlarmEvaluationErrorCount`,
`DeadLetterCount`, `SiteAuditWriteFailures`, `AuditRedactionFailure`) are lost
for that interval.
In practice `ProcessReport` is a pure in-memory CAS loop and is very unlikely
to throw, so the operational impact is small. However, the structural bug is
identical to HealthMonitoring-017 and would be fixed by the same
"peek then commit" refactor in `SiteHealthCollector`. The Audit-Log-related
metrics matter most here: `AuditRedactionFailure` is genuinely incremented at
central during normal operation (the Notification Outbox dispatcher and
Inbound API middleware both write through `CentralAuditRedactionFailureCounter`
which can fan out to the collector via the bridge), so this is not purely
theoretical.
**Recommendation**
Adopt the same "peek then reset on successful publish" pattern recommended for
HealthMonitoring-017. Reuse the new `PeekReport` / `ResetIntervalCounters`
collector API once it lands.
### HealthMonitoring-019 — `SiteAuditTelemetryStalled` and `CentralAuditWriteFailures` design-doc metrics have no HealthMonitoring-side surface
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `docs/requirements/Component-HealthMonitoring.md:39,40`, `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/ICentralHealthAggregator.cs`, `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditCentralHealthSnapshot.cs:39-58` |
**Resolution (2026-05-28):** Took the simpler doc-removal path rather than surfacing the metrics through a new HealthMonitoring API. Removed `SiteAuditTelemetryStalled` and `CentralAuditWriteFailures` from the Monitored Metrics table, the Audit Log KPIs section, and the Audit Log Dependencies entry in `Component-HealthMonitoring.md`. The two metrics remain internal to `AuditLog`'s `AuditCentralHealthSnapshot` — promoting them to dashboard tiles is a separate feature, out of scope here.
**Description**
`Component-HealthMonitoring.md` lists `SiteAuditTelemetryStalled` and
`CentralAuditWriteFailures` (and reiterates them under the Audit Log KPIs
section and in the Dependencies section) as required dashboard metrics. The
doc also says they "are central-computed alongside the existing central KPIs"
(Notification Outbox / Site Call Audit) and surface in the **Audit** dashboard
tile group.
Tracing the code:
- `SiteAuditTelemetryStalled` is published by `SiteAuditReconciliationActor`,
picked up by `SiteAuditTelemetryStalledTracker`, and latched into
`AuditCentralHealthSnapshot._stalled` (a `ConcurrentDictionary<string, bool>`
in the `ZB.MOM.WW.ScadaBridge.AuditLog` assembly).
- `CentralAuditWriteFailures` is incremented inside `AuditCentralHealthSnapshot`
via `ICentralAuditWriteFailureCounter.Increment()` (also in `ZB.MOM.WW.ScadaBridge.AuditLog`).
Neither metric is referenced anywhere in `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/`:
- `ICentralHealthAggregator` does not expose them.
- `SiteHealthCollector` has no central counterpart (it is site-only).
- `SiteHealthReport` has no `SiteAuditTelemetryStalled` / `CentralAuditWriteFailures`
fields (the site-only `SiteAuditWriteFailures`, `AuditRedactionFailure`, and
`SiteAuditBacklog` _are_ wired; the central pair is the gap).
Currently the only consumer of `IAuditCentralHealthSnapshot` is whatever
Central UI page binds to it directly (out of scope for this module), but the
design doc places these metrics under HealthMonitoring's responsibility
("Health Monitoring Dashboard displays aggregated metrics"). At minimum the
Dependencies section's claim that Health Monitoring provides "the
central-computed `CentralAuditWriteFailures` / `AuditRedactionFailure` metrics"
is false for `CentralAuditWriteFailures`: nothing under
`src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/` knows about it.
**Recommendation**
Decide whether HealthMonitoring or the consuming UI page owns the
`IAuditCentralHealthSnapshot` integration:
- If HealthMonitoring owns it, expose a `CentralKpis` accessor on
`ICentralHealthAggregator` (e.g. a `GetCentralAuditHealth()` method that
returns a typed DTO derived from the injected `IAuditCentralHealthSnapshot`)
so the dashboard has a single read surface mirroring `GetAllSiteStates`.
- If the UI page binds `IAuditCentralHealthSnapshot` directly, update the
HealthMonitoring design doc's Responsibilities / Dependencies sections to
reflect that and remove the implied integration.
Either way, add a regression test that the chosen surface returns the live
counter and per-site stalled state.
### HealthMonitoring-020 — `MarkHeartbeat` brings offline site back online with a stale `LastHeartbeatAt` when `receivedAt <= existing.LastHeartbeatAt`
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/CentralHealthAggregator.cs:128-147` |
**Resolution (2026-05-28):** `MarkHeartbeat` now branches on
`existing.IsOnline`: when transitioning offline-to-online it anchors
`LastHeartbeatAt` to `max(receivedAt, _timeProvider.GetUtcNow())` so an
out-of-order or older `receivedAt` cannot leave the recovered site one tick
away from re-going-offline. The online path retains the prior
`max(receivedAt, existing.LastHeartbeatAt)` semantics. Regression test
`MarkHeartbeat_OfflineToOnline_StampsFreshLastHeartbeatAt` asserts both the
fresh `LastHeartbeatAt` (within 5 s of "now") and that the next
`CheckForOfflineSites` does not flap the site back to offline.
**Description**
The CAS path in `MarkHeartbeat` picks `newHeartbeat = max(receivedAt, existing.LastHeartbeatAt)`,
then short-circuits only when `newHeartbeat == existing.LastHeartbeatAt &&
existing.IsOnline`. That short-circuit is correct, but consider the case where
`existing.IsOnline == false` and `receivedAt <= existing.LastHeartbeatAt`:
1. Suppose a site is marked offline by `CheckForOfflineSites` at time T1.
2. A late/out-of-order heartbeat carrying a `receivedAt` _older_ than the last
stored `LastHeartbeatAt` arrives at T2 (clock skew at the receive site, or a
delayed message that was generated before the offline-marking).
3. `newHeartbeat == existing.LastHeartbeatAt` (kept), but the short-circuit
condition fails because `existing.IsOnline == false`, so the CAS produces a
new record with `IsOnline = true` and the **stale** `LastHeartbeatAt`.
4. On the very next `CheckForOfflineSites` tick (≤ `OfflineTimeout/2` later),
`now - LastHeartbeatAt` is still ≥ `OfflineTimeout`, so the site is
immediately marked offline again — the heartbeat brought it online for less
than the check cadence, producing a "flap" in the dashboard.
In practice `receivedAt` is normally `_timeProvider.GetUtcNow()` at the
`CentralCommunicationActor` receive site, so monotonically increasing — the bug
is latent. But the contract `MarkHeartbeat(string siteId, DateTimeOffset receivedAt)`
makes no guarantee about ordering, and an out-of-order delivery (Akka remoting
ordering across connection re-establishment edge cases) or a small wall-clock
correction at central would expose it.
**Recommendation**
When transitioning offline → online, use `now` (from the injected
`TimeProvider`) rather than the caller-supplied `receivedAt` for
`LastHeartbeatAt`, or take `max(receivedAt, _timeProvider.GetUtcNow())` so the
recovery point is always recent. A unit test driving `MarkHeartbeat` with a
`receivedAt` older than the last stored heartbeat on an offline site, then a
`CheckForOfflineSites` immediately afterwards, would assert the site stays
online.
### HealthMonitoring-021 — `CentralSiteId = "central"` reserved constant silently collides with a real site named "central"
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/CentralHealthReportLoop.cs:22`, `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/CentralHealthAggregator.cs:224-226` |
**Resolution (2026-05-28):** `CentralHealthReportLoop.CentralSiteId` is now `"$central"` instead of `"central"`. The leading `$` is forbidden in operator-set `Site.SiteIdentifier` values (which are plain identifiers), so the synthetic central self-report cannot collide with a real site whose identifier happens to be the bare word `"central"`. The collision case the finding called out — two reports clobbering each other in the aggregator keyspace via the sequence-number guard and a real site inheriting `CentralOfflineTimeout` and staying falsely-online for an extra two minutes — is now impossible. The aggregator (`CentralHealthAggregator.CheckForOfflineSites`), the Central UI health dashboard (`Monitoring/Health.razor`), and every test reference the constant rather than the literal string, so the value change is local — no consumer code needed updating. Existing `CentralHealthAggregatorTests` and `CentralHealthReportLoopTests` already use the constant, so they continue to pin the central-self-report identity through the new sentinel.
**Description**
`CentralHealthAggregator.CheckForOfflineSites` looks up the per-site offline
timeout with:
```csharp
var timeout = kvp.Key == CentralHealthReportLoop.CentralSiteId
? _options.CentralOfflineTimeout
: _options.OfflineTimeout;
```
`CentralSiteId` is the literal string `"central"`. Site IDs are free-form
strings set in configuration / the Sites repository; there is no validation
that excludes the reserved `"central"` name. An operator who creates a real
site with `SiteId = "central"` will have:
- Their real-site reports arriving via `ProcessReport` get stored in the same
dictionary slot as the central self-report (they share the keyspace), so the
central self-report and the real-site report repeatedly overwrite each
other via the sequence-number guard — whichever has the higher Unix-ms seed
wins, and the other is silently rejected as stale. The dashboard alternates
between two unrelated payloads.
- The real site gets the longer `CentralOfflineTimeout` (default 3 minutes)
instead of the normal `OfflineTimeout` (60 s), so a genuinely-failed real
site marked "central" stays falsely-online for an extra two minutes.
**Recommendation**
Two options:
1. Reject the reserved name at the Site entity / configuration validation
layer (Configuration Database component, out of this module's scope) and
document `"central"` as reserved. This is the cleaner UX fix.
2. As a defence-in-depth inside HealthMonitoring, store the central
self-report under a key that cannot collide — e.g. prefix it with a
character that is forbidden in real site IDs (`":central"` or `"#central"`)
— and adjust `CheckForOfflineSites` accordingly.
Either fix should include a regression test creating a real `SiteHealthReport`
with `SiteId = "central"` and asserting the central self-report's identity is
preserved.
### HealthMonitoring-022 — `CentralHealthReportLoopTests` uses real-time `PeriodicTimer` + `Task.Delay`; flake-prone on slow CI
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Resolved |
| Location | `tests/ZB.MOM.WW.ScadaBridge.HealthMonitoring.Tests/CentralHealthReportLoopTests.cs:32-42` |
**Resolution (2026-05-28):** Picked the less-invasive recommended option (a) — kept the real-time `PeriodicTimer` but replaced the fixed-budget `Task.Delay` with a generous poll-until-condition helper. `RunLoopUntil(loop, condition, maxWait = 5s)` starts the hosted service, polls `condition` every 25 ms with a 5 s outer cap, and stops cleanly when met; `GeneratesCentralReports_WhenSelfIsPrimary`, `AssignsMonotonicSequenceNumbers`, and `ProcessReportFailure_PreservesIntervalCountersForNextReport` now use it. The legacy `RunLoopBriefly` retains a fixed wait (≥ 1 s) for the two tests that assert *absence* of reports (`GeneratesNoReports_WhenNotPrimary`, `SetsActiveNodeFlag_EvenWhenNotPrimary`) since there is no condition to poll for. Refactoring the production loop to consume `TimeProvider.CreateTimer` (option b) was rejected for batch scope — it would require a production change for what is currently a low-severity test-hygiene gap. All 73 `ZB.MOM.WW.ScadaBridge.HealthMonitoring.Tests` pass; the new generous budget tolerates slow CI runners.
**Description**
`RunLoopBriefly` starts the hosted service with a 50 ms `PeriodicTimer` and
then `await Task.Delay(runForMs, CancellationToken.None)` (with `runForMs`
between 150 ms and 300 ms). `GeneratesCentralReports_WhenSelfIsPrimary` and
`AssignsMonotonicSequenceNumbers` both assert "at least 2 reports were
generated" within the window. On a heavily-contended CI runner where the
hosted-service start-up plus a couple of `PeriodicTimer` ticks can blow past
300 ms, these tests will silently flake.
The rest of the suite (`CentralHealthAggregatorTests`, `SiteHealthCollectorTests`,
`HealthReportSenderTests` partially) was deliberately refactored to use the
injected `TimeProvider` precisely to avoid this. `CentralHealthReportLoop` and
`HealthReportSender` already accept a `TimeProvider`, but the loop's
`PeriodicTimer` is still real-time because `PeriodicTimer` does not consume
the `TimeProvider` parameter.
**Recommendation**
Either (a) accept the timing-sensitivity and bump the delay budget
generously, or (b) refactor the hosted-service loop to use a
`TimeProvider.CreateTimer`-based tick mechanism so the test can advance a
fake clock and assert deterministically how many ticks fire. Option (b) is
the better long-term fix and matches the pattern used elsewhere in the
module's tests.
### HealthMonitoring-023 — `StoreAndForwardBufferDepths_IsEmptyPlaceholder` test name is stale; it now covers the default-state contract, not a placeholder
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | `tests/ZB.MOM.WW.ScadaBridge.HealthMonitoring.Tests/SiteHealthCollectorTests.cs:117-122` |
**Resolution (2026-05-28):** Renamed the test method from `StoreAndForwardBufferDepths_IsEmptyPlaceholder` to `StoreAndForwardBufferDepths_DefaultsToEmpty_WhenSetterNotCalled` so the name describes what it actually asserts — the default-state contract of a fresh collector. No behaviour change; the body still constructs a collector without calling `SetStoreAndForwardDepths` and asserts `Empty(report.StoreAndForwardBufferDepths)`.
**Description**
The test `StoreAndForwardBufferDepths_IsEmptyPlaceholder` was originally named
to codify the HealthMonitoring-001 bug ("`SetStoreAndForwardDepths` has no
callers, so `StoreAndForwardBufferDepths` is always empty"). HealthMonitoring-001
is `Resolved``HealthReportSender` now populates per-category depths from
the S&F engine, and the same test class has `SetStoreAndForwardDepths_ReflectedInReport`
covering the populated path. The "placeholder" test still passes because it
constructs a fresh collector and never calls the setter, so its assertion
(`Assert.Empty(report.StoreAndForwardBufferDepths)`) is now testing the
**default empty state of an un-configured collector**. The HealthMonitoring-001
resolution note explicitly chose to keep it as "the collector-level
default-state test", but the test method name and the implied semantics no
longer match.
A maintainer reading the test name today will misread it as documentation that
the metric is unimplemented (which it isn't), and may waste time investigating
a non-bug.
**Recommendation**
Rename to `StoreAndForwardBufferDepths_DefaultsToEmpty_WhenSetterNotCalled`
(or similar) and update the test body's intent — purely a documentation /
maintainability fix; no behaviour change.