docs(code-review): full review at 4307c381 — 18 modules, 67 findings recorded + remediation tracked

Full per-module re-review of the 16 stale modules (last seen 1eb6e97 / 2026-05-28)
plus first-ever reviews of KpiHistory (#26) and ScriptAnalysis (#25), at HEAD 4307c381.

67 new findings (0 Critical, 6 High, 27 Medium, 34 Low). Remediation in commit
fd618cf1 closed 5 of the 6 Highs and ~33 Medium/Low; the rest are Deferred/Won't Fix
with rationale. Remaining pending (4) are all InboundAPI's Database-helper findings
(IA-026 High .. IA-029), left to the active feat/ipsen-movein effort per owner decision.

Highlights: caught a central-only-delivery security drift (SMTP creds broadcast to
sites — DM-025/SR-031), a never-committed 'Resolved' fix (SiteEventLogging-016 → -024),
an unguarded KPI recorder tick (KH-001), a trust-analyzer fallback weakening (SA-001),
and a native-alarm subscribe-path leak (DCL-023). ScriptAnalysis verdict: trust boundary
is semantically sound (symbol-based) in the production cluster config.

README regenerated; regen-readme.py --check passes (4 pending / 567 total).
This commit is contained in:
Joseph Doherty
2026-06-20 18:02:32 -04:00
parent fd618cf1dc
commit d39089f4ed
19 changed files with 4031 additions and 69 deletions
+149 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring` |
| Design doc | `docs/requirements/Component-HealthMonitoring.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-28 |
| Last reviewed | 2026-06-20 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
| Open findings | 2 |
| Commit reviewed | `4307c381` |
| Open findings | 0 |
## Summary
@@ -80,6 +80,45 @@ covers (HealthMonitoring-023). All sequence-number and offline-detection
arithmetic uses `_timeProvider.GetUtcNow()` consistently — no wall-clock vs
monotonic mismatch was observed.
#### Re-review 2026-06-20 (commit `4307c381`) — full review
All twenty-three prior findings (HealthMonitoring-001..023) remain `Resolved`
and were spot-verified against the current source: the HealthMonitoring-017/018
"snapshot-then-restore-on-failure" counter logic is in place
(`HealthReportSender.cs:158-176`, `CentralHealthReportLoop.cs:118-134`,
`SiteHealthCollector.AddIntervalCounters`), the HealthMonitoring-020 offline→online
`max(receivedAt, now)` anchor is correct (`CentralHealthAggregator.cs:138-149`),
and the HealthMonitoring-021 `$central` sentinel is honoured by the aggregator and
the Central UI. The full 10-category checklist produced **2 new findings, both Low,
none crash-class** — both confined to the new M6 `SiteHealthKpiSampleSource`
(KPI History sampling) added since the last baseline: the synthetic `$central`
self-report is sampled as a real `KpiScopes.Site` series with meaningless
zero connection/instance/S&F values (HealthMonitoring-024), and 8 of the 12
emitted Site Health metrics are persisted every minute per site but never read by
any chart while the design doc claims all 12 are "rendered in the dashboard's
per-site KpiTrendChart panel" (HealthMonitoring-025). The KPI-recorder
scope/captive-dependency reasoning is sound (recorder opens a per-pass scope and
the scoped source over a singleton aggregator is the normal lifetime relationship),
the per-source fault isolation is correct, and the `AddIntervalCounters` /
`SetSiteEventLogWriteFailures` default-interface no-ops are an appropriate
test-fake-compatibility seam. No concurrency, security, or sequence-arithmetic
regressions were found.
_Re-review (2026-06-20, `4307c381`):_
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | x | Re-verified HealthMonitoring-017/018/020 fixes in place and correct. `SiteHealthKpiSampleSource.CollectAsync` arithmetic (connectionsUp/Down split, summed `sfBufferDepth`, null-`SiteAuditBacklog`→0) is correct. No new logic bug. |
| 2 | Akka.NET conventions | x | Module still contains no actors. `IHealthReportTransport.Send` is fire-and-forget (Tell semantics). KPI source is a plain `IKpiSampleSource` consumed by the KpiHistory recorder singleton (out of scope). No issues found. |
| 3 | Concurrency & thread safety | x | `SiteHealthCollector` counters use `Interlocked`; `AddIntervalCounters` restore is per-field `Interlocked.Add` and correctly sums with concurrent increments against the zero left by `CollectReport`'s `Exchange`. Aggregator CAS pattern unchanged. `SiteEventLogFailureCountReporter` uses a linked CTS + `Task.Run` loop with catch-all probe isolation — sound. No issues found. |
| 4 | Error handling & resilience | x | Counter-restore-on-failure (HM-017/018) confirmed. KPI source is exception-isolated by the recorder per-source. `SiteEventLogFailureCountReporter.SafeProbe` logs and continues. No issues found. |
| 5 | Security | x | No issues found. Numeric/string operational metrics only; no secrets, auth surface, or untrusted-input parsing. |
| 6 | Performance & resource management | x | `PeriodicTimer`/CTS disposed correctly. New: `SiteHealthKpiSampleSource` writes 12 Site-scoped rows per site per minute, of which 8 are never charted (HealthMonitoring-025) — bounded but wasteful; also emits a full row-set for the synthetic `$central` (HealthMonitoring-024). |
| 7 | Design-document adherence | x | Design doc line 119 claims all 12 emitted Site Health metrics are "rendered in the dashboard's per-site KpiTrendChart panel" — only 4 (`connectionsDown`/`scriptErrors`/`sfBufferDepth`/`deadLetters`) are actually charted (HealthMonitoring-025). Synthetic `$central` is sampled as a real Site scope (HealthMonitoring-024). |
| 8 | Code organization & conventions | x | KPI source's metric-name catalog is split: 4 charted metrics share `KpiMetrics.SiteHealth` (Commons), 8 are private literals — noted under HealthMonitoring-025. Options/validator ownership and idempotent registration correct. `AddCentralHealthAggregation` registers the scoped KPI source via `TryAddEnumerable` — correct. |
| 9 | Testing coverage | x | `SiteHealthKpiSampleSourceTests` covers the populated/heartbeat-only/null-backlog/empty paths and exact (metric,value) tuples well. No test asserts `$central` is excluded (it currently is not — HealthMonitoring-024). Otherwise coverage is strong (73 tests). |
| 10 | Documentation & comments | x | XML docs accurate. The only doc drift is design-doc line 119's "all 12 rendered" claim (HealthMonitoring-025). |
## Checklist coverage
| # | Category | Examined | Notes |
@@ -1153,3 +1192,110 @@ a non-bug.
Rename to `StoreAndForwardBufferDepths_DefaultsToEmpty_WhenSetterNotCalled`
(or similar) and update the test body's intent — purely a documentation /
maintainability fix; no behaviour change.
### HealthMonitoring-024 — `SiteHealthKpiSampleSource` samples the synthetic `$central` self-report as a real `KpiScopes.Site` series
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/Kpi/SiteHealthKpiSampleSource.cs:69-86`, `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/CentralHealthReportLoop.cs:115-120` |
**Description**
`SiteHealthKpiSampleSource.CollectAsync` iterates
`ICentralHealthAggregator.GetAllSiteStates()` and emits the 12-metric catalog
for **every** entry whose `LatestReport` is non-null. The aggregator keyspace
includes the synthetic central self-report under `CentralHealthReportLoop.CentralSiteId`
(`"$central"`), which `CentralHealthReportLoop` feeds via `ProcessReport` every
30s — so `$central` always has a non-null `LatestReport` once the loop has run.
The source therefore writes 12 `KpiSample` rows per minute with
`Scope = KpiScopes.Site`, `ScopeKey = "$central"` for the central cluster.
The class XML doc, the design doc (`Component-HealthMonitoring.md:119`,
"emits `IKpiSampleSource` (`SiteHealthKpiSampleSource`, per-Site)"), and the
`KpiScopes.Site` scope all describe this as a **per-real-site** series.
`$central` is not a site — its `DataConnectionStatuses`, `StoreAndForwardBufferDepths`,
`ParkedMessageCount`, and instance counts are structurally empty/zero
(the central node runs no DCL, no S&F engine, no instances), so the persisted
`connectionsUp`/`connectionsDown`/`sfBufferDepth`/`parkedMessages`/`deployedInstances`
trend rows for `$central` are meaningless constant-zero noise. The Central UI
trend selector (`Health.razor:570-572`) deliberately pins `$central` first and
queries `KpiScopes.Site`/`scopeKey="$central"`, so the dashboard will plot these
flat-zero "Central Cluster" trends.
This is consistent (the UI renders central as a card by design) so it is not a
behaviour bug — but it conflates a synthetic non-site with real-site KPI history,
permanently stores meaningless zero series, and contradicts the "per-Site"
contract. None of the other M6 sample sources sample a synthetic scope key.
**Recommendation**
Decide whether central-cluster trends belong in the per-Site KPI store. If not,
skip the `CentralHealthReportLoop.CentralSiteId` entry in `CollectAsync` (a single
`if (siteId == CentralHealthReportLoop.CentralSiteId) continue;` guard) and remove
the `$central` option from the trend selector. If central trends are intended,
give them a distinct scope (e.g. a `KpiScopes.Central` scope or a central-specific
metric subset) so the data is not labelled as a real site, and update the design
doc + class XML doc to say so. Either way add a test asserting the chosen
behaviour for `$central`.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): the design doc now documents that the synthetic `$central` self-report is intentionally sampled into the per-site KPI store (the Central UI deliberately pins `$central` first in the trend selector as 'Central Cluster'), and that its zero connection/instance/S&F values reflect the self-report carrying no site-runtime data — not a defect. Doc-only.
### HealthMonitoring-025 — 8 of the 12 emitted Site Health KPI metrics are never charted; design doc claims all 12 are rendered
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/Kpi/SiteHealthKpiSampleSource.cs:40-51`, `docs/requirements/Component-HealthMonitoring.md:119` |
**Description**
`SiteHealthKpiSampleSource` emits 12 metrics per site per sampling pass:
`connectionsUp`, `connectionsDown`, `scriptErrors`, `alarmEvalErrors`,
`sfBufferDepth`, `deadLetters`, `parkedMessages`, `deployedInstances`,
`enabledInstances`, `disabledInstances`, `auditBacklogPending`,
`eventLogWriteFailures`. Only **four** are ever read back: the Central UI
per-site trend panel calls `LoadTrendSeriesAsync` for exactly
`KpiMetrics.SiteHealth.ConnectionsDown`, `.ScriptErrors`, `.SfBufferDepth`, and
`.DeadLetters` (`Health.razor:596-603`) — and only those four are in the public
`KpiMetrics.SiteHealth` catalog (`Commons/Types/Kpi/KpiMetrics.cs:84-97`). The
other 8 metric names are private string literals in the source and have no
reader anywhere in the codebase.
Two consequences:
1. **Stale design doc.** `Component-HealthMonitoring.md:119` lists all 12 metric
names and states they are "rendered in the dashboard's per-site
`KpiTrendChart` panel." Eight of them are never rendered. A reader / future
maintainer is misled about what the trend panel shows.
2. **Persisted-but-dead samples.** The `KpiHistoryRecorderActor` writes all 12
rows to the central `KpiSample` table every minute for every reporting site;
8 are pure write-amplification that nothing queries and that the 90-day
purge eventually drops. Bounded, but wasteful — roughly two-thirds of the
Site Health KPI rows are never read.
A related maintainability smell: the split catalog (4 metrics shared via
`KpiMetrics.SiteHealth`, 8 private literals) means a future chart added for, say,
`alarmEvalErrors` would have to re-type the literal and risk a silent typo
mismatch against the source.
**Recommendation**
Reconcile intent in one of two directions: (a) if only the four charted metrics
are wanted, stop emitting the other eight (and correct the design doc's list to
the four actually rendered); or (b) if the full 12-metric history is intended for
future charts / ad-hoc query, promote all 12 names into the public
`KpiMetrics.SiteHealth` catalog so source and any future UI binding key off one
symbol, and either add the missing charts or soften the doc's "rendered … panel"
wording to "recorded for trend history." Add the design-doc fix in the same
session per the repo's doc-and-code-travel-together rule.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): corrected the design doc — of the 12 sampled Site Health KPI metrics only four are charted (connectionsDown, scriptErrors, sfBufferDepth, deadLetters); the other eight are now documented as sampled-but-not-charted (retained for future surfaces / ad-hoc query). Doc-only.