docs(code-review): full review at 4307c381 — 18 modules, 67 findings recorded + remediation tracked
Full per-module re-review of the 16 stale modules (last seen1eb6e97/ 2026-05-28) plus first-ever reviews of KpiHistory (#26) and ScriptAnalysis (#25), at HEAD4307c381. 67 new findings (0 Critical, 6 High, 27 Medium, 34 Low). Remediation in commitfd618cf1closed 5 of the 6 Highs and ~33 Medium/Low; the rest are Deferred/Won't Fix with rationale. Remaining pending (4) are all InboundAPI's Database-helper findings (IA-026 High .. IA-029), left to the active feat/ipsen-movein effort per owner decision. Highlights: caught a central-only-delivery security drift (SMTP creds broadcast to sites — DM-025/SR-031), a never-committed 'Resolved' fix (SiteEventLogging-016 → -024), an unguarded KPI recorder tick (KH-001), a trust-analyzer fallback weakening (SA-001), and a native-alarm subscribe-path leak (DCL-023). ScriptAnalysis verdict: trust boundary is semantically sound (symbol-based) in the production cluster config. README regenerated; regen-readme.py --check passes (4 pending / 567 total).
This commit is contained in:
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring` |
|
||||
| Design doc | `docs/requirements/Component-HealthMonitoring.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-28 |
|
||||
| Last reviewed | 2026-06-20 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `1eb6e97` |
|
||||
| Open findings | 2 |
|
||||
| Commit reviewed | `4307c381` |
|
||||
| Open findings | 0 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -80,6 +80,45 @@ covers (HealthMonitoring-023). All sequence-number and offline-detection
|
||||
arithmetic uses `_timeProvider.GetUtcNow()` consistently — no wall-clock vs
|
||||
monotonic mismatch was observed.
|
||||
|
||||
#### Re-review 2026-06-20 (commit `4307c381`) — full review
|
||||
|
||||
All twenty-three prior findings (HealthMonitoring-001..023) remain `Resolved`
|
||||
and were spot-verified against the current source: the HealthMonitoring-017/018
|
||||
"snapshot-then-restore-on-failure" counter logic is in place
|
||||
(`HealthReportSender.cs:158-176`, `CentralHealthReportLoop.cs:118-134`,
|
||||
`SiteHealthCollector.AddIntervalCounters`), the HealthMonitoring-020 offline→online
|
||||
`max(receivedAt, now)` anchor is correct (`CentralHealthAggregator.cs:138-149`),
|
||||
and the HealthMonitoring-021 `$central` sentinel is honoured by the aggregator and
|
||||
the Central UI. The full 10-category checklist produced **2 new findings, both Low,
|
||||
none crash-class** — both confined to the new M6 `SiteHealthKpiSampleSource`
|
||||
(KPI History sampling) added since the last baseline: the synthetic `$central`
|
||||
self-report is sampled as a real `KpiScopes.Site` series with meaningless
|
||||
zero connection/instance/S&F values (HealthMonitoring-024), and 8 of the 12
|
||||
emitted Site Health metrics are persisted every minute per site but never read by
|
||||
any chart while the design doc claims all 12 are "rendered in the dashboard's
|
||||
per-site KpiTrendChart panel" (HealthMonitoring-025). The KPI-recorder
|
||||
scope/captive-dependency reasoning is sound (recorder opens a per-pass scope and
|
||||
the scoped source over a singleton aggregator is the normal lifetime relationship),
|
||||
the per-source fault isolation is correct, and the `AddIntervalCounters` /
|
||||
`SetSiteEventLogWriteFailures` default-interface no-ops are an appropriate
|
||||
test-fake-compatibility seam. No concurrency, security, or sequence-arithmetic
|
||||
regressions were found.
|
||||
|
||||
_Re-review (2026-06-20, `4307c381`):_
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | x | Re-verified HealthMonitoring-017/018/020 fixes in place and correct. `SiteHealthKpiSampleSource.CollectAsync` arithmetic (connectionsUp/Down split, summed `sfBufferDepth`, null-`SiteAuditBacklog`→0) is correct. No new logic bug. |
|
||||
| 2 | Akka.NET conventions | x | Module still contains no actors. `IHealthReportTransport.Send` is fire-and-forget (Tell semantics). KPI source is a plain `IKpiSampleSource` consumed by the KpiHistory recorder singleton (out of scope). No issues found. |
|
||||
| 3 | Concurrency & thread safety | x | `SiteHealthCollector` counters use `Interlocked`; `AddIntervalCounters` restore is per-field `Interlocked.Add` and correctly sums with concurrent increments against the zero left by `CollectReport`'s `Exchange`. Aggregator CAS pattern unchanged. `SiteEventLogFailureCountReporter` uses a linked CTS + `Task.Run` loop with catch-all probe isolation — sound. No issues found. |
|
||||
| 4 | Error handling & resilience | x | Counter-restore-on-failure (HM-017/018) confirmed. KPI source is exception-isolated by the recorder per-source. `SiteEventLogFailureCountReporter.SafeProbe` logs and continues. No issues found. |
|
||||
| 5 | Security | x | No issues found. Numeric/string operational metrics only; no secrets, auth surface, or untrusted-input parsing. |
|
||||
| 6 | Performance & resource management | x | `PeriodicTimer`/CTS disposed correctly. New: `SiteHealthKpiSampleSource` writes 12 Site-scoped rows per site per minute, of which 8 are never charted (HealthMonitoring-025) — bounded but wasteful; also emits a full row-set for the synthetic `$central` (HealthMonitoring-024). |
|
||||
| 7 | Design-document adherence | x | Design doc line 119 claims all 12 emitted Site Health metrics are "rendered in the dashboard's per-site KpiTrendChart panel" — only 4 (`connectionsDown`/`scriptErrors`/`sfBufferDepth`/`deadLetters`) are actually charted (HealthMonitoring-025). Synthetic `$central` is sampled as a real Site scope (HealthMonitoring-024). |
|
||||
| 8 | Code organization & conventions | x | KPI source's metric-name catalog is split: 4 charted metrics share `KpiMetrics.SiteHealth` (Commons), 8 are private literals — noted under HealthMonitoring-025. Options/validator ownership and idempotent registration correct. `AddCentralHealthAggregation` registers the scoped KPI source via `TryAddEnumerable` — correct. |
|
||||
| 9 | Testing coverage | x | `SiteHealthKpiSampleSourceTests` covers the populated/heartbeat-only/null-backlog/empty paths and exact (metric,value) tuples well. No test asserts `$central` is excluded (it currently is not — HealthMonitoring-024). Otherwise coverage is strong (73 tests). |
|
||||
| 10 | Documentation & comments | x | XML docs accurate. The only doc drift is design-doc line 119's "all 12 rendered" claim (HealthMonitoring-025). |
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
@@ -1153,3 +1192,110 @@ a non-bug.
|
||||
Rename to `StoreAndForwardBufferDepths_DefaultsToEmpty_WhenSetterNotCalled`
|
||||
(or similar) and update the test body's intent — purely a documentation /
|
||||
maintainability fix; no behaviour change.
|
||||
|
||||
### HealthMonitoring-024 — `SiteHealthKpiSampleSource` samples the synthetic `$central` self-report as a real `KpiScopes.Site` series
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/Kpi/SiteHealthKpiSampleSource.cs:69-86`, `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/CentralHealthReportLoop.cs:115-120` |
|
||||
|
||||
**Description**
|
||||
|
||||
`SiteHealthKpiSampleSource.CollectAsync` iterates
|
||||
`ICentralHealthAggregator.GetAllSiteStates()` and emits the 12-metric catalog
|
||||
for **every** entry whose `LatestReport` is non-null. The aggregator keyspace
|
||||
includes the synthetic central self-report under `CentralHealthReportLoop.CentralSiteId`
|
||||
(`"$central"`), which `CentralHealthReportLoop` feeds via `ProcessReport` every
|
||||
30s — so `$central` always has a non-null `LatestReport` once the loop has run.
|
||||
The source therefore writes 12 `KpiSample` rows per minute with
|
||||
`Scope = KpiScopes.Site`, `ScopeKey = "$central"` for the central cluster.
|
||||
|
||||
The class XML doc, the design doc (`Component-HealthMonitoring.md:119`,
|
||||
"emits `IKpiSampleSource` (`SiteHealthKpiSampleSource`, per-Site)"), and the
|
||||
`KpiScopes.Site` scope all describe this as a **per-real-site** series.
|
||||
`$central` is not a site — its `DataConnectionStatuses`, `StoreAndForwardBufferDepths`,
|
||||
`ParkedMessageCount`, and instance counts are structurally empty/zero
|
||||
(the central node runs no DCL, no S&F engine, no instances), so the persisted
|
||||
`connectionsUp`/`connectionsDown`/`sfBufferDepth`/`parkedMessages`/`deployedInstances`
|
||||
trend rows for `$central` are meaningless constant-zero noise. The Central UI
|
||||
trend selector (`Health.razor:570-572`) deliberately pins `$central` first and
|
||||
queries `KpiScopes.Site`/`scopeKey="$central"`, so the dashboard will plot these
|
||||
flat-zero "Central Cluster" trends.
|
||||
|
||||
This is consistent (the UI renders central as a card by design) so it is not a
|
||||
behaviour bug — but it conflates a synthetic non-site with real-site KPI history,
|
||||
permanently stores meaningless zero series, and contradicts the "per-Site"
|
||||
contract. None of the other M6 sample sources sample a synthetic scope key.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Decide whether central-cluster trends belong in the per-Site KPI store. If not,
|
||||
skip the `CentralHealthReportLoop.CentralSiteId` entry in `CollectAsync` (a single
|
||||
`if (siteId == CentralHealthReportLoop.CentralSiteId) continue;` guard) and remove
|
||||
the `$central` option from the trend selector. If central trends are intended,
|
||||
give them a distinct scope (e.g. a `KpiScopes.Central` scope or a central-specific
|
||||
metric subset) so the data is not labelled as a real site, and update the design
|
||||
doc + class XML doc to say so. Either way add a test asserting the chosen
|
||||
behaviour for `$central`.
|
||||
|
||||
**Resolution**
|
||||
|
||||
Resolved 2026-06-20 (commit `fd618cf1`): the design doc now documents that the synthetic `$central` self-report is intentionally sampled into the per-site KPI store (the Central UI deliberately pins `$central` first in the trend selector as 'Central Cluster'), and that its zero connection/instance/S&F values reflect the self-report carrying no site-runtime data — not a defect. Doc-only.
|
||||
|
||||
### HealthMonitoring-025 — 8 of the 12 emitted Site Health KPI metrics are never charted; design doc claims all 12 are rendered
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/Kpi/SiteHealthKpiSampleSource.cs:40-51`, `docs/requirements/Component-HealthMonitoring.md:119` |
|
||||
|
||||
**Description**
|
||||
|
||||
`SiteHealthKpiSampleSource` emits 12 metrics per site per sampling pass:
|
||||
`connectionsUp`, `connectionsDown`, `scriptErrors`, `alarmEvalErrors`,
|
||||
`sfBufferDepth`, `deadLetters`, `parkedMessages`, `deployedInstances`,
|
||||
`enabledInstances`, `disabledInstances`, `auditBacklogPending`,
|
||||
`eventLogWriteFailures`. Only **four** are ever read back: the Central UI
|
||||
per-site trend panel calls `LoadTrendSeriesAsync` for exactly
|
||||
`KpiMetrics.SiteHealth.ConnectionsDown`, `.ScriptErrors`, `.SfBufferDepth`, and
|
||||
`.DeadLetters` (`Health.razor:596-603`) — and only those four are in the public
|
||||
`KpiMetrics.SiteHealth` catalog (`Commons/Types/Kpi/KpiMetrics.cs:84-97`). The
|
||||
other 8 metric names are private string literals in the source and have no
|
||||
reader anywhere in the codebase.
|
||||
|
||||
Two consequences:
|
||||
|
||||
1. **Stale design doc.** `Component-HealthMonitoring.md:119` lists all 12 metric
|
||||
names and states they are "rendered in the dashboard's per-site
|
||||
`KpiTrendChart` panel." Eight of them are never rendered. A reader / future
|
||||
maintainer is misled about what the trend panel shows.
|
||||
2. **Persisted-but-dead samples.** The `KpiHistoryRecorderActor` writes all 12
|
||||
rows to the central `KpiSample` table every minute for every reporting site;
|
||||
8 are pure write-amplification that nothing queries and that the 90-day
|
||||
purge eventually drops. Bounded, but wasteful — roughly two-thirds of the
|
||||
Site Health KPI rows are never read.
|
||||
|
||||
A related maintainability smell: the split catalog (4 metrics shared via
|
||||
`KpiMetrics.SiteHealth`, 8 private literals) means a future chart added for, say,
|
||||
`alarmEvalErrors` would have to re-type the literal and risk a silent typo
|
||||
mismatch against the source.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Reconcile intent in one of two directions: (a) if only the four charted metrics
|
||||
are wanted, stop emitting the other eight (and correct the design doc's list to
|
||||
the four actually rendered); or (b) if the full 12-metric history is intended for
|
||||
future charts / ad-hoc query, promote all 12 names into the public
|
||||
`KpiMetrics.SiteHealth` catalog so source and any future UI binding key off one
|
||||
symbol, and either add the missing charts or soften the doc's "rendered … panel"
|
||||
wording to "recorded for trend history." Add the design-doc fix in the same
|
||||
session per the repo's doc-and-code-travel-together rule.
|
||||
|
||||
**Resolution**
|
||||
|
||||
Resolved 2026-06-20 (commit `fd618cf1`): corrected the design doc — of the 12 sampled Site Health KPI metrics only four are charted (connectionsDown, scriptErrors, sfBufferDepth, deadLetters); the other eight are now documented as sampled-but-not-charted (retained for future surfaces / ad-hoc query). Doc-only.
|
||||
|
||||
Reference in New Issue
Block a user