Full per-module re-review of the 16 stale modules (last seen1eb6e97/ 2026-05-28) plus first-ever reviews of KpiHistory (#26) and ScriptAnalysis (#25), at HEAD4307c381. 67 new findings (0 Critical, 6 High, 27 Medium, 34 Low). Remediation in commitfd618cf1closed 5 of the 6 Highs and ~33 Medium/Low; the rest are Deferred/Won't Fix with rationale. Remaining pending (4) are all InboundAPI's Database-helper findings (IA-026 High .. IA-029), left to the active feat/ipsen-movein effort per owner decision. Highlights: caught a central-only-delivery security drift (SMTP creds broadcast to sites — DM-025/SR-031), a never-committed 'Resolved' fix (SiteEventLogging-016 → -024), an unguarded KPI recorder tick (KH-001), a trust-analyzer fallback weakening (SA-001), and a native-alarm subscribe-path leak (DCL-023). ScriptAnalysis verdict: trust boundary is semantically sound (symbol-based) in the production cluster config. README regenerated; regen-readme.py --check passes (4 pending / 567 total).
20 KiB
Code Review — KpiHistory
| Field | Value |
|---|---|
| Module | src/ZB.MOM.WW.ScadaBridge.KpiHistory |
| Design doc | docs/requirements/Component-KpiHistory.md |
| Status | Reviewed |
| Last reviewed | 2026-06-20 |
| Reviewer | claude-agent |
| Commit reviewed | 4307c381 |
| Open findings | 0 |
Summary
KpiHistory is a small, well-built observability module: a single ~305-line recorder
singleton (KpiHistoryRecorderActor), a strongly-typed options class with a fail-fast
validator, and a thin DI composition root. The owned code is clean — the actor is
textbook best-effort: every sample pass and purge sweep runs off the actor thread via
PipeTo, per-source faults are isolated so one throwing IKpiSampleSource never aborts
a tick or the other sources, the repository write/purge is guarded, no exception escapes
either tick handler, and a lifecycle CancellationTokenSource is cancelled in PostStop
so an in-flight pass observes shutdown promptly. The singleton is wired correctly in the
Host (ClusterSingletonManager + proxy + PhaseClusterLeave drain) and is deliberately
absent from the readiness barrier, exactly as the design requires. The options validator,
the EAV table mapping, both named indexes, and all four IKpiSampleSource implementations
exist and are registered on the central host as designed.
The dominant theme is unbounded work under load, in two places. First, the recorder
has no in-flight guard on its sample timer — directly contradicting its own XML-doc
claim to "mirror the NotificationOutboxActor timer + scope-per-tick + PipeTo pattern,"
because the NotificationOutbox dispatcher does hold an in-flight guard and the recorder
does not. When a sample pass runs longer than SampleInterval (slow/recovering DB), Akka
periodic timers enqueue, not coalesce, so overlapping RunSamplePass tasks pile up,
multiplying DB load at exactly the moment the store is struggling and double-writing
samples for overlapping windows. Second, GetRawSeriesAsync has no server-side row
cap: the design's DefaultMaxSeriesPoints ceiling is applied by KpiSeriesBucketer
only after the full raw window is materialised into memory — a 7-day window at the
default 60 s cadence is ~10 080 rows per series pulled to the Central UI before
downsampling, defeating the stated intent of the cap. A secondary theme is bucketer
contract drift (the "largest-timestamp-wins for unsorted input" doc claim is not what
the code does, and the short-series early-return emits raw capture timestamps where the
downsample path emits bucket-boundary timestamps) — both live in Commons but are core to
this module's query reducer. No Critical findings; one High, three Medium, two Low.
Checklist coverage
| # | Category | Examined | Notes |
|---|---|---|---|
| 1 | Correctness & logic bugs | Yes | capturedAt/cut-off captured on the actor thread (correct). KpiSeriesBucketer short-series early-return emits raw capture timestamps vs bucket-boundary timestamps on the downsample path (KpiHistory-005). Bucketer "largest-timestamp-wins for unsorted input" doc claim is false — it is last-in-iteration (KpiHistory-006). |
| 2 | Akka.NET conventions | Yes | PipeTo + scope-per-tick + off-thread I/O + IWithTimers all correct; sender not captured across awaits; messages immutable singletons. But no in-flight guard despite the XML claiming to mirror NotificationOutbox (KpiHistory-001). |
| 3 | Concurrency & thread safety | Yes | Actor state is effectively stateless; _shutdownCts lifecycle is correct. The missing in-flight guard allows overlapping sample passes under DB latency (KpiHistory-001). |
| 4 | Error handling & resilience | Yes | Per-source isolation, write/purge guards, and OperationCanceledException shutdown handling are all correct and tested. No issues beyond KpiHistory-001's load amplification. |
| 5 | Security | Yes | No injection surface — all queries are parameterised LINQ; Source/Metric/Scope/ScopeKey are equality predicates, never interpolated SQL. Scope isolation via ScopeKey == scopeKey (incl. IS NULL for Global) is correct. No issues found. |
| 6 | Performance & resource management | Yes | GetRawSeriesAsync returns the entire window with no server-side cap before bucketing (KpiHistory-002). RecordSamplesAsync short-circuits empty batches (good). Purge is set-based ExecuteDeleteAsync but is a single unbatched statement (KpiHistory-004). |
| 7 | Design-document adherence | Yes | The recorder XML claims to mirror NotificationOutbox's pattern but omits its in-flight guard (KpiHistory-001). DefaultMaxSeriesPoints-as-a-true-cap intent is undermined by KpiHistory-002. Otherwise faithful: singleton, not-readiness-gated, daily purge, EAV schema, indexes. |
| 8 | Code organization & conventions | Yes | Options class owned by the component (correct); validator co-located; IKpiHistoryRepository/IKpiSampleSource/KpiSample/bucketer in Commons, impl in ConfigurationDatabase (correct); singleton Props built in Host. No issues found. |
| 9 | Testing coverage | Yes | Recorder isolation, faulted-tick recovery, and purge cut-off are tested; validator bounds fully covered; bucketer has strong sorted-input coverage. Gaps: no overlapping-tick test, no unsorted-input bucketer test, no short-series timestamp-semantics assertion (KpiHistory-003). |
| 10 | Documentation & comments | Yes | XML is generally excellent. Two drift points: the "mirror NotificationOutbox" claim (KpiHistory-001) and the bucketer unsorted-input claim (KpiHistory-006). SampleComplete/PurgeComplete no-op handlers are documented. |
Findings
KpiHistory-001 — Recorder has no in-flight guard; overlapping sample passes pile up under DB latency
| Severity | High |
| Category | Akka.NET conventions |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.KpiHistory/KpiHistoryRecorderActor.cs:89-92, :103-107, :143-159 |
Description
The sample timer is a plain Akka periodic timer
(Timers.StartPeriodicTimer(SampleTimerKey, SampleTick.Instance, …, interval: _options.SampleInterval)).
HandleSampleTick launches RunSamplePass(...) off-thread via PipeTo and returns
immediately; the piped-back SampleComplete is a deliberate no-op
(Receive<SampleComplete>(_ => { })). There is no in-flight guard — nothing prevents
a second SampleTick from starting a second RunSamplePass while the first is still
awaiting its DB round-trip.
Akka periodic timers do not coalesce missed ticks — they enqueue. So when a sample pass
takes longer than SampleInterval (a slow, contended, or recovering central MS SQL —
exactly the regime where observability matters most), each subsequent tick spawns another
concurrent pass. Each pass opens its own DI scope + DbContext and issues its own
AddRange + SaveChangesAsync. The result is a self-amplifying load spiral against the
struggling store, plus duplicate sample rows whose CapturedAtUtc values straddle the same
real-time window (the design's "one shared tick timestamp" invariant assumes one pass per
tick).
The actor's own XML-doc states it "mirrors the NotificationOutboxActor timer +
scope-per-tick + PipeTo pattern" (lines 37-39) — but NotificationOutboxActor holds an
explicit in-flight boolean cleared on DispatchComplete precisely to serialise sweeps.
This recorder dropped that half of the pattern. The piped-back SampleComplete message is
the natural place the guard would be lowered, and its current empty body is the tell that
the guard was intended but omitted.
Recommendation
Add an in-flight guard: set a _sampleInFlight flag at the top of HandleSampleTick, skip
(and log at debug) the tick if already set, and clear it in the SampleComplete handler
(both success and failure projections already route there). This matches the
NotificationOutbox pattern the doc claims to follow and bounds the recorder to one pass per
tick. Add a regression test that drives two SampleTicks before the first pass completes
(e.g. a repository whose RecordSamplesAsync blocks on a gate) and asserts only one pass
ran. The purge tick is daily + idempotent so a guard there is optional, but consider the
same treatment for symmetry.
Resolution
Resolved 2026-06-20 (commit fd618cf1): added a _sampleInFlight guard to the recorder — HandleSampleTick skips (debug-logs) if a pass is in flight; the flag is cleared via a SampleComplete message piped on BOTH success and failure, so a faulted source can't wedge the guard. Overlapping-tick regression test added.
KpiHistory-002 — GetRawSeriesAsync materialises the entire window with no server-side cap; DefaultMaxSeriesPoints is applied only after the fact
| Severity | Medium |
| Category | Performance & resource management |
| Status | Deferred |
| Location | src/ZB.MOM.WW.ScadaBridge.Commons/Interfaces/Repositories/IKpiHistoryRepository.cs:39-41 (contract); src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Repositories/KpiHistoryRepository.cs:45-62 (impl); consumed by src/ZB.MOM.WW.ScadaBridge.CentralUI/Services/KpiHistoryQueryService.cs:77-93 |
Description
GetRawSeriesAsync has no limit/maxPoints/Take parameter. The implementation issues a
Where(...).OrderBy(CapturedAtUtc).Select(...).ToListAsync() that pulls every sample in
[fromUtc, toUtc] for the series into memory. The DefaultMaxSeriesPoints ceiling
(default 200, design-described as the cap that prevents "a single trend query [from
streaming] an arbitrarily large series to the Central UI" — KpiHistoryOptionsValidator
XML) is enforced by KpiSeriesBucketer.Bucket after the full raw set has already been
fetched across the wire and allocated as a List<KpiSeriesPoint> in the query service.
At the default 60 s sample cadence, a 24 h window is ~1 440 rows/series, a 7 d window is
~10 080 rows/series, and a 30 d window ~43 200 rows/series — per chart, with up to four
trend panels on a page each issuing its own query. The cap that the design relies on for
back-pressure provides none at the data tier; it only trims the in-memory result the UI
binds to. The IX_KpiSample_Series index makes the range scan efficient, but the row
count returned is still unbounded by anything except the window the parent page chooses.
Recommendation
Push the downsampling toward the store, or at least bound the fetch. Cheapest correct fix:
add an optional int? maxRows to GetRawSeriesAsync and have the query service pass a
generous multiple of effectiveMax (e.g. effectiveMax * k) so the bucketer still has
enough density to pick representative last-values while the DB-side Take caps the
transfer. A more thorough fix is a server-side bucketed aggregation (GROUP BY a computed
bucket index, MAX(CapturedAtUtc) per bucket), but that is a larger change the design
explicitly deferred ("v1 ships exactly one aggregation"). At minimum, document the
unbounded-fetch behaviour and the practical window ceiling so an operator does not point a
30 d chart at a busy multi-site deployment.
Resolution
Deferred 2026-06-20: bounding the raw window fetch (cheap Take-cap vs. server-side bucketed aggregation) is a design decision and the code lives in ConfigurationDatabase/Commons, not this project. Recorded as a query-path enhancement; the practical window ceiling should be documented when addressed.
KpiHistory-003 — Missing tests: overlapping ticks, unsorted-input bucketing, short-series timestamp semantics
| Severity | Medium |
| Category | Testing coverage |
| Status | Resolved |
| Location | tests/ZB.MOM.WW.ScadaBridge.KpiHistory.Tests/KpiHistoryRecorderActorTests.cs; tests/ZB.MOM.WW.ScadaBridge.Commons.Tests/Kpi/KpiSeriesBucketerTests.cs |
Description
The recorder and bucketer tests are otherwise strong (per-source isolation, faulted-tick recovery, purge cut-off, validator bounds, sorted-input bucketing, right-edge handling, empty-bucket omission, out-of-window filtering). Three behaviour gaps remain, each tied to a finding here:
- No overlapping-tick test (KpiHistory-001). Every recorder test sends a single tick
and awaits its effect; nothing exercises a second
SampleTickarriving while the first pass is still in flight, so the missing in-flight guard is invisible to the suite. A gated-repository test (blockRecordSamplesAsync, send two ticks, count passes) would pin the intended one-pass-per-tick behaviour. - No unsorted-input bucketer test (KpiHistory-006). All bucketer tests pass ascending input, so the doc's "largest-timestamp-wins for unsorted input" claim is never checked — and it is in fact wrong.
- No short-series timestamp-semantics assertion (KpiHistory-005).
Bucket_BucketStartUtc_IsSetToBucketStart…covers only the downsample path; no test asserts whatBucketStartUtcthe early-return (raw.Count <= maxPoints) path emits, so the inconsistency between the two paths is untested.
Recommendation
Add the three tests above. The overlapping-tick test belongs in
KpiHistoryRecorderActorTests (it is a recorder behaviour); the two bucketer tests belong
in KpiSeriesBucketerTests.
Resolution
Resolved 2026-06-20 (commit fd618cf1): added the overlapping-tick gated-repository test plus unsorted-input and short-series bucketer tests (the latter pin the documented short-series behaviour).
KpiHistory-004 — Retention purge is a single unbatched ExecuteDeleteAsync; a large backlog deletes in one transaction
| Severity | Low |
| Category | Performance & resource management |
| Status | Deferred |
| Location | src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Repositories/KpiHistoryRepository.cs:65-71 |
Description
PurgeOlderThanAsync runs one set-based Where(s => s.CapturedAtUtc < before).ExecuteDeleteAsync(...).
For the steady-state daily cadence this deletes one day of expired rows and is fine. But if
the purge has not run for an extended period (singleton down across a long failover window,
PurgeInterval mis-set and later corrected, or RetentionDays shortened), a single
unbounded DELETE can touch a very large row count in one transaction — lock escalation on
KpiSample, a long-running transaction, and transaction-log growth, which on the shared
central MS SQL can affect operational tables. The Audit Log purge path, by contrast, uses a
bounded batched delete for exactly this reason.
This is observability data on a non-partitioned [PRIMARY] table, so the blast radius is
bounded and the severity is Low — but the unbatched delete is a latent operational hazard
on the shared store.
Recommendation
Loop a bounded delete (DELETE TOP (N) … WHERE CapturedAtUtc < @before, or EF Core's
batching) until zero rows are affected, mirroring the Audit Log purge shape. Keep the
returned total for the existing log line.
Resolution
Deferred 2026-06-20: batching the retention purge is a shared-MS-SQL-store tradeoff (vs. the Audit Log's batched-delete precedent); low severity for non-partitioned observability data. Recorded as a future enhancement.
KpiHistory-005 — KpiSeriesBucketer short-series early-return emits raw capture timestamps where the downsample path emits bucket-boundary timestamps
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Deferred |
| Location | src/ZB.MOM.WW.ScadaBridge.Commons/Types/Kpi/KpiSeriesBucketer.cs:57-58 (early return) vs :93-94 (downsample) |
Description
KpiSeriesBucketer.Bucket has two return paths that disagree on what BucketStartUtc
means. When raw.Count <= maxPoints it returns raw unchanged — those points carry the
raw CapturedAtUtc as their BucketStartUtc (the repository builds
new KpiSeriesPoint(s.CapturedAtUtc, s.Value)). When raw.Count > maxPoints it returns
points whose BucketStartUtc is the bucket boundary
(fromUtc + bucketIndex * bucketWidthTicks), as the dedicated test
Bucket_BucketStartUtc_IsSetToBucketStartNotRawPointTimestamp asserts.
So the same series, charted at a density below vs above maxPoints, plots its points at
different x-positions: actual capture instants in the sparse case, evenly-spaced bucket
starts in the dense case. The downstream KpiTrendChart normalises X across
[min(BucketStartUtc), max(BucketStartUtc)], so the visual impact is minor (the time range
is essentially the same), but the contract is inconsistent and the x-axis "tick spacing"
subtly changes as a window crosses the cap. This is the bucketer that the KpiHistory design
defines as the module's query reducer, so the inconsistency is in-scope even though the file
lives in Commons.
Recommendation
Make the two paths agree. Either document the difference explicitly on Bucket (the
short-series path returns raw capture instants; the downsample path returns bucket starts),
or — cleaner — have the short-series path also project onto a consistent timestamp basis if
exact bucket-start semantics are part of the contract. Add the short-series timestamp test
from KpiHistory-003.
Resolution
Deferred 2026-06-20: whether the bucketer's short-series early-return and the downsample path must agree on BucketStartUtc semantics (vs. documenting the difference) is a contract decision for the design owner. The doc was corrected (KpiHistory-006) to describe current behaviour; the contract change is deferred.
KpiHistory-006 — KpiSeriesBucketer doc claims "largest timestamp within each bucket is selected" for unsorted input; the code selects last-in-iteration
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.Commons/Types/Kpi/KpiSeriesBucketer.cs:20-21 (doc), :88-97 (code) |
Description
The raw param doc states: "If not sorted, the point with the largest timestamp within
each bucket is selected." The code does not do this. When a point is stored into a bucket,
best[bucketIndex].BucketStartUtc is set to the bucket-start timestamp
(fromUtc + bucketIndex * bucketWidthTicks), not the raw point's timestamp. The
last-value comparison for a subsequent point in the same bucket is then
point.BucketStartUtc > best[bucketIndex].BucketStartUtc — i.e. it compares the new raw
point's capture time against the bucket start, which (for any in-bucket point) is almost
always true. The effect is that each later-in-iteration point overwrites the previous one
regardless of their relative timestamps, so the last point in iteration order wins, not
the point with the largest timestamp.
For the production caller this is harmless: KpiHistoryRepository.GetRawSeriesAsync always
OrderBy(CapturedAtUtc), so iteration order equals time order and last-in-iteration is the
largest timestamp. But the documented contract for unsorted input is simply false, and the
"if ties, keep first encountered — stable" comment (line 89) is also inaccurate — the
overwrite triggers on equal-as-well-as-greater for any in-bucket point. A future caller that
trusts the unsorted-input guarantee will get wrong results silently.
Recommendation
Either (a) fix the comparison to track the selected raw point's actual timestamp (store the
raw point.BucketStartUtc alongside the emitted value and compare against that), making the
"largest timestamp wins" claim true for unsorted input; or (b) tighten the doc to state the
method requires ascending-sorted input and selects last-in-iteration otherwise, and drop the
inaccurate "largest timestamp" / "stable ties" language. Pair with the unsorted-input test
from KpiHistory-003.
Resolution
Resolved 2026-06-20 (commit fd618cf1): corrected the KpiSeriesBucketer param XML doc — dropped the false 'largest-timestamp-wins / stable ties' claim; now states it requires ascending-sorted input and selects last-in-iteration otherwise. Doc-only.