Files

T

Joseph Doherty d39089f4ed docs(code-review): full review at 4307c381 — 18 modules, 67 findings recorded + remediation tracked

Full per-module re-review of the 16 stale modules (last seen 1eb6e97 / 2026-05-28)
plus first-ever reviews of KpiHistory (#26) and ScriptAnalysis (#25), at HEAD 4307c381.

67 new findings (0 Critical, 6 High, 27 Medium, 34 Low). Remediation in commit
fd618cf1 closed 5 of the 6 Highs and ~33 Medium/Low; the rest are Deferred/Won't Fix
with rationale. Remaining pending (4) are all InboundAPI's Database-helper findings
(IA-026 High .. IA-029), left to the active feat/ipsen-movein effort per owner decision.

Highlights: caught a central-only-delivery security drift (SMTP creds broadcast to
sites — DM-025/SR-031), a never-committed 'Resolved' fix (SiteEventLogging-016 → -024),
an unguarded KPI recorder tick (KH-001), a trust-analyzer fallback weakening (SA-001),
and a native-alarm subscribe-path leak (DCL-023). ScriptAnalysis verdict: trust boundary
is semantically sound (symbol-based) in the production cluster config.

README regenerated; regen-readme.py --check passes (4 pending / 567 total).

2026-06-20 18:02:32 -04:00

20 KiB

Raw Blame History

Code Review — KpiHistory

Field	Value
Module	`src/ZB.MOM.WW.ScadaBridge.KpiHistory`
Design doc	`docs/requirements/Component-KpiHistory.md`
Status	Reviewed
Last reviewed	2026-06-20
Reviewer	claude-agent
Commit reviewed	`4307c381`
Open findings	0

Summary

KpiHistory is a small, well-built observability module: a single ~305-line recorder singleton (KpiHistoryRecorderActor), a strongly-typed options class with a fail-fast validator, and a thin DI composition root. The owned code is clean — the actor is textbook best-effort: every sample pass and purge sweep runs off the actor thread via PipeTo, per-source faults are isolated so one throwing IKpiSampleSource never aborts a tick or the other sources, the repository write/purge is guarded, no exception escapes either tick handler, and a lifecycle CancellationTokenSource is cancelled in PostStop so an in-flight pass observes shutdown promptly. The singleton is wired correctly in the Host (ClusterSingletonManager + proxy + PhaseClusterLeave drain) and is deliberately absent from the readiness barrier, exactly as the design requires. The options validator, the EAV table mapping, both named indexes, and all four IKpiSampleSource implementations exist and are registered on the central host as designed.

The dominant theme is unbounded work under load, in two places. First, the recorder has no in-flight guard on its sample timer — directly contradicting its own XML-doc claim to "mirror the NotificationOutboxActor timer + scope-per-tick + PipeTo pattern," because the NotificationOutbox dispatcher does hold an in-flight guard and the recorder does not. When a sample pass runs longer than SampleInterval (slow/recovering DB), Akka periodic timers enqueue, not coalesce, so overlapping RunSamplePass tasks pile up, multiplying DB load at exactly the moment the store is struggling and double-writing samples for overlapping windows. Second, GetRawSeriesAsync has no server-side row cap: the design's DefaultMaxSeriesPoints ceiling is applied by KpiSeriesBucketer only after the full raw window is materialised into memory — a 7-day window at the default 60 s cadence is ~10 080 rows per series pulled to the Central UI before downsampling, defeating the stated intent of the cap. A secondary theme is bucketer contract drift (the "largest-timestamp-wins for unsorted input" doc claim is not what the code does, and the short-series early-return emits raw capture timestamps where the downsample path emits bucket-boundary timestamps) — both live in Commons but are core to this module's query reducer. No Critical findings; one High, three Medium, two Low.

Checklist coverage

#	Category	Examined	Notes
1	Correctness & logic bugs	Yes	`capturedAt`/cut-off captured on the actor thread (correct). `KpiSeriesBucketer` short-series early-return emits raw capture timestamps vs bucket-boundary timestamps on the downsample path (KpiHistory-005). Bucketer "largest-timestamp-wins for unsorted input" doc claim is false — it is last-in-iteration (KpiHistory-006).
2	Akka.NET conventions	Yes	`PipeTo` + scope-per-tick + off-thread I/O + `IWithTimers` all correct; sender not captured across awaits; messages immutable singletons. But no in-flight guard despite the XML claiming to mirror NotificationOutbox (KpiHistory-001).
3	Concurrency & thread safety	Yes	Actor state is effectively stateless; `_shutdownCts` lifecycle is correct. The missing in-flight guard allows overlapping sample passes under DB latency (KpiHistory-001).
4	Error handling & resilience	Yes	Per-source isolation, write/purge guards, and `OperationCanceledException` shutdown handling are all correct and tested. No issues beyond KpiHistory-001's load amplification.
5	Security	Yes	No injection surface — all queries are parameterised LINQ; `Source`/`Metric`/`Scope`/`ScopeKey` are equality predicates, never interpolated SQL. Scope isolation via `ScopeKey == scopeKey` (incl. `IS NULL` for Global) is correct. No issues found.
6	Performance & resource management	Yes	`GetRawSeriesAsync` returns the entire window with no server-side cap before bucketing (KpiHistory-002). `RecordSamplesAsync` short-circuits empty batches (good). Purge is set-based `ExecuteDeleteAsync` but is a single unbatched statement (KpiHistory-004).
7	Design-document adherence	Yes	The recorder XML claims to mirror NotificationOutbox's pattern but omits its in-flight guard (KpiHistory-001). `DefaultMaxSeriesPoints`-as-a-true-cap intent is undermined by KpiHistory-002. Otherwise faithful: singleton, not-readiness-gated, daily purge, EAV schema, indexes.
8	Code organization & conventions	Yes	Options class owned by the component (correct); validator co-located; `IKpiHistoryRepository`/`IKpiSampleSource`/`KpiSample`/bucketer in Commons, impl in ConfigurationDatabase (correct); singleton Props built in Host. No issues found.
9	Testing coverage	Yes	Recorder isolation, faulted-tick recovery, and purge cut-off are tested; validator bounds fully covered; bucketer has strong sorted-input coverage. Gaps: no overlapping-tick test, no unsorted-input bucketer test, no short-series timestamp-semantics assertion (KpiHistory-003).
10	Documentation & comments	Yes	XML is generally excellent. Two drift points: the "mirror NotificationOutbox" claim (KpiHistory-001) and the bucketer unsorted-input claim (KpiHistory-006). `SampleComplete`/`PurgeComplete` no-op handlers are documented.

Findings

KpiHistory-001 — Recorder has no in-flight guard; overlapping sample passes pile up under DB latency


Severity	High
Category	Akka.NET conventions
Status	Resolved
Location	`src/ZB.MOM.WW.ScadaBridge.KpiHistory/KpiHistoryRecorderActor.cs:89-92`, `:103-107`, `:143-159`

Description

The sample timer is a plain Akka periodic timer (Timers.StartPeriodicTimer(SampleTimerKey, SampleTick.Instance, …, interval: _options.SampleInterval)). HandleSampleTick launches RunSamplePass(...) off-thread via PipeTo and returns immediately; the piped-back SampleComplete is a deliberate no-op (Receive<SampleComplete>(_ => { })). There is no in-flight guard — nothing prevents a second SampleTick from starting a second RunSamplePass while the first is still awaiting its DB round-trip.

Akka periodic timers do not coalesce missed ticks — they enqueue. So when a sample pass takes longer than SampleInterval (a slow, contended, or recovering central MS SQL — exactly the regime where observability matters most), each subsequent tick spawns another concurrent pass. Each pass opens its own DI scope + DbContext and issues its own AddRange + SaveChangesAsync. The result is a self-amplifying load spiral against the struggling store, plus duplicate sample rows whose CapturedAtUtc values straddle the same real-time window (the design's "one shared tick timestamp" invariant assumes one pass per tick).

The actor's own XML-doc states it "mirrors the NotificationOutboxActor timer + scope-per-tick + PipeTo pattern" (lines 37-39) — but NotificationOutboxActor holds an explicit in-flight boolean cleared on DispatchComplete precisely to serialise sweeps. This recorder dropped that half of the pattern. The piped-back SampleComplete message is the natural place the guard would be lowered, and its current empty body is the tell that the guard was intended but omitted.

Recommendation

Add an in-flight guard: set a _sampleInFlight flag at the top of HandleSampleTick, skip (and log at debug) the tick if already set, and clear it in the SampleComplete handler (both success and failure projections already route there). This matches the NotificationOutbox pattern the doc claims to follow and bounds the recorder to one pass per tick. Add a regression test that drives two SampleTicks before the first pass completes (e.g. a repository whose RecordSamplesAsync blocks on a gate) and asserts only one pass ran. The purge tick is daily + idempotent so a guard there is optional, but consider the same treatment for symmetry.

Resolution

Resolved 2026-06-20 (commit fd618cf1): added a _sampleInFlight guard to the recorder — HandleSampleTick skips (debug-logs) if a pass is in flight; the flag is cleared via a SampleComplete message piped on BOTH success and failure, so a faulted source can't wedge the guard. Overlapping-tick regression test added.

KpiHistory-002 — `GetRawSeriesAsync` materialises the entire window with no server-side cap; `DefaultMaxSeriesPoints` is applied only after the fact


Severity	Medium
Category	Performance & resource management
Status	Deferred
Location	`src/ZB.MOM.WW.ScadaBridge.Commons/Interfaces/Repositories/IKpiHistoryRepository.cs:39-41` (contract); `src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Repositories/KpiHistoryRepository.cs:45-62` (impl); consumed by `src/ZB.MOM.WW.ScadaBridge.CentralUI/Services/KpiHistoryQueryService.cs:77-93`

Description

GetRawSeriesAsync has no limit/maxPoints/Take parameter. The implementation issues a Where(...).OrderBy(CapturedAtUtc).Select(...).ToListAsync() that pulls every sample in [fromUtc, toUtc] for the series into memory. The DefaultMaxSeriesPoints ceiling (default 200, design-described as the cap that prevents "a single trend query [from streaming] an arbitrarily large series to the Central UI" — KpiHistoryOptionsValidator XML) is enforced by KpiSeriesBucketer.Bucket after the full raw set has already been fetched across the wire and allocated as a List<KpiSeriesPoint> in the query service.

At the default 60 s sample cadence, a 24 h window is ~1 440 rows/series, a 7 d window is ~10 080 rows/series, and a 30 d window ~43 200 rows/series — per chart, with up to four trend panels on a page each issuing its own query. The cap that the design relies on for back-pressure provides none at the data tier; it only trims the in-memory result the UI binds to. The IX_KpiSample_Series index makes the range scan efficient, but the row count returned is still unbounded by anything except the window the parent page chooses.

Recommendation

Push the downsampling toward the store, or at least bound the fetch. Cheapest correct fix: add an optional int? maxRows to GetRawSeriesAsync and have the query service pass a generous multiple of effectiveMax (e.g. effectiveMax * k) so the bucketer still has enough density to pick representative last-values while the DB-side Take caps the transfer. A more thorough fix is a server-side bucketed aggregation (GROUP BY a computed bucket index, MAX(CapturedAtUtc) per bucket), but that is a larger change the design explicitly deferred ("v1 ships exactly one aggregation"). At minimum, document the unbounded-fetch behaviour and the practical window ceiling so an operator does not point a 30 d chart at a busy multi-site deployment.

Resolution

Deferred 2026-06-20: bounding the raw window fetch (cheap Take-cap vs. server-side bucketed aggregation) is a design decision and the code lives in ConfigurationDatabase/Commons, not this project. Recorded as a query-path enhancement; the practical window ceiling should be documented when addressed.

KpiHistory-003 — Missing tests: overlapping ticks, unsorted-input bucketing, short-series timestamp semantics


Severity	Medium
Category	Testing coverage
Status	Resolved
Location	`tests/ZB.MOM.WW.ScadaBridge.KpiHistory.Tests/KpiHistoryRecorderActorTests.cs`; `tests/ZB.MOM.WW.ScadaBridge.Commons.Tests/Kpi/KpiSeriesBucketerTests.cs`

Description

The recorder and bucketer tests are otherwise strong (per-source isolation, faulted-tick recovery, purge cut-off, validator bounds, sorted-input bucketing, right-edge handling, empty-bucket omission, out-of-window filtering). Three behaviour gaps remain, each tied to a finding here:

No overlapping-tick test (KpiHistory-001). Every recorder test sends a single tick and awaits its effect; nothing exercises a second SampleTick arriving while the first pass is still in flight, so the missing in-flight guard is invisible to the suite. A gated-repository test (block RecordSamplesAsync, send two ticks, count passes) would pin the intended one-pass-per-tick behaviour.
No unsorted-input bucketer test (KpiHistory-006). All bucketer tests pass ascending input, so the doc's "largest-timestamp-wins for unsorted input" claim is never checked — and it is in fact wrong.
No short-series timestamp-semantics assertion (KpiHistory-005). Bucket_BucketStartUtc_IsSetToBucketStart… covers only the downsample path; no test asserts what BucketStartUtc the early-return (raw.Count <= maxPoints) path emits, so the inconsistency between the two paths is untested.

Recommendation

Add the three tests above. The overlapping-tick test belongs in KpiHistoryRecorderActorTests (it is a recorder behaviour); the two bucketer tests belong in KpiSeriesBucketerTests.

Resolution

Resolved 2026-06-20 (commit fd618cf1): added the overlapping-tick gated-repository test plus unsorted-input and short-series bucketer tests (the latter pin the documented short-series behaviour).

KpiHistory-004 — Retention purge is a single unbatched `ExecuteDeleteAsync`; a large backlog deletes in one transaction


Severity	Low
Category	Performance & resource management
Status	Deferred
Location	`src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Repositories/KpiHistoryRepository.cs:65-71`

Description

PurgeOlderThanAsync runs one set-based Where(s => s.CapturedAtUtc < before).ExecuteDeleteAsync(...). For the steady-state daily cadence this deletes one day of expired rows and is fine. But if the purge has not run for an extended period (singleton down across a long failover window, PurgeInterval mis-set and later corrected, or RetentionDays shortened), a single unbounded DELETE can touch a very large row count in one transaction — lock escalation on KpiSample, a long-running transaction, and transaction-log growth, which on the shared central MS SQL can affect operational tables. The Audit Log purge path, by contrast, uses a bounded batched delete for exactly this reason.

This is observability data on a non-partitioned [PRIMARY] table, so the blast radius is bounded and the severity is Low — but the unbatched delete is a latent operational hazard on the shared store.

Recommendation

Loop a bounded delete (DELETE TOP (N) … WHERE CapturedAtUtc < @before, or EF Core's batching) until zero rows are affected, mirroring the Audit Log purge shape. Keep the returned total for the existing log line.

Resolution

Deferred 2026-06-20: batching the retention purge is a shared-MS-SQL-store tradeoff (vs. the Audit Log's batched-delete precedent); low severity for non-partitioned observability data. Recorded as a future enhancement.

KpiHistory-005 — `KpiSeriesBucketer` short-series early-return emits raw capture timestamps where the downsample path emits bucket-boundary timestamps


Severity	Low
Category	Correctness & logic bugs
Status	Deferred
Location	`src/ZB.MOM.WW.ScadaBridge.Commons/Types/Kpi/KpiSeriesBucketer.cs:57-58` (early return) vs `:93-94` (downsample)

Description

KpiSeriesBucketer.Bucket has two return paths that disagree on what BucketStartUtc means. When raw.Count <= maxPoints it returns raw unchanged — those points carry the raw CapturedAtUtc as their BucketStartUtc (the repository builds new KpiSeriesPoint(s.CapturedAtUtc, s.Value)). When raw.Count > maxPoints it returns points whose BucketStartUtc is the bucket boundary (fromUtc + bucketIndex * bucketWidthTicks), as the dedicated test Bucket_BucketStartUtc_IsSetToBucketStartNotRawPointTimestamp asserts.

So the same series, charted at a density below vs above maxPoints, plots its points at different x-positions: actual capture instants in the sparse case, evenly-spaced bucket starts in the dense case. The downstream KpiTrendChart normalises X across [min(BucketStartUtc), max(BucketStartUtc)], so the visual impact is minor (the time range is essentially the same), but the contract is inconsistent and the x-axis "tick spacing" subtly changes as a window crosses the cap. This is the bucketer that the KpiHistory design defines as the module's query reducer, so the inconsistency is in-scope even though the file lives in Commons.

Recommendation

Make the two paths agree. Either document the difference explicitly on Bucket (the short-series path returns raw capture instants; the downsample path returns bucket starts), or — cleaner — have the short-series path also project onto a consistent timestamp basis if exact bucket-start semantics are part of the contract. Add the short-series timestamp test from KpiHistory-003.

Resolution

Deferred 2026-06-20: whether the bucketer's short-series early-return and the downsample path must agree on BucketStartUtc semantics (vs. documenting the difference) is a contract decision for the design owner. The doc was corrected (KpiHistory-006) to describe current behaviour; the contract change is deferred.

KpiHistory-006 — `KpiSeriesBucketer` doc claims "largest timestamp within each bucket is selected" for unsorted input; the code selects last-in-iteration


Severity	Low
Category	Documentation & comments
Status	Resolved
Location	`src/ZB.MOM.WW.ScadaBridge.Commons/Types/Kpi/KpiSeriesBucketer.cs:20-21` (doc), `:88-97` (code)

Description

The raw param doc states: "If not sorted, the point with the largest timestamp within each bucket is selected." The code does not do this. When a point is stored into a bucket, best[bucketIndex].BucketStartUtc is set to the bucket-start timestamp (fromUtc + bucketIndex * bucketWidthTicks), not the raw point's timestamp. The last-value comparison for a subsequent point in the same bucket is then point.BucketStartUtc > best[bucketIndex].BucketStartUtc — i.e. it compares the new raw point's capture time against the bucket start, which (for any in-bucket point) is almost always true. The effect is that each later-in-iteration point overwrites the previous one regardless of their relative timestamps, so the last point in iteration order wins, not the point with the largest timestamp.

For the production caller this is harmless: KpiHistoryRepository.GetRawSeriesAsync always OrderBy(CapturedAtUtc), so iteration order equals time order and last-in-iteration is the largest timestamp. But the documented contract for unsorted input is simply false, and the "if ties, keep first encountered — stable" comment (line 89) is also inaccurate — the overwrite triggers on equal-as-well-as-greater for any in-bucket point. A future caller that trusts the unsorted-input guarantee will get wrong results silently.

Recommendation

Either (a) fix the comparison to track the selected raw point's actual timestamp (store the raw point.BucketStartUtc alongside the emitted value and compare against that), making the "largest timestamp wins" claim true for unsorted input; or (b) tighten the doc to state the method requires ascending-sorted input and selects last-in-iteration otherwise, and drop the inaccurate "largest timestamp" / "stable ties" language. Pair with the unsorted-input test from KpiHistory-003.

Resolution

Resolved 2026-06-20 (commit fd618cf1): corrected the KpiSeriesBucketer param XML doc — dropped the false 'largest-timestamp-wins / stable ties' claim; now states it requires ascending-sorted input and selects last-in-iteration otherwise. Doc-only.

20 KiB Raw Blame History

Code Review — KpiHistory

Summary

Checklist coverage

Findings

KpiHistory-001 — Recorder has no in-flight guard; overlapping sample passes pile up under DB latency

KpiHistory-002 — GetRawSeriesAsync materialises the entire window with no server-side cap; DefaultMaxSeriesPoints is applied only after the fact

KpiHistory-003 — Missing tests: overlapping ticks, unsorted-input bucketing, short-series timestamp semantics

KpiHistory-004 — Retention purge is a single unbatched ExecuteDeleteAsync; a large backlog deletes in one transaction

KpiHistory-005 — KpiSeriesBucketer short-series early-return emits raw capture timestamps where the downsample path emits bucket-boundary timestamps

KpiHistory-006 — KpiSeriesBucketer doc claims "largest timestamp within each bucket is selected" for unsorted input; the code selects last-in-iteration

20 KiB

Raw Blame History

KpiHistory-002 — `GetRawSeriesAsync` materialises the entire window with no server-side cap; `DefaultMaxSeriesPoints` is applied only after the fact

KpiHistory-004 — Retention purge is a single unbatched `ExecuteDeleteAsync`; a large backlog deletes in one transaction

KpiHistory-005 — `KpiSeriesBucketer` short-series early-return emits raw capture timestamps where the downsample path emits bucket-boundary timestamps

KpiHistory-006 — `KpiSeriesBucketer` doc claims "largest timestamp within each bucket is selected" for unsorted input; the code selects last-in-iteration