# Code Review — KpiHistory | Field | Value | |-------|-------| | Module | `src/ZB.MOM.WW.ScadaBridge.KpiHistory` | | Design doc | `docs/requirements/Component-KpiHistory.md` | | Status | Reviewed | | Last reviewed | 2026-06-20 | | Reviewer | claude-agent | | Commit reviewed | `4307c381` | | Open findings | 0 | ## Summary KpiHistory is a small, well-built observability module: a single ~305-line recorder singleton (`KpiHistoryRecorderActor`), a strongly-typed options class with a fail-fast validator, and a thin DI composition root. The owned code is clean — the actor is textbook best-effort: every sample pass and purge sweep runs off the actor thread via `PipeTo`, per-source faults are isolated so one throwing `IKpiSampleSource` never aborts a tick or the other sources, the repository write/purge is guarded, no exception escapes either tick handler, and a lifecycle `CancellationTokenSource` is cancelled in `PostStop` so an in-flight pass observes shutdown promptly. The singleton is wired correctly in the Host (ClusterSingletonManager + proxy + `PhaseClusterLeave` drain) and is deliberately absent from the readiness barrier, exactly as the design requires. The options validator, the EAV table mapping, both named indexes, and all four `IKpiSampleSource` implementations exist and are registered on the central host as designed. The dominant theme is **unbounded work under load**, in two places. First, the recorder has **no in-flight guard** on its sample timer — directly contradicting its own XML-doc claim to "mirror the NotificationOutboxActor timer + scope-per-tick + PipeTo pattern," because the NotificationOutbox dispatcher *does* hold an in-flight guard and the recorder does not. When a sample pass runs longer than `SampleInterval` (slow/recovering DB), Akka periodic timers enqueue, not coalesce, so overlapping `RunSamplePass` tasks pile up, multiplying DB load at exactly the moment the store is struggling and double-writing samples for overlapping windows. Second, `GetRawSeriesAsync` has **no server-side row cap**: the design's `DefaultMaxSeriesPoints` ceiling is applied by `KpiSeriesBucketer` only *after* the full raw window is materialised into memory — a 7-day window at the default 60 s cadence is ~10 080 rows per series pulled to the Central UI before downsampling, defeating the stated intent of the cap. A secondary theme is **bucketer contract drift** (the "largest-timestamp-wins for unsorted input" doc claim is not what the code does, and the short-series early-return emits raw capture timestamps where the downsample path emits bucket-boundary timestamps) — both live in Commons but are core to this module's query reducer. No Critical findings; one High, three Medium, two Low. ## Checklist coverage | # | Category | Examined | Notes | |---|----------|----------|-------| | 1 | Correctness & logic bugs | Yes | `capturedAt`/cut-off captured on the actor thread (correct). `KpiSeriesBucketer` short-series early-return emits raw capture timestamps vs bucket-boundary timestamps on the downsample path (KpiHistory-005). Bucketer "largest-timestamp-wins for unsorted input" doc claim is false — it is last-in-iteration (KpiHistory-006). | | 2 | Akka.NET conventions | Yes | `PipeTo` + scope-per-tick + off-thread I/O + `IWithTimers` all correct; sender not captured across awaits; messages immutable singletons. But no in-flight guard despite the XML claiming to mirror NotificationOutbox (KpiHistory-001). | | 3 | Concurrency & thread safety | Yes | Actor state is effectively stateless; `_shutdownCts` lifecycle is correct. The missing in-flight guard allows overlapping sample passes under DB latency (KpiHistory-001). | | 4 | Error handling & resilience | Yes | Per-source isolation, write/purge guards, and `OperationCanceledException` shutdown handling are all correct and tested. No issues beyond KpiHistory-001's load amplification. | | 5 | Security | Yes | No injection surface — all queries are parameterised LINQ; `Source`/`Metric`/`Scope`/`ScopeKey` are equality predicates, never interpolated SQL. Scope isolation via `ScopeKey == scopeKey` (incl. `IS NULL` for Global) is correct. No issues found. | | 6 | Performance & resource management | Yes | `GetRawSeriesAsync` returns the entire window with no server-side cap before bucketing (KpiHistory-002). `RecordSamplesAsync` short-circuits empty batches (good). Purge is set-based `ExecuteDeleteAsync` but is a single unbatched statement (KpiHistory-004). | | 7 | Design-document adherence | Yes | The recorder XML claims to mirror NotificationOutbox's pattern but omits its in-flight guard (KpiHistory-001). `DefaultMaxSeriesPoints`-as-a-true-cap intent is undermined by KpiHistory-002. Otherwise faithful: singleton, not-readiness-gated, daily purge, EAV schema, indexes. | | 8 | Code organization & conventions | Yes | Options class owned by the component (correct); validator co-located; `IKpiHistoryRepository`/`IKpiSampleSource`/`KpiSample`/bucketer in Commons, impl in ConfigurationDatabase (correct); singleton Props built in Host. No issues found. | | 9 | Testing coverage | Yes | Recorder isolation, faulted-tick recovery, and purge cut-off are tested; validator bounds fully covered; bucketer has strong sorted-input coverage. Gaps: no overlapping-tick test, no unsorted-input bucketer test, no short-series timestamp-semantics assertion (KpiHistory-003). | | 10 | Documentation & comments | Yes | XML is generally excellent. Two drift points: the "mirror NotificationOutbox" claim (KpiHistory-001) and the bucketer unsorted-input claim (KpiHistory-006). `SampleComplete`/`PurgeComplete` no-op handlers are documented. | ## Findings ### KpiHistory-001 — Recorder has no in-flight guard; overlapping sample passes pile up under DB latency | | | |--|--| | Severity | High | | Category | Akka.NET conventions | | Status | Resolved | | Location | `src/ZB.MOM.WW.ScadaBridge.KpiHistory/KpiHistoryRecorderActor.cs:89-92`, `:103-107`, `:143-159` | **Description** The sample timer is a plain Akka periodic timer (`Timers.StartPeriodicTimer(SampleTimerKey, SampleTick.Instance, …, interval: _options.SampleInterval)`). `HandleSampleTick` launches `RunSamplePass(...)` off-thread via `PipeTo` and returns immediately; the piped-back `SampleComplete` is a deliberate no-op (`Receive(_ => { })`). There is **no in-flight guard** — nothing prevents a second `SampleTick` from starting a second `RunSamplePass` while the first is still awaiting its DB round-trip. Akka periodic timers do not coalesce missed ticks — they enqueue. So when a sample pass takes longer than `SampleInterval` (a slow, contended, or recovering central MS SQL — exactly the regime where observability matters most), each subsequent tick spawns *another* concurrent pass. Each pass opens its own DI scope + `DbContext` and issues its own `AddRange` + `SaveChangesAsync`. The result is a self-amplifying load spiral against the struggling store, plus duplicate sample rows whose `CapturedAtUtc` values straddle the same real-time window (the design's "one shared tick timestamp" invariant assumes one pass per tick). The actor's own XML-doc states it "mirrors the `NotificationOutboxActor` timer + scope-per-tick + PipeTo pattern" (lines 37-39) — but `NotificationOutboxActor` holds an explicit in-flight boolean cleared on `DispatchComplete` precisely to serialise sweeps. This recorder dropped that half of the pattern. The piped-back `SampleComplete` message is the natural place the guard would be lowered, and its current empty body is the tell that the guard was intended but omitted. **Recommendation** Add an in-flight guard: set a `_sampleInFlight` flag at the top of `HandleSampleTick`, skip (and log at debug) the tick if already set, and clear it in the `SampleComplete` handler (both success and failure projections already route there). This matches the NotificationOutbox pattern the doc claims to follow and bounds the recorder to one pass per tick. Add a regression test that drives two `SampleTick`s before the first pass completes (e.g. a repository whose `RecordSamplesAsync` blocks on a gate) and asserts only one pass ran. The purge tick is daily + idempotent so a guard there is optional, but consider the same treatment for symmetry. **Resolution** Resolved 2026-06-20 (commit `fd618cf1`): added a `_sampleInFlight` guard to the recorder — `HandleSampleTick` skips (debug-logs) if a pass is in flight; the flag is cleared via a `SampleComplete` message piped on BOTH success and failure, so a faulted source can't wedge the guard. Overlapping-tick regression test added. ### KpiHistory-002 — `GetRawSeriesAsync` materialises the entire window with no server-side cap; `DefaultMaxSeriesPoints` is applied only after the fact | | | |--|--| | Severity | Medium | | Category | Performance & resource management | | Status | Deferred | | Location | `src/ZB.MOM.WW.ScadaBridge.Commons/Interfaces/Repositories/IKpiHistoryRepository.cs:39-41` (contract); `src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Repositories/KpiHistoryRepository.cs:45-62` (impl); consumed by `src/ZB.MOM.WW.ScadaBridge.CentralUI/Services/KpiHistoryQueryService.cs:77-93` | **Description** `GetRawSeriesAsync` has no `limit`/`maxPoints`/`Take` parameter. The implementation issues a `Where(...).OrderBy(CapturedAtUtc).Select(...).ToListAsync()` that pulls **every** sample in `[fromUtc, toUtc]` for the series into memory. The `DefaultMaxSeriesPoints` ceiling (default 200, design-described as the cap that prevents "a single trend query [from streaming] an arbitrarily large series to the Central UI" — `KpiHistoryOptionsValidator` XML) is enforced by `KpiSeriesBucketer.Bucket` **after** the full raw set has already been fetched across the wire and allocated as a `List` in the query service. At the default 60 s sample cadence, a 24 h window is ~1 440 rows/series, a 7 d window is ~10 080 rows/series, and a 30 d window ~43 200 rows/series — per chart, with up to four trend panels on a page each issuing its own query. The cap that the design relies on for back-pressure provides none at the data tier; it only trims the in-memory result the UI binds to. The `IX_KpiSample_Series` index makes the *range scan* efficient, but the row count returned is still unbounded by anything except the window the parent page chooses. **Recommendation** Push the downsampling toward the store, or at least bound the fetch. Cheapest correct fix: add an optional `int? maxRows` to `GetRawSeriesAsync` and have the query service pass a generous multiple of `effectiveMax` (e.g. `effectiveMax * k`) so the bucketer still has enough density to pick representative last-values while the DB-side `Take` caps the transfer. A more thorough fix is a server-side bucketed aggregation (GROUP BY a computed bucket index, MAX(CapturedAtUtc) per bucket), but that is a larger change the design explicitly deferred ("v1 ships exactly one aggregation"). At minimum, document the unbounded-fetch behaviour and the practical window ceiling so an operator does not point a 30 d chart at a busy multi-site deployment. **Resolution** Deferred 2026-06-20: bounding the raw window fetch (cheap `Take`-cap vs. server-side bucketed aggregation) is a design decision and the code lives in ConfigurationDatabase/Commons, not this project. Recorded as a query-path enhancement; the practical window ceiling should be documented when addressed. ### KpiHistory-003 — Missing tests: overlapping ticks, unsorted-input bucketing, short-series timestamp semantics | | | |--|--| | Severity | Medium | | Category | Testing coverage | | Status | Resolved | | Location | `tests/ZB.MOM.WW.ScadaBridge.KpiHistory.Tests/KpiHistoryRecorderActorTests.cs`; `tests/ZB.MOM.WW.ScadaBridge.Commons.Tests/Kpi/KpiSeriesBucketerTests.cs` | **Description** The recorder and bucketer tests are otherwise strong (per-source isolation, faulted-tick recovery, purge cut-off, validator bounds, sorted-input bucketing, right-edge handling, empty-bucket omission, out-of-window filtering). Three behaviour gaps remain, each tied to a finding here: 1. **No overlapping-tick test** (KpiHistory-001). Every recorder test sends a single tick and awaits its effect; nothing exercises a second `SampleTick` arriving while the first pass is still in flight, so the missing in-flight guard is invisible to the suite. A gated-repository test (block `RecordSamplesAsync`, send two ticks, count passes) would pin the intended one-pass-per-tick behaviour. 2. **No unsorted-input bucketer test** (KpiHistory-006). All bucketer tests pass ascending input, so the doc's "largest-timestamp-wins for unsorted input" claim is never checked — and it is in fact wrong. 3. **No short-series timestamp-semantics assertion** (KpiHistory-005). `Bucket_BucketStartUtc_IsSetToBucketStart…` covers only the downsample path; no test asserts what `BucketStartUtc` the early-return (`raw.Count <= maxPoints`) path emits, so the inconsistency between the two paths is untested. **Recommendation** Add the three tests above. The overlapping-tick test belongs in `KpiHistoryRecorderActorTests` (it is a recorder behaviour); the two bucketer tests belong in `KpiSeriesBucketerTests`. **Resolution** Resolved 2026-06-20 (commit `fd618cf1`): added the overlapping-tick gated-repository test plus unsorted-input and short-series bucketer tests (the latter pin the documented short-series behaviour). ### KpiHistory-004 — Retention purge is a single unbatched `ExecuteDeleteAsync`; a large backlog deletes in one transaction | | | |--|--| | Severity | Low | | Category | Performance & resource management | | Status | Deferred | | Location | `src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Repositories/KpiHistoryRepository.cs:65-71` | **Description** `PurgeOlderThanAsync` runs one set-based `Where(s => s.CapturedAtUtc < before).ExecuteDeleteAsync(...)`. For the steady-state daily cadence this deletes one day of expired rows and is fine. But if the purge has not run for an extended period (singleton down across a long failover window, `PurgeInterval` mis-set and later corrected, or `RetentionDays` shortened), a single unbounded `DELETE` can touch a very large row count in one transaction — lock escalation on `KpiSample`, a long-running transaction, and transaction-log growth, which on the shared central MS SQL can affect operational tables. The Audit Log purge path, by contrast, uses a bounded batched delete for exactly this reason. This is observability data on a non-partitioned `[PRIMARY]` table, so the blast radius is bounded and the severity is Low — but the unbatched delete is a latent operational hazard on the shared store. **Recommendation** Loop a bounded delete (`DELETE TOP (N) … WHERE CapturedAtUtc < @before`, or EF Core's batching) until zero rows are affected, mirroring the Audit Log purge shape. Keep the returned total for the existing log line. **Resolution** Deferred 2026-06-20: batching the retention purge is a shared-MS-SQL-store tradeoff (vs. the Audit Log's batched-delete precedent); low severity for non-partitioned observability data. Recorded as a future enhancement. ### KpiHistory-005 — `KpiSeriesBucketer` short-series early-return emits raw capture timestamps where the downsample path emits bucket-boundary timestamps | | | |--|--| | Severity | Low | | Category | Correctness & logic bugs | | Status | Deferred | | Location | `src/ZB.MOM.WW.ScadaBridge.Commons/Types/Kpi/KpiSeriesBucketer.cs:57-58` (early return) vs `:93-94` (downsample) | **Description** `KpiSeriesBucketer.Bucket` has two return paths that disagree on what `BucketStartUtc` means. When `raw.Count <= maxPoints` it returns `raw` unchanged — those points carry the raw `CapturedAtUtc` as their `BucketStartUtc` (the repository builds `new KpiSeriesPoint(s.CapturedAtUtc, s.Value)`). When `raw.Count > maxPoints` it returns points whose `BucketStartUtc` is the **bucket boundary** (`fromUtc + bucketIndex * bucketWidthTicks`), as the dedicated test `Bucket_BucketStartUtc_IsSetToBucketStartNotRawPointTimestamp` asserts. So the same series, charted at a density below vs above `maxPoints`, plots its points at different x-positions: actual capture instants in the sparse case, evenly-spaced bucket starts in the dense case. The downstream `KpiTrendChart` normalises X across `[min(BucketStartUtc), max(BucketStartUtc)]`, so the visual impact is minor (the time range is essentially the same), but the contract is inconsistent and the x-axis "tick spacing" subtly changes as a window crosses the cap. This is the bucketer that the KpiHistory design defines as the module's query reducer, so the inconsistency is in-scope even though the file lives in Commons. **Recommendation** Make the two paths agree. Either document the difference explicitly on `Bucket` (the short-series path returns raw capture instants; the downsample path returns bucket starts), or — cleaner — have the short-series path also project onto a consistent timestamp basis if exact bucket-start semantics are part of the contract. Add the short-series timestamp test from KpiHistory-003. **Resolution** Deferred 2026-06-20: whether the bucketer's short-series early-return and the downsample path must agree on `BucketStartUtc` semantics (vs. documenting the difference) is a contract decision for the design owner. The doc was corrected (KpiHistory-006) to describe current behaviour; the contract change is deferred. ### KpiHistory-006 — `KpiSeriesBucketer` doc claims "largest timestamp within each bucket is selected" for unsorted input; the code selects last-in-iteration | | | |--|--| | Severity | Low | | Category | Documentation & comments | | Status | Resolved | | Location | `src/ZB.MOM.WW.ScadaBridge.Commons/Types/Kpi/KpiSeriesBucketer.cs:20-21` (doc), `:88-97` (code) | **Description** The `raw` param doc states: *"If not sorted, the point with the largest timestamp within each bucket is selected."* The code does not do this. When a point is stored into a bucket, `best[bucketIndex].BucketStartUtc` is set to the **bucket-start** timestamp (`fromUtc + bucketIndex * bucketWidthTicks`), not the raw point's timestamp. The last-value comparison for a subsequent point in the same bucket is then `point.BucketStartUtc > best[bucketIndex].BucketStartUtc` — i.e. it compares the new raw point's capture time against the *bucket start*, which (for any in-bucket point) is almost always true. The effect is that each later-in-iteration point overwrites the previous one regardless of their relative timestamps, so the **last point in iteration order** wins, not the point with the largest timestamp. For the production caller this is harmless: `KpiHistoryRepository.GetRawSeriesAsync` always `OrderBy(CapturedAtUtc)`, so iteration order equals time order and last-in-iteration is the largest timestamp. But the documented contract for unsorted input is simply false, and the "if ties, keep first encountered — stable" comment (line 89) is also inaccurate — the overwrite triggers on equal-as-well-as-greater for any in-bucket point. A future caller that trusts the unsorted-input guarantee will get wrong results silently. **Recommendation** Either (a) fix the comparison to track the selected raw point's actual timestamp (store the raw `point.BucketStartUtc` alongside the emitted value and compare against that), making the "largest timestamp wins" claim true for unsorted input; or (b) tighten the doc to state the method requires ascending-sorted input and selects last-in-iteration otherwise, and drop the inaccurate "largest timestamp" / "stable ties" language. Pair with the unsorted-input test from KpiHistory-003. **Resolution** Resolved 2026-06-20 (commit `fd618cf1`): corrected the `KpiSeriesBucketer` param XML doc — dropped the false 'largest-timestamp-wins / stable ties' claim; now states it requires ascending-sorted input and selects last-in-iteration otherwise. Doc-only.