ScadaBridge/code-reviews/KpiHistory/findings.md

# Code Review — KpiHistory

| Field | Value |
|-------|-------|
| Module | `src/ZB.MOM.WW.ScadaBridge.KpiHistory` |
| Design doc | `docs/requirements/Component-KpiHistory.md` |
| Status | Reviewed |
| Last reviewed | 2026-06-24 |
| Reviewer | claude-agent |
| Commit reviewed | `1f9de8a2` |
| Open findings | 0 |

## Summary

KpiHistory is a small, well-built observability module: a single ~305-line recorder
singleton (`KpiHistoryRecorderActor`), a strongly-typed options class with a fail-fast
validator, and a thin DI composition root. The owned code is clean — the actor is
textbook best-effort: every sample pass and purge sweep runs off the actor thread via
`PipeTo`, per-source faults are isolated so one throwing `IKpiSampleSource` never aborts
a tick or the other sources, the repository write/purge is guarded, no exception escapes
either tick handler, and a lifecycle `CancellationTokenSource` is cancelled in `PostStop`
so an in-flight pass observes shutdown promptly. The singleton is wired correctly in the
Host (ClusterSingletonManager + proxy + `PhaseClusterLeave` drain) and is deliberately
absent from the readiness barrier, exactly as the design requires. The options validator,
the EAV table mapping, both named indexes, and all four `IKpiSampleSource` implementations
exist and are registered on the central host as designed.

The dominant theme is **unbounded work under load**, in two places. First, the recorder
has **no in-flight guard** on its sample timer — directly contradicting its own XML-doc
claim to "mirror the NotificationOutboxActor timer + scope-per-tick + PipeTo pattern,"
because the NotificationOutbox dispatcher *does* hold an in-flight guard and the recorder
does not. When a sample pass runs longer than `SampleInterval` (slow/recovering DB), Akka
periodic timers enqueue, not coalesce, so overlapping `RunSamplePass` tasks pile up,
multiplying DB load at exactly the moment the store is struggling and double-writing
samples for overlapping windows. Second, `GetRawSeriesAsync` has **no server-side row
cap**: the design's `DefaultMaxSeriesPoints` ceiling is applied by `KpiSeriesBucketer`
only *after* the full raw window is materialised into memory — a 7-day window at the
default 60 s cadence is ~10 080 rows per series pulled to the Central UI before
downsampling, defeating the stated intent of the cap. A secondary theme is **bucketer
contract drift** (the "largest-timestamp-wins for unsorted input" doc claim is not what
the code does, and the short-series early-return emits raw capture timestamps where the
downsample path emits bucket-boundary timestamps) — both live in Commons but are core to
this module's query reducer. No Critical findings; one High, three Medium, two Low.

## Checklist coverage

| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | Yes | `capturedAt`/cut-off captured on the actor thread (correct). `KpiSeriesBucketer` short-series early-return emits raw capture timestamps vs bucket-boundary timestamps on the downsample path (KpiHistory-005). Bucketer "largest-timestamp-wins for unsorted input" doc claim is false — it is last-in-iteration (KpiHistory-006). |
| 2 | Akka.NET conventions | Yes | `PipeTo` + scope-per-tick + off-thread I/O + `IWithTimers` all correct; sender not captured across awaits; messages immutable singletons. But no in-flight guard despite the XML claiming to mirror NotificationOutbox (KpiHistory-001). |
| 3 | Concurrency & thread safety | Yes | Actor state is effectively stateless; `_shutdownCts` lifecycle is correct. The missing in-flight guard allows overlapping sample passes under DB latency (KpiHistory-001). |
| 4 | Error handling & resilience | Yes | Per-source isolation, write/purge guards, and `OperationCanceledException` shutdown handling are all correct and tested. No issues beyond KpiHistory-001's load amplification. |
| 5 | Security | Yes | No injection surface — all queries are parameterised LINQ; `Source`/`Metric`/`Scope`/`ScopeKey` are equality predicates, never interpolated SQL. Scope isolation via `ScopeKey == scopeKey` (incl. `IS NULL` for Global) is correct. No issues found. |
| 6 | Performance & resource management | Yes | `GetRawSeriesAsync` returns the entire window with no server-side cap before bucketing (KpiHistory-002). `RecordSamplesAsync` short-circuits empty batches (good). Purge is set-based `ExecuteDeleteAsync` but is a single unbatched statement (KpiHistory-004). |
| 7 | Design-document adherence | Yes | The recorder XML claims to mirror NotificationOutbox's pattern but omits its in-flight guard (KpiHistory-001). `DefaultMaxSeriesPoints`-as-a-true-cap intent is undermined by KpiHistory-002. Otherwise faithful: singleton, not-readiness-gated, daily purge, EAV schema, indexes. |
| 8 | Code organization & conventions | Yes | Options class owned by the component (correct); validator co-located; `IKpiHistoryRepository`/`IKpiSampleSource`/`KpiSample`/bucketer in Commons, impl in ConfigurationDatabase (correct); singleton Props built in Host. No issues found. |
| 9 | Testing coverage | Yes | Recorder isolation, faulted-tick recovery, and purge cut-off are tested; validator bounds fully covered; bucketer has strong sorted-input coverage. Gaps: no overlapping-tick test, no unsorted-input bucketer test, no short-series timestamp-semantics assertion (KpiHistory-003). |
| 10 | Documentation & comments | Yes | XML is generally excellent. Two drift points: the "mirror NotificationOutbox" claim (KpiHistory-001) and the bucketer unsorted-input claim (KpiHistory-006). `SampleComplete`/`PurgeComplete` no-op handlers are documented. |

## Findings

### KpiHistory-001 — Recorder has no in-flight guard; overlapping sample passes pile up under DB latency

| | |
|--|--|
| Severity | High |
| Category | Akka.NET conventions |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.KpiHistory/KpiHistoryRecorderActor.cs:89-92`, `:103-107`, `:143-159` |

**Description**

The sample timer is a plain Akka periodic timer
(`Timers.StartPeriodicTimer(SampleTimerKey, SampleTick.Instance, …, interval: _options.SampleInterval)`).
`HandleSampleTick` launches `RunSamplePass(...)` off-thread via `PipeTo` and returns
immediately; the piped-back `SampleComplete` is a deliberate no-op
(`Receive<SampleComplete>(_ => { })`). There is **no in-flight guard** — nothing prevents
a second `SampleTick` from starting a second `RunSamplePass` while the first is still
awaiting its DB round-trip.

Akka periodic timers do not coalesce missed ticks — they enqueue. So when a sample pass
takes longer than `SampleInterval` (a slow, contended, or recovering central MS SQL —
exactly the regime where observability matters most), each subsequent tick spawns *another*
concurrent pass. Each pass opens its own DI scope + `DbContext` and issues its own
`AddRange` + `SaveChangesAsync`. The result is a self-amplifying load spiral against the
struggling store, plus duplicate sample rows whose `CapturedAtUtc` values straddle the same
real-time window (the design's "one shared tick timestamp" invariant assumes one pass per
tick).

The actor's own XML-doc states it "mirrors the `NotificationOutboxActor` timer +
scope-per-tick + PipeTo pattern" (lines 37-39) — but `NotificationOutboxActor` holds an
explicit in-flight boolean cleared on `DispatchComplete` precisely to serialise sweeps.
This recorder dropped that half of the pattern. The piped-back `SampleComplete` message is
the natural place the guard would be lowered, and its current empty body is the tell that
the guard was intended but omitted.

**Recommendation**

Add an in-flight guard: set a `_sampleInFlight` flag at the top of `HandleSampleTick`, skip
(and log at debug) the tick if already set, and clear it in the `SampleComplete` handler
(both success and failure projections already route there). This matches the
NotificationOutbox pattern the doc claims to follow and bounds the recorder to one pass per
tick. Add a regression test that drives two `SampleTick`s before the first pass completes
(e.g. a repository whose `RecordSamplesAsync` blocks on a gate) and asserts only one pass
ran. The purge tick is daily + idempotent so a guard there is optional, but consider the
same treatment for symmetry.

**Resolution**

Resolved 2026-06-20 (commit `fd618cf1`): added a `_sampleInFlight` guard to the recorder — `HandleSampleTick` skips (debug-logs) if a pass is in flight; the flag is cleared via a `SampleComplete` message piped on BOTH success and failure, so a faulted source can't wedge the guard. Overlapping-tick regression test added.

### KpiHistory-002 — `GetRawSeriesAsync` materialises the entire window with no server-side cap; `DefaultMaxSeriesPoints` is applied only after the fact

| | |
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Deferred |
| Location | `src/ZB.MOM.WW.ScadaBridge.Commons/Interfaces/Repositories/IKpiHistoryRepository.cs:39-41` (contract); `src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Repositories/KpiHistoryRepository.cs:45-62` (impl); consumed by `src/ZB.MOM.WW.ScadaBridge.CentralUI/Services/KpiHistoryQueryService.cs:77-93` |

**Description**

`GetRawSeriesAsync` has no `limit`/`maxPoints`/`Take` parameter. The implementation issues a
`Where(...).OrderBy(CapturedAtUtc).Select(...).ToListAsync()` that pulls **every** sample in
`[fromUtc, toUtc]` for the series into memory. The `DefaultMaxSeriesPoints` ceiling
(default 200, design-described as the cap that prevents "a single trend query [from
streaming] an arbitrarily large series to the Central UI" — `KpiHistoryOptionsValidator`
XML) is enforced by `KpiSeriesBucketer.Bucket` **after** the full raw set has already been
fetched across the wire and allocated as a `List<KpiSeriesPoint>` in the query service.

At the default 60 s sample cadence, a 24 h window is ~1 440 rows/series, a 7 d window is
~10 080 rows/series, and a 30 d window ~43 200 rows/series — per chart, with up to four
trend panels on a page each issuing its own query. The cap that the design relies on for
back-pressure provides none at the data tier; it only trims the in-memory result the UI
binds to. The `IX_KpiSample_Series` index makes the *range scan* efficient, but the row
count returned is still unbounded by anything except the window the parent page chooses.

**Recommendation**

Push the downsampling toward the store, or at least bound the fetch. Cheapest correct fix:
add an optional `int? maxRows` to `GetRawSeriesAsync` and have the query service pass a
generous multiple of `effectiveMax` (e.g. `effectiveMax * k`) so the bucketer still has
enough density to pick representative last-values while the DB-side `Take` caps the
transfer. A more thorough fix is a server-side bucketed aggregation (GROUP BY a computed
bucket index, MAX(CapturedAtUtc) per bucket), but that is a larger change the design
explicitly deferred ("v1 ships exactly one aggregation"). At minimum, document the
unbounded-fetch behaviour and the practical window ceiling so an operator does not point a
30 d chart at a busy multi-site deployment.

**Resolution**

Deferred 2026-06-20: bounding the raw window fetch (cheap `Take`-cap vs. server-side bucketed aggregation) is a design decision and the code lives in ConfigurationDatabase/Commons, not this project. Recorded as a query-path enhancement; the practical window ceiling should be documented when addressed.

### KpiHistory-003 — Missing tests: overlapping ticks, unsorted-input bucketing, short-series timestamp semantics

| | |
|--|--|
| Severity | Medium |
| Category | Testing coverage |
| Status | Resolved |
| Location | `tests/ZB.MOM.WW.ScadaBridge.KpiHistory.Tests/KpiHistoryRecorderActorTests.cs`; `tests/ZB.MOM.WW.ScadaBridge.Commons.Tests/Kpi/KpiSeriesBucketerTests.cs` |

**Description**

The recorder and bucketer tests are otherwise strong (per-source isolation, faulted-tick
recovery, purge cut-off, validator bounds, sorted-input bucketing, right-edge handling,
empty-bucket omission, out-of-window filtering). Three behaviour gaps remain, each tied to
a finding here:

1. **No overlapping-tick test** (KpiHistory-001). Every recorder test sends a single tick
   and awaits its effect; nothing exercises a second `SampleTick` arriving while the first
   pass is still in flight, so the missing in-flight guard is invisible to the suite. A
   gated-repository test (block `RecordSamplesAsync`, send two ticks, count passes) would
   pin the intended one-pass-per-tick behaviour.
2. **No unsorted-input bucketer test** (KpiHistory-006). All bucketer tests pass ascending
   input, so the doc's "largest-timestamp-wins for unsorted input" claim is never checked —
   and it is in fact wrong.
3. **No short-series timestamp-semantics assertion** (KpiHistory-005).
   `Bucket_BucketStartUtc_IsSetToBucketStart…` covers only the downsample path; no test
   asserts what `BucketStartUtc` the early-return (`raw.Count <= maxPoints`) path emits, so
   the inconsistency between the two paths is untested.

**Recommendation**

Add the three tests above. The overlapping-tick test belongs in
`KpiHistoryRecorderActorTests` (it is a recorder behaviour); the two bucketer tests belong
in `KpiSeriesBucketerTests`.

**Resolution**

Resolved 2026-06-20 (commit `fd618cf1`): added the overlapping-tick gated-repository test plus unsorted-input and short-series bucketer tests (the latter pin the documented short-series behaviour).

### KpiHistory-004 — Retention purge is a single unbatched `ExecuteDeleteAsync`; a large backlog deletes in one transaction

| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Deferred |
| Location | `src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Repositories/KpiHistoryRepository.cs:65-71` |

**Description**

`PurgeOlderThanAsync` runs one set-based `Where(s => s.CapturedAtUtc < before).ExecuteDeleteAsync(...)`.
For the steady-state daily cadence this deletes one day of expired rows and is fine. But if
the purge has not run for an extended period (singleton down across a long failover window,
`PurgeInterval` mis-set and later corrected, or `RetentionDays` shortened), a single
unbounded `DELETE` can touch a very large row count in one transaction — lock escalation on
`KpiSample`, a long-running transaction, and transaction-log growth, which on the shared
central MS SQL can affect operational tables. The Audit Log purge path, by contrast, uses a
bounded batched delete for exactly this reason.

This is observability data on a non-partitioned `[PRIMARY]` table, so the blast radius is
bounded and the severity is Low — but the unbatched delete is a latent operational hazard
on the shared store.

**Recommendation**

Loop a bounded delete (`DELETE TOP (N) … WHERE CapturedAtUtc < @before`, or EF Core's
batching) until zero rows are affected, mirroring the Audit Log purge shape. Keep the
returned total for the existing log line.

**Resolution**

Deferred 2026-06-20: batching the retention purge is a shared-MS-SQL-store tradeoff (vs. the Audit Log's batched-delete precedent); low severity for non-partitioned observability data. Recorded as a future enhancement.

### KpiHistory-005 — `KpiSeriesBucketer` short-series early-return emits raw capture timestamps where the downsample path emits bucket-boundary timestamps

| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Deferred |
| Location | `src/ZB.MOM.WW.ScadaBridge.Commons/Types/Kpi/KpiSeriesBucketer.cs:57-58` (early return) vs `:93-94` (downsample) |

**Description**

`KpiSeriesBucketer.Bucket` has two return paths that disagree on what `BucketStartUtc`
means. When `raw.Count <= maxPoints` it returns `raw` unchanged — those points carry the
raw `CapturedAtUtc` as their `BucketStartUtc` (the repository builds
`new KpiSeriesPoint(s.CapturedAtUtc, s.Value)`). When `raw.Count > maxPoints` it returns
points whose `BucketStartUtc` is the **bucket boundary**
(`fromUtc + bucketIndex * bucketWidthTicks`), as the dedicated test
`Bucket_BucketStartUtc_IsSetToBucketStartNotRawPointTimestamp` asserts.

So the same series, charted at a density below vs above `maxPoints`, plots its points at
different x-positions: actual capture instants in the sparse case, evenly-spaced bucket
starts in the dense case. The downstream `KpiTrendChart` normalises X across
`[min(BucketStartUtc), max(BucketStartUtc)]`, so the visual impact is minor (the time range
is essentially the same), but the contract is inconsistent and the x-axis "tick spacing"
subtly changes as a window crosses the cap. This is the bucketer that the KpiHistory design
defines as the module's query reducer, so the inconsistency is in-scope even though the file
lives in Commons.

**Recommendation**

Make the two paths agree. Either document the difference explicitly on `Bucket` (the
short-series path returns raw capture instants; the downsample path returns bucket starts),
or — cleaner — have the short-series path also project onto a consistent timestamp basis if
exact bucket-start semantics are part of the contract. Add the short-series timestamp test
from KpiHistory-003.

**Resolution**

Deferred 2026-06-20: whether the bucketer's short-series early-return and the downsample path must agree on `BucketStartUtc` semantics (vs. documenting the difference) is a contract decision for the design owner. The doc was corrected (KpiHistory-006) to describe current behaviour; the contract change is deferred.

### KpiHistory-006 — `KpiSeriesBucketer` doc claims "largest timestamp within each bucket is selected" for unsorted input; the code selects last-in-iteration

| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Commons/Types/Kpi/KpiSeriesBucketer.cs:20-21` (doc), `:88-97` (code) |

**Description**

The `raw` param doc states: *"If not sorted, the point with the largest timestamp within
each bucket is selected."* The code does not do this. When a point is stored into a bucket,
`best[bucketIndex].BucketStartUtc` is set to the **bucket-start** timestamp
(`fromUtc + bucketIndex * bucketWidthTicks`), not the raw point's timestamp. The
last-value comparison for a subsequent point in the same bucket is then
`point.BucketStartUtc > best[bucketIndex].BucketStartUtc` — i.e. it compares the new raw
point's capture time against the *bucket start*, which (for any in-bucket point) is almost
always true. The effect is that each later-in-iteration point overwrites the previous one
regardless of their relative timestamps, so the **last point in iteration order** wins, not
the point with the largest timestamp.

For the production caller this is harmless: `KpiHistoryRepository.GetRawSeriesAsync` always
`OrderBy(CapturedAtUtc)`, so iteration order equals time order and last-in-iteration is the
largest timestamp. But the documented contract for unsorted input is simply false, and the
"if ties, keep first encountered — stable" comment (line 89) is also inaccurate — the
overwrite triggers on equal-as-well-as-greater for any in-bucket point. A future caller that
trusts the unsorted-input guarantee will get wrong results silently.

**Recommendation**

Either (a) fix the comparison to track the selected raw point's actual timestamp (store the
raw `point.BucketStartUtc` alongside the emitted value and compare against that), making the
"largest timestamp wins" claim true for unsorted input; or (b) tighten the doc to state the
method requires ascending-sorted input and selects last-in-iteration otherwise, and drop the
inaccurate "largest timestamp" / "stable ties" language. Pair with the unsorted-input test
from KpiHistory-003.

**Resolution**

Resolved 2026-06-20 (commit `fd618cf1`): corrected the `KpiSeriesBucketer` param XML doc — dropped the false 'largest-timestamp-wins / stable ties' claim; now states it requires ascending-sorted input and selects last-in-iteration otherwise. Doc-only.

## Re-review — 2026-06-24 (commit `1f9de8a2`)

Focused re-review of the changes since the prior review — verifying the code-review remediation + feature fixes are sound and regression-free. Reviewed by a per-module workflow agent; findings code-verified by the orchestrator.

**Changes reviewed:** The diff adds an in-flight guard (`_sampleInFlight` bool field) to `KpiHistoryRecorderActor` so overlapping sample passes can never run concurrently. `HandleSampleTick` now skips (coalesces) a `SampleTick` and logs at debug if a pass is already in flight; otherwise it raises the guard before launching the off-thread `RunSamplePass`. The `SampleComplete` receive handler lowers the guard (it fires on both success and fault paths via the PipeTo projection). XML doc comments were updated accordingly. Tests: a new deterministic regression test (`OverlappingTick_WhileFirstPassInFlight_DoesNotStartSecondPass`) uses a gated repository to prove the second tick is skipped and the guard correctly resets; the pre-existing recovery test was hardened to re-send the tick per poll to avoid racing the asynchronous guard reset.

**Verdict:** The change is sound, minimal, and regression-free. It faithfully mirrors the already-shipped `NotificationOutboxActor` dispatch in-flight-guard pattern (skip-if-busy, raise-before-launch, lower-on-PipeTo-completion). The guard field is read and written only on the actor thread, so there is no thread-safety hazard; it cannot wedge permanently because `RunSamplePass` never throws and both PipeTo success and failure projections emit `SampleComplete`. The skip-on-overlap behaviour is consistent with the design doc, which describes best-effort per-tick sampling with no strict "a sample must land every interval" guarantee. The new behaviour is covered by a deterministic regression test, and all 4 actor tests pass. No new issues found.

| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | Guard raised before launch, lowered on both success+fault PipeTo paths; cannot wedge (RunSamplePass never throws). 1:1 raise/lower pairing — no double-lower. No issues found. |
| 2 | Akka.NET conventions | ☑ | Off-thread work via PipeTo to Self; guard mutated only on the actor thread in the SampleComplete handler. Matches the documented NotificationOutboxActor pattern. No issues found. |
| 3 | Concurrency & thread safety | ☑ | _sampleInFlight is touched only on the actor thread (HandleSampleTick + SampleComplete receive); no shared mutable state or captured this/sender in the off-thread pass. No issues found. |
| 4 | Error handling & resilience | ☑ | Faulted pass still produces SampleComplete (belt-and-braces failure projection), so the guard always clears; best-effort observability contract preserved. No issues found. |
| 5 | Security | ☑ | No new I/O, no secrets, no user input, no injection surface introduced. Debug-level skip log carries no sensitive data. No issues found. |
| 6 | Performance & resource management | ☑ | The guard's purpose is precisely to avoid load amplification on a slow/recovering DB by preventing overlapping passes. No new allocations or leaks; field is reset on restart. No issues found. |
| 7 | Design-document adherence | ☑ | Component-KpiHistory.md describes best-effort per-tick sampling with no strict per-interval landing guarantee; coalescing overlaps is consistent and the doc already says the recorder mirrors NotificationOutboxActor. No drift. |
| 8 | Code organization & conventions | ☑ | Single new private bool field with thorough XML doc; inline comments accurate. Consistent with the sibling actor's naming/structure. No issues found. |
| 9 | Testing coverage | ☑ | New deterministic gated-repository regression test pins one-pass-per-tick and guard reset; existing recovery test hardened against the async-reset race. All 4 tests pass. Coverage is adequate for the delta. |
| 10 | Documentation & comments | ☑ | Field, handler, and SampleComplete XML docs updated to explain the guard, the enqueue-not-coalesce timer rationale, and the lower-on-fault behaviour. Accurate and clear. No issues found. |

_No new findings — the changes in this module are clean._