Files
ScadaBridge/code-reviews/SiteCallAudit/findings.md
T
Joseph Doherty d39089f4ed docs(code-review): full review at 4307c381 — 18 modules, 67 findings recorded + remediation tracked
Full per-module re-review of the 16 stale modules (last seen 1eb6e97 / 2026-05-28)
plus first-ever reviews of KpiHistory (#26) and ScriptAnalysis (#25), at HEAD 4307c381.

67 new findings (0 Critical, 6 High, 27 Medium, 34 Low). Remediation in commit
fd618cf1 closed 5 of the 6 Highs and ~33 Medium/Low; the rest are Deferred/Won't Fix
with rationale. Remaining pending (4) are all InboundAPI's Database-helper findings
(IA-026 High .. IA-029), left to the active feat/ipsen-movein effort per owner decision.

Highlights: caught a central-only-delivery security drift (SMTP creds broadcast to
sites — DM-025/SR-031), a never-committed 'Resolved' fix (SiteEventLogging-016 → -024),
an unguarded KPI recorder tick (KH-001), a trust-analyzer fallback weakening (SA-001),
and a native-alarm subscribe-path leak (DCL-023). ScriptAnalysis verdict: trust boundary
is semantically sound (symbol-based) in the production cluster config.

README regenerated; regen-readme.py --check passes (4 pending / 567 total).
2026-06-20 18:02:32 -04:00

35 KiB
Raw Blame History

Code Review — SiteCallAudit

Field Value
Module src/ZB.MOM.WW.ScadaBridge.SiteCallAudit
Design doc docs/requirements/Component-SiteCallAudit.md
Status Reviewed
Last reviewed 2026-06-20
Reviewer claude-agent
Commit reviewed 4307c381
Open findings 0

Summary

The module is small (one actor + DI extension + options class). The actor is a central cluster singleton that exposes three responsibility groups: direct UpsertSiteCallCommand ingest, paginated/KPI read handlers, and the central→site Retry/Discard relay. Ingest idempotency is delegated to the repository's monotonic-upsert (the CD-015 check-then-act window is mitigated by the duplicate-key swallow on the insert leg). Findings cluster around two themes: (a) the SupervisorStrategy override is dead-code that contradicts the XML docstring — it governs children, and this actor has none, so the documented "Resume on leaked exception" promise is unenforced; (b) several smaller drifts between the design doc and the code (reconciliation puller + daily purge schedule are still deferred; OnUpsertAsync does not stamp IngestedAtUtc unlike the dual-write path). The relay path is well covered by Akka TestKit unit tests; the ingest + KPI paths are covered by MSSQL-backed integration tests using a shared MsSqlMigrationFixture.

Checklist coverage

# Category Examined Notes
1 Correctness & logic bugs Yes OnUpsertAsync does not refresh IngestedAtUtc (Finding 003).
2 Akka.NET conventions Yes SupervisorStrategy() override is dead code (Finding 001). Sender correctly captured before first await on every handler. PipeTo used for read replies.
3 Concurrency & thread safety Yes _centralCommunication mutated only on actor thread via RegisterCentralCommunication. DI scope-per-message disposed in try/finally. No issues found.
4 Error handling & resilience Yes Ingest catches all + replies Accepted=false. Relay distinguishes SiteUnreachable vs OperationFailed. Failover handover does not wait for in-flight async work (Finding 002).
5 Security Yes All SQL is parameterised at the repository (FromSqlInterpolated). Relay carries no user-controlled strings beyond SourceSite. No issues found.
6 Performance & resource management Yes DI scope-per-message correctly disposed. MaxPageSize=200 clamp present. No issues found.
7 Design-document adherence Yes Reconciliation puller and daily terminal-purge scheduler still deferred; design doc reads as if they ship (Finding 004).
8 Code organization & conventions Yes RegisterCentralCommunication is a top-level record colocated with the actor — by design (carries IActorRef, cannot live in Commons). No issues found.
9 Testing coverage Yes Relay path well covered (6 unit tests). Ingest/KPI well covered by MSSQL fixture. Stuck-only paging boundary edge not directly exercised (Finding 006).
10 Documentation & comments Yes XML docstring claims SupervisorStrategy uses Resume — incorrect (Finding 001). AckErrorMessage switch arm for SiteUnreachable falls through instead of throwing (Finding 005).

Re-review 2026-06-20 (commit 4307c381) — full review

Since the 1eb6e97 baseline the module grew the two pieces that were "still deferred" at the prior pass: the periodic per-site reconciliation puller (Piece A — OnReconciliationTickAsyncReconcileSiteAsync, in-memory per-site cursor + idempotent monotonic upsert + per-site failure isolation) and the daily terminal-row purge scheduler (Piece B — OnPurgeTickAsyncPurgeTerminalAsync, continue-on-error). All of the new code is clean, idiomatic, and well-tested (dedicated SiteCallAuditReconciliationTests, SiteCallAuditPurgeTests, SiteCallRelayTests, KPI-sample and options suites); the M6 SiteCallAuditKpiSampleSource correctly anchors both cutoffs on a single capture instant and emits Global + per-Site + per-Node snapshots, and the per-node KPI handler is a clean additive peer of the per-site one. The three new findings are all forward-compat / cosmetic: a latent purge-gating gap (the purge timer is armed only when the reconciliation collaborators are present — currently masked because Program.cs co-registers them) plus two doc/edge items (the design doc over-claims six charted trend series when only three are charted, and the reconciliation cursor ignores MoreAvailable so a single-timestamp batch saturation re-pulls without progress). No correctness, concurrency, security, or resilience defects in the new code; the inclusive-cursor + idempotent-upsert pairing keeps every edge case safe-by-construction (wasted work, never corruption).

# Category Examined Notes
1 Correctness & logic bugs Yes Reconciliation cursor ignores MoreAvailable; a single-UpdatedAtUtc batch saturation re-pulls without progress (Finding 009 — wasted work only, idempotent upsert prevents corruption). Finding 003 (IngestedAtUtc) resolved — now stamped in OnUpsertAsync and ReconcileSiteAsync.
2 Akka.NET conventions Yes Sender captured before first await on every async handler; PipeTo used for all read/relay replies; self-ticks scheduled via ScheduleTellRepeatedlyCancelable with Self sender, cancelled in PostStop. No issues found.
3 Concurrency & thread safety Yes _reconciliationCursors, _centralCommunication, timers all mutated only on the actor thread. Per-tick/per-message DI scope (CreateAsyncScope for the async tick paths, CreateScope for sync read paths) disposed in finally / await using. No issues found.
4 Error handling & resilience Yes Reconciliation has per-site try/catch isolation; purge is continue-on-error; ingest catches all and replies Accepted=false. Deliberate coarse-retry divergence from SiteAuditReconciliationActor (no per-row abandon) is documented and safe given idempotent upsert. No issues found.
5 Security Yes All SQL parameterised at the repository. Relay carries no user-controlled strings beyond SourceSite (a site id). No issues found.
6 Performance & resource management Yes One DI scope per tick reused across all sites; MaxPageSize=200 clamp; async DbContext disposal off the dispatcher. MoreAvailable-ignoring re-pull is bounded wasted work (Finding 009).
7 Design-document adherence Yes Reconciliation puller + purge scheduler now implemented (prior Finding 004 resolved). Design doc over-claims six charted KPI trend series; only three are charted (Finding 008).
8 Code organization & conventions Yes Purge timer gated on reconciliation collaborators rather than just the repository — a host registering SiteCallAudit without the reconciliation client silently never purges (Finding 007, latent). KpiSampleSource registered via TryAddEnumerable; options owned by the component.
9 Testing coverage Yes Reconciliation/purge/relay/KPI/options all have dedicated suites. No test exercises a saturated reconciliation batch (MoreAvailable: true) or the single-timestamp no-progress edge — every reconciliation test uses MoreAvailable: false (relates to Finding 009).
10 Documentation & comments Yes Prior Findings 001/005 doc fixes still in place. Reconciliation ReconcileSiteAsync XML claims parity with SiteAuditReconciliationActor while diverging on MoreAvailable (the sibling reads it for stalled detection; this actor ignores it) — Finding 009 recommendation includes a doc correction.

Findings

SiteCallAudit-001 — SupervisorStrategy override is dead code; XML claims Resume that is not enforced

Severity Medium
Category Akka.NET conventions
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:32-46, src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:147-151

Description

The XML remarks block (lines 32-46) states:

"The SupervisorStrategy uses Resume so an unexpected throw before the catch (defence in depth) does not restart the actor and reset in-flight state."

The override at lines 147-151 returns a OneForOneStrategy with DefaultDecider and maxNrOfRetries: 0. Two problems compound:

  1. ActorBase.SupervisorStrategy() governs the actor's children, not the actor itself. SiteCallAuditActor creates no children, so this override is dead code.
  2. The returned strategy uses DefaultDecider (Restart for most exceptions), not Directive.Resume. So even if the actor did have children, the strategy would not be Resume — it would be the default Restart-on-most-faults behaviour with maxNrOfRetries: 0 (which forces a Stop after the first failure).

Net effect: the actor's own self-supervision is whatever the parent supplies (SupervisorStrategy.DefaultDecider from the singleton manager / user guardian in tests), which Restarts on most exceptions. If the try/catch in OnUpsertAsync ever leaked (e.g. a synchronous throw constructing replyTo), the actor would Restart, reset _centralCommunication to null, and silently break the relay until RegisterCentralCommunication runs again.

This same pattern (with the same misleading XML doc) exists in AuditLogIngestActor, AuditLogPurgeActor, and SiteAuditReconciliationActor — they were likely cargo-culted; this finding documents the local instance.

Recommendation

Either:

  • Remove the SupervisorStrategy() override entirely (it does nothing useful) and revise the XML comment to drop the "Resume" claim. Self-supervision is the parent's concern (the cluster singleton manager); the try/catch in OnUpsertAsync is what actually keeps the actor alive.
  • Or, if Resume-on-self-throw is actually desired, that requires wiring a custom supervisor in the parent (ClusterSingletonManager) — not overriding SupervisorStrategy() here. Simpler path: keep the try/catch, drop the override.

The CLAUDE.md "Resume for coordinator actors" decision applies to actors with children (Site Runtime hierarchy) — not to leaf cluster singletons.

Resolution (2026-05-28): Rewrote the class-level XML on SiteCallAuditActor plus the method-level XML on SupervisorStrategy() to accurately describe what the override does — a one-for-one strategy with DefaultDecider (Restart on most exceptions, Stop on ActorInitializationException/ActorKilledException) and maxNrOfRetries: 0, governing the actor's children (the actor has none today, so the override is currently inert). Dropped the misleading "Resume" claim. The new docs make clear that self-supervision of this cluster singleton is the parent ClusterSingletonManager's concern and the actor's own resilience comes from the in-handler try/catch in OnUpsertAsync, not from this override. No behaviour change — pure documentation fix; existing 24 SiteCallAudit tests remain green.

SiteCallAudit-002 — Singleton failover does not wait for in-flight async upserts

Severity Low
Category Error handling & resilience
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:455-462 (singleton wiring), src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:153-193

Resolution (2026-05-28): Added a CoordinatedShutdown task in the cluster-leave phase (named drain-site-call-audit-singleton) that issues an explicit GracefulStop(10s) to the SiteCallAudit cluster singleton manager before the cluster-leave proceeds. Akka.NET's singleton handover already waits for the active actor's ReceiveAsync task to complete before signalling HandOverDone, so an in-flight EF UpsertAsync (and its SQL round-trip) drains on the old node before the new singleton starts on the other central node — closing the seam where the new singleton could race a still-running upsert on the old node. The 10-second timeout is bounded so a misbehaving upsert cannot stall coordinated shutdown indefinitely; on timeout the existing PoisonPill termination path takes over and the repository's monotonic-upsert + 2601/2627 duplicate-key swallow remain as the storage-state safety net. Pattern is suitable for the NotificationOutbox singleton too; deferred to keep this change scoped.

Description

The singleton is created with terminationMessage: PoisonPill.Instance. On failover the active node's singleton stops as soon as the mailbox is drained of normal messages and the PoisonPill is processed. An in-flight OnUpsertAsync Task started before the PoisonPill arrived will be allowed to complete (the message-handler runs synchronously from the mailbox's view), but the Akka actor model does NOT cancel the EF Core ExecuteSqlInterpolatedAsync call.

Two consequences:

  1. The new singleton on the other node may begin accepting UpsertSiteCallCommand for the same TrackedOperationId while the old singleton's in-flight upsert is still running. The repository's monotonic-upsert and the SQL duplicate-key swallow protect storage state.
  2. The original replyTo sender may receive its Accepted=true after the new singleton has already returned a different reply. Idempotency keys protect correctness; wire-level ordering is best-effort by design.

This is consistent with the design ("eventually-consistent mirror, sites are source of truth"), but worth documenting as an explicit invariant. The Notification Outbox sibling has the same pattern.

Recommendation

  • Document the failover/handover semantics in the actor's XML remarks: "On cluster singleton handover, in-flight OnUpsertAsync tasks complete on the old node and may produce a late Accepted=true reply; the repository's monotonic upsert ensures storage state is consistent."
  • Add an integration test that deliberately races two concurrent upserts on the same TrackedOperationId to verify the duplicate-key swallow + monotonic rank check (the CD-015 race-pattern check the parent task flagged).

SiteCallAudit-003 — OnUpsertAsync does not refresh IngestedAtUtc; direct-write callers must remember to stamp it

Severity Medium
Category Correctness & logic bugs
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:153-193

Description

The combined-telemetry hot path (AuditLogIngestActor.OnCachedTelemetryAsync) stamps IngestedAtUtc = DateTime.UtcNow on both the AuditLog row and the SiteCall row at central-side persist time (src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogIngestActor.cs:238-239). The design doc treats IngestedAtUtc as "central ingested (or last refreshed) this row" — a central-side timestamp.

SiteCallAuditActor.OnUpsertAsync writes the supplied SiteCall as-is, with whatever IngestedAtUtc the caller stamped. The only current callers are the unit tests (which use DateTime.UtcNow at command-construction time). Once the deferred reconciliation puller lands and starts emitting UpsertSiteCallCommands, the puller (running on central) is responsible for stamping a central timestamp — but if a future direct-write caller forgets, or constructs from a site DTO, the value could drift (e.g. become a site clock value).

This is currently latent because no production caller exists, but it's inconsistent with the dual-write code path and undocumented.

Recommendation

  • Either: stamp IngestedAtUtc = DateTime.UtcNow inside OnUpsertAsync before calling UpsertAsync (matching AuditLogIngestActor's behaviour), using cmd.SiteCall with { IngestedAtUtc = DateTime.UtcNow }.
  • Or: document in the UpsertSiteCallCommand XML that callers MUST stamp IngestedAtUtc to a central-side DateTime.UtcNow immediately before sending.

Preferred: stamp inside the actor — same as the combined-telemetry path — because callers cannot in general know the actor is colocated on central.

Resolution (2026-05-28): OnUpsertAsync now rewrites the incoming SiteCall via cmd.SiteCall with { IngestedAtUtc = DateTime.UtcNow } immediately before calling repository.UpsertAsync, mirroring AuditLogIngestActor's combined-telemetry hot path. The repository writes IngestedAtUtc on both the insert-if-not-exists and the monotonic UPDATE legs (SiteCallAuditRepository.UpsertAsync), so the column is writable on every upsert. Callers (telemetry, the deferred reconciliation puller, any future direct-write) no longer need to remember to stamp a central-side timestamp — the actor owns it. Existing 24 SiteCallAudit tests remain green (the MSSQL-fixture test constructs rows with DateTime.UtcNow and doesn't assert the exact value, so the actor's re-stamp is backward compatible).

SiteCallAudit-004 — Reconciliation puller and daily terminal-purge scheduler still deferred; design-doc drift

Severity Low
Category Design-document adherence
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:23-30 (actor XML), src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/ServiceCollectionExtensions.cs:8-13, docs/requirements/Component-SiteCallAudit.md:24-32

Description

The design doc (Component-SiteCallAudit.md lines 24-32) lists five responsibilities, including:

  • "Run periodic per-site reconciliation pulls so missed telemetry self-heals."
  • "Purge terminal audit rows after a configurable retention window."

The repository exposes PurgeTerminalAsync but nothing in this module schedules a daily call (Notification Outbox owns a MaintenanceService for its equivalent; no SiteCallAuditMaintenanceService exists). The reconciliation puller is acknowledged in the actor XML (only reconciliation remains deferred) but is not surfaced in the design doc as deferred — the doc reads as if it ships.

Recommendation

  • Either: implement the deferred pieces (a hosted service that wakes daily and calls repo.PurgeTerminalAsync(now - retentionWindow), plus a per-site reconciliation puller with a cursor + an IPullCachedTelemetryClient).
  • Or: add a "Status" / "Deferred" subsection to the design doc explicitly listing what's not yet implemented (matches the pattern Audit Log uses for its tamper-evidence hash chain).

Resolution (2026-05-28):

Updated the class-level XML on SiteCallAuditActor to reflect actual state: telemetry ingest, query/detail/KPI handlers (Task 4), and the central→site Retry/Discard relay (Task 5) are implemented; the periodic reconciliation puller and the daily terminal-row purge scheduler remain deferred. The design doc update is tracked separately.

SiteCallAudit-005 — AckErrorMessage switch arm for SiteUnreachable returns ack message instead of throwing

Severity Low
Category Documentation & comments
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:548-563

Description

return outcome switch
{
    SiteCallRelayOutcome.Applied => null,
    SiteCallRelayOutcome.NotParked => "The operation is no longer parked at the site (...)",
    SiteCallRelayOutcome.OperationFailed => ack.ErrorMessage,
    // SiteUnreachable is never produced from a ParkedOperationActionAck —
    // unreachable responses are built by UnreachableRetry/UnreachableDiscard
    // before any ack is classified, so this arm is unreachable by construction.
    SiteCallRelayOutcome.SiteUnreachable => ack.ErrorMessage,
    _ => throw new ArgumentOutOfRangeException(...)
};

The comment correctly states the SiteUnreachable arm is unreachable when called from ClassifyAck. The arm therefore exists only to satisfy exhaustiveness, but instead of throwing or returning a sentinel, it falls through to ack.ErrorMessage — indistinguishable from the OperationFailed arm above. If any future caller does feed SiteUnreachable into AckErrorMessage (e.g. via refactor), the result will be a silent wrong-detail-text bug rather than an immediate crash. The default arm correctly throws ArgumentOutOfRangeException, so the SiteUnreachable arm is the inconsistent one.

Recommendation

Replace the SiteUnreachable => ack.ErrorMessage arm with:

SiteCallRelayOutcome.SiteUnreachable =>
    throw new InvalidOperationException(
        "AckErrorMessage cannot be called for SiteUnreachable — those responses "
        + "are built by UnreachableRetry/UnreachableDiscard before classification."),

— fail fast if the invariant is ever violated by a refactor.

Resolution (2026-05-28):

Behaviour kept (return ack.ErrorMessage); AckErrorMessage stays total and side-effect-free. Expanded the inline comment on the SiteUnreachable arm to explain WHY it returns rather than throws: site-unreachable is classified as transient by the upstream relay (which has already built its SiteUnreachable response and detail text via SiteUnreachableMessage), so a defensive fall-through surfaces the ack's message and lets the caller schedule a retry — throwing would turn a benign refactor invariant violation into a relay-path crash.

SiteCallAudit-006 — Stuck-only paging test does not exercise the multi-page boundary with an interleaved non-stuck row at the cursor

Severity Low
Category Testing coverage
Status Resolved
Location tests/ZB.MOM.WW.ScadaBridge.SiteCallAudit.Tests/SiteCallAuditActorTests.cs:335-392

Resolution (2026-05-28): Added SiteCallQueryRequest_StuckOnly_CursorAtNonStuckBoundary_SkipsToNextStuckRow to tests/ZB.MOM.WW.ScadaBridge.SiteCallAudit.Tests/SiteCallAuditActorTests.cs — drives six rows interleaved as stuck/non-stuck × 3 (oldest-first), then issues three page-size-1 stuck-only queries. The cursor between each page deliberately lands on a non-stuck row, so the SQL composition of the stuck predicate AND the keyset cursor predicate must skip it. Asserts each page returns exactly one stuck row in DESC-by-CreatedAtUtc order with no overlap and all three stuck rows visited. Locks the invariant that post-filtering does not produce under-filled pages with non-null next cursors.

Description

SiteCallQueryRequest_StuckOnly_PagesAreFull_NoEmptyPagesWithCursor covers the case where stuck rows are interleaved with non-stuck rows (page-1 returns 2 stuck rows, page-2 returns the third). It does not cover the edge where the row at the keyset cursor boundary (AfterCreatedAtUtc + AfterId) is itself a non-stuck row — i.e. the cursor points at a row the next page must SKIP through to find more stuck rows. The repository's SQL composes the cursor predicate (CreatedAtUtc < cursor OR (CreatedAtUtc = cursor AND id < ...)) with the stuck predicate, so it should be honest, but the test only asserts row counts and IsStuck, not that the second-page query specifically skipped non-stuck rows between the cursor and the next stuck row.

Lower priority because the SQL composition is straightforward, but adding a direct test would lock the invariant.

Recommendation

Add a test that (a) inserts 6 rows in interleaved order: stuck, not-stuck, stuck, not-stuck, stuck, not-stuck (oldest first); (b) issues a StuckOnly page-size-1 query; (c) asserts each page returns exactly the stuck row, with no overlap and all 3 stuck rows visited.

SiteCallAudit-007 — Daily purge timer is armed only when the reconciliation collaborators are present; a host without the reconciliation client silently never purges

Severity Medium
Category Code organization & conventions
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:307-321, src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:202-227

Description

StartPurgeTimer (lines 307-321) gates the daily terminal-row purge tick on the reconciliation collaborators being present:

private void StartPurgeTimer()
{
    if (_pullClient is null || _siteEnumerator is null)
    {
        return;
    }
    // ... schedule PurgeTick ...
}

But the purge pass (OnPurgeTickAsyncPurgeWithRepositoryAsyncISiteCallAuditRepository.PurgeTerminalAsync) needs only the repository — it has no dependency on IPullSiteCallsClient / ISiteEnumerator. The two collaborators are resolved by the production constructor (lines 202-227) via serviceProvider.GetService<IPullSiteCallsClient>() / GetService<ISiteEnumerator>() — both registered by AddAuditLogCentralReconciliationClient (AuditLog/ServiceCollectionExtensions.cs:473,523, registered as ISiteEnumerator and IPullSiteCallsClient). GetService (not GetRequiredService) returns null if that helper was never called, so a host that registers AddSiteCallAudit() without also calling AddAuditLogCentralReconciliationClient(...) constructs the actor with both collaborators null. In PreStart both StartReconciliationTimer and StartPurgeTimer early-return, so the actor runs forever with no purge timer at all → unbounded growth of the central SiteCalls table, with no log line to say the purge was skipped.

This is currently latent, not live: Program.cs co-registers AddAuditLogCentralReconciliationClient(builder.Configuration) (line 107) immediately before AddSiteCallAudit() (line 113), so on the real central node both collaborators resolve and both timers run today. The risk is a future host (or a refactor that splits the reconciliation client out of the central composition root, or a test/embedded host that wants only ingest + purge) silently losing the purge with no diagnostic. The gate exists for a legitimate reason — keeping the repo-only test ctor free of both background timers so the MSSQL read/upsert tests see no scheduled side effects — but it conflates "no reconciliation route" with "no purge", and the actor's own XML (lines 36-38) documents the coupling as deliberate rather than flagging it as a hazard.

Recommendation

Decouple StartPurgeTimer from the reconciliation collaborators — purge needs only a repository, which both the production and the repo-only test ctors always have. Two viable shapes:

  • Preferred: gate the purge timer on its own real precondition (a repository is always available, so arm it unconditionally in the production + reconciliation ctors; keep it off only in the repo-only test ctor via an explicit "background timers off" flag rather than by proxy of the reconciliation collaborators). This keeps the MSSQL test isolation while removing the accidental coupling.
  • At minimum: log a Warning in StartPurgeTimer (and StartReconciliationTimer) when the timer is not armed on the production path, so a misconfigured host surfaces "SiteCallAudit purge timer not started — PurgeTerminalAsync will never run" instead of growing the table silently.

Severity is a judgment call: Medium because the consequence (unbounded central table growth) is real and silent, but it is latent today (the only production composition root co-registers the reconciliation client), so an argument for Low is defensible. Flagged Medium to weight the silent-failure + data-growth aspect.

Resolution

Resolved 2026-06-20 (commit fd618cf1): decoupled the daily purge timer from the reconciliation collaborators (a new _backgroundTimersEnabled flag) so a central node that omits the reconciliation client still purges — no more silent unbounded SiteCalls growth; a Warning is logged if the reconciliation collaborators are absent. Test added (purge arms without the reconciliation client).

SiteCallAudit-008 — Design doc claims six charted KPI trend series; only three are actually charted

Severity Low
Category Design-document adherence
Status Resolved
Location docs/requirements/Component-SiteCallAudit.md:149-154, src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/Kpi/SiteCallAuditKpiSampleSource.cs:42-47, src/ZB.MOM.WW.ScadaBridge.Commons/Types/Kpi/KpiMetrics.cs:52-62

Description

The design doc (Component-SiteCallAudit.md lines 149-154, the KPI History interaction) states:

"… the resulting buffered / parked / failedLastInterval / deliveredLastInterval / stuck / oldestPendingAgeSeconds series render as trends on the Site Calls page via KpiTrendChart."

That lists six series as charted. In the code only three are charted:

  • The public charted catalog KpiMetrics.SiteCallAudit (KpiMetrics.cs:52-62) exposes exactly Buffered, Parked, and FailedLastInterval — and its own XML says "Charted Site Call Audit (#22) metrics … Rendered by the Central UI Site Calls report trend panel."
  • SiteCallAuditKpiSampleSource (lines 42-47) keys those three off the public Commons catalog (KpiMetrics.SiteCallAudit.*) and keeps the other three — deliveredLastInterval, stuck, oldestPendingAgeSeconds — as private string literals, with the comment "Charted metrics share the public Commons catalog … the uncharted internal metrics stay private here (#178)."
  • The UI confirms it: SiteCalls/SiteCallsReport.razor.cs calls LoadSeriesAsync for exactly KpiMetrics.SiteCallAudit.Buffered, .Parked, and .FailedLastInterval — three series, no more.

So all six metrics are sampled into the KpiSample history store, but only three are charted. The doc reads as if all six are rendered, which is a small design-doc drift (over-claiming the UI surface). The code is internally consistent and correctly commented; the doc is the stale party.

Recommendation

One-line doc edit on Component-SiteCallAudit.md lines 149-154: state that the three series buffered / parked / failedLastInterval render as trends on the Site Calls page via KpiTrendChart, and that deliveredLastInterval / stuck / oldestPendingAgeSeconds are sampled into the KPI-history store but not charted (available for future trend panels / ad-hoc query). No code change — the code is already the intended state.

Resolution

Resolved 2026-06-20 (commit fd618cf1): the design doc now lists the three actually-charted KPI metrics (buffered/parked/failedLastInterval) and marks the other three as sampled-but-not-yet-charted, matching the KpiMetrics catalog. Doc-only.

SiteCallAudit-009 — Reconciliation cursor advances inclusively and ignores MoreAvailable; a single-timestamp batch saturation re-pulls without progress

Severity Low
Category Correctness & logic bugs
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:505-535

Description

ReconcileSiteAsync (lines 505-535) pulls rows since the per-site cursor, upserts each, and advances the cursor to the maximum UpdatedAtUtc observed:

var since = _reconciliationCursors.TryGetValue(site.SiteId, out var c) ? c : DateTime.MinValue;
var response = await client.PullAsync(site.SiteId, since, _options.ReconciliationBatchSize, ...);

var maxUpdated = since;
foreach (var row in response.SiteCalls)
{
    await repository.UpsertAsync(row with { IngestedAtUtc = nowUtc });
    if (row.UpdatedAtUtc > maxUpdated) maxUpdated = row.UpdatedAtUtc;
}
_reconciliationCursors[site.SiteId] = maxUpdated;

Two interacting properties:

  1. The pull is inclusivePullAsync(site, since, …) asks for rows with UpdatedAtUtc >= since (documented in the method's XML, lines 497-503), and the cursor advances to the max UpdatedAtUtc seen, which is itself one of the rows just pulled. So the boundary row is re-pulled every tick and deduped by the idempotent monotonic upsert — intended, harmless.
  2. response.MoreAvailable is never read at all. The PullSiteCallsResponse carries MoreAvailable (PullSiteCallsResponse.cs:14-17: "True when the site saturated the requested batch size — the caller should advance the cursor and pull again"), but ReconcileSiteAsync ignores it and relies on the natural tick cadence to drain the backlog over successive ticks.

The edge case: if a site has more rows than the batch size all sharing one exact UpdatedAtUtc (e.g. a burst written in the same tick / same clock value), the saturated batch returns rows whose max UpdatedAtUtc equals since. maxUpdated therefore stays at since, the cursor does not advance, and because the pull is inclusive the next tick re-pulls the identical window — and again, and again — making no forward progress on that site's backlog. Because the upsert is idempotent and the SiteCalls table is an eventually-consistent mirror (not the source of truth), this is wasted work, never corruption — but it is an unbounded re-pull loop on a pathological input, and any rows in that backlog beyond the batch ceiling never get reconciled.

The sibling SiteAuditReconciliationActor shares the inclusive-cursor / max-timestamp shape, so the same single-timestamp-saturation no-progress edge applies to it — but that sibling does read MoreAvailable (it feeds its stalled-detection state machine, SiteAuditReconciliationActor.cs:324-325, publishing SiteAuditTelemetryStalledChanged so a non-draining site surfaces a health signal). ReconcileSiteAsync's XML (lines 530-534) claims the no-immediate-re-pull behaviour "match[es] SiteAuditReconciliationActor", which is only half true: it matches the cursor cadence but diverges by dropping MoreAvailable entirely, so this actor has neither a continuation pull nor a stalled signal — a saturated site lags silently with no observability.

Recommendation

  • Consume MoreAvailable: either continue draining within the same tick while MoreAvailable is true (bounded by a max-iterations guard), or — matching the sibling — surface a stalled/non-draining signal when a batch comes back saturated so a stuck site is observable rather than silent.
  • Defend the single-timestamp no-progress edge with a tiebreaker beyond the timestamp (e.g. advance on (UpdatedAtUtc, TrackedOperationId) as a composite keyset cursor, and ask the pull for rows strictly after that composite), so a burst sharing one UpdatedAtUtc cannot pin the cursor.
  • Correct the ReconcileSiteAsync XML (lines 530-534): it claims parity with SiteAuditReconciliationActor while diverging on MoreAvailable (the sibling reads it for stalled detection; this actor ignores it).

Severity Low: idempotent upsert + mirror-not-source-of-truth mean no data corruption, and the saturated-single-timestamp input is pathological; the cost is wasted re-pulls and an un-drained backlog tail on that one input, plus the missing observability.

Resolution

Resolved 2026-06-20 (commit fd618cf1): ReconcileSiteAsync now consumes response.MoreAvailable via a within-tick continuation drain bounded by a max-pages guard, with explicit no-progress (single-timestamp-saturation) detection that breaks and logs a Warning instead of re-pulling forever. The XML claim of parity with the sibling reconciler was corrected. Idempotent upsert retained. Tests added.