Full per-module re-review of the 16 stale modules (last seen1eb6e97/ 2026-05-28) plus first-ever reviews of KpiHistory (#26) and ScriptAnalysis (#25), at HEAD4307c381. 67 new findings (0 Critical, 6 High, 27 Medium, 34 Low). Remediation in commitfd618cf1closed 5 of the 6 Highs and ~33 Medium/Low; the rest are Deferred/Won't Fix with rationale. Remaining pending (4) are all InboundAPI's Database-helper findings (IA-026 High .. IA-029), left to the active feat/ipsen-movein effort per owner decision. Highlights: caught a central-only-delivery security drift (SMTP creds broadcast to sites — DM-025/SR-031), a never-committed 'Resolved' fix (SiteEventLogging-016 → -024), an unguarded KPI recorder tick (KH-001), a trust-analyzer fallback weakening (SA-001), and a native-alarm subscribe-path leak (DCL-023). ScriptAnalysis verdict: trust boundary is semantically sound (symbol-based) in the production cluster config. README regenerated; regen-readme.py --check passes (4 pending / 567 total).
35 KiB
Code Review — SiteCallAudit
| Field | Value |
|---|---|
| Module | src/ZB.MOM.WW.ScadaBridge.SiteCallAudit |
| Design doc | docs/requirements/Component-SiteCallAudit.md |
| Status | Reviewed |
| Last reviewed | 2026-06-20 |
| Reviewer | claude-agent |
| Commit reviewed | 4307c381 |
| Open findings | 0 |
Summary
The module is small (one actor + DI extension + options class). The actor is a
central cluster singleton that exposes three responsibility groups: direct
UpsertSiteCallCommand ingest, paginated/KPI read handlers, and the central→site
Retry/Discard relay. Ingest idempotency is delegated to the repository's
monotonic-upsert (the CD-015 check-then-act window is mitigated by the
duplicate-key swallow on the insert leg). Findings cluster around two themes:
(a) the SupervisorStrategy override is dead-code that contradicts the XML
docstring — it governs children, and this actor has none, so the documented
"Resume on leaked exception" promise is unenforced; (b) several smaller drifts
between the design doc and the code (reconciliation puller + daily purge
schedule are still deferred; OnUpsertAsync does not stamp IngestedAtUtc
unlike the dual-write path). The relay path is well covered by Akka TestKit
unit tests; the ingest + KPI paths are covered by MSSQL-backed integration
tests using a shared MsSqlMigrationFixture.
Checklist coverage
| # | Category | Examined | Notes |
|---|---|---|---|
| 1 | Correctness & logic bugs | Yes | OnUpsertAsync does not refresh IngestedAtUtc (Finding 003). |
| 2 | Akka.NET conventions | Yes | SupervisorStrategy() override is dead code (Finding 001). Sender correctly captured before first await on every handler. PipeTo used for read replies. |
| 3 | Concurrency & thread safety | Yes | _centralCommunication mutated only on actor thread via RegisterCentralCommunication. DI scope-per-message disposed in try/finally. No issues found. |
| 4 | Error handling & resilience | Yes | Ingest catches all + replies Accepted=false. Relay distinguishes SiteUnreachable vs OperationFailed. Failover handover does not wait for in-flight async work (Finding 002). |
| 5 | Security | Yes | All SQL is parameterised at the repository (FromSqlInterpolated). Relay carries no user-controlled strings beyond SourceSite. No issues found. |
| 6 | Performance & resource management | Yes | DI scope-per-message correctly disposed. MaxPageSize=200 clamp present. No issues found. |
| 7 | Design-document adherence | Yes | Reconciliation puller and daily terminal-purge scheduler still deferred; design doc reads as if they ship (Finding 004). |
| 8 | Code organization & conventions | Yes | RegisterCentralCommunication is a top-level record colocated with the actor — by design (carries IActorRef, cannot live in Commons). No issues found. |
| 9 | Testing coverage | Yes | Relay path well covered (6 unit tests). Ingest/KPI well covered by MSSQL fixture. Stuck-only paging boundary edge not directly exercised (Finding 006). |
| 10 | Documentation & comments | Yes | XML docstring claims SupervisorStrategy uses Resume — incorrect (Finding 001). AckErrorMessage switch arm for SiteUnreachable falls through instead of throwing (Finding 005). |
Re-review 2026-06-20 (commit 4307c381) — full review
Since the 1eb6e97 baseline the module grew the two pieces that were "still deferred" at the prior pass: the periodic per-site reconciliation puller (Piece A — OnReconciliationTickAsync → ReconcileSiteAsync, in-memory per-site cursor + idempotent monotonic upsert + per-site failure isolation) and the daily terminal-row purge scheduler (Piece B — OnPurgeTickAsync → PurgeTerminalAsync, continue-on-error). All of the new code is clean, idiomatic, and well-tested (dedicated SiteCallAuditReconciliationTests, SiteCallAuditPurgeTests, SiteCallRelayTests, KPI-sample and options suites); the M6 SiteCallAuditKpiSampleSource correctly anchors both cutoffs on a single capture instant and emits Global + per-Site + per-Node snapshots, and the per-node KPI handler is a clean additive peer of the per-site one. The three new findings are all forward-compat / cosmetic: a latent purge-gating gap (the purge timer is armed only when the reconciliation collaborators are present — currently masked because Program.cs co-registers them) plus two doc/edge items (the design doc over-claims six charted trend series when only three are charted, and the reconciliation cursor ignores MoreAvailable so a single-timestamp batch saturation re-pulls without progress). No correctness, concurrency, security, or resilience defects in the new code; the inclusive-cursor + idempotent-upsert pairing keeps every edge case safe-by-construction (wasted work, never corruption).
| # | Category | Examined | Notes |
|---|---|---|---|
| 1 | Correctness & logic bugs | Yes | Reconciliation cursor ignores MoreAvailable; a single-UpdatedAtUtc batch saturation re-pulls without progress (Finding 009 — wasted work only, idempotent upsert prevents corruption). Finding 003 (IngestedAtUtc) resolved — now stamped in OnUpsertAsync and ReconcileSiteAsync. |
| 2 | Akka.NET conventions | Yes | Sender captured before first await on every async handler; PipeTo used for all read/relay replies; self-ticks scheduled via ScheduleTellRepeatedlyCancelable with Self sender, cancelled in PostStop. No issues found. |
| 3 | Concurrency & thread safety | Yes | _reconciliationCursors, _centralCommunication, timers all mutated only on the actor thread. Per-tick/per-message DI scope (CreateAsyncScope for the async tick paths, CreateScope for sync read paths) disposed in finally / await using. No issues found. |
| 4 | Error handling & resilience | Yes | Reconciliation has per-site try/catch isolation; purge is continue-on-error; ingest catches all and replies Accepted=false. Deliberate coarse-retry divergence from SiteAuditReconciliationActor (no per-row abandon) is documented and safe given idempotent upsert. No issues found. |
| 5 | Security | Yes | All SQL parameterised at the repository. Relay carries no user-controlled strings beyond SourceSite (a site id). No issues found. |
| 6 | Performance & resource management | Yes | One DI scope per tick reused across all sites; MaxPageSize=200 clamp; async DbContext disposal off the dispatcher. MoreAvailable-ignoring re-pull is bounded wasted work (Finding 009). |
| 7 | Design-document adherence | Yes | Reconciliation puller + purge scheduler now implemented (prior Finding 004 resolved). Design doc over-claims six charted KPI trend series; only three are charted (Finding 008). |
| 8 | Code organization & conventions | Yes | Purge timer gated on reconciliation collaborators rather than just the repository — a host registering SiteCallAudit without the reconciliation client silently never purges (Finding 007, latent). KpiSampleSource registered via TryAddEnumerable; options owned by the component. |
| 9 | Testing coverage | Yes | Reconciliation/purge/relay/KPI/options all have dedicated suites. No test exercises a saturated reconciliation batch (MoreAvailable: true) or the single-timestamp no-progress edge — every reconciliation test uses MoreAvailable: false (relates to Finding 009). |
| 10 | Documentation & comments | Yes | Prior Findings 001/005 doc fixes still in place. Reconciliation ReconcileSiteAsync XML claims parity with SiteAuditReconciliationActor while diverging on MoreAvailable (the sibling reads it for stalled detection; this actor ignores it) — Finding 009 recommendation includes a doc correction. |
Findings
SiteCallAudit-001 — SupervisorStrategy override is dead code; XML claims Resume that is not enforced
| Severity | Medium |
| Category | Akka.NET conventions |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:32-46, src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:147-151 |
Description
The XML remarks block (lines 32-46) states:
"The
SupervisorStrategyusesResumeso an unexpected throw before the catch (defence in depth) does not restart the actor and reset in-flight state."
The override at lines 147-151 returns a OneForOneStrategy with DefaultDecider
and maxNrOfRetries: 0. Two problems compound:
ActorBase.SupervisorStrategy()governs the actor's children, not the actor itself.SiteCallAuditActorcreates no children, so this override is dead code.- The returned strategy uses
DefaultDecider(Restart for most exceptions), notDirective.Resume. So even if the actor did have children, the strategy would not be Resume — it would be the default Restart-on-most-faults behaviour withmaxNrOfRetries: 0(which forces a Stop after the first failure).
Net effect: the actor's own self-supervision is whatever the parent supplies
(SupervisorStrategy.DefaultDecider from the singleton manager / user
guardian in tests), which Restarts on most exceptions. If the try/catch in
OnUpsertAsync ever leaked (e.g. a synchronous throw constructing replyTo),
the actor would Restart, reset _centralCommunication to null, and silently
break the relay until RegisterCentralCommunication runs again.
This same pattern (with the same misleading XML doc) exists in
AuditLogIngestActor, AuditLogPurgeActor, and SiteAuditReconciliationActor
— they were likely cargo-culted; this finding documents the local instance.
Recommendation
Either:
- Remove the
SupervisorStrategy()override entirely (it does nothing useful) and revise the XML comment to drop the "Resume" claim. Self-supervision is the parent's concern (the cluster singleton manager); thetry/catchinOnUpsertAsyncis what actually keeps the actor alive. - Or, if Resume-on-self-throw is actually desired, that requires wiring a
custom supervisor in the parent (
ClusterSingletonManager) — not overridingSupervisorStrategy()here. Simpler path: keep thetry/catch, drop the override.
The CLAUDE.md "Resume for coordinator actors" decision applies to actors with children (Site Runtime hierarchy) — not to leaf cluster singletons.
Resolution (2026-05-28): Rewrote the class-level XML on SiteCallAuditActor plus the method-level XML on SupervisorStrategy() to accurately describe what the override does — a one-for-one strategy with DefaultDecider (Restart on most exceptions, Stop on ActorInitializationException/ActorKilledException) and maxNrOfRetries: 0, governing the actor's children (the actor has none today, so the override is currently inert). Dropped the misleading "Resume" claim. The new docs make clear that self-supervision of this cluster singleton is the parent ClusterSingletonManager's concern and the actor's own resilience comes from the in-handler try/catch in OnUpsertAsync, not from this override. No behaviour change — pure documentation fix; existing 24 SiteCallAudit tests remain green.
SiteCallAudit-002 — Singleton failover does not wait for in-flight async upserts
| Severity | Low |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:455-462 (singleton wiring), src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:153-193 |
Resolution (2026-05-28): Added a CoordinatedShutdown task in the cluster-leave phase (named drain-site-call-audit-singleton) that issues an explicit GracefulStop(10s) to the SiteCallAudit cluster singleton manager before the cluster-leave proceeds. Akka.NET's singleton handover already waits for the active actor's ReceiveAsync task to complete before signalling HandOverDone, so an in-flight EF UpsertAsync (and its SQL round-trip) drains on the old node before the new singleton starts on the other central node — closing the seam where the new singleton could race a still-running upsert on the old node. The 10-second timeout is bounded so a misbehaving upsert cannot stall coordinated shutdown indefinitely; on timeout the existing PoisonPill termination path takes over and the repository's monotonic-upsert + 2601/2627 duplicate-key swallow remain as the storage-state safety net. Pattern is suitable for the NotificationOutbox singleton too; deferred to keep this change scoped.
Description
The singleton is created with terminationMessage: PoisonPill.Instance. On
failover the active node's singleton stops as soon as the mailbox is drained
of normal messages and the PoisonPill is processed. An in-flight
OnUpsertAsync Task started before the PoisonPill arrived will be allowed to
complete (the message-handler runs synchronously from the mailbox's view),
but the Akka actor model does NOT cancel the EF Core
ExecuteSqlInterpolatedAsync call.
Two consequences:
- The new singleton on the other node may begin accepting
UpsertSiteCallCommandfor the sameTrackedOperationIdwhile the old singleton's in-flight upsert is still running. The repository's monotonic-upsert and the SQL duplicate-key swallow protect storage state. - The original
replyTosender may receive itsAccepted=trueafter the new singleton has already returned a different reply. Idempotency keys protect correctness; wire-level ordering is best-effort by design.
This is consistent with the design ("eventually-consistent mirror, sites are source of truth"), but worth documenting as an explicit invariant. The Notification Outbox sibling has the same pattern.
Recommendation
- Document the failover/handover semantics in the actor's XML remarks: "On
cluster singleton handover, in-flight
OnUpsertAsynctasks complete on the old node and may produce a lateAccepted=truereply; the repository's monotonic upsert ensures storage state is consistent." - Add an integration test that deliberately races two concurrent upserts on
the same
TrackedOperationIdto verify the duplicate-key swallow + monotonic rank check (the CD-015 race-pattern check the parent task flagged).
SiteCallAudit-003 — OnUpsertAsync does not refresh IngestedAtUtc; direct-write callers must remember to stamp it
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:153-193 |
Description
The combined-telemetry hot path (AuditLogIngestActor.OnCachedTelemetryAsync)
stamps IngestedAtUtc = DateTime.UtcNow on both the AuditLog row and the
SiteCall row at central-side persist time
(src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogIngestActor.cs:238-239). The design
doc treats IngestedAtUtc as "central ingested (or last refreshed) this row"
— a central-side timestamp.
SiteCallAuditActor.OnUpsertAsync writes the supplied SiteCall as-is, with
whatever IngestedAtUtc the caller stamped. The only current callers are the
unit tests (which use DateTime.UtcNow at command-construction time). Once
the deferred reconciliation puller lands and starts emitting
UpsertSiteCallCommands, the puller (running on central) is responsible for
stamping a central timestamp — but if a future direct-write caller forgets,
or constructs from a site DTO, the value could drift (e.g. become a site
clock value).
This is currently latent because no production caller exists, but it's inconsistent with the dual-write code path and undocumented.
Recommendation
- Either: stamp
IngestedAtUtc = DateTime.UtcNowinsideOnUpsertAsyncbefore callingUpsertAsync(matchingAuditLogIngestActor's behaviour), usingcmd.SiteCall with { IngestedAtUtc = DateTime.UtcNow }. - Or: document in the
UpsertSiteCallCommandXML that callers MUST stampIngestedAtUtcto a central-sideDateTime.UtcNowimmediately before sending.
Preferred: stamp inside the actor — same as the combined-telemetry path — because callers cannot in general know the actor is colocated on central.
Resolution (2026-05-28): OnUpsertAsync now rewrites the incoming SiteCall via cmd.SiteCall with { IngestedAtUtc = DateTime.UtcNow } immediately before calling repository.UpsertAsync, mirroring AuditLogIngestActor's combined-telemetry hot path. The repository writes IngestedAtUtc on both the insert-if-not-exists and the monotonic UPDATE legs (SiteCallAuditRepository.UpsertAsync), so the column is writable on every upsert. Callers (telemetry, the deferred reconciliation puller, any future direct-write) no longer need to remember to stamp a central-side timestamp — the actor owns it. Existing 24 SiteCallAudit tests remain green (the MSSQL-fixture test constructs rows with DateTime.UtcNow and doesn't assert the exact value, so the actor's re-stamp is backward compatible).
SiteCallAudit-004 — Reconciliation puller and daily terminal-purge scheduler still deferred; design-doc drift
| Severity | Low |
| Category | Design-document adherence |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:23-30 (actor XML), src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/ServiceCollectionExtensions.cs:8-13, docs/requirements/Component-SiteCallAudit.md:24-32 |
Description
The design doc (Component-SiteCallAudit.md lines 24-32) lists five
responsibilities, including:
- "Run periodic per-site reconciliation pulls so missed telemetry self-heals."
- "Purge terminal audit rows after a configurable retention window."
The repository exposes PurgeTerminalAsync but nothing in this module
schedules a daily call (Notification Outbox owns a MaintenanceService for
its equivalent; no SiteCallAuditMaintenanceService exists). The
reconciliation puller is acknowledged in the actor XML
(only reconciliation remains deferred) but is not surfaced in the design
doc as deferred — the doc reads as if it ships.
Recommendation
- Either: implement the deferred pieces (a hosted service that wakes daily
and calls
repo.PurgeTerminalAsync(now - retentionWindow), plus a per-site reconciliation puller with a cursor + anIPullCachedTelemetryClient). - Or: add a "Status" / "Deferred" subsection to the design doc explicitly listing what's not yet implemented (matches the pattern Audit Log uses for its tamper-evidence hash chain).
Resolution (2026-05-28):
Updated the class-level XML on SiteCallAuditActor to reflect actual state:
telemetry ingest, query/detail/KPI handlers (Task 4), and the central→site
Retry/Discard relay (Task 5) are implemented; the periodic reconciliation
puller and the daily terminal-row purge scheduler remain deferred. The design
doc update is tracked separately.
SiteCallAudit-005 — AckErrorMessage switch arm for SiteUnreachable returns ack message instead of throwing
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:548-563 |
Description
return outcome switch
{
SiteCallRelayOutcome.Applied => null,
SiteCallRelayOutcome.NotParked => "The operation is no longer parked at the site (...)",
SiteCallRelayOutcome.OperationFailed => ack.ErrorMessage,
// SiteUnreachable is never produced from a ParkedOperationActionAck —
// unreachable responses are built by UnreachableRetry/UnreachableDiscard
// before any ack is classified, so this arm is unreachable by construction.
SiteCallRelayOutcome.SiteUnreachable => ack.ErrorMessage,
_ => throw new ArgumentOutOfRangeException(...)
};
The comment correctly states the SiteUnreachable arm is unreachable when
called from ClassifyAck. The arm therefore exists only to satisfy
exhaustiveness, but instead of throwing or returning a sentinel, it falls
through to ack.ErrorMessage — indistinguishable from the OperationFailed
arm above. If any future caller does feed SiteUnreachable into
AckErrorMessage (e.g. via refactor), the result will be a silent
wrong-detail-text bug rather than an immediate crash. The default arm
correctly throws ArgumentOutOfRangeException, so the SiteUnreachable arm
is the inconsistent one.
Recommendation
Replace the SiteUnreachable => ack.ErrorMessage arm with:
SiteCallRelayOutcome.SiteUnreachable =>
throw new InvalidOperationException(
"AckErrorMessage cannot be called for SiteUnreachable — those responses "
+ "are built by UnreachableRetry/UnreachableDiscard before classification."),
— fail fast if the invariant is ever violated by a refactor.
Resolution (2026-05-28):
Behaviour kept (return ack.ErrorMessage); AckErrorMessage stays total and
side-effect-free. Expanded the inline comment on the SiteUnreachable arm to
explain WHY it returns rather than throws: site-unreachable is classified as
transient by the upstream relay (which has already built its
SiteUnreachable response and detail text via SiteUnreachableMessage), so a
defensive fall-through surfaces the ack's message and lets the caller schedule
a retry — throwing would turn a benign refactor invariant violation into a
relay-path crash.
SiteCallAudit-006 — Stuck-only paging test does not exercise the multi-page boundary with an interleaved non-stuck row at the cursor
| Severity | Low |
| Category | Testing coverage |
| Status | Resolved |
| Location | tests/ZB.MOM.WW.ScadaBridge.SiteCallAudit.Tests/SiteCallAuditActorTests.cs:335-392 |
Resolution (2026-05-28): Added SiteCallQueryRequest_StuckOnly_CursorAtNonStuckBoundary_SkipsToNextStuckRow to tests/ZB.MOM.WW.ScadaBridge.SiteCallAudit.Tests/SiteCallAuditActorTests.cs — drives six rows interleaved as stuck/non-stuck × 3 (oldest-first), then issues three page-size-1 stuck-only queries. The cursor between each page deliberately lands on a non-stuck row, so the SQL composition of the stuck predicate AND the keyset cursor predicate must skip it. Asserts each page returns exactly one stuck row in DESC-by-CreatedAtUtc order with no overlap and all three stuck rows visited. Locks the invariant that post-filtering does not produce under-filled pages with non-null next cursors.
Description
SiteCallQueryRequest_StuckOnly_PagesAreFull_NoEmptyPagesWithCursor covers
the case where stuck rows are interleaved with non-stuck rows (page-1 returns
2 stuck rows, page-2 returns the third). It does not cover the edge where
the row at the keyset cursor boundary (AfterCreatedAtUtc + AfterId) is
itself a non-stuck row — i.e. the cursor points at a row the next page must
SKIP through to find more stuck rows. The repository's SQL composes the
cursor predicate (CreatedAtUtc < cursor OR (CreatedAtUtc = cursor AND id < ...)) with the stuck predicate, so it should be honest, but the test only
asserts row counts and IsStuck, not that the second-page query specifically
skipped non-stuck rows between the cursor and the next stuck row.
Lower priority because the SQL composition is straightforward, but adding a direct test would lock the invariant.
Recommendation
Add a test that (a) inserts 6 rows in interleaved order: stuck, not-stuck,
stuck, not-stuck, stuck, not-stuck (oldest first); (b) issues a StuckOnly
page-size-1 query; (c) asserts each page returns exactly the stuck row, with
no overlap and all 3 stuck rows visited.
SiteCallAudit-007 — Daily purge timer is armed only when the reconciliation collaborators are present; a host without the reconciliation client silently never purges
| Severity | Medium |
| Category | Code organization & conventions |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:307-321, src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:202-227 |
Description
StartPurgeTimer (lines 307-321) gates the daily terminal-row purge tick on
the reconciliation collaborators being present:
private void StartPurgeTimer()
{
if (_pullClient is null || _siteEnumerator is null)
{
return;
}
// ... schedule PurgeTick ...
}
But the purge pass (OnPurgeTickAsync → PurgeWithRepositoryAsync →
ISiteCallAuditRepository.PurgeTerminalAsync) needs only the repository — it
has no dependency on IPullSiteCallsClient / ISiteEnumerator. The two
collaborators are resolved by the production constructor (lines 202-227) via
serviceProvider.GetService<IPullSiteCallsClient>() /
GetService<ISiteEnumerator>() — both registered by
AddAuditLogCentralReconciliationClient
(AuditLog/ServiceCollectionExtensions.cs:473,523, registered as ISiteEnumerator
and IPullSiteCallsClient). GetService (not GetRequiredService) returns
null if that helper was never called, so a host that registers
AddSiteCallAudit() without also calling
AddAuditLogCentralReconciliationClient(...) constructs the actor with both
collaborators null. In PreStart both StartReconciliationTimer and
StartPurgeTimer early-return, so the actor runs forever with no purge timer
at all → unbounded growth of the central SiteCalls table, with no log line
to say the purge was skipped.
This is currently latent, not live: Program.cs co-registers
AddAuditLogCentralReconciliationClient(builder.Configuration) (line 107)
immediately before AddSiteCallAudit() (line 113), so on the real central node
both collaborators resolve and both timers run today. The risk is a future host
(or a refactor that splits the reconciliation client out of the central
composition root, or a test/embedded host that wants only ingest + purge)
silently losing the purge with no diagnostic. The gate exists for a legitimate
reason — keeping the repo-only test ctor free of both background timers so the
MSSQL read/upsert tests see no scheduled side effects — but it conflates "no
reconciliation route" with "no purge", and the actor's own XML
(lines 36-38) documents the coupling as deliberate rather than flagging it as a
hazard.
Recommendation
Decouple StartPurgeTimer from the reconciliation collaborators — purge needs
only a repository, which both the production and the repo-only test ctors
always have. Two viable shapes:
- Preferred: gate the purge timer on its own real precondition (a repository is always available, so arm it unconditionally in the production + reconciliation ctors; keep it off only in the repo-only test ctor via an explicit "background timers off" flag rather than by proxy of the reconciliation collaborators). This keeps the MSSQL test isolation while removing the accidental coupling.
- At minimum: log a
WarninginStartPurgeTimer(andStartReconciliationTimer) when the timer is not armed on the production path, so a misconfigured host surfaces "SiteCallAudit purge timer not started —PurgeTerminalAsyncwill never run" instead of growing the table silently.
Severity is a judgment call: Medium because the consequence (unbounded central table growth) is real and silent, but it is latent today (the only production composition root co-registers the reconciliation client), so an argument for Low is defensible. Flagged Medium to weight the silent-failure + data-growth aspect.
Resolution
Resolved 2026-06-20 (commit fd618cf1): decoupled the daily purge timer from the reconciliation collaborators (a new _backgroundTimersEnabled flag) so a central node that omits the reconciliation client still purges — no more silent unbounded SiteCalls growth; a Warning is logged if the reconciliation collaborators are absent. Test added (purge arms without the reconciliation client).
SiteCallAudit-008 — Design doc claims six charted KPI trend series; only three are actually charted
| Severity | Low |
| Category | Design-document adherence |
| Status | Resolved |
| Location | docs/requirements/Component-SiteCallAudit.md:149-154, src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/Kpi/SiteCallAuditKpiSampleSource.cs:42-47, src/ZB.MOM.WW.ScadaBridge.Commons/Types/Kpi/KpiMetrics.cs:52-62 |
Description
The design doc (Component-SiteCallAudit.md lines 149-154, the KPI History
interaction) states:
"… the resulting
buffered/parked/failedLastInterval/deliveredLastInterval/stuck/oldestPendingAgeSecondsseries render as trends on the Site Calls page viaKpiTrendChart."
That lists six series as charted. In the code only three are charted:
- The public charted catalog
KpiMetrics.SiteCallAudit(KpiMetrics.cs:52-62) exposes exactlyBuffered,Parked, andFailedLastInterval— and its own XML says "Charted Site Call Audit (#22) metrics … Rendered by the Central UI Site Calls report trend panel." SiteCallAuditKpiSampleSource(lines 42-47) keys those three off the public Commons catalog (KpiMetrics.SiteCallAudit.*) and keeps the other three —deliveredLastInterval,stuck,oldestPendingAgeSeconds— as private string literals, with the comment "Charted metrics share the public Commons catalog … the uncharted internal metrics stay private here (#178)."- The UI confirms it:
SiteCalls/SiteCallsReport.razor.cscallsLoadSeriesAsyncfor exactlyKpiMetrics.SiteCallAudit.Buffered,.Parked, and.FailedLastInterval— three series, no more.
So all six metrics are sampled into the KpiSample history store, but only
three are charted. The doc reads as if all six are rendered, which is a small
design-doc drift (over-claiming the UI surface). The code is internally
consistent and correctly commented; the doc is the stale party.
Recommendation
One-line doc edit on Component-SiteCallAudit.md lines 149-154: state that the
three series buffered / parked / failedLastInterval render as trends on
the Site Calls page via KpiTrendChart, and that deliveredLastInterval /
stuck / oldestPendingAgeSeconds are sampled into the KPI-history store but
not charted (available for future trend panels / ad-hoc query). No code change
— the code is already the intended state.
Resolution
Resolved 2026-06-20 (commit fd618cf1): the design doc now lists the three actually-charted KPI metrics (buffered/parked/failedLastInterval) and marks the other three as sampled-but-not-yet-charted, matching the KpiMetrics catalog. Doc-only.
SiteCallAudit-009 — Reconciliation cursor advances inclusively and ignores MoreAvailable; a single-timestamp batch saturation re-pulls without progress
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:505-535 |
Description
ReconcileSiteAsync (lines 505-535) pulls rows since the per-site cursor,
upserts each, and advances the cursor to the maximum UpdatedAtUtc observed:
var since = _reconciliationCursors.TryGetValue(site.SiteId, out var c) ? c : DateTime.MinValue;
var response = await client.PullAsync(site.SiteId, since, _options.ReconciliationBatchSize, ...);
var maxUpdated = since;
foreach (var row in response.SiteCalls)
{
await repository.UpsertAsync(row with { IngestedAtUtc = nowUtc });
if (row.UpdatedAtUtc > maxUpdated) maxUpdated = row.UpdatedAtUtc;
}
_reconciliationCursors[site.SiteId] = maxUpdated;
Two interacting properties:
- The pull is inclusive —
PullAsync(site, since, …)asks for rows withUpdatedAtUtc >= since(documented in the method's XML, lines 497-503), and the cursor advances to the maxUpdatedAtUtcseen, which is itself one of the rows just pulled. So the boundary row is re-pulled every tick and deduped by the idempotent monotonic upsert — intended, harmless. response.MoreAvailableis never read at all. ThePullSiteCallsResponsecarriesMoreAvailable(PullSiteCallsResponse.cs:14-17: "True when the site saturated the requested batch size — the caller should advance the cursor and pull again"), butReconcileSiteAsyncignores it and relies on the natural tick cadence to drain the backlog over successive ticks.
The edge case: if a site has more rows than the batch size all sharing one
exact UpdatedAtUtc (e.g. a burst written in the same tick / same clock
value), the saturated batch returns rows whose max UpdatedAtUtc equals
since. maxUpdated therefore stays at since, the cursor does not
advance, and because the pull is inclusive the next tick re-pulls the identical
window — and again, and again — making no forward progress on that site's
backlog. Because the upsert is idempotent and the SiteCalls table is an
eventually-consistent mirror (not the source of truth), this is wasted work,
never corruption — but it is an unbounded re-pull loop on a pathological
input, and any rows in that backlog beyond the batch ceiling never get
reconciled.
The sibling SiteAuditReconciliationActor shares the inclusive-cursor /
max-timestamp shape, so the same single-timestamp-saturation no-progress edge
applies to it — but that sibling does read MoreAvailable (it feeds its
stalled-detection state machine, SiteAuditReconciliationActor.cs:324-325,
publishing SiteAuditTelemetryStalledChanged so a non-draining site surfaces a
health signal). ReconcileSiteAsync's XML (lines 530-534) claims the
no-immediate-re-pull behaviour "match[es] SiteAuditReconciliationActor", which
is only half true: it matches the cursor cadence but diverges by dropping
MoreAvailable entirely, so this actor has neither a continuation pull nor a
stalled signal — a saturated site lags silently with no observability.
Recommendation
- Consume
MoreAvailable: either continue draining within the same tick whileMoreAvailableis true (bounded by a max-iterations guard), or — matching the sibling — surface a stalled/non-draining signal when a batch comes back saturated so a stuck site is observable rather than silent. - Defend the single-timestamp no-progress edge with a tiebreaker beyond the
timestamp (e.g. advance on
(UpdatedAtUtc, TrackedOperationId)as a composite keyset cursor, and ask the pull for rows strictly after that composite), so a burst sharing oneUpdatedAtUtccannot pin the cursor. - Correct the
ReconcileSiteAsyncXML (lines 530-534): it claims parity withSiteAuditReconciliationActorwhile diverging onMoreAvailable(the sibling reads it for stalled detection; this actor ignores it).
Severity Low: idempotent upsert + mirror-not-source-of-truth mean no data corruption, and the saturated-single-timestamp input is pathological; the cost is wasted re-pulls and an un-drained backlog tail on that one input, plus the missing observability.
Resolution
Resolved 2026-06-20 (commit fd618cf1): ReconcileSiteAsync now consumes response.MoreAvailable via a within-tick continuation drain bounded by a max-pages guard, with explicit no-progress (single-timestamp-saturation) detection that breaks and logs a Warning instead of re-pulling forever. The XML claim of parity with the sibling reconciler was corrected. Idempotent upsert retained. Tests added.