Files
ScadaBridge/code-reviews/SiteCallAudit/findings.md
T
Joseph Doherty d39089f4ed docs(code-review): full review at 4307c381 — 18 modules, 67 findings recorded + remediation tracked
Full per-module re-review of the 16 stale modules (last seen 1eb6e97 / 2026-05-28)
plus first-ever reviews of KpiHistory (#26) and ScriptAnalysis (#25), at HEAD 4307c381.

67 new findings (0 Critical, 6 High, 27 Medium, 34 Low). Remediation in commit
fd618cf1 closed 5 of the 6 Highs and ~33 Medium/Low; the rest are Deferred/Won't Fix
with rationale. Remaining pending (4) are all InboundAPI's Database-helper findings
(IA-026 High .. IA-029), left to the active feat/ipsen-movein effort per owner decision.

Highlights: caught a central-only-delivery security drift (SMTP creds broadcast to
sites — DM-025/SR-031), a never-committed 'Resolved' fix (SiteEventLogging-016 → -024),
an unguarded KPI recorder tick (KH-001), a trust-analyzer fallback weakening (SA-001),
and a native-alarm subscribe-path leak (DCL-023). ScriptAnalysis verdict: trust boundary
is semantically sound (symbol-based) in the production cluster config.

README regenerated; regen-readme.py --check passes (4 pending / 567 total).
2026-06-20 18:02:32 -04:00

561 lines
35 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Code Review — SiteCallAudit
| Field | Value |
|-------|-------|
| Module | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit` |
| Design doc | `docs/requirements/Component-SiteCallAudit.md` |
| Status | Reviewed |
| Last reviewed | 2026-06-20 |
| Reviewer | claude-agent |
| Commit reviewed | `4307c381` |
| Open findings | 0 |
## Summary
The module is small (one actor + DI extension + options class). The actor is a
central cluster singleton that exposes three responsibility groups: direct
`UpsertSiteCallCommand` ingest, paginated/KPI read handlers, and the central→site
Retry/Discard relay. Ingest idempotency is delegated to the repository's
monotonic-upsert (the CD-015 check-then-act window is mitigated by the
duplicate-key swallow on the insert leg). Findings cluster around two themes:
(a) the `SupervisorStrategy` override is dead-code that contradicts the XML
docstring — it governs children, and this actor has none, so the documented
"Resume on leaked exception" promise is unenforced; (b) several smaller drifts
between the design doc and the code (reconciliation puller + daily purge
schedule are still deferred; `OnUpsertAsync` does not stamp `IngestedAtUtc`
unlike the dual-write path). The relay path is well covered by Akka TestKit
unit tests; the ingest + KPI paths are covered by MSSQL-backed integration
tests using a shared `MsSqlMigrationFixture`.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | Yes | `OnUpsertAsync` does not refresh `IngestedAtUtc` (Finding 003). |
| 2 | Akka.NET conventions | Yes | `SupervisorStrategy()` override is dead code (Finding 001). `Sender` correctly captured before first await on every handler. `PipeTo` used for read replies. |
| 3 | Concurrency & thread safety | Yes | `_centralCommunication` mutated only on actor thread via `RegisterCentralCommunication`. DI scope-per-message disposed in `try/finally`. No issues found. |
| 4 | Error handling & resilience | Yes | Ingest catches all + replies `Accepted=false`. Relay distinguishes `SiteUnreachable` vs `OperationFailed`. Failover handover does not wait for in-flight async work (Finding 002). |
| 5 | Security | Yes | All SQL is parameterised at the repository (FromSqlInterpolated). Relay carries no user-controlled strings beyond `SourceSite`. No issues found. |
| 6 | Performance & resource management | Yes | DI scope-per-message correctly disposed. `MaxPageSize=200` clamp present. No issues found. |
| 7 | Design-document adherence | Yes | Reconciliation puller and daily terminal-purge scheduler still deferred; design doc reads as if they ship (Finding 004). |
| 8 | Code organization & conventions | Yes | `RegisterCentralCommunication` is a top-level record colocated with the actor — by design (carries `IActorRef`, cannot live in Commons). No issues found. |
| 9 | Testing coverage | Yes | Relay path well covered (6 unit tests). Ingest/KPI well covered by MSSQL fixture. Stuck-only paging boundary edge not directly exercised (Finding 006). |
| 10 | Documentation & comments | Yes | XML docstring claims `SupervisorStrategy` uses Resume — incorrect (Finding 001). `AckErrorMessage` switch arm for `SiteUnreachable` falls through instead of throwing (Finding 005). |
#### Re-review 2026-06-20 (commit `4307c381`) — full review
Since the `1eb6e97` baseline the module grew the two pieces that were "still deferred" at the prior pass: the periodic per-site reconciliation puller (Piece A — `OnReconciliationTickAsync``ReconcileSiteAsync`, in-memory per-site cursor + idempotent monotonic upsert + per-site failure isolation) and the daily terminal-row purge scheduler (Piece B — `OnPurgeTickAsync``PurgeTerminalAsync`, continue-on-error). All of the new code is clean, idiomatic, and well-tested (dedicated `SiteCallAuditReconciliationTests`, `SiteCallAuditPurgeTests`, `SiteCallRelayTests`, KPI-sample and options suites); the M6 `SiteCallAuditKpiSampleSource` correctly anchors both cutoffs on a single capture instant and emits Global + per-Site + per-Node snapshots, and the per-node KPI handler is a clean additive peer of the per-site one. The three new findings are all forward-compat / cosmetic: a latent purge-gating gap (the purge timer is armed only when the reconciliation collaborators are present — currently masked because `Program.cs` co-registers them) plus two doc/edge items (the design doc over-claims six charted trend series when only three are charted, and the reconciliation cursor ignores `MoreAvailable` so a single-timestamp batch saturation re-pulls without progress). No correctness, concurrency, security, or resilience defects in the new code; the inclusive-cursor + idempotent-upsert pairing keeps every edge case safe-by-construction (wasted work, never corruption).
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | Yes | Reconciliation cursor ignores `MoreAvailable`; a single-`UpdatedAtUtc` batch saturation re-pulls without progress (Finding 009 — wasted work only, idempotent upsert prevents corruption). Finding 003 (`IngestedAtUtc`) resolved — now stamped in `OnUpsertAsync` and `ReconcileSiteAsync`. |
| 2 | Akka.NET conventions | Yes | `Sender` captured before first await on every async handler; `PipeTo` used for all read/relay replies; self-ticks scheduled via `ScheduleTellRepeatedlyCancelable` with `Self` sender, cancelled in `PostStop`. No issues found. |
| 3 | Concurrency & thread safety | Yes | `_reconciliationCursors`, `_centralCommunication`, timers all mutated only on the actor thread. Per-tick/per-message DI scope (`CreateAsyncScope` for the async tick paths, `CreateScope` for sync read paths) disposed in `finally` / `await using`. No issues found. |
| 4 | Error handling & resilience | Yes | Reconciliation has per-site try/catch isolation; purge is continue-on-error; ingest catches all and replies `Accepted=false`. Deliberate coarse-retry divergence from `SiteAuditReconciliationActor` (no per-row abandon) is documented and safe given idempotent upsert. No issues found. |
| 5 | Security | Yes | All SQL parameterised at the repository. Relay carries no user-controlled strings beyond `SourceSite` (a site id). No issues found. |
| 6 | Performance & resource management | Yes | One DI scope per tick reused across all sites; `MaxPageSize=200` clamp; async DbContext disposal off the dispatcher. `MoreAvailable`-ignoring re-pull is bounded wasted work (Finding 009). |
| 7 | Design-document adherence | Yes | Reconciliation puller + purge scheduler now implemented (prior Finding 004 resolved). Design doc over-claims six charted KPI trend series; only three are charted (Finding 008). |
| 8 | Code organization & conventions | Yes | Purge timer gated on reconciliation collaborators rather than just the repository — a host registering SiteCallAudit without the reconciliation client silently never purges (Finding 007, latent). `KpiSampleSource` registered via `TryAddEnumerable`; options owned by the component. |
| 9 | Testing coverage | Yes | Reconciliation/purge/relay/KPI/options all have dedicated suites. No test exercises a saturated reconciliation batch (`MoreAvailable: true`) or the single-timestamp no-progress edge — every reconciliation test uses `MoreAvailable: false` (relates to Finding 009). |
| 10 | Documentation & comments | Yes | Prior Findings 001/005 doc fixes still in place. Reconciliation `ReconcileSiteAsync` XML claims parity with `SiteAuditReconciliationActor` while diverging on `MoreAvailable` (the sibling reads it for stalled detection; this actor ignores it) — Finding 009 recommendation includes a doc correction. |
## Findings
### SiteCallAudit-001 — SupervisorStrategy override is dead code; XML claims Resume that is not enforced
| | |
|--|--|
| Severity | Medium |
| Category | Akka.NET conventions |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:32-46`, `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:147-151` |
**Description**
The XML remarks block (lines 32-46) states:
> "The `SupervisorStrategy` uses `Resume` so an unexpected throw before the catch (defence in depth) does not restart the actor and reset in-flight state."
The override at lines 147-151 returns a `OneForOneStrategy` with `DefaultDecider`
and `maxNrOfRetries: 0`. Two problems compound:
1. `ActorBase.SupervisorStrategy()` governs the actor's **children**, not the
actor itself. `SiteCallAuditActor` creates no children, so this override is
dead code.
2. The returned strategy uses `DefaultDecider` (Restart for most exceptions),
**not** `Directive.Resume`. So even if the actor did have children, the
strategy would not be Resume — it would be the default Restart-on-most-faults
behaviour with `maxNrOfRetries: 0` (which forces a Stop after the first
failure).
Net effect: the actor's own self-supervision is whatever the parent supplies
(`SupervisorStrategy.DefaultDecider` from the singleton manager / user
guardian in tests), which Restarts on most exceptions. If the `try/catch` in
`OnUpsertAsync` ever leaked (e.g. a synchronous throw constructing `replyTo`),
the actor would Restart, reset `_centralCommunication` to null, and silently
break the relay until `RegisterCentralCommunication` runs again.
This same pattern (with the same misleading XML doc) exists in
`AuditLogIngestActor`, `AuditLogPurgeActor`, and `SiteAuditReconciliationActor`
— they were likely cargo-culted; this finding documents the local instance.
**Recommendation**
Either:
- Remove the `SupervisorStrategy()` override entirely (it does nothing useful)
and revise the XML comment to drop the "Resume" claim. Self-supervision is
the parent's concern (the cluster singleton manager); the `try/catch` in
`OnUpsertAsync` is what actually keeps the actor alive.
- Or, if Resume-on-self-throw is actually desired, that requires wiring a
custom supervisor in the parent (`ClusterSingletonManager`) — not overriding
`SupervisorStrategy()` here. Simpler path: keep the `try/catch`, drop the
override.
The CLAUDE.md "Resume for coordinator actors" decision applies to actors with
children (Site Runtime hierarchy) — not to leaf cluster singletons.
**Resolution (2026-05-28):** Rewrote the class-level XML on `SiteCallAuditActor` plus the method-level XML on `SupervisorStrategy()` to accurately describe what the override does — a one-for-one strategy with `DefaultDecider` (Restart on most exceptions, Stop on `ActorInitializationException`/`ActorKilledException`) and `maxNrOfRetries: 0`, governing the actor's *children* (the actor has none today, so the override is currently inert). Dropped the misleading "Resume" claim. The new docs make clear that self-supervision of this cluster singleton is the parent `ClusterSingletonManager`'s concern and the actor's own resilience comes from the in-handler `try/catch` in `OnUpsertAsync`, not from this override. No behaviour change — pure documentation fix; existing 24 SiteCallAudit tests remain green.
### SiteCallAudit-002 — Singleton failover does not wait for in-flight async upserts
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:455-462` (singleton wiring), `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:153-193` |
**Resolution (2026-05-28):** Added a `CoordinatedShutdown` task in the `cluster-leave` phase (named `drain-site-call-audit-singleton`) that issues an explicit `GracefulStop(10s)` to the `SiteCallAudit` cluster singleton manager before the cluster-leave proceeds. Akka.NET's singleton handover already waits for the active actor's `ReceiveAsync` task to complete before signalling `HandOverDone`, so an in-flight EF `UpsertAsync` (and its SQL round-trip) drains on the old node before the new singleton starts on the other central node — closing the seam where the new singleton could race a still-running upsert on the old node. The 10-second timeout is bounded so a misbehaving upsert cannot stall coordinated shutdown indefinitely; on timeout the existing `PoisonPill` termination path takes over and the repository's monotonic-upsert + 2601/2627 duplicate-key swallow remain as the storage-state safety net. Pattern is suitable for the `NotificationOutbox` singleton too; deferred to keep this change scoped.
**Description**
The singleton is created with `terminationMessage: PoisonPill.Instance`. On
failover the active node's singleton stops as soon as the mailbox is drained
of normal messages and the PoisonPill is processed. An in-flight
`OnUpsertAsync` Task started before the PoisonPill arrived will be allowed to
complete (the message-handler runs synchronously from the mailbox's view),
but the Akka actor model does NOT cancel the EF Core
`ExecuteSqlInterpolatedAsync` call.
Two consequences:
1. The new singleton on the other node may begin accepting
`UpsertSiteCallCommand` for the same `TrackedOperationId` while the old
singleton's in-flight upsert is still running. The repository's
monotonic-upsert and the SQL duplicate-key swallow protect storage state.
2. The original `replyTo` sender may receive its `Accepted=true` after the new
singleton has already returned a different reply. Idempotency keys protect
correctness; wire-level ordering is best-effort by design.
This is consistent with the design ("eventually-consistent mirror, sites are
source of truth"), but worth documenting as an explicit invariant. The
Notification Outbox sibling has the same pattern.
**Recommendation**
- Document the failover/handover semantics in the actor's XML remarks: "On
cluster singleton handover, in-flight `OnUpsertAsync` tasks complete on the
old node and may produce a late `Accepted=true` reply; the repository's
monotonic upsert ensures storage state is consistent."
- Add an integration test that deliberately races two concurrent upserts on
the same `TrackedOperationId` to verify the duplicate-key swallow +
monotonic rank check (the CD-015 race-pattern check the parent task
flagged).
### SiteCallAudit-003 — `OnUpsertAsync` does not refresh `IngestedAtUtc`; direct-write callers must remember to stamp it
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:153-193` |
**Description**
The combined-telemetry hot path (`AuditLogIngestActor.OnCachedTelemetryAsync`)
stamps `IngestedAtUtc = DateTime.UtcNow` on both the `AuditLog` row and the
`SiteCall` row at central-side persist time
(`src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogIngestActor.cs:238-239`). The design
doc treats `IngestedAtUtc` as "central ingested (or last refreshed) this row"
— a central-side timestamp.
`SiteCallAuditActor.OnUpsertAsync` writes the supplied `SiteCall` as-is, with
whatever `IngestedAtUtc` the caller stamped. The only current callers are the
unit tests (which use `DateTime.UtcNow` at command-construction time). Once
the deferred reconciliation puller lands and starts emitting
`UpsertSiteCallCommand`s, the puller (running on central) is responsible for
stamping a central timestamp — but if a future direct-write caller forgets,
or constructs from a site DTO, the value could drift (e.g. become a site
clock value).
This is currently latent because no production caller exists, but it's
inconsistent with the dual-write code path and undocumented.
**Recommendation**
- Either: stamp `IngestedAtUtc = DateTime.UtcNow` inside `OnUpsertAsync`
before calling `UpsertAsync` (matching `AuditLogIngestActor`'s behaviour),
using `cmd.SiteCall with { IngestedAtUtc = DateTime.UtcNow }`.
- Or: document in the `UpsertSiteCallCommand` XML that callers MUST stamp
`IngestedAtUtc` to a central-side `DateTime.UtcNow` immediately before
sending.
Preferred: stamp inside the actor — same as the combined-telemetry path —
because callers cannot in general know the actor is colocated on central.
**Resolution (2026-05-28):** `OnUpsertAsync` now rewrites the incoming `SiteCall` via `cmd.SiteCall with { IngestedAtUtc = DateTime.UtcNow }` immediately before calling `repository.UpsertAsync`, mirroring `AuditLogIngestActor`'s combined-telemetry hot path. The repository writes `IngestedAtUtc` on both the insert-if-not-exists and the monotonic UPDATE legs (`SiteCallAuditRepository.UpsertAsync`), so the column is writable on every upsert. Callers (telemetry, the deferred reconciliation puller, any future direct-write) no longer need to remember to stamp a central-side timestamp — the actor owns it. Existing 24 SiteCallAudit tests remain green (the MSSQL-fixture test constructs rows with `DateTime.UtcNow` and doesn't assert the exact value, so the actor's re-stamp is backward compatible).
### SiteCallAudit-004 — Reconciliation puller and daily terminal-purge scheduler still deferred; design-doc drift
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:23-30` (actor XML), `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/ServiceCollectionExtensions.cs:8-13`, `docs/requirements/Component-SiteCallAudit.md:24-32` |
**Description**
The design doc (`Component-SiteCallAudit.md` lines 24-32) lists five
responsibilities, including:
- "Run periodic per-site reconciliation pulls so missed telemetry self-heals."
- "Purge terminal audit rows after a configurable retention window."
The repository exposes `PurgeTerminalAsync` but nothing in this module
schedules a daily call (Notification Outbox owns a `MaintenanceService` for
its equivalent; no `SiteCallAuditMaintenanceService` exists). The
reconciliation puller is acknowledged in the actor XML
(`only reconciliation remains deferred`) but is not surfaced in the design
doc as deferred — the doc reads as if it ships.
**Recommendation**
- Either: implement the deferred pieces (a hosted service that wakes daily
and calls `repo.PurgeTerminalAsync(now - retentionWindow)`, plus a per-site
reconciliation puller with a cursor + an `IPullCachedTelemetryClient`).
- Or: add a "Status" / "Deferred" subsection to the design doc explicitly
listing what's not yet implemented (matches the pattern Audit Log uses for
its tamper-evidence hash chain).
**Resolution (2026-05-28):**
Updated the class-level XML on `SiteCallAuditActor` to reflect actual state:
telemetry ingest, query/detail/KPI handlers (Task 4), and the central→site
Retry/Discard relay (Task 5) are implemented; the periodic reconciliation
puller and the daily terminal-row purge scheduler remain deferred. The design
doc update is tracked separately.
### SiteCallAudit-005 — `AckErrorMessage` switch arm for `SiteUnreachable` returns ack message instead of throwing
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:548-563` |
**Description**
```csharp
return outcome switch
{
SiteCallRelayOutcome.Applied => null,
SiteCallRelayOutcome.NotParked => "The operation is no longer parked at the site (...)",
SiteCallRelayOutcome.OperationFailed => ack.ErrorMessage,
// SiteUnreachable is never produced from a ParkedOperationActionAck —
// unreachable responses are built by UnreachableRetry/UnreachableDiscard
// before any ack is classified, so this arm is unreachable by construction.
SiteCallRelayOutcome.SiteUnreachable => ack.ErrorMessage,
_ => throw new ArgumentOutOfRangeException(...)
};
```
The comment correctly states the `SiteUnreachable` arm is unreachable when
called from `ClassifyAck`. The arm therefore exists only to satisfy
exhaustiveness, but instead of throwing or returning a sentinel, it falls
through to `ack.ErrorMessage` — indistinguishable from the `OperationFailed`
arm above. If any future caller *does* feed `SiteUnreachable` into
`AckErrorMessage` (e.g. via refactor), the result will be a silent
wrong-detail-text bug rather than an immediate crash. The default arm
correctly throws `ArgumentOutOfRangeException`, so the `SiteUnreachable` arm
is the inconsistent one.
**Recommendation**
Replace the `SiteUnreachable => ack.ErrorMessage` arm with:
```csharp
SiteCallRelayOutcome.SiteUnreachable =>
throw new InvalidOperationException(
"AckErrorMessage cannot be called for SiteUnreachable — those responses "
+ "are built by UnreachableRetry/UnreachableDiscard before classification."),
```
— fail fast if the invariant is ever violated by a refactor.
**Resolution (2026-05-28):**
Behaviour kept (return `ack.ErrorMessage`); `AckErrorMessage` stays total and
side-effect-free. Expanded the inline comment on the `SiteUnreachable` arm to
explain WHY it returns rather than throws: site-unreachable is classified as
transient by the upstream relay (which has already built its
`SiteUnreachable` response and detail text via `SiteUnreachableMessage`), so a
defensive fall-through surfaces the ack's message and lets the caller schedule
a retry — throwing would turn a benign refactor invariant violation into a
relay-path crash.
### SiteCallAudit-006 — Stuck-only paging test does not exercise the multi-page boundary with an interleaved non-stuck row at the cursor
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Resolved |
| Location | `tests/ZB.MOM.WW.ScadaBridge.SiteCallAudit.Tests/SiteCallAuditActorTests.cs:335-392` |
**Resolution (2026-05-28):** Added `SiteCallQueryRequest_StuckOnly_CursorAtNonStuckBoundary_SkipsToNextStuckRow` to `tests/ZB.MOM.WW.ScadaBridge.SiteCallAudit.Tests/SiteCallAuditActorTests.cs` — drives six rows interleaved as `stuck/non-stuck` × 3 (oldest-first), then issues three page-size-1 stuck-only queries. The cursor between each page deliberately lands on a non-stuck row, so the SQL composition of the stuck predicate AND the keyset cursor predicate must skip it. Asserts each page returns exactly one stuck row in DESC-by-CreatedAtUtc order with no overlap and all three stuck rows visited. Locks the invariant that post-filtering does not produce under-filled pages with non-null next cursors.
**Description**
`SiteCallQueryRequest_StuckOnly_PagesAreFull_NoEmptyPagesWithCursor` covers
the case where stuck rows are interleaved with non-stuck rows (page-1 returns
2 stuck rows, page-2 returns the third). It does not cover the edge where
the row at the keyset cursor boundary (`AfterCreatedAtUtc + AfterId`) is
itself a non-stuck row — i.e. the cursor points at a row the next page must
SKIP through to find more stuck rows. The repository's SQL composes the
cursor predicate (`CreatedAtUtc < cursor OR (CreatedAtUtc = cursor AND id <
...)`) with the stuck predicate, so it should be honest, but the test only
asserts row counts and `IsStuck`, not that the second-page query specifically
skipped non-stuck rows between the cursor and the next stuck row.
Lower priority because the SQL composition is straightforward, but adding a
direct test would lock the invariant.
**Recommendation**
Add a test that (a) inserts 6 rows in interleaved order: stuck, not-stuck,
stuck, not-stuck, stuck, not-stuck (oldest first); (b) issues a `StuckOnly`
page-size-1 query; (c) asserts each page returns exactly the stuck row, with
no overlap and all 3 stuck rows visited.
### SiteCallAudit-007 — Daily purge timer is armed only when the reconciliation collaborators are present; a host without the reconciliation client silently never purges
| | |
|--|--|
| Severity | Medium |
| Category | Code organization & conventions |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:307-321`, `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:202-227` |
**Description**
`StartPurgeTimer` (lines 307-321) gates the daily terminal-row purge tick on
the **reconciliation** collaborators being present:
```csharp
private void StartPurgeTimer()
{
if (_pullClient is null || _siteEnumerator is null)
{
return;
}
// ... schedule PurgeTick ...
}
```
But the purge pass (`OnPurgeTickAsync``PurgeWithRepositoryAsync`
`ISiteCallAuditRepository.PurgeTerminalAsync`) needs only the repository — it
has no dependency on `IPullSiteCallsClient` / `ISiteEnumerator`. The two
collaborators are resolved by the production constructor (lines 202-227) via
`serviceProvider.GetService<IPullSiteCallsClient>()` /
`GetService<ISiteEnumerator>()` — both registered by
`AddAuditLogCentralReconciliationClient`
(`AuditLog/ServiceCollectionExtensions.cs:473,523`, registered as `ISiteEnumerator`
and `IPullSiteCallsClient`). `GetService` (not `GetRequiredService`) returns
`null` if that helper was never called, so a host that registers
`AddSiteCallAudit()` **without** also calling
`AddAuditLogCentralReconciliationClient(...)` constructs the actor with both
collaborators null. In `PreStart` both `StartReconciliationTimer` and
`StartPurgeTimer` early-return, so the actor runs forever with **no purge timer
at all** → unbounded growth of the central `SiteCalls` table, with no log line
to say the purge was skipped.
This is **currently latent, not live**: `Program.cs` co-registers
`AddAuditLogCentralReconciliationClient(builder.Configuration)` (line 107)
immediately before `AddSiteCallAudit()` (line 113), so on the real central node
both collaborators resolve and both timers run today. The risk is a future host
(or a refactor that splits the reconciliation client out of the central
composition root, or a test/embedded host that wants only ingest + purge)
silently losing the purge with no diagnostic. The gate exists for a legitimate
reason — keeping the repo-only test ctor free of *both* background timers so the
MSSQL read/upsert tests see no scheduled side effects — but it conflates "no
reconciliation route" with "no purge", and the actor's own XML
(lines 36-38) documents the coupling as deliberate rather than flagging it as a
hazard.
**Recommendation**
Decouple `StartPurgeTimer` from the reconciliation collaborators — purge needs
only a repository, which both the production and the repo-only test ctors
always have. Two viable shapes:
- Preferred: gate the purge timer on its own real precondition (a repository is
always available, so arm it unconditionally in the production + reconciliation
ctors; keep it off only in the repo-only test ctor via an explicit
"background timers off" flag rather than by proxy of the reconciliation
collaborators). This keeps the MSSQL test isolation while removing the
accidental coupling.
- At minimum: log a `Warning` in `StartPurgeTimer` (and `StartReconciliationTimer`)
when the timer is **not** armed on the production path, so a misconfigured host
surfaces "SiteCallAudit purge timer not started — `PurgeTerminalAsync` will
never run" instead of growing the table silently.
Severity is a judgment call: Medium because the consequence (unbounded central
table growth) is real and silent, but it is latent today (the only production
composition root co-registers the reconciliation client), so an argument for Low
is defensible. Flagged Medium to weight the silent-failure + data-growth aspect.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): decoupled the daily purge timer from the reconciliation collaborators (a new `_backgroundTimersEnabled` flag) so a central node that omits the reconciliation client still purges — no more silent unbounded `SiteCalls` growth; a Warning is logged if the reconciliation collaborators are absent. Test added (purge arms without the reconciliation client).
### SiteCallAudit-008 — Design doc claims six charted KPI trend series; only three are actually charted
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `docs/requirements/Component-SiteCallAudit.md:149-154`, `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/Kpi/SiteCallAuditKpiSampleSource.cs:42-47`, `src/ZB.MOM.WW.ScadaBridge.Commons/Types/Kpi/KpiMetrics.cs:52-62` |
**Description**
The design doc (`Component-SiteCallAudit.md` lines 149-154, the KPI History
interaction) states:
> "… the resulting `buffered` / `parked` / `failedLastInterval` /
> `deliveredLastInterval` / `stuck` / `oldestPendingAgeSeconds` series render as
> trends on the Site Calls page via `KpiTrendChart`."
That lists **six** series as charted. In the code only **three** are charted:
- The public charted catalog `KpiMetrics.SiteCallAudit`
(`KpiMetrics.cs:52-62`) exposes exactly `Buffered`, `Parked`, and
`FailedLastInterval` — and its own XML says "Charted Site Call Audit (#22)
metrics … Rendered by the Central UI Site Calls report trend panel."
- `SiteCallAuditKpiSampleSource` (lines 42-47) keys those three off the public
Commons catalog (`KpiMetrics.SiteCallAudit.*`) and keeps the other three —
`deliveredLastInterval`, `stuck`, `oldestPendingAgeSeconds` — as **private**
string literals, with the comment "Charted metrics share the public Commons
catalog … the uncharted internal metrics stay private here (#178)."
- The UI confirms it: `SiteCalls/SiteCallsReport.razor.cs` calls
`LoadSeriesAsync` for exactly `KpiMetrics.SiteCallAudit.Buffered`, `.Parked`,
and `.FailedLastInterval` — three series, no more.
So all six metrics are *sampled* into the `KpiSample` history store, but only
three are *charted*. The doc reads as if all six are rendered, which is a small
design-doc drift (over-claiming the UI surface). The code is internally
consistent and correctly commented; the doc is the stale party.
**Recommendation**
One-line doc edit on `Component-SiteCallAudit.md` lines 149-154: state that the
three series `buffered` / `parked` / `failedLastInterval` render as trends on
the Site Calls page via `KpiTrendChart`, and that `deliveredLastInterval` /
`stuck` / `oldestPendingAgeSeconds` are **sampled into the KPI-history store but
not charted** (available for future trend panels / ad-hoc query). No code change
— the code is already the intended state.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): the design doc now lists the three actually-charted KPI metrics (buffered/parked/failedLastInterval) and marks the other three as sampled-but-not-yet-charted, matching the `KpiMetrics` catalog. Doc-only.
### SiteCallAudit-009 — Reconciliation cursor advances inclusively and ignores `MoreAvailable`; a single-timestamp batch saturation re-pulls without progress
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:505-535` |
**Description**
`ReconcileSiteAsync` (lines 505-535) pulls rows since the per-site cursor,
upserts each, and advances the cursor to the maximum `UpdatedAtUtc` observed:
```csharp
var since = _reconciliationCursors.TryGetValue(site.SiteId, out var c) ? c : DateTime.MinValue;
var response = await client.PullAsync(site.SiteId, since, _options.ReconciliationBatchSize, ...);
var maxUpdated = since;
foreach (var row in response.SiteCalls)
{
await repository.UpsertAsync(row with { IngestedAtUtc = nowUtc });
if (row.UpdatedAtUtc > maxUpdated) maxUpdated = row.UpdatedAtUtc;
}
_reconciliationCursors[site.SiteId] = maxUpdated;
```
Two interacting properties:
1. The pull is **inclusive**`PullAsync(site, since, …)` asks for rows with
`UpdatedAtUtc >= since` (documented in the method's XML, lines 497-503), and
the cursor advances to the **max** `UpdatedAtUtc` seen, which is itself one
of the rows just pulled. So the boundary row is re-pulled every tick and
deduped by the idempotent monotonic upsert — intended, harmless.
2. `response.MoreAvailable` is **never read** at all. The `PullSiteCallsResponse`
carries `MoreAvailable` (`PullSiteCallsResponse.cs:14-17`: "True when the
site saturated the requested batch size — the caller should advance the
cursor and pull again"), but `ReconcileSiteAsync` ignores it and relies on
the natural tick cadence to drain the backlog over successive ticks.
The edge case: if a site has **more rows than the batch size all sharing one
exact `UpdatedAtUtc`** (e.g. a burst written in the same tick / same clock
value), the saturated batch returns rows whose max `UpdatedAtUtc` equals
`since`. `maxUpdated` therefore stays at `since`, the cursor does **not**
advance, and because the pull is inclusive the next tick re-pulls the identical
window — and again, and again — making no forward progress on that site's
backlog. Because the upsert is idempotent and the `SiteCalls` table is an
eventually-consistent mirror (not the source of truth), this is **wasted work,
never corruption** — but it is an unbounded re-pull loop on a pathological
input, and any rows in that backlog beyond the batch ceiling never get
reconciled.
The sibling `SiteAuditReconciliationActor` shares the inclusive-cursor /
max-timestamp shape, so the same single-timestamp-saturation no-progress edge
applies to it — but that sibling **does read `MoreAvailable`** (it feeds its
stalled-detection state machine, `SiteAuditReconciliationActor.cs:324-325`,
publishing `SiteAuditTelemetryStalledChanged` so a non-draining site surfaces a
health signal). `ReconcileSiteAsync`'s XML (lines 530-534) claims the
no-immediate-re-pull behaviour "match[es] `SiteAuditReconciliationActor`", which
is only half true: it matches the cursor cadence but diverges by dropping
`MoreAvailable` entirely, so this actor has neither a continuation pull nor a
stalled signal — a saturated site lags silently with no observability.
**Recommendation**
- Consume `MoreAvailable`: either continue draining within the same tick while
`MoreAvailable` is true (bounded by a max-iterations guard), or — matching the
sibling — surface a stalled/non-draining signal when a batch comes back
saturated so a stuck site is observable rather than silent.
- Defend the single-timestamp no-progress edge with a tiebreaker beyond the
timestamp (e.g. advance on `(UpdatedAtUtc, TrackedOperationId)` as a composite
keyset cursor, and ask the pull for rows strictly after that composite), so a
burst sharing one `UpdatedAtUtc` cannot pin the cursor.
- Correct the `ReconcileSiteAsync` XML (lines 530-534): it claims parity with
`SiteAuditReconciliationActor` while diverging on `MoreAvailable` (the sibling
reads it for stalled detection; this actor ignores it).
Severity Low: idempotent upsert + mirror-not-source-of-truth mean no data
corruption, and the saturated-single-timestamp input is pathological; the cost
is wasted re-pulls and an un-drained backlog tail on that one input, plus the
missing observability.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): `ReconcileSiteAsync` now consumes `response.MoreAvailable` via a within-tick continuation drain bounded by a max-pages guard, with explicit no-progress (single-timestamp-saturation) detection that breaks and logs a Warning instead of re-pulling forever. The XML claim of parity with the sibling reconciler was corrected. Idempotent upsert retained. Tests added.