d39089f4ed
Full per-module re-review of the 16 stale modules (last seen1eb6e97/ 2026-05-28) plus first-ever reviews of KpiHistory (#26) and ScriptAnalysis (#25), at HEAD4307c381. 67 new findings (0 Critical, 6 High, 27 Medium, 34 Low). Remediation in commitfd618cf1closed 5 of the 6 Highs and ~33 Medium/Low; the rest are Deferred/Won't Fix with rationale. Remaining pending (4) are all InboundAPI's Database-helper findings (IA-026 High .. IA-029), left to the active feat/ipsen-movein effort per owner decision. Highlights: caught a central-only-delivery security drift (SMTP creds broadcast to sites — DM-025/SR-031), a never-committed 'Resolved' fix (SiteEventLogging-016 → -024), an unguarded KPI recorder tick (KH-001), a trust-analyzer fallback weakening (SA-001), and a native-alarm subscribe-path leak (DCL-023). ScriptAnalysis verdict: trust boundary is semantically sound (symbol-based) in the production cluster config. README regenerated; regen-readme.py --check passes (4 pending / 567 total).
561 lines
35 KiB
Markdown
561 lines
35 KiB
Markdown
# Code Review — SiteCallAudit
|
||
|
||
| Field | Value |
|
||
|-------|-------|
|
||
| Module | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit` |
|
||
| Design doc | `docs/requirements/Component-SiteCallAudit.md` |
|
||
| Status | Reviewed |
|
||
| Last reviewed | 2026-06-20 |
|
||
| Reviewer | claude-agent |
|
||
| Commit reviewed | `4307c381` |
|
||
| Open findings | 0 |
|
||
|
||
## Summary
|
||
|
||
The module is small (one actor + DI extension + options class). The actor is a
|
||
central cluster singleton that exposes three responsibility groups: direct
|
||
`UpsertSiteCallCommand` ingest, paginated/KPI read handlers, and the central→site
|
||
Retry/Discard relay. Ingest idempotency is delegated to the repository's
|
||
monotonic-upsert (the CD-015 check-then-act window is mitigated by the
|
||
duplicate-key swallow on the insert leg). Findings cluster around two themes:
|
||
(a) the `SupervisorStrategy` override is dead-code that contradicts the XML
|
||
docstring — it governs children, and this actor has none, so the documented
|
||
"Resume on leaked exception" promise is unenforced; (b) several smaller drifts
|
||
between the design doc and the code (reconciliation puller + daily purge
|
||
schedule are still deferred; `OnUpsertAsync` does not stamp `IngestedAtUtc`
|
||
unlike the dual-write path). The relay path is well covered by Akka TestKit
|
||
unit tests; the ingest + KPI paths are covered by MSSQL-backed integration
|
||
tests using a shared `MsSqlMigrationFixture`.
|
||
|
||
## Checklist coverage
|
||
|
||
| # | Category | Examined | Notes |
|
||
|---|----------|----------|-------|
|
||
| 1 | Correctness & logic bugs | Yes | `OnUpsertAsync` does not refresh `IngestedAtUtc` (Finding 003). |
|
||
| 2 | Akka.NET conventions | Yes | `SupervisorStrategy()` override is dead code (Finding 001). `Sender` correctly captured before first await on every handler. `PipeTo` used for read replies. |
|
||
| 3 | Concurrency & thread safety | Yes | `_centralCommunication` mutated only on actor thread via `RegisterCentralCommunication`. DI scope-per-message disposed in `try/finally`. No issues found. |
|
||
| 4 | Error handling & resilience | Yes | Ingest catches all + replies `Accepted=false`. Relay distinguishes `SiteUnreachable` vs `OperationFailed`. Failover handover does not wait for in-flight async work (Finding 002). |
|
||
| 5 | Security | Yes | All SQL is parameterised at the repository (FromSqlInterpolated). Relay carries no user-controlled strings beyond `SourceSite`. No issues found. |
|
||
| 6 | Performance & resource management | Yes | DI scope-per-message correctly disposed. `MaxPageSize=200` clamp present. No issues found. |
|
||
| 7 | Design-document adherence | Yes | Reconciliation puller and daily terminal-purge scheduler still deferred; design doc reads as if they ship (Finding 004). |
|
||
| 8 | Code organization & conventions | Yes | `RegisterCentralCommunication` is a top-level record colocated with the actor — by design (carries `IActorRef`, cannot live in Commons). No issues found. |
|
||
| 9 | Testing coverage | Yes | Relay path well covered (6 unit tests). Ingest/KPI well covered by MSSQL fixture. Stuck-only paging boundary edge not directly exercised (Finding 006). |
|
||
| 10 | Documentation & comments | Yes | XML docstring claims `SupervisorStrategy` uses Resume — incorrect (Finding 001). `AckErrorMessage` switch arm for `SiteUnreachable` falls through instead of throwing (Finding 005). |
|
||
|
||
#### Re-review 2026-06-20 (commit `4307c381`) — full review
|
||
|
||
Since the `1eb6e97` baseline the module grew the two pieces that were "still deferred" at the prior pass: the periodic per-site reconciliation puller (Piece A — `OnReconciliationTickAsync` → `ReconcileSiteAsync`, in-memory per-site cursor + idempotent monotonic upsert + per-site failure isolation) and the daily terminal-row purge scheduler (Piece B — `OnPurgeTickAsync` → `PurgeTerminalAsync`, continue-on-error). All of the new code is clean, idiomatic, and well-tested (dedicated `SiteCallAuditReconciliationTests`, `SiteCallAuditPurgeTests`, `SiteCallRelayTests`, KPI-sample and options suites); the M6 `SiteCallAuditKpiSampleSource` correctly anchors both cutoffs on a single capture instant and emits Global + per-Site + per-Node snapshots, and the per-node KPI handler is a clean additive peer of the per-site one. The three new findings are all forward-compat / cosmetic: a latent purge-gating gap (the purge timer is armed only when the reconciliation collaborators are present — currently masked because `Program.cs` co-registers them) plus two doc/edge items (the design doc over-claims six charted trend series when only three are charted, and the reconciliation cursor ignores `MoreAvailable` so a single-timestamp batch saturation re-pulls without progress). No correctness, concurrency, security, or resilience defects in the new code; the inclusive-cursor + idempotent-upsert pairing keeps every edge case safe-by-construction (wasted work, never corruption).
|
||
|
||
| # | Category | Examined | Notes |
|
||
|---|----------|----------|-------|
|
||
| 1 | Correctness & logic bugs | Yes | Reconciliation cursor ignores `MoreAvailable`; a single-`UpdatedAtUtc` batch saturation re-pulls without progress (Finding 009 — wasted work only, idempotent upsert prevents corruption). Finding 003 (`IngestedAtUtc`) resolved — now stamped in `OnUpsertAsync` and `ReconcileSiteAsync`. |
|
||
| 2 | Akka.NET conventions | Yes | `Sender` captured before first await on every async handler; `PipeTo` used for all read/relay replies; self-ticks scheduled via `ScheduleTellRepeatedlyCancelable` with `Self` sender, cancelled in `PostStop`. No issues found. |
|
||
| 3 | Concurrency & thread safety | Yes | `_reconciliationCursors`, `_centralCommunication`, timers all mutated only on the actor thread. Per-tick/per-message DI scope (`CreateAsyncScope` for the async tick paths, `CreateScope` for sync read paths) disposed in `finally` / `await using`. No issues found. |
|
||
| 4 | Error handling & resilience | Yes | Reconciliation has per-site try/catch isolation; purge is continue-on-error; ingest catches all and replies `Accepted=false`. Deliberate coarse-retry divergence from `SiteAuditReconciliationActor` (no per-row abandon) is documented and safe given idempotent upsert. No issues found. |
|
||
| 5 | Security | Yes | All SQL parameterised at the repository. Relay carries no user-controlled strings beyond `SourceSite` (a site id). No issues found. |
|
||
| 6 | Performance & resource management | Yes | One DI scope per tick reused across all sites; `MaxPageSize=200` clamp; async DbContext disposal off the dispatcher. `MoreAvailable`-ignoring re-pull is bounded wasted work (Finding 009). |
|
||
| 7 | Design-document adherence | Yes | Reconciliation puller + purge scheduler now implemented (prior Finding 004 resolved). Design doc over-claims six charted KPI trend series; only three are charted (Finding 008). |
|
||
| 8 | Code organization & conventions | Yes | Purge timer gated on reconciliation collaborators rather than just the repository — a host registering SiteCallAudit without the reconciliation client silently never purges (Finding 007, latent). `KpiSampleSource` registered via `TryAddEnumerable`; options owned by the component. |
|
||
| 9 | Testing coverage | Yes | Reconciliation/purge/relay/KPI/options all have dedicated suites. No test exercises a saturated reconciliation batch (`MoreAvailable: true`) or the single-timestamp no-progress edge — every reconciliation test uses `MoreAvailable: false` (relates to Finding 009). |
|
||
| 10 | Documentation & comments | Yes | Prior Findings 001/005 doc fixes still in place. Reconciliation `ReconcileSiteAsync` XML claims parity with `SiteAuditReconciliationActor` while diverging on `MoreAvailable` (the sibling reads it for stalled detection; this actor ignores it) — Finding 009 recommendation includes a doc correction. |
|
||
|
||
## Findings
|
||
|
||
### SiteCallAudit-001 — SupervisorStrategy override is dead code; XML claims Resume that is not enforced
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Akka.NET conventions |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:32-46`, `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:147-151` |
|
||
|
||
**Description**
|
||
|
||
The XML remarks block (lines 32-46) states:
|
||
|
||
> "The `SupervisorStrategy` uses `Resume` so an unexpected throw before the catch (defence in depth) does not restart the actor and reset in-flight state."
|
||
|
||
The override at lines 147-151 returns a `OneForOneStrategy` with `DefaultDecider`
|
||
and `maxNrOfRetries: 0`. Two problems compound:
|
||
|
||
1. `ActorBase.SupervisorStrategy()` governs the actor's **children**, not the
|
||
actor itself. `SiteCallAuditActor` creates no children, so this override is
|
||
dead code.
|
||
2. The returned strategy uses `DefaultDecider` (Restart for most exceptions),
|
||
**not** `Directive.Resume`. So even if the actor did have children, the
|
||
strategy would not be Resume — it would be the default Restart-on-most-faults
|
||
behaviour with `maxNrOfRetries: 0` (which forces a Stop after the first
|
||
failure).
|
||
|
||
Net effect: the actor's own self-supervision is whatever the parent supplies
|
||
(`SupervisorStrategy.DefaultDecider` from the singleton manager / user
|
||
guardian in tests), which Restarts on most exceptions. If the `try/catch` in
|
||
`OnUpsertAsync` ever leaked (e.g. a synchronous throw constructing `replyTo`),
|
||
the actor would Restart, reset `_centralCommunication` to null, and silently
|
||
break the relay until `RegisterCentralCommunication` runs again.
|
||
|
||
This same pattern (with the same misleading XML doc) exists in
|
||
`AuditLogIngestActor`, `AuditLogPurgeActor`, and `SiteAuditReconciliationActor`
|
||
— they were likely cargo-culted; this finding documents the local instance.
|
||
|
||
**Recommendation**
|
||
|
||
Either:
|
||
|
||
- Remove the `SupervisorStrategy()` override entirely (it does nothing useful)
|
||
and revise the XML comment to drop the "Resume" claim. Self-supervision is
|
||
the parent's concern (the cluster singleton manager); the `try/catch` in
|
||
`OnUpsertAsync` is what actually keeps the actor alive.
|
||
- Or, if Resume-on-self-throw is actually desired, that requires wiring a
|
||
custom supervisor in the parent (`ClusterSingletonManager`) — not overriding
|
||
`SupervisorStrategy()` here. Simpler path: keep the `try/catch`, drop the
|
||
override.
|
||
|
||
The CLAUDE.md "Resume for coordinator actors" decision applies to actors with
|
||
children (Site Runtime hierarchy) — not to leaf cluster singletons.
|
||
|
||
**Resolution (2026-05-28):** Rewrote the class-level XML on `SiteCallAuditActor` plus the method-level XML on `SupervisorStrategy()` to accurately describe what the override does — a one-for-one strategy with `DefaultDecider` (Restart on most exceptions, Stop on `ActorInitializationException`/`ActorKilledException`) and `maxNrOfRetries: 0`, governing the actor's *children* (the actor has none today, so the override is currently inert). Dropped the misleading "Resume" claim. The new docs make clear that self-supervision of this cluster singleton is the parent `ClusterSingletonManager`'s concern and the actor's own resilience comes from the in-handler `try/catch` in `OnUpsertAsync`, not from this override. No behaviour change — pure documentation fix; existing 24 SiteCallAudit tests remain green.
|
||
|
||
### SiteCallAudit-002 — Singleton failover does not wait for in-flight async upserts
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Error handling & resilience |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:455-462` (singleton wiring), `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:153-193` |
|
||
|
||
**Resolution (2026-05-28):** Added a `CoordinatedShutdown` task in the `cluster-leave` phase (named `drain-site-call-audit-singleton`) that issues an explicit `GracefulStop(10s)` to the `SiteCallAudit` cluster singleton manager before the cluster-leave proceeds. Akka.NET's singleton handover already waits for the active actor's `ReceiveAsync` task to complete before signalling `HandOverDone`, so an in-flight EF `UpsertAsync` (and its SQL round-trip) drains on the old node before the new singleton starts on the other central node — closing the seam where the new singleton could race a still-running upsert on the old node. The 10-second timeout is bounded so a misbehaving upsert cannot stall coordinated shutdown indefinitely; on timeout the existing `PoisonPill` termination path takes over and the repository's monotonic-upsert + 2601/2627 duplicate-key swallow remain as the storage-state safety net. Pattern is suitable for the `NotificationOutbox` singleton too; deferred to keep this change scoped.
|
||
|
||
**Description**
|
||
|
||
The singleton is created with `terminationMessage: PoisonPill.Instance`. On
|
||
failover the active node's singleton stops as soon as the mailbox is drained
|
||
of normal messages and the PoisonPill is processed. An in-flight
|
||
`OnUpsertAsync` Task started before the PoisonPill arrived will be allowed to
|
||
complete (the message-handler runs synchronously from the mailbox's view),
|
||
but the Akka actor model does NOT cancel the EF Core
|
||
`ExecuteSqlInterpolatedAsync` call.
|
||
|
||
Two consequences:
|
||
|
||
1. The new singleton on the other node may begin accepting
|
||
`UpsertSiteCallCommand` for the same `TrackedOperationId` while the old
|
||
singleton's in-flight upsert is still running. The repository's
|
||
monotonic-upsert and the SQL duplicate-key swallow protect storage state.
|
||
2. The original `replyTo` sender may receive its `Accepted=true` after the new
|
||
singleton has already returned a different reply. Idempotency keys protect
|
||
correctness; wire-level ordering is best-effort by design.
|
||
|
||
This is consistent with the design ("eventually-consistent mirror, sites are
|
||
source of truth"), but worth documenting as an explicit invariant. The
|
||
Notification Outbox sibling has the same pattern.
|
||
|
||
**Recommendation**
|
||
|
||
- Document the failover/handover semantics in the actor's XML remarks: "On
|
||
cluster singleton handover, in-flight `OnUpsertAsync` tasks complete on the
|
||
old node and may produce a late `Accepted=true` reply; the repository's
|
||
monotonic upsert ensures storage state is consistent."
|
||
- Add an integration test that deliberately races two concurrent upserts on
|
||
the same `TrackedOperationId` to verify the duplicate-key swallow +
|
||
monotonic rank check (the CD-015 race-pattern check the parent task
|
||
flagged).
|
||
|
||
### SiteCallAudit-003 — `OnUpsertAsync` does not refresh `IngestedAtUtc`; direct-write callers must remember to stamp it
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:153-193` |
|
||
|
||
**Description**
|
||
|
||
The combined-telemetry hot path (`AuditLogIngestActor.OnCachedTelemetryAsync`)
|
||
stamps `IngestedAtUtc = DateTime.UtcNow` on both the `AuditLog` row and the
|
||
`SiteCall` row at central-side persist time
|
||
(`src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogIngestActor.cs:238-239`). The design
|
||
doc treats `IngestedAtUtc` as "central ingested (or last refreshed) this row"
|
||
— a central-side timestamp.
|
||
|
||
`SiteCallAuditActor.OnUpsertAsync` writes the supplied `SiteCall` as-is, with
|
||
whatever `IngestedAtUtc` the caller stamped. The only current callers are the
|
||
unit tests (which use `DateTime.UtcNow` at command-construction time). Once
|
||
the deferred reconciliation puller lands and starts emitting
|
||
`UpsertSiteCallCommand`s, the puller (running on central) is responsible for
|
||
stamping a central timestamp — but if a future direct-write caller forgets,
|
||
or constructs from a site DTO, the value could drift (e.g. become a site
|
||
clock value).
|
||
|
||
This is currently latent because no production caller exists, but it's
|
||
inconsistent with the dual-write code path and undocumented.
|
||
|
||
**Recommendation**
|
||
|
||
- Either: stamp `IngestedAtUtc = DateTime.UtcNow` inside `OnUpsertAsync`
|
||
before calling `UpsertAsync` (matching `AuditLogIngestActor`'s behaviour),
|
||
using `cmd.SiteCall with { IngestedAtUtc = DateTime.UtcNow }`.
|
||
- Or: document in the `UpsertSiteCallCommand` XML that callers MUST stamp
|
||
`IngestedAtUtc` to a central-side `DateTime.UtcNow` immediately before
|
||
sending.
|
||
|
||
Preferred: stamp inside the actor — same as the combined-telemetry path —
|
||
because callers cannot in general know the actor is colocated on central.
|
||
|
||
**Resolution (2026-05-28):** `OnUpsertAsync` now rewrites the incoming `SiteCall` via `cmd.SiteCall with { IngestedAtUtc = DateTime.UtcNow }` immediately before calling `repository.UpsertAsync`, mirroring `AuditLogIngestActor`'s combined-telemetry hot path. The repository writes `IngestedAtUtc` on both the insert-if-not-exists and the monotonic UPDATE legs (`SiteCallAuditRepository.UpsertAsync`), so the column is writable on every upsert. Callers (telemetry, the deferred reconciliation puller, any future direct-write) no longer need to remember to stamp a central-side timestamp — the actor owns it. Existing 24 SiteCallAudit tests remain green (the MSSQL-fixture test constructs rows with `DateTime.UtcNow` and doesn't assert the exact value, so the actor's re-stamp is backward compatible).
|
||
|
||
### SiteCallAudit-004 — Reconciliation puller and daily terminal-purge scheduler still deferred; design-doc drift
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Design-document adherence |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:23-30` (actor XML), `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/ServiceCollectionExtensions.cs:8-13`, `docs/requirements/Component-SiteCallAudit.md:24-32` |
|
||
|
||
**Description**
|
||
|
||
The design doc (`Component-SiteCallAudit.md` lines 24-32) lists five
|
||
responsibilities, including:
|
||
|
||
- "Run periodic per-site reconciliation pulls so missed telemetry self-heals."
|
||
- "Purge terminal audit rows after a configurable retention window."
|
||
|
||
The repository exposes `PurgeTerminalAsync` but nothing in this module
|
||
schedules a daily call (Notification Outbox owns a `MaintenanceService` for
|
||
its equivalent; no `SiteCallAuditMaintenanceService` exists). The
|
||
reconciliation puller is acknowledged in the actor XML
|
||
(`only reconciliation remains deferred`) but is not surfaced in the design
|
||
doc as deferred — the doc reads as if it ships.
|
||
|
||
**Recommendation**
|
||
|
||
- Either: implement the deferred pieces (a hosted service that wakes daily
|
||
and calls `repo.PurgeTerminalAsync(now - retentionWindow)`, plus a per-site
|
||
reconciliation puller with a cursor + an `IPullCachedTelemetryClient`).
|
||
- Or: add a "Status" / "Deferred" subsection to the design doc explicitly
|
||
listing what's not yet implemented (matches the pattern Audit Log uses for
|
||
its tamper-evidence hash chain).
|
||
|
||
**Resolution (2026-05-28):**
|
||
|
||
Updated the class-level XML on `SiteCallAuditActor` to reflect actual state:
|
||
telemetry ingest, query/detail/KPI handlers (Task 4), and the central→site
|
||
Retry/Discard relay (Task 5) are implemented; the periodic reconciliation
|
||
puller and the daily terminal-row purge scheduler remain deferred. The design
|
||
doc update is tracked separately.
|
||
|
||
### SiteCallAudit-005 — `AckErrorMessage` switch arm for `SiteUnreachable` returns ack message instead of throwing
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Documentation & comments |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:548-563` |
|
||
|
||
**Description**
|
||
|
||
```csharp
|
||
return outcome switch
|
||
{
|
||
SiteCallRelayOutcome.Applied => null,
|
||
SiteCallRelayOutcome.NotParked => "The operation is no longer parked at the site (...)",
|
||
SiteCallRelayOutcome.OperationFailed => ack.ErrorMessage,
|
||
// SiteUnreachable is never produced from a ParkedOperationActionAck —
|
||
// unreachable responses are built by UnreachableRetry/UnreachableDiscard
|
||
// before any ack is classified, so this arm is unreachable by construction.
|
||
SiteCallRelayOutcome.SiteUnreachable => ack.ErrorMessage,
|
||
_ => throw new ArgumentOutOfRangeException(...)
|
||
};
|
||
```
|
||
|
||
The comment correctly states the `SiteUnreachable` arm is unreachable when
|
||
called from `ClassifyAck`. The arm therefore exists only to satisfy
|
||
exhaustiveness, but instead of throwing or returning a sentinel, it falls
|
||
through to `ack.ErrorMessage` — indistinguishable from the `OperationFailed`
|
||
arm above. If any future caller *does* feed `SiteUnreachable` into
|
||
`AckErrorMessage` (e.g. via refactor), the result will be a silent
|
||
wrong-detail-text bug rather than an immediate crash. The default arm
|
||
correctly throws `ArgumentOutOfRangeException`, so the `SiteUnreachable` arm
|
||
is the inconsistent one.
|
||
|
||
**Recommendation**
|
||
|
||
Replace the `SiteUnreachable => ack.ErrorMessage` arm with:
|
||
|
||
```csharp
|
||
SiteCallRelayOutcome.SiteUnreachable =>
|
||
throw new InvalidOperationException(
|
||
"AckErrorMessage cannot be called for SiteUnreachable — those responses "
|
||
+ "are built by UnreachableRetry/UnreachableDiscard before classification."),
|
||
```
|
||
|
||
— fail fast if the invariant is ever violated by a refactor.
|
||
|
||
**Resolution (2026-05-28):**
|
||
|
||
Behaviour kept (return `ack.ErrorMessage`); `AckErrorMessage` stays total and
|
||
side-effect-free. Expanded the inline comment on the `SiteUnreachable` arm to
|
||
explain WHY it returns rather than throws: site-unreachable is classified as
|
||
transient by the upstream relay (which has already built its
|
||
`SiteUnreachable` response and detail text via `SiteUnreachableMessage`), so a
|
||
defensive fall-through surfaces the ack's message and lets the caller schedule
|
||
a retry — throwing would turn a benign refactor invariant violation into a
|
||
relay-path crash.
|
||
|
||
### SiteCallAudit-006 — Stuck-only paging test does not exercise the multi-page boundary with an interleaved non-stuck row at the cursor
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Testing coverage |
|
||
| Status | Resolved |
|
||
| Location | `tests/ZB.MOM.WW.ScadaBridge.SiteCallAudit.Tests/SiteCallAuditActorTests.cs:335-392` |
|
||
|
||
**Resolution (2026-05-28):** Added `SiteCallQueryRequest_StuckOnly_CursorAtNonStuckBoundary_SkipsToNextStuckRow` to `tests/ZB.MOM.WW.ScadaBridge.SiteCallAudit.Tests/SiteCallAuditActorTests.cs` — drives six rows interleaved as `stuck/non-stuck` × 3 (oldest-first), then issues three page-size-1 stuck-only queries. The cursor between each page deliberately lands on a non-stuck row, so the SQL composition of the stuck predicate AND the keyset cursor predicate must skip it. Asserts each page returns exactly one stuck row in DESC-by-CreatedAtUtc order with no overlap and all three stuck rows visited. Locks the invariant that post-filtering does not produce under-filled pages with non-null next cursors.
|
||
|
||
**Description**
|
||
|
||
`SiteCallQueryRequest_StuckOnly_PagesAreFull_NoEmptyPagesWithCursor` covers
|
||
the case where stuck rows are interleaved with non-stuck rows (page-1 returns
|
||
2 stuck rows, page-2 returns the third). It does not cover the edge where
|
||
the row at the keyset cursor boundary (`AfterCreatedAtUtc + AfterId`) is
|
||
itself a non-stuck row — i.e. the cursor points at a row the next page must
|
||
SKIP through to find more stuck rows. The repository's SQL composes the
|
||
cursor predicate (`CreatedAtUtc < cursor OR (CreatedAtUtc = cursor AND id <
|
||
...)`) with the stuck predicate, so it should be honest, but the test only
|
||
asserts row counts and `IsStuck`, not that the second-page query specifically
|
||
skipped non-stuck rows between the cursor and the next stuck row.
|
||
|
||
Lower priority because the SQL composition is straightforward, but adding a
|
||
direct test would lock the invariant.
|
||
|
||
**Recommendation**
|
||
|
||
Add a test that (a) inserts 6 rows in interleaved order: stuck, not-stuck,
|
||
stuck, not-stuck, stuck, not-stuck (oldest first); (b) issues a `StuckOnly`
|
||
page-size-1 query; (c) asserts each page returns exactly the stuck row, with
|
||
no overlap and all 3 stuck rows visited.
|
||
|
||
### SiteCallAudit-007 — Daily purge timer is armed only when the reconciliation collaborators are present; a host without the reconciliation client silently never purges
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Code organization & conventions |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:307-321`, `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:202-227` |
|
||
|
||
**Description**
|
||
|
||
`StartPurgeTimer` (lines 307-321) gates the daily terminal-row purge tick on
|
||
the **reconciliation** collaborators being present:
|
||
|
||
```csharp
|
||
private void StartPurgeTimer()
|
||
{
|
||
if (_pullClient is null || _siteEnumerator is null)
|
||
{
|
||
return;
|
||
}
|
||
// ... schedule PurgeTick ...
|
||
}
|
||
```
|
||
|
||
But the purge pass (`OnPurgeTickAsync` → `PurgeWithRepositoryAsync` →
|
||
`ISiteCallAuditRepository.PurgeTerminalAsync`) needs only the repository — it
|
||
has no dependency on `IPullSiteCallsClient` / `ISiteEnumerator`. The two
|
||
collaborators are resolved by the production constructor (lines 202-227) via
|
||
`serviceProvider.GetService<IPullSiteCallsClient>()` /
|
||
`GetService<ISiteEnumerator>()` — both registered by
|
||
`AddAuditLogCentralReconciliationClient`
|
||
(`AuditLog/ServiceCollectionExtensions.cs:473,523`, registered as `ISiteEnumerator`
|
||
and `IPullSiteCallsClient`). `GetService` (not `GetRequiredService`) returns
|
||
`null` if that helper was never called, so a host that registers
|
||
`AddSiteCallAudit()` **without** also calling
|
||
`AddAuditLogCentralReconciliationClient(...)` constructs the actor with both
|
||
collaborators null. In `PreStart` both `StartReconciliationTimer` and
|
||
`StartPurgeTimer` early-return, so the actor runs forever with **no purge timer
|
||
at all** → unbounded growth of the central `SiteCalls` table, with no log line
|
||
to say the purge was skipped.
|
||
|
||
This is **currently latent, not live**: `Program.cs` co-registers
|
||
`AddAuditLogCentralReconciliationClient(builder.Configuration)` (line 107)
|
||
immediately before `AddSiteCallAudit()` (line 113), so on the real central node
|
||
both collaborators resolve and both timers run today. The risk is a future host
|
||
(or a refactor that splits the reconciliation client out of the central
|
||
composition root, or a test/embedded host that wants only ingest + purge)
|
||
silently losing the purge with no diagnostic. The gate exists for a legitimate
|
||
reason — keeping the repo-only test ctor free of *both* background timers so the
|
||
MSSQL read/upsert tests see no scheduled side effects — but it conflates "no
|
||
reconciliation route" with "no purge", and the actor's own XML
|
||
(lines 36-38) documents the coupling as deliberate rather than flagging it as a
|
||
hazard.
|
||
|
||
**Recommendation**
|
||
|
||
Decouple `StartPurgeTimer` from the reconciliation collaborators — purge needs
|
||
only a repository, which both the production and the repo-only test ctors
|
||
always have. Two viable shapes:
|
||
|
||
- Preferred: gate the purge timer on its own real precondition (a repository is
|
||
always available, so arm it unconditionally in the production + reconciliation
|
||
ctors; keep it off only in the repo-only test ctor via an explicit
|
||
"background timers off" flag rather than by proxy of the reconciliation
|
||
collaborators). This keeps the MSSQL test isolation while removing the
|
||
accidental coupling.
|
||
- At minimum: log a `Warning` in `StartPurgeTimer` (and `StartReconciliationTimer`)
|
||
when the timer is **not** armed on the production path, so a misconfigured host
|
||
surfaces "SiteCallAudit purge timer not started — `PurgeTerminalAsync` will
|
||
never run" instead of growing the table silently.
|
||
|
||
Severity is a judgment call: Medium because the consequence (unbounded central
|
||
table growth) is real and silent, but it is latent today (the only production
|
||
composition root co-registers the reconciliation client), so an argument for Low
|
||
is defensible. Flagged Medium to weight the silent-failure + data-growth aspect.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-06-20 (commit `fd618cf1`): decoupled the daily purge timer from the reconciliation collaborators (a new `_backgroundTimersEnabled` flag) so a central node that omits the reconciliation client still purges — no more silent unbounded `SiteCalls` growth; a Warning is logged if the reconciliation collaborators are absent. Test added (purge arms without the reconciliation client).
|
||
|
||
### SiteCallAudit-008 — Design doc claims six charted KPI trend series; only three are actually charted
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Design-document adherence |
|
||
| Status | Resolved |
|
||
| Location | `docs/requirements/Component-SiteCallAudit.md:149-154`, `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/Kpi/SiteCallAuditKpiSampleSource.cs:42-47`, `src/ZB.MOM.WW.ScadaBridge.Commons/Types/Kpi/KpiMetrics.cs:52-62` |
|
||
|
||
**Description**
|
||
|
||
The design doc (`Component-SiteCallAudit.md` lines 149-154, the KPI History
|
||
interaction) states:
|
||
|
||
> "… the resulting `buffered` / `parked` / `failedLastInterval` /
|
||
> `deliveredLastInterval` / `stuck` / `oldestPendingAgeSeconds` series render as
|
||
> trends on the Site Calls page via `KpiTrendChart`."
|
||
|
||
That lists **six** series as charted. In the code only **three** are charted:
|
||
|
||
- The public charted catalog `KpiMetrics.SiteCallAudit`
|
||
(`KpiMetrics.cs:52-62`) exposes exactly `Buffered`, `Parked`, and
|
||
`FailedLastInterval` — and its own XML says "Charted Site Call Audit (#22)
|
||
metrics … Rendered by the Central UI Site Calls report trend panel."
|
||
- `SiteCallAuditKpiSampleSource` (lines 42-47) keys those three off the public
|
||
Commons catalog (`KpiMetrics.SiteCallAudit.*`) and keeps the other three —
|
||
`deliveredLastInterval`, `stuck`, `oldestPendingAgeSeconds` — as **private**
|
||
string literals, with the comment "Charted metrics share the public Commons
|
||
catalog … the uncharted internal metrics stay private here (#178)."
|
||
- The UI confirms it: `SiteCalls/SiteCallsReport.razor.cs` calls
|
||
`LoadSeriesAsync` for exactly `KpiMetrics.SiteCallAudit.Buffered`, `.Parked`,
|
||
and `.FailedLastInterval` — three series, no more.
|
||
|
||
So all six metrics are *sampled* into the `KpiSample` history store, but only
|
||
three are *charted*. The doc reads as if all six are rendered, which is a small
|
||
design-doc drift (over-claiming the UI surface). The code is internally
|
||
consistent and correctly commented; the doc is the stale party.
|
||
|
||
**Recommendation**
|
||
|
||
One-line doc edit on `Component-SiteCallAudit.md` lines 149-154: state that the
|
||
three series `buffered` / `parked` / `failedLastInterval` render as trends on
|
||
the Site Calls page via `KpiTrendChart`, and that `deliveredLastInterval` /
|
||
`stuck` / `oldestPendingAgeSeconds` are **sampled into the KPI-history store but
|
||
not charted** (available for future trend panels / ad-hoc query). No code change
|
||
— the code is already the intended state.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-06-20 (commit `fd618cf1`): the design doc now lists the three actually-charted KPI metrics (buffered/parked/failedLastInterval) and marks the other three as sampled-but-not-yet-charted, matching the `KpiMetrics` catalog. Doc-only.
|
||
|
||
### SiteCallAudit-009 — Reconciliation cursor advances inclusively and ignores `MoreAvailable`; a single-timestamp batch saturation re-pulls without progress
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:505-535` |
|
||
|
||
**Description**
|
||
|
||
`ReconcileSiteAsync` (lines 505-535) pulls rows since the per-site cursor,
|
||
upserts each, and advances the cursor to the maximum `UpdatedAtUtc` observed:
|
||
|
||
```csharp
|
||
var since = _reconciliationCursors.TryGetValue(site.SiteId, out var c) ? c : DateTime.MinValue;
|
||
var response = await client.PullAsync(site.SiteId, since, _options.ReconciliationBatchSize, ...);
|
||
|
||
var maxUpdated = since;
|
||
foreach (var row in response.SiteCalls)
|
||
{
|
||
await repository.UpsertAsync(row with { IngestedAtUtc = nowUtc });
|
||
if (row.UpdatedAtUtc > maxUpdated) maxUpdated = row.UpdatedAtUtc;
|
||
}
|
||
_reconciliationCursors[site.SiteId] = maxUpdated;
|
||
```
|
||
|
||
Two interacting properties:
|
||
|
||
1. The pull is **inclusive** — `PullAsync(site, since, …)` asks for rows with
|
||
`UpdatedAtUtc >= since` (documented in the method's XML, lines 497-503), and
|
||
the cursor advances to the **max** `UpdatedAtUtc` seen, which is itself one
|
||
of the rows just pulled. So the boundary row is re-pulled every tick and
|
||
deduped by the idempotent monotonic upsert — intended, harmless.
|
||
2. `response.MoreAvailable` is **never read** at all. The `PullSiteCallsResponse`
|
||
carries `MoreAvailable` (`PullSiteCallsResponse.cs:14-17`: "True when the
|
||
site saturated the requested batch size — the caller should advance the
|
||
cursor and pull again"), but `ReconcileSiteAsync` ignores it and relies on
|
||
the natural tick cadence to drain the backlog over successive ticks.
|
||
|
||
The edge case: if a site has **more rows than the batch size all sharing one
|
||
exact `UpdatedAtUtc`** (e.g. a burst written in the same tick / same clock
|
||
value), the saturated batch returns rows whose max `UpdatedAtUtc` equals
|
||
`since`. `maxUpdated` therefore stays at `since`, the cursor does **not**
|
||
advance, and because the pull is inclusive the next tick re-pulls the identical
|
||
window — and again, and again — making no forward progress on that site's
|
||
backlog. Because the upsert is idempotent and the `SiteCalls` table is an
|
||
eventually-consistent mirror (not the source of truth), this is **wasted work,
|
||
never corruption** — but it is an unbounded re-pull loop on a pathological
|
||
input, and any rows in that backlog beyond the batch ceiling never get
|
||
reconciled.
|
||
|
||
The sibling `SiteAuditReconciliationActor` shares the inclusive-cursor /
|
||
max-timestamp shape, so the same single-timestamp-saturation no-progress edge
|
||
applies to it — but that sibling **does read `MoreAvailable`** (it feeds its
|
||
stalled-detection state machine, `SiteAuditReconciliationActor.cs:324-325`,
|
||
publishing `SiteAuditTelemetryStalledChanged` so a non-draining site surfaces a
|
||
health signal). `ReconcileSiteAsync`'s XML (lines 530-534) claims the
|
||
no-immediate-re-pull behaviour "match[es] `SiteAuditReconciliationActor`", which
|
||
is only half true: it matches the cursor cadence but diverges by dropping
|
||
`MoreAvailable` entirely, so this actor has neither a continuation pull nor a
|
||
stalled signal — a saturated site lags silently with no observability.
|
||
|
||
**Recommendation**
|
||
|
||
- Consume `MoreAvailable`: either continue draining within the same tick while
|
||
`MoreAvailable` is true (bounded by a max-iterations guard), or — matching the
|
||
sibling — surface a stalled/non-draining signal when a batch comes back
|
||
saturated so a stuck site is observable rather than silent.
|
||
- Defend the single-timestamp no-progress edge with a tiebreaker beyond the
|
||
timestamp (e.g. advance on `(UpdatedAtUtc, TrackedOperationId)` as a composite
|
||
keyset cursor, and ask the pull for rows strictly after that composite), so a
|
||
burst sharing one `UpdatedAtUtc` cannot pin the cursor.
|
||
- Correct the `ReconcileSiteAsync` XML (lines 530-534): it claims parity with
|
||
`SiteAuditReconciliationActor` while diverging on `MoreAvailable` (the sibling
|
||
reads it for stalled detection; this actor ignores it).
|
||
|
||
Severity Low: idempotent upsert + mirror-not-source-of-truth mean no data
|
||
corruption, and the saturated-single-timestamp input is pathological; the cost
|
||
is wasted re-pulls and an un-drained backlog tail on that one input, plus the
|
||
missing observability.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-06-20 (commit `fd618cf1`): `ReconcileSiteAsync` now consumes `response.MoreAvailable` via a within-tick continuation drain bounded by a max-pages guard, with explicit no-progress (single-timestamp-saturation) detection that breaks and logs a Warning instead of re-pulling forever. The XML claim of parity with the sibling reconciler was corrected. Idempotent upsert retained. Tests added.
|