# Code Review — SiteCallAudit | Field | Value | |-------|-------| | Module | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit` | | Design doc | `docs/requirements/Component-SiteCallAudit.md` | | Status | Reviewed | | Last reviewed | 2026-05-28 | | Reviewer | claude-agent | | Commit reviewed | `1eb6e97` | | Open findings | 2 | ## Summary The module is small (one actor + DI extension + options class). The actor is a central cluster singleton that exposes three responsibility groups: direct `UpsertSiteCallCommand` ingest, paginated/KPI read handlers, and the central→site Retry/Discard relay. Ingest idempotency is delegated to the repository's monotonic-upsert (the CD-015 check-then-act window is mitigated by the duplicate-key swallow on the insert leg). Findings cluster around two themes: (a) the `SupervisorStrategy` override is dead-code that contradicts the XML docstring — it governs children, and this actor has none, so the documented "Resume on leaked exception" promise is unenforced; (b) several smaller drifts between the design doc and the code (reconciliation puller + daily purge schedule are still deferred; `OnUpsertAsync` does not stamp `IngestedAtUtc` unlike the dual-write path). The relay path is well covered by Akka TestKit unit tests; the ingest + KPI paths are covered by MSSQL-backed integration tests using a shared `MsSqlMigrationFixture`. ## Checklist coverage | # | Category | Examined | Notes | |---|----------|----------|-------| | 1 | Correctness & logic bugs | Yes | `OnUpsertAsync` does not refresh `IngestedAtUtc` (Finding 003). | | 2 | Akka.NET conventions | Yes | `SupervisorStrategy()` override is dead code (Finding 001). `Sender` correctly captured before first await on every handler. `PipeTo` used for read replies. | | 3 | Concurrency & thread safety | Yes | `_centralCommunication` mutated only on actor thread via `RegisterCentralCommunication`. DI scope-per-message disposed in `try/finally`. No issues found. | | 4 | Error handling & resilience | Yes | Ingest catches all + replies `Accepted=false`. Relay distinguishes `SiteUnreachable` vs `OperationFailed`. Failover handover does not wait for in-flight async work (Finding 002). | | 5 | Security | Yes | All SQL is parameterised at the repository (FromSqlInterpolated). Relay carries no user-controlled strings beyond `SourceSite`. No issues found. | | 6 | Performance & resource management | Yes | DI scope-per-message correctly disposed. `MaxPageSize=200` clamp present. No issues found. | | 7 | Design-document adherence | Yes | Reconciliation puller and daily terminal-purge scheduler still deferred; design doc reads as if they ship (Finding 004). | | 8 | Code organization & conventions | Yes | `RegisterCentralCommunication` is a top-level record colocated with the actor — by design (carries `IActorRef`, cannot live in Commons). No issues found. | | 9 | Testing coverage | Yes | Relay path well covered (6 unit tests). Ingest/KPI well covered by MSSQL fixture. Stuck-only paging boundary edge not directly exercised (Finding 006). | | 10 | Documentation & comments | Yes | XML docstring claims `SupervisorStrategy` uses Resume — incorrect (Finding 001). `AckErrorMessage` switch arm for `SiteUnreachable` falls through instead of throwing (Finding 005). | ## Findings ### SiteCallAudit-001 — SupervisorStrategy override is dead code; XML claims Resume that is not enforced | | | |--|--| | Severity | Medium | | Category | Akka.NET conventions | | Status | Resolved | | Location | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:32-46`, `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:147-151` | **Description** The XML remarks block (lines 32-46) states: > "The `SupervisorStrategy` uses `Resume` so an unexpected throw before the catch (defence in depth) does not restart the actor and reset in-flight state." The override at lines 147-151 returns a `OneForOneStrategy` with `DefaultDecider` and `maxNrOfRetries: 0`. Two problems compound: 1. `ActorBase.SupervisorStrategy()` governs the actor's **children**, not the actor itself. `SiteCallAuditActor` creates no children, so this override is dead code. 2. The returned strategy uses `DefaultDecider` (Restart for most exceptions), **not** `Directive.Resume`. So even if the actor did have children, the strategy would not be Resume — it would be the default Restart-on-most-faults behaviour with `maxNrOfRetries: 0` (which forces a Stop after the first failure). Net effect: the actor's own self-supervision is whatever the parent supplies (`SupervisorStrategy.DefaultDecider` from the singleton manager / user guardian in tests), which Restarts on most exceptions. If the `try/catch` in `OnUpsertAsync` ever leaked (e.g. a synchronous throw constructing `replyTo`), the actor would Restart, reset `_centralCommunication` to null, and silently break the relay until `RegisterCentralCommunication` runs again. This same pattern (with the same misleading XML doc) exists in `AuditLogIngestActor`, `AuditLogPurgeActor`, and `SiteAuditReconciliationActor` — they were likely cargo-culted; this finding documents the local instance. **Recommendation** Either: - Remove the `SupervisorStrategy()` override entirely (it does nothing useful) and revise the XML comment to drop the "Resume" claim. Self-supervision is the parent's concern (the cluster singleton manager); the `try/catch` in `OnUpsertAsync` is what actually keeps the actor alive. - Or, if Resume-on-self-throw is actually desired, that requires wiring a custom supervisor in the parent (`ClusterSingletonManager`) — not overriding `SupervisorStrategy()` here. Simpler path: keep the `try/catch`, drop the override. The CLAUDE.md "Resume for coordinator actors" decision applies to actors with children (Site Runtime hierarchy) — not to leaf cluster singletons. **Resolution (2026-05-28):** Rewrote the class-level XML on `SiteCallAuditActor` plus the method-level XML on `SupervisorStrategy()` to accurately describe what the override does — a one-for-one strategy with `DefaultDecider` (Restart on most exceptions, Stop on `ActorInitializationException`/`ActorKilledException`) and `maxNrOfRetries: 0`, governing the actor's *children* (the actor has none today, so the override is currently inert). Dropped the misleading "Resume" claim. The new docs make clear that self-supervision of this cluster singleton is the parent `ClusterSingletonManager`'s concern and the actor's own resilience comes from the in-handler `try/catch` in `OnUpsertAsync`, not from this override. No behaviour change — pure documentation fix; existing 24 SiteCallAudit tests remain green. ### SiteCallAudit-002 — Singleton failover does not wait for in-flight async upserts | | | |--|--| | Severity | Low | | Category | Error handling & resilience | | Status | Resolved | | Location | `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:455-462` (singleton wiring), `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:153-193` | **Resolution (2026-05-28):** Added a `CoordinatedShutdown` task in the `cluster-leave` phase (named `drain-site-call-audit-singleton`) that issues an explicit `GracefulStop(10s)` to the `SiteCallAudit` cluster singleton manager before the cluster-leave proceeds. Akka.NET's singleton handover already waits for the active actor's `ReceiveAsync` task to complete before signalling `HandOverDone`, so an in-flight EF `UpsertAsync` (and its SQL round-trip) drains on the old node before the new singleton starts on the other central node — closing the seam where the new singleton could race a still-running upsert on the old node. The 10-second timeout is bounded so a misbehaving upsert cannot stall coordinated shutdown indefinitely; on timeout the existing `PoisonPill` termination path takes over and the repository's monotonic-upsert + 2601/2627 duplicate-key swallow remain as the storage-state safety net. Pattern is suitable for the `NotificationOutbox` singleton too; deferred to keep this change scoped. **Description** The singleton is created with `terminationMessage: PoisonPill.Instance`. On failover the active node's singleton stops as soon as the mailbox is drained of normal messages and the PoisonPill is processed. An in-flight `OnUpsertAsync` Task started before the PoisonPill arrived will be allowed to complete (the message-handler runs synchronously from the mailbox's view), but the Akka actor model does NOT cancel the EF Core `ExecuteSqlInterpolatedAsync` call. Two consequences: 1. The new singleton on the other node may begin accepting `UpsertSiteCallCommand` for the same `TrackedOperationId` while the old singleton's in-flight upsert is still running. The repository's monotonic-upsert and the SQL duplicate-key swallow protect storage state. 2. The original `replyTo` sender may receive its `Accepted=true` after the new singleton has already returned a different reply. Idempotency keys protect correctness; wire-level ordering is best-effort by design. This is consistent with the design ("eventually-consistent mirror, sites are source of truth"), but worth documenting as an explicit invariant. The Notification Outbox sibling has the same pattern. **Recommendation** - Document the failover/handover semantics in the actor's XML remarks: "On cluster singleton handover, in-flight `OnUpsertAsync` tasks complete on the old node and may produce a late `Accepted=true` reply; the repository's monotonic upsert ensures storage state is consistent." - Add an integration test that deliberately races two concurrent upserts on the same `TrackedOperationId` to verify the duplicate-key swallow + monotonic rank check (the CD-015 race-pattern check the parent task flagged). ### SiteCallAudit-003 — `OnUpsertAsync` does not refresh `IngestedAtUtc`; direct-write callers must remember to stamp it | | | |--|--| | Severity | Medium | | Category | Correctness & logic bugs | | Status | Resolved | | Location | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:153-193` | **Description** The combined-telemetry hot path (`AuditLogIngestActor.OnCachedTelemetryAsync`) stamps `IngestedAtUtc = DateTime.UtcNow` on both the `AuditLog` row and the `SiteCall` row at central-side persist time (`src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogIngestActor.cs:238-239`). The design doc treats `IngestedAtUtc` as "central ingested (or last refreshed) this row" — a central-side timestamp. `SiteCallAuditActor.OnUpsertAsync` writes the supplied `SiteCall` as-is, with whatever `IngestedAtUtc` the caller stamped. The only current callers are the unit tests (which use `DateTime.UtcNow` at command-construction time). Once the deferred reconciliation puller lands and starts emitting `UpsertSiteCallCommand`s, the puller (running on central) is responsible for stamping a central timestamp — but if a future direct-write caller forgets, or constructs from a site DTO, the value could drift (e.g. become a site clock value). This is currently latent because no production caller exists, but it's inconsistent with the dual-write code path and undocumented. **Recommendation** - Either: stamp `IngestedAtUtc = DateTime.UtcNow` inside `OnUpsertAsync` before calling `UpsertAsync` (matching `AuditLogIngestActor`'s behaviour), using `cmd.SiteCall with { IngestedAtUtc = DateTime.UtcNow }`. - Or: document in the `UpsertSiteCallCommand` XML that callers MUST stamp `IngestedAtUtc` to a central-side `DateTime.UtcNow` immediately before sending. Preferred: stamp inside the actor — same as the combined-telemetry path — because callers cannot in general know the actor is colocated on central. **Resolution (2026-05-28):** `OnUpsertAsync` now rewrites the incoming `SiteCall` via `cmd.SiteCall with { IngestedAtUtc = DateTime.UtcNow }` immediately before calling `repository.UpsertAsync`, mirroring `AuditLogIngestActor`'s combined-telemetry hot path. The repository writes `IngestedAtUtc` on both the insert-if-not-exists and the monotonic UPDATE legs (`SiteCallAuditRepository.UpsertAsync`), so the column is writable on every upsert. Callers (telemetry, the deferred reconciliation puller, any future direct-write) no longer need to remember to stamp a central-side timestamp — the actor owns it. Existing 24 SiteCallAudit tests remain green (the MSSQL-fixture test constructs rows with `DateTime.UtcNow` and doesn't assert the exact value, so the actor's re-stamp is backward compatible). ### SiteCallAudit-004 — Reconciliation puller and daily terminal-purge scheduler still deferred; design-doc drift | | | |--|--| | Severity | Low | | Category | Design-document adherence | | Status | Resolved | | Location | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:23-30` (actor XML), `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/ServiceCollectionExtensions.cs:8-13`, `docs/requirements/Component-SiteCallAudit.md:24-32` | **Description** The design doc (`Component-SiteCallAudit.md` lines 24-32) lists five responsibilities, including: - "Run periodic per-site reconciliation pulls so missed telemetry self-heals." - "Purge terminal audit rows after a configurable retention window." The repository exposes `PurgeTerminalAsync` but nothing in this module schedules a daily call (Notification Outbox owns a `MaintenanceService` for its equivalent; no `SiteCallAuditMaintenanceService` exists). The reconciliation puller is acknowledged in the actor XML (`only reconciliation remains deferred`) but is not surfaced in the design doc as deferred — the doc reads as if it ships. **Recommendation** - Either: implement the deferred pieces (a hosted service that wakes daily and calls `repo.PurgeTerminalAsync(now - retentionWindow)`, plus a per-site reconciliation puller with a cursor + an `IPullCachedTelemetryClient`). - Or: add a "Status" / "Deferred" subsection to the design doc explicitly listing what's not yet implemented (matches the pattern Audit Log uses for its tamper-evidence hash chain). **Resolution (2026-05-28):** Updated the class-level XML on `SiteCallAuditActor` to reflect actual state: telemetry ingest, query/detail/KPI handlers (Task 4), and the central→site Retry/Discard relay (Task 5) are implemented; the periodic reconciliation puller and the daily terminal-row purge scheduler remain deferred. The design doc update is tracked separately. ### SiteCallAudit-005 — `AckErrorMessage` switch arm for `SiteUnreachable` returns ack message instead of throwing | | | |--|--| | Severity | Low | | Category | Documentation & comments | | Status | Resolved | | Location | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:548-563` | **Description** ```csharp return outcome switch { SiteCallRelayOutcome.Applied => null, SiteCallRelayOutcome.NotParked => "The operation is no longer parked at the site (...)", SiteCallRelayOutcome.OperationFailed => ack.ErrorMessage, // SiteUnreachable is never produced from a ParkedOperationActionAck — // unreachable responses are built by UnreachableRetry/UnreachableDiscard // before any ack is classified, so this arm is unreachable by construction. SiteCallRelayOutcome.SiteUnreachable => ack.ErrorMessage, _ => throw new ArgumentOutOfRangeException(...) }; ``` The comment correctly states the `SiteUnreachable` arm is unreachable when called from `ClassifyAck`. The arm therefore exists only to satisfy exhaustiveness, but instead of throwing or returning a sentinel, it falls through to `ack.ErrorMessage` — indistinguishable from the `OperationFailed` arm above. If any future caller *does* feed `SiteUnreachable` into `AckErrorMessage` (e.g. via refactor), the result will be a silent wrong-detail-text bug rather than an immediate crash. The default arm correctly throws `ArgumentOutOfRangeException`, so the `SiteUnreachable` arm is the inconsistent one. **Recommendation** Replace the `SiteUnreachable => ack.ErrorMessage` arm with: ```csharp SiteCallRelayOutcome.SiteUnreachable => throw new InvalidOperationException( "AckErrorMessage cannot be called for SiteUnreachable — those responses " + "are built by UnreachableRetry/UnreachableDiscard before classification."), ``` — fail fast if the invariant is ever violated by a refactor. **Resolution (2026-05-28):** Behaviour kept (return `ack.ErrorMessage`); `AckErrorMessage` stays total and side-effect-free. Expanded the inline comment on the `SiteUnreachable` arm to explain WHY it returns rather than throws: site-unreachable is classified as transient by the upstream relay (which has already built its `SiteUnreachable` response and detail text via `SiteUnreachableMessage`), so a defensive fall-through surfaces the ack's message and lets the caller schedule a retry — throwing would turn a benign refactor invariant violation into a relay-path crash. ### SiteCallAudit-006 — Stuck-only paging test does not exercise the multi-page boundary with an interleaved non-stuck row at the cursor | | | |--|--| | Severity | Low | | Category | Testing coverage | | Status | Resolved | | Location | `tests/ZB.MOM.WW.ScadaBridge.SiteCallAudit.Tests/SiteCallAuditActorTests.cs:335-392` | **Resolution (2026-05-28):** Added `SiteCallQueryRequest_StuckOnly_CursorAtNonStuckBoundary_SkipsToNextStuckRow` to `tests/ZB.MOM.WW.ScadaBridge.SiteCallAudit.Tests/SiteCallAuditActorTests.cs` — drives six rows interleaved as `stuck/non-stuck` × 3 (oldest-first), then issues three page-size-1 stuck-only queries. The cursor between each page deliberately lands on a non-stuck row, so the SQL composition of the stuck predicate AND the keyset cursor predicate must skip it. Asserts each page returns exactly one stuck row in DESC-by-CreatedAtUtc order with no overlap and all three stuck rows visited. Locks the invariant that post-filtering does not produce under-filled pages with non-null next cursors. **Description** `SiteCallQueryRequest_StuckOnly_PagesAreFull_NoEmptyPagesWithCursor` covers the case where stuck rows are interleaved with non-stuck rows (page-1 returns 2 stuck rows, page-2 returns the third). It does not cover the edge where the row at the keyset cursor boundary (`AfterCreatedAtUtc + AfterId`) is itself a non-stuck row — i.e. the cursor points at a row the next page must SKIP through to find more stuck rows. The repository's SQL composes the cursor predicate (`CreatedAtUtc < cursor OR (CreatedAtUtc = cursor AND id < ...)`) with the stuck predicate, so it should be honest, but the test only asserts row counts and `IsStuck`, not that the second-page query specifically skipped non-stuck rows between the cursor and the next stuck row. Lower priority because the SQL composition is straightforward, but adding a direct test would lock the invariant. **Recommendation** Add a test that (a) inserts 6 rows in interleaved order: stuck, not-stuck, stuck, not-stuck, stuck, not-stuck (oldest first); (b) issues a `StuckOnly` page-size-1 query; (c) asserts each page returns exactly the stuck row, with no overlap and all 3 stuck rows visited.