Files
ScadaBridge/code-reviews/SiteCallAudit/findings.md
T
Joseph Doherty 2ed5c6c379 fix(concurrency/lifetime): close Theme 5 — 10 concurrency / DI / scope findings
Concurrency hazards, DI lifetime hygiene, and one verify-only confirmation
across 8 modules. Highlights:

Concurrency:
- CentralUI-030: SandboxConsoleCapture writes routed through WriteSynchronized
  locking on the captured StringWriter — intra-script Task fan-out can no
  longer corrupt the per-call buffer.
- Commons-021: ExternalCallResult.Response now backed by Lazy<dynamic?>
  (ExecutionAndPublication) — no more benign double-parse race.
- CD-017: DeploymentManagerRepository.DeleteDeploymentRecordAsync now takes
  an expected RowVersion and seeds entry.OriginalValues so EF emits
  DELETE ... WHERE Id=@id AND RowVersion=@prior; stale RowVersion now
  throws DbUpdateConcurrencyException instead of silent overwrite.
- Transport-009: AuditCorrelationContext.BundleImportId backed by
  AsyncLocal<Guid?> so concurrent imports get per-logical-call isolation
  (was a scoped instance shared via AuditService across runs).

DI / lifetime:
- AuditLog-003: All 3 AuditLog actor handlers switched to CreateAsyncScope
  + await using — async EF disposal no longer swallowed.
- AuditLog-007: INodeIdentityProvider resolution standardised on
  GetRequiredService<>() (was mixed with GetService<>()).
- AuditLog-011: AddAuditLogHealthMetricsBridge guarded by sentinel
  descriptor check — calling twice no longer double-registers the hosted
  service.

Shutdown / supervision:
- SiteCallAudit-002: AkkaHostedService adds a CoordinatedShutdown
  cluster-leave task (drain-site-call-audit-singleton) that issues a
  bounded GracefulStop(10s) so failover waits for in-flight upserts.

Registration safety:
- NS-020: AkkaHostedService now guards NotificationForwarder S&F
  registration with _notificationDeliveryHandlerRegistered + throws
  InvalidOperationException on double-register to make the regression loud.

VERIFY-only closures:
- NotifOutbox-005: Confirmed already closed by CD-015 fix (ac96b83) —
  NotificationOutboxRepository.InsertIfNotExistsAsync uses the same
  raw-SQL IF NOT EXISTS + 2601/2627 swallow pattern; race eliminated.

5+ new regression tests (CentralUI sandbox WhenAll, ExternalCallResult
64-reader Barrier, AuditLog DI idempotency, RowVersion stale-throw,
SiteCallAudit-002 shutdown drain). Build clean; affected suites all green.
README regenerated: 65 open (was 75).
2026-05-28 07:29:41 -04:00

16 KiB

Code Review — SiteCallAudit

Field Value
Module src/ScadaLink.SiteCallAudit
Design doc docs/requirements/Component-SiteCallAudit.md
Status Reviewed
Last reviewed 2026-05-28
Reviewer claude-agent
Commit reviewed 1eb6e97
Open findings 5

Summary

The module is small (one actor + DI extension + options class). The actor is a central cluster singleton that exposes three responsibility groups: direct UpsertSiteCallCommand ingest, paginated/KPI read handlers, and the central→site Retry/Discard relay. Ingest idempotency is delegated to the repository's monotonic-upsert (the CD-015 check-then-act window is mitigated by the duplicate-key swallow on the insert leg). Findings cluster around two themes: (a) the SupervisorStrategy override is dead-code that contradicts the XML docstring — it governs children, and this actor has none, so the documented "Resume on leaked exception" promise is unenforced; (b) several smaller drifts between the design doc and the code (reconciliation puller + daily purge schedule are still deferred; OnUpsertAsync does not stamp IngestedAtUtc unlike the dual-write path). The relay path is well covered by Akka TestKit unit tests; the ingest + KPI paths are covered by MSSQL-backed integration tests using a shared MsSqlMigrationFixture.

Checklist coverage

# Category Examined Notes
1 Correctness & logic bugs Yes OnUpsertAsync does not refresh IngestedAtUtc (Finding 003).
2 Akka.NET conventions Yes SupervisorStrategy() override is dead code (Finding 001). Sender correctly captured before first await on every handler. PipeTo used for read replies.
3 Concurrency & thread safety Yes _centralCommunication mutated only on actor thread via RegisterCentralCommunication. DI scope-per-message disposed in try/finally. No issues found.
4 Error handling & resilience Yes Ingest catches all + replies Accepted=false. Relay distinguishes SiteUnreachable vs OperationFailed. Failover handover does not wait for in-flight async work (Finding 002).
5 Security Yes All SQL is parameterised at the repository (FromSqlInterpolated). Relay carries no user-controlled strings beyond SourceSite. No issues found.
6 Performance & resource management Yes DI scope-per-message correctly disposed. MaxPageSize=200 clamp present. No issues found.
7 Design-document adherence Yes Reconciliation puller and daily terminal-purge scheduler still deferred; design doc reads as if they ship (Finding 004).
8 Code organization & conventions Yes RegisterCentralCommunication is a top-level record colocated with the actor — by design (carries IActorRef, cannot live in Commons). No issues found.
9 Testing coverage Yes Relay path well covered (6 unit tests). Ingest/KPI well covered by MSSQL fixture. Stuck-only paging boundary edge not directly exercised (Finding 006).
10 Documentation & comments Yes XML docstring claims SupervisorStrategy uses Resume — incorrect (Finding 001). AckErrorMessage switch arm for SiteUnreachable falls through instead of throwing (Finding 005).

Findings

SiteCallAudit-001 — SupervisorStrategy override is dead code; XML claims Resume that is not enforced

Severity Medium
Category Akka.NET conventions
Status Open
Location src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:32-46, src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:147-151

Description

The XML remarks block (lines 32-46) states:

"The SupervisorStrategy uses Resume so an unexpected throw before the catch (defence in depth) does not restart the actor and reset in-flight state."

The override at lines 147-151 returns a OneForOneStrategy with DefaultDecider and maxNrOfRetries: 0. Two problems compound:

  1. ActorBase.SupervisorStrategy() governs the actor's children, not the actor itself. SiteCallAuditActor creates no children, so this override is dead code.
  2. The returned strategy uses DefaultDecider (Restart for most exceptions), not Directive.Resume. So even if the actor did have children, the strategy would not be Resume — it would be the default Restart-on-most-faults behaviour with maxNrOfRetries: 0 (which forces a Stop after the first failure).

Net effect: the actor's own self-supervision is whatever the parent supplies (SupervisorStrategy.DefaultDecider from the singleton manager / user guardian in tests), which Restarts on most exceptions. If the try/catch in OnUpsertAsync ever leaked (e.g. a synchronous throw constructing replyTo), the actor would Restart, reset _centralCommunication to null, and silently break the relay until RegisterCentralCommunication runs again.

This same pattern (with the same misleading XML doc) exists in AuditLogIngestActor, AuditLogPurgeActor, and SiteAuditReconciliationActor — they were likely cargo-culted; this finding documents the local instance.

Recommendation

Either:

  • Remove the SupervisorStrategy() override entirely (it does nothing useful) and revise the XML comment to drop the "Resume" claim. Self-supervision is the parent's concern (the cluster singleton manager); the try/catch in OnUpsertAsync is what actually keeps the actor alive.
  • Or, if Resume-on-self-throw is actually desired, that requires wiring a custom supervisor in the parent (ClusterSingletonManager) — not overriding SupervisorStrategy() here. Simpler path: keep the try/catch, drop the override.

The CLAUDE.md "Resume for coordinator actors" decision applies to actors with children (Site Runtime hierarchy) — not to leaf cluster singletons.

Resolution

Unresolved.

SiteCallAudit-002 — Singleton failover does not wait for in-flight async upserts

Severity Low
Category Error handling & resilience
Status Resolved
Location src/ScadaLink.Host/Actors/AkkaHostedService.cs:455-462 (singleton wiring), src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:153-193

Resolution (2026-05-28): Added a CoordinatedShutdown task in the cluster-leave phase (named drain-site-call-audit-singleton) that issues an explicit GracefulStop(10s) to the SiteCallAudit cluster singleton manager before the cluster-leave proceeds. Akka.NET's singleton handover already waits for the active actor's ReceiveAsync task to complete before signalling HandOverDone, so an in-flight EF UpsertAsync (and its SQL round-trip) drains on the old node before the new singleton starts on the other central node — closing the seam where the new singleton could race a still-running upsert on the old node. The 10-second timeout is bounded so a misbehaving upsert cannot stall coordinated shutdown indefinitely; on timeout the existing PoisonPill termination path takes over and the repository's monotonic-upsert + 2601/2627 duplicate-key swallow remain as the storage-state safety net. Pattern is suitable for the NotificationOutbox singleton too; deferred to keep this change scoped.

Description

The singleton is created with terminationMessage: PoisonPill.Instance. On failover the active node's singleton stops as soon as the mailbox is drained of normal messages and the PoisonPill is processed. An in-flight OnUpsertAsync Task started before the PoisonPill arrived will be allowed to complete (the message-handler runs synchronously from the mailbox's view), but the Akka actor model does NOT cancel the EF Core ExecuteSqlInterpolatedAsync call.

Two consequences:

  1. The new singleton on the other node may begin accepting UpsertSiteCallCommand for the same TrackedOperationId while the old singleton's in-flight upsert is still running. The repository's monotonic-upsert and the SQL duplicate-key swallow protect storage state.
  2. The original replyTo sender may receive its Accepted=true after the new singleton has already returned a different reply. Idempotency keys protect correctness; wire-level ordering is best-effort by design.

This is consistent with the design ("eventually-consistent mirror, sites are source of truth"), but worth documenting as an explicit invariant. The Notification Outbox sibling has the same pattern.

Recommendation

  • Document the failover/handover semantics in the actor's XML remarks: "On cluster singleton handover, in-flight OnUpsertAsync tasks complete on the old node and may produce a late Accepted=true reply; the repository's monotonic upsert ensures storage state is consistent."
  • Add an integration test that deliberately races two concurrent upserts on the same TrackedOperationId to verify the duplicate-key swallow + monotonic rank check (the CD-015 race-pattern check the parent task flagged).

SiteCallAudit-003 — OnUpsertAsync does not refresh IngestedAtUtc; direct-write callers must remember to stamp it

Severity Medium
Category Correctness & logic bugs
Status Open
Location src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:153-193

Description

The combined-telemetry hot path (AuditLogIngestActor.OnCachedTelemetryAsync) stamps IngestedAtUtc = DateTime.UtcNow on both the AuditLog row and the SiteCall row at central-side persist time (src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:238-239). The design doc treats IngestedAtUtc as "central ingested (or last refreshed) this row" — a central-side timestamp.

SiteCallAuditActor.OnUpsertAsync writes the supplied SiteCall as-is, with whatever IngestedAtUtc the caller stamped. The only current callers are the unit tests (which use DateTime.UtcNow at command-construction time). Once the deferred reconciliation puller lands and starts emitting UpsertSiteCallCommands, the puller (running on central) is responsible for stamping a central timestamp — but if a future direct-write caller forgets, or constructs from a site DTO, the value could drift (e.g. become a site clock value).

This is currently latent because no production caller exists, but it's inconsistent with the dual-write code path and undocumented.

Recommendation

  • Either: stamp IngestedAtUtc = DateTime.UtcNow inside OnUpsertAsync before calling UpsertAsync (matching AuditLogIngestActor's behaviour), using cmd.SiteCall with { IngestedAtUtc = DateTime.UtcNow }.
  • Or: document in the UpsertSiteCallCommand XML that callers MUST stamp IngestedAtUtc to a central-side DateTime.UtcNow immediately before sending.

Preferred: stamp inside the actor — same as the combined-telemetry path — because callers cannot in general know the actor is colocated on central.

Resolution

Unresolved.

SiteCallAudit-004 — Reconciliation puller and daily terminal-purge scheduler still deferred; design-doc drift

Severity Low
Category Design-document adherence
Status Resolved
Location src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:23-30 (actor XML), src/ScadaLink.SiteCallAudit/ServiceCollectionExtensions.cs:8-13, docs/requirements/Component-SiteCallAudit.md:24-32

Description

The design doc (Component-SiteCallAudit.md lines 24-32) lists five responsibilities, including:

  • "Run periodic per-site reconciliation pulls so missed telemetry self-heals."
  • "Purge terminal audit rows after a configurable retention window."

The repository exposes PurgeTerminalAsync but nothing in this module schedules a daily call (Notification Outbox owns a MaintenanceService for its equivalent; no SiteCallAuditMaintenanceService exists). The reconciliation puller is acknowledged in the actor XML (only reconciliation remains deferred) but is not surfaced in the design doc as deferred — the doc reads as if it ships.

Recommendation

  • Either: implement the deferred pieces (a hosted service that wakes daily and calls repo.PurgeTerminalAsync(now - retentionWindow), plus a per-site reconciliation puller with a cursor + an IPullCachedTelemetryClient).
  • Or: add a "Status" / "Deferred" subsection to the design doc explicitly listing what's not yet implemented (matches the pattern Audit Log uses for its tamper-evidence hash chain).

Resolution (2026-05-28):

Updated the class-level XML on SiteCallAuditActor to reflect actual state: telemetry ingest, query/detail/KPI handlers (Task 4), and the central→site Retry/Discard relay (Task 5) are implemented; the periodic reconciliation puller and the daily terminal-row purge scheduler remain deferred. The design doc update is tracked separately.

SiteCallAudit-005 — AckErrorMessage switch arm for SiteUnreachable returns ack message instead of throwing

Severity Low
Category Documentation & comments
Status Resolved
Location src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:548-563

Description

return outcome switch
{
    SiteCallRelayOutcome.Applied => null,
    SiteCallRelayOutcome.NotParked => "The operation is no longer parked at the site (...)",
    SiteCallRelayOutcome.OperationFailed => ack.ErrorMessage,
    // SiteUnreachable is never produced from a ParkedOperationActionAck —
    // unreachable responses are built by UnreachableRetry/UnreachableDiscard
    // before any ack is classified, so this arm is unreachable by construction.
    SiteCallRelayOutcome.SiteUnreachable => ack.ErrorMessage,
    _ => throw new ArgumentOutOfRangeException(...)
};

The comment correctly states the SiteUnreachable arm is unreachable when called from ClassifyAck. The arm therefore exists only to satisfy exhaustiveness, but instead of throwing or returning a sentinel, it falls through to ack.ErrorMessage — indistinguishable from the OperationFailed arm above. If any future caller does feed SiteUnreachable into AckErrorMessage (e.g. via refactor), the result will be a silent wrong-detail-text bug rather than an immediate crash. The default arm correctly throws ArgumentOutOfRangeException, so the SiteUnreachable arm is the inconsistent one.

Recommendation

Replace the SiteUnreachable => ack.ErrorMessage arm with:

SiteCallRelayOutcome.SiteUnreachable =>
    throw new InvalidOperationException(
        "AckErrorMessage cannot be called for SiteUnreachable — those responses "
        + "are built by UnreachableRetry/UnreachableDiscard before classification."),

— fail fast if the invariant is ever violated by a refactor.

Resolution (2026-05-28):

Behaviour kept (return ack.ErrorMessage); AckErrorMessage stays total and side-effect-free. Expanded the inline comment on the SiteUnreachable arm to explain WHY it returns rather than throws: site-unreachable is classified as transient by the upstream relay (which has already built its SiteUnreachable response and detail text via SiteUnreachableMessage), so a defensive fall-through surfaces the ack's message and lets the caller schedule a retry — throwing would turn a benign refactor invariant violation into a relay-path crash.

SiteCallAudit-006 — Stuck-only paging test does not exercise the multi-page boundary with an interleaved non-stuck row at the cursor

Severity Low
Category Testing coverage
Status Open
Location tests/ScadaLink.SiteCallAudit.Tests/SiteCallAuditActorTests.cs:335-392

Description

SiteCallQueryRequest_StuckOnly_PagesAreFull_NoEmptyPagesWithCursor covers the case where stuck rows are interleaved with non-stuck rows (page-1 returns 2 stuck rows, page-2 returns the third). It does not cover the edge where the row at the keyset cursor boundary (AfterCreatedAtUtc + AfterId) is itself a non-stuck row — i.e. the cursor points at a row the next page must SKIP through to find more stuck rows. The repository's SQL composes the cursor predicate (CreatedAtUtc < cursor OR (CreatedAtUtc = cursor AND id < ...)) with the stuck predicate, so it should be honest, but the test only asserts row counts and IsStuck, not that the second-page query specifically skipped non-stuck rows between the cursor and the next stuck row.

Lower priority because the SQL composition is straightforward, but adding a direct test would lock the invariant.

Recommendation

Add a test that (a) inserts 6 rows in interleaved order: stuck, not-stuck, stuck, not-stuck, stuck, not-stuck (oldest first); (b) issues a StuckOnly page-size-1 query; (c) asserts each page returns exactly the stuck row, with no overlap and all 3 stuck rows visited.

Resolution

Unresolved.