2ed5c6c379
Concurrency hazards, DI lifetime hygiene, and one verify-only confirmation
across 8 modules. Highlights:
Concurrency:
- CentralUI-030: SandboxConsoleCapture writes routed through WriteSynchronized
locking on the captured StringWriter — intra-script Task fan-out can no
longer corrupt the per-call buffer.
- Commons-021: ExternalCallResult.Response now backed by Lazy<dynamic?>
(ExecutionAndPublication) — no more benign double-parse race.
- CD-017: DeploymentManagerRepository.DeleteDeploymentRecordAsync now takes
an expected RowVersion and seeds entry.OriginalValues so EF emits
DELETE ... WHERE Id=@id AND RowVersion=@prior; stale RowVersion now
throws DbUpdateConcurrencyException instead of silent overwrite.
- Transport-009: AuditCorrelationContext.BundleImportId backed by
AsyncLocal<Guid?> so concurrent imports get per-logical-call isolation
(was a scoped instance shared via AuditService across runs).
DI / lifetime:
- AuditLog-003: All 3 AuditLog actor handlers switched to CreateAsyncScope
+ await using — async EF disposal no longer swallowed.
- AuditLog-007: INodeIdentityProvider resolution standardised on
GetRequiredService<>() (was mixed with GetService<>()).
- AuditLog-011: AddAuditLogHealthMetricsBridge guarded by sentinel
descriptor check — calling twice no longer double-registers the hosted
service.
Shutdown / supervision:
- SiteCallAudit-002: AkkaHostedService adds a CoordinatedShutdown
cluster-leave task (drain-site-call-audit-singleton) that issues a
bounded GracefulStop(10s) so failover waits for in-flight upserts.
Registration safety:
- NS-020: AkkaHostedService now guards NotificationForwarder S&F
registration with _notificationDeliveryHandlerRegistered + throws
InvalidOperationException on double-register to make the regression loud.
VERIFY-only closures:
- NotifOutbox-005: Confirmed already closed by CD-015 fix (ac96b83) —
NotificationOutboxRepository.InsertIfNotExistsAsync uses the same
raw-SQL IF NOT EXISTS + 2601/2627 swallow pattern; race eliminated.
5+ new regression tests (CentralUI sandbox WhenAll, ExternalCallResult
64-reader Barrier, AuditLog DI idempotency, RowVersion stale-throw,
SiteCallAudit-002 shutdown drain). Build clean; affected suites all green.
README regenerated: 65 open (was 75).
332 lines
16 KiB
Markdown
332 lines
16 KiB
Markdown
# Code Review — SiteCallAudit
|
|
|
|
| Field | Value |
|
|
|-------|-------|
|
|
| Module | `src/ScadaLink.SiteCallAudit` |
|
|
| Design doc | `docs/requirements/Component-SiteCallAudit.md` |
|
|
| Status | Reviewed |
|
|
| Last reviewed | 2026-05-28 |
|
|
| Reviewer | claude-agent |
|
|
| Commit reviewed | `1eb6e97` |
|
|
| Open findings | 5 |
|
|
|
|
## Summary
|
|
|
|
The module is small (one actor + DI extension + options class). The actor is a
|
|
central cluster singleton that exposes three responsibility groups: direct
|
|
`UpsertSiteCallCommand` ingest, paginated/KPI read handlers, and the central→site
|
|
Retry/Discard relay. Ingest idempotency is delegated to the repository's
|
|
monotonic-upsert (the CD-015 check-then-act window is mitigated by the
|
|
duplicate-key swallow on the insert leg). Findings cluster around two themes:
|
|
(a) the `SupervisorStrategy` override is dead-code that contradicts the XML
|
|
docstring — it governs children, and this actor has none, so the documented
|
|
"Resume on leaked exception" promise is unenforced; (b) several smaller drifts
|
|
between the design doc and the code (reconciliation puller + daily purge
|
|
schedule are still deferred; `OnUpsertAsync` does not stamp `IngestedAtUtc`
|
|
unlike the dual-write path). The relay path is well covered by Akka TestKit
|
|
unit tests; the ingest + KPI paths are covered by MSSQL-backed integration
|
|
tests using a shared `MsSqlMigrationFixture`.
|
|
|
|
## Checklist coverage
|
|
|
|
| # | Category | Examined | Notes |
|
|
|---|----------|----------|-------|
|
|
| 1 | Correctness & logic bugs | Yes | `OnUpsertAsync` does not refresh `IngestedAtUtc` (Finding 003). |
|
|
| 2 | Akka.NET conventions | Yes | `SupervisorStrategy()` override is dead code (Finding 001). `Sender` correctly captured before first await on every handler. `PipeTo` used for read replies. |
|
|
| 3 | Concurrency & thread safety | Yes | `_centralCommunication` mutated only on actor thread via `RegisterCentralCommunication`. DI scope-per-message disposed in `try/finally`. No issues found. |
|
|
| 4 | Error handling & resilience | Yes | Ingest catches all + replies `Accepted=false`. Relay distinguishes `SiteUnreachable` vs `OperationFailed`. Failover handover does not wait for in-flight async work (Finding 002). |
|
|
| 5 | Security | Yes | All SQL is parameterised at the repository (FromSqlInterpolated). Relay carries no user-controlled strings beyond `SourceSite`. No issues found. |
|
|
| 6 | Performance & resource management | Yes | DI scope-per-message correctly disposed. `MaxPageSize=200` clamp present. No issues found. |
|
|
| 7 | Design-document adherence | Yes | Reconciliation puller and daily terminal-purge scheduler still deferred; design doc reads as if they ship (Finding 004). |
|
|
| 8 | Code organization & conventions | Yes | `RegisterCentralCommunication` is a top-level record colocated with the actor — by design (carries `IActorRef`, cannot live in Commons). No issues found. |
|
|
| 9 | Testing coverage | Yes | Relay path well covered (6 unit tests). Ingest/KPI well covered by MSSQL fixture. Stuck-only paging boundary edge not directly exercised (Finding 006). |
|
|
| 10 | Documentation & comments | Yes | XML docstring claims `SupervisorStrategy` uses Resume — incorrect (Finding 001). `AckErrorMessage` switch arm for `SiteUnreachable` falls through instead of throwing (Finding 005). |
|
|
|
|
## Findings
|
|
|
|
### SiteCallAudit-001 — SupervisorStrategy override is dead code; XML claims Resume that is not enforced
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Akka.NET conventions |
|
|
| Status | Open |
|
|
| Location | `src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:32-46`, `src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:147-151` |
|
|
|
|
**Description**
|
|
|
|
The XML remarks block (lines 32-46) states:
|
|
|
|
> "The `SupervisorStrategy` uses `Resume` so an unexpected throw before the catch (defence in depth) does not restart the actor and reset in-flight state."
|
|
|
|
The override at lines 147-151 returns a `OneForOneStrategy` with `DefaultDecider`
|
|
and `maxNrOfRetries: 0`. Two problems compound:
|
|
|
|
1. `ActorBase.SupervisorStrategy()` governs the actor's **children**, not the
|
|
actor itself. `SiteCallAuditActor` creates no children, so this override is
|
|
dead code.
|
|
2. The returned strategy uses `DefaultDecider` (Restart for most exceptions),
|
|
**not** `Directive.Resume`. So even if the actor did have children, the
|
|
strategy would not be Resume — it would be the default Restart-on-most-faults
|
|
behaviour with `maxNrOfRetries: 0` (which forces a Stop after the first
|
|
failure).
|
|
|
|
Net effect: the actor's own self-supervision is whatever the parent supplies
|
|
(`SupervisorStrategy.DefaultDecider` from the singleton manager / user
|
|
guardian in tests), which Restarts on most exceptions. If the `try/catch` in
|
|
`OnUpsertAsync` ever leaked (e.g. a synchronous throw constructing `replyTo`),
|
|
the actor would Restart, reset `_centralCommunication` to null, and silently
|
|
break the relay until `RegisterCentralCommunication` runs again.
|
|
|
|
This same pattern (with the same misleading XML doc) exists in
|
|
`AuditLogIngestActor`, `AuditLogPurgeActor`, and `SiteAuditReconciliationActor`
|
|
— they were likely cargo-culted; this finding documents the local instance.
|
|
|
|
**Recommendation**
|
|
|
|
Either:
|
|
|
|
- Remove the `SupervisorStrategy()` override entirely (it does nothing useful)
|
|
and revise the XML comment to drop the "Resume" claim. Self-supervision is
|
|
the parent's concern (the cluster singleton manager); the `try/catch` in
|
|
`OnUpsertAsync` is what actually keeps the actor alive.
|
|
- Or, if Resume-on-self-throw is actually desired, that requires wiring a
|
|
custom supervisor in the parent (`ClusterSingletonManager`) — not overriding
|
|
`SupervisorStrategy()` here. Simpler path: keep the `try/catch`, drop the
|
|
override.
|
|
|
|
The CLAUDE.md "Resume for coordinator actors" decision applies to actors with
|
|
children (Site Runtime hierarchy) — not to leaf cluster singletons.
|
|
|
|
**Resolution**
|
|
|
|
_Unresolved._
|
|
|
|
### SiteCallAudit-002 — Singleton failover does not wait for in-flight async upserts
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Error handling & resilience |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.Host/Actors/AkkaHostedService.cs:455-462` (singleton wiring), `src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:153-193` |
|
|
|
|
**Resolution (2026-05-28):** Added a `CoordinatedShutdown` task in the `cluster-leave` phase (named `drain-site-call-audit-singleton`) that issues an explicit `GracefulStop(10s)` to the `SiteCallAudit` cluster singleton manager before the cluster-leave proceeds. Akka.NET's singleton handover already waits for the active actor's `ReceiveAsync` task to complete before signalling `HandOverDone`, so an in-flight EF `UpsertAsync` (and its SQL round-trip) drains on the old node before the new singleton starts on the other central node — closing the seam where the new singleton could race a still-running upsert on the old node. The 10-second timeout is bounded so a misbehaving upsert cannot stall coordinated shutdown indefinitely; on timeout the existing `PoisonPill` termination path takes over and the repository's monotonic-upsert + 2601/2627 duplicate-key swallow remain as the storage-state safety net. Pattern is suitable for the `NotificationOutbox` singleton too; deferred to keep this change scoped.
|
|
|
|
**Description**
|
|
|
|
The singleton is created with `terminationMessage: PoisonPill.Instance`. On
|
|
failover the active node's singleton stops as soon as the mailbox is drained
|
|
of normal messages and the PoisonPill is processed. An in-flight
|
|
`OnUpsertAsync` Task started before the PoisonPill arrived will be allowed to
|
|
complete (the message-handler runs synchronously from the mailbox's view),
|
|
but the Akka actor model does NOT cancel the EF Core
|
|
`ExecuteSqlInterpolatedAsync` call.
|
|
|
|
Two consequences:
|
|
|
|
1. The new singleton on the other node may begin accepting
|
|
`UpsertSiteCallCommand` for the same `TrackedOperationId` while the old
|
|
singleton's in-flight upsert is still running. The repository's
|
|
monotonic-upsert and the SQL duplicate-key swallow protect storage state.
|
|
2. The original `replyTo` sender may receive its `Accepted=true` after the new
|
|
singleton has already returned a different reply. Idempotency keys protect
|
|
correctness; wire-level ordering is best-effort by design.
|
|
|
|
This is consistent with the design ("eventually-consistent mirror, sites are
|
|
source of truth"), but worth documenting as an explicit invariant. The
|
|
Notification Outbox sibling has the same pattern.
|
|
|
|
**Recommendation**
|
|
|
|
- Document the failover/handover semantics in the actor's XML remarks: "On
|
|
cluster singleton handover, in-flight `OnUpsertAsync` tasks complete on the
|
|
old node and may produce a late `Accepted=true` reply; the repository's
|
|
monotonic upsert ensures storage state is consistent."
|
|
- Add an integration test that deliberately races two concurrent upserts on
|
|
the same `TrackedOperationId` to verify the duplicate-key swallow +
|
|
monotonic rank check (the CD-015 race-pattern check the parent task
|
|
flagged).
|
|
|
|
### SiteCallAudit-003 — `OnUpsertAsync` does not refresh `IngestedAtUtc`; direct-write callers must remember to stamp it
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Correctness & logic bugs |
|
|
| Status | Open |
|
|
| Location | `src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:153-193` |
|
|
|
|
**Description**
|
|
|
|
The combined-telemetry hot path (`AuditLogIngestActor.OnCachedTelemetryAsync`)
|
|
stamps `IngestedAtUtc = DateTime.UtcNow` on both the `AuditLog` row and the
|
|
`SiteCall` row at central-side persist time
|
|
(`src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:238-239`). The design
|
|
doc treats `IngestedAtUtc` as "central ingested (or last refreshed) this row"
|
|
— a central-side timestamp.
|
|
|
|
`SiteCallAuditActor.OnUpsertAsync` writes the supplied `SiteCall` as-is, with
|
|
whatever `IngestedAtUtc` the caller stamped. The only current callers are the
|
|
unit tests (which use `DateTime.UtcNow` at command-construction time). Once
|
|
the deferred reconciliation puller lands and starts emitting
|
|
`UpsertSiteCallCommand`s, the puller (running on central) is responsible for
|
|
stamping a central timestamp — but if a future direct-write caller forgets,
|
|
or constructs from a site DTO, the value could drift (e.g. become a site
|
|
clock value).
|
|
|
|
This is currently latent because no production caller exists, but it's
|
|
inconsistent with the dual-write code path and undocumented.
|
|
|
|
**Recommendation**
|
|
|
|
- Either: stamp `IngestedAtUtc = DateTime.UtcNow` inside `OnUpsertAsync`
|
|
before calling `UpsertAsync` (matching `AuditLogIngestActor`'s behaviour),
|
|
using `cmd.SiteCall with { IngestedAtUtc = DateTime.UtcNow }`.
|
|
- Or: document in the `UpsertSiteCallCommand` XML that callers MUST stamp
|
|
`IngestedAtUtc` to a central-side `DateTime.UtcNow` immediately before
|
|
sending.
|
|
|
|
Preferred: stamp inside the actor — same as the combined-telemetry path —
|
|
because callers cannot in general know the actor is colocated on central.
|
|
|
|
**Resolution**
|
|
|
|
_Unresolved._
|
|
|
|
### SiteCallAudit-004 — Reconciliation puller and daily terminal-purge scheduler still deferred; design-doc drift
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Design-document adherence |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:23-30` (actor XML), `src/ScadaLink.SiteCallAudit/ServiceCollectionExtensions.cs:8-13`, `docs/requirements/Component-SiteCallAudit.md:24-32` |
|
|
|
|
**Description**
|
|
|
|
The design doc (`Component-SiteCallAudit.md` lines 24-32) lists five
|
|
responsibilities, including:
|
|
|
|
- "Run periodic per-site reconciliation pulls so missed telemetry self-heals."
|
|
- "Purge terminal audit rows after a configurable retention window."
|
|
|
|
The repository exposes `PurgeTerminalAsync` but nothing in this module
|
|
schedules a daily call (Notification Outbox owns a `MaintenanceService` for
|
|
its equivalent; no `SiteCallAuditMaintenanceService` exists). The
|
|
reconciliation puller is acknowledged in the actor XML
|
|
(`only reconciliation remains deferred`) but is not surfaced in the design
|
|
doc as deferred — the doc reads as if it ships.
|
|
|
|
**Recommendation**
|
|
|
|
- Either: implement the deferred pieces (a hosted service that wakes daily
|
|
and calls `repo.PurgeTerminalAsync(now - retentionWindow)`, plus a per-site
|
|
reconciliation puller with a cursor + an `IPullCachedTelemetryClient`).
|
|
- Or: add a "Status" / "Deferred" subsection to the design doc explicitly
|
|
listing what's not yet implemented (matches the pattern Audit Log uses for
|
|
its tamper-evidence hash chain).
|
|
|
|
**Resolution (2026-05-28):**
|
|
|
|
Updated the class-level XML on `SiteCallAuditActor` to reflect actual state:
|
|
telemetry ingest, query/detail/KPI handlers (Task 4), and the central→site
|
|
Retry/Discard relay (Task 5) are implemented; the periodic reconciliation
|
|
puller and the daily terminal-row purge scheduler remain deferred. The design
|
|
doc update is tracked separately.
|
|
|
|
### SiteCallAudit-005 — `AckErrorMessage` switch arm for `SiteUnreachable` returns ack message instead of throwing
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Documentation & comments |
|
|
| Status | Resolved |
|
|
| Location | `src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:548-563` |
|
|
|
|
**Description**
|
|
|
|
```csharp
|
|
return outcome switch
|
|
{
|
|
SiteCallRelayOutcome.Applied => null,
|
|
SiteCallRelayOutcome.NotParked => "The operation is no longer parked at the site (...)",
|
|
SiteCallRelayOutcome.OperationFailed => ack.ErrorMessage,
|
|
// SiteUnreachable is never produced from a ParkedOperationActionAck —
|
|
// unreachable responses are built by UnreachableRetry/UnreachableDiscard
|
|
// before any ack is classified, so this arm is unreachable by construction.
|
|
SiteCallRelayOutcome.SiteUnreachable => ack.ErrorMessage,
|
|
_ => throw new ArgumentOutOfRangeException(...)
|
|
};
|
|
```
|
|
|
|
The comment correctly states the `SiteUnreachable` arm is unreachable when
|
|
called from `ClassifyAck`. The arm therefore exists only to satisfy
|
|
exhaustiveness, but instead of throwing or returning a sentinel, it falls
|
|
through to `ack.ErrorMessage` — indistinguishable from the `OperationFailed`
|
|
arm above. If any future caller *does* feed `SiteUnreachable` into
|
|
`AckErrorMessage` (e.g. via refactor), the result will be a silent
|
|
wrong-detail-text bug rather than an immediate crash. The default arm
|
|
correctly throws `ArgumentOutOfRangeException`, so the `SiteUnreachable` arm
|
|
is the inconsistent one.
|
|
|
|
**Recommendation**
|
|
|
|
Replace the `SiteUnreachable => ack.ErrorMessage` arm with:
|
|
|
|
```csharp
|
|
SiteCallRelayOutcome.SiteUnreachable =>
|
|
throw new InvalidOperationException(
|
|
"AckErrorMessage cannot be called for SiteUnreachable — those responses "
|
|
+ "are built by UnreachableRetry/UnreachableDiscard before classification."),
|
|
```
|
|
|
|
— fail fast if the invariant is ever violated by a refactor.
|
|
|
|
**Resolution (2026-05-28):**
|
|
|
|
Behaviour kept (return `ack.ErrorMessage`); `AckErrorMessage` stays total and
|
|
side-effect-free. Expanded the inline comment on the `SiteUnreachable` arm to
|
|
explain WHY it returns rather than throws: site-unreachable is classified as
|
|
transient by the upstream relay (which has already built its
|
|
`SiteUnreachable` response and detail text via `SiteUnreachableMessage`), so a
|
|
defensive fall-through surfaces the ack's message and lets the caller schedule
|
|
a retry — throwing would turn a benign refactor invariant violation into a
|
|
relay-path crash.
|
|
|
|
### SiteCallAudit-006 — Stuck-only paging test does not exercise the multi-page boundary with an interleaved non-stuck row at the cursor
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Testing coverage |
|
|
| Status | Open |
|
|
| Location | `tests/ScadaLink.SiteCallAudit.Tests/SiteCallAuditActorTests.cs:335-392` |
|
|
|
|
**Description**
|
|
|
|
`SiteCallQueryRequest_StuckOnly_PagesAreFull_NoEmptyPagesWithCursor` covers
|
|
the case where stuck rows are interleaved with non-stuck rows (page-1 returns
|
|
2 stuck rows, page-2 returns the third). It does not cover the edge where
|
|
the row at the keyset cursor boundary (`AfterCreatedAtUtc + AfterId`) is
|
|
itself a non-stuck row — i.e. the cursor points at a row the next page must
|
|
SKIP through to find more stuck rows. The repository's SQL composes the
|
|
cursor predicate (`CreatedAtUtc < cursor OR (CreatedAtUtc = cursor AND id <
|
|
...)`) with the stuck predicate, so it should be honest, but the test only
|
|
asserts row counts and `IsStuck`, not that the second-page query specifically
|
|
skipped non-stuck rows between the cursor and the next stuck row.
|
|
|
|
Lower priority because the SQL composition is straightforward, but adding a
|
|
direct test would lock the invariant.
|
|
|
|
**Recommendation**
|
|
|
|
Add a test that (a) inserts 6 rows in interleaved order: stuck, not-stuck,
|
|
stuck, not-stuck, stuck, not-stuck (oldest first); (b) issues a `StuckOnly`
|
|
page-size-1 query; (c) asserts each page returns exactly the stuck row, with
|
|
no overlap and all 3 stuck rows visited.
|
|
|
|
**Resolution**
|
|
|
|
_Unresolved._
|