Files
ScadaBridge/code-reviews/SiteCallAudit/findings.md
T
Joseph Doherty 7b0b9c7365 refactor: rename ScadaLink → ZB.MOM.WW.ScadaBridge (code + projects + namespaces)
Solution + 23 src projects + 26 test projects renamed; folders, csproj,
namespaces, and ScadaLinkDbContext/ScadaBridgeDbContext class updated.
ActorSystem "scadalink" → "scadabridge", Akka seed-node URLs migrated.
SQL roles/logins, LDAP domains, CLI command name, and CLI config dir
(~/.scadalink → ~/.scadabridge) also renamed.

Build green; 5 Host.Tests fail awaiting SQL login rename in next commit.
Pre-existing StaleTagMonitor timing flakes unchanged.

Rename script committed at tools/rename-to-scadabridge.sh.
2026-05-28 09:37:45 -04:00

326 lines
19 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Code Review — SiteCallAudit
| Field | Value |
|-------|-------|
| Module | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit` |
| Design doc | `docs/requirements/Component-SiteCallAudit.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
| Open findings | 2 |
## Summary
The module is small (one actor + DI extension + options class). The actor is a
central cluster singleton that exposes three responsibility groups: direct
`UpsertSiteCallCommand` ingest, paginated/KPI read handlers, and the central→site
Retry/Discard relay. Ingest idempotency is delegated to the repository's
monotonic-upsert (the CD-015 check-then-act window is mitigated by the
duplicate-key swallow on the insert leg). Findings cluster around two themes:
(a) the `SupervisorStrategy` override is dead-code that contradicts the XML
docstring — it governs children, and this actor has none, so the documented
"Resume on leaked exception" promise is unenforced; (b) several smaller drifts
between the design doc and the code (reconciliation puller + daily purge
schedule are still deferred; `OnUpsertAsync` does not stamp `IngestedAtUtc`
unlike the dual-write path). The relay path is well covered by Akka TestKit
unit tests; the ingest + KPI paths are covered by MSSQL-backed integration
tests using a shared `MsSqlMigrationFixture`.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | Yes | `OnUpsertAsync` does not refresh `IngestedAtUtc` (Finding 003). |
| 2 | Akka.NET conventions | Yes | `SupervisorStrategy()` override is dead code (Finding 001). `Sender` correctly captured before first await on every handler. `PipeTo` used for read replies. |
| 3 | Concurrency & thread safety | Yes | `_centralCommunication` mutated only on actor thread via `RegisterCentralCommunication`. DI scope-per-message disposed in `try/finally`. No issues found. |
| 4 | Error handling & resilience | Yes | Ingest catches all + replies `Accepted=false`. Relay distinguishes `SiteUnreachable` vs `OperationFailed`. Failover handover does not wait for in-flight async work (Finding 002). |
| 5 | Security | Yes | All SQL is parameterised at the repository (FromSqlInterpolated). Relay carries no user-controlled strings beyond `SourceSite`. No issues found. |
| 6 | Performance & resource management | Yes | DI scope-per-message correctly disposed. `MaxPageSize=200` clamp present. No issues found. |
| 7 | Design-document adherence | Yes | Reconciliation puller and daily terminal-purge scheduler still deferred; design doc reads as if they ship (Finding 004). |
| 8 | Code organization & conventions | Yes | `RegisterCentralCommunication` is a top-level record colocated with the actor — by design (carries `IActorRef`, cannot live in Commons). No issues found. |
| 9 | Testing coverage | Yes | Relay path well covered (6 unit tests). Ingest/KPI well covered by MSSQL fixture. Stuck-only paging boundary edge not directly exercised (Finding 006). |
| 10 | Documentation & comments | Yes | XML docstring claims `SupervisorStrategy` uses Resume — incorrect (Finding 001). `AckErrorMessage` switch arm for `SiteUnreachable` falls through instead of throwing (Finding 005). |
## Findings
### SiteCallAudit-001 — SupervisorStrategy override is dead code; XML claims Resume that is not enforced
| | |
|--|--|
| Severity | Medium |
| Category | Akka.NET conventions |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:32-46`, `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:147-151` |
**Description**
The XML remarks block (lines 32-46) states:
> "The `SupervisorStrategy` uses `Resume` so an unexpected throw before the catch (defence in depth) does not restart the actor and reset in-flight state."
The override at lines 147-151 returns a `OneForOneStrategy` with `DefaultDecider`
and `maxNrOfRetries: 0`. Two problems compound:
1. `ActorBase.SupervisorStrategy()` governs the actor's **children**, not the
actor itself. `SiteCallAuditActor` creates no children, so this override is
dead code.
2. The returned strategy uses `DefaultDecider` (Restart for most exceptions),
**not** `Directive.Resume`. So even if the actor did have children, the
strategy would not be Resume — it would be the default Restart-on-most-faults
behaviour with `maxNrOfRetries: 0` (which forces a Stop after the first
failure).
Net effect: the actor's own self-supervision is whatever the parent supplies
(`SupervisorStrategy.DefaultDecider` from the singleton manager / user
guardian in tests), which Restarts on most exceptions. If the `try/catch` in
`OnUpsertAsync` ever leaked (e.g. a synchronous throw constructing `replyTo`),
the actor would Restart, reset `_centralCommunication` to null, and silently
break the relay until `RegisterCentralCommunication` runs again.
This same pattern (with the same misleading XML doc) exists in
`AuditLogIngestActor`, `AuditLogPurgeActor`, and `SiteAuditReconciliationActor`
— they were likely cargo-culted; this finding documents the local instance.
**Recommendation**
Either:
- Remove the `SupervisorStrategy()` override entirely (it does nothing useful)
and revise the XML comment to drop the "Resume" claim. Self-supervision is
the parent's concern (the cluster singleton manager); the `try/catch` in
`OnUpsertAsync` is what actually keeps the actor alive.
- Or, if Resume-on-self-throw is actually desired, that requires wiring a
custom supervisor in the parent (`ClusterSingletonManager`) — not overriding
`SupervisorStrategy()` here. Simpler path: keep the `try/catch`, drop the
override.
The CLAUDE.md "Resume for coordinator actors" decision applies to actors with
children (Site Runtime hierarchy) — not to leaf cluster singletons.
**Resolution (2026-05-28):** Rewrote the class-level XML on `SiteCallAuditActor` plus the method-level XML on `SupervisorStrategy()` to accurately describe what the override does — a one-for-one strategy with `DefaultDecider` (Restart on most exceptions, Stop on `ActorInitializationException`/`ActorKilledException`) and `maxNrOfRetries: 0`, governing the actor's *children* (the actor has none today, so the override is currently inert). Dropped the misleading "Resume" claim. The new docs make clear that self-supervision of this cluster singleton is the parent `ClusterSingletonManager`'s concern and the actor's own resilience comes from the in-handler `try/catch` in `OnUpsertAsync`, not from this override. No behaviour change — pure documentation fix; existing 24 SiteCallAudit tests remain green.
### SiteCallAudit-002 — Singleton failover does not wait for in-flight async upserts
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:455-462` (singleton wiring), `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:153-193` |
**Resolution (2026-05-28):** Added a `CoordinatedShutdown` task in the `cluster-leave` phase (named `drain-site-call-audit-singleton`) that issues an explicit `GracefulStop(10s)` to the `SiteCallAudit` cluster singleton manager before the cluster-leave proceeds. Akka.NET's singleton handover already waits for the active actor's `ReceiveAsync` task to complete before signalling `HandOverDone`, so an in-flight EF `UpsertAsync` (and its SQL round-trip) drains on the old node before the new singleton starts on the other central node — closing the seam where the new singleton could race a still-running upsert on the old node. The 10-second timeout is bounded so a misbehaving upsert cannot stall coordinated shutdown indefinitely; on timeout the existing `PoisonPill` termination path takes over and the repository's monotonic-upsert + 2601/2627 duplicate-key swallow remain as the storage-state safety net. Pattern is suitable for the `NotificationOutbox` singleton too; deferred to keep this change scoped.
**Description**
The singleton is created with `terminationMessage: PoisonPill.Instance`. On
failover the active node's singleton stops as soon as the mailbox is drained
of normal messages and the PoisonPill is processed. An in-flight
`OnUpsertAsync` Task started before the PoisonPill arrived will be allowed to
complete (the message-handler runs synchronously from the mailbox's view),
but the Akka actor model does NOT cancel the EF Core
`ExecuteSqlInterpolatedAsync` call.
Two consequences:
1. The new singleton on the other node may begin accepting
`UpsertSiteCallCommand` for the same `TrackedOperationId` while the old
singleton's in-flight upsert is still running. The repository's
monotonic-upsert and the SQL duplicate-key swallow protect storage state.
2. The original `replyTo` sender may receive its `Accepted=true` after the new
singleton has already returned a different reply. Idempotency keys protect
correctness; wire-level ordering is best-effort by design.
This is consistent with the design ("eventually-consistent mirror, sites are
source of truth"), but worth documenting as an explicit invariant. The
Notification Outbox sibling has the same pattern.
**Recommendation**
- Document the failover/handover semantics in the actor's XML remarks: "On
cluster singleton handover, in-flight `OnUpsertAsync` tasks complete on the
old node and may produce a late `Accepted=true` reply; the repository's
monotonic upsert ensures storage state is consistent."
- Add an integration test that deliberately races two concurrent upserts on
the same `TrackedOperationId` to verify the duplicate-key swallow +
monotonic rank check (the CD-015 race-pattern check the parent task
flagged).
### SiteCallAudit-003 — `OnUpsertAsync` does not refresh `IngestedAtUtc`; direct-write callers must remember to stamp it
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:153-193` |
**Description**
The combined-telemetry hot path (`AuditLogIngestActor.OnCachedTelemetryAsync`)
stamps `IngestedAtUtc = DateTime.UtcNow` on both the `AuditLog` row and the
`SiteCall` row at central-side persist time
(`src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogIngestActor.cs:238-239`). The design
doc treats `IngestedAtUtc` as "central ingested (or last refreshed) this row"
— a central-side timestamp.
`SiteCallAuditActor.OnUpsertAsync` writes the supplied `SiteCall` as-is, with
whatever `IngestedAtUtc` the caller stamped. The only current callers are the
unit tests (which use `DateTime.UtcNow` at command-construction time). Once
the deferred reconciliation puller lands and starts emitting
`UpsertSiteCallCommand`s, the puller (running on central) is responsible for
stamping a central timestamp — but if a future direct-write caller forgets,
or constructs from a site DTO, the value could drift (e.g. become a site
clock value).
This is currently latent because no production caller exists, but it's
inconsistent with the dual-write code path and undocumented.
**Recommendation**
- Either: stamp `IngestedAtUtc = DateTime.UtcNow` inside `OnUpsertAsync`
before calling `UpsertAsync` (matching `AuditLogIngestActor`'s behaviour),
using `cmd.SiteCall with { IngestedAtUtc = DateTime.UtcNow }`.
- Or: document in the `UpsertSiteCallCommand` XML that callers MUST stamp
`IngestedAtUtc` to a central-side `DateTime.UtcNow` immediately before
sending.
Preferred: stamp inside the actor — same as the combined-telemetry path —
because callers cannot in general know the actor is colocated on central.
**Resolution (2026-05-28):** `OnUpsertAsync` now rewrites the incoming `SiteCall` via `cmd.SiteCall with { IngestedAtUtc = DateTime.UtcNow }` immediately before calling `repository.UpsertAsync`, mirroring `AuditLogIngestActor`'s combined-telemetry hot path. The repository writes `IngestedAtUtc` on both the insert-if-not-exists and the monotonic UPDATE legs (`SiteCallAuditRepository.UpsertAsync`), so the column is writable on every upsert. Callers (telemetry, the deferred reconciliation puller, any future direct-write) no longer need to remember to stamp a central-side timestamp — the actor owns it. Existing 24 SiteCallAudit tests remain green (the MSSQL-fixture test constructs rows with `DateTime.UtcNow` and doesn't assert the exact value, so the actor's re-stamp is backward compatible).
### SiteCallAudit-004 — Reconciliation puller and daily terminal-purge scheduler still deferred; design-doc drift
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:23-30` (actor XML), `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/ServiceCollectionExtensions.cs:8-13`, `docs/requirements/Component-SiteCallAudit.md:24-32` |
**Description**
The design doc (`Component-SiteCallAudit.md` lines 24-32) lists five
responsibilities, including:
- "Run periodic per-site reconciliation pulls so missed telemetry self-heals."
- "Purge terminal audit rows after a configurable retention window."
The repository exposes `PurgeTerminalAsync` but nothing in this module
schedules a daily call (Notification Outbox owns a `MaintenanceService` for
its equivalent; no `SiteCallAuditMaintenanceService` exists). The
reconciliation puller is acknowledged in the actor XML
(`only reconciliation remains deferred`) but is not surfaced in the design
doc as deferred — the doc reads as if it ships.
**Recommendation**
- Either: implement the deferred pieces (a hosted service that wakes daily
and calls `repo.PurgeTerminalAsync(now - retentionWindow)`, plus a per-site
reconciliation puller with a cursor + an `IPullCachedTelemetryClient`).
- Or: add a "Status" / "Deferred" subsection to the design doc explicitly
listing what's not yet implemented (matches the pattern Audit Log uses for
its tamper-evidence hash chain).
**Resolution (2026-05-28):**
Updated the class-level XML on `SiteCallAuditActor` to reflect actual state:
telemetry ingest, query/detail/KPI handlers (Task 4), and the central→site
Retry/Discard relay (Task 5) are implemented; the periodic reconciliation
puller and the daily terminal-row purge scheduler remain deferred. The design
doc update is tracked separately.
### SiteCallAudit-005 — `AckErrorMessage` switch arm for `SiteUnreachable` returns ack message instead of throwing
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:548-563` |
**Description**
```csharp
return outcome switch
{
SiteCallRelayOutcome.Applied => null,
SiteCallRelayOutcome.NotParked => "The operation is no longer parked at the site (...)",
SiteCallRelayOutcome.OperationFailed => ack.ErrorMessage,
// SiteUnreachable is never produced from a ParkedOperationActionAck —
// unreachable responses are built by UnreachableRetry/UnreachableDiscard
// before any ack is classified, so this arm is unreachable by construction.
SiteCallRelayOutcome.SiteUnreachable => ack.ErrorMessage,
_ => throw new ArgumentOutOfRangeException(...)
};
```
The comment correctly states the `SiteUnreachable` arm is unreachable when
called from `ClassifyAck`. The arm therefore exists only to satisfy
exhaustiveness, but instead of throwing or returning a sentinel, it falls
through to `ack.ErrorMessage` — indistinguishable from the `OperationFailed`
arm above. If any future caller *does* feed `SiteUnreachable` into
`AckErrorMessage` (e.g. via refactor), the result will be a silent
wrong-detail-text bug rather than an immediate crash. The default arm
correctly throws `ArgumentOutOfRangeException`, so the `SiteUnreachable` arm
is the inconsistent one.
**Recommendation**
Replace the `SiteUnreachable => ack.ErrorMessage` arm with:
```csharp
SiteCallRelayOutcome.SiteUnreachable =>
throw new InvalidOperationException(
"AckErrorMessage cannot be called for SiteUnreachable — those responses "
+ "are built by UnreachableRetry/UnreachableDiscard before classification."),
```
— fail fast if the invariant is ever violated by a refactor.
**Resolution (2026-05-28):**
Behaviour kept (return `ack.ErrorMessage`); `AckErrorMessage` stays total and
side-effect-free. Expanded the inline comment on the `SiteUnreachable` arm to
explain WHY it returns rather than throws: site-unreachable is classified as
transient by the upstream relay (which has already built its
`SiteUnreachable` response and detail text via `SiteUnreachableMessage`), so a
defensive fall-through surfaces the ack's message and lets the caller schedule
a retry — throwing would turn a benign refactor invariant violation into a
relay-path crash.
### SiteCallAudit-006 — Stuck-only paging test does not exercise the multi-page boundary with an interleaved non-stuck row at the cursor
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Resolved |
| Location | `tests/ZB.MOM.WW.ScadaBridge.SiteCallAudit.Tests/SiteCallAuditActorTests.cs:335-392` |
**Resolution (2026-05-28):** Added `SiteCallQueryRequest_StuckOnly_CursorAtNonStuckBoundary_SkipsToNextStuckRow` to `tests/ZB.MOM.WW.ScadaBridge.SiteCallAudit.Tests/SiteCallAuditActorTests.cs` — drives six rows interleaved as `stuck/non-stuck` × 3 (oldest-first), then issues three page-size-1 stuck-only queries. The cursor between each page deliberately lands on a non-stuck row, so the SQL composition of the stuck predicate AND the keyset cursor predicate must skip it. Asserts each page returns exactly one stuck row in DESC-by-CreatedAtUtc order with no overlap and all three stuck rows visited. Locks the invariant that post-filtering does not produce under-filled pages with non-null next cursors.
**Description**
`SiteCallQueryRequest_StuckOnly_PagesAreFull_NoEmptyPagesWithCursor` covers
the case where stuck rows are interleaved with non-stuck rows (page-1 returns
2 stuck rows, page-2 returns the third). It does not cover the edge where
the row at the keyset cursor boundary (`AfterCreatedAtUtc + AfterId`) is
itself a non-stuck row — i.e. the cursor points at a row the next page must
SKIP through to find more stuck rows. The repository's SQL composes the
cursor predicate (`CreatedAtUtc < cursor OR (CreatedAtUtc = cursor AND id <
...)`) with the stuck predicate, so it should be honest, but the test only
asserts row counts and `IsStuck`, not that the second-page query specifically
skipped non-stuck rows between the cursor and the next stuck row.
Lower priority because the SQL composition is straightforward, but adding a
direct test would lock the invariant.
**Recommendation**
Add a test that (a) inserts 6 rows in interleaved order: stuck, not-stuck,
stuck, not-stuck, stuck, not-stuck (oldest first); (b) issues a `StuckOnly`
page-size-1 query; (c) asserts each page returns exactly the stuck row, with
no overlap and all 3 stuck rows visited.