Doc/XML-comment drift + small adherence fixes across 17 modules. Highlights: - Host-017: site CoordinatedShutdown ordering — SiteStreamGrpcServer gains CancelAllStreams() (refuse new streams, cancel active), wired into Program.cs site branch via ApplicationStopping. - InboundAPI-021: ParentExecutionId now travels on RouteToGet/SetAttributes symmetric with RouteToCallRequest; RouteHelper stamps from _parentExecutionId. - ClusterInfra-012: ClusterOptionsValidator now requires both seed nodes. - Comm-018: SiteCommunicationActor.HeartbeatMessage.IsActive derived from cluster leader check (was hardcoded true). - DM-020: reconciliation audit row attributes the current user, not prior deployer. - SEL-019: EventLogPurgeService early-exits on standby via active-node check. - Plus comment/XML-doc accuracy fixes across AuditLog, ConfigurationDatabase, NotificationOutbox, SiteRuntime, SiteCallAudit; doc refreshes for Component- Commons / -ManagementService / -CLI / -ExternalSystemGateway / -HealthMonitoring / -Transport / -ConfigurationDatabase; CD-023 index-name doc alignment. 11 new regression tests (RouteHelper x4, SiteStreamGrpcServer x2, ClusterOptionsValidator x1, SiteCommunicationActor x1, DeploymentService x1, EventLogPurgeService x3). Build clean (0 warnings); InboundAPI/Communication/ Host suites all green. README regenerated: 112 open (was 136).
15 KiB
Code Review — SiteCallAudit
| Field | Value |
|---|---|
| Module | src/ScadaLink.SiteCallAudit |
| Design doc | docs/requirements/Component-SiteCallAudit.md |
| Status | Reviewed |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | 1eb6e97 |
| Open findings | 6 |
Summary
The module is small (one actor + DI extension + options class). The actor is a
central cluster singleton that exposes three responsibility groups: direct
UpsertSiteCallCommand ingest, paginated/KPI read handlers, and the central→site
Retry/Discard relay. Ingest idempotency is delegated to the repository's
monotonic-upsert (the CD-015 check-then-act window is mitigated by the
duplicate-key swallow on the insert leg). Findings cluster around two themes:
(a) the SupervisorStrategy override is dead-code that contradicts the XML
docstring — it governs children, and this actor has none, so the documented
"Resume on leaked exception" promise is unenforced; (b) several smaller drifts
between the design doc and the code (reconciliation puller + daily purge
schedule are still deferred; OnUpsertAsync does not stamp IngestedAtUtc
unlike the dual-write path). The relay path is well covered by Akka TestKit
unit tests; the ingest + KPI paths are covered by MSSQL-backed integration
tests using a shared MsSqlMigrationFixture.
Checklist coverage
| # | Category | Examined | Notes |
|---|---|---|---|
| 1 | Correctness & logic bugs | Yes | OnUpsertAsync does not refresh IngestedAtUtc (Finding 003). |
| 2 | Akka.NET conventions | Yes | SupervisorStrategy() override is dead code (Finding 001). Sender correctly captured before first await on every handler. PipeTo used for read replies. |
| 3 | Concurrency & thread safety | Yes | _centralCommunication mutated only on actor thread via RegisterCentralCommunication. DI scope-per-message disposed in try/finally. No issues found. |
| 4 | Error handling & resilience | Yes | Ingest catches all + replies Accepted=false. Relay distinguishes SiteUnreachable vs OperationFailed. Failover handover does not wait for in-flight async work (Finding 002). |
| 5 | Security | Yes | All SQL is parameterised at the repository (FromSqlInterpolated). Relay carries no user-controlled strings beyond SourceSite. No issues found. |
| 6 | Performance & resource management | Yes | DI scope-per-message correctly disposed. MaxPageSize=200 clamp present. No issues found. |
| 7 | Design-document adherence | Yes | Reconciliation puller and daily terminal-purge scheduler still deferred; design doc reads as if they ship (Finding 004). |
| 8 | Code organization & conventions | Yes | RegisterCentralCommunication is a top-level record colocated with the actor — by design (carries IActorRef, cannot live in Commons). No issues found. |
| 9 | Testing coverage | Yes | Relay path well covered (6 unit tests). Ingest/KPI well covered by MSSQL fixture. Stuck-only paging boundary edge not directly exercised (Finding 006). |
| 10 | Documentation & comments | Yes | XML docstring claims SupervisorStrategy uses Resume — incorrect (Finding 001). AckErrorMessage switch arm for SiteUnreachable falls through instead of throwing (Finding 005). |
Findings
SiteCallAudit-001 — SupervisorStrategy override is dead code; XML claims Resume that is not enforced
| Severity | Medium |
| Category | Akka.NET conventions |
| Status | Open |
| Location | src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:32-46, src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:147-151 |
Description
The XML remarks block (lines 32-46) states:
"The
SupervisorStrategyusesResumeso an unexpected throw before the catch (defence in depth) does not restart the actor and reset in-flight state."
The override at lines 147-151 returns a OneForOneStrategy with DefaultDecider
and maxNrOfRetries: 0. Two problems compound:
ActorBase.SupervisorStrategy()governs the actor's children, not the actor itself.SiteCallAuditActorcreates no children, so this override is dead code.- The returned strategy uses
DefaultDecider(Restart for most exceptions), notDirective.Resume. So even if the actor did have children, the strategy would not be Resume — it would be the default Restart-on-most-faults behaviour withmaxNrOfRetries: 0(which forces a Stop after the first failure).
Net effect: the actor's own self-supervision is whatever the parent supplies
(SupervisorStrategy.DefaultDecider from the singleton manager / user
guardian in tests), which Restarts on most exceptions. If the try/catch in
OnUpsertAsync ever leaked (e.g. a synchronous throw constructing replyTo),
the actor would Restart, reset _centralCommunication to null, and silently
break the relay until RegisterCentralCommunication runs again.
This same pattern (with the same misleading XML doc) exists in
AuditLogIngestActor, AuditLogPurgeActor, and SiteAuditReconciliationActor
— they were likely cargo-culted; this finding documents the local instance.
Recommendation
Either:
- Remove the
SupervisorStrategy()override entirely (it does nothing useful) and revise the XML comment to drop the "Resume" claim. Self-supervision is the parent's concern (the cluster singleton manager); thetry/catchinOnUpsertAsyncis what actually keeps the actor alive. - Or, if Resume-on-self-throw is actually desired, that requires wiring a
custom supervisor in the parent (
ClusterSingletonManager) — not overridingSupervisorStrategy()here. Simpler path: keep thetry/catch, drop the override.
The CLAUDE.md "Resume for coordinator actors" decision applies to actors with children (Site Runtime hierarchy) — not to leaf cluster singletons.
Resolution
Unresolved.
SiteCallAudit-002 — Singleton failover does not wait for in-flight async upserts
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Location | src/ScadaLink.Host/Actors/AkkaHostedService.cs:455-462 (singleton wiring), src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:153-193 |
Description
The singleton is created with terminationMessage: PoisonPill.Instance. On
failover the active node's singleton stops as soon as the mailbox is drained
of normal messages and the PoisonPill is processed. An in-flight
OnUpsertAsync Task started before the PoisonPill arrived will be allowed to
complete (the message-handler runs synchronously from the mailbox's view),
but the Akka actor model does NOT cancel the EF Core
ExecuteSqlInterpolatedAsync call.
Two consequences:
- The new singleton on the other node may begin accepting
UpsertSiteCallCommandfor the sameTrackedOperationIdwhile the old singleton's in-flight upsert is still running. The repository's monotonic-upsert and the SQL duplicate-key swallow protect storage state. - The original
replyTosender may receive itsAccepted=trueafter the new singleton has already returned a different reply. Idempotency keys protect correctness; wire-level ordering is best-effort by design.
This is consistent with the design ("eventually-consistent mirror, sites are source of truth"), but worth documenting as an explicit invariant. The Notification Outbox sibling has the same pattern.
Recommendation
- Document the failover/handover semantics in the actor's XML remarks: "On
cluster singleton handover, in-flight
OnUpsertAsynctasks complete on the old node and may produce a lateAccepted=truereply; the repository's monotonic upsert ensures storage state is consistent." - Add an integration test that deliberately races two concurrent upserts on
the same
TrackedOperationIdto verify the duplicate-key swallow + monotonic rank check (the CD-015 race-pattern check the parent task flagged).
Resolution
Unresolved.
SiteCallAudit-003 — OnUpsertAsync does not refresh IngestedAtUtc; direct-write callers must remember to stamp it
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:153-193 |
Description
The combined-telemetry hot path (AuditLogIngestActor.OnCachedTelemetryAsync)
stamps IngestedAtUtc = DateTime.UtcNow on both the AuditLog row and the
SiteCall row at central-side persist time
(src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:238-239). The design
doc treats IngestedAtUtc as "central ingested (or last refreshed) this row"
— a central-side timestamp.
SiteCallAuditActor.OnUpsertAsync writes the supplied SiteCall as-is, with
whatever IngestedAtUtc the caller stamped. The only current callers are the
unit tests (which use DateTime.UtcNow at command-construction time). Once
the deferred reconciliation puller lands and starts emitting
UpsertSiteCallCommands, the puller (running on central) is responsible for
stamping a central timestamp — but if a future direct-write caller forgets,
or constructs from a site DTO, the value could drift (e.g. become a site
clock value).
This is currently latent because no production caller exists, but it's inconsistent with the dual-write code path and undocumented.
Recommendation
- Either: stamp
IngestedAtUtc = DateTime.UtcNowinsideOnUpsertAsyncbefore callingUpsertAsync(matchingAuditLogIngestActor's behaviour), usingcmd.SiteCall with { IngestedAtUtc = DateTime.UtcNow }. - Or: document in the
UpsertSiteCallCommandXML that callers MUST stampIngestedAtUtcto a central-sideDateTime.UtcNowimmediately before sending.
Preferred: stamp inside the actor — same as the combined-telemetry path — because callers cannot in general know the actor is colocated on central.
Resolution
Unresolved.
SiteCallAudit-004 — Reconciliation puller and daily terminal-purge scheduler still deferred; design-doc drift
| Severity | Low |
| Category | Design-document adherence |
| Status | Resolved |
| Location | src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:23-30 (actor XML), src/ScadaLink.SiteCallAudit/ServiceCollectionExtensions.cs:8-13, docs/requirements/Component-SiteCallAudit.md:24-32 |
Description
The design doc (Component-SiteCallAudit.md lines 24-32) lists five
responsibilities, including:
- "Run periodic per-site reconciliation pulls so missed telemetry self-heals."
- "Purge terminal audit rows after a configurable retention window."
The repository exposes PurgeTerminalAsync but nothing in this module
schedules a daily call (Notification Outbox owns a MaintenanceService for
its equivalent; no SiteCallAuditMaintenanceService exists). The
reconciliation puller is acknowledged in the actor XML
(only reconciliation remains deferred) but is not surfaced in the design
doc as deferred — the doc reads as if it ships.
Recommendation
- Either: implement the deferred pieces (a hosted service that wakes daily
and calls
repo.PurgeTerminalAsync(now - retentionWindow), plus a per-site reconciliation puller with a cursor + anIPullCachedTelemetryClient). - Or: add a "Status" / "Deferred" subsection to the design doc explicitly listing what's not yet implemented (matches the pattern Audit Log uses for its tamper-evidence hash chain).
Resolution (2026-05-28):
Updated the class-level XML on SiteCallAuditActor to reflect actual state:
telemetry ingest, query/detail/KPI handlers (Task 4), and the central→site
Retry/Discard relay (Task 5) are implemented; the periodic reconciliation
puller and the daily terminal-row purge scheduler remain deferred. The design
doc update is tracked separately.
SiteCallAudit-005 — AckErrorMessage switch arm for SiteUnreachable returns ack message instead of throwing
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | src/ScadaLink.SiteCallAudit/SiteCallAuditActor.cs:548-563 |
Description
return outcome switch
{
SiteCallRelayOutcome.Applied => null,
SiteCallRelayOutcome.NotParked => "The operation is no longer parked at the site (...)",
SiteCallRelayOutcome.OperationFailed => ack.ErrorMessage,
// SiteUnreachable is never produced from a ParkedOperationActionAck —
// unreachable responses are built by UnreachableRetry/UnreachableDiscard
// before any ack is classified, so this arm is unreachable by construction.
SiteCallRelayOutcome.SiteUnreachable => ack.ErrorMessage,
_ => throw new ArgumentOutOfRangeException(...)
};
The comment correctly states the SiteUnreachable arm is unreachable when
called from ClassifyAck. The arm therefore exists only to satisfy
exhaustiveness, but instead of throwing or returning a sentinel, it falls
through to ack.ErrorMessage — indistinguishable from the OperationFailed
arm above. If any future caller does feed SiteUnreachable into
AckErrorMessage (e.g. via refactor), the result will be a silent
wrong-detail-text bug rather than an immediate crash. The default arm
correctly throws ArgumentOutOfRangeException, so the SiteUnreachable arm
is the inconsistent one.
Recommendation
Replace the SiteUnreachable => ack.ErrorMessage arm with:
SiteCallRelayOutcome.SiteUnreachable =>
throw new InvalidOperationException(
"AckErrorMessage cannot be called for SiteUnreachable — those responses "
+ "are built by UnreachableRetry/UnreachableDiscard before classification."),
— fail fast if the invariant is ever violated by a refactor.
Resolution (2026-05-28):
Behaviour kept (return ack.ErrorMessage); AckErrorMessage stays total and
side-effect-free. Expanded the inline comment on the SiteUnreachable arm to
explain WHY it returns rather than throws: site-unreachable is classified as
transient by the upstream relay (which has already built its
SiteUnreachable response and detail text via SiteUnreachableMessage), so a
defensive fall-through surfaces the ack's message and lets the caller schedule
a retry — throwing would turn a benign refactor invariant violation into a
relay-path crash.
SiteCallAudit-006 — Stuck-only paging test does not exercise the multi-page boundary with an interleaved non-stuck row at the cursor
| Severity | Low |
| Category | Testing coverage |
| Status | Open |
| Location | tests/ScadaLink.SiteCallAudit.Tests/SiteCallAuditActorTests.cs:335-392 |
Description
SiteCallQueryRequest_StuckOnly_PagesAreFull_NoEmptyPagesWithCursor covers
the case where stuck rows are interleaved with non-stuck rows (page-1 returns
2 stuck rows, page-2 returns the third). It does not cover the edge where
the row at the keyset cursor boundary (AfterCreatedAtUtc + AfterId) is
itself a non-stuck row — i.e. the cursor points at a row the next page must
SKIP through to find more stuck rows. The repository's SQL composes the
cursor predicate (CreatedAtUtc < cursor OR (CreatedAtUtc = cursor AND id < ...)) with the stuck predicate, so it should be honest, but the test only
asserts row counts and IsStuck, not that the second-page query specifically
skipped non-stuck rows between the cursor and the next stuck row.
Lower priority because the SQL composition is straightforward, but adding a direct test would lock the invariant.
Recommendation
Add a test that (a) inserts 6 rows in interleaved order: stuck, not-stuck,
stuck, not-stuck, stuck, not-stuck (oldest first); (b) issues a StuckOnly
page-size-1 query; (c) asserts each page returns exactly the stuck row, with
no overlap and all 3 stuck rows visited.
Resolution
Unresolved.