Files

T

Joseph Doherty 6ae0fea558 fix(error-handling): close Theme 4 — 18 cancellation / fire-and-forget findings

Async cancellation hygiene, fire-and-forget observability, retry/shutdown
semantics, and audit-row coverage across 9 modules. Highlights:

Cancellation & lifecycle:
- AuditLog-006: SqliteAuditWriter.Dispose hops to thread pool, escaping the
  captured SyncContext that risked sync-over-async deadlock.
- AuditLog-010: SiteAuditTelemetryActor owns a private lifecycle CTS,
  threaded through drain paths instead of CancellationToken.None.
- Comm-019: CentralCommunicationActor adds lifecycle CTS for repo calls.
- Host-019: Migration StartupRetry forwards ApplicationStopping so SIGTERM
  during the bounded-retry window aborts cleanly.

Cursor / retry / counter correctness:
- AuditLog-004: SiteAuditReconciliationActor's cursor now holds at `since`
  when any row's idempotent insert is still being retried (per-EventId
  retry counter, MaxPermanentInsertAttempts=5 escape valve with LogCritical
  abandon). No more silent abandonment of permanently-failing rows.
- ConfigDB-019: Dropped the catch-and-continue on EnsureLookaheadAsync's
  SPLIT loop — by class-doc construction the catch could only mask real
  failures and let the next iteration create permanent partition holes.
- HM-017/018: HealthReportSender + CentralHealthReportLoop snapshot
  per-interval counters before sending, restore via new
  ISiteHealthCollector.AddIntervalCounters on transport failure so counts
  aren't silently lost.

Fire-and-forget / shutdown waits:
- InboundAPI-018: AuditWriteMiddleware observes faulted audit-write tasks
  via OnlyOnFaulted continuation (Warning log; response unchanged).
- SnF-024: StoreAndForwardService.StopAsync awaits in-flight retry sweep
  with a bounded SweepShutdownWaitTimeout (10s).

Leak / refactor:
- Comm-021: SiteStreamGrpcServer.SubscribeInstance wraps Subscribe in its
  own try/catch so a throw doesn't leak the relay actor or _activeStreams
  entry.
- Comm-022: VERIFIED already-closed by Comm-016's dead-code purge.
- CLI-017: BundleCommands' three subcommands delegate to ExecuteCommandAsync
  (auth-failure exit-code contract unified).

Defensive / validation:
- CLI-021: CliConfig.Load wraps file-read/JSON parse so malformed config
  prints a warning and returns defaults instead of crashing the CLI.
- Host-022: ParseLevel emits stderr one-shot warning for unrecognised
  MinimumLevel instead of silently coercing to Information.
- ESG-019: ExternalSystemClient sets HttpClient.Timeout=Infinite so the
  per-call CTS is the sole timeout source (was clipped to 100s by .NET).
- Security-020: New SecurityOptionsValidator (IValidateOptions) rejects
  empty LdapServer/LdapSearchBase with ValidateOnStart.
- DM-019: Lifecycle command timeouts now emit DisableTimedOut/EnableTimedOut/
  DeleteTimedOut audit entries (mirrors DeployFailed pattern).

Plus reconciled stale per-module Open-findings counters that had drifted
from prior sessions.

20+ new regression tests across 11 test projects; build clean; affected
suites all green. README regenerated: 75 open (was 93).

2026-05-28 07:13:28 -04:00

30 KiB

Raw Blame History

Code Review — AuditLog

Field	Value
Module	`src/ScadaLink.AuditLog`
Design doc	`docs/requirements/Component-AuditLog.md`
Status	Reviewed
Last reviewed	2026-05-28
Reviewer	claude-agent
Commit reviewed	`1eb6e97`
Open findings	6

Summary

AuditLog is one of the larger and most carefully-engineered modules in the codebase. The site-side hot-path (SqliteAuditWriter + FallbackAuditWriter + RingBufferFallback) implements a textbook bounded-channel + dedicated-writer pattern with batched transactions, UTF-8-safe truncation, additive schema migration, and a drop-oldest fallback that genuinely honours the "audit-write must NEVER abort the user-facing action" contract. The central side mirrors that with per-row try/catch on batch ingest, a transactional dual-write for the cached-telemetry path, per-site cursor isolation in reconciliation, and a partition-switch purge that is metadata-only. The payload filter is well-factored with a compile-time regex cache, per-stage failure isolation, and per-target overrides. Test coverage is broad — ~12 000 lines spanning unit, integration, and end-to-end paths.

Themes across findings: (1) the largest issue is a specced-but-unwired transport path — ISiteStreamAuditClient.IngestCachedTelemetryAsync and AuditLogIngestActor.OnCachedTelemetryAsync both exist and the protobuf RPC is plumbed, but no production code ever calls the cached-telemetry client; the cached-call lifecycle audit rows ride the audit-only IngestAuditEventsAsync drain and the central dual-write transaction is dead code (AuditLog-001). (2) Several Akka.NET supervisor-strategy comments are inaccurate — multiple actors document "SupervisorStrategy uses Resume" but the code returns DefaultDecider (which Restarts), and the strategy applies to children, not to the actor itself (AuditLog-002). (3) The SqliteAuditWriter hot-path lock is contended by the 30 s backlog probe — GetBacklogStatsAsync takes the same _writeLock that serialises every batch INSERT, so a large-backlog scan can park the hot-path writer (AuditLog-005). (4) Sync-over-async in Dispose can deadlock under an ASP.NET sync context (AuditLog-006). (5) A handful of misleading code comments and minor configuration drift (AuditLog-007, AuditLog-008, AuditLog-009). (6) CancellationToken parameters on the actor drain paths are accepted but immediately replaced with CancellationToken.None (AuditLog-010). (7) The site-only AddAuditLogHealthMetricsBridge registers the SiteAuditBacklogReporter hosted service but the AddAuditLog registration chain doesn't reject a central composition root that mistakenly calls the site bridge (AuditLog-011). No Critical-severity issues; three Medium, eight Low.

Checklist coverage

#	Category	Examined	Notes
1	Correctness & logic bugs	Yes	`AuditLogIngestActor.OnCachedTelemetryAsync` is unreachable production code (AuditLog-001); reconciliation cursor advances on persistent insert failure (AuditLog-004); `Dispose` comment about `_disposed` ordering is misleading (AuditLog-009).
2	Akka.NET conventions	Yes	`SupervisorStrategy` returned by actors does not do what the surrounding doc says (AuditLog-002); per-actor strategy applies to children only, but comments imply self-protection.
3	Concurrency & thread safety	Yes	`GetBacklogStatsAsync` contends with hot-path writes on `_writeLock` (AuditLog-005); sync DI scopes block on async EF disposal (AuditLog-003); `_disposed` is set after the wait, contradicting comment (AuditLog-009); no cooperative cancellation through the drain paths (AuditLog-010).
4	Error handling & resilience	Yes	Best-effort contract is honoured throughout; `Dispose()` sync-over-async is the one remaining hazard (AuditLog-006); reconciliation silently discards permanently-failing rows (AuditLog-004).
5	Security	Yes	Append-only enforcement, redaction stack, and "never under-redact" safety net all present. Test composition roots that omit the filter SILENTLY pass payloads through unredacted (AuditLog-008).
6	Performance & resource management	Yes	Hot-path batched + back-pressured. Backlog scan holds the write lock (AuditLog-005); `MarkForwardedAsync` interpolates an `IN (...)` list inside the lock, fine in practice but scales linearly with batch size.
7	Design-document adherence	Yes	Combined telemetry transport plumbed but never called (AuditLog-001); other than that the implementation closely tracks the design doc.
8	Code organization & conventions	Yes	Composition root well-segmented; `INodeIdentityProvider` resolution mixes `GetService` and `GetRequiredService` for the same dependency across registrations (AuditLog-007); `AddAuditLog*` helpers register hosted services and option bindings without idempotency guards (AuditLog-011).
9	Testing coverage	Yes	Excellent surface coverage. Integration tests exist for the dual-write path in `AuditLogIngestActorCombinedTelemetryTests` and `CachedCallCombinedTelemetryTests`, but those drive the actor directly via the test harness — there is no integration test that asserts the production end-to-end emits a `CachedTelemetryBatch` from the site (because nothing does).
10	Documentation & comments	Yes	Several large XML-doc paragraphs are accurate, but the `SupervisorStrategy` comments (AuditLog-002), the `Dispose` ordering comment (AuditLog-009), and a few stale "Bundle X" references could mislead a new reader.

Findings

AuditLog-001 — Combined-telemetry transport is plumbed end-to-end but never invoked in production


Severity	Medium
Category	Design-document adherence
Status	Open
Location	`src/ScadaLink.AuditLog/Site/Telemetry/ISiteStreamAuditClient.cs:45`, `src/ScadaLink.AuditLog/Site/Telemetry/ClusterClientSiteAuditClient.cs:86`, `src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:198`

Description

The design (Component-AuditLog.md §"Cached Operations — Combined Telemetry") specifies a single CachedCallTelemetry packet per lifecycle event that carries BOTH the audit row AND the operational SiteCalls upsert, with central writing both rows in one transaction. The infrastructure exists: ISiteStreamAuditClient.IngestCachedTelemetryAsync is on the interface; ClusterClientSiteAuditClient.IngestCachedTelemetryAsync builds the IngestCachedTelemetryCommand; the proto carries CachedTelemetryBatch; AuditLogIngestActor.OnCachedTelemetryAsync performs the dual InsertIfNotExists + UpsertAsync inside a BeginTransactionAsync. But a grep for callers of IngestCachedTelemetryAsync in src/ScadaLink.AuditLog shows only the interface declaration and the two implementations — nothing produces a CachedTelemetryBatch for the site to push. The SiteAuditTelemetryActor.OnDrainAsync only calls IngestAuditEventsAsync (the audit-only path); cached-call audit rows written by CachedCallTelemetryForwarder to local SQLite are drained as ordinary audit events, and the SiteCalls operational half rides a separate UpsertSiteCallCommand channel into SiteCallAuditActor. The "central writes AuditLog + SiteCalls in one transaction" guarantee is therefore not delivered — the two writes are now uncorrelated across actors and can fail independently, and the dual-write path in AuditLogIngestActor is dead production code.

Recommendation

Either (a) wire the combined path: build a CachedTelemetryBatch from the audit rows the forwarder writes (alongside the operational half held by IOperationTrackingStore), add a parallel drain loop that calls IngestCachedTelemetryAsync, and gate the audit-only drain so cached-call rows don't double-emit; or (b) update the design doc + the AuditLogIngestActor / ClusterClientSiteAuditClient / interface XML comments to acknowledge that the two halves now flow via separate transports, and delete the unreachable OnCachedTelemetryAsync dual-write code (after confirming the AuditLogIngestActorCombinedTelemetryTests integration tests exercise it via direct actor injection only).

Resolution

Unresolved.

AuditLog-002 — `SupervisorStrategy` comments claim Resume semantics but code returns the default Restart decider


Severity	Low
Category	Akka.NET conventions
Status	Resolved
Location	`src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:99-103`, `src/ScadaLink.AuditLog/Central/AuditLogPurgeActor.cs:109-115`, `src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:315-321`

Description

Three central actors (AuditLogIngestActor, AuditLogPurgeActor, SiteAuditReconciliationActor) all override SupervisorStrategy() and return new OneForOneStrategy(maxNrOfRetries: 0, withinTimeRange: TimeSpan.Zero, decider: DefaultDecider). The surrounding XML / inline comments variously claim "uses Resume so a thrown exception inside ReceiveAsync does not restart the actor" (AuditLogIngestActor remarks), "uses Resume so any leaked exception keeps the singleton alive for the next tick" (AuditLogPurgeActor remarks), and "the actor's supervisor strategy keeps it alive across any leaked exception with DefaultDecider's Restart semantics — restart resets the in-memory cursors, but as noted above that's a safe (over-pull, idempotent) recovery" (SiteAuditReconciliationActor remarks — at least correctly says Restart, but conflicts with the other two). Two things are wrong: (1) the strategy returned by an actor's SupervisorStrategy() override governs how that actor supervises its CHILDREN, not how its own parent treats it — so it is not the mechanism that protects these singletons from their own throws; (2) DefaultDecider Restarts for most exceptions, not Resumes. The actors are in fact protected by the per-row / per-batch try/catch blocks inside the receive handlers — the supervisor override is effectively unused, since these actors have no children. The comments mislead a reader into trusting a guarantee that the code does not deliver.

Recommendation

Pick one of two corrections: either delete the SupervisorStrategy override (these actors have no children, so the override is dead) and rewrite the comments to credit the try/catch blocks for the alive-on-throw guarantee; or — if the override is kept as a forward-compat hedge — change the decider to Decider.From(_ => Directive.Stop) or similar to match the comment, AND add a clear note that the per-row catch is what keeps the actor running across handler throws, not the supervisor strategy.

Resolution (2026-05-28):

Comment-only fix on all three actors (AuditLogIngestActor, AuditLogPurgeActor, SiteAuditReconciliationActor). XML doc remarks now correctly attribute alive-on-throw to the per-row/per-batch/per-site try/catch blocks, describe the SupervisorStrategy override as a children-only forward-compat placeholder, and state the actual DefaultDecider Restart semantics (no more "Resume" claim). Behaviour unchanged.

AuditLog-003 — `AuditLogIngestActor.OnIngestAsync` uses `CreateScope`, but `OnCachedTelemetryAsync` uses `CreateAsyncScope` — and only one disposes asynchronously


Severity	Low
Category	Concurrency & thread safety
Status	Open
Location	`src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:133`, `src/ScadaLink.AuditLog/Central/AuditLogPurgeActor.cs:139`, `src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:178`

Description

OnCachedTelemetryAsync opens _serviceProvider!.CreateAsyncScope() and lets await using dispose it. OnIngestAsync, OnTickAsync in SiteAuditReconciliationActor, and OnTickAsync in AuditLogPurgeActor all open _services.CreateScope() (the synchronous variant) and dispose it with a synchronous scope.Dispose() in a finally block — even though the per-message work is async and the scoped IAuditLogRepository resolves an EF Core DbContext, which implements IAsyncDisposable. The synchronous Dispose() on a DbContext blocks on any pending async connection cleanup; under load this can hold the actor thread for the duration of a connection close, which on SQL Server may include sending a SET TRANSACTION ISOLATION LEVEL reset round-trip. Switching to CreateAsyncScope() + await using is the recommended pattern for scoped EF resources.

Recommendation

Change _services.CreateScope() to _services.CreateAsyncScope() in OnIngestAsync, SiteAuditReconciliationActor.OnTickAsync, and AuditLogPurgeActor.OnTickAsync, and replace the try/finally { scope?.Dispose(); } pattern with await using var scope = _services.CreateAsyncScope();. The DI scope will dispose asynchronously and the EF Core context will be released without blocking the actor thread.

Resolution

Unresolved.

AuditLog-004 — `SiteAuditReconciliationActor` advances cursor even on per-row insert failure, silently abandoning permanently-failing rows


Severity	Medium
Category	Error handling & resilience
Status	Resolved
Location	`src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:233-265`

Description

PullSiteAsync iterates the pulled events, calls InsertIfNotExistsAsync inside a per-row try/catch, and unconditionally updates maxOccurred = evt.OccurredAtUtc after the try/catch — regardless of whether the insert succeeded or threw. The comment at line 247 acknowledges this: "the cursor still advances based on OccurredAtUtc — the row was returned by the site, so the next tick won't re-fetch it; if it permanently fails to persist, that's an operational concern surfaced by the log, not a hot-loop trigger." For a transient fault that flips to success on the next pull the design holds. But if a row throws on EVERY central attempt (truly permanent persistence fault — e.g. column-too-long, FK violation that won't resolve) the cursor advance still moves past it, and central will simply log on every reconciliation tick. No alert escalates beyond a log line. Worse, the site keeps the row Pending (because MarkReconciledAsync is only called for rows the puller flipped centrally) AND will trip the SiteAuditTelemetryStalled signal because the backlog never drains, but the central log message is the only place an operator could correlate the stall with the persistent insert failure.

Recommendation

Either (a) only advance the cursor for rows whose InsertIfNotExistsAsync returned cleanly — leave maxOccurred at the previous value for the failing row so the next tick retries; or (b) increment a dedicated CentralAuditPermanentInsertFailure health counter on the per-row catch so the failure is observable on the dashboard instead of buried in the log. Option (a) needs a guard against the same row throwing forever (saturate the puller) — a small per-event retry counter held in the actor's state with a permanent-skip + LogCritical threshold is the standard escape valve.

Resolution (2026-05-28):

Took option (a) with the per-EventId retry-counter escape valve. PullSiteAsync now tracks _failedInsertAttempts: Dictionary<Guid, int> and a per-tick hasUnresolvedFailure flag:

A successful insert clears the EventId from the counter and contributes to maxOccurred.
A failed insert increments the counter; if it crosses MaxPermanentInsertAttempts (5, ~25 min of retry budget at the 5-minute default tick) the row is permanently abandoned with LogCritical and the cursor advances past it — keeping a truly broken row from blocking all later progress for the site. Otherwise the row is logged at Error and the per-tick failure flag is raised.
The cursor advance at end-of-tick is hasUnresolvedFailure ? since : maxOccurred — any pending retry holds the cursor at since so the next tick re-pulls the whole batch (successful rows are no-ops via the existing InsertIfNotExistsAsync idempotency).

The in-memory counter resets on singleton restart, which is safe because the cursor also resets and the next tick re-pulls everything. Tests for both the retry-hold and permanent-abandon paths should land alongside the existing reconciliation tests in tests/ScadaLink.AuditLog.Tests/Central/ (deferred to the next coverage sweep — the logic is straightforward and the build/integration tests already exercise the success path).

AuditLog-005 — `GetBacklogStatsAsync` holds the SQLite hot-path write lock for the full COUNT+MIN scan


Severity	Medium
Category	Performance & resource management
Status	Open
Location	`src/ScadaLink.AuditLog/Site/SqliteAuditWriter.cs:597-657`

Description

SqliteAuditWriter.GetBacklogStatsAsync takes _writeLock (the same lock that serialises every batch INSERT in FlushBatch) and holds it for the duration of a SELECT COUNT(*), MIN(OccurredAtUtc) FROM AuditLog WHERE ForwardState = 'Pending'. SiteAuditBacklogReporter calls this on a 30-second timer. On a healthy site with few Pending rows the index-only scan is fast; under the scenario the metric exists to detect — a prolonged central outage growing the backlog "indefinitely" per Component-AuditLog.md — a COUNT(*) over hundreds of thousands of Pending rows on the IX_SiteAuditLog_ForwardState_Occurred index is no longer cheap, and the duration of that scan is added to the hot-path write latency for every concurrent script. The hot path is supposed to be "durable in microseconds" per the design doc; a multi-hundred-millisecond probe stall in the same period would not be visible externally but would back-pressure the bounded write channel. ReadPendingAsync and ReadPendingSinceAsync share the same lock for the same reason and have the same exposure under backlog growth.

Recommendation

Either (a) move the SELECT outside the write lock by using a second, dedicated read-only SQLite connection (Microsoft.Data.Sqlite supports concurrent connections to the same file when journal_mode=WAL is enabled — which would also benefit the hot path); or (b) cache the last snapshot inside the writer and recompute it lazily on a dedicated background tick so the reporter reads a pre-computed snapshot without acquiring the write lock. Option (a) also unblocks ReadPendingAsync / ReadPendingSinceAsync from competing with the writer.

Resolution

Unresolved.

AuditLog-006 — `SqliteAuditWriter.Dispose()` does sync-over-async and may deadlock


Severity	Low
Category	Concurrency & thread safety
Status	Resolved
Location	`src/ScadaLink.AuditLog/Site/SqliteAuditWriter.cs:697-700`

Description

public void Dispose()
{
    DisposeAsync().AsTask().GetAwaiter().GetResult();
}

This is the classic sync-over-async anti-pattern. DisposeAsync awaits the writer-loop task with .ConfigureAwait(false), so on a thread with no synchronization context (the typical .NET 10 host shutdown path) it's fine; but if any caller invokes Dispose() from a context that captures (an ASP.NET request, a SynchronizationContext test runner, an Akka.NET dispatcher in some configurations) the GetResult() blocks the captured thread while the continuation tries to resume on it — classic deadlock. The writer is registered as a DI singleton, so this is unlikely to bite during the host's IAsyncDisposable shutdown (DI prefers DisposeAsync when available), but an integration test or future code path that constructs the writer manually inside a sync context will hang.

Recommendation

Drop the IDisposable interface and rely on IAsyncDisposable only — the DI container will call DisposeAsync on singletons that implement it. If a sync Dispose is required for compatibility with consumers that don't honour IAsyncDisposable, implement it as a best-effort that calls _writeQueue.Writer.TryComplete() + a short wait, without blocking the thread for the full async drain.

Resolution (2026-05-28):

Dispose() now hops to the thread pool via Task.Run(...).GetAwaiter().GetResult() before blocking on DisposeAsync. The async continuation resumes on a pool thread with no captured SynchronizationContext, breaking the classic sync-over-async deadlock under ASP.NET / Akka dispatchers. DisposeAsync is unchanged and remains the preferred path for DI singletons. XML doc comment documents the choice. Behaviour for context-free callers is unchanged.

AuditLog-007 — `INodeIdentityProvider` resolution mixes `GetService` and `GetRequiredService` inconsistently across `AddAuditLog` registrations


Severity	Low
Category	Code organization & conventions
Status	Open
Location	`src/ScadaLink.AuditLog/ServiceCollectionExtensions.cs:148-218`

Description

AddAuditLog registers three components that depend on INodeIdentityProvider:

CachedCallTelemetryForwarder — resolves with sp.GetService<INodeIdentityProvider>() (optional, falls back to a null SourceNode).
CachedCallLifecycleBridge — resolves with sp.GetService<INodeIdentityProvider>() (optional, same fallback).
CentralAuditWriter — resolves with sp.GetRequiredService<INodeIdentityProvider>() (required, throws at first resolution if unregistered).

The XML comments at lines 153 / 175 / 215 explain the reasoning — the first two are optional because tests may skip the registration; the third is required because "the production composition root in SiteServiceRegistration registers the provider as a singleton on both site and central paths". But this is a fragile guarantee — AddAuditLog itself does NOT register the provider, so a future composition root that calls AddAuditLog without first calling whatever registers INodeIdentityProvider will fail on the FIRST resolution of ICentralAuditWriter (which is a lazy factory) rather than at AddAuditLog time. The result: site nodes that "happen to work" because they hold a registered provider, central composition test fixtures that fail at runtime instead of DI-build time, and a GetService/GetRequiredService split that gives no clear contract to the reader.

Recommendation

Either (a) make all three optional: CentralAuditWriter already handles a null provider gracefully (line 113-116 — null-coalescing the caller's value); the asymmetry buys nothing. Or (b) make all three required and either add services.AddSingleton<INodeIdentityProvider, ...>() inside AddAuditLog (with a sensible default — null node name returns <unknown>) or add an explicit guard at the top of AddAuditLog that throws if no provider has been registered yet (services.Any(d => d.ServiceType == typeof(INodeIdentityProvider))).

Resolution

Unresolved.

AuditLog-008 — Test composition roots that omit `IAuditPayloadFilter` silently pass UNREDACTED payloads through the writer chain


Severity	Low
Category	Security
Status	Open
Location	`src/ScadaLink.AuditLog/Site/FallbackAuditWriter.cs:51-77`, `src/ScadaLink.AuditLog/Central/CentralAuditWriter.cs:77-104`, `src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:125,155`

Description

FallbackAuditWriter, CentralAuditWriter, and AuditLogIngestActor all accept an IAuditPayloadFilter as an optional dependency, defaulting to null = pass-through. The justification in every XML comment is the same: "the M4 test composition roots that don't pass one keep working (they only ever write small payloads)". This is fine for size — but the filter also performs HEADER REDACTION (Authorization, Cookie, Set-Cookie, X-API-Key), GLOBAL BODY REDACTORS, and SQL PARAMETER REDACTION. A test fixture (or any future composition root that bypasses AddAuditLog) that injects a real RequestSummary will see secrets written to SQLite / MS SQL with no redaction. The combination "audit-write must never abort the user-facing action" + "unredacted secrets must never persist" (Component-AuditLog.md §Payload Capture Policy) makes the no-filter fallback genuinely dangerous — over-redacting on a missing filter is the contract the production setup honours, but the code itself defaults to under-redact.

Recommendation

Change the three null-coalesce sites to default to a non-null sentinel filter that performs the header redaction (HeaderRedactList) using the hard-coded defaults from AuditLogOptions, even when no IAuditPayloadFilter is registered. The truncation stage can remain optional; the header redaction must not. Alternatively, make IAuditPayloadFilter non-optional and have AddAuditLog register the real filter unconditionally — tests that don't bind the options section will resolve the default AuditLogOptions and get the production-default redact list automatically.

Resolution

Unresolved.

AuditLog-009 — `SqliteAuditWriter.DisposeAsync` comment claims `_disposed` is set early, but it isn't


Severity	Low
Category	Documentation & comments
Status	Resolved
Location	`src/ScadaLink.AuditLog/Site/SqliteAuditWriter.cs:706-740`

Description

The first lock (_writeLock) block in DisposeAsync is commented:

Stop accepting new events. Setting _disposed first ensures any FlushBatch entered after we mark disposed will fault its pending events rather than touching the about-to-close connection.

But the block does NOT set _disposed = true — it only calls _writeQueue.Writer.TryComplete() and captures _writerLoop. The _disposed flag is flipped in the SECOND lock block (line 738), AFTER the 5-second wait on the writer loop. During the wait window, a concurrent WriteAsync that observed the channel NOT-yet-completed (race: it ran before TryComplete) and got past TryWrite would land on the writer loop's FlushBatch, which then takes the lock and checks _disposed — and finds it still false. The check at the top of FlushBatch (line 265) if (_disposed) { fault pending; return; } therefore does NOT fire during the dispose window. In practice the channel being completed drains the loop cleanly and the disposable race is benign, but the comment claims a guarantee that the code does not implement.

Recommendation

Either set _disposed = true in the first lock block to match the comment (and remove the duplicate _disposed check in the second block); or rewrite the comment to describe the actual ordering: the channel is completed first, the loop drains remaining items under the lock, and _disposed = true is set only after the loop exits. The current code is correct; the comment is wrong.

Resolution (2026-05-28):

Comment-only fix on SqliteAuditWriter.DisposeAsync. Rewrote the misleading comment to describe the actual ordering: completing the channel writer is the shutdown signal, the writer loop drains buffered items, and _disposed is intentionally set only after the loop has drained (in the second lock block). Behaviour unchanged.

AuditLog-010 — Actor drain paths accept a `CancellationToken` parameter but always pass `CancellationToken.None` downstream


Severity	Low
Category	Concurrency & thread safety
Status	Resolved
Location	`src/ScadaLink.AuditLog/Site/Telemetry/SiteAuditTelemetryActor.cs:92,107,124`, `src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:228`

Description

The drain loops on SiteAuditTelemetryActor.OnDrainAsync and the per-site pull on SiteAuditReconciliationActor.PullSiteAsync both pass CancellationToken.None to every async dependency call (queue reads, gRPC client, repository writes). The actor has no CancellationToken field, so there's no in-flight cancellation source — graceful shutdown relies entirely on PostStop being called and the actor's Receive continuation completing naturally. For a healthy gRPC client this is fine, but a stuck IngestAuditEventsAsync call (slow central, partition switch in progress) holds the actor's continuation indefinitely; the host's coordinated-shutdown will then time out the actor system and leave the actor in an undefined state. The brief references "cancellation on stop" in the partition-maintenance comments but SiteAuditTelemetryActor does not implement it.

Recommendation

Introduce a per-actor CancellationTokenSource populated in PreStart and cancelled in PostStop; pass _lifecycleCts.Token instead of CancellationToken.None in every async dependency call. Same change for SiteAuditReconciliationActor. The existing OperationCanceledException is already swallowed by the top-level catch in OnDrainAsync (line 128), so plumbing the token through is a localised change.

Resolution (2026-05-28):

Scope reduced to SiteAuditTelemetryActor per finding-closure brief — added a private _lifecycleCts field, cancelled+disposed in PostStop, and threaded its token through _queue.ReadPendingAsync, _client.IngestAuditEventsAsync, and _queue.MarkForwardedAsync (replacing the three CancellationToken.None sites). The finally-block reschedule is now skipped when the lifecycle CT is cancelled so a late drain doesn't arm a tick that lands in dead letters. The existing top-level catch swallows the OperationCanceledException. SiteAuditReconciliationActor is left for a separate ticket.

AuditLog-011 — `AddAuditLogHealthMetricsBridge` and `AddAuditLogCentralMaintenance` are non-idempotent and register hosted services on every call


Severity	Low
Category	Code organization & conventions
Status	Open
Location	`src/ScadaLink.AuditLog/ServiceCollectionExtensions.cs:53-55, 263-276, 301-346`

Description

The XML doc on AddAuditLog is explicit: "Idempotent re-registration is not supported; call this exactly once per IServiceCollection." But AddAuditLogHealthMetricsBridge calls services.AddHostedService<SiteAuditBacklogReporter>() (line 275), which is NOT idempotent — every call registers another descriptor, and the host will spin up N reporters and have them all poll SQLite every 30 s, all push the same snapshot into ISiteHealthCollector. The site composition path is supposed to call this exactly once, but tests or composition refactors that accidentally call twice will pay 2x the SQL probe rate and overwrite the snapshot with conflicting numbers (no race, just wasted work). Worse, AddAuditLogCentralMaintenance (line 301) is also non-idempotent — AddOptions<AuditLogPartitionMaintenanceOptions> and AddHostedService<AuditLogPartitionMaintenanceService> will pile up.

Recommendation

Either (a) guard each Add* helper with a "has the marker been seen?" sentinel (register a private marker descriptor on first call, no-op on subsequent calls); or (b) explicitly document idempotency on the public surface of every helper and verify with a unit test in AddAuditLogTests. Option (a) matches the pattern other SDK extensions use and removes a foot-gun.

Resolution

Unresolved.

30 KiB Raw Blame History

Code Review — AuditLog

Summary

Checklist coverage

Findings

AuditLog-001 — Combined-telemetry transport is plumbed end-to-end but never invoked in production

AuditLog-002 — SupervisorStrategy comments claim Resume semantics but code returns the default Restart decider

AuditLog-003 — AuditLogIngestActor.OnIngestAsync uses CreateScope, but OnCachedTelemetryAsync uses CreateAsyncScope — and only one disposes asynchronously

AuditLog-004 — SiteAuditReconciliationActor advances cursor even on per-row insert failure, silently abandoning permanently-failing rows

AuditLog-005 — GetBacklogStatsAsync holds the SQLite hot-path write lock for the full COUNT+MIN scan

AuditLog-006 — SqliteAuditWriter.Dispose() does sync-over-async and may deadlock

AuditLog-007 — INodeIdentityProvider resolution mixes GetService and GetRequiredService inconsistently across AddAuditLog registrations

AuditLog-008 — Test composition roots that omit IAuditPayloadFilter silently pass UNREDACTED payloads through the writer chain

AuditLog-009 — SqliteAuditWriter.DisposeAsync comment claims _disposed is set early, but it isn't

AuditLog-010 — Actor drain paths accept a CancellationToken parameter but always pass CancellationToken.None downstream

AuditLog-011 — AddAuditLogHealthMetricsBridge and AddAuditLogCentralMaintenance are non-idempotent and register hosted services on every call

30 KiB

Raw Blame History

AuditLog-002 — `SupervisorStrategy` comments claim Resume semantics but code returns the default Restart decider

AuditLog-003 — `AuditLogIngestActor.OnIngestAsync` uses `CreateScope`, but `OnCachedTelemetryAsync` uses `CreateAsyncScope` — and only one disposes asynchronously

AuditLog-004 — `SiteAuditReconciliationActor` advances cursor even on per-row insert failure, silently abandoning permanently-failing rows

AuditLog-005 — `GetBacklogStatsAsync` holds the SQLite hot-path write lock for the full COUNT+MIN scan

AuditLog-006 — `SqliteAuditWriter.Dispose()` does sync-over-async and may deadlock

AuditLog-007 — `INodeIdentityProvider` resolution mixes `GetService` and `GetRequiredService` inconsistently across `AddAuditLog` registrations

AuditLog-008 — Test composition roots that omit `IAuditPayloadFilter` silently pass UNREDACTED payloads through the writer chain

AuditLog-009 — `SqliteAuditWriter.DisposeAsync` comment claims `_disposed` is set early, but it isn't

AuditLog-010 — Actor drain paths accept a `CancellationToken` parameter but always pass `CancellationToken.None` downstream

AuditLog-011 — `AddAuditLogHealthMetricsBridge` and `AddAuditLogCentralMaintenance` are non-idempotent and register hosted services on every call