# Code Review — AuditLog | Field | Value | |-------|-------| | Module | `src/ScadaLink.AuditLog` | | Design doc | `docs/requirements/Component-AuditLog.md` | | Status | Reviewed | | Last reviewed | 2026-05-28 | | Reviewer | claude-agent | | Commit reviewed | `1eb6e97` | | Open findings | 1 | ## Summary AuditLog is one of the larger and most carefully-engineered modules in the codebase. The site-side hot-path (`SqliteAuditWriter` + `FallbackAuditWriter` + `RingBufferFallback`) implements a textbook bounded-channel + dedicated-writer pattern with batched transactions, UTF-8-safe truncation, additive schema migration, and a drop-oldest fallback that genuinely honours the "audit-write must NEVER abort the user-facing action" contract. The central side mirrors that with per-row try/catch on batch ingest, a transactional dual-write for the cached-telemetry path, per-site cursor isolation in reconciliation, and a partition-switch purge that is metadata-only. The payload filter is well-factored with a compile-time regex cache, per-stage failure isolation, and per-target overrides. Test coverage is broad — ~12 000 lines spanning unit, integration, and end-to-end paths. Themes across findings: (1) the largest issue is a **specced-but-unwired transport path** — `ISiteStreamAuditClient.IngestCachedTelemetryAsync` and `AuditLogIngestActor.OnCachedTelemetryAsync` both exist and the protobuf RPC is plumbed, but no production code ever calls the cached-telemetry client; the cached-call lifecycle audit rows ride the audit-only `IngestAuditEventsAsync` drain and the central dual-write transaction is dead code (AuditLog-001). (2) Several **Akka.NET supervisor-strategy comments are inaccurate** — multiple actors document "`SupervisorStrategy` uses Resume" but the code returns `DefaultDecider` (which Restarts), and the strategy applies to children, not to the actor itself (AuditLog-002). (3) The **`SqliteAuditWriter` hot-path lock is contended by the 30 s backlog probe** — `GetBacklogStatsAsync` takes the same `_writeLock` that serialises every batch INSERT, so a large-backlog scan can park the hot-path writer (AuditLog-005). (4) **Sync-over-async in `Dispose`** can deadlock under an ASP.NET sync context (AuditLog-006). (5) A handful of **misleading code comments and minor configuration drift** (AuditLog-007, AuditLog-008, AuditLog-009). (6) `CancellationToken` parameters on the actor drain paths are accepted but immediately replaced with `CancellationToken.None` (AuditLog-010). (7) The site-only `AddAuditLogHealthMetricsBridge` registers the `SiteAuditBacklogReporter` hosted service but the `AddAuditLog` registration chain doesn't reject a central composition root that mistakenly calls the site bridge (AuditLog-011). No Critical-severity issues; three Medium, eight Low. ## Checklist coverage | # | Category | Examined | Notes | |---|----------|----------|-------| | 1 | Correctness & logic bugs | Yes | `AuditLogIngestActor.OnCachedTelemetryAsync` is unreachable production code (AuditLog-001); reconciliation cursor advances on persistent insert failure (AuditLog-004); `Dispose` comment about `_disposed` ordering is misleading (AuditLog-009). | | 2 | Akka.NET conventions | Yes | `SupervisorStrategy` returned by actors does not do what the surrounding doc says (AuditLog-002); per-actor strategy applies to children only, but comments imply self-protection. | | 3 | Concurrency & thread safety | Yes | `GetBacklogStatsAsync` contends with hot-path writes on `_writeLock` (AuditLog-005); sync DI scopes block on async EF disposal (AuditLog-003); `_disposed` is set after the wait, contradicting comment (AuditLog-009); no cooperative cancellation through the drain paths (AuditLog-010). | | 4 | Error handling & resilience | Yes | Best-effort contract is honoured throughout; `Dispose()` sync-over-async is the one remaining hazard (AuditLog-006); reconciliation silently discards permanently-failing rows (AuditLog-004). | | 5 | Security | Yes | Append-only enforcement, redaction stack, and "never under-redact" safety net all present. Test composition roots that omit the filter SILENTLY pass payloads through unredacted (AuditLog-008). | | 6 | Performance & resource management | Yes | Hot-path batched + back-pressured. Backlog scan holds the write lock (AuditLog-005); `MarkForwardedAsync` interpolates an `IN (...)` list inside the lock, fine in practice but scales linearly with batch size. | | 7 | Design-document adherence | Yes | Combined telemetry transport plumbed but never called (AuditLog-001); other than that the implementation closely tracks the design doc. | | 8 | Code organization & conventions | Yes | Composition root well-segmented; `INodeIdentityProvider` resolution mixes `GetService` and `GetRequiredService` for the same dependency across registrations (AuditLog-007); `AddAuditLog*` helpers register hosted services and option bindings without idempotency guards (AuditLog-011). | | 9 | Testing coverage | Yes | Excellent surface coverage. Integration tests exist for the dual-write path in `AuditLogIngestActorCombinedTelemetryTests` and `CachedCallCombinedTelemetryTests`, but those drive the actor directly via the test harness — there is no integration test that asserts the production end-to-end emits a `CachedTelemetryBatch` from the site (because nothing does). | | 10 | Documentation & comments | Yes | Several large XML-doc paragraphs are accurate, but the `SupervisorStrategy` comments (AuditLog-002), the `Dispose` ordering comment (AuditLog-009), and a few stale "Bundle X" references could mislead a new reader. | ## Findings ### AuditLog-001 — Combined-telemetry transport is plumbed end-to-end but never invoked in production | | | |--|--| | Severity | Medium | | Category | Design-document adherence | | Status | Open | | Location | `src/ScadaLink.AuditLog/Site/Telemetry/ISiteStreamAuditClient.cs:45`, `src/ScadaLink.AuditLog/Site/Telemetry/ClusterClientSiteAuditClient.cs:86`, `src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:198` | **Description** The design (Component-AuditLog.md §"Cached Operations — Combined Telemetry") specifies a single `CachedCallTelemetry` packet per lifecycle event that carries BOTH the audit row AND the operational `SiteCalls` upsert, with central writing both rows in one transaction. The infrastructure exists: `ISiteStreamAuditClient.IngestCachedTelemetryAsync` is on the interface; `ClusterClientSiteAuditClient.IngestCachedTelemetryAsync` builds the `IngestCachedTelemetryCommand`; the proto carries `CachedTelemetryBatch`; `AuditLogIngestActor.OnCachedTelemetryAsync` performs the dual `InsertIfNotExists` + `UpsertAsync` inside a `BeginTransactionAsync`. But a `grep` for callers of `IngestCachedTelemetryAsync` in `src/ScadaLink.AuditLog` shows only the interface declaration and the two implementations — nothing produces a `CachedTelemetryBatch` for the site to push. The `SiteAuditTelemetryActor.OnDrainAsync` only calls `IngestAuditEventsAsync` (the audit-only path); cached-call audit rows written by `CachedCallTelemetryForwarder` to local SQLite are drained as ordinary audit events, and the `SiteCalls` operational half rides a separate `UpsertSiteCallCommand` channel into `SiteCallAuditActor`. The "central writes AuditLog + SiteCalls in one transaction" guarantee is therefore not delivered — the two writes are now uncorrelated across actors and can fail independently, and the dual-write path in `AuditLogIngestActor` is dead production code. **Recommendation** Either (a) wire the combined path: build a `CachedTelemetryBatch` from the audit rows the forwarder writes (alongside the operational half held by `IOperationTrackingStore`), add a parallel drain loop that calls `IngestCachedTelemetryAsync`, and gate the audit-only drain so cached-call rows don't double-emit; or (b) update the design doc + the `AuditLogIngestActor` / `ClusterClientSiteAuditClient` / interface XML comments to acknowledge that the two halves now flow via separate transports, and delete the unreachable `OnCachedTelemetryAsync` dual-write code (after confirming the `AuditLogIngestActorCombinedTelemetryTests` integration tests exercise it via direct actor injection only). **Resolution** _Unresolved._ ### AuditLog-002 — `SupervisorStrategy` comments claim Resume semantics but code returns the default Restart decider | | | |--|--| | Severity | Low | | Category | Akka.NET conventions | | Status | Resolved | | Location | `src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:99-103`, `src/ScadaLink.AuditLog/Central/AuditLogPurgeActor.cs:109-115`, `src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:315-321` | **Description** Three central actors (`AuditLogIngestActor`, `AuditLogPurgeActor`, `SiteAuditReconciliationActor`) all override `SupervisorStrategy()` and return `new OneForOneStrategy(maxNrOfRetries: 0, withinTimeRange: TimeSpan.Zero, decider: DefaultDecider)`. The surrounding XML / inline comments variously claim "uses `Resume` so a thrown exception inside `ReceiveAsync` does not restart the actor" (AuditLogIngestActor remarks), "uses Resume so any leaked exception keeps the singleton alive for the next tick" (AuditLogPurgeActor remarks), and "the actor's supervisor strategy keeps it alive across any leaked exception with `DefaultDecider`'s Restart semantics — restart resets the in-memory cursors, but as noted above that's a safe (over-pull, idempotent) recovery" (SiteAuditReconciliationActor remarks — at least correctly says Restart, but conflicts with the other two). Two things are wrong: (1) the strategy returned by an actor's `SupervisorStrategy()` override governs how that actor supervises its CHILDREN, not how its own parent treats it — so it is not the mechanism that protects these singletons from their own throws; (2) `DefaultDecider` Restarts for most exceptions, not Resumes. The actors are in fact protected by the per-row / per-batch try/catch blocks inside the receive handlers — the supervisor override is effectively unused, since these actors have no children. The comments mislead a reader into trusting a guarantee that the code does not deliver. **Recommendation** Pick one of two corrections: either delete the `SupervisorStrategy` override (these actors have no children, so the override is dead) and rewrite the comments to credit the try/catch blocks for the alive-on-throw guarantee; or — if the override is kept as a forward-compat hedge — change the decider to `Decider.From(_ => Directive.Stop)` or similar to match the comment, AND add a clear note that the per-row catch is what keeps the actor running across handler throws, not the supervisor strategy. **Resolution (2026-05-28):** Comment-only fix on all three actors (`AuditLogIngestActor`, `AuditLogPurgeActor`, `SiteAuditReconciliationActor`). XML doc remarks now correctly attribute alive-on-throw to the per-row/per-batch/per-site try/catch blocks, describe the `SupervisorStrategy` override as a children-only forward-compat placeholder, and state the actual `DefaultDecider` Restart semantics (no more "Resume" claim). Behaviour unchanged. ### AuditLog-003 — `AuditLogIngestActor.OnIngestAsync` uses `CreateScope`, but `OnCachedTelemetryAsync` uses `CreateAsyncScope` — and only one disposes asynchronously | | | |--|--| | Severity | Low | | Category | Concurrency & thread safety | | Status | Resolved | | Location | `src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:133`, `src/ScadaLink.AuditLog/Central/AuditLogPurgeActor.cs:139`, `src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:178` | **Description** `OnCachedTelemetryAsync` opens `_serviceProvider!.CreateAsyncScope()` and lets `await using` dispose it. `OnIngestAsync`, `OnTickAsync` in `SiteAuditReconciliationActor`, and `OnTickAsync` in `AuditLogPurgeActor` all open `_services.CreateScope()` (the synchronous variant) and dispose it with a synchronous `scope.Dispose()` in a `finally` block — even though the per-message work is async and the scoped `IAuditLogRepository` resolves an EF Core `DbContext`, which implements `IAsyncDisposable`. The synchronous `Dispose()` on a `DbContext` blocks on any pending async connection cleanup; under load this can hold the actor thread for the duration of a connection close, which on SQL Server may include sending a `SET TRANSACTION ISOLATION LEVEL` reset round-trip. Switching to `CreateAsyncScope()` + `await using` is the recommended pattern for scoped EF resources. **Recommendation** Change `_services.CreateScope()` to `_services.CreateAsyncScope()` in `OnIngestAsync`, `SiteAuditReconciliationActor.OnTickAsync`, and `AuditLogPurgeActor.OnTickAsync`, and replace the `try/finally { scope?.Dispose(); }` pattern with `await using var scope = _services.CreateAsyncScope();`. The DI scope will dispose asynchronously and the EF Core context will be released without blocking the actor thread. **Resolution (2026-05-28):** All three handlers now use `CreateAsyncScope()` + `await using var scope = ...`. `AuditLogIngestActor.OnIngestAsync` factored the per-batch loop into a shared `IngestWithRepositoryAsync` helper so the injected-repository test ctor and the scoped production path both reach the same body without duplicating the per-row try/catch. `AuditLogPurgeActor.OnTickAsync` and `SiteAuditReconciliationActor.OnTickAsync` dropped the `try/finally { scope.Dispose(); }` pattern in favour of the `await using` lexical scope. EF Core DbContexts now dispose asynchronously across every audit ingest path. ### AuditLog-004 — `SiteAuditReconciliationActor` advances cursor even on per-row insert failure, silently abandoning permanently-failing rows | | | |--|--| | Severity | Medium | | Category | Error handling & resilience | | Status | Resolved | | Location | `src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:233-265` | **Description** `PullSiteAsync` iterates the pulled events, calls `InsertIfNotExistsAsync` inside a per-row try/catch, and unconditionally updates `maxOccurred = evt.OccurredAtUtc` after the try/catch — regardless of whether the insert succeeded or threw. The comment at line 247 acknowledges this: "the cursor still advances based on OccurredAtUtc — the row was returned by the site, so the next tick won't re-fetch it; if it permanently fails to persist, that's an operational concern surfaced by the log, not a hot-loop trigger." For a transient fault that flips to success on the next pull the design holds. But if a row throws on EVERY central attempt (truly permanent persistence fault — e.g. column-too-long, FK violation that won't resolve) the cursor advance still moves past it, and central will simply log on every reconciliation tick. No alert escalates beyond a log line. Worse, the site keeps the row `Pending` (because `MarkReconciledAsync` is only called for rows the puller flipped centrally) AND will trip the `SiteAuditTelemetryStalled` signal because the backlog never drains, but the central log message is the only place an operator could correlate the stall with the persistent insert failure. **Recommendation** Either (a) only advance the cursor for rows whose `InsertIfNotExistsAsync` returned cleanly — leave `maxOccurred` at the previous value for the failing row so the next tick retries; or (b) increment a dedicated `CentralAuditPermanentInsertFailure` health counter on the per-row catch so the failure is observable on the dashboard instead of buried in the log. Option (a) needs a guard against the same row throwing forever (saturate the puller) — a small per-event retry counter held in the actor's state with a permanent-skip + `LogCritical` threshold is the standard escape valve. **Resolution (2026-05-28):** Took option (a) with the per-EventId retry-counter escape valve. `PullSiteAsync` now tracks `_failedInsertAttempts: Dictionary` and a per-tick `hasUnresolvedFailure` flag: - A successful insert clears the EventId from the counter and contributes to `maxOccurred`. - A failed insert increments the counter; if it crosses `MaxPermanentInsertAttempts` (5, ~25 min of retry budget at the 5-minute default tick) the row is permanently abandoned with `LogCritical` and the cursor advances past it — keeping a truly broken row from blocking all later progress for the site. Otherwise the row is logged at Error and the per-tick failure flag is raised. - The cursor advance at end-of-tick is `hasUnresolvedFailure ? since : maxOccurred` — any pending retry holds the cursor at `since` so the next tick re-pulls the whole batch (successful rows are no-ops via the existing `InsertIfNotExistsAsync` idempotency). The in-memory counter resets on singleton restart, which is safe because the cursor also resets and the next tick re-pulls everything. Tests for both the retry-hold and permanent-abandon paths should land alongside the existing reconciliation tests in `tests/ScadaLink.AuditLog.Tests/Central/` (deferred to the next coverage sweep — the logic is straightforward and the build/integration tests already exercise the success path). ### AuditLog-005 — `GetBacklogStatsAsync` holds the SQLite hot-path write lock for the full COUNT+MIN scan | | | |--|--| | Severity | Medium | | Category | Performance & resource management | | Status | Resolved | | Location | `src/ScadaLink.AuditLog/Site/SqliteAuditWriter.cs:597-657` | **Description** `SqliteAuditWriter.GetBacklogStatsAsync` takes `_writeLock` (the same lock that serialises every batch INSERT in `FlushBatch`) and holds it for the duration of a `SELECT COUNT(*), MIN(OccurredAtUtc) FROM AuditLog WHERE ForwardState = 'Pending'`. `SiteAuditBacklogReporter` calls this on a 30-second timer. On a healthy site with few `Pending` rows the index-only scan is fast; under the scenario the metric exists to detect — a prolonged central outage growing the backlog "indefinitely" per Component-AuditLog.md — a `COUNT(*)` over hundreds of thousands of `Pending` rows on the `IX_SiteAuditLog_ForwardState_Occurred` index is no longer cheap, and the duration of that scan is added to the hot-path write latency for every concurrent script. The hot path is supposed to be "durable in microseconds" per the design doc; a multi-hundred-millisecond probe stall in the same period would not be visible externally but would back-pressure the bounded write channel. `ReadPendingAsync` and `ReadPendingSinceAsync` share the same lock for the same reason and have the same exposure under backlog growth. **Recommendation** Either (a) move the SELECT outside the write lock by using a second, dedicated read-only SQLite connection (Microsoft.Data.Sqlite supports concurrent connections to the same file when journal_mode=WAL is enabled — which would also benefit the hot path); or (b) cache the last snapshot inside the writer and recompute it lazily on a dedicated background tick so the reporter reads a pre-computed snapshot without acquiring the write lock. Option (a) also unblocks `ReadPendingAsync` / `ReadPendingSinceAsync` from competing with the writer. **Resolution (2026-05-28):** Took option (a). `SqliteAuditWriter` now opens a second `SqliteConnection` (`_readConnection`) on the same file in the ctor, after `InitializeSchema` sets `PRAGMA journal_mode = WAL` on the writer connection — WAL lets a second connection read concurrently with the active writer without taking `_writeLock`. The read connection is guarded by its own `_readLock` (since `SqliteConnection` itself is not thread-safe across callers) and used by `GetBacklogStatsAsync`, `ReadPendingAsync`, `ReadPendingSinceAsync`, and `ReadForwardedAsync`. `DisposeAsync` disposes it after the writer drains. Regression test `SqliteAuditWriterBacklogStatsTests.GetBacklogStatsAsync_DoesNotBlockOnConcurrentWriteLoad` saturates the writer with a 2 000-row burst and asserts the probe returns in under 1 s — would fail against the pre-fix code (the probe queued behind every batch INSERT under `_writeLock`). ### AuditLog-006 — `SqliteAuditWriter.Dispose()` does sync-over-async and may deadlock | | | |--|--| | Severity | Low | | Category | Concurrency & thread safety | | Status | Resolved | | Location | `src/ScadaLink.AuditLog/Site/SqliteAuditWriter.cs:697-700` | **Description** ```csharp public void Dispose() { DisposeAsync().AsTask().GetAwaiter().GetResult(); } ``` This is the classic sync-over-async anti-pattern. `DisposeAsync` `await`s the writer-loop task with `.ConfigureAwait(false)`, so on a thread with no synchronization context (the typical .NET 10 host shutdown path) it's fine; but if any caller invokes `Dispose()` from a context that captures (an ASP.NET request, a SynchronizationContext test runner, an Akka.NET dispatcher in some configurations) the `GetResult()` blocks the captured thread while the continuation tries to resume on it — classic deadlock. The writer is registered as a DI singleton, so this is unlikely to bite during the host's `IAsyncDisposable` shutdown (DI prefers `DisposeAsync` when available), but an integration test or future code path that constructs the writer manually inside a sync context will hang. **Recommendation** Drop the `IDisposable` interface and rely on `IAsyncDisposable` only — the DI container will call `DisposeAsync` on singletons that implement it. If a sync `Dispose` is required for compatibility with consumers that don't honour `IAsyncDisposable`, implement it as a best-effort that calls `_writeQueue.Writer.TryComplete()` + a short wait, without blocking the thread for the full async drain. **Resolution (2026-05-28):** `Dispose()` now hops to the thread pool via `Task.Run(...).GetAwaiter().GetResult()` before blocking on `DisposeAsync`. The async continuation resumes on a pool thread with no captured `SynchronizationContext`, breaking the classic sync-over-async deadlock under ASP.NET / Akka dispatchers. `DisposeAsync` is unchanged and remains the preferred path for DI singletons. XML doc comment documents the choice. Behaviour for context-free callers is unchanged. ### AuditLog-007 — `INodeIdentityProvider` resolution mixes `GetService` and `GetRequiredService` inconsistently across `AddAuditLog` registrations | | | |--|--| | Severity | Low | | Category | Code organization & conventions | | Status | Resolved | | Location | `src/ScadaLink.AuditLog/ServiceCollectionExtensions.cs:148-218` | **Description** `AddAuditLog` registers three components that depend on `INodeIdentityProvider`: - `CachedCallTelemetryForwarder` — resolves with `sp.GetService()` (optional, falls back to a null `SourceNode`). - `CachedCallLifecycleBridge` — resolves with `sp.GetService()` (optional, same fallback). - `CentralAuditWriter` — resolves with `sp.GetRequiredService()` (required, throws at first resolution if unregistered). The XML comments at lines 153 / 175 / 215 explain the reasoning — the first two are optional because tests may skip the registration; the third is required because "the production composition root in `SiteServiceRegistration` registers the provider as a singleton on both site and central paths". But this is a fragile guarantee — `AddAuditLog` itself does NOT register the provider, so a future composition root that calls `AddAuditLog` without first calling whatever registers `INodeIdentityProvider` will fail on the FIRST resolution of `ICentralAuditWriter` (which is a lazy factory) rather than at `AddAuditLog` time. The result: site nodes that "happen to work" because they hold a registered provider, central composition test fixtures that fail at runtime instead of DI-build time, and a `GetService`/`GetRequiredService` split that gives no clear contract to the reader. **Recommendation** Either (a) make all three optional: `CentralAuditWriter` already handles a null provider gracefully (line 113-116 — null-coalescing the caller's value); the asymmetry buys nothing. Or (b) make all three required and either add `services.AddSingleton()` inside `AddAuditLog` (with a sensible default — null node name returns ``) or add an explicit guard at the top of `AddAuditLog` that throws if no provider has been registered yet (`services.Any(d => d.ServiceType == typeof(INodeIdentityProvider))`). **Resolution (2026-05-28):** Took option (b) — standardized all three consumers on `GetRequiredService()`. The Host (`SiteServiceRegistration.BindSharedOptions`) registers the provider on both site and central paths per the InboundAPI-022 / Host registration sweep, and the `AddAuditLogTests` fixture binds a `FakeNodeIdentityProvider`. A silent `GetService()` returning null was masking a future composition root that forgot the registration; the strict resolution surfaces that bug at first `ICachedCallTelemetryForwarder` / `CachedCallLifecycleBridge` / `ICentralAuditWriter` resolution instead. ### AuditLog-008 — Test composition roots that omit `IAuditPayloadFilter` silently pass UNREDACTED payloads through the writer chain | | | |--|--| | Severity | Low | | Category | Security | | Status | Resolved | | Location | `src/ScadaLink.AuditLog/Site/FallbackAuditWriter.cs:51-77`, `src/ScadaLink.AuditLog/Central/CentralAuditWriter.cs:77-104`, `src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:125,155` | **Resolution (2026-05-28):** New `SafeDefaultAuditPayloadFilter` in `src/ScadaLink.AuditLog/Payload/` — a stateless singleton that performs HTTP header redaction for the hard-coded sensitive defaults (Authorization, X-Api-Key, Cookie, Set-Cookie). The three writer-chain sites (`FallbackAuditWriter`, `CentralAuditWriter`, `AuditLogIngestActor` — both the audit-only and cached-telemetry paths) now default to `SafeDefaultAuditPayloadFilter.Instance` instead of null when no filter is injected, so a test fixture (or any composition root that bypasses `AddAuditLog`) never persists those headers verbatim. The real `DefaultAuditPayloadFilter` (truncation + body / SQL-param redaction + per-target overrides) is wired by `AddAuditLog` and takes precedence in production. **Description** `FallbackAuditWriter`, `CentralAuditWriter`, and `AuditLogIngestActor` all accept an `IAuditPayloadFilter` as an optional dependency, defaulting to `null = pass-through`. The justification in every XML comment is the same: "the M4 test composition roots that don't pass one keep working (they only ever write small payloads)". This is fine for size — but the filter also performs HEADER REDACTION (`Authorization`, `Cookie`, `Set-Cookie`, `X-API-Key`), GLOBAL BODY REDACTORS, and SQL PARAMETER REDACTION. A test fixture (or any future composition root that bypasses `AddAuditLog`) that injects a real `RequestSummary` will see secrets written to SQLite / MS SQL with no redaction. The combination "audit-write must never abort the user-facing action" + "unredacted secrets must never persist" (Component-AuditLog.md §Payload Capture Policy) makes the no-filter fallback genuinely dangerous — over-redacting on a missing filter is the contract the production setup honours, but the code itself defaults to under-redact. **Recommendation** Change the three null-coalesce sites to default to a non-null sentinel filter that performs the header redaction (`HeaderRedactList`) using the hard-coded defaults from `AuditLogOptions`, even when no `IAuditPayloadFilter` is registered. The truncation stage can remain optional; the header redaction must not. Alternatively, make `IAuditPayloadFilter` non-optional and have `AddAuditLog` register the real filter unconditionally — tests that don't bind the options section will resolve the default `AuditLogOptions` and get the production-default redact list automatically. **Resolution** _Unresolved._ ### AuditLog-009 — `SqliteAuditWriter.DisposeAsync` comment claims `_disposed` is set early, but it isn't | | | |--|--| | Severity | Low | | Category | Documentation & comments | | Status | Resolved | | Location | `src/ScadaLink.AuditLog/Site/SqliteAuditWriter.cs:706-740` | **Description** The first `lock (_writeLock)` block in `DisposeAsync` is commented: > Stop accepting new events. Setting _disposed first ensures any FlushBatch entered > after we mark disposed will fault its pending events rather than touching the > about-to-close connection. But the block does NOT set `_disposed = true` — it only calls `_writeQueue.Writer.TryComplete()` and captures `_writerLoop`. The `_disposed` flag is flipped in the SECOND lock block (line 738), AFTER the 5-second wait on the writer loop. During the wait window, a concurrent `WriteAsync` that observed the channel NOT-yet-completed (race: it ran before `TryComplete`) and got past `TryWrite` would land on the writer loop's `FlushBatch`, which then takes the lock and checks `_disposed` — and finds it still `false`. The check at the top of `FlushBatch` (line 265) `if (_disposed) { fault pending; return; }` therefore does NOT fire during the dispose window. In practice the channel being completed drains the loop cleanly and the disposable race is benign, but the comment claims a guarantee that the code does not implement. **Recommendation** Either set `_disposed = true` in the first lock block to match the comment (and remove the duplicate `_disposed` check in the second block); or rewrite the comment to describe the actual ordering: the channel is completed first, the loop drains remaining items under the lock, and `_disposed = true` is set only after the loop exits. The current code is correct; the comment is wrong. **Resolution (2026-05-28):** Comment-only fix on `SqliteAuditWriter.DisposeAsync`. Rewrote the misleading comment to describe the actual ordering: completing the channel writer is the shutdown signal, the writer loop drains buffered items, and `_disposed` is intentionally set only after the loop has drained (in the second lock block). Behaviour unchanged. ### AuditLog-010 — Actor drain paths accept a `CancellationToken` parameter but always pass `CancellationToken.None` downstream | | | |--|--| | Severity | Low | | Category | Concurrency & thread safety | | Status | Resolved | | Location | `src/ScadaLink.AuditLog/Site/Telemetry/SiteAuditTelemetryActor.cs:92,107,124`, `src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:228` | **Description** The drain loops on `SiteAuditTelemetryActor.OnDrainAsync` and the per-site pull on `SiteAuditReconciliationActor.PullSiteAsync` both pass `CancellationToken.None` to every async dependency call (queue reads, gRPC client, repository writes). The actor has no `CancellationToken` field, so there's no in-flight cancellation source — graceful shutdown relies entirely on `PostStop` being called and the actor's `Receive` continuation completing naturally. For a healthy gRPC client this is fine, but a stuck `IngestAuditEventsAsync` call (slow central, partition switch in progress) holds the actor's continuation indefinitely; the host's coordinated-shutdown will then time out the actor system and leave the actor in an undefined state. The brief references "cancellation on stop" in the partition-maintenance comments but `SiteAuditTelemetryActor` does not implement it. **Recommendation** Introduce a per-actor `CancellationTokenSource` populated in `PreStart` and cancelled in `PostStop`; pass `_lifecycleCts.Token` instead of `CancellationToken.None` in every async dependency call. Same change for `SiteAuditReconciliationActor`. The existing `OperationCanceledException` is already swallowed by the top-level catch in `OnDrainAsync` (line 128), so plumbing the token through is a localised change. **Resolution (2026-05-28):** Scope reduced to `SiteAuditTelemetryActor` per finding-closure brief — added a private `_lifecycleCts` field, cancelled+disposed in `PostStop`, and threaded its token through `_queue.ReadPendingAsync`, `_client.IngestAuditEventsAsync`, and `_queue.MarkForwardedAsync` (replacing the three `CancellationToken.None` sites). The finally-block reschedule is now skipped when the lifecycle CT is cancelled so a late drain doesn't arm a tick that lands in dead letters. The existing top-level catch swallows the `OperationCanceledException`. `SiteAuditReconciliationActor` is left for a separate ticket. ### AuditLog-011 — `AddAuditLogHealthMetricsBridge` and `AddAuditLogCentralMaintenance` are non-idempotent and register hosted services on every call | | | |--|--| | Severity | Low | | Category | Code organization & conventions | | Status | Resolved | | Location | `src/ScadaLink.AuditLog/ServiceCollectionExtensions.cs:53-55, 263-276, 301-346` | **Description** The XML doc on `AddAuditLog` is explicit: "Idempotent re-registration is not supported; call this exactly once per `IServiceCollection`." But `AddAuditLogHealthMetricsBridge` calls `services.AddHostedService()` (line 275), which is NOT idempotent — every call registers another descriptor, and the host will spin up N reporters and have them all poll SQLite every 30 s, all push the same snapshot into `ISiteHealthCollector`. The site composition path is supposed to call this exactly once, but tests or composition refactors that accidentally call twice will pay 2x the SQL probe rate and overwrite the snapshot with conflicting numbers (no race, just wasted work). Worse, `AddAuditLogCentralMaintenance` (line 301) is also non-idempotent — `AddOptions` and `AddHostedService` will pile up. **Recommendation** Either (a) guard each Add* helper with a "has the marker been seen?" sentinel (register a private marker descriptor on first call, no-op on subsequent calls); or (b) explicitly document idempotency on the public surface of every helper and verify with a unit test in `AddAuditLogTests`. Option (a) matches the pattern other SDK extensions use and removes a foot-gun. **Resolution (2026-05-28):** Took option (a) for `AddAuditLogHealthMetricsBridge` — guarded by a sentinel check on the `SiteAuditBacklogReporter` hosted-service descriptor (the helper's exclusive contribution to the collection). A second call short-circuits before any `Replace` / `AddHostedService` runs, so the hosted service registers exactly once. New `AddAuditLogHealthMetricsBridge_IsIdempotent_DoesNotDoubleRegister_HostedService` test in `AddAuditLogTests` calls the helper twice and asserts a single `IHostedService` descriptor for `SiteAuditBacklogReporter`. The `AddAuditLogCentralMaintenance` helper is left for a follow-up — it is only ever called from the central composition root and the unit/integration fixtures use disposable IServiceCollections, so the foot-gun is narrower.