code-review: 2026-05-28 baseline re-review of all 23 modules at 1eb6e97

Re-applies the full 10-category checklist to every src/ project — including first-time reviews of the four newer components (AuditLog, NotificationOutbox, SiteCallAudit, Transport) — so the code-reviews/ index reflects today's codebase rather than the 2026-05-16 baseline. 172 new Open findings (0 Critical, 18 High, 62 Medium, 92 Low); 481 findings total across 23 modules. regen-readme.py now derives each module's Last reviewed + Commit from its findings.md header instead of hard-coding 2026-05-16 / 9c60592, so future single-module re-reviews show their own date in the Module Status table.
2026-05-28 02:55:47 -04:00
parent 1eb6e972b0
commit f93b7b99bb
25 changed files with 8793 additions and 115 deletions
@@ -0,0 +1,500 @@
+# Code Review — AuditLog
+
+| Field | Value |
+|-------|-------|
+| Module | `src/ScadaLink.AuditLog` |
+| Design doc | `docs/requirements/Component-AuditLog.md` |
+| Status | Reviewed |
+| Last reviewed | 2026-05-28 |
+| Reviewer | claude-agent |
+| Commit reviewed | `1eb6e97` |
+| Open findings | 11 |
+
+## Summary
+
+AuditLog is one of the larger and most carefully-engineered modules in the codebase.
+The site-side hot-path (`SqliteAuditWriter` + `FallbackAuditWriter` + `RingBufferFallback`)
+implements a textbook bounded-channel + dedicated-writer pattern with batched transactions,
+UTF-8-safe truncation, additive schema migration, and a drop-oldest fallback that
+genuinely honours the "audit-write must NEVER abort the user-facing action" contract.
+The central side mirrors that with per-row try/catch on batch ingest, a transactional
+dual-write for the cached-telemetry path, per-site cursor isolation in reconciliation,
+and a partition-switch purge that is metadata-only. The payload filter is well-factored
+with a compile-time regex cache, per-stage failure isolation, and per-target overrides.
+Test coverage is broad — ~12 000 lines spanning unit, integration, and end-to-end paths.
+
+Themes across findings: (1) the largest issue is a **specced-but-unwired transport path** —
+`ISiteStreamAuditClient.IngestCachedTelemetryAsync` and `AuditLogIngestActor.OnCachedTelemetryAsync`
+both exist and the protobuf RPC is plumbed, but no production code ever calls the cached-telemetry
+client; the cached-call lifecycle audit rows ride the audit-only `IngestAuditEventsAsync` drain
+and the central dual-write transaction is dead code (AuditLog-001). (2) Several
+**Akka.NET supervisor-strategy comments are inaccurate** — multiple actors document
+"`SupervisorStrategy` uses Resume" but the code returns `DefaultDecider` (which Restarts), and
+the strategy applies to children, not to the actor itself (AuditLog-002). (3) The
+**`SqliteAuditWriter` hot-path lock is contended by the 30 s backlog probe** — `GetBacklogStatsAsync`
+takes the same `_writeLock` that serialises every batch INSERT, so a large-backlog scan can
+park the hot-path writer (AuditLog-005). (4) **Sync-over-async in `Dispose`** can deadlock under
+an ASP.NET sync context (AuditLog-006). (5) A handful of **misleading code comments and minor
+configuration drift** (AuditLog-007, AuditLog-008, AuditLog-009). (6) `CancellationToken`
+parameters on the actor drain paths are accepted but immediately replaced with
+`CancellationToken.None` (AuditLog-010). (7) The site-only `AddAuditLogHealthMetricsBridge`
+registers the `SiteAuditBacklogReporter` hosted service but the `AddAuditLog` registration
+chain doesn't reject a central composition root that mistakenly calls the site bridge
+(AuditLog-011). No Critical-severity issues; three Medium, eight Low.
+
+## Checklist coverage
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | Yes | `AuditLogIngestActor.OnCachedTelemetryAsync` is unreachable production code (AuditLog-001); reconciliation cursor advances on persistent insert failure (AuditLog-004); `Dispose` comment about `_disposed` ordering is misleading (AuditLog-009). |
+| 2 | Akka.NET conventions | Yes | `SupervisorStrategy` returned by actors does not do what the surrounding doc says (AuditLog-002); per-actor strategy applies to children only, but comments imply self-protection. |
+| 3 | Concurrency & thread safety | Yes | `GetBacklogStatsAsync` contends with hot-path writes on `_writeLock` (AuditLog-005); sync DI scopes block on async EF disposal (AuditLog-003); `_disposed` is set after the wait, contradicting comment (AuditLog-009); no cooperative cancellation through the drain paths (AuditLog-010). |
+| 4 | Error handling & resilience | Yes | Best-effort contract is honoured throughout; `Dispose()` sync-over-async is the one remaining hazard (AuditLog-006); reconciliation silently discards permanently-failing rows (AuditLog-004). |
+| 5 | Security | Yes | Append-only enforcement, redaction stack, and "never under-redact" safety net all present. Test composition roots that omit the filter SILENTLY pass payloads through unredacted (AuditLog-008). |
+| 6 | Performance & resource management | Yes | Hot-path batched + back-pressured. Backlog scan holds the write lock (AuditLog-005); `MarkForwardedAsync` interpolates an `IN (...)` list inside the lock, fine in practice but scales linearly with batch size. |
+| 7 | Design-document adherence | Yes | Combined telemetry transport plumbed but never called (AuditLog-001); other than that the implementation closely tracks the design doc. |
+| 8 | Code organization & conventions | Yes | Composition root well-segmented; `INodeIdentityProvider` resolution mixes `GetService` and `GetRequiredService` for the same dependency across registrations (AuditLog-007); `AddAuditLog*` helpers register hosted services and option bindings without idempotency guards (AuditLog-011). |
+| 9 | Testing coverage | Yes | Excellent surface coverage. Integration tests exist for the dual-write path in `AuditLogIngestActorCombinedTelemetryTests` and `CachedCallCombinedTelemetryTests`, but those drive the actor directly via the test harness — there is no integration test that asserts the production end-to-end emits a `CachedTelemetryBatch` from the site (because nothing does). |
+| 10 | Documentation & comments | Yes | Several large XML-doc paragraphs are accurate, but the `SupervisorStrategy` comments (AuditLog-002), the `Dispose` ordering comment (AuditLog-009), and a few stale "Bundle X" references could mislead a new reader. |
+
+## Findings
+
+### AuditLog-001 — Combined-telemetry transport is plumbed end-to-end but never invoked in production
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.AuditLog/Site/Telemetry/ISiteStreamAuditClient.cs:45`, `src/ScadaLink.AuditLog/Site/Telemetry/ClusterClientSiteAuditClient.cs:86`, `src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:198` |
+
+**Description**
+
+The design (Component-AuditLog.md §"Cached Operations — Combined Telemetry") specifies a
+single `CachedCallTelemetry` packet per lifecycle event that carries BOTH the audit row
+AND the operational `SiteCalls` upsert, with central writing both rows in one transaction.
+The infrastructure exists: `ISiteStreamAuditClient.IngestCachedTelemetryAsync` is on the
+interface; `ClusterClientSiteAuditClient.IngestCachedTelemetryAsync` builds the
+`IngestCachedTelemetryCommand`; the proto carries `CachedTelemetryBatch`;
+`AuditLogIngestActor.OnCachedTelemetryAsync` performs the dual `InsertIfNotExists` +
+`UpsertAsync` inside a `BeginTransactionAsync`. But a `grep` for callers of
+`IngestCachedTelemetryAsync` in `src/ScadaLink.AuditLog` shows only the interface
+declaration and the two implementations — nothing produces a `CachedTelemetryBatch` for
+the site to push. The `SiteAuditTelemetryActor.OnDrainAsync` only calls
+`IngestAuditEventsAsync` (the audit-only path); cached-call audit rows written by
+`CachedCallTelemetryForwarder` to local SQLite are drained as ordinary audit events,
+and the `SiteCalls` operational half rides a separate `UpsertSiteCallCommand` channel
+into `SiteCallAuditActor`. The "central writes AuditLog + SiteCalls in one transaction"
+guarantee is therefore not delivered — the two writes are now uncorrelated across
+actors and can fail independently, and the dual-write path in `AuditLogIngestActor`
+is dead production code.
+
+**Recommendation**
+
+Either (a) wire the combined path: build a `CachedTelemetryBatch` from the audit rows
+the forwarder writes (alongside the operational half held by `IOperationTrackingStore`),
+add a parallel drain loop that calls `IngestCachedTelemetryAsync`, and gate the audit-only
+drain so cached-call rows don't double-emit; or (b) update the design doc + the
+`AuditLogIngestActor` / `ClusterClientSiteAuditClient` / interface XML comments to
+acknowledge that the two halves now flow via separate transports, and delete the
+unreachable `OnCachedTelemetryAsync` dual-write code (after confirming the
+`AuditLogIngestActorCombinedTelemetryTests` integration tests exercise it via direct
+actor injection only).
+
+**Resolution**
+
+_Unresolved._
+
+### AuditLog-002 — `SupervisorStrategy` comments claim Resume semantics but code returns the default Restart decider
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Akka.NET conventions |
+| Status | Open |
+| Location | `src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:99-103`, `src/ScadaLink.AuditLog/Central/AuditLogPurgeActor.cs:109-115`, `src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:315-321` |
+
+**Description**
+
+Three central actors (`AuditLogIngestActor`, `AuditLogPurgeActor`, `SiteAuditReconciliationActor`)
+all override `SupervisorStrategy()` and return
+`new OneForOneStrategy(maxNrOfRetries: 0, withinTimeRange: TimeSpan.Zero, decider: DefaultDecider)`.
+The surrounding XML / inline comments variously claim "uses `Resume` so a thrown exception
+inside `ReceiveAsync` does not restart the actor" (AuditLogIngestActor remarks),
+"uses Resume so any leaked exception keeps the singleton alive for the next tick"
+(AuditLogPurgeActor remarks), and "the actor's supervisor strategy keeps it alive
+across any leaked exception with `DefaultDecider`'s Restart semantics — restart resets
+the in-memory cursors, but as noted above that's a safe (over-pull, idempotent) recovery"
+(SiteAuditReconciliationActor remarks — at least correctly says Restart, but conflicts
+with the other two). Two things are wrong: (1) the strategy returned by an actor's
+`SupervisorStrategy()` override governs how that actor supervises its CHILDREN, not how
+its own parent treats it — so it is not the mechanism that protects these singletons
+from their own throws; (2) `DefaultDecider` Restarts for most exceptions, not Resumes.
+The actors are in fact protected by the per-row / per-batch try/catch blocks inside
+the receive handlers — the supervisor override is effectively unused, since these
+actors have no children. The comments mislead a reader into trusting a guarantee
+that the code does not deliver.
+
+**Recommendation**
+
+Pick one of two corrections: either delete the `SupervisorStrategy` override (these
+actors have no children, so the override is dead) and rewrite the comments to credit
+the try/catch blocks for the alive-on-throw guarantee; or — if the override is kept
+as a forward-compat hedge — change the decider to `Decider.From(_ => Directive.Stop)`
+or similar to match the comment, AND add a clear note that the per-row catch is what
+keeps the actor running across handler throws, not the supervisor strategy.
+
+**Resolution**
+
+_Unresolved._
+
+### AuditLog-003 — `AuditLogIngestActor.OnIngestAsync` uses `CreateScope`, but `OnCachedTelemetryAsync` uses `CreateAsyncScope` — and only one disposes asynchronously
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Concurrency & thread safety |
+| Status | Open |
+| Location | `src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:133`, `src/ScadaLink.AuditLog/Central/AuditLogPurgeActor.cs:139`, `src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:178` |
+
+**Description**
+
+`OnCachedTelemetryAsync` opens `_serviceProvider!.CreateAsyncScope()` and lets
+`await using` dispose it. `OnIngestAsync`, `OnTickAsync` in
+`SiteAuditReconciliationActor`, and `OnTickAsync` in `AuditLogPurgeActor` all open
+`_services.CreateScope()` (the synchronous variant) and dispose it with a synchronous
+`scope.Dispose()` in a `finally` block — even though the per-message work is async and
+the scoped `IAuditLogRepository` resolves an EF Core `DbContext`, which implements
+`IAsyncDisposable`. The synchronous `Dispose()` on a `DbContext` blocks on any pending
+async connection cleanup; under load this can hold the actor thread for the duration
+of a connection close, which on SQL Server may include sending a `SET TRANSACTION
+ISOLATION LEVEL` reset round-trip. Switching to `CreateAsyncScope()` + `await using`
+is the recommended pattern for scoped EF resources.
+
+**Recommendation**
+
+Change `_services.CreateScope()` to `_services.CreateAsyncScope()` in
+`OnIngestAsync`, `SiteAuditReconciliationActor.OnTickAsync`, and
+`AuditLogPurgeActor.OnTickAsync`, and replace the `try/finally { scope?.Dispose(); }`
+pattern with `await using var scope = _services.CreateAsyncScope();`. The DI scope
+will dispose asynchronously and the EF Core context will be released without
+blocking the actor thread.
+
+**Resolution**
+
+_Unresolved._
+
+### AuditLog-004 — `SiteAuditReconciliationActor` advances cursor even on per-row insert failure, silently abandoning permanently-failing rows
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:233-265` |
+
+**Description**
+
+`PullSiteAsync` iterates the pulled events, calls `InsertIfNotExistsAsync` inside a
+per-row try/catch, and unconditionally updates `maxOccurred = evt.OccurredAtUtc` after
+the try/catch — regardless of whether the insert succeeded or threw. The comment at
+line 247 acknowledges this: "the cursor still advances based on OccurredAtUtc — the
+row was returned by the site, so the next tick won't re-fetch it; if it permanently
+fails to persist, that's an operational concern surfaced by the log, not a hot-loop
+trigger." For a transient fault that flips to success on the next pull the design
+holds. But if a row throws on EVERY central attempt (truly permanent persistence fault —
+e.g. column-too-long, FK violation that won't resolve) the cursor advance still moves
+past it, and central will simply log on every reconciliation tick. No alert escalates
+beyond a log line. Worse, the site keeps the row `Pending` (because `MarkReconciledAsync`
+is only called for rows the puller flipped centrally) AND will trip the
+`SiteAuditTelemetryStalled` signal because the backlog never drains, but the central
+log message is the only place an operator could correlate the stall with the
+persistent insert failure.
+
+**Recommendation**
+
+Either (a) only advance the cursor for rows whose `InsertIfNotExistsAsync` returned
+cleanly — leave `maxOccurred` at the previous value for the failing row so the next
+tick retries; or (b) increment a dedicated `CentralAuditPermanentInsertFailure` health
+counter on the per-row catch so the failure is observable on the dashboard instead of
+buried in the log. Option (a) needs a guard against the same row throwing forever
+(saturate the puller) — a small per-event retry counter held in the actor's state with
+a permanent-skip + `LogCritical` threshold is the standard escape valve.
+
+**Resolution**
+
+_Unresolved._
+
+### AuditLog-005 — `GetBacklogStatsAsync` holds the SQLite hot-path write lock for the full COUNT+MIN scan
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Performance & resource management |
+| Status | Open |
+| Location | `src/ScadaLink.AuditLog/Site/SqliteAuditWriter.cs:597-657` |
+
+**Description**
+
+`SqliteAuditWriter.GetBacklogStatsAsync` takes `_writeLock` (the same lock that
+serialises every batch INSERT in `FlushBatch`) and holds it for the duration of a
+`SELECT COUNT(*), MIN(OccurredAtUtc) FROM AuditLog WHERE ForwardState = 'Pending'`.
+`SiteAuditBacklogReporter` calls this on a 30-second timer. On a healthy site with
+few `Pending` rows the index-only scan is fast; under the scenario the metric exists
+to detect — a prolonged central outage growing the backlog "indefinitely" per
+Component-AuditLog.md — a `COUNT(*)` over hundreds of thousands of `Pending` rows
+on the `IX_SiteAuditLog_ForwardState_Occurred` index is no longer cheap, and the
+duration of that scan is added to the hot-path write latency for every concurrent
+script. The hot path is supposed to be "durable in microseconds" per the design doc;
+a multi-hundred-millisecond probe stall in the same period would not be visible
+externally but would back-pressure the bounded write channel. `ReadPendingAsync` and
+`ReadPendingSinceAsync` share the same lock for the same reason and have the same
+exposure under backlog growth.
+
+**Recommendation**
+
+Either (a) move the SELECT outside the write lock by using a second, dedicated
+read-only SQLite connection (Microsoft.Data.Sqlite supports concurrent connections
+to the same file when journal_mode=WAL is enabled — which would also benefit the
+hot path); or (b) cache the last snapshot inside the writer and recompute it
+lazily on a dedicated background tick so the reporter reads a pre-computed snapshot
+without acquiring the write lock. Option (a) also unblocks `ReadPendingAsync` /
+`ReadPendingSinceAsync` from competing with the writer.
+
+**Resolution**
+
+_Unresolved._
+
+### AuditLog-006 — `SqliteAuditWriter.Dispose()` does sync-over-async and may deadlock
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Concurrency & thread safety |
+| Status | Open |
+| Location | `src/ScadaLink.AuditLog/Site/SqliteAuditWriter.cs:697-700` |
+
+**Description**
+
+```csharp
+public void Dispose()
+{
+    DisposeAsync().AsTask().GetAwaiter().GetResult();
+}
+```
+
+This is the classic sync-over-async anti-pattern. `DisposeAsync` `await`s the
+writer-loop task with `.ConfigureAwait(false)`, so on a thread with no synchronization
+context (the typical .NET 10 host shutdown path) it's fine; but if any caller invokes
+`Dispose()` from a context that captures (an ASP.NET request, a SynchronizationContext
+test runner, an Akka.NET dispatcher in some configurations) the `GetResult()` blocks
+the captured thread while the continuation tries to resume on it — classic deadlock.
+The writer is registered as a DI singleton, so this is unlikely to bite during the
+host's `IAsyncDisposable` shutdown (DI prefers `DisposeAsync` when available), but
+an integration test or future code path that constructs the writer manually inside
+a sync context will hang.
+
+**Recommendation**
+
+Drop the `IDisposable` interface and rely on `IAsyncDisposable` only — the DI container
+will call `DisposeAsync` on singletons that implement it. If a sync `Dispose` is
+required for compatibility with consumers that don't honour `IAsyncDisposable`,
+implement it as a best-effort that calls `_writeQueue.Writer.TryComplete()` + a
+short wait, without blocking the thread for the full async drain.
+
+**Resolution**
+
+_Unresolved._
+
+### AuditLog-007 — `INodeIdentityProvider` resolution mixes `GetService` and `GetRequiredService` inconsistently across `AddAuditLog` registrations
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Code organization & conventions |
+| Status | Open |
+| Location | `src/ScadaLink.AuditLog/ServiceCollectionExtensions.cs:148-218` |
+
+**Description**
+
+`AddAuditLog` registers three components that depend on `INodeIdentityProvider`:
+- `CachedCallTelemetryForwarder` — resolves with `sp.GetService<INodeIdentityProvider>()`
+  (optional, falls back to a null `SourceNode`).
+- `CachedCallLifecycleBridge` — resolves with `sp.GetService<INodeIdentityProvider>()`
+  (optional, same fallback).
+- `CentralAuditWriter` — resolves with `sp.GetRequiredService<INodeIdentityProvider>()`
+  (required, throws at first resolution if unregistered).
+
+The XML comments at lines 153 / 175 / 215 explain the reasoning — the first two are
+optional because tests may skip the registration; the third is required because "the
+production composition root in `SiteServiceRegistration` registers the provider as a
+singleton on both site and central paths". But this is a fragile guarantee — `AddAuditLog`
+itself does NOT register the provider, so a future composition root that calls
+`AddAuditLog` without first calling whatever registers `INodeIdentityProvider` will fail
+on the FIRST resolution of `ICentralAuditWriter` (which is a lazy factory) rather than
+at `AddAuditLog` time. The result: site nodes that "happen to work" because they hold
+a registered provider, central composition test fixtures that fail at runtime instead
+of DI-build time, and a `GetService`/`GetRequiredService` split that gives no clear
+contract to the reader.
+
+**Recommendation**
+
+Either (a) make all three optional: `CentralAuditWriter` already handles a null provider
+gracefully (line 113-116 — null-coalescing the caller's value); the asymmetry buys
+nothing. Or (b) make all three required and either add `services.AddSingleton<INodeIdentityProvider, ...>()`
+inside `AddAuditLog` (with a sensible default — null node name returns `<unknown>`) or
+add an explicit guard at the top of `AddAuditLog` that throws if no provider has been
+registered yet (`services.Any(d => d.ServiceType == typeof(INodeIdentityProvider))`).
+
+**Resolution**
+
+_Unresolved._
+
+### AuditLog-008 — Test composition roots that omit `IAuditPayloadFilter` silently pass UNREDACTED payloads through the writer chain
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Security |
+| Status | Open |
+| Location | `src/ScadaLink.AuditLog/Site/FallbackAuditWriter.cs:51-77`, `src/ScadaLink.AuditLog/Central/CentralAuditWriter.cs:77-104`, `src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:125,155` |
+
+**Description**
+
+`FallbackAuditWriter`, `CentralAuditWriter`, and `AuditLogIngestActor` all accept an
+`IAuditPayloadFilter` as an optional dependency, defaulting to `null = pass-through`.
+The justification in every XML comment is the same: "the M4 test composition roots
+that don't pass one keep working (they only ever write small payloads)". This is fine
+for size — but the filter also performs HEADER REDACTION (`Authorization`, `Cookie`,
+`Set-Cookie`, `X-API-Key`), GLOBAL BODY REDACTORS, and SQL PARAMETER REDACTION. A test
+fixture (or any future composition root that bypasses `AddAuditLog`) that injects a
+real `RequestSummary` will see secrets written to SQLite / MS SQL with no redaction.
+The combination "audit-write must never abort the user-facing action" + "unredacted
+secrets must never persist" (Component-AuditLog.md §Payload Capture Policy) makes the
+no-filter fallback genuinely dangerous — over-redacting on a missing filter is the
+contract the production setup honours, but the code itself defaults to under-redact.
+
+**Recommendation**
+
+Change the three null-coalesce sites to default to a non-null sentinel filter that
+performs the header redaction (`HeaderRedactList`) using the hard-coded defaults
+from `AuditLogOptions`, even when no `IAuditPayloadFilter` is registered. The
+truncation stage can remain optional; the header redaction must not. Alternatively,
+make `IAuditPayloadFilter` non-optional and have `AddAuditLog` register the real
+filter unconditionally — tests that don't bind the options section will resolve the
+default `AuditLogOptions` and get the production-default redact list automatically.
+
+**Resolution**
+
+_Unresolved._
+
+### AuditLog-009 — `SqliteAuditWriter.DisposeAsync` comment claims `_disposed` is set early, but it isn't
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Open |
+| Location | `src/ScadaLink.AuditLog/Site/SqliteAuditWriter.cs:706-740` |
+
+**Description**
+
+The first `lock (_writeLock)` block in `DisposeAsync` is commented:
+
+> Stop accepting new events. Setting _disposed first ensures any FlushBatch entered
+> after we mark disposed will fault its pending events rather than touching the
+> about-to-close connection.
+
+But the block does NOT set `_disposed = true` — it only calls
+`_writeQueue.Writer.TryComplete()` and captures `_writerLoop`. The `_disposed` flag is
+flipped in the SECOND lock block (line 738), AFTER the 5-second wait on the writer
+loop. During the wait window, a concurrent `WriteAsync` that observed the channel
+NOT-yet-completed (race: it ran before `TryComplete`) and got past `TryWrite` would
+land on the writer loop's `FlushBatch`, which then takes the lock and checks
+`_disposed` — and finds it still `false`. The check at the top of `FlushBatch`
+(line 265) `if (_disposed) { fault pending; return; }` therefore does NOT fire during
+the dispose window. In practice the channel being completed drains the loop cleanly
+and the disposable race is benign, but the comment claims a guarantee that the code
+does not implement.
+
+**Recommendation**
+
+Either set `_disposed = true` in the first lock block to match the comment (and remove
+the duplicate `_disposed` check in the second block); or rewrite the comment to
+describe the actual ordering: the channel is completed first, the loop drains
+remaining items under the lock, and `_disposed = true` is set only after the loop
+exits. The current code is correct; the comment is wrong.
+
+**Resolution**
+
+_Unresolved._
+
+### AuditLog-010 — Actor drain paths accept a `CancellationToken` parameter but always pass `CancellationToken.None` downstream
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Concurrency & thread safety |
+| Status | Open |
+| Location | `src/ScadaLink.AuditLog/Site/Telemetry/SiteAuditTelemetryActor.cs:92,107,124`, `src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:228` |
+
+**Description**
+
+The drain loops on `SiteAuditTelemetryActor.OnDrainAsync` and the per-site pull on
+`SiteAuditReconciliationActor.PullSiteAsync` both pass `CancellationToken.None` to
+every async dependency call (queue reads, gRPC client, repository writes). The actor
+has no `CancellationToken` field, so there's no in-flight cancellation source —
+graceful shutdown relies entirely on `PostStop` being called and the actor's
+`Receive` continuation completing naturally. For a healthy gRPC client this is fine,
+but a stuck `IngestAuditEventsAsync` call (slow central, partition switch in progress)
+holds the actor's continuation indefinitely; the host's coordinated-shutdown will then
+time out the actor system and leave the actor in an undefined state. The brief
+references "cancellation on stop" in the partition-maintenance comments but
+`SiteAuditTelemetryActor` does not implement it.
+
+**Recommendation**
+
+Introduce a per-actor `CancellationTokenSource` populated in `PreStart` and cancelled
+in `PostStop`; pass `_lifecycleCts.Token` instead of `CancellationToken.None` in
+every async dependency call. Same change for `SiteAuditReconciliationActor`. The
+existing `OperationCanceledException` is already swallowed by the top-level catch
+in `OnDrainAsync` (line 128), so plumbing the token through is a localised change.
+
+**Resolution**
+
+_Unresolved._
+
+### AuditLog-011 — `AddAuditLogHealthMetricsBridge` and `AddAuditLogCentralMaintenance` are non-idempotent and register hosted services on every call
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Code organization & conventions |
+| Status | Open |
+| Location | `src/ScadaLink.AuditLog/ServiceCollectionExtensions.cs:53-55, 263-276, 301-346` |
+
+**Description**
+
+The XML doc on `AddAuditLog` is explicit: "Idempotent re-registration is not supported;
+call this exactly once per `IServiceCollection`." But `AddAuditLogHealthMetricsBridge`
+calls `services.AddHostedService<SiteAuditBacklogReporter>()` (line 275), which is
+NOT idempotent — every call registers another descriptor, and the host will spin up
+N reporters and have them all poll SQLite every 30 s, all push the same snapshot into
+`ISiteHealthCollector`. The site composition path is supposed to call this exactly
+once, but tests or composition refactors that accidentally call twice will pay 2x the
+SQL probe rate and overwrite the snapshot with conflicting numbers (no race, just
+wasted work). Worse, `AddAuditLogCentralMaintenance` (line 301) is also non-idempotent —
+`AddOptions<AuditLogPartitionMaintenanceOptions>` and `AddHostedService<AuditLogPartitionMaintenanceService>`
+will pile up.
+
+**Recommendation**
+
+Either (a) guard each Add* helper with a "has the marker been seen?" sentinel
+(register a private marker descriptor on first call, no-op on subsequent calls);
+or (b) explicitly document idempotency on the public surface of every helper and
+verify with a unit test in `AddAuditLogTests`. Option (a) matches the pattern other
+SDK extensions use and removes a foot-gun.
+
+**Resolution**
+
+_Unresolved._