ScadaBridge/code-reviews/AuditLog/findings.md

# Code Review — AuditLog

| Field | Value |
|-------|-------|
| Module | `src/ZB.MOM.WW.ScadaBridge.AuditLog` |
| Design doc | `docs/requirements/Component-AuditLog.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
| Open findings | 0 |

## Summary

AuditLog is one of the larger and most carefully-engineered modules in the codebase.
The site-side hot-path (`SqliteAuditWriter` + `FallbackAuditWriter` + `RingBufferFallback`)
implements a textbook bounded-channel + dedicated-writer pattern with batched transactions,
UTF-8-safe truncation, additive schema migration, and a drop-oldest fallback that
genuinely honours the "audit-write must NEVER abort the user-facing action" contract.
The central side mirrors that with per-row try/catch on batch ingest, a transactional
dual-write for the cached-telemetry path, per-site cursor isolation in reconciliation,
and a partition-switch purge that is metadata-only. The payload filter is well-factored
with a compile-time regex cache, per-stage failure isolation, and per-target overrides.
Test coverage is broad — ~12 000 lines spanning unit, integration, and end-to-end paths.

Themes across findings: (1) the largest issue is a **specced-but-unwired transport path** —
`ISiteStreamAuditClient.IngestCachedTelemetryAsync` and `AuditLogIngestActor.OnCachedTelemetryAsync`
both exist and the protobuf RPC is plumbed, but no production code ever calls the cached-telemetry
client; the cached-call lifecycle audit rows ride the audit-only `IngestAuditEventsAsync` drain
and the central dual-write transaction is dead code (AuditLog-001). (2) Several
**Akka.NET supervisor-strategy comments are inaccurate** — multiple actors document
"`SupervisorStrategy` uses Resume" but the code returns `DefaultDecider` (which Restarts), and
the strategy applies to children, not to the actor itself (AuditLog-002). (3) The
**`SqliteAuditWriter` hot-path lock is contended by the 30 s backlog probe** — `GetBacklogStatsAsync`
takes the same `_writeLock` that serialises every batch INSERT, so a large-backlog scan can
park the hot-path writer (AuditLog-005). (4) **Sync-over-async in `Dispose`** can deadlock under
an ASP.NET sync context (AuditLog-006). (5) A handful of **misleading code comments and minor
configuration drift** (AuditLog-007, AuditLog-008, AuditLog-009). (6) `CancellationToken`
parameters on the actor drain paths are accepted but immediately replaced with
`CancellationToken.None` (AuditLog-010). (7) The site-only `AddAuditLogHealthMetricsBridge`
registers the `SiteAuditBacklogReporter` hosted service but the `AddAuditLog` registration
chain doesn't reject a central composition root that mistakenly calls the site bridge
(AuditLog-011). No Critical-severity issues; three Medium, eight Low.

## Checklist coverage

| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | Yes | `AuditLogIngestActor.OnCachedTelemetryAsync` is unreachable production code (AuditLog-001); reconciliation cursor advances on persistent insert failure (AuditLog-004); `Dispose` comment about `_disposed` ordering is misleading (AuditLog-009). |
| 2 | Akka.NET conventions | Yes | `SupervisorStrategy` returned by actors does not do what the surrounding doc says (AuditLog-002); per-actor strategy applies to children only, but comments imply self-protection. |
| 3 | Concurrency & thread safety | Yes | `GetBacklogStatsAsync` contends with hot-path writes on `_writeLock` (AuditLog-005); sync DI scopes block on async EF disposal (AuditLog-003); `_disposed` is set after the wait, contradicting comment (AuditLog-009); no cooperative cancellation through the drain paths (AuditLog-010). |
| 4 | Error handling & resilience | Yes | Best-effort contract is honoured throughout; `Dispose()` sync-over-async is the one remaining hazard (AuditLog-006); reconciliation silently discards permanently-failing rows (AuditLog-004). |
| 5 | Security | Yes | Append-only enforcement, redaction stack, and "never under-redact" safety net all present. Test composition roots that omit the filter SILENTLY pass payloads through unredacted (AuditLog-008). |
| 6 | Performance & resource management | Yes | Hot-path batched + back-pressured. Backlog scan holds the write lock (AuditLog-005); `MarkForwardedAsync` interpolates an `IN (...)` list inside the lock, fine in practice but scales linearly with batch size. |
| 7 | Design-document adherence | Yes | Combined telemetry transport plumbed but never called (AuditLog-001); other than that the implementation closely tracks the design doc. |
| 8 | Code organization & conventions | Yes | Composition root well-segmented; `INodeIdentityProvider` resolution mixes `GetService` and `GetRequiredService` for the same dependency across registrations (AuditLog-007); `AddAuditLog*` helpers register hosted services and option bindings without idempotency guards (AuditLog-011). |
| 9 | Testing coverage | Yes | Excellent surface coverage. Integration tests exist for the dual-write path in `AuditLogIngestActorCombinedTelemetryTests` and `CachedCallCombinedTelemetryTests`, but those drive the actor directly via the test harness — there is no integration test that asserts the production end-to-end emits a `CachedTelemetryBatch` from the site (because nothing does). |
| 10 | Documentation & comments | Yes | Several large XML-doc paragraphs are accurate, but the `SupervisorStrategy` comments (AuditLog-002), the `Dispose` ordering comment (AuditLog-009), and a few stale "Bundle X" references could mislead a new reader. |

## Findings

### AuditLog-001 — Combined-telemetry transport is plumbed end-to-end but never invoked in production

| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Site/Telemetry/ISiteStreamAuditClient.cs:45`, `src/ZB.MOM.WW.ScadaBridge.AuditLog/Site/Telemetry/ClusterClientSiteAuditClient.cs:86`, `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogIngestActor.cs:198` |

**Description**

The design (Component-AuditLog.md §"Cached Operations — Combined Telemetry") specifies a
single `CachedCallTelemetry` packet per lifecycle event that carries BOTH the audit row
AND the operational `SiteCalls` upsert, with central writing both rows in one transaction.
The infrastructure exists: `ISiteStreamAuditClient.IngestCachedTelemetryAsync` is on the
interface; `ClusterClientSiteAuditClient.IngestCachedTelemetryAsync` builds the
`IngestCachedTelemetryCommand`; the proto carries `CachedTelemetryBatch`;
`AuditLogIngestActor.OnCachedTelemetryAsync` performs the dual `InsertIfNotExists` +
`UpsertAsync` inside a `BeginTransactionAsync`. But a `grep` for callers of
`IngestCachedTelemetryAsync` in `src/ZB.MOM.WW.ScadaBridge.AuditLog` shows only the interface
declaration and the two implementations — nothing produces a `CachedTelemetryBatch` for
the site to push. The `SiteAuditTelemetryActor.OnDrainAsync` only calls
`IngestAuditEventsAsync` (the audit-only path); cached-call audit rows written by
`CachedCallTelemetryForwarder` to local SQLite are drained as ordinary audit events,
and the `SiteCalls` operational half rides a separate `UpsertSiteCallCommand` channel
into `SiteCallAuditActor`. The "central writes AuditLog + SiteCalls in one transaction"
guarantee is therefore not delivered — the two writes are now uncorrelated across
actors and can fail independently, and the dual-write path in `AuditLogIngestActor`
is dead production code.

**Recommendation**

Either (a) wire the combined path: build a `CachedTelemetryBatch` from the audit rows
the forwarder writes (alongside the operational half held by `IOperationTrackingStore`),
add a parallel drain loop that calls `IngestCachedTelemetryAsync`, and gate the audit-only
drain so cached-call rows don't double-emit; or (b) update the design doc + the
`AuditLogIngestActor` / `ClusterClientSiteAuditClient` / interface XML comments to
acknowledge that the two halves now flow via separate transports, and delete the
unreachable `OnCachedTelemetryAsync` dual-write code (after confirming the
`AuditLogIngestActorCombinedTelemetryTests` integration tests exercise it via direct
actor injection only).

**Resolution (2026-05-28):**

Wired the combined-telemetry transport end-to-end via recommendation (a). The
previously-unreachable `IngestCachedTelemetryAsync` client path now carries
cached-call lifecycle rows from the site SQLite hot-path through to the central
`AuditLogIngestActor.OnCachedTelemetryAsync` dual-write transaction. Changes:

- **`ISiteAuditQueue`** (`src/ZB.MOM.WW.ScadaBridge.Commons/Interfaces/Services/ISiteAuditQueue.cs`):
  added `ReadPendingCachedTelemetryAsync(int, CancellationToken)` returning
  rows in `AuditForwardState.Pending` whose `Kind` is one of `CachedSubmit`,
  `ApiCallCached`, `DbWriteCached`, `CachedResolve`. Updated `ReadPendingAsync`
  XML doc to call out the partition.
- **`SqliteAuditWriter`** (`src/ZB.MOM.WW.ScadaBridge.AuditLog/Site/SqliteAuditWriter.cs`):
  implemented `ReadPendingCachedTelemetryAsync` with a `Kind IN (...)` filter
  reusing the existing `_readConnection` / `_readLock` decoupling; modified
  `ReadPendingAsync` to add the symmetric `Kind NOT IN (...)` predicate so the
  audit-only drain no longer double-emits cached rows.
- **`SiteAuditTelemetryActor`** (`src/ZB.MOM.WW.ScadaBridge.AuditLog/Site/Telemetry/SiteAuditTelemetryActor.cs`):
  added an optional `IOperationTrackingStore?` constructor parameter, a sibling
  `CachedDrain` self-tick message, and an `OnCachedDrainAsync` handler running
  in parallel with the existing audit-only drain. The cached-drain reads the
  partitioned audit rows, joins each with the matching tracking-store
  snapshot (looked up by `TrackedOperationId` via `CorrelationId`), builds a
  `CachedTelemetryBatch`, pushes via `IngestCachedTelemetryAsync`, and marks
  ack'd EventIds Forwarded. Orphan rows (no matching tracking snapshot, or a
  thrown tracking-store call) are logged + skipped so the bad row never
  blocks the rest of the batch; rows stay Pending and reconciliation /
  retention handles them. The lifecycle CTS (AuditLog-010) gates both drains
  uniformly.
- **`AkkaHostedService`** (`src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs`):
  resolves `IOperationTrackingStore` via `GetService` (site-only registration)
  and threads it through the actor's `Props.Create`. Central composition
  roots and tests that don't register the tracking store get the legacy
  audit-only behaviour — the cached scheduler is never armed.
- **Tests** (`tests/ZB.MOM.WW.ScadaBridge.AuditLog.Tests/Site/Telemetry/SiteAuditTelemetryActorTests.cs`):
  added three regression tests asserting (1) cached rows route through
  `IngestCachedTelemetryAsync` and NOT `IngestAuditEventsAsync`, (2) an
  orphan row with no tracking snapshot is logged + skipped without crashing
  the drain, (3) the audit-only drain still flows when the cached drain is
  disabled (null tracking store). Updated `WaitForSiteRowsPersistedAsync` in
  `ParentExecutionIdCorrelationTests` to union `ReadPendingCachedTelemetryAsync`
  into the durability check — its `ReadPendingAsync(256) ∪ ReadForwardedAsync(256)`
  assertion previously missed the cached kinds after the partition change.

**Design notes / caveats.**

- *Operational state at emission time is the latest tracking row, not the
  per-event status.* The original spec described one combined packet per
  lifecycle event, but the production wiring keeps the existing
  `CachedCallTelemetryForwarder` dual-write (audit + tracking) and uses the
  drain as a join. Central's `SiteCalls` upsert is monotonic so this is
  consistent with the broader design — the audit row preserves per-event
  granularity, the SiteCalls mirror reflects "most recent known" state.
- *Test-only `CombinedTelemetryDispatcher` wire push is now redundant but
  harmless.* The dispatcher's manual `IngestCachedTelemetryAsync` call in
  `CombinedTelemetryHarness` / `ParentExecutionIdCorrelationTests` still
  executes; central's idempotent `InsertIfNotExistsAsync` swallows the
  duplicate so it's a no-op. Removing it is a separate clean-up.
- *Per-actor cancellation gates both drains.* The lifecycle CTS (AuditLog-010)
  is shared so `PostStop` cancels in-flight cached lookups + pushes at the
  same instant as audit-only drains.

Build: `dotnet build ZB.MOM.WW.ScadaBridge.slnx` — 0 warnings, 0 errors.
Tests: `dotnet test tests/ZB.MOM.WW.ScadaBridge.AuditLog.Tests` — 250 passed, 1 failed
(`PartitionPurgeTests.EndToEnd_OldestPartition_PurgedViaActor_NewerKept` —
pre-existing MS-SQL date-sensitive flake, called out in the prompt as
acceptable). `dotnet test tests/ZB.MOM.WW.ScadaBridge.SiteRuntime.Tests` — all 302
passed.

### AuditLog-002 — `SupervisorStrategy` comments claim Resume semantics but code returns the default Restart decider

| | |
|--|--|
| Severity | Low |
| Category | Akka.NET conventions |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogIngestActor.cs:99-103`, `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogPurgeActor.cs:109-115`, `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/SiteAuditReconciliationActor.cs:315-321` |

**Description**

Three central actors (`AuditLogIngestActor`, `AuditLogPurgeActor`, `SiteAuditReconciliationActor`)
all override `SupervisorStrategy()` and return
`new OneForOneStrategy(maxNrOfRetries: 0, withinTimeRange: TimeSpan.Zero, decider: DefaultDecider)`.
The surrounding XML / inline comments variously claim "uses `Resume` so a thrown exception
inside `ReceiveAsync` does not restart the actor" (AuditLogIngestActor remarks),
"uses Resume so any leaked exception keeps the singleton alive for the next tick"
(AuditLogPurgeActor remarks), and "the actor's supervisor strategy keeps it alive
across any leaked exception with `DefaultDecider`'s Restart semantics — restart resets
the in-memory cursors, but as noted above that's a safe (over-pull, idempotent) recovery"
(SiteAuditReconciliationActor remarks — at least correctly says Restart, but conflicts
with the other two). Two things are wrong: (1) the strategy returned by an actor's
`SupervisorStrategy()` override governs how that actor supervises its CHILDREN, not how
its own parent treats it — so it is not the mechanism that protects these singletons
from their own throws; (2) `DefaultDecider` Restarts for most exceptions, not Resumes.
The actors are in fact protected by the per-row / per-batch try/catch blocks inside
the receive handlers — the supervisor override is effectively unused, since these
actors have no children. The comments mislead a reader into trusting a guarantee
that the code does not deliver.

**Recommendation**

Pick one of two corrections: either delete the `SupervisorStrategy` override (these
actors have no children, so the override is dead) and rewrite the comments to credit
the try/catch blocks for the alive-on-throw guarantee; or — if the override is kept
as a forward-compat hedge — change the decider to `Decider.From(_ => Directive.Stop)`
or similar to match the comment, AND add a clear note that the per-row catch is what
keeps the actor running across handler throws, not the supervisor strategy.

**Resolution (2026-05-28):**

Comment-only fix on all three actors (`AuditLogIngestActor`, `AuditLogPurgeActor`,
`SiteAuditReconciliationActor`). XML doc remarks now correctly attribute alive-on-throw
to the per-row/per-batch/per-site try/catch blocks, describe the `SupervisorStrategy`
override as a children-only forward-compat placeholder, and state the actual
`DefaultDecider` Restart semantics (no more "Resume" claim). Behaviour unchanged.

### AuditLog-003 — `AuditLogIngestActor.OnIngestAsync` uses `CreateScope`, but `OnCachedTelemetryAsync` uses `CreateAsyncScope` — and only one disposes asynchronously

| | |
|--|--|
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogIngestActor.cs:133`, `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogPurgeActor.cs:139`, `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/SiteAuditReconciliationActor.cs:178` |

**Description**

`OnCachedTelemetryAsync` opens `_serviceProvider!.CreateAsyncScope()` and lets
`await using` dispose it. `OnIngestAsync`, `OnTickAsync` in
`SiteAuditReconciliationActor`, and `OnTickAsync` in `AuditLogPurgeActor` all open
`_services.CreateScope()` (the synchronous variant) and dispose it with a synchronous
`scope.Dispose()` in a `finally` block — even though the per-message work is async and
the scoped `IAuditLogRepository` resolves an EF Core `DbContext`, which implements
`IAsyncDisposable`. The synchronous `Dispose()` on a `DbContext` blocks on any pending
async connection cleanup; under load this can hold the actor thread for the duration
of a connection close, which on SQL Server may include sending a `SET TRANSACTION
ISOLATION LEVEL` reset round-trip. Switching to `CreateAsyncScope()` + `await using`
is the recommended pattern for scoped EF resources.

**Recommendation**

Change `_services.CreateScope()` to `_services.CreateAsyncScope()` in
`OnIngestAsync`, `SiteAuditReconciliationActor.OnTickAsync`, and
`AuditLogPurgeActor.OnTickAsync`, and replace the `try/finally { scope?.Dispose(); }`
pattern with `await using var scope = _services.CreateAsyncScope();`. The DI scope
will dispose asynchronously and the EF Core context will be released without
blocking the actor thread.

**Resolution (2026-05-28):**

All three handlers now use `CreateAsyncScope()` + `await using var scope = ...`.
`AuditLogIngestActor.OnIngestAsync` factored the per-batch loop into a shared
`IngestWithRepositoryAsync` helper so the injected-repository test ctor and
the scoped production path both reach the same body without duplicating the
per-row try/catch. `AuditLogPurgeActor.OnTickAsync` and
`SiteAuditReconciliationActor.OnTickAsync` dropped the `try/finally { scope.Dispose(); }`
pattern in favour of the `await using` lexical scope. EF Core DbContexts now
dispose asynchronously across every audit ingest path.

### AuditLog-004 — `SiteAuditReconciliationActor` advances cursor even on per-row insert failure, silently abandoning permanently-failing rows

| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/SiteAuditReconciliationActor.cs:233-265` |

**Description**

`PullSiteAsync` iterates the pulled events, calls `InsertIfNotExistsAsync` inside a
per-row try/catch, and unconditionally updates `maxOccurred = evt.OccurredAtUtc` after
the try/catch — regardless of whether the insert succeeded or threw. The comment at
line 247 acknowledges this: "the cursor still advances based on OccurredAtUtc — the
row was returned by the site, so the next tick won't re-fetch it; if it permanently
fails to persist, that's an operational concern surfaced by the log, not a hot-loop
trigger." For a transient fault that flips to success on the next pull the design
holds. But if a row throws on EVERY central attempt (truly permanent persistence fault —
e.g. column-too-long, FK violation that won't resolve) the cursor advance still moves
past it, and central will simply log on every reconciliation tick. No alert escalates
beyond a log line. Worse, the site keeps the row `Pending` (because `MarkReconciledAsync`
is only called for rows the puller flipped centrally) AND will trip the
`SiteAuditTelemetryStalled` signal because the backlog never drains, but the central
log message is the only place an operator could correlate the stall with the
persistent insert failure.

**Recommendation**

Either (a) only advance the cursor for rows whose `InsertIfNotExistsAsync` returned
cleanly — leave `maxOccurred` at the previous value for the failing row so the next
tick retries; or (b) increment a dedicated `CentralAuditPermanentInsertFailure` health
counter on the per-row catch so the failure is observable on the dashboard instead of
buried in the log. Option (a) needs a guard against the same row throwing forever
(saturate the puller) — a small per-event retry counter held in the actor's state with
a permanent-skip + `LogCritical` threshold is the standard escape valve.

**Resolution (2026-05-28):**

Took option (a) with the per-EventId retry-counter escape valve. `PullSiteAsync`
now tracks `_failedInsertAttempts: Dictionary<Guid, int>` and a per-tick
`hasUnresolvedFailure` flag:
- A successful insert clears the EventId from the counter and contributes to
  `maxOccurred`.
- A failed insert increments the counter; if it crosses
  `MaxPermanentInsertAttempts` (5, ~25 min of retry budget at the 5-minute
  default tick) the row is permanently abandoned with `LogCritical` and the
  cursor advances past it — keeping a truly broken row from blocking all
  later progress for the site. Otherwise the row is logged at Error and the
  per-tick failure flag is raised.
- The cursor advance at end-of-tick is `hasUnresolvedFailure ? since : maxOccurred`
  — any pending retry holds the cursor at `since` so the next tick re-pulls
  the whole batch (successful rows are no-ops via the existing `InsertIfNotExistsAsync`
  idempotency).

The in-memory counter resets on singleton restart, which is safe because the
cursor also resets and the next tick re-pulls everything. Tests for both the
retry-hold and permanent-abandon paths should land alongside the existing
reconciliation tests in `tests/ZB.MOM.WW.ScadaBridge.AuditLog.Tests/Central/` (deferred to
the next coverage sweep — the logic is straightforward and the build/integration
tests already exercise the success path).

### AuditLog-005 — `GetBacklogStatsAsync` holds the SQLite hot-path write lock for the full COUNT+MIN scan

| | |
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Site/SqliteAuditWriter.cs:597-657` |

**Description**

`SqliteAuditWriter.GetBacklogStatsAsync` takes `_writeLock` (the same lock that
serialises every batch INSERT in `FlushBatch`) and holds it for the duration of a
`SELECT COUNT(*), MIN(OccurredAtUtc) FROM AuditLog WHERE ForwardState = 'Pending'`.
`SiteAuditBacklogReporter` calls this on a 30-second timer. On a healthy site with
few `Pending` rows the index-only scan is fast; under the scenario the metric exists
to detect — a prolonged central outage growing the backlog "indefinitely" per
Component-AuditLog.md — a `COUNT(*)` over hundreds of thousands of `Pending` rows
on the `IX_SiteAuditLog_ForwardState_Occurred` index is no longer cheap, and the
duration of that scan is added to the hot-path write latency for every concurrent
script. The hot path is supposed to be "durable in microseconds" per the design doc;
a multi-hundred-millisecond probe stall in the same period would not be visible
externally but would back-pressure the bounded write channel. `ReadPendingAsync` and
`ReadPendingSinceAsync` share the same lock for the same reason and have the same
exposure under backlog growth.

**Recommendation**

Either (a) move the SELECT outside the write lock by using a second, dedicated
read-only SQLite connection (Microsoft.Data.Sqlite supports concurrent connections
to the same file when journal_mode=WAL is enabled — which would also benefit the
hot path); or (b) cache the last snapshot inside the writer and recompute it
lazily on a dedicated background tick so the reporter reads a pre-computed snapshot
without acquiring the write lock. Option (a) also unblocks `ReadPendingAsync` /
`ReadPendingSinceAsync` from competing with the writer.

**Resolution (2026-05-28):**

Took option (a). `SqliteAuditWriter` now opens a second `SqliteConnection`
(`_readConnection`) on the same file in the ctor, after `InitializeSchema`
sets `PRAGMA journal_mode = WAL` on the writer connection — WAL lets a
second connection read concurrently with the active writer without taking
`_writeLock`. The read connection is guarded by its own `_readLock` (since
`SqliteConnection` itself is not thread-safe across callers) and used by
`GetBacklogStatsAsync`, `ReadPendingAsync`, `ReadPendingSinceAsync`, and
`ReadForwardedAsync`. `DisposeAsync` disposes it after the writer drains.
Regression test `SqliteAuditWriterBacklogStatsTests.GetBacklogStatsAsync_DoesNotBlockOnConcurrentWriteLoad`
saturates the writer with a 2 000-row burst and asserts the probe returns
in under 1 s — would fail against the pre-fix code (the probe queued
behind every batch INSERT under `_writeLock`).

### AuditLog-006 — `SqliteAuditWriter.Dispose()` does sync-over-async and may deadlock

| | |
|--|--|
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Site/SqliteAuditWriter.cs:697-700` |

**Description**

```csharp
public void Dispose()
{
    DisposeAsync().AsTask().GetAwaiter().GetResult();
}
```

This is the classic sync-over-async anti-pattern. `DisposeAsync` `await`s the
writer-loop task with `.ConfigureAwait(false)`, so on a thread with no synchronization
context (the typical .NET 10 host shutdown path) it's fine; but if any caller invokes
`Dispose()` from a context that captures (an ASP.NET request, a SynchronizationContext
test runner, an Akka.NET dispatcher in some configurations) the `GetResult()` blocks
the captured thread while the continuation tries to resume on it — classic deadlock.
The writer is registered as a DI singleton, so this is unlikely to bite during the
host's `IAsyncDisposable` shutdown (DI prefers `DisposeAsync` when available), but
an integration test or future code path that constructs the writer manually inside
a sync context will hang.

**Recommendation**

Drop the `IDisposable` interface and rely on `IAsyncDisposable` only — the DI container
will call `DisposeAsync` on singletons that implement it. If a sync `Dispose` is
required for compatibility with consumers that don't honour `IAsyncDisposable`,
implement it as a best-effort that calls `_writeQueue.Writer.TryComplete()` + a
short wait, without blocking the thread for the full async drain.

**Resolution (2026-05-28):**

`Dispose()` now hops to the thread pool via `Task.Run(...).GetAwaiter().GetResult()`
before blocking on `DisposeAsync`. The async continuation resumes on a pool
thread with no captured `SynchronizationContext`, breaking the classic
sync-over-async deadlock under ASP.NET / Akka dispatchers. `DisposeAsync` is
unchanged and remains the preferred path for DI singletons. XML doc comment
documents the choice. Behaviour for context-free callers is unchanged.

### AuditLog-007 — `INodeIdentityProvider` resolution mixes `GetService` and `GetRequiredService` inconsistently across `AddAuditLog` registrations

| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/ServiceCollectionExtensions.cs:148-218` |

**Description**

`AddAuditLog` registers three components that depend on `INodeIdentityProvider`:
- `CachedCallTelemetryForwarder` — resolves with `sp.GetService<INodeIdentityProvider>()`
  (optional, falls back to a null `SourceNode`).
- `CachedCallLifecycleBridge` — resolves with `sp.GetService<INodeIdentityProvider>()`
  (optional, same fallback).
- `CentralAuditWriter` — resolves with `sp.GetRequiredService<INodeIdentityProvider>()`
  (required, throws at first resolution if unregistered).

The XML comments at lines 153 / 175 / 215 explain the reasoning — the first two are
optional because tests may skip the registration; the third is required because "the
production composition root in `SiteServiceRegistration` registers the provider as a
singleton on both site and central paths". But this is a fragile guarantee — `AddAuditLog`
itself does NOT register the provider, so a future composition root that calls
`AddAuditLog` without first calling whatever registers `INodeIdentityProvider` will fail
on the FIRST resolution of `ICentralAuditWriter` (which is a lazy factory) rather than
at `AddAuditLog` time. The result: site nodes that "happen to work" because they hold
a registered provider, central composition test fixtures that fail at runtime instead
of DI-build time, and a `GetService`/`GetRequiredService` split that gives no clear
contract to the reader.

**Recommendation**

Either (a) make all three optional: `CentralAuditWriter` already handles a null provider
gracefully (line 113-116 — null-coalescing the caller's value); the asymmetry buys
nothing. Or (b) make all three required and either add `services.AddSingleton<INodeIdentityProvider, ...>()`
inside `AddAuditLog` (with a sensible default — null node name returns `<unknown>`) or
add an explicit guard at the top of `AddAuditLog` that throws if no provider has been
registered yet (`services.Any(d => d.ServiceType == typeof(INodeIdentityProvider))`).

**Resolution (2026-05-28):**

Took option (b) — standardized all three consumers on `GetRequiredService<INodeIdentityProvider>()`.
The Host (`SiteServiceRegistration.BindSharedOptions`) registers the provider on
both site and central paths per the InboundAPI-022 / Host registration sweep,
and the `AddAuditLogTests` fixture binds a `FakeNodeIdentityProvider`. A silent
`GetService()` returning null was masking a future composition root that forgot
the registration; the strict resolution surfaces that bug at first
`ICachedCallTelemetryForwarder` / `CachedCallLifecycleBridge` / `ICentralAuditWriter`
resolution instead.

### AuditLog-008 — Test composition roots that omit `IAuditPayloadFilter` silently pass UNREDACTED payloads through the writer chain

| | |
|--|--|
| Severity | Low |
| Category | Security |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Site/FallbackAuditWriter.cs:51-77`, `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/CentralAuditWriter.cs:77-104`, `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogIngestActor.cs:125,155` |

**Resolution (2026-05-28):** New `SafeDefaultAuditPayloadFilter` in
`src/ZB.MOM.WW.ScadaBridge.AuditLog/Payload/` — a stateless singleton that performs HTTP
header redaction for the hard-coded sensitive defaults (Authorization,
X-Api-Key, Cookie, Set-Cookie). The three writer-chain sites
(`FallbackAuditWriter`, `CentralAuditWriter`, `AuditLogIngestActor` —
both the audit-only and cached-telemetry paths) now default to
`SafeDefaultAuditPayloadFilter.Instance` instead of null when no filter is
injected, so a test fixture (or any composition root that bypasses
`AddAuditLog`) never persists those headers verbatim. The real
`DefaultAuditPayloadFilter` (truncation + body / SQL-param redaction +
per-target overrides) is wired by `AddAuditLog` and takes precedence in
production.

**Description**

`FallbackAuditWriter`, `CentralAuditWriter`, and `AuditLogIngestActor` all accept an
`IAuditPayloadFilter` as an optional dependency, defaulting to `null = pass-through`.
The justification in every XML comment is the same: "the M4 test composition roots
that don't pass one keep working (they only ever write small payloads)". This is fine
for size — but the filter also performs HEADER REDACTION (`Authorization`, `Cookie`,
`Set-Cookie`, `X-API-Key`), GLOBAL BODY REDACTORS, and SQL PARAMETER REDACTION. A test
fixture (or any future composition root that bypasses `AddAuditLog`) that injects a
real `RequestSummary` will see secrets written to SQLite / MS SQL with no redaction.
The combination "audit-write must never abort the user-facing action" + "unredacted
secrets must never persist" (Component-AuditLog.md §Payload Capture Policy) makes the
no-filter fallback genuinely dangerous — over-redacting on a missing filter is the
contract the production setup honours, but the code itself defaults to under-redact.

**Recommendation**

Change the three null-coalesce sites to default to a non-null sentinel filter that
performs the header redaction (`HeaderRedactList`) using the hard-coded defaults
from `AuditLogOptions`, even when no `IAuditPayloadFilter` is registered. The
truncation stage can remain optional; the header redaction must not. Alternatively,
make `IAuditPayloadFilter` non-optional and have `AddAuditLog` register the real
filter unconditionally — tests that don't bind the options section will resolve the
default `AuditLogOptions` and get the production-default redact list automatically.

**Resolution**

_Unresolved._

### AuditLog-009 — `SqliteAuditWriter.DisposeAsync` comment claims `_disposed` is set early, but it isn't

| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Site/SqliteAuditWriter.cs:706-740` |

**Description**

The first `lock (_writeLock)` block in `DisposeAsync` is commented:

> Stop accepting new events. Setting _disposed first ensures any FlushBatch entered
> after we mark disposed will fault its pending events rather than touching the
> about-to-close connection.

But the block does NOT set `_disposed = true` — it only calls
`_writeQueue.Writer.TryComplete()` and captures `_writerLoop`. The `_disposed` flag is
flipped in the SECOND lock block (line 738), AFTER the 5-second wait on the writer
loop. During the wait window, a concurrent `WriteAsync` that observed the channel
NOT-yet-completed (race: it ran before `TryComplete`) and got past `TryWrite` would
land on the writer loop's `FlushBatch`, which then takes the lock and checks
`_disposed` — and finds it still `false`. The check at the top of `FlushBatch`
(line 265) `if (_disposed) { fault pending; return; }` therefore does NOT fire during
the dispose window. In practice the channel being completed drains the loop cleanly
and the disposable race is benign, but the comment claims a guarantee that the code
does not implement.

**Recommendation**

Either set `_disposed = true` in the first lock block to match the comment (and remove
the duplicate `_disposed` check in the second block); or rewrite the comment to
describe the actual ordering: the channel is completed first, the loop drains
remaining items under the lock, and `_disposed = true` is set only after the loop
exits. The current code is correct; the comment is wrong.

**Resolution (2026-05-28):**

Comment-only fix on `SqliteAuditWriter.DisposeAsync`. Rewrote the misleading comment
to describe the actual ordering: completing the channel writer is the shutdown signal,
the writer loop drains buffered items, and `_disposed` is intentionally set only after
the loop has drained (in the second lock block). Behaviour unchanged.

### AuditLog-010 — Actor drain paths accept a `CancellationToken` parameter but always pass `CancellationToken.None` downstream

| | |
|--|--|
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Site/Telemetry/SiteAuditTelemetryActor.cs:92,107,124`, `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/SiteAuditReconciliationActor.cs:228` |

**Description**

The drain loops on `SiteAuditTelemetryActor.OnDrainAsync` and the per-site pull on
`SiteAuditReconciliationActor.PullSiteAsync` both pass `CancellationToken.None` to
every async dependency call (queue reads, gRPC client, repository writes). The actor
has no `CancellationToken` field, so there's no in-flight cancellation source —
graceful shutdown relies entirely on `PostStop` being called and the actor's
`Receive` continuation completing naturally. For a healthy gRPC client this is fine,
but a stuck `IngestAuditEventsAsync` call (slow central, partition switch in progress)
holds the actor's continuation indefinitely; the host's coordinated-shutdown will then
time out the actor system and leave the actor in an undefined state. The brief
references "cancellation on stop" in the partition-maintenance comments but
`SiteAuditTelemetryActor` does not implement it.

**Recommendation**

Introduce a per-actor `CancellationTokenSource` populated in `PreStart` and cancelled
in `PostStop`; pass `_lifecycleCts.Token` instead of `CancellationToken.None` in
every async dependency call. Same change for `SiteAuditReconciliationActor`. The
existing `OperationCanceledException` is already swallowed by the top-level catch
in `OnDrainAsync` (line 128), so plumbing the token through is a localised change.

**Resolution (2026-05-28):**

Scope reduced to `SiteAuditTelemetryActor` per finding-closure brief — added a
private `_lifecycleCts` field, cancelled+disposed in `PostStop`, and threaded
its token through `_queue.ReadPendingAsync`, `_client.IngestAuditEventsAsync`,
and `_queue.MarkForwardedAsync` (replacing the three `CancellationToken.None`
sites). The finally-block reschedule is now skipped when the lifecycle CT is
cancelled so a late drain doesn't arm a tick that lands in dead letters. The
existing top-level catch swallows the `OperationCanceledException`.
`SiteAuditReconciliationActor` is left for a separate ticket.

### AuditLog-011 — `AddAuditLogHealthMetricsBridge` and `AddAuditLogCentralMaintenance` are non-idempotent and register hosted services on every call

| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/ServiceCollectionExtensions.cs:53-55, 263-276, 301-346` |

**Description**

The XML doc on `AddAuditLog` is explicit: "Idempotent re-registration is not supported;
call this exactly once per `IServiceCollection`." But `AddAuditLogHealthMetricsBridge`
calls `services.AddHostedService<SiteAuditBacklogReporter>()` (line 275), which is
NOT idempotent — every call registers another descriptor, and the host will spin up
N reporters and have them all poll SQLite every 30 s, all push the same snapshot into
`ISiteHealthCollector`. The site composition path is supposed to call this exactly
once, but tests or composition refactors that accidentally call twice will pay 2x the
SQL probe rate and overwrite the snapshot with conflicting numbers (no race, just
wasted work). Worse, `AddAuditLogCentralMaintenance` (line 301) is also non-idempotent —
`AddOptions<AuditLogPartitionMaintenanceOptions>` and `AddHostedService<AuditLogPartitionMaintenanceService>`
will pile up.

**Recommendation**

Either (a) guard each Add* helper with a "has the marker been seen?" sentinel
(register a private marker descriptor on first call, no-op on subsequent calls);
or (b) explicitly document idempotency on the public surface of every helper and
verify with a unit test in `AddAuditLogTests`. Option (a) matches the pattern other
SDK extensions use and removes a foot-gun.

**Resolution (2026-05-28):**

Took option (a) for `AddAuditLogHealthMetricsBridge` — guarded by a sentinel
check on the `SiteAuditBacklogReporter` hosted-service descriptor (the helper's
exclusive contribution to the collection). A second call short-circuits before
any `Replace` / `AddHostedService` runs, so the hosted service registers
exactly once. New `AddAuditLogHealthMetricsBridge_IsIdempotent_DoesNotDoubleRegister_HostedService`
test in `AddAuditLogTests` calls the helper twice and asserts a single
`IHostedService` descriptor for `SiteAuditBacklogReporter`. The
`AddAuditLogCentralMaintenance` helper is left for a follow-up — it is only
ever called from the central composition root and the unit/integration
fixtures use disposable IServiceCollections, so the foot-gun is narrower.