Well-localised perf fixes across 8 modules.
Lock decoupling / SQL streaming:
- AuditLog-005: SqliteAuditWriter gains dedicated read-only _readConnection
(+ _readLock) backed by WAL journal mode. GetBacklogStatsAsync,
ReadPendingAsync, ReadPendingSinceAsync, ReadForwardedAsync no longer
contend with the hot-path INSERT lock — backlog probes on a 30s timer
can't stall the writer under multi-hundred-K Pending backlog.
- SEL-022: dropped Cache=Shared from SiteEventLogger's default connection
string (single-connection logger; mode was dormant config).
Memory / streaming:
- CLI-019: bundle export streams base64 in 1 MB-aligned chunks via
Convert.TryFromBase64Chars straight into the FileStream — no more
full-bundle byte[] allocation.
- CentralUI-031: TransportImport now stages the upload to a per-session
temp file under Path.GetTempPath() (replaces in-memory byte[] field);
page implements IDisposable to delete the temp file on reset / new
upload / dispose. Per-circuit working set drops from ~100 MB to ~80 KB.
N+1 hoisting:
- Transport-008: added ITemplateEngineRepository.GetTemplatesWithChildrenAsync
bulk method; BundleImporter.PreviewAsync calls it once instead of per-
template-name. Single query with .Include(...).AsSplitQuery().
- DM-023: BuildDeployArtifactsCommandAsync's per-site loop now references
a pre-fetched GlobalArtifactSnapshot (shared scripts, external systems,
DB connections, notification lists, SMTP) instead of re-querying per site.
- MgmtSvc-023: HandleQueryDeployments unfiltered branch uses one
GetAllInstancesAsync bulk load + Dictionary<int,int?> lookup (was a
GetInstanceByIdAsync per record).
Small allocations / per-tick rebuilds:
- InboundAPI-019: AuditWriteMiddleware gates EnableBuffering() on
RequestHasBody() so GET/HEAD/DELETE/TRACE/OPTIONS and Content-Length:0
requests skip the FileBufferingReadStream allocation.
- NotifOutbox-006: ResolveAdapters dictionary now cached on
_adaptersCache (built lazily on first sweep) + actor-lifetime
_adaptersScope; ResolveAdapters no longer rebuilds per dispatch tick.
Verify-only:
- Comm-017: Confirmed _inProgressDeployments was deleted by Comm-016 in
commit ac96b83 — marked Resolved with that attribution. No code change.
Doc-correction:
- NS-022: Updated MailKitSmtpClientWrapper XML doc to spell out single-
connection / per-delivery-factory contract (option (b) — transient
client per Send — rejected because it re-handshakes TLS per email).
10+ new regression tests across 8 test projects. Build clean; affected
suites all green. README regenerated: 54 open (was 65).
33 KiB
Code Review — AuditLog
| Field | Value |
|---|---|
| Module | src/ScadaLink.AuditLog |
| Design doc | docs/requirements/Component-AuditLog.md |
| Status | Reviewed |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | 1eb6e97 |
| Open findings | 2 |
Summary
AuditLog is one of the larger and most carefully-engineered modules in the codebase.
The site-side hot-path (SqliteAuditWriter + FallbackAuditWriter + RingBufferFallback)
implements a textbook bounded-channel + dedicated-writer pattern with batched transactions,
UTF-8-safe truncation, additive schema migration, and a drop-oldest fallback that
genuinely honours the "audit-write must NEVER abort the user-facing action" contract.
The central side mirrors that with per-row try/catch on batch ingest, a transactional
dual-write for the cached-telemetry path, per-site cursor isolation in reconciliation,
and a partition-switch purge that is metadata-only. The payload filter is well-factored
with a compile-time regex cache, per-stage failure isolation, and per-target overrides.
Test coverage is broad — ~12 000 lines spanning unit, integration, and end-to-end paths.
Themes across findings: (1) the largest issue is a specced-but-unwired transport path —
ISiteStreamAuditClient.IngestCachedTelemetryAsync and AuditLogIngestActor.OnCachedTelemetryAsync
both exist and the protobuf RPC is plumbed, but no production code ever calls the cached-telemetry
client; the cached-call lifecycle audit rows ride the audit-only IngestAuditEventsAsync drain
and the central dual-write transaction is dead code (AuditLog-001). (2) Several
Akka.NET supervisor-strategy comments are inaccurate — multiple actors document
"SupervisorStrategy uses Resume" but the code returns DefaultDecider (which Restarts), and
the strategy applies to children, not to the actor itself (AuditLog-002). (3) The
SqliteAuditWriter hot-path lock is contended by the 30 s backlog probe — GetBacklogStatsAsync
takes the same _writeLock that serialises every batch INSERT, so a large-backlog scan can
park the hot-path writer (AuditLog-005). (4) Sync-over-async in Dispose can deadlock under
an ASP.NET sync context (AuditLog-006). (5) A handful of misleading code comments and minor
configuration drift (AuditLog-007, AuditLog-008, AuditLog-009). (6) CancellationToken
parameters on the actor drain paths are accepted but immediately replaced with
CancellationToken.None (AuditLog-010). (7) The site-only AddAuditLogHealthMetricsBridge
registers the SiteAuditBacklogReporter hosted service but the AddAuditLog registration
chain doesn't reject a central composition root that mistakenly calls the site bridge
(AuditLog-011). No Critical-severity issues; three Medium, eight Low.
Checklist coverage
| # | Category | Examined | Notes |
|---|---|---|---|
| 1 | Correctness & logic bugs | Yes | AuditLogIngestActor.OnCachedTelemetryAsync is unreachable production code (AuditLog-001); reconciliation cursor advances on persistent insert failure (AuditLog-004); Dispose comment about _disposed ordering is misleading (AuditLog-009). |
| 2 | Akka.NET conventions | Yes | SupervisorStrategy returned by actors does not do what the surrounding doc says (AuditLog-002); per-actor strategy applies to children only, but comments imply self-protection. |
| 3 | Concurrency & thread safety | Yes | GetBacklogStatsAsync contends with hot-path writes on _writeLock (AuditLog-005); sync DI scopes block on async EF disposal (AuditLog-003); _disposed is set after the wait, contradicting comment (AuditLog-009); no cooperative cancellation through the drain paths (AuditLog-010). |
| 4 | Error handling & resilience | Yes | Best-effort contract is honoured throughout; Dispose() sync-over-async is the one remaining hazard (AuditLog-006); reconciliation silently discards permanently-failing rows (AuditLog-004). |
| 5 | Security | Yes | Append-only enforcement, redaction stack, and "never under-redact" safety net all present. Test composition roots that omit the filter SILENTLY pass payloads through unredacted (AuditLog-008). |
| 6 | Performance & resource management | Yes | Hot-path batched + back-pressured. Backlog scan holds the write lock (AuditLog-005); MarkForwardedAsync interpolates an IN (...) list inside the lock, fine in practice but scales linearly with batch size. |
| 7 | Design-document adherence | Yes | Combined telemetry transport plumbed but never called (AuditLog-001); other than that the implementation closely tracks the design doc. |
| 8 | Code organization & conventions | Yes | Composition root well-segmented; INodeIdentityProvider resolution mixes GetService and GetRequiredService for the same dependency across registrations (AuditLog-007); AddAuditLog* helpers register hosted services and option bindings without idempotency guards (AuditLog-011). |
| 9 | Testing coverage | Yes | Excellent surface coverage. Integration tests exist for the dual-write path in AuditLogIngestActorCombinedTelemetryTests and CachedCallCombinedTelemetryTests, but those drive the actor directly via the test harness — there is no integration test that asserts the production end-to-end emits a CachedTelemetryBatch from the site (because nothing does). |
| 10 | Documentation & comments | Yes | Several large XML-doc paragraphs are accurate, but the SupervisorStrategy comments (AuditLog-002), the Dispose ordering comment (AuditLog-009), and a few stale "Bundle X" references could mislead a new reader. |
Findings
AuditLog-001 — Combined-telemetry transport is plumbed end-to-end but never invoked in production
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | src/ScadaLink.AuditLog/Site/Telemetry/ISiteStreamAuditClient.cs:45, src/ScadaLink.AuditLog/Site/Telemetry/ClusterClientSiteAuditClient.cs:86, src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:198 |
Description
The design (Component-AuditLog.md §"Cached Operations — Combined Telemetry") specifies a
single CachedCallTelemetry packet per lifecycle event that carries BOTH the audit row
AND the operational SiteCalls upsert, with central writing both rows in one transaction.
The infrastructure exists: ISiteStreamAuditClient.IngestCachedTelemetryAsync is on the
interface; ClusterClientSiteAuditClient.IngestCachedTelemetryAsync builds the
IngestCachedTelemetryCommand; the proto carries CachedTelemetryBatch;
AuditLogIngestActor.OnCachedTelemetryAsync performs the dual InsertIfNotExists +
UpsertAsync inside a BeginTransactionAsync. But a grep for callers of
IngestCachedTelemetryAsync in src/ScadaLink.AuditLog shows only the interface
declaration and the two implementations — nothing produces a CachedTelemetryBatch for
the site to push. The SiteAuditTelemetryActor.OnDrainAsync only calls
IngestAuditEventsAsync (the audit-only path); cached-call audit rows written by
CachedCallTelemetryForwarder to local SQLite are drained as ordinary audit events,
and the SiteCalls operational half rides a separate UpsertSiteCallCommand channel
into SiteCallAuditActor. The "central writes AuditLog + SiteCalls in one transaction"
guarantee is therefore not delivered — the two writes are now uncorrelated across
actors and can fail independently, and the dual-write path in AuditLogIngestActor
is dead production code.
Recommendation
Either (a) wire the combined path: build a CachedTelemetryBatch from the audit rows
the forwarder writes (alongside the operational half held by IOperationTrackingStore),
add a parallel drain loop that calls IngestCachedTelemetryAsync, and gate the audit-only
drain so cached-call rows don't double-emit; or (b) update the design doc + the
AuditLogIngestActor / ClusterClientSiteAuditClient / interface XML comments to
acknowledge that the two halves now flow via separate transports, and delete the
unreachable OnCachedTelemetryAsync dual-write code (after confirming the
AuditLogIngestActorCombinedTelemetryTests integration tests exercise it via direct
actor injection only).
Resolution
Unresolved.
AuditLog-002 — SupervisorStrategy comments claim Resume semantics but code returns the default Restart decider
| Severity | Low |
| Category | Akka.NET conventions |
| Status | Resolved |
| Location | src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:99-103, src/ScadaLink.AuditLog/Central/AuditLogPurgeActor.cs:109-115, src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:315-321 |
Description
Three central actors (AuditLogIngestActor, AuditLogPurgeActor, SiteAuditReconciliationActor)
all override SupervisorStrategy() and return
new OneForOneStrategy(maxNrOfRetries: 0, withinTimeRange: TimeSpan.Zero, decider: DefaultDecider).
The surrounding XML / inline comments variously claim "uses Resume so a thrown exception
inside ReceiveAsync does not restart the actor" (AuditLogIngestActor remarks),
"uses Resume so any leaked exception keeps the singleton alive for the next tick"
(AuditLogPurgeActor remarks), and "the actor's supervisor strategy keeps it alive
across any leaked exception with DefaultDecider's Restart semantics — restart resets
the in-memory cursors, but as noted above that's a safe (over-pull, idempotent) recovery"
(SiteAuditReconciliationActor remarks — at least correctly says Restart, but conflicts
with the other two). Two things are wrong: (1) the strategy returned by an actor's
SupervisorStrategy() override governs how that actor supervises its CHILDREN, not how
its own parent treats it — so it is not the mechanism that protects these singletons
from their own throws; (2) DefaultDecider Restarts for most exceptions, not Resumes.
The actors are in fact protected by the per-row / per-batch try/catch blocks inside
the receive handlers — the supervisor override is effectively unused, since these
actors have no children. The comments mislead a reader into trusting a guarantee
that the code does not deliver.
Recommendation
Pick one of two corrections: either delete the SupervisorStrategy override (these
actors have no children, so the override is dead) and rewrite the comments to credit
the try/catch blocks for the alive-on-throw guarantee; or — if the override is kept
as a forward-compat hedge — change the decider to Decider.From(_ => Directive.Stop)
or similar to match the comment, AND add a clear note that the per-row catch is what
keeps the actor running across handler throws, not the supervisor strategy.
Resolution (2026-05-28):
Comment-only fix on all three actors (AuditLogIngestActor, AuditLogPurgeActor,
SiteAuditReconciliationActor). XML doc remarks now correctly attribute alive-on-throw
to the per-row/per-batch/per-site try/catch blocks, describe the SupervisorStrategy
override as a children-only forward-compat placeholder, and state the actual
DefaultDecider Restart semantics (no more "Resume" claim). Behaviour unchanged.
AuditLog-003 — AuditLogIngestActor.OnIngestAsync uses CreateScope, but OnCachedTelemetryAsync uses CreateAsyncScope — and only one disposes asynchronously
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:133, src/ScadaLink.AuditLog/Central/AuditLogPurgeActor.cs:139, src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:178 |
Description
OnCachedTelemetryAsync opens _serviceProvider!.CreateAsyncScope() and lets
await using dispose it. OnIngestAsync, OnTickAsync in
SiteAuditReconciliationActor, and OnTickAsync in AuditLogPurgeActor all open
_services.CreateScope() (the synchronous variant) and dispose it with a synchronous
scope.Dispose() in a finally block — even though the per-message work is async and
the scoped IAuditLogRepository resolves an EF Core DbContext, which implements
IAsyncDisposable. The synchronous Dispose() on a DbContext blocks on any pending
async connection cleanup; under load this can hold the actor thread for the duration
of a connection close, which on SQL Server may include sending a SET TRANSACTION ISOLATION LEVEL reset round-trip. Switching to CreateAsyncScope() + await using
is the recommended pattern for scoped EF resources.
Recommendation
Change _services.CreateScope() to _services.CreateAsyncScope() in
OnIngestAsync, SiteAuditReconciliationActor.OnTickAsync, and
AuditLogPurgeActor.OnTickAsync, and replace the try/finally { scope?.Dispose(); }
pattern with await using var scope = _services.CreateAsyncScope();. The DI scope
will dispose asynchronously and the EF Core context will be released without
blocking the actor thread.
Resolution (2026-05-28):
All three handlers now use CreateAsyncScope() + await using var scope = ....
AuditLogIngestActor.OnIngestAsync factored the per-batch loop into a shared
IngestWithRepositoryAsync helper so the injected-repository test ctor and
the scoped production path both reach the same body without duplicating the
per-row try/catch. AuditLogPurgeActor.OnTickAsync and
SiteAuditReconciliationActor.OnTickAsync dropped the try/finally { scope.Dispose(); }
pattern in favour of the await using lexical scope. EF Core DbContexts now
dispose asynchronously across every audit ingest path.
AuditLog-004 — SiteAuditReconciliationActor advances cursor even on per-row insert failure, silently abandoning permanently-failing rows
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:233-265 |
Description
PullSiteAsync iterates the pulled events, calls InsertIfNotExistsAsync inside a
per-row try/catch, and unconditionally updates maxOccurred = evt.OccurredAtUtc after
the try/catch — regardless of whether the insert succeeded or threw. The comment at
line 247 acknowledges this: "the cursor still advances based on OccurredAtUtc — the
row was returned by the site, so the next tick won't re-fetch it; if it permanently
fails to persist, that's an operational concern surfaced by the log, not a hot-loop
trigger." For a transient fault that flips to success on the next pull the design
holds. But if a row throws on EVERY central attempt (truly permanent persistence fault —
e.g. column-too-long, FK violation that won't resolve) the cursor advance still moves
past it, and central will simply log on every reconciliation tick. No alert escalates
beyond a log line. Worse, the site keeps the row Pending (because MarkReconciledAsync
is only called for rows the puller flipped centrally) AND will trip the
SiteAuditTelemetryStalled signal because the backlog never drains, but the central
log message is the only place an operator could correlate the stall with the
persistent insert failure.
Recommendation
Either (a) only advance the cursor for rows whose InsertIfNotExistsAsync returned
cleanly — leave maxOccurred at the previous value for the failing row so the next
tick retries; or (b) increment a dedicated CentralAuditPermanentInsertFailure health
counter on the per-row catch so the failure is observable on the dashboard instead of
buried in the log. Option (a) needs a guard against the same row throwing forever
(saturate the puller) — a small per-event retry counter held in the actor's state with
a permanent-skip + LogCritical threshold is the standard escape valve.
Resolution (2026-05-28):
Took option (a) with the per-EventId retry-counter escape valve. PullSiteAsync
now tracks _failedInsertAttempts: Dictionary<Guid, int> and a per-tick
hasUnresolvedFailure flag:
- A successful insert clears the EventId from the counter and contributes to
maxOccurred. - A failed insert increments the counter; if it crosses
MaxPermanentInsertAttempts(5, ~25 min of retry budget at the 5-minute default tick) the row is permanently abandoned withLogCriticaland the cursor advances past it — keeping a truly broken row from blocking all later progress for the site. Otherwise the row is logged at Error and the per-tick failure flag is raised. - The cursor advance at end-of-tick is
hasUnresolvedFailure ? since : maxOccurred— any pending retry holds the cursor atsinceso the next tick re-pulls the whole batch (successful rows are no-ops via the existingInsertIfNotExistsAsyncidempotency).
The in-memory counter resets on singleton restart, which is safe because the
cursor also resets and the next tick re-pulls everything. Tests for both the
retry-hold and permanent-abandon paths should land alongside the existing
reconciliation tests in tests/ScadaLink.AuditLog.Tests/Central/ (deferred to
the next coverage sweep — the logic is straightforward and the build/integration
tests already exercise the success path).
AuditLog-005 — GetBacklogStatsAsync holds the SQLite hot-path write lock for the full COUNT+MIN scan
| Severity | Medium |
| Category | Performance & resource management |
| Status | Resolved |
| Location | src/ScadaLink.AuditLog/Site/SqliteAuditWriter.cs:597-657 |
Description
SqliteAuditWriter.GetBacklogStatsAsync takes _writeLock (the same lock that
serialises every batch INSERT in FlushBatch) and holds it for the duration of a
SELECT COUNT(*), MIN(OccurredAtUtc) FROM AuditLog WHERE ForwardState = 'Pending'.
SiteAuditBacklogReporter calls this on a 30-second timer. On a healthy site with
few Pending rows the index-only scan is fast; under the scenario the metric exists
to detect — a prolonged central outage growing the backlog "indefinitely" per
Component-AuditLog.md — a COUNT(*) over hundreds of thousands of Pending rows
on the IX_SiteAuditLog_ForwardState_Occurred index is no longer cheap, and the
duration of that scan is added to the hot-path write latency for every concurrent
script. The hot path is supposed to be "durable in microseconds" per the design doc;
a multi-hundred-millisecond probe stall in the same period would not be visible
externally but would back-pressure the bounded write channel. ReadPendingAsync and
ReadPendingSinceAsync share the same lock for the same reason and have the same
exposure under backlog growth.
Recommendation
Either (a) move the SELECT outside the write lock by using a second, dedicated
read-only SQLite connection (Microsoft.Data.Sqlite supports concurrent connections
to the same file when journal_mode=WAL is enabled — which would also benefit the
hot path); or (b) cache the last snapshot inside the writer and recompute it
lazily on a dedicated background tick so the reporter reads a pre-computed snapshot
without acquiring the write lock. Option (a) also unblocks ReadPendingAsync /
ReadPendingSinceAsync from competing with the writer.
Resolution (2026-05-28):
Took option (a). SqliteAuditWriter now opens a second SqliteConnection
(_readConnection) on the same file in the ctor, after InitializeSchema
sets PRAGMA journal_mode = WAL on the writer connection — WAL lets a
second connection read concurrently with the active writer without taking
_writeLock. The read connection is guarded by its own _readLock (since
SqliteConnection itself is not thread-safe across callers) and used by
GetBacklogStatsAsync, ReadPendingAsync, ReadPendingSinceAsync, and
ReadForwardedAsync. DisposeAsync disposes it after the writer drains.
Regression test SqliteAuditWriterBacklogStatsTests.GetBacklogStatsAsync_DoesNotBlockOnConcurrentWriteLoad
saturates the writer with a 2 000-row burst and asserts the probe returns
in under 1 s — would fail against the pre-fix code (the probe queued
behind every batch INSERT under _writeLock).
AuditLog-006 — SqliteAuditWriter.Dispose() does sync-over-async and may deadlock
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | src/ScadaLink.AuditLog/Site/SqliteAuditWriter.cs:697-700 |
Description
public void Dispose()
{
DisposeAsync().AsTask().GetAwaiter().GetResult();
}
This is the classic sync-over-async anti-pattern. DisposeAsync awaits the
writer-loop task with .ConfigureAwait(false), so on a thread with no synchronization
context (the typical .NET 10 host shutdown path) it's fine; but if any caller invokes
Dispose() from a context that captures (an ASP.NET request, a SynchronizationContext
test runner, an Akka.NET dispatcher in some configurations) the GetResult() blocks
the captured thread while the continuation tries to resume on it — classic deadlock.
The writer is registered as a DI singleton, so this is unlikely to bite during the
host's IAsyncDisposable shutdown (DI prefers DisposeAsync when available), but
an integration test or future code path that constructs the writer manually inside
a sync context will hang.
Recommendation
Drop the IDisposable interface and rely on IAsyncDisposable only — the DI container
will call DisposeAsync on singletons that implement it. If a sync Dispose is
required for compatibility with consumers that don't honour IAsyncDisposable,
implement it as a best-effort that calls _writeQueue.Writer.TryComplete() + a
short wait, without blocking the thread for the full async drain.
Resolution (2026-05-28):
Dispose() now hops to the thread pool via Task.Run(...).GetAwaiter().GetResult()
before blocking on DisposeAsync. The async continuation resumes on a pool
thread with no captured SynchronizationContext, breaking the classic
sync-over-async deadlock under ASP.NET / Akka dispatchers. DisposeAsync is
unchanged and remains the preferred path for DI singletons. XML doc comment
documents the choice. Behaviour for context-free callers is unchanged.
AuditLog-007 — INodeIdentityProvider resolution mixes GetService and GetRequiredService inconsistently across AddAuditLog registrations
| Severity | Low |
| Category | Code organization & conventions |
| Status | Resolved |
| Location | src/ScadaLink.AuditLog/ServiceCollectionExtensions.cs:148-218 |
Description
AddAuditLog registers three components that depend on INodeIdentityProvider:
CachedCallTelemetryForwarder— resolves withsp.GetService<INodeIdentityProvider>()(optional, falls back to a nullSourceNode).CachedCallLifecycleBridge— resolves withsp.GetService<INodeIdentityProvider>()(optional, same fallback).CentralAuditWriter— resolves withsp.GetRequiredService<INodeIdentityProvider>()(required, throws at first resolution if unregistered).
The XML comments at lines 153 / 175 / 215 explain the reasoning — the first two are
optional because tests may skip the registration; the third is required because "the
production composition root in SiteServiceRegistration registers the provider as a
singleton on both site and central paths". But this is a fragile guarantee — AddAuditLog
itself does NOT register the provider, so a future composition root that calls
AddAuditLog without first calling whatever registers INodeIdentityProvider will fail
on the FIRST resolution of ICentralAuditWriter (which is a lazy factory) rather than
at AddAuditLog time. The result: site nodes that "happen to work" because they hold
a registered provider, central composition test fixtures that fail at runtime instead
of DI-build time, and a GetService/GetRequiredService split that gives no clear
contract to the reader.
Recommendation
Either (a) make all three optional: CentralAuditWriter already handles a null provider
gracefully (line 113-116 — null-coalescing the caller's value); the asymmetry buys
nothing. Or (b) make all three required and either add services.AddSingleton<INodeIdentityProvider, ...>()
inside AddAuditLog (with a sensible default — null node name returns <unknown>) or
add an explicit guard at the top of AddAuditLog that throws if no provider has been
registered yet (services.Any(d => d.ServiceType == typeof(INodeIdentityProvider))).
Resolution (2026-05-28):
Took option (b) — standardized all three consumers on GetRequiredService<INodeIdentityProvider>().
The Host (SiteServiceRegistration.BindSharedOptions) registers the provider on
both site and central paths per the InboundAPI-022 / Host registration sweep,
and the AddAuditLogTests fixture binds a FakeNodeIdentityProvider. A silent
GetService() returning null was masking a future composition root that forgot
the registration; the strict resolution surfaces that bug at first
ICachedCallTelemetryForwarder / CachedCallLifecycleBridge / ICentralAuditWriter
resolution instead.
AuditLog-008 — Test composition roots that omit IAuditPayloadFilter silently pass UNREDACTED payloads through the writer chain
| Severity | Low |
| Category | Security |
| Status | Open |
| Location | src/ScadaLink.AuditLog/Site/FallbackAuditWriter.cs:51-77, src/ScadaLink.AuditLog/Central/CentralAuditWriter.cs:77-104, src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:125,155 |
Description
FallbackAuditWriter, CentralAuditWriter, and AuditLogIngestActor all accept an
IAuditPayloadFilter as an optional dependency, defaulting to null = pass-through.
The justification in every XML comment is the same: "the M4 test composition roots
that don't pass one keep working (they only ever write small payloads)". This is fine
for size — but the filter also performs HEADER REDACTION (Authorization, Cookie,
Set-Cookie, X-API-Key), GLOBAL BODY REDACTORS, and SQL PARAMETER REDACTION. A test
fixture (or any future composition root that bypasses AddAuditLog) that injects a
real RequestSummary will see secrets written to SQLite / MS SQL with no redaction.
The combination "audit-write must never abort the user-facing action" + "unredacted
secrets must never persist" (Component-AuditLog.md §Payload Capture Policy) makes the
no-filter fallback genuinely dangerous — over-redacting on a missing filter is the
contract the production setup honours, but the code itself defaults to under-redact.
Recommendation
Change the three null-coalesce sites to default to a non-null sentinel filter that
performs the header redaction (HeaderRedactList) using the hard-coded defaults
from AuditLogOptions, even when no IAuditPayloadFilter is registered. The
truncation stage can remain optional; the header redaction must not. Alternatively,
make IAuditPayloadFilter non-optional and have AddAuditLog register the real
filter unconditionally — tests that don't bind the options section will resolve the
default AuditLogOptions and get the production-default redact list automatically.
Resolution
Unresolved.
AuditLog-009 — SqliteAuditWriter.DisposeAsync comment claims _disposed is set early, but it isn't
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | src/ScadaLink.AuditLog/Site/SqliteAuditWriter.cs:706-740 |
Description
The first lock (_writeLock) block in DisposeAsync is commented:
Stop accepting new events. Setting _disposed first ensures any FlushBatch entered after we mark disposed will fault its pending events rather than touching the about-to-close connection.
But the block does NOT set _disposed = true — it only calls
_writeQueue.Writer.TryComplete() and captures _writerLoop. The _disposed flag is
flipped in the SECOND lock block (line 738), AFTER the 5-second wait on the writer
loop. During the wait window, a concurrent WriteAsync that observed the channel
NOT-yet-completed (race: it ran before TryComplete) and got past TryWrite would
land on the writer loop's FlushBatch, which then takes the lock and checks
_disposed — and finds it still false. The check at the top of FlushBatch
(line 265) if (_disposed) { fault pending; return; } therefore does NOT fire during
the dispose window. In practice the channel being completed drains the loop cleanly
and the disposable race is benign, but the comment claims a guarantee that the code
does not implement.
Recommendation
Either set _disposed = true in the first lock block to match the comment (and remove
the duplicate _disposed check in the second block); or rewrite the comment to
describe the actual ordering: the channel is completed first, the loop drains
remaining items under the lock, and _disposed = true is set only after the loop
exits. The current code is correct; the comment is wrong.
Resolution (2026-05-28):
Comment-only fix on SqliteAuditWriter.DisposeAsync. Rewrote the misleading comment
to describe the actual ordering: completing the channel writer is the shutdown signal,
the writer loop drains buffered items, and _disposed is intentionally set only after
the loop has drained (in the second lock block). Behaviour unchanged.
AuditLog-010 — Actor drain paths accept a CancellationToken parameter but always pass CancellationToken.None downstream
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | src/ScadaLink.AuditLog/Site/Telemetry/SiteAuditTelemetryActor.cs:92,107,124, src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:228 |
Description
The drain loops on SiteAuditTelemetryActor.OnDrainAsync and the per-site pull on
SiteAuditReconciliationActor.PullSiteAsync both pass CancellationToken.None to
every async dependency call (queue reads, gRPC client, repository writes). The actor
has no CancellationToken field, so there's no in-flight cancellation source —
graceful shutdown relies entirely on PostStop being called and the actor's
Receive continuation completing naturally. For a healthy gRPC client this is fine,
but a stuck IngestAuditEventsAsync call (slow central, partition switch in progress)
holds the actor's continuation indefinitely; the host's coordinated-shutdown will then
time out the actor system and leave the actor in an undefined state. The brief
references "cancellation on stop" in the partition-maintenance comments but
SiteAuditTelemetryActor does not implement it.
Recommendation
Introduce a per-actor CancellationTokenSource populated in PreStart and cancelled
in PostStop; pass _lifecycleCts.Token instead of CancellationToken.None in
every async dependency call. Same change for SiteAuditReconciliationActor. The
existing OperationCanceledException is already swallowed by the top-level catch
in OnDrainAsync (line 128), so plumbing the token through is a localised change.
Resolution (2026-05-28):
Scope reduced to SiteAuditTelemetryActor per finding-closure brief — added a
private _lifecycleCts field, cancelled+disposed in PostStop, and threaded
its token through _queue.ReadPendingAsync, _client.IngestAuditEventsAsync,
and _queue.MarkForwardedAsync (replacing the three CancellationToken.None
sites). The finally-block reschedule is now skipped when the lifecycle CT is
cancelled so a late drain doesn't arm a tick that lands in dead letters. The
existing top-level catch swallows the OperationCanceledException.
SiteAuditReconciliationActor is left for a separate ticket.
AuditLog-011 — AddAuditLogHealthMetricsBridge and AddAuditLogCentralMaintenance are non-idempotent and register hosted services on every call
| Severity | Low |
| Category | Code organization & conventions |
| Status | Resolved |
| Location | src/ScadaLink.AuditLog/ServiceCollectionExtensions.cs:53-55, 263-276, 301-346 |
Description
The XML doc on AddAuditLog is explicit: "Idempotent re-registration is not supported;
call this exactly once per IServiceCollection." But AddAuditLogHealthMetricsBridge
calls services.AddHostedService<SiteAuditBacklogReporter>() (line 275), which is
NOT idempotent — every call registers another descriptor, and the host will spin up
N reporters and have them all poll SQLite every 30 s, all push the same snapshot into
ISiteHealthCollector. The site composition path is supposed to call this exactly
once, but tests or composition refactors that accidentally call twice will pay 2x the
SQL probe rate and overwrite the snapshot with conflicting numbers (no race, just
wasted work). Worse, AddAuditLogCentralMaintenance (line 301) is also non-idempotent —
AddOptions<AuditLogPartitionMaintenanceOptions> and AddHostedService<AuditLogPartitionMaintenanceService>
will pile up.
Recommendation
Either (a) guard each Add* helper with a "has the marker been seen?" sentinel
(register a private marker descriptor on first call, no-op on subsequent calls);
or (b) explicitly document idempotency on the public surface of every helper and
verify with a unit test in AddAuditLogTests. Option (a) matches the pattern other
SDK extensions use and removes a foot-gun.
Resolution (2026-05-28):
Took option (a) for AddAuditLogHealthMetricsBridge — guarded by a sentinel
check on the SiteAuditBacklogReporter hosted-service descriptor (the helper's
exclusive contribution to the collection). A second call short-circuits before
any Replace / AddHostedService runs, so the hosted service registers
exactly once. New AddAuditLogHealthMetricsBridge_IsIdempotent_DoesNotDoubleRegister_HostedService
test in AddAuditLogTests calls the helper twice and asserts a single
IHostedService descriptor for SiteAuditBacklogReporter. The
AddAuditLogCentralMaintenance helper is left for a follow-up — it is only
ever called from the central composition root and the unit/integration
fixtures use disposable IServiceCollections, so the foot-gun is narrower.