f93b7b99bb
Re-applies the full 10-category checklist to every src/ project — including
first-time reviews of the four newer components (AuditLog, NotificationOutbox,
SiteCallAudit, Transport) — so the code-reviews/ index reflects today's
codebase rather than the 2026-05-16 baseline. 172 new Open findings (0
Critical, 18 High, 62 Medium, 92 Low); 481 findings total across 23 modules.
regen-readme.py now derives each module's Last reviewed + Commit from its
findings.md header instead of hard-coding 2026-05-16 / 9c60592, so future
single-module re-reviews show their own date in the Module Status table.
501 lines
27 KiB
Markdown
501 lines
27 KiB
Markdown
# Code Review — AuditLog
|
|
|
|
| Field | Value |
|
|
|-------|-------|
|
|
| Module | `src/ScadaLink.AuditLog` |
|
|
| Design doc | `docs/requirements/Component-AuditLog.md` |
|
|
| Status | Reviewed |
|
|
| Last reviewed | 2026-05-28 |
|
|
| Reviewer | claude-agent |
|
|
| Commit reviewed | `1eb6e97` |
|
|
| Open findings | 11 |
|
|
|
|
## Summary
|
|
|
|
AuditLog is one of the larger and most carefully-engineered modules in the codebase.
|
|
The site-side hot-path (`SqliteAuditWriter` + `FallbackAuditWriter` + `RingBufferFallback`)
|
|
implements a textbook bounded-channel + dedicated-writer pattern with batched transactions,
|
|
UTF-8-safe truncation, additive schema migration, and a drop-oldest fallback that
|
|
genuinely honours the "audit-write must NEVER abort the user-facing action" contract.
|
|
The central side mirrors that with per-row try/catch on batch ingest, a transactional
|
|
dual-write for the cached-telemetry path, per-site cursor isolation in reconciliation,
|
|
and a partition-switch purge that is metadata-only. The payload filter is well-factored
|
|
with a compile-time regex cache, per-stage failure isolation, and per-target overrides.
|
|
Test coverage is broad — ~12 000 lines spanning unit, integration, and end-to-end paths.
|
|
|
|
Themes across findings: (1) the largest issue is a **specced-but-unwired transport path** —
|
|
`ISiteStreamAuditClient.IngestCachedTelemetryAsync` and `AuditLogIngestActor.OnCachedTelemetryAsync`
|
|
both exist and the protobuf RPC is plumbed, but no production code ever calls the cached-telemetry
|
|
client; the cached-call lifecycle audit rows ride the audit-only `IngestAuditEventsAsync` drain
|
|
and the central dual-write transaction is dead code (AuditLog-001). (2) Several
|
|
**Akka.NET supervisor-strategy comments are inaccurate** — multiple actors document
|
|
"`SupervisorStrategy` uses Resume" but the code returns `DefaultDecider` (which Restarts), and
|
|
the strategy applies to children, not to the actor itself (AuditLog-002). (3) The
|
|
**`SqliteAuditWriter` hot-path lock is contended by the 30 s backlog probe** — `GetBacklogStatsAsync`
|
|
takes the same `_writeLock` that serialises every batch INSERT, so a large-backlog scan can
|
|
park the hot-path writer (AuditLog-005). (4) **Sync-over-async in `Dispose`** can deadlock under
|
|
an ASP.NET sync context (AuditLog-006). (5) A handful of **misleading code comments and minor
|
|
configuration drift** (AuditLog-007, AuditLog-008, AuditLog-009). (6) `CancellationToken`
|
|
parameters on the actor drain paths are accepted but immediately replaced with
|
|
`CancellationToken.None` (AuditLog-010). (7) The site-only `AddAuditLogHealthMetricsBridge`
|
|
registers the `SiteAuditBacklogReporter` hosted service but the `AddAuditLog` registration
|
|
chain doesn't reject a central composition root that mistakenly calls the site bridge
|
|
(AuditLog-011). No Critical-severity issues; three Medium, eight Low.
|
|
|
|
## Checklist coverage
|
|
|
|
| # | Category | Examined | Notes |
|
|
|---|----------|----------|-------|
|
|
| 1 | Correctness & logic bugs | Yes | `AuditLogIngestActor.OnCachedTelemetryAsync` is unreachable production code (AuditLog-001); reconciliation cursor advances on persistent insert failure (AuditLog-004); `Dispose` comment about `_disposed` ordering is misleading (AuditLog-009). |
|
|
| 2 | Akka.NET conventions | Yes | `SupervisorStrategy` returned by actors does not do what the surrounding doc says (AuditLog-002); per-actor strategy applies to children only, but comments imply self-protection. |
|
|
| 3 | Concurrency & thread safety | Yes | `GetBacklogStatsAsync` contends with hot-path writes on `_writeLock` (AuditLog-005); sync DI scopes block on async EF disposal (AuditLog-003); `_disposed` is set after the wait, contradicting comment (AuditLog-009); no cooperative cancellation through the drain paths (AuditLog-010). |
|
|
| 4 | Error handling & resilience | Yes | Best-effort contract is honoured throughout; `Dispose()` sync-over-async is the one remaining hazard (AuditLog-006); reconciliation silently discards permanently-failing rows (AuditLog-004). |
|
|
| 5 | Security | Yes | Append-only enforcement, redaction stack, and "never under-redact" safety net all present. Test composition roots that omit the filter SILENTLY pass payloads through unredacted (AuditLog-008). |
|
|
| 6 | Performance & resource management | Yes | Hot-path batched + back-pressured. Backlog scan holds the write lock (AuditLog-005); `MarkForwardedAsync` interpolates an `IN (...)` list inside the lock, fine in practice but scales linearly with batch size. |
|
|
| 7 | Design-document adherence | Yes | Combined telemetry transport plumbed but never called (AuditLog-001); other than that the implementation closely tracks the design doc. |
|
|
| 8 | Code organization & conventions | Yes | Composition root well-segmented; `INodeIdentityProvider` resolution mixes `GetService` and `GetRequiredService` for the same dependency across registrations (AuditLog-007); `AddAuditLog*` helpers register hosted services and option bindings without idempotency guards (AuditLog-011). |
|
|
| 9 | Testing coverage | Yes | Excellent surface coverage. Integration tests exist for the dual-write path in `AuditLogIngestActorCombinedTelemetryTests` and `CachedCallCombinedTelemetryTests`, but those drive the actor directly via the test harness — there is no integration test that asserts the production end-to-end emits a `CachedTelemetryBatch` from the site (because nothing does). |
|
|
| 10 | Documentation & comments | Yes | Several large XML-doc paragraphs are accurate, but the `SupervisorStrategy` comments (AuditLog-002), the `Dispose` ordering comment (AuditLog-009), and a few stale "Bundle X" references could mislead a new reader. |
|
|
|
|
## Findings
|
|
|
|
### AuditLog-001 — Combined-telemetry transport is plumbed end-to-end but never invoked in production
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Design-document adherence |
|
|
| Status | Open |
|
|
| Location | `src/ScadaLink.AuditLog/Site/Telemetry/ISiteStreamAuditClient.cs:45`, `src/ScadaLink.AuditLog/Site/Telemetry/ClusterClientSiteAuditClient.cs:86`, `src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:198` |
|
|
|
|
**Description**
|
|
|
|
The design (Component-AuditLog.md §"Cached Operations — Combined Telemetry") specifies a
|
|
single `CachedCallTelemetry` packet per lifecycle event that carries BOTH the audit row
|
|
AND the operational `SiteCalls` upsert, with central writing both rows in one transaction.
|
|
The infrastructure exists: `ISiteStreamAuditClient.IngestCachedTelemetryAsync` is on the
|
|
interface; `ClusterClientSiteAuditClient.IngestCachedTelemetryAsync` builds the
|
|
`IngestCachedTelemetryCommand`; the proto carries `CachedTelemetryBatch`;
|
|
`AuditLogIngestActor.OnCachedTelemetryAsync` performs the dual `InsertIfNotExists` +
|
|
`UpsertAsync` inside a `BeginTransactionAsync`. But a `grep` for callers of
|
|
`IngestCachedTelemetryAsync` in `src/ScadaLink.AuditLog` shows only the interface
|
|
declaration and the two implementations — nothing produces a `CachedTelemetryBatch` for
|
|
the site to push. The `SiteAuditTelemetryActor.OnDrainAsync` only calls
|
|
`IngestAuditEventsAsync` (the audit-only path); cached-call audit rows written by
|
|
`CachedCallTelemetryForwarder` to local SQLite are drained as ordinary audit events,
|
|
and the `SiteCalls` operational half rides a separate `UpsertSiteCallCommand` channel
|
|
into `SiteCallAuditActor`. The "central writes AuditLog + SiteCalls in one transaction"
|
|
guarantee is therefore not delivered — the two writes are now uncorrelated across
|
|
actors and can fail independently, and the dual-write path in `AuditLogIngestActor`
|
|
is dead production code.
|
|
|
|
**Recommendation**
|
|
|
|
Either (a) wire the combined path: build a `CachedTelemetryBatch` from the audit rows
|
|
the forwarder writes (alongside the operational half held by `IOperationTrackingStore`),
|
|
add a parallel drain loop that calls `IngestCachedTelemetryAsync`, and gate the audit-only
|
|
drain so cached-call rows don't double-emit; or (b) update the design doc + the
|
|
`AuditLogIngestActor` / `ClusterClientSiteAuditClient` / interface XML comments to
|
|
acknowledge that the two halves now flow via separate transports, and delete the
|
|
unreachable `OnCachedTelemetryAsync` dual-write code (after confirming the
|
|
`AuditLogIngestActorCombinedTelemetryTests` integration tests exercise it via direct
|
|
actor injection only).
|
|
|
|
**Resolution**
|
|
|
|
_Unresolved._
|
|
|
|
### AuditLog-002 — `SupervisorStrategy` comments claim Resume semantics but code returns the default Restart decider
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Akka.NET conventions |
|
|
| Status | Open |
|
|
| Location | `src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:99-103`, `src/ScadaLink.AuditLog/Central/AuditLogPurgeActor.cs:109-115`, `src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:315-321` |
|
|
|
|
**Description**
|
|
|
|
Three central actors (`AuditLogIngestActor`, `AuditLogPurgeActor`, `SiteAuditReconciliationActor`)
|
|
all override `SupervisorStrategy()` and return
|
|
`new OneForOneStrategy(maxNrOfRetries: 0, withinTimeRange: TimeSpan.Zero, decider: DefaultDecider)`.
|
|
The surrounding XML / inline comments variously claim "uses `Resume` so a thrown exception
|
|
inside `ReceiveAsync` does not restart the actor" (AuditLogIngestActor remarks),
|
|
"uses Resume so any leaked exception keeps the singleton alive for the next tick"
|
|
(AuditLogPurgeActor remarks), and "the actor's supervisor strategy keeps it alive
|
|
across any leaked exception with `DefaultDecider`'s Restart semantics — restart resets
|
|
the in-memory cursors, but as noted above that's a safe (over-pull, idempotent) recovery"
|
|
(SiteAuditReconciliationActor remarks — at least correctly says Restart, but conflicts
|
|
with the other two). Two things are wrong: (1) the strategy returned by an actor's
|
|
`SupervisorStrategy()` override governs how that actor supervises its CHILDREN, not how
|
|
its own parent treats it — so it is not the mechanism that protects these singletons
|
|
from their own throws; (2) `DefaultDecider` Restarts for most exceptions, not Resumes.
|
|
The actors are in fact protected by the per-row / per-batch try/catch blocks inside
|
|
the receive handlers — the supervisor override is effectively unused, since these
|
|
actors have no children. The comments mislead a reader into trusting a guarantee
|
|
that the code does not deliver.
|
|
|
|
**Recommendation**
|
|
|
|
Pick one of two corrections: either delete the `SupervisorStrategy` override (these
|
|
actors have no children, so the override is dead) and rewrite the comments to credit
|
|
the try/catch blocks for the alive-on-throw guarantee; or — if the override is kept
|
|
as a forward-compat hedge — change the decider to `Decider.From(_ => Directive.Stop)`
|
|
or similar to match the comment, AND add a clear note that the per-row catch is what
|
|
keeps the actor running across handler throws, not the supervisor strategy.
|
|
|
|
**Resolution**
|
|
|
|
_Unresolved._
|
|
|
|
### AuditLog-003 — `AuditLogIngestActor.OnIngestAsync` uses `CreateScope`, but `OnCachedTelemetryAsync` uses `CreateAsyncScope` — and only one disposes asynchronously
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Concurrency & thread safety |
|
|
| Status | Open |
|
|
| Location | `src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:133`, `src/ScadaLink.AuditLog/Central/AuditLogPurgeActor.cs:139`, `src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:178` |
|
|
|
|
**Description**
|
|
|
|
`OnCachedTelemetryAsync` opens `_serviceProvider!.CreateAsyncScope()` and lets
|
|
`await using` dispose it. `OnIngestAsync`, `OnTickAsync` in
|
|
`SiteAuditReconciliationActor`, and `OnTickAsync` in `AuditLogPurgeActor` all open
|
|
`_services.CreateScope()` (the synchronous variant) and dispose it with a synchronous
|
|
`scope.Dispose()` in a `finally` block — even though the per-message work is async and
|
|
the scoped `IAuditLogRepository` resolves an EF Core `DbContext`, which implements
|
|
`IAsyncDisposable`. The synchronous `Dispose()` on a `DbContext` blocks on any pending
|
|
async connection cleanup; under load this can hold the actor thread for the duration
|
|
of a connection close, which on SQL Server may include sending a `SET TRANSACTION
|
|
ISOLATION LEVEL` reset round-trip. Switching to `CreateAsyncScope()` + `await using`
|
|
is the recommended pattern for scoped EF resources.
|
|
|
|
**Recommendation**
|
|
|
|
Change `_services.CreateScope()` to `_services.CreateAsyncScope()` in
|
|
`OnIngestAsync`, `SiteAuditReconciliationActor.OnTickAsync`, and
|
|
`AuditLogPurgeActor.OnTickAsync`, and replace the `try/finally { scope?.Dispose(); }`
|
|
pattern with `await using var scope = _services.CreateAsyncScope();`. The DI scope
|
|
will dispose asynchronously and the EF Core context will be released without
|
|
blocking the actor thread.
|
|
|
|
**Resolution**
|
|
|
|
_Unresolved._
|
|
|
|
### AuditLog-004 — `SiteAuditReconciliationActor` advances cursor even on per-row insert failure, silently abandoning permanently-failing rows
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Error handling & resilience |
|
|
| Status | Open |
|
|
| Location | `src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:233-265` |
|
|
|
|
**Description**
|
|
|
|
`PullSiteAsync` iterates the pulled events, calls `InsertIfNotExistsAsync` inside a
|
|
per-row try/catch, and unconditionally updates `maxOccurred = evt.OccurredAtUtc` after
|
|
the try/catch — regardless of whether the insert succeeded or threw. The comment at
|
|
line 247 acknowledges this: "the cursor still advances based on OccurredAtUtc — the
|
|
row was returned by the site, so the next tick won't re-fetch it; if it permanently
|
|
fails to persist, that's an operational concern surfaced by the log, not a hot-loop
|
|
trigger." For a transient fault that flips to success on the next pull the design
|
|
holds. But if a row throws on EVERY central attempt (truly permanent persistence fault —
|
|
e.g. column-too-long, FK violation that won't resolve) the cursor advance still moves
|
|
past it, and central will simply log on every reconciliation tick. No alert escalates
|
|
beyond a log line. Worse, the site keeps the row `Pending` (because `MarkReconciledAsync`
|
|
is only called for rows the puller flipped centrally) AND will trip the
|
|
`SiteAuditTelemetryStalled` signal because the backlog never drains, but the central
|
|
log message is the only place an operator could correlate the stall with the
|
|
persistent insert failure.
|
|
|
|
**Recommendation**
|
|
|
|
Either (a) only advance the cursor for rows whose `InsertIfNotExistsAsync` returned
|
|
cleanly — leave `maxOccurred` at the previous value for the failing row so the next
|
|
tick retries; or (b) increment a dedicated `CentralAuditPermanentInsertFailure` health
|
|
counter on the per-row catch so the failure is observable on the dashboard instead of
|
|
buried in the log. Option (a) needs a guard against the same row throwing forever
|
|
(saturate the puller) — a small per-event retry counter held in the actor's state with
|
|
a permanent-skip + `LogCritical` threshold is the standard escape valve.
|
|
|
|
**Resolution**
|
|
|
|
_Unresolved._
|
|
|
|
### AuditLog-005 — `GetBacklogStatsAsync` holds the SQLite hot-path write lock for the full COUNT+MIN scan
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Performance & resource management |
|
|
| Status | Open |
|
|
| Location | `src/ScadaLink.AuditLog/Site/SqliteAuditWriter.cs:597-657` |
|
|
|
|
**Description**
|
|
|
|
`SqliteAuditWriter.GetBacklogStatsAsync` takes `_writeLock` (the same lock that
|
|
serialises every batch INSERT in `FlushBatch`) and holds it for the duration of a
|
|
`SELECT COUNT(*), MIN(OccurredAtUtc) FROM AuditLog WHERE ForwardState = 'Pending'`.
|
|
`SiteAuditBacklogReporter` calls this on a 30-second timer. On a healthy site with
|
|
few `Pending` rows the index-only scan is fast; under the scenario the metric exists
|
|
to detect — a prolonged central outage growing the backlog "indefinitely" per
|
|
Component-AuditLog.md — a `COUNT(*)` over hundreds of thousands of `Pending` rows
|
|
on the `IX_SiteAuditLog_ForwardState_Occurred` index is no longer cheap, and the
|
|
duration of that scan is added to the hot-path write latency for every concurrent
|
|
script. The hot path is supposed to be "durable in microseconds" per the design doc;
|
|
a multi-hundred-millisecond probe stall in the same period would not be visible
|
|
externally but would back-pressure the bounded write channel. `ReadPendingAsync` and
|
|
`ReadPendingSinceAsync` share the same lock for the same reason and have the same
|
|
exposure under backlog growth.
|
|
|
|
**Recommendation**
|
|
|
|
Either (a) move the SELECT outside the write lock by using a second, dedicated
|
|
read-only SQLite connection (Microsoft.Data.Sqlite supports concurrent connections
|
|
to the same file when journal_mode=WAL is enabled — which would also benefit the
|
|
hot path); or (b) cache the last snapshot inside the writer and recompute it
|
|
lazily on a dedicated background tick so the reporter reads a pre-computed snapshot
|
|
without acquiring the write lock. Option (a) also unblocks `ReadPendingAsync` /
|
|
`ReadPendingSinceAsync` from competing with the writer.
|
|
|
|
**Resolution**
|
|
|
|
_Unresolved._
|
|
|
|
### AuditLog-006 — `SqliteAuditWriter.Dispose()` does sync-over-async and may deadlock
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Concurrency & thread safety |
|
|
| Status | Open |
|
|
| Location | `src/ScadaLink.AuditLog/Site/SqliteAuditWriter.cs:697-700` |
|
|
|
|
**Description**
|
|
|
|
```csharp
|
|
public void Dispose()
|
|
{
|
|
DisposeAsync().AsTask().GetAwaiter().GetResult();
|
|
}
|
|
```
|
|
|
|
This is the classic sync-over-async anti-pattern. `DisposeAsync` `await`s the
|
|
writer-loop task with `.ConfigureAwait(false)`, so on a thread with no synchronization
|
|
context (the typical .NET 10 host shutdown path) it's fine; but if any caller invokes
|
|
`Dispose()` from a context that captures (an ASP.NET request, a SynchronizationContext
|
|
test runner, an Akka.NET dispatcher in some configurations) the `GetResult()` blocks
|
|
the captured thread while the continuation tries to resume on it — classic deadlock.
|
|
The writer is registered as a DI singleton, so this is unlikely to bite during the
|
|
host's `IAsyncDisposable` shutdown (DI prefers `DisposeAsync` when available), but
|
|
an integration test or future code path that constructs the writer manually inside
|
|
a sync context will hang.
|
|
|
|
**Recommendation**
|
|
|
|
Drop the `IDisposable` interface and rely on `IAsyncDisposable` only — the DI container
|
|
will call `DisposeAsync` on singletons that implement it. If a sync `Dispose` is
|
|
required for compatibility with consumers that don't honour `IAsyncDisposable`,
|
|
implement it as a best-effort that calls `_writeQueue.Writer.TryComplete()` + a
|
|
short wait, without blocking the thread for the full async drain.
|
|
|
|
**Resolution**
|
|
|
|
_Unresolved._
|
|
|
|
### AuditLog-007 — `INodeIdentityProvider` resolution mixes `GetService` and `GetRequiredService` inconsistently across `AddAuditLog` registrations
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Code organization & conventions |
|
|
| Status | Open |
|
|
| Location | `src/ScadaLink.AuditLog/ServiceCollectionExtensions.cs:148-218` |
|
|
|
|
**Description**
|
|
|
|
`AddAuditLog` registers three components that depend on `INodeIdentityProvider`:
|
|
- `CachedCallTelemetryForwarder` — resolves with `sp.GetService<INodeIdentityProvider>()`
|
|
(optional, falls back to a null `SourceNode`).
|
|
- `CachedCallLifecycleBridge` — resolves with `sp.GetService<INodeIdentityProvider>()`
|
|
(optional, same fallback).
|
|
- `CentralAuditWriter` — resolves with `sp.GetRequiredService<INodeIdentityProvider>()`
|
|
(required, throws at first resolution if unregistered).
|
|
|
|
The XML comments at lines 153 / 175 / 215 explain the reasoning — the first two are
|
|
optional because tests may skip the registration; the third is required because "the
|
|
production composition root in `SiteServiceRegistration` registers the provider as a
|
|
singleton on both site and central paths". But this is a fragile guarantee — `AddAuditLog`
|
|
itself does NOT register the provider, so a future composition root that calls
|
|
`AddAuditLog` without first calling whatever registers `INodeIdentityProvider` will fail
|
|
on the FIRST resolution of `ICentralAuditWriter` (which is a lazy factory) rather than
|
|
at `AddAuditLog` time. The result: site nodes that "happen to work" because they hold
|
|
a registered provider, central composition test fixtures that fail at runtime instead
|
|
of DI-build time, and a `GetService`/`GetRequiredService` split that gives no clear
|
|
contract to the reader.
|
|
|
|
**Recommendation**
|
|
|
|
Either (a) make all three optional: `CentralAuditWriter` already handles a null provider
|
|
gracefully (line 113-116 — null-coalescing the caller's value); the asymmetry buys
|
|
nothing. Or (b) make all three required and either add `services.AddSingleton<INodeIdentityProvider, ...>()`
|
|
inside `AddAuditLog` (with a sensible default — null node name returns `<unknown>`) or
|
|
add an explicit guard at the top of `AddAuditLog` that throws if no provider has been
|
|
registered yet (`services.Any(d => d.ServiceType == typeof(INodeIdentityProvider))`).
|
|
|
|
**Resolution**
|
|
|
|
_Unresolved._
|
|
|
|
### AuditLog-008 — Test composition roots that omit `IAuditPayloadFilter` silently pass UNREDACTED payloads through the writer chain
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Security |
|
|
| Status | Open |
|
|
| Location | `src/ScadaLink.AuditLog/Site/FallbackAuditWriter.cs:51-77`, `src/ScadaLink.AuditLog/Central/CentralAuditWriter.cs:77-104`, `src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:125,155` |
|
|
|
|
**Description**
|
|
|
|
`FallbackAuditWriter`, `CentralAuditWriter`, and `AuditLogIngestActor` all accept an
|
|
`IAuditPayloadFilter` as an optional dependency, defaulting to `null = pass-through`.
|
|
The justification in every XML comment is the same: "the M4 test composition roots
|
|
that don't pass one keep working (they only ever write small payloads)". This is fine
|
|
for size — but the filter also performs HEADER REDACTION (`Authorization`, `Cookie`,
|
|
`Set-Cookie`, `X-API-Key`), GLOBAL BODY REDACTORS, and SQL PARAMETER REDACTION. A test
|
|
fixture (or any future composition root that bypasses `AddAuditLog`) that injects a
|
|
real `RequestSummary` will see secrets written to SQLite / MS SQL with no redaction.
|
|
The combination "audit-write must never abort the user-facing action" + "unredacted
|
|
secrets must never persist" (Component-AuditLog.md §Payload Capture Policy) makes the
|
|
no-filter fallback genuinely dangerous — over-redacting on a missing filter is the
|
|
contract the production setup honours, but the code itself defaults to under-redact.
|
|
|
|
**Recommendation**
|
|
|
|
Change the three null-coalesce sites to default to a non-null sentinel filter that
|
|
performs the header redaction (`HeaderRedactList`) using the hard-coded defaults
|
|
from `AuditLogOptions`, even when no `IAuditPayloadFilter` is registered. The
|
|
truncation stage can remain optional; the header redaction must not. Alternatively,
|
|
make `IAuditPayloadFilter` non-optional and have `AddAuditLog` register the real
|
|
filter unconditionally — tests that don't bind the options section will resolve the
|
|
default `AuditLogOptions` and get the production-default redact list automatically.
|
|
|
|
**Resolution**
|
|
|
|
_Unresolved._
|
|
|
|
### AuditLog-009 — `SqliteAuditWriter.DisposeAsync` comment claims `_disposed` is set early, but it isn't
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Documentation & comments |
|
|
| Status | Open |
|
|
| Location | `src/ScadaLink.AuditLog/Site/SqliteAuditWriter.cs:706-740` |
|
|
|
|
**Description**
|
|
|
|
The first `lock (_writeLock)` block in `DisposeAsync` is commented:
|
|
|
|
> Stop accepting new events. Setting _disposed first ensures any FlushBatch entered
|
|
> after we mark disposed will fault its pending events rather than touching the
|
|
> about-to-close connection.
|
|
|
|
But the block does NOT set `_disposed = true` — it only calls
|
|
`_writeQueue.Writer.TryComplete()` and captures `_writerLoop`. The `_disposed` flag is
|
|
flipped in the SECOND lock block (line 738), AFTER the 5-second wait on the writer
|
|
loop. During the wait window, a concurrent `WriteAsync` that observed the channel
|
|
NOT-yet-completed (race: it ran before `TryComplete`) and got past `TryWrite` would
|
|
land on the writer loop's `FlushBatch`, which then takes the lock and checks
|
|
`_disposed` — and finds it still `false`. The check at the top of `FlushBatch`
|
|
(line 265) `if (_disposed) { fault pending; return; }` therefore does NOT fire during
|
|
the dispose window. In practice the channel being completed drains the loop cleanly
|
|
and the disposable race is benign, but the comment claims a guarantee that the code
|
|
does not implement.
|
|
|
|
**Recommendation**
|
|
|
|
Either set `_disposed = true` in the first lock block to match the comment (and remove
|
|
the duplicate `_disposed` check in the second block); or rewrite the comment to
|
|
describe the actual ordering: the channel is completed first, the loop drains
|
|
remaining items under the lock, and `_disposed = true` is set only after the loop
|
|
exits. The current code is correct; the comment is wrong.
|
|
|
|
**Resolution**
|
|
|
|
_Unresolved._
|
|
|
|
### AuditLog-010 — Actor drain paths accept a `CancellationToken` parameter but always pass `CancellationToken.None` downstream
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Concurrency & thread safety |
|
|
| Status | Open |
|
|
| Location | `src/ScadaLink.AuditLog/Site/Telemetry/SiteAuditTelemetryActor.cs:92,107,124`, `src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:228` |
|
|
|
|
**Description**
|
|
|
|
The drain loops on `SiteAuditTelemetryActor.OnDrainAsync` and the per-site pull on
|
|
`SiteAuditReconciliationActor.PullSiteAsync` both pass `CancellationToken.None` to
|
|
every async dependency call (queue reads, gRPC client, repository writes). The actor
|
|
has no `CancellationToken` field, so there's no in-flight cancellation source —
|
|
graceful shutdown relies entirely on `PostStop` being called and the actor's
|
|
`Receive` continuation completing naturally. For a healthy gRPC client this is fine,
|
|
but a stuck `IngestAuditEventsAsync` call (slow central, partition switch in progress)
|
|
holds the actor's continuation indefinitely; the host's coordinated-shutdown will then
|
|
time out the actor system and leave the actor in an undefined state. The brief
|
|
references "cancellation on stop" in the partition-maintenance comments but
|
|
`SiteAuditTelemetryActor` does not implement it.
|
|
|
|
**Recommendation**
|
|
|
|
Introduce a per-actor `CancellationTokenSource` populated in `PreStart` and cancelled
|
|
in `PostStop`; pass `_lifecycleCts.Token` instead of `CancellationToken.None` in
|
|
every async dependency call. Same change for `SiteAuditReconciliationActor`. The
|
|
existing `OperationCanceledException` is already swallowed by the top-level catch
|
|
in `OnDrainAsync` (line 128), so plumbing the token through is a localised change.
|
|
|
|
**Resolution**
|
|
|
|
_Unresolved._
|
|
|
|
### AuditLog-011 — `AddAuditLogHealthMetricsBridge` and `AddAuditLogCentralMaintenance` are non-idempotent and register hosted services on every call
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Code organization & conventions |
|
|
| Status | Open |
|
|
| Location | `src/ScadaLink.AuditLog/ServiceCollectionExtensions.cs:53-55, 263-276, 301-346` |
|
|
|
|
**Description**
|
|
|
|
The XML doc on `AddAuditLog` is explicit: "Idempotent re-registration is not supported;
|
|
call this exactly once per `IServiceCollection`." But `AddAuditLogHealthMetricsBridge`
|
|
calls `services.AddHostedService<SiteAuditBacklogReporter>()` (line 275), which is
|
|
NOT idempotent — every call registers another descriptor, and the host will spin up
|
|
N reporters and have them all poll SQLite every 30 s, all push the same snapshot into
|
|
`ISiteHealthCollector`. The site composition path is supposed to call this exactly
|
|
once, but tests or composition refactors that accidentally call twice will pay 2x the
|
|
SQL probe rate and overwrite the snapshot with conflicting numbers (no race, just
|
|
wasted work). Worse, `AddAuditLogCentralMaintenance` (line 301) is also non-idempotent —
|
|
`AddOptions<AuditLogPartitionMaintenanceOptions>` and `AddHostedService<AuditLogPartitionMaintenanceService>`
|
|
will pile up.
|
|
|
|
**Recommendation**
|
|
|
|
Either (a) guard each Add* helper with a "has the marker been seen?" sentinel
|
|
(register a private marker descriptor on first call, no-op on subsequent calls);
|
|
or (b) explicitly document idempotency on the public surface of every helper and
|
|
verify with a unit test in `AddAuditLogTests`. Option (a) matches the pattern other
|
|
SDK extensions use and removes a foot-gun.
|
|
|
|
**Resolution**
|
|
|
|
_Unresolved._
|