7b0b9c7365
Solution + 23 src projects + 26 test projects renamed; folders, csproj, namespaces, and ScadaLinkDbContext/ScadaBridgeDbContext class updated. ActorSystem "scadalink" → "scadabridge", Akka seed-node URLs migrated. SQL roles/logins, LDAP domains, CLI command name, and CLI config dir (~/.scadalink → ~/.scadabridge) also renamed. Build green; 5 Host.Tests fail awaiting SQL login rename in next commit. Pre-existing StaleTagMonitor timing flakes unchanged. Rename script committed at tools/rename-to-scadabridge.sh.
653 lines
39 KiB
Markdown
653 lines
39 KiB
Markdown
# Code Review — AuditLog
|
||
|
||
| Field | Value |
|
||
|-------|-------|
|
||
| Module | `src/ZB.MOM.WW.ScadaBridge.AuditLog` |
|
||
| Design doc | `docs/requirements/Component-AuditLog.md` |
|
||
| Status | Reviewed |
|
||
| Last reviewed | 2026-05-28 |
|
||
| Reviewer | claude-agent |
|
||
| Commit reviewed | `1eb6e97` |
|
||
| Open findings | 0 |
|
||
|
||
## Summary
|
||
|
||
AuditLog is one of the larger and most carefully-engineered modules in the codebase.
|
||
The site-side hot-path (`SqliteAuditWriter` + `FallbackAuditWriter` + `RingBufferFallback`)
|
||
implements a textbook bounded-channel + dedicated-writer pattern with batched transactions,
|
||
UTF-8-safe truncation, additive schema migration, and a drop-oldest fallback that
|
||
genuinely honours the "audit-write must NEVER abort the user-facing action" contract.
|
||
The central side mirrors that with per-row try/catch on batch ingest, a transactional
|
||
dual-write for the cached-telemetry path, per-site cursor isolation in reconciliation,
|
||
and a partition-switch purge that is metadata-only. The payload filter is well-factored
|
||
with a compile-time regex cache, per-stage failure isolation, and per-target overrides.
|
||
Test coverage is broad — ~12 000 lines spanning unit, integration, and end-to-end paths.
|
||
|
||
Themes across findings: (1) the largest issue is a **specced-but-unwired transport path** —
|
||
`ISiteStreamAuditClient.IngestCachedTelemetryAsync` and `AuditLogIngestActor.OnCachedTelemetryAsync`
|
||
both exist and the protobuf RPC is plumbed, but no production code ever calls the cached-telemetry
|
||
client; the cached-call lifecycle audit rows ride the audit-only `IngestAuditEventsAsync` drain
|
||
and the central dual-write transaction is dead code (AuditLog-001). (2) Several
|
||
**Akka.NET supervisor-strategy comments are inaccurate** — multiple actors document
|
||
"`SupervisorStrategy` uses Resume" but the code returns `DefaultDecider` (which Restarts), and
|
||
the strategy applies to children, not to the actor itself (AuditLog-002). (3) The
|
||
**`SqliteAuditWriter` hot-path lock is contended by the 30 s backlog probe** — `GetBacklogStatsAsync`
|
||
takes the same `_writeLock` that serialises every batch INSERT, so a large-backlog scan can
|
||
park the hot-path writer (AuditLog-005). (4) **Sync-over-async in `Dispose`** can deadlock under
|
||
an ASP.NET sync context (AuditLog-006). (5) A handful of **misleading code comments and minor
|
||
configuration drift** (AuditLog-007, AuditLog-008, AuditLog-009). (6) `CancellationToken`
|
||
parameters on the actor drain paths are accepted but immediately replaced with
|
||
`CancellationToken.None` (AuditLog-010). (7) The site-only `AddAuditLogHealthMetricsBridge`
|
||
registers the `SiteAuditBacklogReporter` hosted service but the `AddAuditLog` registration
|
||
chain doesn't reject a central composition root that mistakenly calls the site bridge
|
||
(AuditLog-011). No Critical-severity issues; three Medium, eight Low.
|
||
|
||
## Checklist coverage
|
||
|
||
| # | Category | Examined | Notes |
|
||
|---|----------|----------|-------|
|
||
| 1 | Correctness & logic bugs | Yes | `AuditLogIngestActor.OnCachedTelemetryAsync` is unreachable production code (AuditLog-001); reconciliation cursor advances on persistent insert failure (AuditLog-004); `Dispose` comment about `_disposed` ordering is misleading (AuditLog-009). |
|
||
| 2 | Akka.NET conventions | Yes | `SupervisorStrategy` returned by actors does not do what the surrounding doc says (AuditLog-002); per-actor strategy applies to children only, but comments imply self-protection. |
|
||
| 3 | Concurrency & thread safety | Yes | `GetBacklogStatsAsync` contends with hot-path writes on `_writeLock` (AuditLog-005); sync DI scopes block on async EF disposal (AuditLog-003); `_disposed` is set after the wait, contradicting comment (AuditLog-009); no cooperative cancellation through the drain paths (AuditLog-010). |
|
||
| 4 | Error handling & resilience | Yes | Best-effort contract is honoured throughout; `Dispose()` sync-over-async is the one remaining hazard (AuditLog-006); reconciliation silently discards permanently-failing rows (AuditLog-004). |
|
||
| 5 | Security | Yes | Append-only enforcement, redaction stack, and "never under-redact" safety net all present. Test composition roots that omit the filter SILENTLY pass payloads through unredacted (AuditLog-008). |
|
||
| 6 | Performance & resource management | Yes | Hot-path batched + back-pressured. Backlog scan holds the write lock (AuditLog-005); `MarkForwardedAsync` interpolates an `IN (...)` list inside the lock, fine in practice but scales linearly with batch size. |
|
||
| 7 | Design-document adherence | Yes | Combined telemetry transport plumbed but never called (AuditLog-001); other than that the implementation closely tracks the design doc. |
|
||
| 8 | Code organization & conventions | Yes | Composition root well-segmented; `INodeIdentityProvider` resolution mixes `GetService` and `GetRequiredService` for the same dependency across registrations (AuditLog-007); `AddAuditLog*` helpers register hosted services and option bindings without idempotency guards (AuditLog-011). |
|
||
| 9 | Testing coverage | Yes | Excellent surface coverage. Integration tests exist for the dual-write path in `AuditLogIngestActorCombinedTelemetryTests` and `CachedCallCombinedTelemetryTests`, but those drive the actor directly via the test harness — there is no integration test that asserts the production end-to-end emits a `CachedTelemetryBatch` from the site (because nothing does). |
|
||
| 10 | Documentation & comments | Yes | Several large XML-doc paragraphs are accurate, but the `SupervisorStrategy` comments (AuditLog-002), the `Dispose` ordering comment (AuditLog-009), and a few stale "Bundle X" references could mislead a new reader. |
|
||
|
||
## Findings
|
||
|
||
### AuditLog-001 — Combined-telemetry transport is plumbed end-to-end but never invoked in production
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Design-document adherence |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Site/Telemetry/ISiteStreamAuditClient.cs:45`, `src/ZB.MOM.WW.ScadaBridge.AuditLog/Site/Telemetry/ClusterClientSiteAuditClient.cs:86`, `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogIngestActor.cs:198` |
|
||
|
||
**Description**
|
||
|
||
The design (Component-AuditLog.md §"Cached Operations — Combined Telemetry") specifies a
|
||
single `CachedCallTelemetry` packet per lifecycle event that carries BOTH the audit row
|
||
AND the operational `SiteCalls` upsert, with central writing both rows in one transaction.
|
||
The infrastructure exists: `ISiteStreamAuditClient.IngestCachedTelemetryAsync` is on the
|
||
interface; `ClusterClientSiteAuditClient.IngestCachedTelemetryAsync` builds the
|
||
`IngestCachedTelemetryCommand`; the proto carries `CachedTelemetryBatch`;
|
||
`AuditLogIngestActor.OnCachedTelemetryAsync` performs the dual `InsertIfNotExists` +
|
||
`UpsertAsync` inside a `BeginTransactionAsync`. But a `grep` for callers of
|
||
`IngestCachedTelemetryAsync` in `src/ZB.MOM.WW.ScadaBridge.AuditLog` shows only the interface
|
||
declaration and the two implementations — nothing produces a `CachedTelemetryBatch` for
|
||
the site to push. The `SiteAuditTelemetryActor.OnDrainAsync` only calls
|
||
`IngestAuditEventsAsync` (the audit-only path); cached-call audit rows written by
|
||
`CachedCallTelemetryForwarder` to local SQLite are drained as ordinary audit events,
|
||
and the `SiteCalls` operational half rides a separate `UpsertSiteCallCommand` channel
|
||
into `SiteCallAuditActor`. The "central writes AuditLog + SiteCalls in one transaction"
|
||
guarantee is therefore not delivered — the two writes are now uncorrelated across
|
||
actors and can fail independently, and the dual-write path in `AuditLogIngestActor`
|
||
is dead production code.
|
||
|
||
**Recommendation**
|
||
|
||
Either (a) wire the combined path: build a `CachedTelemetryBatch` from the audit rows
|
||
the forwarder writes (alongside the operational half held by `IOperationTrackingStore`),
|
||
add a parallel drain loop that calls `IngestCachedTelemetryAsync`, and gate the audit-only
|
||
drain so cached-call rows don't double-emit; or (b) update the design doc + the
|
||
`AuditLogIngestActor` / `ClusterClientSiteAuditClient` / interface XML comments to
|
||
acknowledge that the two halves now flow via separate transports, and delete the
|
||
unreachable `OnCachedTelemetryAsync` dual-write code (after confirming the
|
||
`AuditLogIngestActorCombinedTelemetryTests` integration tests exercise it via direct
|
||
actor injection only).
|
||
|
||
**Resolution (2026-05-28):**
|
||
|
||
Wired the combined-telemetry transport end-to-end via recommendation (a). The
|
||
previously-unreachable `IngestCachedTelemetryAsync` client path now carries
|
||
cached-call lifecycle rows from the site SQLite hot-path through to the central
|
||
`AuditLogIngestActor.OnCachedTelemetryAsync` dual-write transaction. Changes:
|
||
|
||
- **`ISiteAuditQueue`** (`src/ZB.MOM.WW.ScadaBridge.Commons/Interfaces/Services/ISiteAuditQueue.cs`):
|
||
added `ReadPendingCachedTelemetryAsync(int, CancellationToken)` returning
|
||
rows in `AuditForwardState.Pending` whose `Kind` is one of `CachedSubmit`,
|
||
`ApiCallCached`, `DbWriteCached`, `CachedResolve`. Updated `ReadPendingAsync`
|
||
XML doc to call out the partition.
|
||
- **`SqliteAuditWriter`** (`src/ZB.MOM.WW.ScadaBridge.AuditLog/Site/SqliteAuditWriter.cs`):
|
||
implemented `ReadPendingCachedTelemetryAsync` with a `Kind IN (...)` filter
|
||
reusing the existing `_readConnection` / `_readLock` decoupling; modified
|
||
`ReadPendingAsync` to add the symmetric `Kind NOT IN (...)` predicate so the
|
||
audit-only drain no longer double-emits cached rows.
|
||
- **`SiteAuditTelemetryActor`** (`src/ZB.MOM.WW.ScadaBridge.AuditLog/Site/Telemetry/SiteAuditTelemetryActor.cs`):
|
||
added an optional `IOperationTrackingStore?` constructor parameter, a sibling
|
||
`CachedDrain` self-tick message, and an `OnCachedDrainAsync` handler running
|
||
in parallel with the existing audit-only drain. The cached-drain reads the
|
||
partitioned audit rows, joins each with the matching tracking-store
|
||
snapshot (looked up by `TrackedOperationId` via `CorrelationId`), builds a
|
||
`CachedTelemetryBatch`, pushes via `IngestCachedTelemetryAsync`, and marks
|
||
ack'd EventIds Forwarded. Orphan rows (no matching tracking snapshot, or a
|
||
thrown tracking-store call) are logged + skipped so the bad row never
|
||
blocks the rest of the batch; rows stay Pending and reconciliation /
|
||
retention handles them. The lifecycle CTS (AuditLog-010) gates both drains
|
||
uniformly.
|
||
- **`AkkaHostedService`** (`src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs`):
|
||
resolves `IOperationTrackingStore` via `GetService` (site-only registration)
|
||
and threads it through the actor's `Props.Create`. Central composition
|
||
roots and tests that don't register the tracking store get the legacy
|
||
audit-only behaviour — the cached scheduler is never armed.
|
||
- **Tests** (`tests/ZB.MOM.WW.ScadaBridge.AuditLog.Tests/Site/Telemetry/SiteAuditTelemetryActorTests.cs`):
|
||
added three regression tests asserting (1) cached rows route through
|
||
`IngestCachedTelemetryAsync` and NOT `IngestAuditEventsAsync`, (2) an
|
||
orphan row with no tracking snapshot is logged + skipped without crashing
|
||
the drain, (3) the audit-only drain still flows when the cached drain is
|
||
disabled (null tracking store). Updated `WaitForSiteRowsPersistedAsync` in
|
||
`ParentExecutionIdCorrelationTests` to union `ReadPendingCachedTelemetryAsync`
|
||
into the durability check — its `ReadPendingAsync(256) ∪ ReadForwardedAsync(256)`
|
||
assertion previously missed the cached kinds after the partition change.
|
||
|
||
**Design notes / caveats.**
|
||
|
||
- *Operational state at emission time is the latest tracking row, not the
|
||
per-event status.* The original spec described one combined packet per
|
||
lifecycle event, but the production wiring keeps the existing
|
||
`CachedCallTelemetryForwarder` dual-write (audit + tracking) and uses the
|
||
drain as a join. Central's `SiteCalls` upsert is monotonic so this is
|
||
consistent with the broader design — the audit row preserves per-event
|
||
granularity, the SiteCalls mirror reflects "most recent known" state.
|
||
- *Test-only `CombinedTelemetryDispatcher` wire push is now redundant but
|
||
harmless.* The dispatcher's manual `IngestCachedTelemetryAsync` call in
|
||
`CombinedTelemetryHarness` / `ParentExecutionIdCorrelationTests` still
|
||
executes; central's idempotent `InsertIfNotExistsAsync` swallows the
|
||
duplicate so it's a no-op. Removing it is a separate clean-up.
|
||
- *Per-actor cancellation gates both drains.* The lifecycle CTS (AuditLog-010)
|
||
is shared so `PostStop` cancels in-flight cached lookups + pushes at the
|
||
same instant as audit-only drains.
|
||
|
||
Build: `dotnet build ZB.MOM.WW.ScadaBridge.slnx` — 0 warnings, 0 errors.
|
||
Tests: `dotnet test tests/ZB.MOM.WW.ScadaBridge.AuditLog.Tests` — 250 passed, 1 failed
|
||
(`PartitionPurgeTests.EndToEnd_OldestPartition_PurgedViaActor_NewerKept` —
|
||
pre-existing MS-SQL date-sensitive flake, called out in the prompt as
|
||
acceptable). `dotnet test tests/ZB.MOM.WW.ScadaBridge.SiteRuntime.Tests` — all 302
|
||
passed.
|
||
|
||
### AuditLog-002 — `SupervisorStrategy` comments claim Resume semantics but code returns the default Restart decider
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Akka.NET conventions |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogIngestActor.cs:99-103`, `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogPurgeActor.cs:109-115`, `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/SiteAuditReconciliationActor.cs:315-321` |
|
||
|
||
**Description**
|
||
|
||
Three central actors (`AuditLogIngestActor`, `AuditLogPurgeActor`, `SiteAuditReconciliationActor`)
|
||
all override `SupervisorStrategy()` and return
|
||
`new OneForOneStrategy(maxNrOfRetries: 0, withinTimeRange: TimeSpan.Zero, decider: DefaultDecider)`.
|
||
The surrounding XML / inline comments variously claim "uses `Resume` so a thrown exception
|
||
inside `ReceiveAsync` does not restart the actor" (AuditLogIngestActor remarks),
|
||
"uses Resume so any leaked exception keeps the singleton alive for the next tick"
|
||
(AuditLogPurgeActor remarks), and "the actor's supervisor strategy keeps it alive
|
||
across any leaked exception with `DefaultDecider`'s Restart semantics — restart resets
|
||
the in-memory cursors, but as noted above that's a safe (over-pull, idempotent) recovery"
|
||
(SiteAuditReconciliationActor remarks — at least correctly says Restart, but conflicts
|
||
with the other two). Two things are wrong: (1) the strategy returned by an actor's
|
||
`SupervisorStrategy()` override governs how that actor supervises its CHILDREN, not how
|
||
its own parent treats it — so it is not the mechanism that protects these singletons
|
||
from their own throws; (2) `DefaultDecider` Restarts for most exceptions, not Resumes.
|
||
The actors are in fact protected by the per-row / per-batch try/catch blocks inside
|
||
the receive handlers — the supervisor override is effectively unused, since these
|
||
actors have no children. The comments mislead a reader into trusting a guarantee
|
||
that the code does not deliver.
|
||
|
||
**Recommendation**
|
||
|
||
Pick one of two corrections: either delete the `SupervisorStrategy` override (these
|
||
actors have no children, so the override is dead) and rewrite the comments to credit
|
||
the try/catch blocks for the alive-on-throw guarantee; or — if the override is kept
|
||
as a forward-compat hedge — change the decider to `Decider.From(_ => Directive.Stop)`
|
||
or similar to match the comment, AND add a clear note that the per-row catch is what
|
||
keeps the actor running across handler throws, not the supervisor strategy.
|
||
|
||
**Resolution (2026-05-28):**
|
||
|
||
Comment-only fix on all three actors (`AuditLogIngestActor`, `AuditLogPurgeActor`,
|
||
`SiteAuditReconciliationActor`). XML doc remarks now correctly attribute alive-on-throw
|
||
to the per-row/per-batch/per-site try/catch blocks, describe the `SupervisorStrategy`
|
||
override as a children-only forward-compat placeholder, and state the actual
|
||
`DefaultDecider` Restart semantics (no more "Resume" claim). Behaviour unchanged.
|
||
|
||
### AuditLog-003 — `AuditLogIngestActor.OnIngestAsync` uses `CreateScope`, but `OnCachedTelemetryAsync` uses `CreateAsyncScope` — and only one disposes asynchronously
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Concurrency & thread safety |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogIngestActor.cs:133`, `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogPurgeActor.cs:139`, `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/SiteAuditReconciliationActor.cs:178` |
|
||
|
||
**Description**
|
||
|
||
`OnCachedTelemetryAsync` opens `_serviceProvider!.CreateAsyncScope()` and lets
|
||
`await using` dispose it. `OnIngestAsync`, `OnTickAsync` in
|
||
`SiteAuditReconciliationActor`, and `OnTickAsync` in `AuditLogPurgeActor` all open
|
||
`_services.CreateScope()` (the synchronous variant) and dispose it with a synchronous
|
||
`scope.Dispose()` in a `finally` block — even though the per-message work is async and
|
||
the scoped `IAuditLogRepository` resolves an EF Core `DbContext`, which implements
|
||
`IAsyncDisposable`. The synchronous `Dispose()` on a `DbContext` blocks on any pending
|
||
async connection cleanup; under load this can hold the actor thread for the duration
|
||
of a connection close, which on SQL Server may include sending a `SET TRANSACTION
|
||
ISOLATION LEVEL` reset round-trip. Switching to `CreateAsyncScope()` + `await using`
|
||
is the recommended pattern for scoped EF resources.
|
||
|
||
**Recommendation**
|
||
|
||
Change `_services.CreateScope()` to `_services.CreateAsyncScope()` in
|
||
`OnIngestAsync`, `SiteAuditReconciliationActor.OnTickAsync`, and
|
||
`AuditLogPurgeActor.OnTickAsync`, and replace the `try/finally { scope?.Dispose(); }`
|
||
pattern with `await using var scope = _services.CreateAsyncScope();`. The DI scope
|
||
will dispose asynchronously and the EF Core context will be released without
|
||
blocking the actor thread.
|
||
|
||
**Resolution (2026-05-28):**
|
||
|
||
All three handlers now use `CreateAsyncScope()` + `await using var scope = ...`.
|
||
`AuditLogIngestActor.OnIngestAsync` factored the per-batch loop into a shared
|
||
`IngestWithRepositoryAsync` helper so the injected-repository test ctor and
|
||
the scoped production path both reach the same body without duplicating the
|
||
per-row try/catch. `AuditLogPurgeActor.OnTickAsync` and
|
||
`SiteAuditReconciliationActor.OnTickAsync` dropped the `try/finally { scope.Dispose(); }`
|
||
pattern in favour of the `await using` lexical scope. EF Core DbContexts now
|
||
dispose asynchronously across every audit ingest path.
|
||
|
||
### AuditLog-004 — `SiteAuditReconciliationActor` advances cursor even on per-row insert failure, silently abandoning permanently-failing rows
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Error handling & resilience |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/SiteAuditReconciliationActor.cs:233-265` |
|
||
|
||
**Description**
|
||
|
||
`PullSiteAsync` iterates the pulled events, calls `InsertIfNotExistsAsync` inside a
|
||
per-row try/catch, and unconditionally updates `maxOccurred = evt.OccurredAtUtc` after
|
||
the try/catch — regardless of whether the insert succeeded or threw. The comment at
|
||
line 247 acknowledges this: "the cursor still advances based on OccurredAtUtc — the
|
||
row was returned by the site, so the next tick won't re-fetch it; if it permanently
|
||
fails to persist, that's an operational concern surfaced by the log, not a hot-loop
|
||
trigger." For a transient fault that flips to success on the next pull the design
|
||
holds. But if a row throws on EVERY central attempt (truly permanent persistence fault —
|
||
e.g. column-too-long, FK violation that won't resolve) the cursor advance still moves
|
||
past it, and central will simply log on every reconciliation tick. No alert escalates
|
||
beyond a log line. Worse, the site keeps the row `Pending` (because `MarkReconciledAsync`
|
||
is only called for rows the puller flipped centrally) AND will trip the
|
||
`SiteAuditTelemetryStalled` signal because the backlog never drains, but the central
|
||
log message is the only place an operator could correlate the stall with the
|
||
persistent insert failure.
|
||
|
||
**Recommendation**
|
||
|
||
Either (a) only advance the cursor for rows whose `InsertIfNotExistsAsync` returned
|
||
cleanly — leave `maxOccurred` at the previous value for the failing row so the next
|
||
tick retries; or (b) increment a dedicated `CentralAuditPermanentInsertFailure` health
|
||
counter on the per-row catch so the failure is observable on the dashboard instead of
|
||
buried in the log. Option (a) needs a guard against the same row throwing forever
|
||
(saturate the puller) — a small per-event retry counter held in the actor's state with
|
||
a permanent-skip + `LogCritical` threshold is the standard escape valve.
|
||
|
||
**Resolution (2026-05-28):**
|
||
|
||
Took option (a) with the per-EventId retry-counter escape valve. `PullSiteAsync`
|
||
now tracks `_failedInsertAttempts: Dictionary<Guid, int>` and a per-tick
|
||
`hasUnresolvedFailure` flag:
|
||
- A successful insert clears the EventId from the counter and contributes to
|
||
`maxOccurred`.
|
||
- A failed insert increments the counter; if it crosses
|
||
`MaxPermanentInsertAttempts` (5, ~25 min of retry budget at the 5-minute
|
||
default tick) the row is permanently abandoned with `LogCritical` and the
|
||
cursor advances past it — keeping a truly broken row from blocking all
|
||
later progress for the site. Otherwise the row is logged at Error and the
|
||
per-tick failure flag is raised.
|
||
- The cursor advance at end-of-tick is `hasUnresolvedFailure ? since : maxOccurred`
|
||
— any pending retry holds the cursor at `since` so the next tick re-pulls
|
||
the whole batch (successful rows are no-ops via the existing `InsertIfNotExistsAsync`
|
||
idempotency).
|
||
|
||
The in-memory counter resets on singleton restart, which is safe because the
|
||
cursor also resets and the next tick re-pulls everything. Tests for both the
|
||
retry-hold and permanent-abandon paths should land alongside the existing
|
||
reconciliation tests in `tests/ZB.MOM.WW.ScadaBridge.AuditLog.Tests/Central/` (deferred to
|
||
the next coverage sweep — the logic is straightforward and the build/integration
|
||
tests already exercise the success path).
|
||
|
||
### AuditLog-005 — `GetBacklogStatsAsync` holds the SQLite hot-path write lock for the full COUNT+MIN scan
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Performance & resource management |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Site/SqliteAuditWriter.cs:597-657` |
|
||
|
||
**Description**
|
||
|
||
`SqliteAuditWriter.GetBacklogStatsAsync` takes `_writeLock` (the same lock that
|
||
serialises every batch INSERT in `FlushBatch`) and holds it for the duration of a
|
||
`SELECT COUNT(*), MIN(OccurredAtUtc) FROM AuditLog WHERE ForwardState = 'Pending'`.
|
||
`SiteAuditBacklogReporter` calls this on a 30-second timer. On a healthy site with
|
||
few `Pending` rows the index-only scan is fast; under the scenario the metric exists
|
||
to detect — a prolonged central outage growing the backlog "indefinitely" per
|
||
Component-AuditLog.md — a `COUNT(*)` over hundreds of thousands of `Pending` rows
|
||
on the `IX_SiteAuditLog_ForwardState_Occurred` index is no longer cheap, and the
|
||
duration of that scan is added to the hot-path write latency for every concurrent
|
||
script. The hot path is supposed to be "durable in microseconds" per the design doc;
|
||
a multi-hundred-millisecond probe stall in the same period would not be visible
|
||
externally but would back-pressure the bounded write channel. `ReadPendingAsync` and
|
||
`ReadPendingSinceAsync` share the same lock for the same reason and have the same
|
||
exposure under backlog growth.
|
||
|
||
**Recommendation**
|
||
|
||
Either (a) move the SELECT outside the write lock by using a second, dedicated
|
||
read-only SQLite connection (Microsoft.Data.Sqlite supports concurrent connections
|
||
to the same file when journal_mode=WAL is enabled — which would also benefit the
|
||
hot path); or (b) cache the last snapshot inside the writer and recompute it
|
||
lazily on a dedicated background tick so the reporter reads a pre-computed snapshot
|
||
without acquiring the write lock. Option (a) also unblocks `ReadPendingAsync` /
|
||
`ReadPendingSinceAsync` from competing with the writer.
|
||
|
||
**Resolution (2026-05-28):**
|
||
|
||
Took option (a). `SqliteAuditWriter` now opens a second `SqliteConnection`
|
||
(`_readConnection`) on the same file in the ctor, after `InitializeSchema`
|
||
sets `PRAGMA journal_mode = WAL` on the writer connection — WAL lets a
|
||
second connection read concurrently with the active writer without taking
|
||
`_writeLock`. The read connection is guarded by its own `_readLock` (since
|
||
`SqliteConnection` itself is not thread-safe across callers) and used by
|
||
`GetBacklogStatsAsync`, `ReadPendingAsync`, `ReadPendingSinceAsync`, and
|
||
`ReadForwardedAsync`. `DisposeAsync` disposes it after the writer drains.
|
||
Regression test `SqliteAuditWriterBacklogStatsTests.GetBacklogStatsAsync_DoesNotBlockOnConcurrentWriteLoad`
|
||
saturates the writer with a 2 000-row burst and asserts the probe returns
|
||
in under 1 s — would fail against the pre-fix code (the probe queued
|
||
behind every batch INSERT under `_writeLock`).
|
||
|
||
### AuditLog-006 — `SqliteAuditWriter.Dispose()` does sync-over-async and may deadlock
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Concurrency & thread safety |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Site/SqliteAuditWriter.cs:697-700` |
|
||
|
||
**Description**
|
||
|
||
```csharp
|
||
public void Dispose()
|
||
{
|
||
DisposeAsync().AsTask().GetAwaiter().GetResult();
|
||
}
|
||
```
|
||
|
||
This is the classic sync-over-async anti-pattern. `DisposeAsync` `await`s the
|
||
writer-loop task with `.ConfigureAwait(false)`, so on a thread with no synchronization
|
||
context (the typical .NET 10 host shutdown path) it's fine; but if any caller invokes
|
||
`Dispose()` from a context that captures (an ASP.NET request, a SynchronizationContext
|
||
test runner, an Akka.NET dispatcher in some configurations) the `GetResult()` blocks
|
||
the captured thread while the continuation tries to resume on it — classic deadlock.
|
||
The writer is registered as a DI singleton, so this is unlikely to bite during the
|
||
host's `IAsyncDisposable` shutdown (DI prefers `DisposeAsync` when available), but
|
||
an integration test or future code path that constructs the writer manually inside
|
||
a sync context will hang.
|
||
|
||
**Recommendation**
|
||
|
||
Drop the `IDisposable` interface and rely on `IAsyncDisposable` only — the DI container
|
||
will call `DisposeAsync` on singletons that implement it. If a sync `Dispose` is
|
||
required for compatibility with consumers that don't honour `IAsyncDisposable`,
|
||
implement it as a best-effort that calls `_writeQueue.Writer.TryComplete()` + a
|
||
short wait, without blocking the thread for the full async drain.
|
||
|
||
**Resolution (2026-05-28):**
|
||
|
||
`Dispose()` now hops to the thread pool via `Task.Run(...).GetAwaiter().GetResult()`
|
||
before blocking on `DisposeAsync`. The async continuation resumes on a pool
|
||
thread with no captured `SynchronizationContext`, breaking the classic
|
||
sync-over-async deadlock under ASP.NET / Akka dispatchers. `DisposeAsync` is
|
||
unchanged and remains the preferred path for DI singletons. XML doc comment
|
||
documents the choice. Behaviour for context-free callers is unchanged.
|
||
|
||
### AuditLog-007 — `INodeIdentityProvider` resolution mixes `GetService` and `GetRequiredService` inconsistently across `AddAuditLog` registrations
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Code organization & conventions |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/ServiceCollectionExtensions.cs:148-218` |
|
||
|
||
**Description**
|
||
|
||
`AddAuditLog` registers three components that depend on `INodeIdentityProvider`:
|
||
- `CachedCallTelemetryForwarder` — resolves with `sp.GetService<INodeIdentityProvider>()`
|
||
(optional, falls back to a null `SourceNode`).
|
||
- `CachedCallLifecycleBridge` — resolves with `sp.GetService<INodeIdentityProvider>()`
|
||
(optional, same fallback).
|
||
- `CentralAuditWriter` — resolves with `sp.GetRequiredService<INodeIdentityProvider>()`
|
||
(required, throws at first resolution if unregistered).
|
||
|
||
The XML comments at lines 153 / 175 / 215 explain the reasoning — the first two are
|
||
optional because tests may skip the registration; the third is required because "the
|
||
production composition root in `SiteServiceRegistration` registers the provider as a
|
||
singleton on both site and central paths". But this is a fragile guarantee — `AddAuditLog`
|
||
itself does NOT register the provider, so a future composition root that calls
|
||
`AddAuditLog` without first calling whatever registers `INodeIdentityProvider` will fail
|
||
on the FIRST resolution of `ICentralAuditWriter` (which is a lazy factory) rather than
|
||
at `AddAuditLog` time. The result: site nodes that "happen to work" because they hold
|
||
a registered provider, central composition test fixtures that fail at runtime instead
|
||
of DI-build time, and a `GetService`/`GetRequiredService` split that gives no clear
|
||
contract to the reader.
|
||
|
||
**Recommendation**
|
||
|
||
Either (a) make all three optional: `CentralAuditWriter` already handles a null provider
|
||
gracefully (line 113-116 — null-coalescing the caller's value); the asymmetry buys
|
||
nothing. Or (b) make all three required and either add `services.AddSingleton<INodeIdentityProvider, ...>()`
|
||
inside `AddAuditLog` (with a sensible default — null node name returns `<unknown>`) or
|
||
add an explicit guard at the top of `AddAuditLog` that throws if no provider has been
|
||
registered yet (`services.Any(d => d.ServiceType == typeof(INodeIdentityProvider))`).
|
||
|
||
**Resolution (2026-05-28):**
|
||
|
||
Took option (b) — standardized all three consumers on `GetRequiredService<INodeIdentityProvider>()`.
|
||
The Host (`SiteServiceRegistration.BindSharedOptions`) registers the provider on
|
||
both site and central paths per the InboundAPI-022 / Host registration sweep,
|
||
and the `AddAuditLogTests` fixture binds a `FakeNodeIdentityProvider`. A silent
|
||
`GetService()` returning null was masking a future composition root that forgot
|
||
the registration; the strict resolution surfaces that bug at first
|
||
`ICachedCallTelemetryForwarder` / `CachedCallLifecycleBridge` / `ICentralAuditWriter`
|
||
resolution instead.
|
||
|
||
### AuditLog-008 — Test composition roots that omit `IAuditPayloadFilter` silently pass UNREDACTED payloads through the writer chain
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Security |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Site/FallbackAuditWriter.cs:51-77`, `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/CentralAuditWriter.cs:77-104`, `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogIngestActor.cs:125,155` |
|
||
|
||
**Resolution (2026-05-28):** New `SafeDefaultAuditPayloadFilter` in
|
||
`src/ZB.MOM.WW.ScadaBridge.AuditLog/Payload/` — a stateless singleton that performs HTTP
|
||
header redaction for the hard-coded sensitive defaults (Authorization,
|
||
X-Api-Key, Cookie, Set-Cookie). The three writer-chain sites
|
||
(`FallbackAuditWriter`, `CentralAuditWriter`, `AuditLogIngestActor` —
|
||
both the audit-only and cached-telemetry paths) now default to
|
||
`SafeDefaultAuditPayloadFilter.Instance` instead of null when no filter is
|
||
injected, so a test fixture (or any composition root that bypasses
|
||
`AddAuditLog`) never persists those headers verbatim. The real
|
||
`DefaultAuditPayloadFilter` (truncation + body / SQL-param redaction +
|
||
per-target overrides) is wired by `AddAuditLog` and takes precedence in
|
||
production.
|
||
|
||
**Description**
|
||
|
||
`FallbackAuditWriter`, `CentralAuditWriter`, and `AuditLogIngestActor` all accept an
|
||
`IAuditPayloadFilter` as an optional dependency, defaulting to `null = pass-through`.
|
||
The justification in every XML comment is the same: "the M4 test composition roots
|
||
that don't pass one keep working (they only ever write small payloads)". This is fine
|
||
for size — but the filter also performs HEADER REDACTION (`Authorization`, `Cookie`,
|
||
`Set-Cookie`, `X-API-Key`), GLOBAL BODY REDACTORS, and SQL PARAMETER REDACTION. A test
|
||
fixture (or any future composition root that bypasses `AddAuditLog`) that injects a
|
||
real `RequestSummary` will see secrets written to SQLite / MS SQL with no redaction.
|
||
The combination "audit-write must never abort the user-facing action" + "unredacted
|
||
secrets must never persist" (Component-AuditLog.md §Payload Capture Policy) makes the
|
||
no-filter fallback genuinely dangerous — over-redacting on a missing filter is the
|
||
contract the production setup honours, but the code itself defaults to under-redact.
|
||
|
||
**Recommendation**
|
||
|
||
Change the three null-coalesce sites to default to a non-null sentinel filter that
|
||
performs the header redaction (`HeaderRedactList`) using the hard-coded defaults
|
||
from `AuditLogOptions`, even when no `IAuditPayloadFilter` is registered. The
|
||
truncation stage can remain optional; the header redaction must not. Alternatively,
|
||
make `IAuditPayloadFilter` non-optional and have `AddAuditLog` register the real
|
||
filter unconditionally — tests that don't bind the options section will resolve the
|
||
default `AuditLogOptions` and get the production-default redact list automatically.
|
||
|
||
**Resolution**
|
||
|
||
_Unresolved._
|
||
|
||
### AuditLog-009 — `SqliteAuditWriter.DisposeAsync` comment claims `_disposed` is set early, but it isn't
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Documentation & comments |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Site/SqliteAuditWriter.cs:706-740` |
|
||
|
||
**Description**
|
||
|
||
The first `lock (_writeLock)` block in `DisposeAsync` is commented:
|
||
|
||
> Stop accepting new events. Setting _disposed first ensures any FlushBatch entered
|
||
> after we mark disposed will fault its pending events rather than touching the
|
||
> about-to-close connection.
|
||
|
||
But the block does NOT set `_disposed = true` — it only calls
|
||
`_writeQueue.Writer.TryComplete()` and captures `_writerLoop`. The `_disposed` flag is
|
||
flipped in the SECOND lock block (line 738), AFTER the 5-second wait on the writer
|
||
loop. During the wait window, a concurrent `WriteAsync` that observed the channel
|
||
NOT-yet-completed (race: it ran before `TryComplete`) and got past `TryWrite` would
|
||
land on the writer loop's `FlushBatch`, which then takes the lock and checks
|
||
`_disposed` — and finds it still `false`. The check at the top of `FlushBatch`
|
||
(line 265) `if (_disposed) { fault pending; return; }` therefore does NOT fire during
|
||
the dispose window. In practice the channel being completed drains the loop cleanly
|
||
and the disposable race is benign, but the comment claims a guarantee that the code
|
||
does not implement.
|
||
|
||
**Recommendation**
|
||
|
||
Either set `_disposed = true` in the first lock block to match the comment (and remove
|
||
the duplicate `_disposed` check in the second block); or rewrite the comment to
|
||
describe the actual ordering: the channel is completed first, the loop drains
|
||
remaining items under the lock, and `_disposed = true` is set only after the loop
|
||
exits. The current code is correct; the comment is wrong.
|
||
|
||
**Resolution (2026-05-28):**
|
||
|
||
Comment-only fix on `SqliteAuditWriter.DisposeAsync`. Rewrote the misleading comment
|
||
to describe the actual ordering: completing the channel writer is the shutdown signal,
|
||
the writer loop drains buffered items, and `_disposed` is intentionally set only after
|
||
the loop has drained (in the second lock block). Behaviour unchanged.
|
||
|
||
### AuditLog-010 — Actor drain paths accept a `CancellationToken` parameter but always pass `CancellationToken.None` downstream
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Concurrency & thread safety |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Site/Telemetry/SiteAuditTelemetryActor.cs:92,107,124`, `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/SiteAuditReconciliationActor.cs:228` |
|
||
|
||
**Description**
|
||
|
||
The drain loops on `SiteAuditTelemetryActor.OnDrainAsync` and the per-site pull on
|
||
`SiteAuditReconciliationActor.PullSiteAsync` both pass `CancellationToken.None` to
|
||
every async dependency call (queue reads, gRPC client, repository writes). The actor
|
||
has no `CancellationToken` field, so there's no in-flight cancellation source —
|
||
graceful shutdown relies entirely on `PostStop` being called and the actor's
|
||
`Receive` continuation completing naturally. For a healthy gRPC client this is fine,
|
||
but a stuck `IngestAuditEventsAsync` call (slow central, partition switch in progress)
|
||
holds the actor's continuation indefinitely; the host's coordinated-shutdown will then
|
||
time out the actor system and leave the actor in an undefined state. The brief
|
||
references "cancellation on stop" in the partition-maintenance comments but
|
||
`SiteAuditTelemetryActor` does not implement it.
|
||
|
||
**Recommendation**
|
||
|
||
Introduce a per-actor `CancellationTokenSource` populated in `PreStart` and cancelled
|
||
in `PostStop`; pass `_lifecycleCts.Token` instead of `CancellationToken.None` in
|
||
every async dependency call. Same change for `SiteAuditReconciliationActor`. The
|
||
existing `OperationCanceledException` is already swallowed by the top-level catch
|
||
in `OnDrainAsync` (line 128), so plumbing the token through is a localised change.
|
||
|
||
**Resolution (2026-05-28):**
|
||
|
||
Scope reduced to `SiteAuditTelemetryActor` per finding-closure brief — added a
|
||
private `_lifecycleCts` field, cancelled+disposed in `PostStop`, and threaded
|
||
its token through `_queue.ReadPendingAsync`, `_client.IngestAuditEventsAsync`,
|
||
and `_queue.MarkForwardedAsync` (replacing the three `CancellationToken.None`
|
||
sites). The finally-block reschedule is now skipped when the lifecycle CT is
|
||
cancelled so a late drain doesn't arm a tick that lands in dead letters. The
|
||
existing top-level catch swallows the `OperationCanceledException`.
|
||
`SiteAuditReconciliationActor` is left for a separate ticket.
|
||
|
||
### AuditLog-011 — `AddAuditLogHealthMetricsBridge` and `AddAuditLogCentralMaintenance` are non-idempotent and register hosted services on every call
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Code organization & conventions |
|
||
| Status | Resolved |
|
||
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/ServiceCollectionExtensions.cs:53-55, 263-276, 301-346` |
|
||
|
||
**Description**
|
||
|
||
The XML doc on `AddAuditLog` is explicit: "Idempotent re-registration is not supported;
|
||
call this exactly once per `IServiceCollection`." But `AddAuditLogHealthMetricsBridge`
|
||
calls `services.AddHostedService<SiteAuditBacklogReporter>()` (line 275), which is
|
||
NOT idempotent — every call registers another descriptor, and the host will spin up
|
||
N reporters and have them all poll SQLite every 30 s, all push the same snapshot into
|
||
`ISiteHealthCollector`. The site composition path is supposed to call this exactly
|
||
once, but tests or composition refactors that accidentally call twice will pay 2x the
|
||
SQL probe rate and overwrite the snapshot with conflicting numbers (no race, just
|
||
wasted work). Worse, `AddAuditLogCentralMaintenance` (line 301) is also non-idempotent —
|
||
`AddOptions<AuditLogPartitionMaintenanceOptions>` and `AddHostedService<AuditLogPartitionMaintenanceService>`
|
||
will pile up.
|
||
|
||
**Recommendation**
|
||
|
||
Either (a) guard each Add* helper with a "has the marker been seen?" sentinel
|
||
(register a private marker descriptor on first call, no-op on subsequent calls);
|
||
or (b) explicitly document idempotency on the public surface of every helper and
|
||
verify with a unit test in `AddAuditLogTests`. Option (a) matches the pattern other
|
||
SDK extensions use and removes a foot-gun.
|
||
|
||
**Resolution (2026-05-28):**
|
||
|
||
Took option (a) for `AddAuditLogHealthMetricsBridge` — guarded by a sentinel
|
||
check on the `SiteAuditBacklogReporter` hosted-service descriptor (the helper's
|
||
exclusive contribution to the collection). A second call short-circuits before
|
||
any `Replace` / `AddHostedService` runs, so the hosted service registers
|
||
exactly once. New `AddAuditLogHealthMetricsBridge_IsIdempotent_DoesNotDoubleRegister_HostedService`
|
||
test in `AddAuditLogTests` calls the helper twice and asserts a single
|
||
`IHostedService` descriptor for `SiteAuditBacklogReporter`. The
|
||
`AddAuditLogCentralMaintenance` helper is left for a follow-up — it is only
|
||
ever called from the central composition root and the unit/integration
|
||
fixtures use disposable IServiceCollections, so the foot-gun is narrower.
|