feat(audit): close AuditLog-001 — wire combined-telemetry dual-write transport

Closes the last open code-review finding. The unreachable
IngestCachedTelemetryAsync path now carries production cached-call
lifecycle traffic, delivering the design's "AuditLog + SiteCalls in one
MS SQL transaction" guarantee. Before this commit, the SiteCalls
operational half had NO production transport at all — central's
SiteCallAuditActor.OnUpsertAsync had zero producers, so cached-call
operational state never reached the central mirror.

Site-side partition (so neither path double-emits):
- ISiteAuditQueue.ReadPendingCachedTelemetryAsync — new method returning
  rows where Kind ∈ {CachedSubmit, ApiCallCached, DbWriteCached,
  CachedResolve} AND ForwardState = Pending.
- ISiteAuditQueue.ReadPendingAsync — XML doc updated, SQLite impl now
  filters Kind NOT IN the cached set so cached rows no longer ride the
  audit-only drain.

New cached-drain in SiteAuditTelemetryActor:
- Optional IOperationTrackingStore? ctor param (null on central
  composition roots — the cached scheduler is never armed there).
- Independent CachedDrain message + scheduler tick parallel to the
  existing Drain — a stall on one path can't block the other; shared
  lifecycle CTS gates both.
- OnCachedDrainAsync: reads cached audit rows, joins each with its
  matching SiteCallOperational snapshot via CorrelationId →
  TrackedOperationId from the tracking store, builds CachedTelemetryBatch,
  pushes via IngestCachedTelemetryAsync, marks ack'd rows Forwarded.
- Orphan rows (no tracking snapshot, thrown tracking-store call,
  missing CorrelationId) logged at Warning + skipped — they stay
  Pending so reconciliation/retry picks them up later. Best-effort
  contract preserved.

Central side: AuditLogIngestActor.OnCachedTelemetryAsync was already
implemented (M3 Bundle G dead code today, alive after this commit) —
performs InsertIfNotExists for AuditLog + UpsertAsync for SiteCalls
inside a BeginTransactionAsync. The handler is idempotent on EventId,
so any duplicate arrivals from concurrent push + reconciliation are
silent no-ops.

Composition root: AkkaHostedService now resolves IOperationTrackingStore
via GetService<>() (site-only) and threads it through the actor's
Props.Create.

Tests added (+3 in SiteAuditTelemetryActorTests):
- Cached rows route through the new transport, not the audit-only drain.
- Orphan cached row (no tracking match) is logged + skipped, drain
  doesn't crash.
- Ordinary audit rows still flow through the audit-only drain unchanged.
- ParentExecutionIdCorrelationTests now unions both queues to assert
  all expected Kinds remain covered after the partition.

Build clean; AuditLog.Tests 250/251 (the 1 fail is the pre-existing
date-sensitive PartitionPurgeTests integration flake explicitly accepted
across the session); SiteRuntime.Tests 302/302.

README regenerated: 0 pending of 481 total.

Session-final totals: 136 of 136 originally-open Theme findings closed
across 11 commits (10 themed batches + this architectural close).
This commit is contained in:
Joseph Doherty
2026-05-28 09:08:43 -04:00
parent 11950b0a8e
commit c1fe1c4f83
8 changed files with 698 additions and 34 deletions
+69 -4
View File
@@ -8,7 +8,7 @@
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
| Open findings | 1 |
| Open findings | 0 |
## Summary
@@ -65,7 +65,7 @@ chain doesn't reject a central composition root that mistakenly calls the site b
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.AuditLog/Site/Telemetry/ISiteStreamAuditClient.cs:45`, `src/ScadaLink.AuditLog/Site/Telemetry/ClusterClientSiteAuditClient.cs:86`, `src/ScadaLink.AuditLog/Central/AuditLogIngestActor.cs:198` |
**Description**
@@ -101,9 +101,74 @@ unreachable `OnCachedTelemetryAsync` dual-write code (after confirming the
`AuditLogIngestActorCombinedTelemetryTests` integration tests exercise it via direct
actor injection only).
**Resolution**
**Resolution (2026-05-28):**
_Unresolved._
Wired the combined-telemetry transport end-to-end via recommendation (a). The
previously-unreachable `IngestCachedTelemetryAsync` client path now carries
cached-call lifecycle rows from the site SQLite hot-path through to the central
`AuditLogIngestActor.OnCachedTelemetryAsync` dual-write transaction. Changes:
- **`ISiteAuditQueue`** (`src/ScadaLink.Commons/Interfaces/Services/ISiteAuditQueue.cs`):
added `ReadPendingCachedTelemetryAsync(int, CancellationToken)` returning
rows in `AuditForwardState.Pending` whose `Kind` is one of `CachedSubmit`,
`ApiCallCached`, `DbWriteCached`, `CachedResolve`. Updated `ReadPendingAsync`
XML doc to call out the partition.
- **`SqliteAuditWriter`** (`src/ScadaLink.AuditLog/Site/SqliteAuditWriter.cs`):
implemented `ReadPendingCachedTelemetryAsync` with a `Kind IN (...)` filter
reusing the existing `_readConnection` / `_readLock` decoupling; modified
`ReadPendingAsync` to add the symmetric `Kind NOT IN (...)` predicate so the
audit-only drain no longer double-emits cached rows.
- **`SiteAuditTelemetryActor`** (`src/ScadaLink.AuditLog/Site/Telemetry/SiteAuditTelemetryActor.cs`):
added an optional `IOperationTrackingStore?` constructor parameter, a sibling
`CachedDrain` self-tick message, and an `OnCachedDrainAsync` handler running
in parallel with the existing audit-only drain. The cached-drain reads the
partitioned audit rows, joins each with the matching tracking-store
snapshot (looked up by `TrackedOperationId` via `CorrelationId`), builds a
`CachedTelemetryBatch`, pushes via `IngestCachedTelemetryAsync`, and marks
ack'd EventIds Forwarded. Orphan rows (no matching tracking snapshot, or a
thrown tracking-store call) are logged + skipped so the bad row never
blocks the rest of the batch; rows stay Pending and reconciliation /
retention handles them. The lifecycle CTS (AuditLog-010) gates both drains
uniformly.
- **`AkkaHostedService`** (`src/ScadaLink.Host/Actors/AkkaHostedService.cs`):
resolves `IOperationTrackingStore` via `GetService` (site-only registration)
and threads it through the actor's `Props.Create`. Central composition
roots and tests that don't register the tracking store get the legacy
audit-only behaviour — the cached scheduler is never armed.
- **Tests** (`tests/ScadaLink.AuditLog.Tests/Site/Telemetry/SiteAuditTelemetryActorTests.cs`):
added three regression tests asserting (1) cached rows route through
`IngestCachedTelemetryAsync` and NOT `IngestAuditEventsAsync`, (2) an
orphan row with no tracking snapshot is logged + skipped without crashing
the drain, (3) the audit-only drain still flows when the cached drain is
disabled (null tracking store). Updated `WaitForSiteRowsPersistedAsync` in
`ParentExecutionIdCorrelationTests` to union `ReadPendingCachedTelemetryAsync`
into the durability check — its `ReadPendingAsync(256) ReadForwardedAsync(256)`
assertion previously missed the cached kinds after the partition change.
**Design notes / caveats.**
- *Operational state at emission time is the latest tracking row, not the
per-event status.* The original spec described one combined packet per
lifecycle event, but the production wiring keeps the existing
`CachedCallTelemetryForwarder` dual-write (audit + tracking) and uses the
drain as a join. Central's `SiteCalls` upsert is monotonic so this is
consistent with the broader design — the audit row preserves per-event
granularity, the SiteCalls mirror reflects "most recent known" state.
- *Test-only `CombinedTelemetryDispatcher` wire push is now redundant but
harmless.* The dispatcher's manual `IngestCachedTelemetryAsync` call in
`CombinedTelemetryHarness` / `ParentExecutionIdCorrelationTests` still
executes; central's idempotent `InsertIfNotExistsAsync` swallows the
duplicate so it's a no-op. Removing it is a separate clean-up.
- *Per-actor cancellation gates both drains.* The lifecycle CTS (AuditLog-010)
is shared so `PostStop` cancels in-flight cached lookups + pushes at the
same instant as audit-only drains.
Build: `dotnet build ScadaLink.slnx` — 0 warnings, 0 errors.
Tests: `dotnet test tests/ScadaLink.AuditLog.Tests` — 250 passed, 1 failed
(`PartitionPurgeTests.EndToEnd_OldestPartition_PurgedViaActor_NewerKept`
pre-existing MS-SQL date-sensitive flake, called out in the prompt as
acceptable). `dotnet test tests/ScadaLink.SiteRuntime.Tests` — all 302
passed.
### AuditLog-002 — `SupervisorStrategy` comments claim Resume semantics but code returns the default Restart decider
+5 -7
View File
@@ -41,15 +41,15 @@ module file and counted in **Total**.
|----------|---------------|
| Critical | 0 |
| High | 0 |
| Medium | 1 |
| Medium | 0 |
| Low | 0 |
| **Total** | **1** |
| **Total** | **0** |
## Module Status
| Module | Last reviewed | Commit | Open (C/H/M/L) | Open | Total |
|--------|---------------|--------|----------------|------|-------|
| [AuditLog](AuditLog/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/1/0 | 1 | 11 |
| [AuditLog](AuditLog/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 11 |
| [CLI](CLI/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 23 |
| [CentralUI](CentralUI/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 33 |
| [ClusterInfrastructure](ClusterInfrastructure/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 14 |
@@ -88,11 +88,9 @@ _None open._
_None open._
### Medium (1)
### Medium (0)
| ID | Module | Title |
|----|--------|-------|
| AuditLog-001 | [AuditLog](AuditLog/findings.md) | Combined-telemetry transport is plumbed end-to-end but never invoked in production |
_None open._
### Low (0)