The snapshot's per-site stalled latch now lives on the snapshot itself
and is fed by SiteAuditTelemetryStalledTracker via ApplyStalled, removing
the chain that required ActorSystem at DI composition time. The tracker
is now constructed by AkkaHostedService once ActorSystem.Create returns,
with a lock-guarded auxiliary-disposable list so concurrent host
start/stop in tests cannot race the enumeration.
Central singleton (M6-T4 Bundle C) that drives the daily AuditLog partition
purge. On a configurable timer (default 24 hours) the actor:
1. Queries IAuditLogRepository.GetPartitionBoundariesOlderThanAsync for
monthly boundaries whose latest OccurredAtUtc is older than
DateTime.UtcNow - AuditLogOptions.RetentionDays.
2. For each eligible boundary calls SwitchOutPartitionAsync, which runs
the drop-and-rebuild dance around UX_AuditLog_EventId.
3. Publishes AuditLogPurgedEvent(boundary, rowsDeleted, durationMs) on
the actor-system EventStream so the Bundle E central health collector
and ops surfaces can subscribe without coupling to this actor.
Co-changes:
* SwitchOutPartitionAsync returns long (rows deleted) — sampled BEFORE the
switch via COUNT_BIG over the per-partition filter so the count
reflects what the switch removed, not a post-purge scan of a table that
no longer exists. All stub implementations updated.
* AuditLogPurgeOptions: IntervalHours (default 24), IntervalOverride for
tests, Interval property resolving either.
* AuditLogPurgedEvent: record with MonthBoundary, RowsDeleted, DurationMs.
Behavior:
* Continue-on-error per boundary — one partition that throws does NOT
abandon the rest of the tick.
* DI scope opened per tick (IAuditLogRepository is a SCOPED EF Core
service); mirrors SiteAuditReconciliationActor and AuditLogIngestActor.
* SupervisorStrategy Resume keeps the singleton alive across leaked
exceptions.
* EventStream capture BEFORE the first await — Context is unsafe after
await in async receive handlers (same pattern as Sender-capture in
AuditLogIngestActor.OnIngestAsync).
Tests:
* Tick_Fires_OnDailyInterval — visible timer side effect.
* Tick_OldPartitions_SwitchedOut — both seeded boundaries purged.
* Tick_NewerPartitions_Untouched — empty enumerator → no switches.
* Tick_PublishesPurgedEvent_WithRowCount — AuditLogPurgedEvent carries
RowsDeleted and DurationMs.
* Tick_SwitchThrows_OtherPartitionsStillProcessed — continue-on-error.
* Threshold_UsesAuditLogOptionsRetentionDays — non-default 30-day window
computed from UtcNow - RetentionDays.
* EndToEnd_RealPartition_RowsRemoved_PurgedEventPublished — TestKit +
MsSqlMigrationFixture: real partitioned table, Jan-2026 row purged,
Apr-2026 row kept, AuditLogPurgedEvent observed via probe.
Replaces M1's NotSupportedException stub with the production drop-DROP-INDEX
→ CREATE-staging → SWITCH PARTITION → DROP-staging → CREATE-INDEX dance
documented in alog.md §4. UX_AuditLog_EventId is intentionally non-aligned
with ps_AuditLog_Month so single-column EventId uniqueness can be enforced
cheaply for InsertIfNotExistsAsync; SQL Server rejects ALTER TABLE SWITCH
while a non-aligned unique index is present, so the implementation drops
it, switches the partition data into a GUID-suffixed staging table on
[PRIMARY], drops staging (discarding the rows), and rebuilds the unique
index — all inside an explicit transaction with a CATCH that guarantees
the unique index is rebuilt regardless of failure point.
Also adds GetPartitionBoundariesOlderThanAsync to IAuditLogRepository: a
CROSS APPLY over sys.partition_range_values + per-partition MAX(OccurredAtUtc)
to enumerate retention-eligible months for the M6 purge actor (next commit).
Tests verify:
* Old partition's rows are removed; other months untouched
* UX_AuditLog_EventId is rebuilt after a successful switch
* InsertIfNotExistsAsync's first-write-wins idempotency still holds after switch
* On engineered SWITCH failure (inbound FK from a probe table), SqlException
propagates AND UX_AuditLog_EventId is still present (CATCH branch ran)
* GetPartitionBoundariesOlderThanAsync returns only boundaries whose partition's
MAX(OccurredAtUtc) is strictly older than the threshold; empty partitions
excluded
6 bundles: proto+site handler, reconciliation actor, purge actor with
drop-and-rebuild around UX index, partition maintenance, four health
metrics, integration tests. M5 realities baked in.
M6 head records M5 realities:
- IOptionsMonitor hot-reload pattern verified; M6 retention config can
reuse.
- AuditRedactionFailure counter site-only in M5; M6 wires central side.
- Filter integration is at 3 writer entry points; purge actor doesn't
emit so no filter integration needed.
- SwitchOutPartitionAsync drop-and-rebuild dance required (M1 reality
+ M6-T4 already documents it).
- M6 should land the real ISiteStreamAuditClient (Option A) so push
telemetry leaves NoOp behind.
M5 ships the payload filter pipeline. IAuditPayloadFilter runs between
event construction and writer call:
- Stage 1: HTTP header redaction (Authorization/Cookie/Set-Cookie/X-Api-Key
default list from M1-T9; case-insensitive name match against JSON
{headers,body} shape).
- Stage 2: Body regex redaction (global + per-target). Patterns compiled
at startup with 100ms budget; runtime 50ms timeout guard against
catastrophic backtracking. Over-redact on exception + increment counter.
- Stage 3: SQL parameter redaction (Channel=DbOutbound, per-connection
opt-in via PerTargetOverrides[connection].RedactSqlParamsMatching).
- Stage 4: UTF-8 boundary-safe truncation. Default cap 8 KB; error cap
64 KB on Status NOT IN (Delivered/Submitted/Forwarded). PayloadTruncated
set to true when applied.
Filter wired into all three writer entry points:
- FallbackAuditWriter (site chain) — filter before SqliteAuditWriter.
- CentralAuditWriter (central direct-write) — filter before
IAuditLogRepository.InsertIfNotExistsAsync (NotificationOutbox dispatcher,
AuditWriteMiddleware).
- AuditLogIngestActor — filter before dual-write transaction.
Health metric SiteAuditRedactionFailureCounter wired through the existing
M2 Bundle G + M4 Bundle B health-bridge pattern; central-side counter
deferred to M6 (the milestone that ships the full central health surface).
Hot-reload via IOptionsMonitor + per-call CurrentValue read. Regex cache
keyed by pattern string so changing the config naturally invalidates old
patterns.
Shipped: 11 commits, ~49 net new tests across AuditLog.Tests,
HealthMonitoring.Tests, PerformanceTests. Full solution 24/24 test projects
green. infra/* untouched on any branch commit.
Bundle C task M5-T7 — surface DefaultAuditPayloadFilter redactor
over-redactions as a Site Health metric so a misconfigured /
catastrophic regex shows up on /monitoring/health rather than
disappearing into a NoOp sink.
- SiteHealthReport: new 'AuditRedactionFailure' int field
(defaulted to 0 for back-compat with existing producers/tests).
- ISiteHealthCollector / SiteHealthCollector:
new IncrementAuditRedactionFailure() — per-interval atomic
counter with Interlocked, reset on CollectReport, mirroring
the M2 Bundle G SiteAuditWriteFailures pattern.
- HealthMetricsAuditRedactionFailureCounter: new bridge in
ScadaLink.AuditLog.Site that forwards IAuditRedactionFailureCounter
increments to ISiteHealthCollector — mirrors
HealthMetricsAuditWriteFailureCounter one-for-one.
- AddAuditLogHealthMetricsBridge: now ALSO Replaces the
NoOpAuditRedactionFailureCounter binding with the health-metrics
bridge, so a single AddAuditLogHealthMetricsBridge() call wires
both the M2 Bundle G write-failure counter and the M5 Bundle C
redaction-failure counter into the health report.
Site-side only for M5 — the filter also runs on CentralAuditWriter
and AuditLogIngestActor (where it just keeps the NoOp default), but
a central-side health-metric surface for AuditRedactionFailure is
deferred to M6 alongside the rest of the central health collector
work.
Tests:
- AuditRedactionFailureMetricTests (HealthMonitoring) covers the
SiteHealthCollector increment/report/reset shape (3 tests).
- HealthMetricsAuditRedactionFailureCounterTests (AuditLog) covers
the AuditLog → HealthMonitoring bridge (3 tests).
- Existing CountCapturingHealthCollector stub in
DeploymentManagerRedeployTests extended with the new no-op
interface method.
Verified: dotnet build clean, all 24 test projects green
(the only Failed at first ScadaLink.SiteRuntime.Tests run was the
known-flaky InstanceActorChildAttributeRaceTests; passes on re-run
in isolation and full suite, unrelated to these changes).
Bundle C task M5-T6 — plugs the IAuditPayloadFilter singleton into the
three audit writer entry points so every event is truncated + redacted
before persistence, regardless of which path it took to disk:
- FallbackAuditWriter (site hot path): filter runs before the primary
SQLite write AND the ring-buffer enqueue, so a recovery drain replays
rows that are already capped/redacted.
- CentralAuditWriter (central direct-write): filter runs before the
per-call IAuditLogRepository.InsertIfNotExistsAsync.
- AuditLogIngestActor (site→central telemetry):
- OnIngestAsync resolves the filter from the per-message scope and
applies it to each row before IngestedAtUtc stamping.
- OnCachedTelemetryAsync (M3 dual-write) applies the filter to the
audit half of every CachedTelemetryEntry before the audit-insert
+ site-call-upsert transaction.
Filter parameter is optional (nullable) on each constructor so the
existing test composition roots that don't pass one keep working unchanged
— production DI wiring in AddAuditLog always passes the real filter
through. ICentralAuditWriter registration switched from the open-ctor
form to a factory so the filter flows through it.
Tests: FilterIntegrationTests covers all three writer paths end-to-end
(4 tests). Full ScadaLink.AuditLog.Tests suite: 146 passed, 0 failed,
0 skipped.
4 bundles: filter+truncation, redactors (header/body/SQL-param), wire
into all emission paths + health metric, config+perf+safety-net.
Vocabulary translation locked: error-row cap (64 KB) on Status NOT IN
(Delivered, Submitted, Forwarded). Filter integration point in each
writer (FallbackAuditWriter, CentralAuditWriter, AuditLogIngestActor)
BEFORE storage call.
M5 head records M4 realities:
- AuditingDbConnection/Command/DataReader decorators need filter plug-in
at WriteAsync emission point.
- CentralAuditWriter + FallbackAuditWriter are both filter integration
points for the direct-write + chained-write paths.
- InboundAPI middleware RequestSummary populated, ResponseSummary=null
pending response-body buffering decision in M5.
- UseWhen(/api/) path-scoped middleware gives natural per-target
redaction hook.
- Error-row cap raised on Status IN (Failed, Parked, Discarded,
Attempted, Skipped) per M1 vocab reconciliation.
M4 closes the script-trust-boundary emission gaps:
- Sync DB writes/reads via AuditingDbConnection decorator (Channel=DbOutbound,
Kind=DbWrite; Extra carries op + rowsAffected/rowsReturned).
- Notification Outbox dispatcher: NotifyDeliver(Attempted) per attempt;
NotifyDeliver(Delivered/Parked/Discarded) on terminal. Direct-write via
new ICentralAuditWriter (CentralAuditWriter implementation wraps
IAuditLogRepository.InsertIfNotExistsAsync with scope-per-call).
- Site Notify.To().Send() emits NotifySend(Submitted) via the existing
IAuditWriter site path; correlation via NotificationId.
- Inbound API AuditWriteMiddleware emits InboundRequest on success,
InboundAuthFailure on 401/403; Actor = API key NAME (never material);
registered via UseWhen(/api/) AFTER UseAuthentication/UseAuthorization;
audit failure NEVER changes HTTP response.
Audit-write-failure-never-aborts-action proven end-to-end across all five
new code paths via AuditWriteFailureSafetyTests (broken ICentralAuditWriter
+ broken IAuditWriter scenarios all green).
Shipped: 12 commits, ~62 net new tests across SiteRuntime / NotificationOutbox
/ AuditLog / InboundAPI tests. Full solution 2763 tests passing. No
regressions. infra/* untouched on any branch commit.
Audit Log #23 M4 Bundle C — Task C1: every script-initiated
Notify.To(list).Send(...) now emits exactly one
Notification/NotifySend audit row via the IAuditWriter wired through
ScriptRuntimeContext. The row carries Status=Submitted,
Target=list name, RequestSummary={subject,body} JSON (M5 will redact),
CorrelationId=NotificationId (parsed as Guid), provenance from context,
ForwardState=Pending.
Emission is best-effort per alog.md §7: a thrown audit writer is logged
and swallowed inside the helper; the original NotificationId still flows
back to the script and the underlying S&F enqueue still happened.
Mirrors the M2 Bundle F ExternalSystem.Call wrapper pattern.
Tests: 7 new tests in NotifySendAuditEmissionTests covering submitted-
status, list-name target, request-summary JSON shape, writer-throws
fail-safe, provenance, NotificationId/CorrelationId round-trip, and the
null-writer degrade path.
M4 Bundle B (B3) — NotificationOutboxActor emits a second NotifyDeliver
audit row carrying the terminal AuditStatus whenever a notification
transitions to a terminal state (Delivered, Parked, Discarded).
- Dispatcher: after the B2 Attempted row, a Delivered or Parked row is
emitted when the post-outcome status is terminal. Discarded is never
produced by the dispatcher — only by the manual discard path.
- Missing-adapter park: now emits both Attempted and terminal Parked,
both carrying the same explanatory error.
- Manual discard (DiscardAsync): after the row update, emits a terminal
Discarded NotifyDeliver row with no error message (operator-driven
cancellation, not a delivery error).
- MapNotificationStatusToAuditStatus + IsTerminal helpers added; terminal
emission shares BuildNotifyDeliverEvent with the B2 Attempted path so
the two rows carry identical correlation/provenance fields.
Audit failure NEVER aborts the user-facing action: every emission is
wrapped in try/catch (defensive — the CentralAuditWriter itself swallows).
M4 Bundle B (B2) — NotificationOutboxActor's dispatcher loop emits a single
AuditChannel.Notification / AuditKind.NotifyDeliver row with AuditStatus.Attempted
for every delivery attempt (success, transient failure, permanent failure,
and the missing-adapter park).
- BuildNotifyDeliverEvent helper populates correlation id (parsed from the
string NotificationId — sites generate Guid.NewGuid().ToString("N"),
non-Guid ids fall through as null), list-name target, source site/instance/script
provenance, and Actor=null (central dispatch has no authenticated end-user).
- Attempt duration is measured around the adapter call and recorded as
DurationMs so KPIs can compute per-attempt latency.
- Emission is fire-and-forget (the writer swallows internally) and wrapped
in try/catch — audit failure NEVER aborts the user-facing dispatch.
Terminal-state emission lands separately in B3.
M4 Bundle B (B1) — add the central-only ICentralAuditWriter implementation
and inject it into NotificationOutboxActor so subsequent tasks (B2/B3) can
route attempt + terminal lifecycle events through the direct-write audit path.
- CentralAuditWriter: thin wrapper around IAuditLogRepository.InsertIfNotExistsAsync;
scope-per-call (matches AuditLogIngestActor / NotificationOutboxActor pattern);
stamps IngestedAtUtc; swallows all internal failures (alog.md §13).
- Registered as a singleton in AddAuditLog.
- NotificationOutboxActor ctor takes ICentralAuditWriter (validated non-null).
- Host wiring resolves the writer once from the root provider and passes it
into the singleton's Props.Create call.
- Existing TestKit fixtures updated with a NoOpCentralAuditWriter helper so
tests that don't exercise audit emission still compile and pass.
Audit Log #23 — M4 Bundle A (Tasks A1+A2): every script-initiated
synchronous DB call routed through Database.Connection(name) now emits
exactly one DbOutbound/DbWrite audit row.
Implementation — three thin ADO.NET decorators in
src/ScadaLink.SiteRuntime/Scripts/:
- AuditingDbConnection: wraps the gateway-returned DbConnection so
CreateDbCommand() hands the script an AuditingDbCommand. All other
ADO.NET surface forwards unchanged.
- AuditingDbCommand: intercepts ExecuteNonQuery / ExecuteScalar /
ExecuteReader (sync + async). On terminal:
Channel = DbOutbound, Kind = DbWrite, Status = Delivered|Failed,
Extra = {"op":"write","rowsAffected":N} (Execute*),
{"op":"read","rowsReturned":N} (ExecuteReader),
RequestSummary = JSON of SQL + parameter values (default capture;
redaction in M5),
Target = "<connection>.<first 60 chars of SQL>",
DurationMs captured via Stopwatch,
Provenance from ScriptRuntimeContext (SourceSiteId,
SourceInstanceId, SourceScript).
- AuditingDbDataReader: counts rows on Read/ReadAsync and fires the
audit emission exactly once on Close/CloseAsync/Dispose.
DatabaseHelper now takes an IAuditWriter; ScriptRuntimeContext.Database
threads through _auditWriter. When the writer is null (tests / minimal
hosts) Connection() returns the raw inner DbConnection unchanged.
Best-effort emission (alog.md §7): mirrors M2 Bundle F's 3-layer
fail-safe — build, write, continuation. Audit-build, audit-write, and
audit-continuation faults are logged + swallowed; the original ADO.NET
result (or original exception) flows back to the script untouched. The
SiteAuditWriteFailures counter increments automatically through the
existing FallbackAuditWriter (Bundle G).
Tests — tests/ScadaLink.SiteRuntime.Tests/Scripts/DatabaseSyncEmissionTests.cs
(7 new, all passing):
1. Execute / INSERT success — one DbWrite row, op=write, rowsAffected=1.
2. ExecuteScalar success — one DbWrite row, op=write.
3. Execute throws — Status=Failed, ErrorMessage + ErrorDetail set.
4. ExecuteReader success — op=read, rowsReturned counts rows pulled.
5. AuditWriter throws — original ADO.NET rowsAffected returned, no
events captured, no exception propagates.
6. Provenance populated from context.
7. DurationMs recorded non-zero.
Tests use Microsoft.Data.Sqlite in-memory (already transitively
available via SiteRuntime). Total SiteRuntime test suite: 251 passing
(244 baseline + 7 new). Full solution test suite passes.
5 bundles: DB sync emissions, NotificationOutbox central, site Notify.Send,
Inbound API middleware, integration tests. M3-reality vocab baked in
(DbWrite/NotifyDeliver/NotifySend/InboundRequest/InboundAuthFailure).
M4 head now records M3 realities:
- Vocabulary translation table from pre-M1 spec strings to M1-aligned
enum values (DbWrite vs SyncWrite/SyncRead; NotifyDeliver vs
Notification.Attempt/Terminal; InboundRequest/InboundAuthFailure vs
ApiInbound.Completed; Failed vs PermanentFailure).
- Mapper consolidation: 4 DTO mappers exist; extract single helper
before M4 adds more channels.
- OnCachedTelemetryWithoutDualWriteAsync test-mode fallback may be
deprecated in M4.
- Site SQLite drain for OperationTrackingStore: only dual-write
transaction writes central today; plan drain if M4 needs in-flight
tracking visibility.
- SiteCallAuditActor wired but unused on M3 hot path; M4/M6 natural
first direct caller.
M3 ships the cached-call lifecycle: ExternalSystem.CachedCall and
Database.CachedWrite each produce 3-5 audit rows + 1 SiteCalls row
sharing the same TrackedOperationId. Site emits the combined packet
(AuditEvent + SiteCallOperational); central writes both rows in one
MS SQL transaction.
Inlines the minimum-viable Site Call Audit (#22) surface:
SiteCalls table + ISiteCallAuditRepository + SiteCallAuditActor.
Reconciliation, KPIs, central->site Retry/Discard relay deferred.
Shipped (23 commits, ~120 net new tests, 24/24 test projects green):
- TrackedOperationId strong type + OperationTrackingStore site-local
SQLite + Tracking.Status script API.
- CachedCallTelemetry combined operational+audit packet (additive per
Commons REQ-COM-5a — never renamed CachedOperationTelemetry).
- SiteCalls MS SQL table + monotonic upsert repository (operational
state, no partitioning) + migration.
- ScadaLink.SiteCallAudit new project + SiteCallAuditActor cluster
singleton.
- sitestream.proto extended with IngestCachedTelemetry RPC +
SiteCallOperationalDto + CachedTelemetryPacket/Batch.
- AuditLogIngestActor combined-telemetry handler with per-entry
BeginTransactionAsync; rollback on either-throw; per-entry try/catch
isolates failures; central singleton stays alive (Resume).
- ScriptRuntimeContext.ExternalSystem.CachedCall + Database.CachedWrite
wrappers emit CachedSubmit on enqueue + handle immediate-success path
(no S&F retry) with direct Attempted+CachedResolve emission.
- StoreAndForward observer hook (ICachedCallLifecycleObserver) +
CachedCallLifecycleBridge translates S&F outcomes to combined
telemetry; per-attempt rows carry Kind=ApiCallCached/DbWriteCached,
Status=Attempted (HttpStatus/ErrorMessage capture success/failure);
terminal carries Kind=CachedResolve, Status=Delivered/Failed/Parked/
Discarded.
- Component-level e2e via TestKit + MsSqlMigrationFixture +
DirectActorSiteStreamAuditClient extracted to shared Integration/
Infrastructure/ + CombinedTelemetryHarness/Dispatcher helpers.
- Health metric SiteAuditWriteFailures still wired (M2). Bridge from
ICachedCallTelemetryForwarder to AuditWriter chain.
Invariants honored: append-only AuditLog (writer role DENY UPDATE/DELETE
from M1); audit-failure-never-aborts-script (three-layer fail-safe
preserved); central singleton supervisor=Resume; idempotent at central
on EventId (M2 race-fix from Bundle A) + monotonic at central on
TrackedOperationId. infra/* never touched on any branch commit
(verified empty via 'git log main..feature/audit-log-m3-cached-operations -- infra/').
Site->central gRPC client still NoOpSiteStreamAuditClient in production
until M6; cached telemetry rows accumulate at site as Pending in
production.
Bundle E left a gap in ExternalSystem.CachedCall: when the underlying HTTP
call succeeds immediately (WasBuffered=false), the store-and-forward retry
loop is never engaged and the ICachedCallLifecycleObserver hook never
fires. As a result Tracking.Status(id) would stay in Submitted forever and
the audit log would be missing the Attempted + CachedResolve pair the M3
contract requires.
Fix: capture the ExternalCallResult returned by IExternalSystemClient.
CachedCallAsync. When WasBuffered=false, emit the two missing telemetry
packets from the helper itself:
- ApiCallCached / Attempted (per-attempt mechanics row, HttpStatus +
ErrorMessage extracted via the same regex
the synchronous Call() audit row uses)
- CachedResolve / Delivered on Success, or
- CachedResolve / Failed on Success=false (immediate permanent
failure or transient failure without S&F).
The terminal CachedResolve row carries TerminalAtUtc so SiteCallAudit can
recognise the row as eligible for purge.
The WasBuffered=true path is unaffected — the S&F retry loop owns the
Attempted + Resolve emissions there via the CachedCallLifecycleBridge.
Database.CachedWrite is unaffected too because IDatabaseGateway.
CachedWriteAsync always enqueues into S&F (no immediate-success path).
Both new emissions are best-effort: a throwing forwarder is logged and
swallowed (alog.md §7) and each row is independently try/catch-wrapped so
a single fault cannot drop both halves of the terminal pair.
Tests in ExternalSystemCachedCallEmissionTests:
- CachedCall_ImmediateSuccess_EmitsAttemptedAndCachedResolve
- CachedCall_ImmediateFailure_EmitsAttemptedAndCachedResolveFailed
- CachedCall_BufferedPath_DoesNotEmitTerminalTelemetryFromHelper
Full suite: 244 SiteRuntime tests (3 new), 200 Host tests, all green.