The store-and-forward retry loop emits the per-attempt and terminal cached
audit rows (ApiCallCached/DbWriteCached Attempted, CachedResolve) via
CachedCallLifecycleBridge from a CachedCallAttemptContext, not from the
script context. ExecutionId (and SourceScript) were not threaded through the
S&F buffer, so those rows had ExecutionId = null and SourceScript = null.
Thread both, additively, from the cached-call enqueue path:
- StoreAndForwardMessage gains ExecutionId (Guid?) / SourceScript (string?).
- StoreAndForwardStorage adds nullable execution_id / source_script columns
via an idempotent PRAGMA-probed ALTER TABLE migration; rows persisted by
an older build read back null (back-compat).
- StoreAndForwardService.EnqueueAsync gains optional executionId /
sourceScript params, stamped onto the buffered message and surfaced on the
CachedCallAttemptContext built in the retry loop.
- CachedCallAttemptContext gains ExecutionId / SourceScript.
- CachedCallLifecycleBridge.BuildPacket sets AuditEvent.ExecutionId and
AuditEvent.SourceScript from the context (replacing the hard-coded
SourceScript = null and its now-stale comment).
- IExternalSystemClient.CachedCallAsync / IDatabaseGateway.CachedWriteAsync
gain optional executionId / sourceScript params; ScriptRuntimeContext's
CachedCall / CachedWrite helpers pass _executionId / _sourceScript.
Script-side cached rows (CachedSubmit, immediate Attempted+Resolve) are
unchanged. All threading is additive — old buffered S&F rows still
deserialize and process with the new fields null.
Add two optional nullable fields (TlsMode, Credentials) to the
UpdateSmtpConfigCommand record. The handler applies preserve-if-null
semantics: an update that omits a field leaves the existing value
intact, so existing 5-arg callers remain non-breaking.
Adds three KPI tiles to the central Health dashboard for the Audit channel:
volume (rows in the last hour), error rate (Failed/Parked/Discarded over
total), and backlog (sum of SiteAuditBacklog.PendingCount across all sites).
Repo + service:
- IAuditLogRepository.GetKpiSnapshotAsync(window, nowUtc) — single aggregate
SELECT over the trailing window returning total + error counts; nowUtc is
optional for production callers and pinned by integration tests against the
shared MSSQL fixture so the global counts are deterministic.
- AuditLogQueryService.GetKpiSnapshotAsync() — composes the repo aggregate
with a sum of SiteAuditBacklog.PendingCount read from ICentralHealthAggregator.
- AuditLogKpiSnapshot record in Commons/Types/.
UI:
- New AuditKpiTiles Blazor component (Components/Health/) — three Bootstrap
card-tiles, click navigates to /audit/log with the matching pre-filter.
- Health.razor wires the tiles in alongside the existing Notification Outbox
KPIs; LoadAuditKpis() runs on every 10s refresh tick and degrades to em
dashes + inline error if the query fails.
- AuditLogPage extended to parse ?status= so the error-rate tile drill-in
(?status=Failed) auto-loads the grid.
Tests:
- AuditLogRepositoryTests: GetKpiSnapshotAsync mixed-status + empty-window
cases against the MSSQL migration fixture.
- AuditLogQueryServiceTests: forwarding + backlog composition; sites with
null SiteAuditBacklog contribute zero.
- AuditKpiTilesTests: 9 bUnit tests covering tile render, error-rate maths
with safe zero-events handling, em-dash unavailable path, click-through
navigation, and warning/danger border thresholds.
- HealthPageTests: new Renders_AuditKpiTiles_WithValues plus IAuditLogQueryService
stub registration in the constructor so existing outbox tests still pass.
- AuditLogPageScaffoldTests: ?status=Failed auto-load + unknown status drop.
Central singleton (M6-T4 Bundle C) that drives the daily AuditLog partition
purge. On a configurable timer (default 24 hours) the actor:
1. Queries IAuditLogRepository.GetPartitionBoundariesOlderThanAsync for
monthly boundaries whose latest OccurredAtUtc is older than
DateTime.UtcNow - AuditLogOptions.RetentionDays.
2. For each eligible boundary calls SwitchOutPartitionAsync, which runs
the drop-and-rebuild dance around UX_AuditLog_EventId.
3. Publishes AuditLogPurgedEvent(boundary, rowsDeleted, durationMs) on
the actor-system EventStream so the Bundle E central health collector
and ops surfaces can subscribe without coupling to this actor.
Co-changes:
* SwitchOutPartitionAsync returns long (rows deleted) — sampled BEFORE the
switch via COUNT_BIG over the per-partition filter so the count
reflects what the switch removed, not a post-purge scan of a table that
no longer exists. All stub implementations updated.
* AuditLogPurgeOptions: IntervalHours (default 24), IntervalOverride for
tests, Interval property resolving either.
* AuditLogPurgedEvent: record with MonthBoundary, RowsDeleted, DurationMs.
Behavior:
* Continue-on-error per boundary — one partition that throws does NOT
abandon the rest of the tick.
* DI scope opened per tick (IAuditLogRepository is a SCOPED EF Core
service); mirrors SiteAuditReconciliationActor and AuditLogIngestActor.
* SupervisorStrategy Resume keeps the singleton alive across leaked
exceptions.
* EventStream capture BEFORE the first await — Context is unsafe after
await in async receive handlers (same pattern as Sender-capture in
AuditLogIngestActor.OnIngestAsync).
Tests:
* Tick_Fires_OnDailyInterval — visible timer side effect.
* Tick_OldPartitions_SwitchedOut — both seeded boundaries purged.
* Tick_NewerPartitions_Untouched — empty enumerator → no switches.
* Tick_PublishesPurgedEvent_WithRowCount — AuditLogPurgedEvent carries
RowsDeleted and DurationMs.
* Tick_SwitchThrows_OtherPartitionsStillProcessed — continue-on-error.
* Threshold_UsesAuditLogOptionsRetentionDays — non-default 30-day window
computed from UtcNow - RetentionDays.
* EndToEnd_RealPartition_RowsRemoved_PurgedEventPublished — TestKit +
MsSqlMigrationFixture: real partitioned table, Jan-2026 row purged,
Apr-2026 row kept, AuditLogPurgedEvent observed via probe.
Replaces M1's NotSupportedException stub with the production drop-DROP-INDEX
→ CREATE-staging → SWITCH PARTITION → DROP-staging → CREATE-INDEX dance
documented in alog.md §4. UX_AuditLog_EventId is intentionally non-aligned
with ps_AuditLog_Month so single-column EventId uniqueness can be enforced
cheaply for InsertIfNotExistsAsync; SQL Server rejects ALTER TABLE SWITCH
while a non-aligned unique index is present, so the implementation drops
it, switches the partition data into a GUID-suffixed staging table on
[PRIMARY], drops staging (discarding the rows), and rebuilds the unique
index — all inside an explicit transaction with a CATCH that guarantees
the unique index is rebuilt regardless of failure point.
Also adds GetPartitionBoundariesOlderThanAsync to IAuditLogRepository: a
CROSS APPLY over sys.partition_range_values + per-partition MAX(OccurredAtUtc)
to enumerate retention-eligible months for the M6 purge actor (next commit).
Tests verify:
* Old partition's rows are removed; other months untouched
* UX_AuditLog_EventId is rebuilt after a successful switch
* InsertIfNotExistsAsync's first-write-wins idempotency still holds after switch
* On engineered SWITCH failure (inbound FK from a probe table), SqlException
propagates AND UX_AuditLog_EventId is still present (CATCH branch ran)
* GetPartitionBoundariesOlderThanAsync returns only boundaries whose partition's
MAX(OccurredAtUtc) is strictly older than the threshold; empty partitions
excluded
Bundle C task M5-T7 — surface DefaultAuditPayloadFilter redactor
over-redactions as a Site Health metric so a misconfigured /
catastrophic regex shows up on /monitoring/health rather than
disappearing into a NoOp sink.
- SiteHealthReport: new 'AuditRedactionFailure' int field
(defaulted to 0 for back-compat with existing producers/tests).
- ISiteHealthCollector / SiteHealthCollector:
new IncrementAuditRedactionFailure() — per-interval atomic
counter with Interlocked, reset on CollectReport, mirroring
the M2 Bundle G SiteAuditWriteFailures pattern.
- HealthMetricsAuditRedactionFailureCounter: new bridge in
ScadaLink.AuditLog.Site that forwards IAuditRedactionFailureCounter
increments to ISiteHealthCollector — mirrors
HealthMetricsAuditWriteFailureCounter one-for-one.
- AddAuditLogHealthMetricsBridge: now ALSO Replaces the
NoOpAuditRedactionFailureCounter binding with the health-metrics
bridge, so a single AddAuditLogHealthMetricsBridge() call wires
both the M2 Bundle G write-failure counter and the M5 Bundle C
redaction-failure counter into the health report.
Site-side only for M5 — the filter also runs on CentralAuditWriter
and AuditLogIngestActor (where it just keeps the NoOp default), but
a central-side health-metric surface for AuditRedactionFailure is
deferred to M6 alongside the rest of the central health collector
work.
Tests:
- AuditRedactionFailureMetricTests (HealthMonitoring) covers the
SiteHealthCollector increment/report/reset shape (3 tests).
- HealthMetricsAuditRedactionFailureCounterTests (AuditLog) covers
the AuditLog → HealthMonitoring bridge (3 tests).
- Existing CountCapturingHealthCollector stub in
DeploymentManagerRedeployTests extended with the new no-op
interface method.
Verified: dotnet build clean, all 24 test projects green
(the only Failed at first ScadaLink.SiteRuntime.Tests run was the
known-flaky InstanceActorChildAttributeRaceTests; passes on re-run
in isolation and full suite, unrelated to these changes).
Hook the store-and-forward retry loop so the audit pipeline can emit
per-attempt + terminal telemetry under the original TrackedOperationId
(Bundle E Tasks E4 + E5).
New seam:
* ICachedCallLifecycleObserver + CachedCallAttemptContext in
Commons.Interfaces.Services. Outcome enum
(Delivered / TransientFailure / PermanentFailure / ParkedMaxRetries)
is S&F-vocabulary; the bridge living in ScadaLink.AuditLog (Bundle F)
will map it to the AuditKind/AuditStatus pair when building the
CachedCallTelemetry packet.
* StoreAndForwardService gains an optional cachedCallObserver
constructor parameter + siteId. RetryMessageAsync fires the observer
exactly once per attempt with the appropriate outcome:
- handler returns true -> Delivered
- handler returns false -> PermanentFailure (and parks)
- handler throws + retries remaining -> TransientFailure
- handler throws + max retries hit -> ParkedMaxRetries (and parks)
Hook is best-effort: a thrown observer is logged + swallowed so a
failing audit pipeline can never be misclassified as a transient
delivery failure or corrupt the retry-count bookkeeping (alog.md §7).
Only cached-call categories (ExternalSystem, CachedDbWrite) generate
notifications — Notification category has its own central-side
audit pipeline (Notification Outbox / #21).
Pre-M3 callers that didn't thread a TrackedOperationId into the S&F
message id are silently skipped — the observer requires a parseable id
by contract. New S&F callers stamp the id as messageId (Bundle E3).
Bundle E tasks E4 + E5.
Rework ScriptRuntimeContext.ExternalSystem.CachedCall to fit the M3
combined-telemetry model:
* Mints a fresh TrackedOperationId and emits one CachedSubmit packet
via ICachedCallTelemetryForwarder BEFORE handing the call off — the
SiteCalls row is materialised before the first delivery attempt so
Tracking.Status(id) can observe a Submitted row even if immediate
delivery resolves before the helper returns.
* Threads the TrackedOperationId into IExternalSystemClient.CachedCallAsync
as a new optional parameter (and into IDatabaseGateway.CachedWriteAsync
for the Database mirror set up here for E6). The gateway uses the id
as the StoreAndForward messageId so the retry loop (Tasks E4/E5) can
recover it from StoreAndForwardMessage.Id.
* Returns the TrackedOperationId rather than ExternalCallResult — the
script's contract is now "get a tracking handle, observe outcome via
Tracking.Status". Best-effort emission: a thrown forwarder is logged
+ swallowed; the original call still runs and the id is still returned.
DatabaseHelper gets the matching siteId / sourceScript / forwarder
fields and a parallel CachedSubmit emitter (Channel=DbOutbound) so Task
E6's Database.CachedWrite mirror plugs in without further runtime
wiring.
New ICachedCallTelemetryForwarder seam in Commons.Interfaces.Services
so SiteRuntime depends on Commons (existing arrow) rather than
ScadaLink.AuditLog (would have introduced a new dependency).
Bundle E task E3 (and helper-shape work for E6).
Bundle C of Audit Log #23 M3.
Adds the ScadaLink.SiteCallAudit project + matching tests project, mirroring
the ScadaLink.AuditLog scaffolding pattern (net10.0, central package
management, InternalsVisibleTo to the tests assembly).
SiteCallAuditActor is the central singleton entry point for Site Call Audit
(#22): it receives UpsertSiteCallCommand and persists the SiteCall via
ISiteCallAuditRepository.UpsertAsync (monotonic, idempotent — out-of-order
or duplicate updates are silent no-ops at the repo). Audit-write failures
NEVER abort the user-facing action (CLAUDE.md): repository throws are
caught + logged, the actor replies Accepted=false, and the singleton stays
alive (Resume supervisor strategy as defence in depth).
Two constructors mirror AuditLogIngestActor:
- IServiceProvider production constructor resolves the scoped EF repository
from a fresh DI scope per message.
- ISiteCallAuditRepository test constructor injects a concrete repository so
the TestKit tests exercise the real monotonic-upsert SQL end to end.
UpsertSiteCallCommand + UpsertSiteCallReply live in ScadaLink.Commons (same
home as IngestAuditEventsCommand) so Bundle D's gRPC server can construct
them without taking a project reference on the actor's host project.
AddSiteCallAudit() is a placeholder for symmetry with AddAuditLog /
AddNotificationOutbox; Bundle F will populate it with the actor's Props
factory + options bindings.
Tests (Akka.TestKit.Xunit2 + MsSqlMigrationFixture via project ref to
ScadaLink.ConfigurationDatabase.Tests, mirroring Bundle D2):
- Receive_UpsertSiteCallCommand_Persists_Replies_Accepted
- Receive_DuplicateUpsert_OlderStatus_NoOp_StillRepliesAccepted (idempotency)
- Receive_RepoThrowsTransient_RepliesAccepted_False_ActorStaysAlive
Reconciliation, KPIs, and the central->site Retry/Discard relay are
deferred per CLAUDE.md scope discipline.
ScadaLink.slnx updated to include both new projects.
All 3 new tests pass against the running infra/mssql container; full suite
(2683 tests across 27 projects) passes with no regressions.
Bundle B3 of Audit Log #23 M3: data-access layer for the central SiteCalls
table introduced in B1+B2. UpsertAsync is insert-if-not-exists then
monotonic-status update so out-of-order telemetry, duplicate gRPC packets,
and reconciliation pulls all converge on the same row without rolling
state backward.
- src/ScadaLink.Commons/Interfaces/Repositories/ISiteCallAuditRepository.cs:
UpsertAsync (monotonic), GetAsync, QueryAsync, PurgeTerminalAsync.
- src/ScadaLink.Commons/Types/Audit/SiteCallQueryFilter.cs +
SiteCallPaging.cs: filter (Channel/SourceSite/Status/Target/time range)
and keyset paging cursor on (CreatedAtUtc DESC, TrackedOperationId DESC),
mirrored on M1's AuditLog* equivalents.
- src/ScadaLink.ConfigurationDatabase/Repositories/SiteCallAuditRepository.cs:
raw-SQL InsertIfNotExists + conditional UPDATE with inline CASE rank
compare (Submitted=0, Forwarded=1, Attempted/Skipped=2, terminal=3 —
terminal statuses are mutually exclusive so e.g. Delivered cannot
overwrite Parked). Duplicate-key violations (SQL 2601/2627) are
swallowed at Debug, identical to AuditLogRepository's race-fix.
QueryAsync uses FromSqlInterpolated because EF Core 10 cannot translate
string.Compare against the value-converted TrackedOperationId column
inside an expression tree.
- ServiceCollectionExtensions wires the repository (scoped, after
IAuditLogRepository).
- 12 integration tests in tests/ScadaLink.ConfigurationDatabase.Tests/
Repositories/ (MsSqlMigrationFixture + [SkippableFact]): fresh insert,
monotonic advance, older-status no-op, same-status no-op,
terminal-over-terminal no-op, 50-way concurrent-insert race produces
exactly one row, Get known/unknown, filter by site, keyset paging no
overlap, purge terminal-and-old, purge keeps non-terminal-and-recent.
Bundle B1 of Audit Log #23 M3: introduces the SiteCall entity + EF mapping
for the central SiteCalls operational-state table. One row per
TrackedOperationId, mirrored from sites via best-effort telemetry then
periodic reconciliation; eventually-consistent mirror, not a dispatcher.
- src/ScadaLink.Commons/Entities/Audit/SiteCall.cs: append-once record
with required TrackedOperationId/Channel/Target/SourceSite/Status,
monotonic status update at the repo layer.
- src/ScadaLink.ConfigurationDatabase/Configurations/SiteCallEntityTypeConfiguration.cs:
table SiteCalls, PK on TrackedOperationId (stored as varchar(36) via
value conversion through the canonical 'D'-format GUID string —
matches the wire shape used by gRPC + SQLite columns), two named
indexes (IX_SiteCalls_Source_Created, IX_SiteCalls_Status_Updated).
- ScadaLinkDbContext: DbSet<SiteCall> SiteCalls in the existing Audit
section, after AuditLogs.
- Tests in tests/ScadaLink.ConfigurationDatabase.Tests/Configurations/:
table name, PK, value-conversion shape, index presence + ordering.
Bundle G of Audit Log #23 M2. Bridges the FallbackAuditWriter primary-
failure counter into the Site Health Monitoring report payload so a
sustained audit-write outage surfaces on /monitoring/health instead of
disappearing into a NoOp sink.
- SiteHealthReport: add SiteAuditWriteFailures (defaulted, additive).
- ISiteHealthCollector + SiteHealthCollector: new
IncrementSiteAuditWriteFailures() counter, per-interval reset
semantics matching ScriptErrorCount / DeadLetterCount.
- HealthMetricsAuditWriteFailureCounter: adapter forwarding
IAuditWriteFailureCounter.Increment() to the collector.
- AddAuditLogHealthMetricsBridge(): swaps the NoOp default
registration for the real bridge; called from
SiteServiceRegistration after AddSiteHealthMonitoring + AddAuditLog.
- Existing host-wiring test updated: site composition now resolves
HealthMetricsAuditWriteFailureCounter (not NoOp).
Tests: HealthMonitoring 60 -> 63 (3 new), AuditLog 56 -> 59 (3 new),
full solution green.
Append-only data-access surface for the central AuditLog table — three
methods: InsertIfNotExistsAsync (first-write-wins on EventId), QueryAsync
(filter + keyset paging on (OccurredAtUtc desc, EventId desc)), and
SwitchOutPartitionAsync (M1 honest contract — throws NotSupported until
M6 lands the non-aligned-index drop/rebuild dance for the partition
switch). No Update, no row-delete; bulk purge is partition-only.
Bundle D of the Audit Log #23 M1 Foundation plan.
The script-analysis sandbox Notify surface was stale after the Notification
Outbox change: SandboxNotifyTarget.Send returned Task<NotificationResult> and
there was no Status method, while production NotifyTarget.Send returns
Task<string> (a NotificationId) plus NotifyHelper.Status. A script that
test-ran cleanly in the sandbox would not compile against the real site
runtime.
- Move the NotificationDeliveryStatus record from ScadaLink.SiteRuntime.Scripts
into ScadaLink.Commons.Messages.Notification so both production and the
CentralUI sandbox reference the exact same type (CentralUI does not, and
should not, reference SiteRuntime). Production NotifyHelper.Status is
otherwise untouched.
- Rewrite SandboxNotifyHelper/SandboxNotifyTarget to be a signature-faithful
no-op fake: Send returns Task<string> (a fake NotificationId), Status returns
Task<NotificationDeliveryStatus>. Production now enqueues into the site S&F
engine, which has no central-side equivalent in the sandbox, so the fake no
longer carries an INotificationDeliveryService.
- Add script-analysis tests proving a script using the new Notify shape both
diagnoses clean and runs in the sandbox.
Inbound-API bearer credentials are no longer persisted in plaintext. ApiKey now
holds a KeyHash (peppered HMAC-SHA256); the key is shown once at creation and
only its hash is stored. Lookup and validation hash the presented candidate.
Cross-module: Commons (ApiKey, ApiKeyHasher), ConfigurationDatabase (mapping +
HashApiKeyValue migration), InboundAPI (ApiKeyValidator), ManagementService
(key creation), CentralUI (ApiKeys.razor). Existing keys must be re-issued.