Two follow-ups from the T13/T14 code review:
- M1: Add CachedWrite_StampsSourceNode_OnSubmitTelemetryRow and
CachedWrite_NoSourceNodeWired_LeavesSourceNodeNull to DatabaseCachedWriteEmissionTests,
mirroring the existing ApiOutbound SourceNode tests in
ExternalSystemCachedCallEmissionTests. Site-emitter coverage now symmetric
across both cached-call channels.
- M2: Clarify the GetService(INodeIdentityProvider) DI comments on the
CachedCallTelemetryForwarder and CachedCallLifecycleBridge factories:
it's test composition roots that may not register the provider, not
central production. Both site and central hosts always register it via
SiteServiceRegistration.BindSharedOptions.
Site: site emitters of SiteCallOperational (ExternalSystemClient, the script-API
cached call path in ScriptRuntimeContext, CachedCallLifecycleBridge) inject
INodeIdentityProvider and stamp SourceNode = NodeName at construction.
OperationTrackingStore call site in CachedCallTelemetryForwarder now stamps
SourceNode too.
Central: SiteCallAuditRepository.UpsertAsync INSERT includes SourceNode in the
column list; conditional monotonic UPDATE uses
COALESCE(@SourceNode, SourceNode) so later packets cannot blank a previously-
stamped value. After this commit every SiteCalls row carries node-a/node-b in
SourceNode (subject to monotonic preservation).
Two follow-ups flagged by code review on Tasks 11/12:
- Lock the back-compat contract for CentralAuditWriter's optional
`nodeIdentity = null` ctor parameter with two explicit tests
(`WriteAsync_PassesThroughCallerSourceNode_WhenNoProviderInjected` and
`WriteAsync_LeavesSourceNodeNull_WhenNoProviderInjected`). The previous
null-provider path was only exercised incidentally via legacy
CentralAuditWriterTests setups; the new tests make the contract explicit
and distinct from the "provider supplied, returns null" path.
- Document why the catch-block log references `evt` rather than the
post-stamp record: the three logged fields (EventId, Kind, Status) are
immutable across the filter+stamp chain, so referencing either name is
equivalent — but the comment warns future maintainers to switch names if
they ever add a field the chain mutates (e.g. SourceNode).
CentralAuditWriter injects INodeIdentityProvider and stamps the event before
handing to the repository. AuditLogRepository.InsertIfNotExistsAsync now
includes SourceNode in the INSERT column list. Caller-provided value wins
(supports any future direct-write callsite that already has its own node id).
Caller-provided SourceNode wins (preserves reconciled rows from other nodes);
otherwise the writer fills it from the local INodeIdentityProvider.NodeName.
Reads from the provider on every write — singleton lifetime means zero overhead.
- AddColumnIfMissing is now shared by ExecutionId and ParentExecutionId;
drop the ExecutionId-specific tag.
- AuditLogRepository.GetExecutionTreeAsync doc no longer hardcodes the
MAXRECURSION literal; reference the ExecutionChainMaxDepth const instead.
The store-and-forward retry loop emits the per-attempt and terminal cached
audit rows (ApiCallCached/DbWriteCached Attempted, CachedResolve) via
CachedCallLifecycleBridge from a CachedCallAttemptContext, not from the
script context. The ExecutionId rollout (Task 4) already threaded ExecutionId
and SourceScript through this path; ParentExecutionId — the spawning
inbound-API request's ExecutionId — was not, so those retry-loop rows had
ParentExecutionId = null even for an inbound-API-routed run.
Thread it additively as a sibling at every carry point ExecutionId passes
through:
- StoreAndForwardMessage gains ParentExecutionId (Guid?).
- StoreAndForwardStorage adds a nullable parent_execution_id column via the
same idempotent PRAGMA-probed ALTER TABLE migration; rows persisted by an
older build read back null (back-compat). The defensive Guid.TryParse read
helper (ParseExecutionId) is renamed ParseGuidColumn and reused for both
columns so a corrupt value cannot abort the retry sweep.
- StoreAndForwardService.EnqueueAsync gains an optional parentExecutionId
param, stamped onto the buffered message and surfaced on the
CachedCallAttemptContext built in the retry loop.
- CachedCallAttemptContext gains ParentExecutionId.
- CachedCallLifecycleBridge.BuildPacket sets AuditEvent.ParentExecutionId
from the context, beside the existing ExecutionId.
- IExternalSystemClient.CachedCallAsync / IDatabaseGateway.CachedWriteAsync
gain an optional parentExecutionId param; ScriptRuntimeContext's CachedCall
/ CachedWrite helpers pass _parentExecutionId.
All threading is additive — ParentExecutionId is Guid? everywhere, null for
non-routed runs, and old buffered S&F rows still deserialize with the new
field null.
The store-and-forward retry loop emits the per-attempt and terminal cached
audit rows (ApiCallCached/DbWriteCached Attempted, CachedResolve) via
CachedCallLifecycleBridge from a CachedCallAttemptContext, not from the
script context. ExecutionId (and SourceScript) were not threaded through the
S&F buffer, so those rows had ExecutionId = null and SourceScript = null.
Thread both, additively, from the cached-call enqueue path:
- StoreAndForwardMessage gains ExecutionId (Guid?) / SourceScript (string?).
- StoreAndForwardStorage adds nullable execution_id / source_script columns
via an idempotent PRAGMA-probed ALTER TABLE migration; rows persisted by
an older build read back null (back-compat).
- StoreAndForwardService.EnqueueAsync gains optional executionId /
sourceScript params, stamped onto the buffered message and surfaced on the
CachedCallAttemptContext built in the retry loop.
- CachedCallAttemptContext gains ExecutionId / SourceScript.
- CachedCallLifecycleBridge.BuildPacket sets AuditEvent.ExecutionId and
AuditEvent.SourceScript from the context (replacing the hard-coded
SourceScript = null and its now-stale comment).
- IExternalSystemClient.CachedCallAsync / IDatabaseGateway.CachedWriteAsync
gain optional executionId / sourceScript params; ScriptRuntimeContext's
CachedCall / CachedWrite helpers pass _executionId / _sourceScript.
Script-side cached rows (CachedSubmit, immediate Attempted+Resolve) are
unchanged. All threading is additive — old buffered S&F rows still
deserialize and process with the new fields null.
Extract the verbatim-duplicated SiteCallOperationalDto -> SiteCall mapper
into a single public SiteCallDtoMapper static class in
ScadaLink.Communication.Grpc, mirroring AuditEventDtoMapper. Replaces three
identical private copies (SiteStreamGrpcServer.MapSiteCallFromDto,
ClusterClientSiteAuditClient.MapSiteCall, and the test-infra
DirectActorSiteStreamAuditClient.MapSiteCallFromDto), removes the now-stale
doc comment that justified the duplication, and drops the using directives
that became unused. Adds SiteCallDtoMapperTests for field-by-field coverage.
Only the FromDto direction is provided: nothing maps SiteCall back onto the
wire, so a ToDto would be dead code.
The snapshot's per-site stalled latch now lives on the snapshot itself
and is fed by SiteAuditTelemetryStalledTracker via ApplyStalled, removing
the chain that required ActorSystem at DI composition time. The tracker
is now constructed by AkkaHostedService once ActorSystem.Create returns,
with a lock-guarded auxiliary-disposable list so concurrent host
start/stop in tests cannot race the enumeration.
Central singleton (M6-T4 Bundle C) that drives the daily AuditLog partition
purge. On a configurable timer (default 24 hours) the actor:
1. Queries IAuditLogRepository.GetPartitionBoundariesOlderThanAsync for
monthly boundaries whose latest OccurredAtUtc is older than
DateTime.UtcNow - AuditLogOptions.RetentionDays.
2. For each eligible boundary calls SwitchOutPartitionAsync, which runs
the drop-and-rebuild dance around UX_AuditLog_EventId.
3. Publishes AuditLogPurgedEvent(boundary, rowsDeleted, durationMs) on
the actor-system EventStream so the Bundle E central health collector
and ops surfaces can subscribe without coupling to this actor.
Co-changes:
* SwitchOutPartitionAsync returns long (rows deleted) — sampled BEFORE the
switch via COUNT_BIG over the per-partition filter so the count
reflects what the switch removed, not a post-purge scan of a table that
no longer exists. All stub implementations updated.
* AuditLogPurgeOptions: IntervalHours (default 24), IntervalOverride for
tests, Interval property resolving either.
* AuditLogPurgedEvent: record with MonthBoundary, RowsDeleted, DurationMs.
Behavior:
* Continue-on-error per boundary — one partition that throws does NOT
abandon the rest of the tick.
* DI scope opened per tick (IAuditLogRepository is a SCOPED EF Core
service); mirrors SiteAuditReconciliationActor and AuditLogIngestActor.
* SupervisorStrategy Resume keeps the singleton alive across leaked
exceptions.
* EventStream capture BEFORE the first await — Context is unsafe after
await in async receive handlers (same pattern as Sender-capture in
AuditLogIngestActor.OnIngestAsync).
Tests:
* Tick_Fires_OnDailyInterval — visible timer side effect.
* Tick_OldPartitions_SwitchedOut — both seeded boundaries purged.
* Tick_NewerPartitions_Untouched — empty enumerator → no switches.
* Tick_PublishesPurgedEvent_WithRowCount — AuditLogPurgedEvent carries
RowsDeleted and DurationMs.
* Tick_SwitchThrows_OtherPartitionsStillProcessed — continue-on-error.
* Threshold_UsesAuditLogOptionsRetentionDays — non-default 30-day window
computed from UtcNow - RetentionDays.
* EndToEnd_RealPartition_RowsRemoved_PurgedEventPublished — TestKit +
MsSqlMigrationFixture: real partitioned table, Jan-2026 row purged,
Apr-2026 row kept, AuditLogPurgedEvent observed via probe.
Bundle C task M5-T7 — surface DefaultAuditPayloadFilter redactor
over-redactions as a Site Health metric so a misconfigured /
catastrophic regex shows up on /monitoring/health rather than
disappearing into a NoOp sink.
- SiteHealthReport: new 'AuditRedactionFailure' int field
(defaulted to 0 for back-compat with existing producers/tests).
- ISiteHealthCollector / SiteHealthCollector:
new IncrementAuditRedactionFailure() — per-interval atomic
counter with Interlocked, reset on CollectReport, mirroring
the M2 Bundle G SiteAuditWriteFailures pattern.
- HealthMetricsAuditRedactionFailureCounter: new bridge in
ScadaLink.AuditLog.Site that forwards IAuditRedactionFailureCounter
increments to ISiteHealthCollector — mirrors
HealthMetricsAuditWriteFailureCounter one-for-one.
- AddAuditLogHealthMetricsBridge: now ALSO Replaces the
NoOpAuditRedactionFailureCounter binding with the health-metrics
bridge, so a single AddAuditLogHealthMetricsBridge() call wires
both the M2 Bundle G write-failure counter and the M5 Bundle C
redaction-failure counter into the health report.
Site-side only for M5 — the filter also runs on CentralAuditWriter
and AuditLogIngestActor (where it just keeps the NoOp default), but
a central-side health-metric surface for AuditRedactionFailure is
deferred to M6 alongside the rest of the central health collector
work.
Tests:
- AuditRedactionFailureMetricTests (HealthMonitoring) covers the
SiteHealthCollector increment/report/reset shape (3 tests).
- HealthMetricsAuditRedactionFailureCounterTests (AuditLog) covers
the AuditLog → HealthMonitoring bridge (3 tests).
- Existing CountCapturingHealthCollector stub in
DeploymentManagerRedeployTests extended with the new no-op
interface method.
Verified: dotnet build clean, all 24 test projects green
(the only Failed at first ScadaLink.SiteRuntime.Tests run was the
known-flaky InstanceActorChildAttributeRaceTests; passes on re-run
in isolation and full suite, unrelated to these changes).
Bundle C task M5-T6 — plugs the IAuditPayloadFilter singleton into the
three audit writer entry points so every event is truncated + redacted
before persistence, regardless of which path it took to disk:
- FallbackAuditWriter (site hot path): filter runs before the primary
SQLite write AND the ring-buffer enqueue, so a recovery drain replays
rows that are already capped/redacted.
- CentralAuditWriter (central direct-write): filter runs before the
per-call IAuditLogRepository.InsertIfNotExistsAsync.
- AuditLogIngestActor (site→central telemetry):
- OnIngestAsync resolves the filter from the per-message scope and
applies it to each row before IngestedAtUtc stamping.
- OnCachedTelemetryAsync (M3 dual-write) applies the filter to the
audit half of every CachedTelemetryEntry before the audit-insert
+ site-call-upsert transaction.
Filter parameter is optional (nullable) on each constructor so the
existing test composition roots that don't pass one keep working unchanged
— production DI wiring in AddAuditLog always passes the real filter
through. ICentralAuditWriter registration switched from the open-ctor
form to a factory so the filter flows through it.
Tests: FilterIntegrationTests covers all three writer paths end-to-end
(4 tests). Full ScadaLink.AuditLog.Tests suite: 146 passed, 0 failed,
0 skipped.
M4 Bundle B (B1) — add the central-only ICentralAuditWriter implementation
and inject it into NotificationOutboxActor so subsequent tasks (B2/B3) can
route attempt + terminal lifecycle events through the direct-write audit path.
- CentralAuditWriter: thin wrapper around IAuditLogRepository.InsertIfNotExistsAsync;
scope-per-call (matches AuditLogIngestActor / NotificationOutboxActor pattern);
stamps IngestedAtUtc; swallows all internal failures (alog.md §13).
- Registered as a singleton in AddAuditLog.
- NotificationOutboxActor ctor takes ICentralAuditWriter (validated non-null).
- Host wiring resolves the writer once from the root provider and passes it
into the singleton's Props.Create call.
- Existing TestKit fixtures updated with a NoOpCentralAuditWriter helper so
tests that don't exercise audit emission still compile and pass.
M3 Bundle F (Task F1) wires the cached-call audit pipeline through the
composition roots:
- Central: register SiteCallAuditActor as a cluster singleton + proxy
(mirrors AuditLogIngestActor and NotificationOutboxActor). Program.cs
calls .AddSiteCallAudit() on the central role.
- Site: register ICachedCallTelemetryForwarder + CachedCallLifecycleBridge
in AddAuditLog (lazy factory — Central nodes degrade to audit-only
emission because IOperationTrackingStore is site-only).
- Site: bind CachedCallLifecycleBridge to ICachedCallLifecycleObserver so
StoreAndForwardService picks it up via DI.
- Site: introduce IStoreAndForwardSiteContext + Host adapter to surface the
site id to StoreAndForwardService without creating a
StoreAndForward -> HealthMonitoring project-reference cycle.
- ScriptExecutionActor resolves ICachedCallTelemetryForwarder per script
scope and threads it into ScriptRuntimeContext.
CachedCallTelemetryForwarder's IOperationTrackingStore dependency is now
nullable so Central DI validation succeeds with the lazy registration; the
forwarder's tracking-half emission is a no-op when the store is absent.
Tests:
- AkkaHostedServiceAuditWiringTests: Central host builds with
AddSiteCallAudit and resolves ICachedCallTelemetryForwarder; Site
resolves the forwarder + bridge + observer + IStoreAndForwardSiteContext.
- Full solution: 194 Host tests green, 241 SiteRuntime tests green, every
other suite unchanged.
Wire the M3 cached-call audit pipeline end-to-end for the database
channel and close the loop between the S&F lifecycle observer and the
site-side dual emitter.
* DatabaseCachedWriteEmissionTests covers Database.CachedWrite (set up
in Bundle E3): mints a TrackedOperationId, emits one CachedSubmit
packet on DbOutbound, threads the id into IDatabaseGateway, and is
best-effort on a thrown forwarder. Mirrors ExternalSystem.CachedCall
coverage from E3.
* CachedCallLifecycleBridge (new) implements ICachedCallLifecycleObserver
and lives alongside CachedCallTelemetryForwarder. The bridge ingests
per-attempt notifications from the S&F retry loop and fans them out
to the forwarder:
- TransientFailure -> 1 Attempted row
- Delivered -> Attempted + CachedResolve(Delivered)
- PermanentFailure -> Attempted + CachedResolve(Parked)
- ParkedMaxRetries -> Attempted + CachedResolve(Parked)
Channel string -> AuditKind mapping (ApiOutbound->ApiCallCached,
DbOutbound->DbWriteCached). Best-effort top-level catch swallows any
unexpected throw so the S&F retry bookkeeping is never disturbed.
* Bridge tests (7) cover all four outcomes, channel mapping, provenance
propagation, and the no-throw-on-forwarder-failure contract.
Bundle F (Host registration) will instantiate the bridge and inject it
into StoreAndForwardService.cachedCallObserver, closing the wiring path
end-to-end.
Bundle E task E6.
Rework ScriptRuntimeContext.ExternalSystem.CachedCall to fit the M3
combined-telemetry model:
* Mints a fresh TrackedOperationId and emits one CachedSubmit packet
via ICachedCallTelemetryForwarder BEFORE handing the call off — the
SiteCalls row is materialised before the first delivery attempt so
Tracking.Status(id) can observe a Submitted row even if immediate
delivery resolves before the helper returns.
* Threads the TrackedOperationId into IExternalSystemClient.CachedCallAsync
as a new optional parameter (and into IDatabaseGateway.CachedWriteAsync
for the Database mirror set up here for E6). The gateway uses the id
as the StoreAndForward messageId so the retry loop (Tasks E4/E5) can
recover it from StoreAndForwardMessage.Id.
* Returns the TrackedOperationId rather than ExternalCallResult — the
script's contract is now "get a tracking handle, observe outcome via
Tracking.Status". Best-effort emission: a thrown forwarder is logged
+ swallowed; the original call still runs and the id is still returned.
DatabaseHelper gets the matching siteId / sourceScript / forwarder
fields and a parallel CachedSubmit emitter (Channel=DbOutbound) so Task
E6's Database.CachedWrite mirror plugs in without further runtime
wiring.
New ICachedCallTelemetryForwarder seam in Commons.Interfaces.Services
so SiteRuntime depends on Commons (existing arrow) rather than
ScadaLink.AuditLog (would have introduced a new dependency).
Bundle E task E3 (and helper-shape work for E6).
Sister to SiteAuditTelemetryActor: takes a combined CachedCallTelemetry
packet and fans it out to the two site-local stores.
* AuditEvent half writes through IAuditWriter (the M2 FallbackAuditWriter
+ SqliteAuditWriter chain — same site SQLite hot-path as sync calls).
* SiteCallOperational half maps Audit.Kind to the matching
IOperationTrackingStore method:
- CachedSubmit -> RecordEnqueueAsync (insert-if-not-exists)
- ApiCallCached / DbWriteCached -> RecordAttemptAsync (monotonic)
- CachedResolve -> RecordTerminalAsync (first-write-wins)
Best-effort contract (alog.md §7): independent try/catch per half so a
thrown writer cannot starve the tracking row (and vice-versa); both
failures are logged at warning level and swallowed — the calling script
never sees them.
Wire push deferred to M6 — the NoOp ISiteStreamAuditClient binding stays
in effect; the forwarder writes only to the local stores in M3. The
existing SiteAuditTelemetryActor drain loop will sweep the audit rows
once a real gRPC client lands.
Bundle E task E2.
Add the second site→central RPC seam alongside the existing M2
IngestAuditEventsAsync. The Bundle D proto already lit up
IngestCachedTelemetry (CachedTelemetryBatch / IngestAck) so this commit
just plumbs the client-side abstraction:
* ISiteStreamAuditClient gains IngestCachedTelemetryAsync(batch, ct).
* NoOpSiteStreamAuditClient implements it returning an empty IngestAck
(same shape as M2 — production gRPC client lands in M6).
* SyncCallEmissionEndToEndTests' DirectActorSiteStreamAuditClient stub
throws NotSupportedException from the new method so a regression that
accidentally routes a cached packet through the sync stub fails loudly.
* New NoOpSiteStreamAuditClientTests cover the null-guard + empty-ack
contract for both batch shapes.
Bundle E task E1.
Bundle G of Audit Log #23 M2. Bridges the FallbackAuditWriter primary-
failure counter into the Site Health Monitoring report payload so a
sustained audit-write outage surfaces on /monitoring/health instead of
disappearing into a NoOp sink.
- SiteHealthReport: add SiteAuditWriteFailures (defaulted, additive).
- ISiteHealthCollector + SiteHealthCollector: new
IncrementSiteAuditWriteFailures() counter, per-interval reset
semantics matching ScriptErrorCount / DeadLetterCount.
- HealthMetricsAuditWriteFailureCounter: adapter forwarding
IAuditWriteFailureCounter.Increment() to the collector.
- AddAuditLogHealthMetricsBridge(): swaps the NoOp default
registration for the real bridge; called from
SiteServiceRegistration after AddSiteHealthMonitoring + AddAuditLog.
- Existing host-wiring test updated: site composition now resolves
HealthMetricsAuditWriteFailureCounter (not NoOp).
Tests: HealthMonitoring 60 -> 63 (3 new), AuditLog 56 -> 59 (3 new),
full solution green.
Wires Bundle E of the M2 site-sync pipeline:
- AddAuditLog extended to register the site writer chain (SqliteAuditWriter
singleton + ISiteAuditQueue forward + RingBufferFallback + FallbackAuditWriter
composing them) and the telemetry collaborators (SiteAuditTelemetryOptions,
SqliteAuditWriterOptions, IAuditWriteFailureCounter NoOp default,
ISiteStreamAuditClient NoOp default).
- AkkaHostedService central role: AuditLogIngestActor as ClusterSingletonManager
(singleton name 'audit-log-ingest') + ClusterSingletonProxy, mirroring the
Notification Outbox pattern. Proxy is offered to SiteStreamGrpcServer if it
resolves (Site path only today; M6 reconciliation will host gRPC on central).
- AkkaHostedService site role: SiteAuditTelemetryActor (per-site, NOT a
singleton because each site is its own cluster), bound to a dedicated
audit-telemetry-dispatcher (ForkJoinDispatcher, 2 dedicated threads).
- Program.cs + SiteServiceRegistration.Configure call AddAuditLog on both roles.
- AuditLogIngestActor gains a second constructor that takes IServiceProvider so
the cluster singleton can create a fresh scope per message — IAuditLogRepository
is a scoped EF Core service and cannot be pre-resolved from the root. The
IAuditLogRepository constructor remains for Bundle D's MSSQL-fixture tests.
NoOp ISiteStreamAuditClient is deliberate: no site→central gRPC channel exists
in M2 (sites talk to central via Akka ClusterClient; gRPC SiteStreamService is
hosted on sites for central→site streaming). M6 reconciliation introduces the
real gRPC site→central client + central-hosted gRPC server. Bundle H's
integration test substitutes a stub client directly via the actor's Props.
Tests:
- tests/ScadaLink.AuditLog.Tests/AddAuditLogTests.cs — 11 tests (was 3): writer
singleton, IAuditWriter as FallbackAuditWriter, ISiteAuditQueue same-instance
as SqliteAuditWriter, options bind round-trip, NoOp default assertions.
- tests/ScadaLink.Host.Tests/AkkaHostedServiceAuditWiringTests.cs (new) — 13
tests: BuildHocon emits audit-telemetry-dispatcher block with the expected
type/throughput/thread-count; Central composition root resolves the writer
chain + options; Site composition root resolves the writer chain + options +
NoOp client.
Verified: dotnet build clean, 23 test suites green (Host 194 + AuditLog 54).