27 KiB
Component: Audit Log
Purpose
Provides a single, append-only, forensic + operational record of every integration action initiated by, or terminating in, a script — across outbound API, outbound DB, notifications, and inbound API. One row per lifecycle event, rich payloads, long retention, dashboards, drilldowns, and filter queries, answering both forensic questions ("did instance X send notification Y on date Z, with what body?") and operational ones ("which inbound caller is hammering us right now?").
The Audit Log is not a dispatcher. It does not drive delivery, retry loops,
or operator Retry/Discard actions — those remain in Notification Outbox
and Site Call Audit. The Audit Log is the
immutable history that observes those subsystems and adds coverage where
they are silent (sync ExternalSystem.Call, sync DB writes and reads, inbound
API requests).
Location
Central cluster and site clusters.
- Central: the
AuditLogtable in central MS SQL, plus three singletons on the active central node —AuditLogIngestActor(telemetry receiver),SiteAuditReconciliationActor, andAuditLogPurgeActor. - Sites: a site-local
AuditLogSQLite database file alongside the Store-and-Forward buffer, plus aSiteAuditTelemetryActorsingleton on the active site node.
Registered as component #23 in the Host role configuration.
Responsibilities
- Accept site-local hot-path audit writes from script-trust-boundary call paths.
- Forward site audit rows to central via gRPC telemetry with at-least-once
delivery and idempotency on
EventId. - Run periodic per-site reconciliation pulls so missed telemetry self-heals.
- Accept central-originated audit writes (Inbound API, Notification dispatch attempts and terminal status).
- Compute point-in-time KPIs (global and per-site) from the central
AuditLogtable. - Purge expired rows by monthly partition switch — no row-level deletes.
Scope — the script trust boundary
The Audit Log captures every action a script causes to cross the cluster trust boundary:
| Channel | Trigger | Direction | Covered today? |
|---|---|---|---|
ExternalSystem.Call(...) |
Script | Outbound | No (gap) |
ExternalSystem.CachedCall(...) |
Script | Outbound | Yes — SiteCalls (Site Call Audit) |
Database.Connection().Execute*(...) — writes |
Script | Outbound | No (gap) |
Database.CachedWrite(...) |
Script | Outbound | Yes — SiteCalls (Site Call Audit) |
Notify.To(list).Send(...) |
Script | Outbound | Yes — Notifications (Notification Outbox) |
POST /api/{method} (Inbound API) |
External | Inbound (invokes a script) | No (gap) |
Out of scope — framework traffic is not audited:
- Health checks, heartbeats, cluster membership messages.
- gRPC inter-cluster real-time streams (attribute values, alarm states).
- Data Connection Layer ↔ OPC UA / custom protocol traffic.
- LDAP authentication probes, Traefik routing decisions.
- Internal Configuration Database queries by the framework.
- Site Event Log writes; audit log writes themselves.
Script-initiated DB reads via Database.Connection().ExecuteReader(...)
count as actions from a script and are in scope. Reads via DCL / subscriptions
are framework traffic and excluded.
The AuditLog Table (central)
Single wide table in central MS SQL, polymorphic by Channel + Kind
discriminators, with a JSON Extra column for channel-specific overflow. One
row per lifecycle event across all channels.
| Column | Type | Notes |
|---|---|---|
EventId |
uniqueidentifier PK |
Generated where the event originates (site or central). Idempotency key. |
OccurredAtUtc |
datetime2 |
When the event happened (call returned, retry attempted, etc.). |
IngestedAtUtc |
datetime2 |
When central persisted the row (lags OccurredAtUtc for site-originated rows). |
Channel |
varchar(32) |
ApiOutbound | DbOutbound | Notification | ApiInbound. |
Kind |
varchar(32) |
Event kind discriminator (see kinds list below). |
CorrelationId |
uniqueidentifier NULL |
Ties multi-event operations together. TrackedOperationId for cached calls, NotificationId for notifications, request-id for inbound API. NULL for sync one-shot calls. |
ExecutionId |
uniqueidentifier NULL |
The originating script execution / inbound request — the universal per-run correlation value; distinct from CorrelationId, which is the per-operation lifecycle id. Stamped on every audit row emitted by one execution. |
ParentExecutionId |
uniqueidentifier NULL |
The ExecutionId of the execution that spawned this run — the cross-execution correlation pointer. Set on every row of an inbound-API-routed site script run (= the inbound request's ExecutionId); NULL for a top-level run (inbound, tag-change / timer-triggered, un-bridged). |
SourceSiteId |
varchar(64) NULL |
NULL for central-originated events. |
SourceInstanceId |
varchar(128) NULL |
Instance whose script initiated the action (when applicable). |
SourceScript |
varchar(128) NULL |
Script name within the instance. |
Actor |
varchar(128) NULL |
Inbound API: API key name. Outbound: script identity. Central: system user. |
Target |
varchar(256) NULL |
Outbound API: external system + method. DB: connection name. Notification: list name. Inbound API: method name. |
Status |
varchar(32) |
Outcome of this event — Submitted, Forwarded, Attempted, Delivered, Failed, Parked, Discarded, Skipped. |
HttpStatus |
int NULL |
HTTP-bearing events only. |
DurationMs |
int NULL |
Call / attempt duration. |
ErrorMessage |
nvarchar(1024) NULL |
Truncated; ErrorDetail for full text. |
ErrorDetail |
nvarchar(max) NULL |
Optional full exception text on failures. |
RequestSummary |
nvarchar(max) NULL |
Truncated request payload (configurable cap). Headers redacted. For Channel = ApiInbound, captured in full up to AuditLog:InboundMaxBytes (default 1 MiB) — see Payload Capture Policy. |
ResponseSummary |
nvarchar(max) NULL |
Truncated response payload. For Channel = ApiInbound, captured in full up to AuditLog:InboundMaxBytes (default 1 MiB). For other channels, capped at DefaultCapBytes by default and ErrorCapBytes on error rows. |
PayloadTruncated |
bit |
Set if either summary was truncated. |
Extra |
nvarchar(max) NULL |
Channel-specific JSON for fields we don't promote to columns. |
Indexes (first cut):
IX_AuditLog_OccurredAtUtc— primary time-range index for global scans.IX_AuditLog_Site_Occurred (SourceSiteId, OccurredAtUtc)— per-site filters.IX_AuditLog_CorrelationId (CorrelationId)— drilldown from a single operation.IX_AuditLog_Execution (ExecutionId)— drilldown to every action of one script execution / inbound request.IX_AuditLog_ParentExecution (ParentExecutionId)— cross-execution drilldown: the downward leg of the execution-tree walk seeks on it (ParentExecutionId = ancestor.ExecutionId), and it backs theparentExecutionIdfilter.IX_AuditLog_Channel_Status_Occurred (Channel, Status, OccurredAtUtc)— KPI / dashboard tiles.IX_AuditLog_Target_Occurred (Target, OccurredAtUtc)— "what did we send to system X".- Monthly partitioning on
OccurredAtUtcfrom day one; purge is a partition switch (see Retention & Purge).
Kind values (flat — 10 discriminators across all channels):
| Kind | Fires when |
|---|---|
ApiCall |
Sync ExternalSystem.Call(...) returns (success or permanent failure). One row per call. |
ApiCallCached |
A cached outbound-API attempt records its forward-ack (Forwarded) or each retry (Attempted). |
DbWrite |
Sync Database.Connection().Execute*(...) / ExecuteReader(...) completes. One row per call. |
DbWriteCached |
A cached outbound-DB attempt records its forward-ack (Forwarded) or each retry (Attempted). |
NotifySend |
Script's Notify.Send(...) is enqueued on the site — first row in a notification's lifecycle (Status=Submitted). |
NotifyDeliver |
Central Notification Outbox dispatcher records a delivery attempt (Attempted) or terminal outcome (Delivered/Parked/Discarded). |
InboundRequest |
An inbound API request completes — one row per request, written at request end with final status. |
InboundAuthFailure |
An inbound API request was rejected at the auth boundary (bad/missing key). One row, Status=Failed, HttpStatus=401. |
CachedSubmit |
Script-side enqueue of a cached call (ExternalSystem.CachedCall / Database.CachedWrite); first row in the cached-call lifecycle, written to site SQLite before any forward attempt. |
CachedResolve |
Terminal row for a cached operation — Status = Delivered / Failed / Parked / Discarded. |
Inbound API is intentionally collapsed to a single InboundRequest (or
InboundAuthFailure for auth rejections) row per request rather than a
multi-event lifecycle.
ExecutionId vs CorrelationId
The table carries two correlation columns at different granularities:
ExecutionIdis the universal per-run value: one id per script execution (tag-change / timer-triggered or otherwise) or per inbound API request. It is stamped on every audit row that run produces — the syncApiCallandDbWriterows, the full cached-call lifecycle, theNotifySend/NotifyDeliverrows, and the inbound row alike. A run that performs no trust-boundary action emits no rows, but any run that emits multiple rows ties them all together under oneExecutionId. This lets an audit reader pull the complete trust-boundary footprint of a single script run with oneExecutionIdfilter.CorrelationIdis the per-operation lifecycle id — it groups the multiple events of one long-running operation (TrackedOperationIdfor a cached call,NotificationIdfor a notification, request-id for inbound API) and is NULL for sync one-shot calls that have no operation lifecycle.
The two are orthogonal: one execution may touch several operations (each with
its own CorrelationId) yet every resulting row shares the one ExecutionId.
ParentExecutionId adds cross-execution correlation on top. ExecutionId
is per-run and flat — WHERE ExecutionId = X returns everything one run did, but
nothing links a run to the run that spawned it. ParentExecutionId carries the
spawning execution's ExecutionId: a spawned run still gets its own fresh
ExecutionId, and every audit row it emits also carries the spawner's id in
ParentExecutionId. The first cut bridges the inbound API → routed-site-script
case: an inbound request runs a method script that calls Route.Call, routing to
a site instance; the routed site script records the inbound request's
ExecutionId as its ParentExecutionId, while the inbound InboundRequest row
itself is top-level (ParentExecutionId NULL). The pointer always references the
immediate spawner, so a routed run that itself routes onward threads its own
ExecutionId — walking ParentExecutionId → ExecutionId recursively
reconstructs the call chain as a tree of arbitrary depth. The tag-cascade case
(an attribute write triggering another script) is deferred — the model
generalises to it with no schema change once that spawn point is threaded.
The Site-Local AuditLog (SQLite)
A SQLite database file on each site node, alongside the Store-and-Forward
buffer. Same schema as central minus IngestedAtUtc (irrelevant at the source),
plus a ForwardState column with values Pending | Forwarded | Reconciled that
drives the telemetry loop and reconciliation pull.
Site SQLite retention rule (hard invariant):
A row is eligible for purge only when both
OccurredAtUtc < retention thresholdANDForwardState IN ('Forwarded', 'Reconciled'). Pending rows are never purged.
A prolonged central outage will grow the site audit table indefinitely until
central is reachable again. This is intentional — losing audit rows to make
room is a compliance violation, not a self-healing behavior. To bound that
growth in practice, the site emits a SiteAuditBacklog health metric (pending
row count, oldest pending age, bytes on disk); crossing operator-configured
thresholds surfaces a warning on the relevant site tile in the Health
dashboard, mirroring the Store-and-Forward Engine's backlog metric.
Central is the durable home. Site SQLite is a write-buffer with a forwarding guarantee.
Ingestion Paths
Four paths feed the central AuditLog — one site originator and three central
writers — all idempotent on EventId.
Site hot-path append (site-originated events)
The component completing a script-trust-boundary action (External System
Gateway, Database layer, Store-and-Forward Engine) builds an AuditEvent with a
fresh EventId (Guid v4) and OccurredAtUtc = UtcNow, then appends it to the
site-local AuditLog SQLite via IAuditWriter with
ForwardState = 'Pending'. The append is a single-statement INSERT and is
durable in microseconds; control returns to the script with no central
round-trip on the hot path.
Telemetry forward (site → central)
A SiteAuditTelemetryActor singleton drives the forwarding loop: select up to
N Pending rows ordered by OccurredAtUtc, batch-send them to central via the
existing SiteStream gRPC channel as IngestAuditEvents(events), and on
central-ack flip ForwardState = 'Forwarded' for accepted IDs. Rejected IDs
stay Pending for the next sweep. Cadence is short (default 5 s) when
non-empty, longer (default 30 s) when idle; telemetry runs on a dedicated
dispatcher.
Reconciliation pull (self-healing for missed telemetry)
A central SiteAuditReconciliationActor periodically (default 5 min per site)
asks each site for its oldest Pending row and pending count; if backlog is
non-draining (e.g., telemetry actor wedged), central issues a
PullAuditEvents(sinceUtc, batchSize) and inserts-if-not-exists. Accepted rows
are flipped to ForwardState = 'Reconciled' site-side. Same self-healing
pattern as Site Call Audit's reconciliation of SiteCalls.
Central direct-write (central-originated events)
Events originating at central never touch site SQLite. Inbound API writes one
ApiInbound.InboundRequest row via ICentralAuditWriter synchronously inside
the request-handler middleware, before the HTTP response is flushed; auth-layer
rejections emit ApiInbound.InboundAuthFailure (Status=Failed, HTTP 401)
instead. The Notification Outbox dispatcher writes
Notification.NotifyDeliver with Status=Attempted per delivery attempt and
Notification.NotifyDeliver with Status=Delivered/Parked/Discarded on
terminal status. Central direct-writes use the same insert-if-not-exists
semantics keyed on EventId.
Cached Operations — Combined Telemetry
For ExternalSystem.CachedCall and Database.CachedWrite, the site is the
source of truth for every audit row. The site writes each lifecycle event —
CachedSubmit (Status=Submitted), then ApiCallCached/DbWriteCached rows
for the forward-ack (Status=Forwarded) and each retry (Status=Attempted),
then a terminal CachedResolve row
(Status=Delivered/Failed/Parked/Discarded) — to its local SQLite
AuditLog on the hot path (or on the retry tick for Attempted rows), then
forwards via the same telemetry channel. The telemetry message format gains the
audit-row fields additively — one packet per lifecycle transition carries both
the operational state update AND the audit row content.
On receipt, central performs both writes in one transaction:
- Insert-if-not-exists the immutable
AuditLogrow, keyed onEventId. - Upsert the operational
SiteCallsrow — existing Site Call Audit behavior (status, retry count, last error, timestamps).
This collapses two telemetry concerns into one, keeps site SQLite as the
single local source of truth for audit content, and preserves the existing
operational SiteCalls shape for the dispatcher and UI.
Payload Capture Policy
- Default cap — 8 KB for each of
RequestSummaryandResponseSummary; raised to 64 KB on any error row (Status IN ('Failed', 'Parked', 'Discarded')). - Inbound API exception. For
Channel = ApiInbound,RequestSummaryandResponseSummaryare captured in full up to a per-body hard ceiling of 1 MiB (configurable viaAuditLog:InboundMaxBytes; default 1 048 576 bytes; min 8 192; max 16 777 216). The 8 KiB / 64 KiB default/error caps that apply to other channels do not apply here.PayloadTruncated = 1is set only when the inbound ceiling is hit — verbatim capture is the normal case. The ceiling applies independently to each body. Header redaction and per-target body redactors still run before persistence. - Truncation — UTF-8 byte-safe;
PayloadTruncated = 1when applied. Full bodies are never stored. - HTTP headers —
Authorization,Cookie,Set-Cookie,X-API-Key, and any header matching the configured redact-list regex become<redacted>. - HTTP bodies — captured verbatim by default. Operators register per-target body redactors (regex → replacement) for known secret fields.
- SQL — statement text and parameter values captured verbatim by default; per-connection opt-in to redact parameters whose name matches a regex.
- Never captured — raw API key material (only the key name via
Actor), LDAP bind credentials, cluster secrets, Configuration DB connection strings. - Safety net — if a configured redactor throws, the affected payload becomes
"<redacted: redactor error>"andAuditRedactionFailureincrements. We over-redact, never under-redact, on configuration faults.
Redaction happens at the write site, before the row touches SQLite (or central MS SQL for direct-write events). Unredacted secrets never persist.
Failure Handling & Idempotency
EventIdis the dedup key. Generated at the originator; central ingest isINSERT … WHERE NOT EXISTS (SELECT 1 FROM AuditLog WHERE EventId = @id)under the PK constraint. Idempotent across telemetry retries, reconciliation pulls, and any combination of the two.- Never fail the action. A failed audit write — site SQLite or central
direct-write — logs a critical Site Event Log entry and increments a health
metric (
SiteAuditWriteFailuresorCentralAuditWriteFailures), but the user-facing action proceeds. We do not fail script-initiated work because the audit write failed. - Hot-path ring buffer. While the site audit writer is unhealthy (disk full, schema lock, transient IO), events buffer in a small in-memory ring (default 1024 rows); oldest are discarded with a Site Event Log warning per drop.
- Reconciliation as fallback. If two consecutive reconciliation cycles
report a non-draining backlog, the supervisor restarts the telemetry actor
and a
SiteAuditTelemetryStalledevent fires. - No dedup horizon.
EventIdPK enforces uniqueness only while a row exists. A retry that arrives after the original row is purged inserts a "new" row — vanishingly rare and harmless.
Retention & Purge
- Central: 365-day default based on
OccurredAtUtc, configurable viaAuditLog:RetentionDays(min 7, max 3650). Single global retention in v1 — no per-channel overrides. - Partitioning: monthly partitions on
OccurredAtUtcfrom day one (pf_AuditLog_Month/ps_AuditLog_Month). Purge is a partition switch; there are no row-level deletes at central. - Purge actor:
AuditLogPurgeActorsingleton on the active central node runs daily, switches out any partition whose latestOccurredAtUtcis older than the retention window, and emits anAuditLog:Purgedevent (partition range, rowcount, duration). A partition-maintenance step rolls forward each month, creating the next month's partition ahead of time. - Sites: daily site job; default 7-day retention (configurable, min 1,
max 90). Respects the hard
ForwardStateinvariant —Pendingrows are never purged on age alone.
Security & Tamper-Evidence
- Append-only enforcement. The application accesses
AuditLogvia a dedicated DB rolescadalink_audit_writergrantedINSERT+SELECTonly — noUPDATE, noDELETE. Purge runs under a separate rolescadalink_audit_purgerwhose permissions are limited to the partition-switch operation; row-levelDELETEis not granted even to purge. - CI grep guard. The build greps the data layer for any
UPDATE … AuditLogorDELETE … AuditLogtext and fails on a hit. - Authorization. Reading the Audit Log requires the existing Audit role extended with a new OperationalAudit permission. Per-site row scoping reuses the existing site-permission model; bulk export requires an additional AuditExport permission.
- Payload redaction at write. See Payload Capture Policy. Unredacted secrets never persist; the safety net over-redacts on misconfiguration.
- Hash-chain tamper evidence — deferred to v1.x. A future
RowHashcolumn, computed per partition asSHA-256(prev.RowHash || canonical(row)), will be verifiable offline viascadalink audit verify-chain --month YYYY-MM. Off by default in v1. - Site SQLite security. File permissions: read/write by the ScadaLink service account only. Not backed up off-machine — site SQLite is a buffer, not a record.
KPIs
Point-in-time, computed from the central AuditLog table; global and per-site.
- Audit volume — events/min landing in the central
AuditLog; global plus per-site sparkline. - Audit error rate — % of central
AuditLogrows withStatus IN ('Failed', 'Parked', 'Discarded')over a rolling 5-minute window. This is the operational error rate of audited operations (HTTP 5xx, permanent failures, parked deliveries) — NOT audit-writer health, which surfaces separately viaCentralAuditWriteFailuresandAuditRedactionFailure. - Audit backlog — sum of
Pendingsite rows across sites; click drills into a per-site breakdown.
Notification Outbox and
Site Call Audit KPIs are unaffected — they remain
sourced from Notifications and SiteCalls respectively. Audit Log KPIs
describe the audit table itself.
Configuration
Bound from appsettings.json to a new AuditLogOptions class owned by this
component (Options pattern):
"AuditLog": {
"DefaultCapBytes": 8192,
"ErrorCapBytes": 65536,
"HeaderRedactList": [ "Authorization", "Cookie", "Set-Cookie", "X-API-Key" ],
"GlobalBodyRedactors": [
{ "Pattern": "\"password\"\\s*:\\s*\"[^\"]+\"", "Replacement": "\"password\":\"<redacted>\"" }
],
"PerTargetOverrides": {
"Weather/GetForecast": { "CapBytes": 4096 },
"PlantDB": { "RedactSqlParamsMatching": "@apikey|@token" }
},
"RetentionDays": 365
}
PerTargetOverrides keys bind by External System / Inbound Method /
Notification List / Database Connection name. RetentionDays is a single
global value in v1; per-channel overrides are deferred to v1.x.
Dependencies
- Commons (#16) —
AuditEvent,IAuditWriter/ICentralAuditWriterinterfaces, and theAuditChannel,AuditKind,AuditStatusenum types live here. - Configuration Database (#17) — hosts
the
AuditLogtable schema, the monthly partition function and scheme, thescadalink_audit_writer/scadalink_audit_purgerDB roles, and the EF migration. Distinct concern fromIAuditService(config-change audit), which is unchanged. - Cluster Infrastructure (#13) —
singleton placement and supervision for
AuditLogIngestActor,SiteAuditTelemetryActor,SiteAuditReconciliationActor, andAuditLogPurgeActor. - Central–Site Communication (#5) — carries
audit telemetry. New gRPC message types (
IngestAuditEvents,PullAuditEvents) are added to the existing site-stream proto additively. - Site Runtime (#3) — script-trust-boundary
call paths invoke
IAuditWriterto append events. - Host (#15) — registers this component (#23) under the central and site roles.
Interactions
- External System Gateway (#7) —
emits
ApiOutbound.ApiCallrows on every syncCall(). ForCachedCall, emits the combined cached telemetry packet (audit row + operational update) per Cached Operations — Combined Telemetry, using kindsCachedSubmit/ApiCallCached/CachedResolve. - External System Gateway (#7) — Database layer — the database access modes inside ESG emit
DbOutbound.DbWriterows on script-initiatedConnection()calls (writes and reads share the kind; distinguish viaExtra.rowsAffectedvsExtra.rowsReturned);Database.CachedWriteemits the cached-write lifecycle rows via the combined-telemetry packet using kindsCachedSubmit/DbWriteCached/CachedResolve(same shape asApiOutbound). Site Runtime is the API surface that exposes theDatabase.*calls to scripts; the audit emission itself lives in ESG. - Inbound API (#14) — emits one
ApiInbound.InboundRequestrow per successful request from request-handler middleware, written directly to central viaICentralAuditWriterbefore the response is flushed. Auth-layer rejections emitApiInbound.InboundAuthFailureinstead (Status=Failed, HTTP 401). - Notification Outbox (#21) — the
site-emitted
Notification.NotifySendrow (Status=Submitted) flows via audit telemetry; the central dispatcher writesNotification.NotifyDeliverrows directly viaICentralAuditWriter—Status=Attemptedper delivery attempt,Status=Delivered/Parked/Discardedon terminal status. The operationalNotificationstable is unchanged. - Site Call Audit (#22) — shares the
cached-call telemetry packet. Central ingest of that packet performs both the
AuditLoginsert and theSiteCallsupsert in one transaction.SiteCallsremains the operational state store; the Audit Log is its immutable shadow. - Central UI (#9) — a new Audit nav group hosts the Audit Log page (filter bar, results grid, drilldown drawer, server-side CSV export). Drill-in links appear on Notifications, Site Calls, External Systems, Inbound API key, Sites, and Instances detail pages. Double-clicking a node on the execution-tree page opens a detail modal listing that execution's audit rows, with click-through to each row's full detail view.
- Health Monitoring (#11) — three new
tiles (Volume, Error rate, Backlog) plus new health metrics:
SiteAuditBacklog,SiteAuditWriteFailures,SiteAuditTelemetryStalled,CentralAuditWriteFailures,AuditRedactionFailure. - CLI (#19) — new
scadalink audit query,scadalink audit export, andscadalink audit verify-chaincommands; same permission requirements as the UI.