Files
scadalink-design/docs/requirements/Component-AuditLog.md
Joseph Doherty 3592e74085 docs(audit): align alog.md + Component-AuditLog.md vocab with M1 enums (#23)
The M1 implementation (Bundle A) committed concrete AuditChannel /
AuditKind / AuditStatus enums that reflect CLAUDE.md's locked
cached-call lifecycle decisions. The older alog.md and
Component-AuditLog.md narratives still used pre-M1 vocabulary
(Success / TransientFailure / PermanentFailure / Enqueued / Retrying /
SyncCall / CachedEnqueued / Attempt / Terminal / Completed). This
commit reconciles both docs to the M1 vocabulary:

  AuditChannel  : ApiOutbound, DbOutbound, Notification, ApiInbound
  AuditKind (10): ApiCall, ApiCallCached, DbWrite, DbWriteCached,
                  NotifySend, NotifyDeliver, InboundRequest,
                  InboundAuthFailure, CachedSubmit, CachedResolve
  AuditStatus(8): Submitted, Forwarded, Attempted, Delivered, Failed,
                  Parked, Discarded, Skipped

Updates:
  - Status column description + worked examples use the new 8 values.
  - Kind table flattened from per-channel groupings to a single flat
    list of the 10 discriminators (no more SyncCall / Cached* /
    Attempt / Terminal / Completed).
  - Cached-call lifecycle examples rewritten to the
    CachedSubmit -> Forwarded -> Attempted... -> CachedResolve shape.
  - Notification lifecycle examples rewritten to
    NotifySend(Submitted) -> NotifyDeliver(Attempted) ->
    NotifyDeliver(Delivered/Parked/Discarded).
  - Inbound API examples split into InboundRequest (success path) and
    InboundAuthFailure (401 path).
  - 'Errors only' UI toggle, audit-error-rate KPI, and payload-cap
    decision (#6 in §16) all switched from 'non-Success' to
    Status IN ('Failed', 'Parked', 'Discarded').
  - Per-site event-rate table in §13.1 renamed to the new kinds.

Pure design correction; no operational behavior change. Per the
goal-prompt invariant #6, alog.md may change when a design correction
is committed before the affected code change — this commit is that
correction, landed ahead of the M1 merge so the merge order reads
design-first, code-second.

No code, test, or infra file changes.
2026-05-20 11:56:34 -04:00

23 KiB
Raw Blame History

Component: Audit Log

Purpose

Provides a single, append-only, forensic + operational record of every integration action initiated by, or terminating in, a script — across outbound API, outbound DB, notifications, and inbound API. One row per lifecycle event, rich payloads, long retention, dashboards, drilldowns, and filter queries, answering both forensic questions ("did instance X send notification Y on date Z, with what body?") and operational ones ("which inbound caller is hammering us right now?").

The Audit Log is not a dispatcher. It does not drive delivery, retry loops, or operator Retry/Discard actions — those remain in Notification Outbox and Site Call Audit. The Audit Log is the immutable history that observes those subsystems and adds coverage where they are silent (sync ExternalSystem.Call, sync DB writes and reads, inbound API requests).

Location

Central cluster and site clusters.

  • Central: the AuditLog table in central MS SQL, plus three singletons on the active central node — AuditLogIngestActor (telemetry receiver), SiteAuditReconciliationActor, and AuditLogPurgeActor.
  • Sites: a site-local AuditLog SQLite database file alongside the Store-and-Forward buffer, plus a SiteAuditTelemetryActor singleton on the active site node.

Registered as component #23 in the Host role configuration.

Responsibilities

  • Accept site-local hot-path audit writes from script-trust-boundary call paths.
  • Forward site audit rows to central via gRPC telemetry with at-least-once delivery and idempotency on EventId.
  • Run periodic per-site reconciliation pulls so missed telemetry self-heals.
  • Accept central-originated audit writes (Inbound API, Notification dispatch attempts and terminal status).
  • Compute point-in-time KPIs (global and per-site) from the central AuditLog table.
  • Purge expired rows by monthly partition switch — no row-level deletes.

Scope — the script trust boundary

The Audit Log captures every action a script causes to cross the cluster trust boundary:

Channel Trigger Direction Covered today?
ExternalSystem.Call(...) Script Outbound No (gap)
ExternalSystem.CachedCall(...) Script Outbound Yes — SiteCalls (Site Call Audit)
Database.Connection().Execute*(...) — writes Script Outbound No (gap)
Database.CachedWrite(...) Script Outbound Yes — SiteCalls (Site Call Audit)
Notify.To(list).Send(...) Script Outbound Yes — Notifications (Notification Outbox)
POST /api/{method} (Inbound API) External Inbound (invokes a script) No (gap)

Out of scope — framework traffic is not audited:

  • Health checks, heartbeats, cluster membership messages.
  • gRPC inter-cluster real-time streams (attribute values, alarm states).
  • Data Connection Layer ↔ OPC UA / custom protocol traffic.
  • LDAP authentication probes, Traefik routing decisions.
  • Internal Configuration Database queries by the framework.
  • Site Event Log writes; audit log writes themselves.

Script-initiated DB reads via Database.Connection().ExecuteReader(...) count as actions from a script and are in scope. Reads via DCL / subscriptions are framework traffic and excluded.

The AuditLog Table (central)

Single wide table in central MS SQL, polymorphic by Channel + Kind discriminators, with a JSON Extra column for channel-specific overflow. One row per lifecycle event across all channels.

Column Type Notes
EventId uniqueidentifier PK Generated where the event originates (site or central). Idempotency key.
OccurredAtUtc datetime2 When the event happened (call returned, retry attempted, etc.).
IngestedAtUtc datetime2 When central persisted the row (lags OccurredAtUtc for site-originated rows).
Channel varchar(32) ApiOutbound | DbOutbound | Notification | ApiInbound.
Kind varchar(32) Event kind discriminator (see kinds list below).
CorrelationId uniqueidentifier NULL Ties multi-event operations together. TrackedOperationId for cached calls, NotificationId for notifications, request-id for inbound API. NULL for sync one-shot calls.
SourceSiteId varchar(64) NULL NULL for central-originated events.
SourceInstanceId varchar(128) NULL Instance whose script initiated the action (when applicable).
SourceScript varchar(128) NULL Script name within the instance.
Actor varchar(128) NULL Inbound API: API key name. Outbound: script identity. Central: system user.
Target varchar(256) NULL Outbound API: external system + method. DB: connection name. Notification: list name. Inbound API: method name.
Status varchar(32) Outcome of this eventSubmitted, Forwarded, Attempted, Delivered, Failed, Parked, Discarded, Skipped.
HttpStatus int NULL HTTP-bearing events only.
DurationMs int NULL Call / attempt duration.
ErrorMessage nvarchar(1024) NULL Truncated; ErrorDetail for full text.
ErrorDetail nvarchar(max) NULL Optional full exception text on failures.
RequestSummary nvarchar(max) NULL Truncated request payload (configurable cap). Headers redacted.
ResponseSummary nvarchar(max) NULL Truncated response payload. Full on errors.
PayloadTruncated bit Set if either summary was truncated.
Extra nvarchar(max) NULL Channel-specific JSON for fields we don't promote to columns.

Indexes (first cut):

  • IX_AuditLog_OccurredAtUtc — primary time-range index for global scans.
  • IX_AuditLog_Site_Occurred (SourceSiteId, OccurredAtUtc) — per-site filters.
  • IX_AuditLog_Correlation (CorrelationId) — drilldown from a single operation.
  • IX_AuditLog_Channel_Status_Occurred (Channel, Status, OccurredAtUtc) — KPI / dashboard tiles.
  • IX_AuditLog_Target_Occurred (Target, OccurredAtUtc) — "what did we send to system X".
  • Monthly partitioning on OccurredAtUtc from day one; purge is a partition switch (see Retention & Purge).

Kind values (flat — 10 discriminators across all channels):

Kind Fires when
ApiCall Sync ExternalSystem.Call(...) returns (success or permanent failure). One row per call.
ApiCallCached A cached outbound-API attempt records its forward-ack (Forwarded) or each retry (Attempted).
DbWrite Sync Database.Connection().Execute*(...) / ExecuteReader(...) completes. One row per call.
DbWriteCached A cached outbound-DB attempt records its forward-ack (Forwarded) or each retry (Attempted).
NotifySend Script's Notify.Send(...) is enqueued on the site — first row in a notification's lifecycle (Status=Submitted).
NotifyDeliver Central Notification Outbox dispatcher records a delivery attempt (Attempted) or terminal outcome (Delivered/Parked/Discarded).
InboundRequest An inbound API request completes — one row per request, written at request end with final status.
InboundAuthFailure An inbound API request was rejected at the auth boundary (bad/missing key). One row, Status=Failed, HttpStatus=401.
CachedSubmit Script-side enqueue of a cached call (ExternalSystem.CachedCall / Database.CachedWrite); first row in the cached-call lifecycle, written to site SQLite before any forward attempt.
CachedResolve Terminal row for a cached operation — Status = Delivered / Failed / Parked / Discarded.

Inbound API is intentionally collapsed to a single InboundRequest (or InboundAuthFailure for auth rejections) row per request rather than a multi-event lifecycle.

The Site-Local AuditLog (SQLite)

A SQLite database file on each site node, alongside the Store-and-Forward buffer. Same schema as central minus IngestedAtUtc (irrelevant at the source), plus a ForwardState column with values Pending | Forwarded | Reconciled that drives the telemetry loop and reconciliation pull.

Site SQLite retention rule (hard invariant):

A row is eligible for purge only when both OccurredAtUtc < retention threshold AND ForwardState IN ('Forwarded', 'Reconciled'). Pending rows are never purged.

A prolonged central outage will grow the site audit table indefinitely until central is reachable again. This is intentional — losing audit rows to make room is a compliance violation, not a self-healing behavior. To bound that growth in practice, the site emits a SiteAuditBacklog health metric (pending row count, oldest pending age, bytes on disk); crossing operator-configured thresholds surfaces a warning on the relevant site tile in the Health dashboard, mirroring the Store-and-Forward Engine's backlog metric.

Central is the durable home. Site SQLite is a write-buffer with a forwarding guarantee.

Ingestion Paths

Four paths feed the central AuditLog — one site originator and three central writers — all idempotent on EventId.

Site hot-path append (site-originated events)

The component completing a script-trust-boundary action (External System Gateway, Database layer, Store-and-Forward Engine) builds an AuditEvent with a fresh EventId (Guid v4) and OccurredAtUtc = UtcNow, then appends it to the site-local AuditLog SQLite via IAuditWriter with ForwardState = 'Pending'. The append is a single-statement INSERT and is durable in microseconds; control returns to the script with no central round-trip on the hot path.

Telemetry forward (site → central)

A SiteAuditTelemetryActor singleton drives the forwarding loop: select up to N Pending rows ordered by OccurredAtUtc, batch-send them to central via the existing SiteStream gRPC channel as IngestAuditEvents(events), and on central-ack flip ForwardState = 'Forwarded' for accepted IDs. Rejected IDs stay Pending for the next sweep. Cadence is short (default 5 s) when non-empty, longer (default 30 s) when idle; telemetry runs on a dedicated dispatcher.

Reconciliation pull (self-healing for missed telemetry)

A central SiteAuditReconciliationActor periodically (default 5 min per site) asks each site for its oldest Pending row and pending count; if backlog is non-draining (e.g., telemetry actor wedged), central issues a PullAuditEvents(sinceUtc, batchSize) and inserts-if-not-exists. Accepted rows are flipped to ForwardState = 'Reconciled' site-side. Same self-healing pattern as Site Call Audit's reconciliation of SiteCalls.

Central direct-write (central-originated events)

Events originating at central never touch site SQLite. Inbound API writes one ApiInbound.InboundRequest row via ICentralAuditWriter synchronously inside the request-handler middleware, before the HTTP response is flushed; auth-layer rejections emit ApiInbound.InboundAuthFailure (Status=Failed, HTTP 401) instead. The Notification Outbox dispatcher writes Notification.NotifyDeliver with Status=Attempted per delivery attempt and Notification.NotifyDeliver with Status=Delivered/Parked/Discarded on terminal status. Central direct-writes use the same insert-if-not-exists semantics keyed on EventId.

Cached Operations — Combined Telemetry

For ExternalSystem.CachedCall and Database.CachedWrite, the site is the source of truth for every audit row. The site writes each lifecycle event — CachedSubmit (Status=Submitted), then ApiCallCached/DbWriteCached rows for the forward-ack (Status=Forwarded) and each retry (Status=Attempted), then a terminal CachedResolve row (Status=Delivered/Failed/Parked/Discarded) — to its local SQLite AuditLog on the hot path (or on the retry tick for Attempted rows), then forwards via the same telemetry channel. The telemetry message format gains the audit-row fields additively — one packet per lifecycle transition carries both the operational state update AND the audit row content.

On receipt, central performs both writes in one transaction:

  1. Insert-if-not-exists the immutable AuditLog row, keyed on EventId.
  2. Upsert the operational SiteCalls row — existing Site Call Audit behavior (status, retry count, last error, timestamps).

This collapses two telemetry concerns into one, keeps site SQLite as the single local source of truth for audit content, and preserves the existing operational SiteCalls shape for the dispatcher and UI.

Payload Capture Policy

  • Default cap — 8 KB for each of RequestSummary and ResponseSummary; raised to 64 KB on any error row (Status IN ('Failed', 'Parked', 'Discarded')).
  • Truncation — UTF-8 byte-safe; PayloadTruncated = 1 when applied. Full bodies are never stored.
  • HTTP headersAuthorization, Cookie, Set-Cookie, X-API-Key, and any header matching the configured redact-list regex become <redacted>.
  • HTTP bodies — captured verbatim by default. Operators register per-target body redactors (regex → replacement) for known secret fields.
  • SQL — statement text and parameter values captured verbatim by default; per-connection opt-in to redact parameters whose name matches a regex.
  • Never captured — raw API key material (only the key name via Actor), LDAP bind credentials, cluster secrets, Configuration DB connection strings.
  • Safety net — if a configured redactor throws, the affected payload becomes "<redacted: redactor error>" and AuditRedactionFailure increments. We over-redact, never under-redact, on configuration faults.

Redaction happens at the write site, before the row touches SQLite (or central MS SQL for direct-write events). Unredacted secrets never persist.

Failure Handling & Idempotency

  • EventId is the dedup key. Generated at the originator; central ingest is INSERT … WHERE NOT EXISTS (SELECT 1 FROM AuditLog WHERE EventId = @id) under the PK constraint. Idempotent across telemetry retries, reconciliation pulls, and any combination of the two.
  • Never fail the action. A failed audit write — site SQLite or central direct-write — logs a critical Site Event Log entry and increments a health metric (SiteAuditWriteFailures or CentralAuditWriteFailures), but the user-facing action proceeds. We do not fail script-initiated work because the audit write failed.
  • Hot-path ring buffer. While the site audit writer is unhealthy (disk full, schema lock, transient IO), events buffer in a small in-memory ring (default 1024 rows); oldest are discarded with a Site Event Log warning per drop.
  • Reconciliation as fallback. If two consecutive reconciliation cycles report a non-draining backlog, the supervisor restarts the telemetry actor and a SiteAuditTelemetryStalled event fires.
  • No dedup horizon. EventId PK enforces uniqueness only while a row exists. A retry that arrives after the original row is purged inserts a "new" row — vanishingly rare and harmless.

Retention & Purge

  • Central: 365-day default based on OccurredAtUtc, configurable via AuditLog:RetentionDays (min 7, max 3650). Single global retention in v1 — no per-channel overrides.
  • Partitioning: monthly partitions on OccurredAtUtc from day one (pf_AuditLog_Month / ps_AuditLog_Month). Purge is a partition switch; there are no row-level deletes at central.
  • Purge actor: AuditLogPurgeActor singleton on the active central node runs daily, switches out any partition whose latest OccurredAtUtc is older than the retention window, and emits an AuditLog:Purged event (partition range, rowcount, duration). A partition-maintenance step rolls forward each month, creating the next month's partition ahead of time.
  • Sites: daily site job; default 7-day retention (configurable, min 1, max 90). Respects the hard ForwardState invariant — Pending rows are never purged on age alone.

Security & Tamper-Evidence

  • Append-only enforcement. The application accesses AuditLog via a dedicated DB role scadalink_audit_writer granted INSERT + SELECT only — no UPDATE, no DELETE. Purge runs under a separate role scadalink_audit_purger whose permissions are limited to the partition-switch operation; row-level DELETE is not granted even to purge.
  • CI grep guard. The build greps the data layer for any UPDATE … AuditLog or DELETE … AuditLog text and fails on a hit.
  • Authorization. Reading the Audit Log requires the existing Audit role extended with a new OperationalAudit permission. Per-site row scoping reuses the existing site-permission model; bulk export requires an additional AuditExport permission.
  • Payload redaction at write. See Payload Capture Policy. Unredacted secrets never persist; the safety net over-redacts on misconfiguration.
  • Hash-chain tamper evidence — deferred to v1.x. A future RowHash column, computed per partition as SHA-256(prev.RowHash || canonical(row)), will be verifiable offline via scadalink audit verify-chain --month YYYY-MM. Off by default in v1.
  • Site SQLite security. File permissions: read/write by the ScadaLink service account only. Not backed up off-machine — site SQLite is a buffer, not a record.

KPIs

Point-in-time, computed from the central AuditLog table; global and per-site.

  • Audit volume — events/min landing in the central AuditLog; global plus per-site sparkline.
  • Audit error rate — % of central AuditLog rows with Status IN ('Failed', 'Parked', 'Discarded') over a rolling 5-minute window. This is the operational error rate of audited operations (HTTP 5xx, permanent failures, parked deliveries) — NOT audit-writer health, which surfaces separately via CentralAuditWriteFailures and AuditRedactionFailure.
  • Audit backlog — sum of Pending site rows across sites; click drills into a per-site breakdown.

Notification Outbox and Site Call Audit KPIs are unaffected — they remain sourced from Notifications and SiteCalls respectively. Audit Log KPIs describe the audit table itself.

Configuration

Bound from appsettings.json to a new AuditLogOptions class owned by this component (Options pattern):

"AuditLog": {
  "DefaultCapBytes": 8192,
  "ErrorCapBytes": 65536,
  "HeaderRedactList": [ "Authorization", "Cookie", "Set-Cookie", "X-API-Key" ],
  "GlobalBodyRedactors": [
    { "Pattern": "\"password\"\\s*:\\s*\"[^\"]+\"", "Replacement": "\"password\":\"<redacted>\"" }
  ],
  "PerTargetOverrides": {
    "Weather/GetForecast": { "CapBytes": 4096 },
    "PlantDB":             { "RedactSqlParamsMatching": "@apikey|@token" }
  },
  "RetentionDays": 365
}

PerTargetOverrides keys bind by External System / Inbound Method / Notification List / Database Connection name. RetentionDays is a single global value in v1; per-channel overrides are deferred to v1.x.

Dependencies

  • Commons (#16)AuditEvent, IAuditWriter / ICentralAuditWriter interfaces, and the AuditChannel, AuditKind, AuditStatus enum types live here.
  • Configuration Database (#17) — hosts the AuditLog table schema, the monthly partition function and scheme, the scadalink_audit_writer / scadalink_audit_purger DB roles, and the EF migration. Distinct concern from IAuditService (config-change audit), which is unchanged.
  • Cluster Infrastructure (#13) — singleton placement and supervision for AuditLogIngestActor, SiteAuditTelemetryActor, SiteAuditReconciliationActor, and AuditLogPurgeActor.
  • CentralSite Communication (#5) — carries audit telemetry. New gRPC message types (IngestAuditEvents, PullAuditEvents) are added to the existing site-stream proto additively.
  • Site Runtime (#3) — script-trust-boundary call paths invoke IAuditWriter to append events.
  • Host (#15) — registers this component (#23) under the central and site roles.

Interactions

  • External System Gateway (#7) — emits ApiOutbound.ApiCall rows on every sync Call(). For CachedCall, emits the combined cached telemetry packet (audit row + operational update) per Cached Operations — Combined Telemetry, using kinds CachedSubmit / ApiCallCached / CachedResolve.
  • External System Gateway (#7) — Database layer — the database access modes inside ESG emit DbOutbound.DbWrite rows on script-initiated Connection() calls (writes and reads share the kind; distinguish via Extra.rowsAffected vs Extra.rowsReturned); Database.CachedWrite emits the cached-write lifecycle rows via the combined-telemetry packet using kinds CachedSubmit / DbWriteCached / CachedResolve (same shape as ApiOutbound). Site Runtime is the API surface that exposes the Database.* calls to scripts; the audit emission itself lives in ESG.
  • Inbound API (#14) — emits one ApiInbound.InboundRequest row per successful request from request-handler middleware, written directly to central via ICentralAuditWriter before the response is flushed. Auth-layer rejections emit ApiInbound.InboundAuthFailure instead (Status=Failed, HTTP 401).
  • Notification Outbox (#21) — the site-emitted Notification.NotifySend row (Status=Submitted) flows via audit telemetry; the central dispatcher writes Notification.NotifyDeliver rows directly via ICentralAuditWriterStatus=Attempted per delivery attempt, Status=Delivered/Parked/Discarded on terminal status. The operational Notifications table is unchanged.
  • Site Call Audit (#22) — shares the cached-call telemetry packet. Central ingest of that packet performs both the AuditLog insert and the SiteCalls upsert in one transaction. SiteCalls remains the operational state store; the Audit Log is its immutable shadow.
  • Central UI (#9) — a new Audit nav group hosts the Audit Log page (filter bar, results grid, drilldown drawer, server-side CSV export). Drill-in links appear on Notifications, Site Calls, External Systems, Inbound API key, Sites, and Instances detail pages.
  • Health Monitoring (#11) — three new tiles (Volume, Error rate, Backlog) plus new health metrics: SiteAuditBacklog, SiteAuditWriteFailures, SiteAuditTelemetryStalled, CentralAuditWriteFailures, AuditRedactionFailure.
  • CLI (#19) — new scadalink audit query, scadalink audit export, and scadalink audit verify-chain commands; same permission requirements as the UI.