docs(audit): add Component-AuditLog (#23) design document

This commit is contained in:
Joseph Doherty
2026-05-20 07:36:35 -04:00
parent d93ca4c56e
commit c334de03f4

View File

@@ -0,0 +1,384 @@
# Component: Audit Log
## Purpose
Provides a single, append-only, forensic + operational record of every
integration action initiated by, or terminating in, a script — across outbound
API, outbound DB, notifications, and inbound API. One row per lifecycle event,
rich payloads, long retention, dashboards plus drilldowns plus filter queries,
answering both forensic questions ("did instance X send notification Y on date
Z, with what body?") and operational ones ("which inbound caller is hammering
us right now?").
The Audit Log is **not a dispatcher**. It does not drive delivery, retry loops,
or operator Retry/Discard actions — those remain in [Notification Outbox](Component-NotificationOutbox.md)
and [Site Call Audit](Component-SiteCallAudit.md). The Audit Log is the
immutable history that **observes** those subsystems and adds coverage where
they are silent (sync `ExternalSystem.Call`, sync DB writes and reads, inbound
API requests).
## Location
Central cluster and site clusters.
- **Central:** the `AuditLog` table in central MS SQL, plus three singletons on
the active central node — `AuditLogIngestActor` (telemetry receiver),
`SiteAuditReconciliationActor`, and `AuditLogPurgeActor`.
- **Sites:** a site-local `AuditLog` SQLite database file alongside the
Store-and-Forward buffer, plus a `SiteAuditTelemetryActor` singleton on the
active site node.
Registered as component #23 in the Host role configuration.
## Responsibilities
- Accept site-local hot-path audit writes from script-trust-boundary call paths.
- Forward site audit rows to central via gRPC telemetry with at-least-once
delivery and idempotency on `EventId`.
- Run periodic per-site reconciliation pulls so missed telemetry self-heals.
- Accept central-originated audit writes (Inbound API, Notification dispatch
attempts and terminal status).
- Compute point-in-time KPIs (global and per-site) from the central `AuditLog`
table.
- Purge expired rows by monthly partition switch — no row-level deletes.
## Scope — the script trust boundary
The Audit Log captures every action a script causes to cross the cluster trust
boundary:
| Channel | Trigger | Direction | Covered today? |
|---|---|---|---|
| `ExternalSystem.Call(...)` | Script | Outbound | No (gap) |
| `ExternalSystem.CachedCall(...)` | Script | Outbound | Yes — `SiteCalls` (Site Call Audit) |
| `Database.Connection().Execute*(...)` — writes | Script | Outbound | No (gap) |
| `Database.CachedWrite(...)` | Script | Outbound | Yes — `SiteCalls` (Site Call Audit) |
| `Notify.To(list).Send(...)` | Script | Outbound | Yes — `Notifications` (Notification Outbox) |
| `POST /api/{method}` (Inbound API) | External | Inbound (invokes a script) | No (gap) |
Out of scope — framework traffic is not audited:
- Health checks, heartbeats, cluster membership messages.
- gRPC inter-cluster real-time streams (attribute values, alarm states).
- Data Connection Layer ↔ OPC UA / custom protocol traffic.
- LDAP authentication probes, Traefik routing decisions.
- Internal Configuration Database queries by the framework.
- Site Event Log writes; audit log writes themselves.
Script-initiated DB **reads** via `Database.Connection().ExecuteReader(...)`
count as actions from a script and are in scope. Reads via DCL / subscriptions
are framework traffic and excluded.
## The `AuditLog` Table (central)
Single wide table in central MS SQL, polymorphic by `Channel` + `Kind`
discriminators, with a JSON `Extra` column for channel-specific overflow. One
row per lifecycle event across all channels.
| Column | Type | Notes |
|---|---|---|
| `EventId` | `uniqueidentifier` PK | Generated where the event originates (site or central). Idempotency key. |
| `OccurredAtUtc` | `datetime2` | When the event happened (call returned, retry attempted, etc.). |
| `IngestedAtUtc` | `datetime2` | When central persisted the row (lags `OccurredAtUtc` for site-originated rows). |
| `Channel` | `varchar(32)` | `ApiOutbound` \| `DbOutbound` \| `Notification` \| `ApiInbound`. |
| `Kind` | `varchar(32)` | Channel-specific event kind (see below). |
| `CorrelationId` | `uniqueidentifier` NULL | Ties multi-event operations together. `TrackedOperationId` for cached calls, `NotificationId` for notifications, request-id for inbound API. NULL for sync one-shot calls. |
| `SourceSiteId` | `varchar(64)` NULL | NULL for central-originated events. |
| `SourceInstanceId` | `varchar(128)` NULL | Instance whose script initiated the action (when applicable). |
| `SourceScript` | `varchar(128)` NULL | Script name within the instance. |
| `Actor` | `varchar(128)` NULL | Inbound API: API key name. Outbound: script identity. Central: system user. |
| `Target` | `varchar(256)` NULL | Outbound API: external system + method. DB: connection name. Notification: list name. Inbound API: method name. |
| `Status` | `varchar(32)` | Outcome of *this event*`Success`, `TransientFailure`, `PermanentFailure`, `Enqueued`, `Retrying`, `Delivered`, `Parked`, `Discarded`. |
| `HttpStatus` | `int` NULL | HTTP-bearing events only. |
| `DurationMs` | `int` NULL | Call / attempt duration. |
| `ErrorMessage` | `nvarchar(1024)` NULL | Truncated; `ErrorDetail` for full text. |
| `ErrorDetail` | `nvarchar(max)` NULL | Optional full exception text on failures. |
| `RequestSummary` | `nvarchar(max)` NULL | Truncated request payload (configurable cap). Headers redacted. |
| `ResponseSummary` | `nvarchar(max)` NULL | Truncated response payload. Full on errors. |
| `PayloadTruncated` | `bit` | Set if either summary was truncated. |
| `Extra` | `nvarchar(max)` NULL | Channel-specific JSON for fields we don't promote to columns. |
**Indexes (first cut):**
- `IX_AuditLog_OccurredAtUtc` — primary time-range index for global scans.
- `IX_AuditLog_Site_Occurred (SourceSiteId, OccurredAtUtc)` — per-site filters.
- `IX_AuditLog_Correlation (CorrelationId)` — drilldown from a single operation.
- `IX_AuditLog_Channel_Status_Occurred (Channel, Status, OccurredAtUtc)` — KPI / dashboard tiles.
- `IX_AuditLog_Target_Occurred (Target, OccurredAtUtc)` — "what did we send to system X".
- Monthly partitioning on `OccurredAtUtc` from day one; purge is a partition switch (§ Retention & Purge).
**`Kind` values by channel:**
| Channel | Kinds |
|---|---|
| `ApiOutbound` | `SyncCall`, `CachedEnqueued`, `CachedAttempt`, `CachedTerminal` |
| `DbOutbound` | `SyncWrite`, `SyncRead`, `CachedEnqueued`, `CachedAttempt`, `CachedTerminal` |
| `Notification` | `Enqueued`, `Attempt`, `Terminal` |
| `ApiInbound` | `Completed` — one row per request, written at request end with final status |
Inbound API is intentionally collapsed to a single `Completed` row per request
rather than a multi-event lifecycle.
## The Site-Local `AuditLog` (SQLite)
A SQLite database file on each site node, alongside the Store-and-Forward
buffer. Same schema as central minus `IngestedAtUtc` (irrelevant at the source),
plus a `ForwardState` column with values `Pending | Forwarded | Reconciled` that
drives the telemetry loop and reconciliation pull.
**Site SQLite retention rule (hard invariant):**
> A row is eligible for purge only when both `OccurredAtUtc < retention threshold` AND `ForwardState IN ('Forwarded', 'Reconciled')`. Pending rows are never purged.
A prolonged central outage will grow the site audit table indefinitely until
central is reachable again. This is intentional — losing audit rows to make
room is a compliance violation, not a self-healing behavior. To bound that
growth in practice, the site emits a `SiteAuditBacklog` health metric (pending
row count, oldest pending age, bytes on disk); crossing operator-configured
thresholds surfaces a warning on the relevant site tile in the Health
dashboard, mirroring the Store-and-Forward Engine's backlog metric.
Central is the durable home. Site SQLite is a write-buffer with a forwarding
guarantee.
## Ingestion Paths
Three write paths converge on the central `AuditLog`, all idempotent on
`EventId`.
### Site hot-path append (site-originated events)
The component completing a script-trust-boundary action (External System
Gateway, Database layer, Store-and-Forward Engine) builds an `AuditEvent` with a
fresh `EventId` (Guid v4) and `OccurredAtUtc = UtcNow`, then appends it to the
site-local `AuditLog` SQLite via `ISiteAuditWriter` with
`ForwardState = 'Pending'`. The append is a single-statement INSERT and is
durable in microseconds; control returns to the script with no central
round-trip on the hot path.
### Telemetry forward (site → central)
A `SiteAuditTelemetryActor` singleton drives the forwarding loop: select up to
N `Pending` rows ordered by `OccurredAtUtc`, batch-send them to central via the
existing `SiteStream` gRPC channel as `IngestAuditEvents(events)`, and on
central-ack flip `ForwardState = 'Forwarded'` for accepted IDs. Rejected IDs
stay `Pending` for the next sweep. Cadence is short (default 5 s) when
non-empty, longer (default 30 s) when idle; telemetry runs on a dedicated
dispatcher.
### Reconciliation pull (self-healing for missed telemetry)
A central `SiteAuditReconciliationActor` periodically (default 5 min per site)
asks each site for its oldest `Pending` row and pending count; if backlog is
non-draining (e.g., telemetry actor wedged), central issues a
`PullAuditEvents(sinceUtc, batchSize)` and inserts-if-not-exists. Accepted rows
are flipped to `ForwardState = 'Reconciled'` site-side. Same self-healing
pattern as Site Call Audit's reconciliation of `SiteCalls`.
### Central direct-write (central-originated events)
Events originating at central never touch site SQLite. Inbound API writes one
`ApiInbound`/`Completed` row via `ICentralAuditWriter` synchronously inside the
request-handler middleware, before the HTTP response is flushed. The
Notification Outbox dispatcher writes `Notification`/`Attempt` per delivery
attempt and `Notification`/`Terminal` on terminal status. Central direct-writes
use the same insert-if-not-exists semantics keyed on `EventId`.
## Cached Operations — Combined Telemetry
For `ExternalSystem.CachedCall` and `Database.CachedWrite`, the **site** is the
source of truth for every audit row. The site writes each lifecycle event
(`CachedEnqueued`, `CachedAttempt`, `CachedTerminal`) to its local SQLite
`AuditLog` on the hot path (or on the retry tick for `CachedAttempt`), then
forwards via the same telemetry channel. The telemetry message format gains the
audit-row fields additively — one packet per lifecycle transition carries both
the operational state update AND the audit row content.
On receipt, central performs both writes in one transaction:
1. Insert-if-not-exists the immutable `AuditLog` row, keyed on `EventId`.
2. Upsert the operational `SiteCalls` row — existing Site Call Audit behavior
(status, retry count, last error, timestamps).
This collapses two telemetry concerns into one, keeps site SQLite as the
single local source of truth for audit content, and preserves the existing
operational `SiteCalls` shape for the dispatcher and UI.
## Payload Capture Policy
- **Default cap** — 8 KB for each of `RequestSummary` and `ResponseSummary`;
raised to 64 KB on any non-`Success` row.
- **Truncation** — UTF-8 byte-safe; `PayloadTruncated = 1` when applied. Full
bodies are never stored.
- **HTTP headers** — `Authorization`, `Cookie`, `Set-Cookie`, `X-API-Key`, and
any header matching the configured redact-list regex become `<redacted>`.
- **HTTP bodies** — captured verbatim by default. Operators register per-target
body redactors (regex → replacement) for known secret fields.
- **SQL** — statement text and parameter values captured verbatim by default;
per-connection opt-in to redact parameters whose name matches a regex.
- **Never captured** — raw API key material (only the key *name* via `Actor`),
LDAP bind credentials, cluster secrets, Configuration DB connection strings.
- **Safety net** — if a configured redactor throws, the affected payload becomes
`"<redacted: redactor error>"` and `AuditRedactionFailure` increments. We
over-redact, never under-redact, on configuration faults.
Redaction happens at the write site, before the row touches SQLite (or central
MS SQL for direct-write events). Unredacted secrets never persist.
## Failure Handling & Idempotency
- **`EventId` is the dedup key.** Generated at the originator; central ingest
is `INSERT … WHERE NOT EXISTS (SELECT 1 FROM AuditLog WHERE EventId = @id)`
under the PK constraint. Idempotent across telemetry retries, reconciliation
pulls, and any combination of the two.
- **Never fail the action.** A failed audit write — site SQLite or central
direct-write — logs a critical Site Event Log entry and increments a health
metric (`SiteAuditWriteFailures` or `CentralAuditWriteFailures`), but the
user-facing action proceeds. We do not fail script-initiated work because the
audit write failed.
- **Hot-path ring buffer.** While the site audit writer is unhealthy
(disk full, schema lock, transient IO), events buffer in a small in-memory
ring (default 1024 rows); oldest are discarded with a Site Event Log warning
per drop.
- **Reconciliation as fallback.** If two consecutive reconciliation cycles
report a non-draining backlog, the supervisor restarts the telemetry actor
and a `SiteAuditTelemetryStalled` event fires.
- **No dedup horizon.** `EventId` PK enforces uniqueness only while a row
exists. A retry that arrives after the original row is purged inserts a "new"
row — vanishingly rare and harmless.
## Retention & Purge
- **Central:** 365-day default based on `OccurredAtUtc`, configurable via
`AuditLog:RetentionDays` (min 7, max 3650). Single global retention in v1 —
no per-channel overrides.
- **Partitioning:** monthly partitions on `OccurredAtUtc` from day one
(`pf_AuditLog_Month` / `ps_AuditLog_Month`). Purge is a partition switch;
there are no row-level deletes at central.
- **Purge actor:** `AuditLogPurgeActor` singleton on the active central node
runs daily, switches out any partition whose latest `OccurredAtUtc` is older
than the retention window, and emits an `AuditLog:Purged` event (partition
range, rowcount, duration). A partition-maintenance step rolls forward each
month, creating the next month's partition ahead of time.
- **Sites:** daily site job; default 7-day retention (configurable, min 1,
max 90). Respects the hard `ForwardState` invariant — `Pending` rows are
never purged on age alone.
## Security & Tamper-Evidence
- **Append-only enforcement.** The application accesses `AuditLog` via a
dedicated DB role `scadalink_audit_writer` granted `INSERT` + `SELECT` only —
no `UPDATE`, no `DELETE`. Purge runs under a separate role
`scadalink_audit_purger` whose permissions are limited to the partition-switch
operation; row-level `DELETE` is not granted even to purge.
- **CI grep guard.** The build greps the data layer for any
`UPDATE … AuditLog` or `DELETE … AuditLog` text and fails on a hit.
- **Authorization.** Reading the Audit Log requires the existing **Audit** role
extended with a new **OperationalAudit** permission. Per-site row scoping
reuses the existing site-permission model; bulk export requires an additional
**AuditExport** permission.
- **Payload redaction at write.** See Payload Capture Policy. Unredacted
secrets never persist; the safety net over-redacts on misconfiguration.
- **Hash-chain tamper evidence — deferred to v1.x.** A future `RowHash` column,
computed per partition as `SHA-256(prev.RowHash || canonical(row))`, will be
verifiable offline via `scadalink audit verify-chain --month YYYY-MM`. Off by
default in v1.
- **Site SQLite security.** File permissions: read/write by the ScadaLink
service account only. Not backed up off-machine — site SQLite is a buffer,
not a record.
## KPIs
Point-in-time, computed from the central `AuditLog` table; global and per-site.
- **Volume** — events/min.
- **Error rate** — % non-`Success` rows, rolling 5 min.
- **Backlog** — sum of `Pending` site rows across sites.
- **Top inbound callers** — top-10 `Actor` by request count, last 1 h.
- **Top outbound 5xx** — top-10 `Target` by 5xx-status count, last 1 h.
[Notification Outbox](Component-NotificationOutbox.md) and
[Site Call Audit](Component-SiteCallAudit.md) KPIs are unaffected — they remain
sourced from `Notifications` and `SiteCalls` respectively. Audit Log KPIs
describe the audit table itself.
## Configuration
Bound from `appsettings.json` to a new `AuditLogOptions` class owned by this
component (Options pattern):
```jsonc
"AuditLog": {
"DefaultCapBytes": 8192,
"ErrorCapBytes": 65536,
"HeaderRedactList": [ "Authorization", "Cookie", "Set-Cookie", "X-API-Key" ],
"GlobalBodyRedactors": [
{ "Pattern": "\"password\"\\s*:\\s*\"[^\"]+\"", "Replacement": "\"password\":\"<redacted>\"" }
],
"PerTargetOverrides": {
"Weather/GetForecast": { "CapBytes": 4096 },
"PlantDB": { "RedactSqlParamsMatching": "@apikey|@token" }
},
"RetentionDays": 365
}
```
`PerTargetOverrides` keys bind by External System / Inbound Method /
Notification List / Database Connection name. `RetentionDays` is a single
global value in v1; per-channel overrides are deferred to v1.x.
## Dependencies
- **[Commons (#16)](Component-Commons.md)** — `AuditEvent`, `IAuditWriter` /
`ICentralAuditWriter` interfaces, and the `AuditChannel`, `AuditKind`,
`AuditStatus` enum types live here.
- **[Configuration Database (#17)](Component-ConfigurationDatabase.md)** — hosts
the `AuditLog` table schema, the monthly partition function and scheme, the
`scadalink_audit_writer` / `scadalink_audit_purger` DB roles, and the EF
migration. Distinct concern from `IAuditService` (config-change audit), which
is unchanged.
- **[Cluster Infrastructure (#13)](Component-ClusterInfrastructure.md)** —
singleton placement and supervision for `AuditLogIngestActor`,
`SiteAuditTelemetryActor`, `SiteAuditReconciliationActor`, and
`AuditLogPurgeActor`.
- **[CentralSite Communication (#5)](Component-Communication.md)** — carries
audit telemetry. New gRPC message types (`IngestAuditEvents`,
`PullAuditEvents`) are added to the existing site-stream proto additively.
- **[Site Runtime (#3)](Component-SiteRuntime.md)** — script-trust-boundary
call paths invoke `IAuditWriter` to append events.
- **[Host (#15)](Component-Host.md)** — registers this component (#23) under
the central and site roles.
## Interactions
- **[External System Gateway (#7)](Component-ExternalSystemGateway.md)** —
emits `ApiOutbound.SyncCall` rows on every sync `Call()`. For `CachedCall`,
emits the combined cached telemetry packet (audit row + operational update)
per § Cached Operations — Combined Telemetry.
- **[Site Runtime (#3)](Component-SiteRuntime.md) — Database layer** — emits
`DbOutbound.SyncWrite`, `DbOutbound.SyncRead`, and the cached-write variants
via the same combined-telemetry path.
- **[Inbound API (#14)](Component-InboundAPI.md)** — emits one
`ApiInbound.Completed` row per request from request-handler middleware,
written directly to central via `ICentralAuditWriter` before the response is
flushed.
- **[Notification Outbox (#21)](Component-NotificationOutbox.md)** — the
site-emitted `Notification.Enqueued` row flows via audit telemetry; the
central dispatcher writes `Notification.Attempt` (per delivery attempt) and
`Notification.Terminal` (on terminal status) directly via
`ICentralAuditWriter`. The operational `Notifications` table is unchanged.
- **[Site Call Audit (#22)](Component-SiteCallAudit.md)** — shares the
cached-call telemetry packet. Central ingest of that packet performs both the
`AuditLog` insert and the `SiteCalls` upsert in one transaction. `SiteCalls`
remains the operational state store; the Audit Log is its immutable shadow.
- **[Central UI (#9)](Component-CentralUI.md)** — a new **Audit** nav group
hosts the Audit Log page (filter bar, results grid, drilldown drawer,
server-side CSV export). Drill-in links appear on Notifications, Site Calls,
External Systems, Inbound API key, Sites, and Instances detail pages.
- **[Health Monitoring (#11)](Component-HealthMonitoring.md)** — three new
tiles (Volume, Error rate, Backlog) plus new health metrics:
`SiteAuditBacklog`, `SiteAuditWriteFailures`, `SiteAuditTelemetryStalled`,
`CentralAuditWriteFailures`, `AuditRedactionFailure`.
- **[CLI (#19)](Component-CLI.md)** — new `scadalink audit query`,
`scadalink audit export`, and `scadalink audit verify-chain` commands; same
permission requirements as the UI.