From c334de03f494e7a37c5288eecd8c0a843ae0cb55 Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Wed, 20 May 2026 07:36:35 -0400 Subject: [PATCH] docs(audit): add Component-AuditLog (#23) design document --- docs/requirements/Component-AuditLog.md | 384 ++++++++++++++++++++++++ 1 file changed, 384 insertions(+) create mode 100644 docs/requirements/Component-AuditLog.md diff --git a/docs/requirements/Component-AuditLog.md b/docs/requirements/Component-AuditLog.md new file mode 100644 index 0000000..189a905 --- /dev/null +++ b/docs/requirements/Component-AuditLog.md @@ -0,0 +1,384 @@ +# Component: Audit Log + +## Purpose + +Provides a single, append-only, forensic + operational record of every +integration action initiated by, or terminating in, a script — across outbound +API, outbound DB, notifications, and inbound API. One row per lifecycle event, +rich payloads, long retention, dashboards plus drilldowns plus filter queries, +answering both forensic questions ("did instance X send notification Y on date +Z, with what body?") and operational ones ("which inbound caller is hammering +us right now?"). + +The Audit Log is **not a dispatcher**. It does not drive delivery, retry loops, +or operator Retry/Discard actions — those remain in [Notification Outbox](Component-NotificationOutbox.md) +and [Site Call Audit](Component-SiteCallAudit.md). The Audit Log is the +immutable history that **observes** those subsystems and adds coverage where +they are silent (sync `ExternalSystem.Call`, sync DB writes and reads, inbound +API requests). + +## Location + +Central cluster and site clusters. + +- **Central:** the `AuditLog` table in central MS SQL, plus three singletons on + the active central node — `AuditLogIngestActor` (telemetry receiver), + `SiteAuditReconciliationActor`, and `AuditLogPurgeActor`. +- **Sites:** a site-local `AuditLog` SQLite database file alongside the + Store-and-Forward buffer, plus a `SiteAuditTelemetryActor` singleton on the + active site node. + +Registered as component #23 in the Host role configuration. + +## Responsibilities + +- Accept site-local hot-path audit writes from script-trust-boundary call paths. +- Forward site audit rows to central via gRPC telemetry with at-least-once + delivery and idempotency on `EventId`. +- Run periodic per-site reconciliation pulls so missed telemetry self-heals. +- Accept central-originated audit writes (Inbound API, Notification dispatch + attempts and terminal status). +- Compute point-in-time KPIs (global and per-site) from the central `AuditLog` + table. +- Purge expired rows by monthly partition switch — no row-level deletes. + +## Scope — the script trust boundary + +The Audit Log captures every action a script causes to cross the cluster trust +boundary: + +| Channel | Trigger | Direction | Covered today? | +|---|---|---|---| +| `ExternalSystem.Call(...)` | Script | Outbound | No (gap) | +| `ExternalSystem.CachedCall(...)` | Script | Outbound | Yes — `SiteCalls` (Site Call Audit) | +| `Database.Connection().Execute*(...)` — writes | Script | Outbound | No (gap) | +| `Database.CachedWrite(...)` | Script | Outbound | Yes — `SiteCalls` (Site Call Audit) | +| `Notify.To(list).Send(...)` | Script | Outbound | Yes — `Notifications` (Notification Outbox) | +| `POST /api/{method}` (Inbound API) | External | Inbound (invokes a script) | No (gap) | + +Out of scope — framework traffic is not audited: + +- Health checks, heartbeats, cluster membership messages. +- gRPC inter-cluster real-time streams (attribute values, alarm states). +- Data Connection Layer ↔ OPC UA / custom protocol traffic. +- LDAP authentication probes, Traefik routing decisions. +- Internal Configuration Database queries by the framework. +- Site Event Log writes; audit log writes themselves. + +Script-initiated DB **reads** via `Database.Connection().ExecuteReader(...)` +count as actions from a script and are in scope. Reads via DCL / subscriptions +are framework traffic and excluded. + +## The `AuditLog` Table (central) + +Single wide table in central MS SQL, polymorphic by `Channel` + `Kind` +discriminators, with a JSON `Extra` column for channel-specific overflow. One +row per lifecycle event across all channels. + +| Column | Type | Notes | +|---|---|---| +| `EventId` | `uniqueidentifier` PK | Generated where the event originates (site or central). Idempotency key. | +| `OccurredAtUtc` | `datetime2` | When the event happened (call returned, retry attempted, etc.). | +| `IngestedAtUtc` | `datetime2` | When central persisted the row (lags `OccurredAtUtc` for site-originated rows). | +| `Channel` | `varchar(32)` | `ApiOutbound` \| `DbOutbound` \| `Notification` \| `ApiInbound`. | +| `Kind` | `varchar(32)` | Channel-specific event kind (see below). | +| `CorrelationId` | `uniqueidentifier` NULL | Ties multi-event operations together. `TrackedOperationId` for cached calls, `NotificationId` for notifications, request-id for inbound API. NULL for sync one-shot calls. | +| `SourceSiteId` | `varchar(64)` NULL | NULL for central-originated events. | +| `SourceInstanceId` | `varchar(128)` NULL | Instance whose script initiated the action (when applicable). | +| `SourceScript` | `varchar(128)` NULL | Script name within the instance. | +| `Actor` | `varchar(128)` NULL | Inbound API: API key name. Outbound: script identity. Central: system user. | +| `Target` | `varchar(256)` NULL | Outbound API: external system + method. DB: connection name. Notification: list name. Inbound API: method name. | +| `Status` | `varchar(32)` | Outcome of *this event* — `Success`, `TransientFailure`, `PermanentFailure`, `Enqueued`, `Retrying`, `Delivered`, `Parked`, `Discarded`. | +| `HttpStatus` | `int` NULL | HTTP-bearing events only. | +| `DurationMs` | `int` NULL | Call / attempt duration. | +| `ErrorMessage` | `nvarchar(1024)` NULL | Truncated; `ErrorDetail` for full text. | +| `ErrorDetail` | `nvarchar(max)` NULL | Optional full exception text on failures. | +| `RequestSummary` | `nvarchar(max)` NULL | Truncated request payload (configurable cap). Headers redacted. | +| `ResponseSummary` | `nvarchar(max)` NULL | Truncated response payload. Full on errors. | +| `PayloadTruncated` | `bit` | Set if either summary was truncated. | +| `Extra` | `nvarchar(max)` NULL | Channel-specific JSON for fields we don't promote to columns. | + +**Indexes (first cut):** + +- `IX_AuditLog_OccurredAtUtc` — primary time-range index for global scans. +- `IX_AuditLog_Site_Occurred (SourceSiteId, OccurredAtUtc)` — per-site filters. +- `IX_AuditLog_Correlation (CorrelationId)` — drilldown from a single operation. +- `IX_AuditLog_Channel_Status_Occurred (Channel, Status, OccurredAtUtc)` — KPI / dashboard tiles. +- `IX_AuditLog_Target_Occurred (Target, OccurredAtUtc)` — "what did we send to system X". +- Monthly partitioning on `OccurredAtUtc` from day one; purge is a partition switch (§ Retention & Purge). + +**`Kind` values by channel:** + +| Channel | Kinds | +|---|---| +| `ApiOutbound` | `SyncCall`, `CachedEnqueued`, `CachedAttempt`, `CachedTerminal` | +| `DbOutbound` | `SyncWrite`, `SyncRead`, `CachedEnqueued`, `CachedAttempt`, `CachedTerminal` | +| `Notification` | `Enqueued`, `Attempt`, `Terminal` | +| `ApiInbound` | `Completed` — one row per request, written at request end with final status | + +Inbound API is intentionally collapsed to a single `Completed` row per request +rather than a multi-event lifecycle. + +## The Site-Local `AuditLog` (SQLite) + +A SQLite database file on each site node, alongside the Store-and-Forward +buffer. Same schema as central minus `IngestedAtUtc` (irrelevant at the source), +plus a `ForwardState` column with values `Pending | Forwarded | Reconciled` that +drives the telemetry loop and reconciliation pull. + +**Site SQLite retention rule (hard invariant):** + +> A row is eligible for purge only when both `OccurredAtUtc < retention threshold` AND `ForwardState IN ('Forwarded', 'Reconciled')`. Pending rows are never purged. + +A prolonged central outage will grow the site audit table indefinitely until +central is reachable again. This is intentional — losing audit rows to make +room is a compliance violation, not a self-healing behavior. To bound that +growth in practice, the site emits a `SiteAuditBacklog` health metric (pending +row count, oldest pending age, bytes on disk); crossing operator-configured +thresholds surfaces a warning on the relevant site tile in the Health +dashboard, mirroring the Store-and-Forward Engine's backlog metric. + +Central is the durable home. Site SQLite is a write-buffer with a forwarding +guarantee. + +## Ingestion Paths + +Three write paths converge on the central `AuditLog`, all idempotent on +`EventId`. + +### Site hot-path append (site-originated events) + +The component completing a script-trust-boundary action (External System +Gateway, Database layer, Store-and-Forward Engine) builds an `AuditEvent` with a +fresh `EventId` (Guid v4) and `OccurredAtUtc = UtcNow`, then appends it to the +site-local `AuditLog` SQLite via `ISiteAuditWriter` with +`ForwardState = 'Pending'`. The append is a single-statement INSERT and is +durable in microseconds; control returns to the script with no central +round-trip on the hot path. + +### Telemetry forward (site → central) + +A `SiteAuditTelemetryActor` singleton drives the forwarding loop: select up to +N `Pending` rows ordered by `OccurredAtUtc`, batch-send them to central via the +existing `SiteStream` gRPC channel as `IngestAuditEvents(events)`, and on +central-ack flip `ForwardState = 'Forwarded'` for accepted IDs. Rejected IDs +stay `Pending` for the next sweep. Cadence is short (default 5 s) when +non-empty, longer (default 30 s) when idle; telemetry runs on a dedicated +dispatcher. + +### Reconciliation pull (self-healing for missed telemetry) + +A central `SiteAuditReconciliationActor` periodically (default 5 min per site) +asks each site for its oldest `Pending` row and pending count; if backlog is +non-draining (e.g., telemetry actor wedged), central issues a +`PullAuditEvents(sinceUtc, batchSize)` and inserts-if-not-exists. Accepted rows +are flipped to `ForwardState = 'Reconciled'` site-side. Same self-healing +pattern as Site Call Audit's reconciliation of `SiteCalls`. + +### Central direct-write (central-originated events) + +Events originating at central never touch site SQLite. Inbound API writes one +`ApiInbound`/`Completed` row via `ICentralAuditWriter` synchronously inside the +request-handler middleware, before the HTTP response is flushed. The +Notification Outbox dispatcher writes `Notification`/`Attempt` per delivery +attempt and `Notification`/`Terminal` on terminal status. Central direct-writes +use the same insert-if-not-exists semantics keyed on `EventId`. + +## Cached Operations — Combined Telemetry + +For `ExternalSystem.CachedCall` and `Database.CachedWrite`, the **site** is the +source of truth for every audit row. The site writes each lifecycle event +(`CachedEnqueued`, `CachedAttempt`, `CachedTerminal`) to its local SQLite +`AuditLog` on the hot path (or on the retry tick for `CachedAttempt`), then +forwards via the same telemetry channel. The telemetry message format gains the +audit-row fields additively — one packet per lifecycle transition carries both +the operational state update AND the audit row content. + +On receipt, central performs both writes in one transaction: + +1. Insert-if-not-exists the immutable `AuditLog` row, keyed on `EventId`. +2. Upsert the operational `SiteCalls` row — existing Site Call Audit behavior + (status, retry count, last error, timestamps). + +This collapses two telemetry concerns into one, keeps site SQLite as the +single local source of truth for audit content, and preserves the existing +operational `SiteCalls` shape for the dispatcher and UI. + +## Payload Capture Policy + +- **Default cap** — 8 KB for each of `RequestSummary` and `ResponseSummary`; + raised to 64 KB on any non-`Success` row. +- **Truncation** — UTF-8 byte-safe; `PayloadTruncated = 1` when applied. Full + bodies are never stored. +- **HTTP headers** — `Authorization`, `Cookie`, `Set-Cookie`, `X-API-Key`, and + any header matching the configured redact-list regex become ``. +- **HTTP bodies** — captured verbatim by default. Operators register per-target + body redactors (regex → replacement) for known secret fields. +- **SQL** — statement text and parameter values captured verbatim by default; + per-connection opt-in to redact parameters whose name matches a regex. +- **Never captured** — raw API key material (only the key *name* via `Actor`), + LDAP bind credentials, cluster secrets, Configuration DB connection strings. +- **Safety net** — if a configured redactor throws, the affected payload becomes + `""` and `AuditRedactionFailure` increments. We + over-redact, never under-redact, on configuration faults. + +Redaction happens at the write site, before the row touches SQLite (or central +MS SQL for direct-write events). Unredacted secrets never persist. + +## Failure Handling & Idempotency + +- **`EventId` is the dedup key.** Generated at the originator; central ingest + is `INSERT … WHERE NOT EXISTS (SELECT 1 FROM AuditLog WHERE EventId = @id)` + under the PK constraint. Idempotent across telemetry retries, reconciliation + pulls, and any combination of the two. +- **Never fail the action.** A failed audit write — site SQLite or central + direct-write — logs a critical Site Event Log entry and increments a health + metric (`SiteAuditWriteFailures` or `CentralAuditWriteFailures`), but the + user-facing action proceeds. We do not fail script-initiated work because the + audit write failed. +- **Hot-path ring buffer.** While the site audit writer is unhealthy + (disk full, schema lock, transient IO), events buffer in a small in-memory + ring (default 1024 rows); oldest are discarded with a Site Event Log warning + per drop. +- **Reconciliation as fallback.** If two consecutive reconciliation cycles + report a non-draining backlog, the supervisor restarts the telemetry actor + and a `SiteAuditTelemetryStalled` event fires. +- **No dedup horizon.** `EventId` PK enforces uniqueness only while a row + exists. A retry that arrives after the original row is purged inserts a "new" + row — vanishingly rare and harmless. + +## Retention & Purge + +- **Central:** 365-day default based on `OccurredAtUtc`, configurable via + `AuditLog:RetentionDays` (min 7, max 3650). Single global retention in v1 — + no per-channel overrides. +- **Partitioning:** monthly partitions on `OccurredAtUtc` from day one + (`pf_AuditLog_Month` / `ps_AuditLog_Month`). Purge is a partition switch; + there are no row-level deletes at central. +- **Purge actor:** `AuditLogPurgeActor` singleton on the active central node + runs daily, switches out any partition whose latest `OccurredAtUtc` is older + than the retention window, and emits an `AuditLog:Purged` event (partition + range, rowcount, duration). A partition-maintenance step rolls forward each + month, creating the next month's partition ahead of time. +- **Sites:** daily site job; default 7-day retention (configurable, min 1, + max 90). Respects the hard `ForwardState` invariant — `Pending` rows are + never purged on age alone. + +## Security & Tamper-Evidence + +- **Append-only enforcement.** The application accesses `AuditLog` via a + dedicated DB role `scadalink_audit_writer` granted `INSERT` + `SELECT` only — + no `UPDATE`, no `DELETE`. Purge runs under a separate role + `scadalink_audit_purger` whose permissions are limited to the partition-switch + operation; row-level `DELETE` is not granted even to purge. +- **CI grep guard.** The build greps the data layer for any + `UPDATE … AuditLog` or `DELETE … AuditLog` text and fails on a hit. +- **Authorization.** Reading the Audit Log requires the existing **Audit** role + extended with a new **OperationalAudit** permission. Per-site row scoping + reuses the existing site-permission model; bulk export requires an additional + **AuditExport** permission. +- **Payload redaction at write.** See Payload Capture Policy. Unredacted + secrets never persist; the safety net over-redacts on misconfiguration. +- **Hash-chain tamper evidence — deferred to v1.x.** A future `RowHash` column, + computed per partition as `SHA-256(prev.RowHash || canonical(row))`, will be + verifiable offline via `scadalink audit verify-chain --month YYYY-MM`. Off by + default in v1. +- **Site SQLite security.** File permissions: read/write by the ScadaLink + service account only. Not backed up off-machine — site SQLite is a buffer, + not a record. + +## KPIs + +Point-in-time, computed from the central `AuditLog` table; global and per-site. + +- **Volume** — events/min. +- **Error rate** — % non-`Success` rows, rolling 5 min. +- **Backlog** — sum of `Pending` site rows across sites. +- **Top inbound callers** — top-10 `Actor` by request count, last 1 h. +- **Top outbound 5xx** — top-10 `Target` by 5xx-status count, last 1 h. + +[Notification Outbox](Component-NotificationOutbox.md) and +[Site Call Audit](Component-SiteCallAudit.md) KPIs are unaffected — they remain +sourced from `Notifications` and `SiteCalls` respectively. Audit Log KPIs +describe the audit table itself. + +## Configuration + +Bound from `appsettings.json` to a new `AuditLogOptions` class owned by this +component (Options pattern): + +```jsonc +"AuditLog": { + "DefaultCapBytes": 8192, + "ErrorCapBytes": 65536, + "HeaderRedactList": [ "Authorization", "Cookie", "Set-Cookie", "X-API-Key" ], + "GlobalBodyRedactors": [ + { "Pattern": "\"password\"\\s*:\\s*\"[^\"]+\"", "Replacement": "\"password\":\"\"" } + ], + "PerTargetOverrides": { + "Weather/GetForecast": { "CapBytes": 4096 }, + "PlantDB": { "RedactSqlParamsMatching": "@apikey|@token" } + }, + "RetentionDays": 365 +} +``` + +`PerTargetOverrides` keys bind by External System / Inbound Method / +Notification List / Database Connection name. `RetentionDays` is a single +global value in v1; per-channel overrides are deferred to v1.x. + +## Dependencies + +- **[Commons (#16)](Component-Commons.md)** — `AuditEvent`, `IAuditWriter` / + `ICentralAuditWriter` interfaces, and the `AuditChannel`, `AuditKind`, + `AuditStatus` enum types live here. +- **[Configuration Database (#17)](Component-ConfigurationDatabase.md)** — hosts + the `AuditLog` table schema, the monthly partition function and scheme, the + `scadalink_audit_writer` / `scadalink_audit_purger` DB roles, and the EF + migration. Distinct concern from `IAuditService` (config-change audit), which + is unchanged. +- **[Cluster Infrastructure (#13)](Component-ClusterInfrastructure.md)** — + singleton placement and supervision for `AuditLogIngestActor`, + `SiteAuditTelemetryActor`, `SiteAuditReconciliationActor`, and + `AuditLogPurgeActor`. +- **[Central–Site Communication (#5)](Component-Communication.md)** — carries + audit telemetry. New gRPC message types (`IngestAuditEvents`, + `PullAuditEvents`) are added to the existing site-stream proto additively. +- **[Site Runtime (#3)](Component-SiteRuntime.md)** — script-trust-boundary + call paths invoke `IAuditWriter` to append events. +- **[Host (#15)](Component-Host.md)** — registers this component (#23) under + the central and site roles. + +## Interactions + +- **[External System Gateway (#7)](Component-ExternalSystemGateway.md)** — + emits `ApiOutbound.SyncCall` rows on every sync `Call()`. For `CachedCall`, + emits the combined cached telemetry packet (audit row + operational update) + per § Cached Operations — Combined Telemetry. +- **[Site Runtime (#3)](Component-SiteRuntime.md) — Database layer** — emits + `DbOutbound.SyncWrite`, `DbOutbound.SyncRead`, and the cached-write variants + via the same combined-telemetry path. +- **[Inbound API (#14)](Component-InboundAPI.md)** — emits one + `ApiInbound.Completed` row per request from request-handler middleware, + written directly to central via `ICentralAuditWriter` before the response is + flushed. +- **[Notification Outbox (#21)](Component-NotificationOutbox.md)** — the + site-emitted `Notification.Enqueued` row flows via audit telemetry; the + central dispatcher writes `Notification.Attempt` (per delivery attempt) and + `Notification.Terminal` (on terminal status) directly via + `ICentralAuditWriter`. The operational `Notifications` table is unchanged. +- **[Site Call Audit (#22)](Component-SiteCallAudit.md)** — shares the + cached-call telemetry packet. Central ingest of that packet performs both the + `AuditLog` insert and the `SiteCalls` upsert in one transaction. `SiteCalls` + remains the operational state store; the Audit Log is its immutable shadow. +- **[Central UI (#9)](Component-CentralUI.md)** — a new **Audit** nav group + hosts the Audit Log page (filter bar, results grid, drilldown drawer, + server-side CSV export). Drill-in links appear on Notifications, Site Calls, + External Systems, Inbound API key, Sites, and Instances detail pages. +- **[Health Monitoring (#11)](Component-HealthMonitoring.md)** — three new + tiles (Volume, Error rate, Backlog) plus new health metrics: + `SiteAuditBacklog`, `SiteAuditWriteFailures`, `SiteAuditTelemetryStalled`, + `CentralAuditWriteFailures`, `AuditRedactionFailure`. +- **[CLI (#19)](Component-CLI.md)** — new `scadalink audit query`, + `scadalink audit export`, and `scadalink audit verify-chain` commands; same + permission requirements as the UI.