Files
ScadaBridge/docs/requirements/Component-AuditLog.md
T
Joseph Doherty 639e331db1 test+docs(m5): M5.7 — de-date 2 EndToEnd purge tests (closes #52); document T3-T8 in Component-AuditLog/-CLI/README/CLAUDE
Tests: anchor SeedOccurredAt() to a fixed thresholdAnchor (2026-01-20) and compute
RetentionDays dynamically (UtcNow - anchor + 1d) so the threshold always sits near
Jan 20 2026, between the Jan-15 "old" seed (purged) and Apr-15/Jun-15 "kept" seeds.
Seed dates stay within the explicit pf_AuditLog_Month boundary range (Jan 2026 –
Dec 2027) — relative-from-now offsets landed before 2026-01-01 (the catch-all
partition, invisible to GetPartitionBoundariesOlderThanAsync). Both tests confirmed
passing; all 284 AuditLog tests green.

Docs:
- Component-AuditLog.md: per-channel retention overrides (T3, PerChannelRetentionDays
  + bounded DELETE + AuditLogPurge:ChannelPurgeBatchSize); ParentExecutionId tag-cascade
  now spans alarm-triggered + nested CallScript/CallShared + inbound→routed (T4, "no
  further spawn points deferred"); per-node stuck KPIs for Notification Outbox +
  Site Call Audit (T6); T7 structured response-capture increments (request headers in
  Extra.requestHeaders, AuditInboundCeilingHits counter, per-method SkipBodyCapture);
  T8 CLI audit tree; T1 hash-chain + T2 Parquet explicitly marked deferred to v1.x.
- Component-CLI.md + README.md: document audit tree --execution-id <guid> and
  audit backfill-source-node --sentinel/--before/--batch with exact options verified
  against AuditCommands.cs; update Interactions to list new endpoints.
- CLAUDE.md: update audit-log design-decision bullets for T3 per-channel retention,
  T4 tag-cascade complete, T6 per-node KPIs, T7 inbound capture increments, T8 tree
  command; clarify T1/T2 remain deferred to v1.x.
2026-06-16 22:26:09 -04:00

581 lines
35 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Component: Audit Log
## Purpose
Provides a single, append-only, forensic + operational record of every
integration action initiated by, or terminating in, a script — across outbound
API, outbound DB, notifications, and inbound API. One row per lifecycle event,
rich payloads, long retention, dashboards, drilldowns, and filter queries,
answering both forensic questions ("did instance X send notification Y on date
Z, with what body?") and operational ones ("which inbound caller is hammering
us right now?").
The Audit Log is **not a dispatcher**. It does not drive delivery, retry loops,
or operator Retry/Discard actions — those remain in [Notification Outbox](Component-NotificationOutbox.md)
and [Site Call Audit](Component-SiteCallAudit.md). The Audit Log is the
immutable history that **observes** those subsystems and adds coverage where
they are silent (sync `ExternalSystem.Call`, sync DB writes and reads, inbound
API requests).
## Location
Central cluster and site clusters.
- **Central:** the `AuditLog` table in central MS SQL, plus three singletons on
the active central node — `AuditLogIngestActor` (telemetry receiver),
`SiteAuditReconciliationActor`, and `AuditLogPurgeActor`.
- **Sites:** a site-local `AuditLog` SQLite database file alongside the
Store-and-Forward buffer, plus a `SiteAuditTelemetryActor` singleton on the
active site node.
Registered as component #23 in the Host role configuration.
## Responsibilities
- Accept site-local hot-path audit writes from script-trust-boundary call paths.
- Forward site audit rows to central via gRPC telemetry with at-least-once
delivery and idempotency on `EventId`.
- Run periodic per-site reconciliation pulls so missed telemetry self-heals.
- Accept central-originated audit writes (Inbound API, Notification dispatch
attempts and terminal status).
- Compute point-in-time KPIs (global and per-site) from the central `AuditLog`
table.
- Purge expired rows by monthly partition switch — no row-level deletes.
## Scope — the script trust boundary
The Audit Log captures every action a script causes to cross the cluster trust
boundary:
| Channel | Trigger | Direction | Covered today? |
|---|---|---|---|
| `ExternalSystem.Call(...)` | Script | Outbound | No (gap) |
| `ExternalSystem.CachedCall(...)` | Script | Outbound | Yes — `SiteCalls` (Site Call Audit) |
| `Database.Connection().Execute*(...)` — writes | Script | Outbound | No (gap) |
| `Database.CachedWrite(...)` | Script | Outbound | Yes — `SiteCalls` (Site Call Audit) |
| `Notify.To(list).Send(...)` | Script | Outbound | Yes — `Notifications` (Notification Outbox) |
| `POST /api/{method}` (Inbound API) | External | Inbound (invokes a script) | No (gap) |
Out of scope — framework traffic is not audited:
- Health checks, heartbeats, cluster membership messages.
- gRPC inter-cluster real-time streams (attribute values, alarm states).
- Data Connection Layer ↔ OPC UA / custom protocol traffic.
- LDAP authentication probes, Traefik routing decisions.
- Internal Configuration Database queries by the framework.
- Site Event Log writes; audit log writes themselves.
Script-initiated DB **reads** via `Database.Connection().ExecuteReader(...)`
count as actions from a script and are in scope. Reads via DCL / subscriptions
are framework traffic and excluded.
## The `AuditLog` Table (central)
Single wide table in central MS SQL, polymorphic by `Channel` + `Kind`
discriminators, with a JSON `Extra` column for channel-specific overflow. One
row per lifecycle event across all channels.
| Column | Type | Notes |
|---|---|---|
| `EventId` | `uniqueidentifier` PK | Generated where the event originates (site or central). Idempotency key. |
| `OccurredAtUtc` | `datetime2` | When the event happened (call returned, retry attempted, etc.). |
| `IngestedAtUtc` | `datetime2` | When central persisted the row (lags `OccurredAtUtc` for site-originated rows). |
| `Channel` | `varchar(32)` | `ApiOutbound` \| `DbOutbound` \| `Notification` \| `ApiInbound`. |
| `Kind` | `varchar(32)` | Event kind discriminator (see kinds list below). |
| `CorrelationId` | `uniqueidentifier` NULL | Ties multi-event operations together. `TrackedOperationId` for cached calls, `NotificationId` for notifications, request-id for inbound API. NULL for sync one-shot calls. |
| `ExecutionId` | `uniqueidentifier` NULL | The originating script execution / inbound request — the universal per-run correlation value; distinct from `CorrelationId`, which is the per-operation lifecycle id. Stamped on *every* audit row emitted by one execution. |
| `ParentExecutionId` | `uniqueidentifier` NULL | The `ExecutionId` of the execution that *spawned* this run — the cross-execution correlation pointer. Set on every row of an inbound-API-routed site script run (= the inbound request's `ExecutionId`); NULL for a top-level run (inbound, tag-change / timer-triggered, un-bridged). |
| `SourceSiteId` | `varchar(64)` NULL | NULL for central-originated events. |
| `SourceNode` | `varchar(64)` NULL | The cluster node on which the event was emitted — `node-a` / `node-b` for site rows (qualified by `SourceSiteId`), `central-a` / `central-b` for central-originated rows. Nullable so reconciled rows from a node that has since been retired don't block ingest. |
| `SourceInstanceId` | `varchar(128)` NULL | Instance whose script initiated the action (when applicable). |
| `SourceScript` | `varchar(128)` NULL | Script name within the instance. |
| `Actor` | `varchar(128)` NULL | Inbound API: API key name. Outbound: script identity. Central: system user. |
| `Target` | `varchar(256)` NULL | Outbound API: external system + method. DB: connection name. Notification: list name. Inbound API: method name. |
| `Status` | `varchar(32)` | Outcome of *this event*`Submitted`, `Forwarded`, `Attempted`, `Delivered`, `Failed`, `Parked`, `Discarded`, `Skipped`. |
| `HttpStatus` | `int` NULL | HTTP-bearing events only. |
| `DurationMs` | `int` NULL | Call / attempt duration. |
| `ErrorMessage` | `nvarchar(1024)` NULL | Truncated; `ErrorDetail` for full text. |
| `ErrorDetail` | `nvarchar(max)` NULL | Optional full exception text on failures. |
| `RequestSummary` | `nvarchar(max)` NULL | Truncated request payload (configurable cap). Headers redacted. For `Channel = ApiInbound`, captured in full up to `AuditLog:InboundMaxBytes` (default 1 MiB) — see Payload Capture Policy. |
| `ResponseSummary` | `nvarchar(max)` NULL | Truncated response payload. For `Channel = ApiInbound`, captured in full up to `AuditLog:InboundMaxBytes` (default 1 MiB). For other channels, capped at `DefaultCapBytes` by default and `ErrorCapBytes` on error rows. |
| `PayloadTruncated` | `bit` | Set if either summary was truncated. |
| `Extra` | `nvarchar(max)` NULL | Channel-specific JSON for fields we don't promote to columns. |
**Indexes (first cut):**
- `IX_AuditLog_OccurredAtUtc` — primary time-range index for global scans.
- `IX_AuditLog_Site_Occurred (SourceSiteId, OccurredAtUtc)` — per-site filters.
- `IX_AuditLog_Node_Occurred (SourceNode, OccurredAtUtc)` — per-node filters ("everything `central-a` did in window X", or pinning a misbehaving site node).
- `IX_AuditLog_CorrelationId (CorrelationId)` — drilldown from a single operation.
- `IX_AuditLog_Execution (ExecutionId)` — drilldown to every action of one script execution / inbound request.
- `IX_AuditLog_ParentExecution (ParentExecutionId)` — cross-execution drilldown: the downward leg of the execution-tree walk seeks on it (`ParentExecutionId = ancestor.ExecutionId`), and it backs the `parentExecutionId` filter.
- `IX_AuditLog_Channel_Status_Occurred (Channel, Status, OccurredAtUtc)` — KPI / dashboard tiles.
- `IX_AuditLog_Target_Occurred (Target, OccurredAtUtc)` — "what did we send to system X".
- Monthly partitioning on `OccurredAtUtc` from day one; purge is a partition switch (see Retention & Purge).
**`Kind` values (flat — 10 discriminators across all channels):**
| Kind | Fires when |
|---|---|
| `ApiCall` | Sync `ExternalSystem.Call(...)` returns (success or permanent failure). One row per call. |
| `ApiCallCached` | A cached outbound-API attempt records its forward-ack (`Forwarded`) or each retry (`Attempted`). |
| `DbWrite` | Sync `Database.Connection().Execute*(...)` / `ExecuteReader(...)` completes. One row per call. |
| `DbWriteCached` | A cached outbound-DB attempt records its forward-ack (`Forwarded`) or each retry (`Attempted`). |
| `NotifySend` | Script's `Notify.Send(...)` is enqueued on the site — first row in a notification's lifecycle (`Status=Submitted`). |
| `NotifyDeliver` | Central Notification Outbox dispatcher records a delivery attempt (`Attempted`) or terminal outcome (`Delivered`/`Parked`/`Discarded`). |
| `InboundRequest` | An inbound API request completes — one row per request, written at request end with final status. |
| `InboundAuthFailure` | An inbound API request was rejected at the auth boundary (bad/missing key). One row, `Status=Failed`, `HttpStatus=401`. |
| `CachedSubmit` | Script-side enqueue of a cached call (`ExternalSystem.CachedCall` / `Database.CachedWrite`); first row in the cached-call lifecycle, written to site SQLite before any forward attempt. |
| `CachedResolve` | Terminal row for a cached operation — `Status` = `Delivered` / `Failed` / `Parked` / `Discarded`. |
Inbound API is intentionally collapsed to a single `InboundRequest` (or
`InboundAuthFailure` for auth rejections) row per request rather than a
multi-event lifecycle.
### `ExecutionId` vs `CorrelationId`
The table carries two correlation columns at different granularities:
- **`ExecutionId`** is the *universal per-run* value: one id per script
execution (tag-change / timer-triggered or otherwise) or per inbound API
request. It is stamped on **every** audit row that run produces — the sync
`ApiCall` and `DbWrite` rows, the full cached-call lifecycle, the
`NotifySend` / `NotifyDeliver` rows, and the inbound row alike. A run that
performs no trust-boundary action emits no rows, but any run that emits
multiple rows ties them all together under one `ExecutionId`. This lets an
audit reader pull the complete trust-boundary footprint of a single script
run with one `ExecutionId` filter.
- **`CorrelationId`** is the *per-operation lifecycle* id — it groups the
multiple events of one long-running operation (`TrackedOperationId` for a
cached call, `NotificationId` for a notification, request-id for inbound
API) and is NULL for sync one-shot calls that have no operation lifecycle.
The two are orthogonal: one execution may touch several operations (each with
its own `CorrelationId`) yet every resulting row shares the one `ExecutionId`.
**`ParentExecutionId`** adds *cross-execution* correlation on top. `ExecutionId`
is per-run and flat — `WHERE ExecutionId = X` returns everything one run did, but
nothing links a run to the run that *spawned* it. `ParentExecutionId` carries the
spawning execution's `ExecutionId`: a spawned run still gets its own fresh
`ExecutionId`, and every audit row it emits also carries the spawner's id in
`ParentExecutionId`. The pointer always references the *immediate* spawner, so a
run that itself spawns further runs threads its own `ExecutionId` — walking
`ParentExecutionId → ExecutionId` recursively reconstructs the call chain as a
tree of arbitrary depth.
**Tag-cascade coverage (M5.4 T4):** `ParentExecutionId` threading now spans all
known spawn points:
- **Inbound API → routed site script** — an inbound request runs a method script
that calls `Route.Call`; the routed site script records the inbound request's
`ExecutionId` as its `ParentExecutionId`, while the inbound `InboundRequest` row
is top-level (`ParentExecutionId` NULL).
- **Alarm-triggered on-trigger script** — when an alarm fires and its on-trigger
script runs (via `AlarmActor → AlarmExecutionActor`), the alarm context's
`ExecutionId` is carried as the run's `ParentExecutionId`. Currently the alarm
subsystem has no Guid-typed firing id so on-trigger runs are roots (NULL) in
practice, but the wiring is in place for a future alarm `ExecutionId`.
- **Nested `CallScript` / `CallShared` invocations** — when a script calls
`Instance.CallScript(...)` or a shared script via `CallShared`, the calling
execution's `ExecutionId` threads into the spawned run as its
`ParentExecutionId`, making deeply nested call chains visible as a tree.
Attribute-write-triggered cascades (one tag change triggering another script via a
tag subscription) are also wired: trigger-driven runs carry `ParentExecutionId =
NULL` (top-level roots), and any nested `CallScript`/`CallShared` they perform
chains as above. The schema is unchanged — no further tag-cascade work is deferred.
## The Site-Local `AuditLog` (SQLite)
A SQLite database file on each site node, alongside the Store-and-Forward
buffer. Same schema as central minus `IngestedAtUtc` (irrelevant at the source),
plus a `ForwardState` column with values `Pending | Forwarded | Reconciled` that
drives the telemetry loop and reconciliation pull. `SourceNode` is stamped by the
writing node itself (`node-a` / `node-b`) at append time and travels with the row
through telemetry and reconciliation unchanged.
**Site SQLite retention rule (hard invariant):**
> A row is eligible for purge only when both `OccurredAtUtc < retention threshold` AND `ForwardState IN ('Forwarded', 'Reconciled')`. Pending rows are never purged.
A prolonged central outage will grow the site audit table indefinitely until
central is reachable again. This is intentional — losing audit rows to make
room is a compliance violation, not a self-healing behavior. To bound that
growth in practice, the site emits a `SiteAuditBacklog` health metric (pending
row count, oldest pending age, bytes on disk); crossing operator-configured
thresholds surfaces a warning on the relevant site tile in the Health
dashboard, mirroring the Store-and-Forward Engine's backlog metric.
Central is the durable home. Site SQLite is a write-buffer with a forwarding
guarantee.
## Ingestion Paths
Four paths feed the central `AuditLog` — one site originator and three central
writers — all idempotent on `EventId`.
### Site hot-path append (site-originated events)
The component completing a script-trust-boundary action (External System
Gateway, Database layer, Store-and-Forward Engine) builds an `AuditEvent` with a
fresh `EventId` (Guid v4) and `OccurredAtUtc = UtcNow`, then appends it to the
site-local `AuditLog` SQLite via `IAuditWriter` with
`ForwardState = 'Pending'`. The append is a single-statement INSERT and is
durable in microseconds; control returns to the script with no central
round-trip on the hot path.
### Telemetry forward (site → central)
A `SiteAuditTelemetryActor` singleton drives the forwarding loop: select up to
N `Pending` rows ordered by `OccurredAtUtc`, batch-send them to central via the
existing `SiteStream` gRPC channel as `IngestAuditEvents(events)`, and on
central-ack flip `ForwardState = 'Forwarded'` for accepted IDs. Rejected IDs
stay `Pending` for the next sweep. Cadence is short (default 5 s) when
non-empty, longer (default 30 s) when idle; telemetry runs on a dedicated
dispatcher.
### Reconciliation pull (self-healing for missed telemetry)
A central `SiteAuditReconciliationActor` periodically (default 5 min per site)
asks each site for its oldest `Pending` row and pending count; if backlog is
non-draining (e.g., telemetry actor wedged), central issues a
`PullAuditEvents(sinceUtc, batchSize)` and inserts-if-not-exists. Accepted rows
are flipped to `ForwardState = 'Reconciled'` site-side. Same self-healing
pattern as Site Call Audit's reconciliation of `SiteCalls`.
### Central direct-write (central-originated events)
Events originating at central never touch site SQLite. Inbound API writes one
`ApiInbound.InboundRequest` row via `ICentralAuditWriter` synchronously inside
the request-handler middleware, before the HTTP response is flushed; auth-layer
rejections emit `ApiInbound.InboundAuthFailure` (`Status=Failed`, HTTP 401)
instead. The Notification Outbox dispatcher writes
`Notification.NotifyDeliver` with `Status=Attempted` per delivery attempt and
`Notification.NotifyDeliver` with `Status=Delivered`/`Parked`/`Discarded` on
terminal status. Central direct-writes use the same insert-if-not-exists
semantics keyed on `EventId`. `SourceSiteId` is NULL on all central direct-write
rows; `SourceNode` is stamped to the local central node's role name
(`central-a` / `central-b`).
## Cached Operations — Combined Telemetry
For `ExternalSystem.CachedCall` and `Database.CachedWrite`, the **site** is the
source of truth for every audit row. The site writes each lifecycle event —
`CachedSubmit` (`Status=Submitted`), then `ApiCallCached`/`DbWriteCached` rows
for the forward-ack (`Status=Forwarded`) and each retry (`Status=Attempted`),
then a terminal `CachedResolve` row
(`Status=Delivered`/`Failed`/`Parked`/`Discarded`) — to its local SQLite
`AuditLog` on the hot path (or on the retry tick for `Attempted` rows), then
forwards via the same telemetry channel. The telemetry message format gains the
audit-row fields additively — one packet per lifecycle transition carries both
the operational state update AND the audit row content.
On receipt, central performs both writes in one transaction:
1. Insert-if-not-exists the immutable `AuditLog` row, keyed on `EventId`.
2. Upsert the operational `SiteCalls` row — existing Site Call Audit behavior
(status, retry count, last error, timestamps).
This collapses two telemetry concerns into one, keeps site SQLite as the
single local source of truth for audit content, and preserves the existing
operational `SiteCalls` shape for the dispatcher and UI.
## Payload Capture Policy
- **Default cap** — 8 KB for each of `RequestSummary` and `ResponseSummary`;
raised to 64 KB on any error row (`Status IN ('Failed', 'Parked', 'Discarded')`).
- **Inbound API exception.** For `Channel = ApiInbound`, `RequestSummary` and
`ResponseSummary` are captured in full up to a per-body hard ceiling of 1 MiB
(configurable via `AuditLog:InboundMaxBytes`; default 1 048 576 bytes; min
8 192; max 16 777 216). The 8 KiB / 64 KiB default/error caps that apply to
other channels do not apply here. `PayloadTruncated = 1` is set only when the
inbound ceiling is hit — verbatim capture is the normal case. The ceiling
applies independently to each body. Header redaction and per-target body
redactors still run before persistence.
- **Inbound ceiling hits (M5.3 T7).** Every time the `InboundMaxBytes` ceiling
truncates a body an `IAuditInboundCeilingHitsCounter.Increment()` call fires.
This counter is surfaced as `AuditInboundCeilingHits` on the central health
snapshot (alongside `CentralAuditWriteFailures` / `AuditRedactionFailure`) so
operators can detect persistently oversized payloads and raise the ceiling or
add per-target body redactors.
- **Request headers in `Extra` (M5.3 T7).** For `Channel = ApiInbound`, the
`AuditWriteMiddleware` captures the inbound HTTP request headers (post-redaction
`Authorization`, `X-API-Key`, `Cookie`, `Set-Cookie`, and the configured
`HeaderRedactList` are scrubbed before serialization) into the `Extra` JSON
column under the key `"requestHeaders"`. This makes the full header envelope
visible in the Audit Log UI's detail drawer and the CLI's `audit query` output
without widening the schema.
- **Per-method `SkipBodyCapture` (M5.3 T7).** `PerTargetOverrides` now includes
a `SkipBodyCapture: true` flag. When set for an inbound API method, the audit
row is always emitted (headers, status, duration, actor, etc. are recorded) but
`RequestSummary` and `ResponseSummary` are left null. Use this for methods whose
payloads are structurally large or contain secrets not covered by body redactors.
Headers are still captured into `Extra.requestHeaders` (after redaction) even
when `SkipBodyCapture` is true.
- **Truncation** — UTF-8 byte-safe; `PayloadTruncated = 1` when applied. Full
bodies are never stored.
- **HTTP headers** — `Authorization`, `Cookie`, `Set-Cookie`, `X-API-Key`, and
any header matching the configured redact-list regex become `<redacted>`.
- **HTTP bodies** — captured verbatim by default. Operators register per-target
body redactors (regex → replacement) for known secret fields.
- **SQL** — statement text and parameter values captured verbatim by default;
per-connection opt-in to redact parameters whose name matches a regex.
- **Never captured** — raw API key material (only the key *name* via `Actor`),
LDAP bind credentials, cluster secrets, Configuration DB connection strings.
- **Safety net** — if a configured redactor throws, the affected payload becomes
`"<redacted: redactor error>"` and `AuditRedactionFailure` increments. We
over-redact, never under-redact, on configuration faults.
Redaction happens at the write site, before the row touches SQLite (or central
MS SQL for direct-write events). Unredacted secrets never persist.
## Failure Handling & Idempotency
- **`EventId` is the dedup key.** Generated at the originator; central ingest
is `INSERT … WHERE NOT EXISTS (SELECT 1 FROM AuditLog WHERE EventId = @id)`
under the PK constraint. Idempotent across telemetry retries, reconciliation
pulls, and any combination of the two.
- **Never fail the action.** A failed audit write — site SQLite or central
direct-write — logs a critical Site Event Log entry and increments a health
metric (`SiteAuditWriteFailures` or `CentralAuditWriteFailures`), but the
user-facing action proceeds. We do not fail script-initiated work because the
audit write failed.
- **Hot-path ring buffer.** While the site audit writer is unhealthy
(disk full, schema lock, transient IO), events buffer in a small in-memory
ring (default 1024 rows); oldest are discarded with a Site Event Log warning
per drop.
- **Reconciliation as fallback.** If two consecutive reconciliation cycles
report a non-draining backlog, the supervisor restarts the telemetry actor
and a `SiteAuditTelemetryStalled` event fires.
- **No dedup horizon.** `EventId` PK enforces uniqueness only while a row
exists. A retry that arrives after the original row is purged inserts a "new"
row — vanishingly rare and harmless.
## Retention & Purge
- **Central:** 365-day default based on `OccurredAtUtc`, configurable via
`AuditLog:RetentionDays` (min 30, max 3650).
- **Partitioning:** monthly partitions on `OccurredAtUtc` from day one
(`pf_AuditLog_Month` / `ps_AuditLog_Month`). The global partition switch is
channel-blind; it drops a whole month once every row in it is older than the
global window. There are no row-level deletes at central for the global purge.
- **Purge actor:** `AuditLogPurgeActor` singleton on the active central node
runs daily, switches out any partition whose latest `OccurredAtUtc` is older
than the retention window, then applies any per-channel overrides (see below),
and emits an `AuditLog:Purged` event (partition range, rowcount, duration) per
switched partition. A partition-maintenance step rolls forward each month,
creating the next month's partition ahead of time.
- **Per-channel retention overrides (M5.5 T3):** `AuditLog:PerChannelRetentionDays`
is a dictionary keyed by canonical channel name (`ApiOutbound`, `DbOutbound`,
`Notification`, `ApiInbound`) whose value is a retention window in days that
MUST be strictly shorter than the global `RetentionDays`. After the daily
partition switch-out, the purge actor runs a bounded, batched row DELETE
(`PurgeChannelOlderThanAsync`) for each channel whose override is shorter than
the global window — expiring rows of that channel earlier than the global
partition switch would. Overrides equal to or longer than the global window are
silently skipped (the global switch already covers them). The DELETE runs under
`scadabridge_audit_purger` (the maintenance role); the append-only writer role
is unaffected. Batch size is configurable via
`AuditLogPurge:ChannelPurgeBatchSize` (default 5000). Each channel override
runs in its own try/catch, mirroring the per-boundary error-isolation of the
partition switch-out loop. Values are validated to be in
`[30, RetentionDays]`; keys that are not a recognized `AuditChannel` enum name
are rejected at startup.
- **Sites:** daily site job; default 7-day retention (configurable, min 1,
max 90). Respects the hard `ForwardState` invariant — `Pending` rows are
never purged on age alone.
## Security & Tamper-Evidence
- **Append-only enforcement.** The application accesses `AuditLog` via a
dedicated DB role `scadabridge_audit_writer` granted `INSERT` + `SELECT` only —
no `UPDATE`, no `DELETE`. Purge runs under a separate role
`scadabridge_audit_purger` whose permissions are limited to the partition-switch
operation; row-level `DELETE` is not granted even to purge.
- **CI grep guard.** The build greps the data layer for any
`UPDATE … AuditLog` or `DELETE … AuditLog` text and fails on a hit.
- **Authorization.** Reading the Audit Log requires the existing **Audit** role
extended with a new **OperationalAudit** permission. Per-site row scoping
reuses the existing site-permission model; bulk export requires an additional
**AuditExport** permission.
- **Payload redaction at write.** See Payload Capture Policy. Unredacted
secrets never persist; the safety net over-redacts on misconfiguration.
- **Hash-chain tamper evidence (T1) — deferred to v1.x.** A future `RowHash`
column, computed per partition as `SHA-256(prev.RowHash || canonical(row))`, will
be verifiable offline via `scadabridge audit verify-chain --month YYYY-MM`. The
`verify-chain` CLI command is a no-op placeholder today. Off by default in v1.
- **Parquet archival (T2) — deferred to v1.x.** Long-term cold storage of purged
monthly partitions as Parquet files (suitable for offline analytics) will be
added in a future milestone. T1 and T2 are not shipped as part of M5.
- **Site SQLite security.** File permissions: read/write by the ScadaBridge
service account only. Not backed up off-machine — site SQLite is a buffer,
not a record.
## KPIs
Point-in-time, computed from the central `AuditLog` table; global and per-site.
- **Audit volume** — events/min landing in the central `AuditLog`; global plus per-site sparkline.
- **Audit error rate** — % of central `AuditLog` rows with `Status IN ('Failed', 'Parked', 'Discarded')` over a rolling 5-minute window. This is the operational error rate of audited operations (HTTP 5xx, permanent failures, parked deliveries) — NOT audit-writer health, which surfaces separately via `CentralAuditWriteFailures` and `AuditRedactionFailure`.
- **Audit backlog** — sum of `Pending` site rows across sites; click drills into a per-site breakdown.
- **`AuditInboundCeilingHits`** (M5.3 T7) — rolling count of inbound API responses truncated by the `InboundMaxBytes` ceiling; surfaced on the central health snapshot alongside `CentralAuditWriteFailures`.
**Per-node stuck KPIs (M5.3 T6):** Both [Notification Outbox](Component-NotificationOutbox.md)
and [Site Call Audit](Component-SiteCallAudit.md) now expose a
`PerNodeNotificationKpiRequest` / `PerNodeSiteCallKpiRequest` message pair that
groups the existing stuck, parked, and delivered-last-interval counts by the
`SourceNode` that emitted the original row. This surfaces per-node breakdowns on
the Health dashboard tiles and the Notification Outbox / Site Calls pages,
making it possible to identify a single misbehaving node (e.g., `site-a:node-b`)
as the source of a spike rather than a site-wide problem. The existing global and
per-site KPI shapes are unchanged; the per-node slice is additive.
[Notification Outbox](Component-NotificationOutbox.md) and
[Site Call Audit](Component-SiteCallAudit.md) KPIs are unaffected for their
operational dispatch responsibilities — they remain sourced from `Notifications`
and `SiteCalls` respectively. Audit Log KPIs describe the audit table itself.
## Configuration
Bound from `appsettings.json` to a new `AuditLogOptions` class owned by this
component (Options pattern):
```jsonc
"AuditLog": {
"DefaultCapBytes": 8192,
"ErrorCapBytes": 65536,
"InboundMaxBytes": 1048576,
"HeaderRedactList": [ "Authorization", "Cookie", "Set-Cookie", "X-API-Key" ],
"GlobalBodyRedactors": [
{ "Pattern": "\"password\"\\s*:\\s*\"[^\"]+\"", "Replacement": "\"password\":\"<redacted>\"" }
],
"PerTargetOverrides": {
"Weather/GetForecast": { "CapBytes": 4096 },
"PlantDB": { "RedactSqlParamsMatching": "@apikey|@token" },
"HighVolumeMethod": { "SkipBodyCapture": true }
},
"RetentionDays": 365,
"PerChannelRetentionDays": {
"ApiOutbound": 90,
"Notification": 180
}
}
```
`PerTargetOverrides` keys bind by External System / Inbound Method /
Notification List / Database Connection name. `SkipBodyCapture: true` omits
`RequestSummary`/`ResponseSummary` for that method while still capturing headers
into `Extra.requestHeaders` and emitting the full audit row. `RetentionDays` is
the global window; `PerChannelRetentionDays` specifies per-channel windows that
are strictly shorter — any channel whose override equals or exceeds the global
value is silently ignored (the global partition switch-out already governs it).
`AuditLogPurge` section controls the purge actor cadence and batch size:
```jsonc
"AuditLogPurge": {
"IntervalHours": 24,
"ChannelPurgeBatchSize": 5000
}
```
## Ops Notes — Historical Null Columns
### `SourceNode` backfill (M5.6 T5)
`SourceNode` (`varchar(64)` NULL) is a physical column stamped on every row at
write time. Rows ingested before M5.6 shipped have `SourceNode IS NULL` because
the value was not populated until the feature landed. A one-time CLI command sets
these to a configurable sentinel:
```
scadabridge audit backfill-source-node --before <ISO-8601-UTC> [--sentinel unknown] [--batch 5000]
```
The default sentinel is `"unknown"`. The true node-of-origin for pre-feature rows
is **unknowable** retroactively — the emitting node is long gone from the telemetry
pipeline. The sentinel makes that explicit rather than leaving the column NULL
(which the Audit Log UI's Node filter already treats as "unresolved", but which
an operator might mistake for a data-quality bug).
The backfill runs via `POST /api/audit/backfill-source-node` (Admin role required)
on the maintenance/purge path, NOT the append-only `scadabridge_audit_writer` role.
It is idempotent and can be re-run safely.
### `ExecutionId` and `ParentExecutionId` — cannot be backfilled
`ExecutionId` and `ParentExecutionId` are **PERSISTED COMPUTED columns** derived
from `DetailsJson`. They were introduced in the same feature window as the column
itself but their value comes from the JSON payload that was written at ingest time.
The AuditLog append-only invariant **forbids mutating `DetailsJson`** — rows may
only be inserted, never updated. Because backfilling the computed values would
require rewriting the underlying `DetailsJson`, it is impossible under the
append-only contract. Pre-feature rows carry `NULL` in both columns permanently.
This is a documented limitation, not a defect. The NULL values are visible in the
Audit Log UI's execution-tree drilldown (rows with no `ExecutionId` appear as
orphaned entries) and in the CLI's `audit tree` output.
## Dependencies
- **[Commons (#16)](Component-Commons.md)** — `AuditEvent`, `IAuditWriter` /
`ICentralAuditWriter` interfaces, and the `AuditChannel`, `AuditKind`,
`AuditStatus` enum types live here.
- **[Configuration Database (#17)](Component-ConfigurationDatabase.md)** — hosts
the `AuditLog` table schema, the monthly partition function and scheme, the
`scadabridge_audit_writer` / `scadabridge_audit_purger` DB roles, and the EF
migration. Distinct concern from `IAuditService` (config-change audit), which
is unchanged.
- **[Cluster Infrastructure (#13)](Component-ClusterInfrastructure.md)** —
singleton placement and supervision for `AuditLogIngestActor`,
`SiteAuditTelemetryActor`, `SiteAuditReconciliationActor`, and
`AuditLogPurgeActor`.
- **[CentralSite Communication (#5)](Component-Communication.md)** — carries
audit telemetry. New gRPC message types (`IngestAuditEvents`,
`PullAuditEvents`) are added to the existing site-stream proto additively.
- **[Site Runtime (#3)](Component-SiteRuntime.md)** — script-trust-boundary
call paths invoke `IAuditWriter` to append events.
- **[Host (#15)](Component-Host.md)** — registers this component (#23) under
the central and site roles.
## Interactions
- **[External System Gateway (#7)](Component-ExternalSystemGateway.md)** —
emits `ApiOutbound.ApiCall` rows on every sync `Call()`. For `CachedCall`,
emits the combined cached telemetry packet (audit row + operational update)
per Cached Operations — Combined Telemetry, using kinds
`CachedSubmit` / `ApiCallCached` / `CachedResolve`.
- **[External System Gateway (#7)](Component-ExternalSystemGateway.md) — Database layer** — the database access modes inside ESG emit `DbOutbound.DbWrite` rows on script-initiated `Connection()` calls (writes and reads share the kind; distinguish via `Extra.rowsAffected` vs `Extra.rowsReturned`); `Database.CachedWrite` emits the cached-write lifecycle rows via the combined-telemetry packet using kinds `CachedSubmit` / `DbWriteCached` / `CachedResolve` (same shape as `ApiOutbound`). Site Runtime is the API surface that exposes the `Database.*` calls to scripts; the audit emission itself lives in ESG.
- **[Inbound API (#14)](Component-InboundAPI.md)** — emits one
`ApiInbound.InboundRequest` row per successful request from request-handler
middleware, written directly to central via `ICentralAuditWriter` before the
response is flushed. Auth-layer rejections emit
`ApiInbound.InboundAuthFailure` instead (`Status=Failed`, HTTP 401).
- **[Notification Outbox (#21)](Component-NotificationOutbox.md)** — the
site-emitted `Notification.NotifySend` row (`Status=Submitted`) flows via
audit telemetry; the central dispatcher writes `Notification.NotifyDeliver`
rows directly via `ICentralAuditWriter``Status=Attempted` per delivery
attempt, `Status=Delivered`/`Parked`/`Discarded` on terminal status. The
operational `Notifications` table is unchanged.
- **[Site Call Audit (#22)](Component-SiteCallAudit.md)** — shares the
cached-call telemetry packet. Central ingest of that packet performs both the
`AuditLog` insert and the `SiteCalls` upsert in one transaction. `SiteCalls`
remains the operational state store; the Audit Log is its immutable shadow.
- **[Central UI (#9)](Component-CentralUI.md)** — a new **Audit** nav group
hosts the Audit Log page (filter bar, results grid, drilldown drawer,
server-side CSV export). Drill-in links appear on Notifications, Site Calls,
External Systems, Inbound API key, Sites, and Instances detail pages.
Double-clicking a node on the execution-tree page opens a detail modal
listing that execution's audit rows, with click-through to each row's full
detail view.
- **[Health Monitoring (#11)](Component-HealthMonitoring.md)** — three new
tiles (Volume, Error rate, Backlog) plus new health metrics:
`SiteAuditBacklog`, `SiteAuditWriteFailures`, `SiteAuditTelemetryStalled`,
`CentralAuditWriteFailures`, `AuditRedactionFailure`.
- **[CLI (#19)](Component-CLI.md)** — `scadabridge audit query`,
`scadabridge audit export`, `scadabridge audit tree --execution-id <guid>`,
`scadabridge audit backfill-source-node --sentinel <s> --before <date>`, and
`scadabridge audit verify-chain` (no-op placeholder for the deferred hash-chain
feature); same permission requirements as the UI.