docs(audit): add AuditLog table, partitioning, and DB roles to Config DB

This commit is contained in:
Joseph Doherty
2026-05-20 07:58:27 -04:00
parent acb160ecce
commit 36a598840f

View File

@@ -60,6 +60,9 @@ The configuration database stores all central system data, organized by domain a
### Site Calls
- **SiteCalls**: The central audit table for cached site calls — `ExternalSystem.CachedCall()` and `Database.CachedWrite()` — owned by the Site Call Audit component and a sibling of the `Notifications` table. One row per cached operation. Columns: `TrackedOperationId` (GUID, primary key — generated site-side at call time, used as the idempotency key), `SourceSite`, `Kind` (a `TrackedOperationKind` enum stored with values `ExternalCall` / `DatabaseWrite`), `TargetSummary` (external system + method for an `ExternalCall`, database connection name for a `DatabaseWrite`), `Status` (a `TrackedOperationStatus` enum stored with values `Pending`, `Retrying`, `Delivered`, `Parked`, `Failed`, `Discarded`), `RetryCount`, `LastError`, `Provenance` (source instance / script), `CreatedAtUtc`, `UpdatedAtUtc`, `TerminalAtUtc`. The table is populated **only** by Site Call Audit telemetry and reconciliation pulls — sites are the source of truth and the row is an eventually-consistent mirror, never written by a central dispatcher. Ingestion is **insert-if-not-exists** keyed on `TrackedOperationId`, then **upsert-on-newer-status**; the lifecycle is monotonic, so at-least-once and out-of-order telemetry are harmless. Indexed on `Status` and `SourceSite` for KPI computation and the Central UI query page. Terminal rows are removed by a daily purge job — see Scheduled Maintenance below. See Component-SiteCallAudit.md for the full lifecycle.
### Audit Log
- **AuditLog**: The central, append-only audit table owned by the Audit Log component — one row per script-trust-boundary lifecycle event across all channels (outbound API calls, outbound DB writes/reads, notifications, and inbound API requests). Sibling of the `Notifications` and `SiteCalls` tables but distinct: `AuditLog` is the immutable history that observes the other subsystems, not an operational state store. Columns: `EventId` (`uniqueidentifier` primary key — generated at the originator, used as the idempotency key), `OccurredAtUtc` (`datetime2`), `IngestedAtUtc` (`datetime2`), `Channel` (`varchar(32)``ApiOutbound` / `DbOutbound` / `Notification` / `ApiInbound`), `Kind` (`varchar(32)` — channel-specific event kind), `CorrelationId` (`uniqueidentifier` NULL — `TrackedOperationId` for cached calls, `NotificationId` for notifications, request-id for inbound API), `SourceSiteId` (`varchar(64)` NULL), `SourceInstanceId` (`varchar(128)` NULL), `SourceScript` (`varchar(128)` NULL), `Actor` (`varchar(128)` NULL), `Target` (`varchar(256)` NULL), `Status` (`varchar(32)` — outcome of *this event*: `Success`, `TransientFailure`, `PermanentFailure`, `Enqueued`, `Retrying`, `Delivered`, `Parked`, `Discarded`), `HttpStatus` (`int` NULL), `DurationMs` (`int` NULL), `ErrorMessage` (`nvarchar(1024)` NULL), `ErrorDetail` (`nvarchar(max)` NULL), `RequestSummary` (`nvarchar(max)` NULL — truncated request payload, headers redacted), `ResponseSummary` (`nvarchar(max)` NULL — truncated response payload), `PayloadTruncated` (`bit`), `Extra` (`nvarchar(max)` NULL — channel-specific JSON for fields not promoted to columns). Indexes: `IX_AuditLog_OccurredAtUtc` (primary time-range index for global scans), `IX_AuditLog_Site_Occurred (SourceSiteId, OccurredAtUtc)` (per-site filters), `IX_AuditLog_Correlation (CorrelationId)` (drilldown from a single operation), `IX_AuditLog_Channel_Status_Occurred (Channel, Status, OccurredAtUtc)` (KPI / dashboard tiles), and `IX_AuditLog_Target_Occurred (Target, OccurredAtUtc)` ("what did we send to system X"). The primary key on `EventId` enforces idempotency — central ingest is `INSERT … WHERE NOT EXISTS`, so at-least-once telemetry and reconciliation retries collapse to a single row. **Monthly partitioning** on `OccurredAtUtc` from day one via partition function `pf_AuditLog_Month` and partition scheme `ps_AuditLog_Month`, with a filegroup-per-month rollover so that retention purge is a partition switch rather than a row-level delete. The partition-maintenance job that rolls the scheme forward and switches expired partitions is owned by the Audit Log component, not this component. The table is populated only by Audit Log writers (site telemetry, central direct-write, reconciliation pulls); central ingest is **insert-if-not-exists** keyed on `EventId`. See Component-AuditLog.md for the full lifecycle, payload-capture policy, and ingestion paths.
### Inbound API
- **API Keys**: Key definitions (name/label, key value, enabled flag).
- **API Methods**: Method definitions (name, approved key references, parameter definitions, return value definitions, implementation script, timeout).
@@ -226,6 +229,17 @@ Results are returned in reverse chronological order (most recent first) with pag
---
## Database Roles
The configuration database defines dedicated SQL Server roles for the append-only `AuditLog` table so that the application can never accidentally mutate audit history:
- **`scadalink_audit_writer`** — the role used by application code that ingests audit events (the `AuditLogIngestActor`, central direct-write paths, and the Notification Outbox dispatcher). Granted `INSERT` and `SELECT` on `AuditLog` only — explicitly **no** `UPDATE` and **no** `DELETE`. Audit ingest is `INSERT … WHERE NOT EXISTS` keyed on `EventId`, which this grant set fully supports.
- **`scadalink_audit_purger`** — the role used by the `AuditLogPurgeActor`. Granted only the permissions required to execute the monthly partition-switch operation (switch out a partition to a staging table and drop the staging table). Row-level `DELETE` on `AuditLog` is **not** granted even to the purge role; retention is a partition switch, never a row-by-row delete.
A CI grep guard fails the build on any occurrence of `UPDATE … AuditLog` or `DELETE … AuditLog` in the data-access layer source, backstopping the DB-grant enforcement at code-review time. See Component-AuditLog.md (Security & Tamper-Evidence) for the full enforcement contract.
---
## Migration Management
### Entity Framework Core Migrations
@@ -233,6 +247,7 @@ Results are returned in reverse chronological order (most recent first) with pag
- Schema changes are managed via EF Core Migrations (`dotnet ef migrations add`, `dotnet ef migrations script`).
- Each migration is a versioned, incremental schema change.
- New tables are introduced as their own migration — for example, the `Notifications` table for the Notification Outbox ships as a dedicated EF Core migration that creates the table, its `Type`/`Status` value conversions, and its dispatcher and KPI indexes.
- The initial `AuditLog` migration creates the monthly partition function `pf_AuditLog_Month` and partition scheme `ps_AuditLog_Month`, then creates the `AuditLog` table aligned to that scheme on `OccurredAtUtc`, along with the indexes listed under Database Schema. The migration also creates the `scadalink_audit_writer` and `scadalink_audit_purger` DB roles with the grants described in Database Roles. The ongoing **partition-maintenance job** that rolls the scheme forward each month (creating the next month's partition ahead of time) and switches out expired partitions is owned by the **Audit Log component** (`AuditLogPurgeActor` and its monthly roll-forward step), not by the Configuration Database component — this component is responsible only for the initial schema, roles, and any EF migrations against the table going forward.
### Development Environment
- Migrations are **auto-applied** at application startup using `dbContext.Database.MigrateAsync()`.
@@ -282,6 +297,10 @@ The `Notifications` table grows one row per notification and is never trimmed by
The `SiteCalls` table grows one row per cached site call and is never trimmed by normal operation. To bound table growth while preserving a strong audit trail, a **daily purge job** deletes terminal rows (`Delivered`, `Failed`, `Discarded`) older than a configurable retention window (default 365 days). Non-terminal rows (`Pending`, `Retrying`, `Parked`) are never purged. The purge is a bulk `DELETE`; it is owned and scheduled by the Site Call Audit component (see Component-SiteCallAudit.md), which supplies the retention window. The Configuration Database component provides only the repository operation and the table.
### AuditLog Table Purge
The `AuditLog` table is append-only and grows by every script-trust-boundary event across all channels. Unlike `Notifications` and `SiteCalls`, purge is **never a row-level `DELETE`** — it is a **monthly partition switch** against the `ps_AuditLog_Month` scheme. A daily job switches out any partition whose latest `OccurredAtUtc` is older than the global retention window (default 365 days, configurable via the `AuditLog:RetentionDays` Audit Log option — single global value in v1, no per-channel overrides) and drops the resulting staging table. The job is owned and scheduled by the Audit Log component (`AuditLogPurgeActor` — see Component-AuditLog.md), which is also the consumer of the `AuditLog:RetentionDays` option. The Configuration Database component contributes only the table, the partition function/scheme, the indexes, and the DB roles that constrain the purge to a partition switch.
---
## Connection Management