- Adds SourceNode varchar(64) NULL to AuditLog, Notifications, and SiteCalls tables with role-name semantics: node-a/node-b for site rows (qualified by SourceSiteId), central-a/central-b for central direct-write rows. - New IX_AuditLog_Node_Occurred (SourceNode, OccurredAtUtc) index. - Reframes CLAUDE.md from documentation-only to implementation project. - Adds docs/plans/2026-05-23-audit-source-node.md + tasks.json companion.
448 lines
28 KiB
Markdown
448 lines
28 KiB
Markdown
# Component: Audit Log
|
||
|
||
## Purpose
|
||
|
||
Provides a single, append-only, forensic + operational record of every
|
||
integration action initiated by, or terminating in, a script — across outbound
|
||
API, outbound DB, notifications, and inbound API. One row per lifecycle event,
|
||
rich payloads, long retention, dashboards, drilldowns, and filter queries,
|
||
answering both forensic questions ("did instance X send notification Y on date
|
||
Z, with what body?") and operational ones ("which inbound caller is hammering
|
||
us right now?").
|
||
|
||
The Audit Log is **not a dispatcher**. It does not drive delivery, retry loops,
|
||
or operator Retry/Discard actions — those remain in [Notification Outbox](Component-NotificationOutbox.md)
|
||
and [Site Call Audit](Component-SiteCallAudit.md). The Audit Log is the
|
||
immutable history that **observes** those subsystems and adds coverage where
|
||
they are silent (sync `ExternalSystem.Call`, sync DB writes and reads, inbound
|
||
API requests).
|
||
|
||
## Location
|
||
|
||
Central cluster and site clusters.
|
||
|
||
- **Central:** the `AuditLog` table in central MS SQL, plus three singletons on
|
||
the active central node — `AuditLogIngestActor` (telemetry receiver),
|
||
`SiteAuditReconciliationActor`, and `AuditLogPurgeActor`.
|
||
- **Sites:** a site-local `AuditLog` SQLite database file alongside the
|
||
Store-and-Forward buffer, plus a `SiteAuditTelemetryActor` singleton on the
|
||
active site node.
|
||
|
||
Registered as component #23 in the Host role configuration.
|
||
|
||
## Responsibilities
|
||
|
||
- Accept site-local hot-path audit writes from script-trust-boundary call paths.
|
||
- Forward site audit rows to central via gRPC telemetry with at-least-once
|
||
delivery and idempotency on `EventId`.
|
||
- Run periodic per-site reconciliation pulls so missed telemetry self-heals.
|
||
- Accept central-originated audit writes (Inbound API, Notification dispatch
|
||
attempts and terminal status).
|
||
- Compute point-in-time KPIs (global and per-site) from the central `AuditLog`
|
||
table.
|
||
- Purge expired rows by monthly partition switch — no row-level deletes.
|
||
|
||
## Scope — the script trust boundary
|
||
|
||
The Audit Log captures every action a script causes to cross the cluster trust
|
||
boundary:
|
||
|
||
| Channel | Trigger | Direction | Covered today? |
|
||
|---|---|---|---|
|
||
| `ExternalSystem.Call(...)` | Script | Outbound | No (gap) |
|
||
| `ExternalSystem.CachedCall(...)` | Script | Outbound | Yes — `SiteCalls` (Site Call Audit) |
|
||
| `Database.Connection().Execute*(...)` — writes | Script | Outbound | No (gap) |
|
||
| `Database.CachedWrite(...)` | Script | Outbound | Yes — `SiteCalls` (Site Call Audit) |
|
||
| `Notify.To(list).Send(...)` | Script | Outbound | Yes — `Notifications` (Notification Outbox) |
|
||
| `POST /api/{method}` (Inbound API) | External | Inbound (invokes a script) | No (gap) |
|
||
|
||
Out of scope — framework traffic is not audited:
|
||
|
||
- Health checks, heartbeats, cluster membership messages.
|
||
- gRPC inter-cluster real-time streams (attribute values, alarm states).
|
||
- Data Connection Layer ↔ OPC UA / custom protocol traffic.
|
||
- LDAP authentication probes, Traefik routing decisions.
|
||
- Internal Configuration Database queries by the framework.
|
||
- Site Event Log writes; audit log writes themselves.
|
||
|
||
Script-initiated DB **reads** via `Database.Connection().ExecuteReader(...)`
|
||
count as actions from a script and are in scope. Reads via DCL / subscriptions
|
||
are framework traffic and excluded.
|
||
|
||
## The `AuditLog` Table (central)
|
||
|
||
Single wide table in central MS SQL, polymorphic by `Channel` + `Kind`
|
||
discriminators, with a JSON `Extra` column for channel-specific overflow. One
|
||
row per lifecycle event across all channels.
|
||
|
||
| Column | Type | Notes |
|
||
|---|---|---|
|
||
| `EventId` | `uniqueidentifier` PK | Generated where the event originates (site or central). Idempotency key. |
|
||
| `OccurredAtUtc` | `datetime2` | When the event happened (call returned, retry attempted, etc.). |
|
||
| `IngestedAtUtc` | `datetime2` | When central persisted the row (lags `OccurredAtUtc` for site-originated rows). |
|
||
| `Channel` | `varchar(32)` | `ApiOutbound` \| `DbOutbound` \| `Notification` \| `ApiInbound`. |
|
||
| `Kind` | `varchar(32)` | Event kind discriminator (see kinds list below). |
|
||
| `CorrelationId` | `uniqueidentifier` NULL | Ties multi-event operations together. `TrackedOperationId` for cached calls, `NotificationId` for notifications, request-id for inbound API. NULL for sync one-shot calls. |
|
||
| `ExecutionId` | `uniqueidentifier` NULL | The originating script execution / inbound request — the universal per-run correlation value; distinct from `CorrelationId`, which is the per-operation lifecycle id. Stamped on *every* audit row emitted by one execution. |
|
||
| `ParentExecutionId` | `uniqueidentifier` NULL | The `ExecutionId` of the execution that *spawned* this run — the cross-execution correlation pointer. Set on every row of an inbound-API-routed site script run (= the inbound request's `ExecutionId`); NULL for a top-level run (inbound, tag-change / timer-triggered, un-bridged). |
|
||
| `SourceSiteId` | `varchar(64)` NULL | NULL for central-originated events. |
|
||
| `SourceNode` | `varchar(64)` NULL | The cluster node on which the event was emitted — `node-a` / `node-b` for site rows (qualified by `SourceSiteId`), `central-a` / `central-b` for central-originated rows. Nullable so reconciled rows from a node that has since been retired don't block ingest. |
|
||
| `SourceInstanceId` | `varchar(128)` NULL | Instance whose script initiated the action (when applicable). |
|
||
| `SourceScript` | `varchar(128)` NULL | Script name within the instance. |
|
||
| `Actor` | `varchar(128)` NULL | Inbound API: API key name. Outbound: script identity. Central: system user. |
|
||
| `Target` | `varchar(256)` NULL | Outbound API: external system + method. DB: connection name. Notification: list name. Inbound API: method name. |
|
||
| `Status` | `varchar(32)` | Outcome of *this event* — `Submitted`, `Forwarded`, `Attempted`, `Delivered`, `Failed`, `Parked`, `Discarded`, `Skipped`. |
|
||
| `HttpStatus` | `int` NULL | HTTP-bearing events only. |
|
||
| `DurationMs` | `int` NULL | Call / attempt duration. |
|
||
| `ErrorMessage` | `nvarchar(1024)` NULL | Truncated; `ErrorDetail` for full text. |
|
||
| `ErrorDetail` | `nvarchar(max)` NULL | Optional full exception text on failures. |
|
||
| `RequestSummary` | `nvarchar(max)` NULL | Truncated request payload (configurable cap). Headers redacted. For `Channel = ApiInbound`, captured in full up to `AuditLog:InboundMaxBytes` (default 1 MiB) — see Payload Capture Policy. |
|
||
| `ResponseSummary` | `nvarchar(max)` NULL | Truncated response payload. For `Channel = ApiInbound`, captured in full up to `AuditLog:InboundMaxBytes` (default 1 MiB). For other channels, capped at `DefaultCapBytes` by default and `ErrorCapBytes` on error rows. |
|
||
| `PayloadTruncated` | `bit` | Set if either summary was truncated. |
|
||
| `Extra` | `nvarchar(max)` NULL | Channel-specific JSON for fields we don't promote to columns. |
|
||
|
||
**Indexes (first cut):**
|
||
|
||
- `IX_AuditLog_OccurredAtUtc` — primary time-range index for global scans.
|
||
- `IX_AuditLog_Site_Occurred (SourceSiteId, OccurredAtUtc)` — per-site filters.
|
||
- `IX_AuditLog_Node_Occurred (SourceNode, OccurredAtUtc)` — per-node filters ("everything `central-a` did in window X", or pinning a misbehaving site node).
|
||
- `IX_AuditLog_CorrelationId (CorrelationId)` — drilldown from a single operation.
|
||
- `IX_AuditLog_Execution (ExecutionId)` — drilldown to every action of one script execution / inbound request.
|
||
- `IX_AuditLog_ParentExecution (ParentExecutionId)` — cross-execution drilldown: the downward leg of the execution-tree walk seeks on it (`ParentExecutionId = ancestor.ExecutionId`), and it backs the `parentExecutionId` filter.
|
||
- `IX_AuditLog_Channel_Status_Occurred (Channel, Status, OccurredAtUtc)` — KPI / dashboard tiles.
|
||
- `IX_AuditLog_Target_Occurred (Target, OccurredAtUtc)` — "what did we send to system X".
|
||
- Monthly partitioning on `OccurredAtUtc` from day one; purge is a partition switch (see Retention & Purge).
|
||
|
||
**`Kind` values (flat — 10 discriminators across all channels):**
|
||
|
||
| Kind | Fires when |
|
||
|---|---|
|
||
| `ApiCall` | Sync `ExternalSystem.Call(...)` returns (success or permanent failure). One row per call. |
|
||
| `ApiCallCached` | A cached outbound-API attempt records its forward-ack (`Forwarded`) or each retry (`Attempted`). |
|
||
| `DbWrite` | Sync `Database.Connection().Execute*(...)` / `ExecuteReader(...)` completes. One row per call. |
|
||
| `DbWriteCached` | A cached outbound-DB attempt records its forward-ack (`Forwarded`) or each retry (`Attempted`). |
|
||
| `NotifySend` | Script's `Notify.Send(...)` is enqueued on the site — first row in a notification's lifecycle (`Status=Submitted`). |
|
||
| `NotifyDeliver` | Central Notification Outbox dispatcher records a delivery attempt (`Attempted`) or terminal outcome (`Delivered`/`Parked`/`Discarded`). |
|
||
| `InboundRequest` | An inbound API request completes — one row per request, written at request end with final status. |
|
||
| `InboundAuthFailure` | An inbound API request was rejected at the auth boundary (bad/missing key). One row, `Status=Failed`, `HttpStatus=401`. |
|
||
| `CachedSubmit` | Script-side enqueue of a cached call (`ExternalSystem.CachedCall` / `Database.CachedWrite`); first row in the cached-call lifecycle, written to site SQLite before any forward attempt. |
|
||
| `CachedResolve` | Terminal row for a cached operation — `Status` = `Delivered` / `Failed` / `Parked` / `Discarded`. |
|
||
|
||
Inbound API is intentionally collapsed to a single `InboundRequest` (or
|
||
`InboundAuthFailure` for auth rejections) row per request rather than a
|
||
multi-event lifecycle.
|
||
|
||
### `ExecutionId` vs `CorrelationId`
|
||
|
||
The table carries two correlation columns at different granularities:
|
||
|
||
- **`ExecutionId`** is the *universal per-run* value: one id per script
|
||
execution (tag-change / timer-triggered or otherwise) or per inbound API
|
||
request. It is stamped on **every** audit row that run produces — the sync
|
||
`ApiCall` and `DbWrite` rows, the full cached-call lifecycle, the
|
||
`NotifySend` / `NotifyDeliver` rows, and the inbound row alike. A run that
|
||
performs no trust-boundary action emits no rows, but any run that emits
|
||
multiple rows ties them all together under one `ExecutionId`. This lets an
|
||
audit reader pull the complete trust-boundary footprint of a single script
|
||
run with one `ExecutionId` filter.
|
||
- **`CorrelationId`** is the *per-operation lifecycle* id — it groups the
|
||
multiple events of one long-running operation (`TrackedOperationId` for a
|
||
cached call, `NotificationId` for a notification, request-id for inbound
|
||
API) and is NULL for sync one-shot calls that have no operation lifecycle.
|
||
|
||
The two are orthogonal: one execution may touch several operations (each with
|
||
its own `CorrelationId`) yet every resulting row shares the one `ExecutionId`.
|
||
|
||
**`ParentExecutionId`** adds *cross-execution* correlation on top. `ExecutionId`
|
||
is per-run and flat — `WHERE ExecutionId = X` returns everything one run did, but
|
||
nothing links a run to the run that *spawned* it. `ParentExecutionId` carries the
|
||
spawning execution's `ExecutionId`: a spawned run still gets its own fresh
|
||
`ExecutionId`, and every audit row it emits also carries the spawner's id in
|
||
`ParentExecutionId`. The first cut bridges the **inbound API → routed-site-script**
|
||
case: an inbound request runs a method script that calls `Route.Call`, routing to
|
||
a site instance; the routed site script records the inbound request's
|
||
`ExecutionId` as its `ParentExecutionId`, while the inbound `InboundRequest` row
|
||
itself is top-level (`ParentExecutionId` NULL). The pointer always references the
|
||
*immediate* spawner, so a routed run that itself routes onward threads its own
|
||
`ExecutionId` — walking `ParentExecutionId → ExecutionId` recursively
|
||
reconstructs the call chain as a tree of arbitrary depth. The tag-cascade case
|
||
(an attribute write triggering another script) is **deferred** — the model
|
||
generalises to it with no schema change once that spawn point is threaded.
|
||
|
||
## The Site-Local `AuditLog` (SQLite)
|
||
|
||
A SQLite database file on each site node, alongside the Store-and-Forward
|
||
buffer. Same schema as central minus `IngestedAtUtc` (irrelevant at the source),
|
||
plus a `ForwardState` column with values `Pending | Forwarded | Reconciled` that
|
||
drives the telemetry loop and reconciliation pull. `SourceNode` is stamped by the
|
||
writing node itself (`node-a` / `node-b`) at append time and travels with the row
|
||
through telemetry and reconciliation unchanged.
|
||
|
||
**Site SQLite retention rule (hard invariant):**
|
||
|
||
> A row is eligible for purge only when both `OccurredAtUtc < retention threshold` AND `ForwardState IN ('Forwarded', 'Reconciled')`. Pending rows are never purged.
|
||
|
||
A prolonged central outage will grow the site audit table indefinitely until
|
||
central is reachable again. This is intentional — losing audit rows to make
|
||
room is a compliance violation, not a self-healing behavior. To bound that
|
||
growth in practice, the site emits a `SiteAuditBacklog` health metric (pending
|
||
row count, oldest pending age, bytes on disk); crossing operator-configured
|
||
thresholds surfaces a warning on the relevant site tile in the Health
|
||
dashboard, mirroring the Store-and-Forward Engine's backlog metric.
|
||
|
||
Central is the durable home. Site SQLite is a write-buffer with a forwarding
|
||
guarantee.
|
||
|
||
## Ingestion Paths
|
||
|
||
Four paths feed the central `AuditLog` — one site originator and three central
|
||
writers — all idempotent on `EventId`.
|
||
|
||
### Site hot-path append (site-originated events)
|
||
|
||
The component completing a script-trust-boundary action (External System
|
||
Gateway, Database layer, Store-and-Forward Engine) builds an `AuditEvent` with a
|
||
fresh `EventId` (Guid v4) and `OccurredAtUtc = UtcNow`, then appends it to the
|
||
site-local `AuditLog` SQLite via `IAuditWriter` with
|
||
`ForwardState = 'Pending'`. The append is a single-statement INSERT and is
|
||
durable in microseconds; control returns to the script with no central
|
||
round-trip on the hot path.
|
||
|
||
### Telemetry forward (site → central)
|
||
|
||
A `SiteAuditTelemetryActor` singleton drives the forwarding loop: select up to
|
||
N `Pending` rows ordered by `OccurredAtUtc`, batch-send them to central via the
|
||
existing `SiteStream` gRPC channel as `IngestAuditEvents(events)`, and on
|
||
central-ack flip `ForwardState = 'Forwarded'` for accepted IDs. Rejected IDs
|
||
stay `Pending` for the next sweep. Cadence is short (default 5 s) when
|
||
non-empty, longer (default 30 s) when idle; telemetry runs on a dedicated
|
||
dispatcher.
|
||
|
||
### Reconciliation pull (self-healing for missed telemetry)
|
||
|
||
A central `SiteAuditReconciliationActor` periodically (default 5 min per site)
|
||
asks each site for its oldest `Pending` row and pending count; if backlog is
|
||
non-draining (e.g., telemetry actor wedged), central issues a
|
||
`PullAuditEvents(sinceUtc, batchSize)` and inserts-if-not-exists. Accepted rows
|
||
are flipped to `ForwardState = 'Reconciled'` site-side. Same self-healing
|
||
pattern as Site Call Audit's reconciliation of `SiteCalls`.
|
||
|
||
### Central direct-write (central-originated events)
|
||
|
||
Events originating at central never touch site SQLite. Inbound API writes one
|
||
`ApiInbound.InboundRequest` row via `ICentralAuditWriter` synchronously inside
|
||
the request-handler middleware, before the HTTP response is flushed; auth-layer
|
||
rejections emit `ApiInbound.InboundAuthFailure` (`Status=Failed`, HTTP 401)
|
||
instead. The Notification Outbox dispatcher writes
|
||
`Notification.NotifyDeliver` with `Status=Attempted` per delivery attempt and
|
||
`Notification.NotifyDeliver` with `Status=Delivered`/`Parked`/`Discarded` on
|
||
terminal status. Central direct-writes use the same insert-if-not-exists
|
||
semantics keyed on `EventId`. `SourceSiteId` is NULL on all central direct-write
|
||
rows; `SourceNode` is stamped to the local central node's role name
|
||
(`central-a` / `central-b`).
|
||
|
||
## Cached Operations — Combined Telemetry
|
||
|
||
For `ExternalSystem.CachedCall` and `Database.CachedWrite`, the **site** is the
|
||
source of truth for every audit row. The site writes each lifecycle event —
|
||
`CachedSubmit` (`Status=Submitted`), then `ApiCallCached`/`DbWriteCached` rows
|
||
for the forward-ack (`Status=Forwarded`) and each retry (`Status=Attempted`),
|
||
then a terminal `CachedResolve` row
|
||
(`Status=Delivered`/`Failed`/`Parked`/`Discarded`) — to its local SQLite
|
||
`AuditLog` on the hot path (or on the retry tick for `Attempted` rows), then
|
||
forwards via the same telemetry channel. The telemetry message format gains the
|
||
audit-row fields additively — one packet per lifecycle transition carries both
|
||
the operational state update AND the audit row content.
|
||
|
||
On receipt, central performs both writes in one transaction:
|
||
|
||
1. Insert-if-not-exists the immutable `AuditLog` row, keyed on `EventId`.
|
||
2. Upsert the operational `SiteCalls` row — existing Site Call Audit behavior
|
||
(status, retry count, last error, timestamps).
|
||
|
||
This collapses two telemetry concerns into one, keeps site SQLite as the
|
||
single local source of truth for audit content, and preserves the existing
|
||
operational `SiteCalls` shape for the dispatcher and UI.
|
||
|
||
## Payload Capture Policy
|
||
|
||
- **Default cap** — 8 KB for each of `RequestSummary` and `ResponseSummary`;
|
||
raised to 64 KB on any error row (`Status IN ('Failed', 'Parked', 'Discarded')`).
|
||
- **Inbound API exception.** For `Channel = ApiInbound`, `RequestSummary` and `ResponseSummary` are captured in full up to a per-body hard ceiling of 1 MiB (configurable via `AuditLog:InboundMaxBytes`; default 1 048 576 bytes; min 8 192; max 16 777 216). The 8 KiB / 64 KiB default/error caps that apply to other channels do not apply here. `PayloadTruncated = 1` is set only when the inbound ceiling is hit — verbatim capture is the normal case. The ceiling applies independently to each body. Header redaction and per-target body redactors still run before persistence.
|
||
- **Truncation** — UTF-8 byte-safe; `PayloadTruncated = 1` when applied. Full
|
||
bodies are never stored.
|
||
- **HTTP headers** — `Authorization`, `Cookie`, `Set-Cookie`, `X-API-Key`, and
|
||
any header matching the configured redact-list regex become `<redacted>`.
|
||
- **HTTP bodies** — captured verbatim by default. Operators register per-target
|
||
body redactors (regex → replacement) for known secret fields.
|
||
- **SQL** — statement text and parameter values captured verbatim by default;
|
||
per-connection opt-in to redact parameters whose name matches a regex.
|
||
- **Never captured** — raw API key material (only the key *name* via `Actor`),
|
||
LDAP bind credentials, cluster secrets, Configuration DB connection strings.
|
||
- **Safety net** — if a configured redactor throws, the affected payload becomes
|
||
`"<redacted: redactor error>"` and `AuditRedactionFailure` increments. We
|
||
over-redact, never under-redact, on configuration faults.
|
||
|
||
Redaction happens at the write site, before the row touches SQLite (or central
|
||
MS SQL for direct-write events). Unredacted secrets never persist.
|
||
|
||
## Failure Handling & Idempotency
|
||
|
||
- **`EventId` is the dedup key.** Generated at the originator; central ingest
|
||
is `INSERT … WHERE NOT EXISTS (SELECT 1 FROM AuditLog WHERE EventId = @id)`
|
||
under the PK constraint. Idempotent across telemetry retries, reconciliation
|
||
pulls, and any combination of the two.
|
||
- **Never fail the action.** A failed audit write — site SQLite or central
|
||
direct-write — logs a critical Site Event Log entry and increments a health
|
||
metric (`SiteAuditWriteFailures` or `CentralAuditWriteFailures`), but the
|
||
user-facing action proceeds. We do not fail script-initiated work because the
|
||
audit write failed.
|
||
- **Hot-path ring buffer.** While the site audit writer is unhealthy
|
||
(disk full, schema lock, transient IO), events buffer in a small in-memory
|
||
ring (default 1024 rows); oldest are discarded with a Site Event Log warning
|
||
per drop.
|
||
- **Reconciliation as fallback.** If two consecutive reconciliation cycles
|
||
report a non-draining backlog, the supervisor restarts the telemetry actor
|
||
and a `SiteAuditTelemetryStalled` event fires.
|
||
- **No dedup horizon.** `EventId` PK enforces uniqueness only while a row
|
||
exists. A retry that arrives after the original row is purged inserts a "new"
|
||
row — vanishingly rare and harmless.
|
||
|
||
## Retention & Purge
|
||
|
||
- **Central:** 365-day default based on `OccurredAtUtc`, configurable via
|
||
`AuditLog:RetentionDays` (min 7, max 3650). Single global retention in v1 —
|
||
no per-channel overrides.
|
||
- **Partitioning:** monthly partitions on `OccurredAtUtc` from day one
|
||
(`pf_AuditLog_Month` / `ps_AuditLog_Month`). Purge is a partition switch;
|
||
there are no row-level deletes at central.
|
||
- **Purge actor:** `AuditLogPurgeActor` singleton on the active central node
|
||
runs daily, switches out any partition whose latest `OccurredAtUtc` is older
|
||
than the retention window, and emits an `AuditLog:Purged` event (partition
|
||
range, rowcount, duration). A partition-maintenance step rolls forward each
|
||
month, creating the next month's partition ahead of time.
|
||
- **Sites:** daily site job; default 7-day retention (configurable, min 1,
|
||
max 90). Respects the hard `ForwardState` invariant — `Pending` rows are
|
||
never purged on age alone.
|
||
|
||
## Security & Tamper-Evidence
|
||
|
||
- **Append-only enforcement.** The application accesses `AuditLog` via a
|
||
dedicated DB role `scadalink_audit_writer` granted `INSERT` + `SELECT` only —
|
||
no `UPDATE`, no `DELETE`. Purge runs under a separate role
|
||
`scadalink_audit_purger` whose permissions are limited to the partition-switch
|
||
operation; row-level `DELETE` is not granted even to purge.
|
||
- **CI grep guard.** The build greps the data layer for any
|
||
`UPDATE … AuditLog` or `DELETE … AuditLog` text and fails on a hit.
|
||
- **Authorization.** Reading the Audit Log requires the existing **Audit** role
|
||
extended with a new **OperationalAudit** permission. Per-site row scoping
|
||
reuses the existing site-permission model; bulk export requires an additional
|
||
**AuditExport** permission.
|
||
- **Payload redaction at write.** See Payload Capture Policy. Unredacted
|
||
secrets never persist; the safety net over-redacts on misconfiguration.
|
||
- **Hash-chain tamper evidence — deferred to v1.x.** A future `RowHash` column,
|
||
computed per partition as `SHA-256(prev.RowHash || canonical(row))`, will be
|
||
verifiable offline via `scadalink audit verify-chain --month YYYY-MM`. Off by
|
||
default in v1.
|
||
- **Site SQLite security.** File permissions: read/write by the ScadaLink
|
||
service account only. Not backed up off-machine — site SQLite is a buffer,
|
||
not a record.
|
||
|
||
## KPIs
|
||
|
||
Point-in-time, computed from the central `AuditLog` table; global and per-site.
|
||
|
||
- **Audit volume** — events/min landing in the central `AuditLog`; global plus per-site sparkline.
|
||
- **Audit error rate** — % of central `AuditLog` rows with `Status IN ('Failed', 'Parked', 'Discarded')` over a rolling 5-minute window. This is the operational error rate of audited operations (HTTP 5xx, permanent failures, parked deliveries) — NOT audit-writer health, which surfaces separately via `CentralAuditWriteFailures` and `AuditRedactionFailure`.
|
||
- **Audit backlog** — sum of `Pending` site rows across sites; click drills into a per-site breakdown.
|
||
|
||
[Notification Outbox](Component-NotificationOutbox.md) and
|
||
[Site Call Audit](Component-SiteCallAudit.md) KPIs are unaffected — they remain
|
||
sourced from `Notifications` and `SiteCalls` respectively. Audit Log KPIs
|
||
describe the audit table itself.
|
||
|
||
## Configuration
|
||
|
||
Bound from `appsettings.json` to a new `AuditLogOptions` class owned by this
|
||
component (Options pattern):
|
||
|
||
```jsonc
|
||
"AuditLog": {
|
||
"DefaultCapBytes": 8192,
|
||
"ErrorCapBytes": 65536,
|
||
"HeaderRedactList": [ "Authorization", "Cookie", "Set-Cookie", "X-API-Key" ],
|
||
"GlobalBodyRedactors": [
|
||
{ "Pattern": "\"password\"\\s*:\\s*\"[^\"]+\"", "Replacement": "\"password\":\"<redacted>\"" }
|
||
],
|
||
"PerTargetOverrides": {
|
||
"Weather/GetForecast": { "CapBytes": 4096 },
|
||
"PlantDB": { "RedactSqlParamsMatching": "@apikey|@token" }
|
||
},
|
||
"RetentionDays": 365
|
||
}
|
||
```
|
||
|
||
`PerTargetOverrides` keys bind by External System / Inbound Method /
|
||
Notification List / Database Connection name. `RetentionDays` is a single
|
||
global value in v1; per-channel overrides are deferred to v1.x.
|
||
|
||
## Dependencies
|
||
|
||
- **[Commons (#16)](Component-Commons.md)** — `AuditEvent`, `IAuditWriter` /
|
||
`ICentralAuditWriter` interfaces, and the `AuditChannel`, `AuditKind`,
|
||
`AuditStatus` enum types live here.
|
||
- **[Configuration Database (#17)](Component-ConfigurationDatabase.md)** — hosts
|
||
the `AuditLog` table schema, the monthly partition function and scheme, the
|
||
`scadalink_audit_writer` / `scadalink_audit_purger` DB roles, and the EF
|
||
migration. Distinct concern from `IAuditService` (config-change audit), which
|
||
is unchanged.
|
||
- **[Cluster Infrastructure (#13)](Component-ClusterInfrastructure.md)** —
|
||
singleton placement and supervision for `AuditLogIngestActor`,
|
||
`SiteAuditTelemetryActor`, `SiteAuditReconciliationActor`, and
|
||
`AuditLogPurgeActor`.
|
||
- **[Central–Site Communication (#5)](Component-Communication.md)** — carries
|
||
audit telemetry. New gRPC message types (`IngestAuditEvents`,
|
||
`PullAuditEvents`) are added to the existing site-stream proto additively.
|
||
- **[Site Runtime (#3)](Component-SiteRuntime.md)** — script-trust-boundary
|
||
call paths invoke `IAuditWriter` to append events.
|
||
- **[Host (#15)](Component-Host.md)** — registers this component (#23) under
|
||
the central and site roles.
|
||
|
||
## Interactions
|
||
|
||
- **[External System Gateway (#7)](Component-ExternalSystemGateway.md)** —
|
||
emits `ApiOutbound.ApiCall` rows on every sync `Call()`. For `CachedCall`,
|
||
emits the combined cached telemetry packet (audit row + operational update)
|
||
per Cached Operations — Combined Telemetry, using kinds
|
||
`CachedSubmit` / `ApiCallCached` / `CachedResolve`.
|
||
- **[External System Gateway (#7)](Component-ExternalSystemGateway.md) — Database layer** — the database access modes inside ESG emit `DbOutbound.DbWrite` rows on script-initiated `Connection()` calls (writes and reads share the kind; distinguish via `Extra.rowsAffected` vs `Extra.rowsReturned`); `Database.CachedWrite` emits the cached-write lifecycle rows via the combined-telemetry packet using kinds `CachedSubmit` / `DbWriteCached` / `CachedResolve` (same shape as `ApiOutbound`). Site Runtime is the API surface that exposes the `Database.*` calls to scripts; the audit emission itself lives in ESG.
|
||
- **[Inbound API (#14)](Component-InboundAPI.md)** — emits one
|
||
`ApiInbound.InboundRequest` row per successful request from request-handler
|
||
middleware, written directly to central via `ICentralAuditWriter` before the
|
||
response is flushed. Auth-layer rejections emit
|
||
`ApiInbound.InboundAuthFailure` instead (`Status=Failed`, HTTP 401).
|
||
- **[Notification Outbox (#21)](Component-NotificationOutbox.md)** — the
|
||
site-emitted `Notification.NotifySend` row (`Status=Submitted`) flows via
|
||
audit telemetry; the central dispatcher writes `Notification.NotifyDeliver`
|
||
rows directly via `ICentralAuditWriter` — `Status=Attempted` per delivery
|
||
attempt, `Status=Delivered`/`Parked`/`Discarded` on terminal status. The
|
||
operational `Notifications` table is unchanged.
|
||
- **[Site Call Audit (#22)](Component-SiteCallAudit.md)** — shares the
|
||
cached-call telemetry packet. Central ingest of that packet performs both the
|
||
`AuditLog` insert and the `SiteCalls` upsert in one transaction. `SiteCalls`
|
||
remains the operational state store; the Audit Log is its immutable shadow.
|
||
- **[Central UI (#9)](Component-CentralUI.md)** — a new **Audit** nav group
|
||
hosts the Audit Log page (filter bar, results grid, drilldown drawer,
|
||
server-side CSV export). Drill-in links appear on Notifications, Site Calls,
|
||
External Systems, Inbound API key, Sites, and Instances detail pages.
|
||
Double-clicking a node on the execution-tree page opens a detail modal
|
||
listing that execution's audit rows, with click-through to each row's full
|
||
detail view.
|
||
- **[Health Monitoring (#11)](Component-HealthMonitoring.md)** — three new
|
||
tiles (Volume, Error rate, Backlog) plus new health metrics:
|
||
`SiteAuditBacklog`, `SiteAuditWriteFailures`, `SiteAuditTelemetryStalled`,
|
||
`CentralAuditWriteFailures`, `AuditRedactionFailure`.
|
||
- **[CLI (#19)](Component-CLI.md)** — new `scadalink audit query`,
|
||
`scadalink audit export`, and `scadalink audit verify-chain` commands; same
|
||
permission requirements as the UI.
|