Files
scadalink-design/alog.md
Joseph Doherty 3592e74085 docs(audit): align alog.md + Component-AuditLog.md vocab with M1 enums (#23)
The M1 implementation (Bundle A) committed concrete AuditChannel /
AuditKind / AuditStatus enums that reflect CLAUDE.md's locked
cached-call lifecycle decisions. The older alog.md and
Component-AuditLog.md narratives still used pre-M1 vocabulary
(Success / TransientFailure / PermanentFailure / Enqueued / Retrying /
SyncCall / CachedEnqueued / Attempt / Terminal / Completed). This
commit reconciles both docs to the M1 vocabulary:

  AuditChannel  : ApiOutbound, DbOutbound, Notification, ApiInbound
  AuditKind (10): ApiCall, ApiCallCached, DbWrite, DbWriteCached,
                  NotifySend, NotifyDeliver, InboundRequest,
                  InboundAuthFailure, CachedSubmit, CachedResolve
  AuditStatus(8): Submitted, Forwarded, Attempted, Delivered, Failed,
                  Parked, Discarded, Skipped

Updates:
  - Status column description + worked examples use the new 8 values.
  - Kind table flattened from per-channel groupings to a single flat
    list of the 10 discriminators (no more SyncCall / Cached* /
    Attempt / Terminal / Completed).
  - Cached-call lifecycle examples rewritten to the
    CachedSubmit -> Forwarded -> Attempted... -> CachedResolve shape.
  - Notification lifecycle examples rewritten to
    NotifySend(Submitted) -> NotifyDeliver(Attempted) ->
    NotifyDeliver(Delivered/Parked/Discarded).
  - Inbound API examples split into InboundRequest (success path) and
    InboundAuthFailure (401 path).
  - 'Errors only' UI toggle, audit-error-rate KPI, and payload-cap
    decision (#6 in §16) all switched from 'non-Success' to
    Status IN ('Failed', 'Parked', 'Discarded').
  - Per-site event-rate table in §13.1 renamed to the new kinds.

Pure design correction; no operational behavior change. Per the
goal-prompt invariant #6, alog.md may change when a design correction
is committed before the affected code change — this commit is that
correction, landed ahead of the M1 merge so the merge order reads
design-first, code-second.

No code, test, or infra file changes.
2026-05-20 11:56:34 -04:00

605 lines
39 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Centralized Audit Log — Design (Working Draft)
**Status:** Validated — ready for implementation planning.
**Owner:** Joseph Doherty
**Date:** 2026-05-20
**Provisional component number:** #23 Audit Log
> A new central, append-only audit log capturing every action a script causes to cross the cluster trust boundary: outbound API calls (sync + cached), outbound DB writes (sync + cached), notifications sent, and inbound API requests that invoke a script.
---
## 1. Purpose
Provide a **single forensic + operational record** of every integration action initiated by, or terminating in, a script — answering both:
- **Compliance / forensic:** "Did instance X send notification Y on date Z? What was the body? Did external system A get called by script B last quarter, and with what result?"
- **Operational visibility:** "Why is site S misbehaving right now? What did its scripts touch in the last 10 minutes? Which inbound API caller is hammering us?"
One store, rich payloads, long retention, dashboards + drilldowns + filter queries.
The audit log is **not** the operational state store. It does not drive dispatchers, retry loops, or Retry/Discard actions — those remain in [Notification Outbox](#21) and [Site Call Audit](#22). The audit log is the immutable history that **observes** those subsystems and adds coverage where they are silent.
---
## 2. Scope — the script trust boundary
The audit log captures **every action that a script causes to cross the cluster trust boundary**:
| Channel | Trigger | Direction | Covered today? |
|---|---|---|---|
| `ExternalSystem.Call(...)` | Script | Outbound | ❌ (gap) |
| `ExternalSystem.CachedCall(...)` | Script | Outbound | ✅ `SiteCalls` (Site Call Audit) |
| `Database.Connection().Execute*(...)` — writes | Script | Outbound | ❌ (gap) |
| `Database.CachedWrite(...)` | Script | Outbound | ✅ `SiteCalls` (Site Call Audit) |
| `Notify.To(list).Send(...)` | Script | Outbound | ✅ `Notifications` (Notification Outbox) |
| `POST /api/{method}` (Inbound API) | External | Inbound (invokes a script) | ❌ (gap) |
**Out of scope** — framework traffic is *not* audited:
- Health checks, heartbeats, cluster membership messages.
- gRPC inter-cluster real-time streams (attribute values, alarm states).
- Data Connection Layer ↔ OPC UA / custom protocol traffic.
- LDAP authentication probes, Traefik routing decisions.
- Internal Configuration Database queries by the framework.
- Site Event Log writes, audit log writes themselves.
This boundary is meaningful because the script trust model already controls what scripts can do; the audit log is the record of how that surface was exercised.
> **Note on DB reads.** Script-initiated reads via `Database.Connection().ExecuteReader(...)` count as actions from a script and ARE in scope. They are expected to be far less common than reads via DCL/subscriptions (which are framework traffic and excluded).
---
## 3. Architecture — layered, append-only
```
┌──────────────────────────────────────────────────────────────────────┐
│ Central cluster (MS SQL) │
│ │
│ ┌──────────────────┐ ┌───────────────┐ ┌────────────────────┐ │
│ │ Notification │ │ Site Call │ │ Inbound API │ │
│ │ Outbox (#21) │ │ Audit (#22) │ │ (#14) │ │
│ │ Notifications │ │ SiteCalls │ │ (no audit today) │ │
│ └────────┬─────────┘ └───────┬───────┘ └─────────┬──────────┘ │
│ │ emits │ emits │ emits │
│ ▼ ▼ ▼ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ AuditLog (new, append-only, MS SQL) │ │
│ │ one row per lifecycle event across all channels │ │
│ └─────────────────────────▲──────────────────────────────────┘ │
│ │ telemetry (gRPC, idempotent) │
└─────────────────────────────────┼─────────────────────────────────────┘
┌─────────────────────────────────┼─────────────────────────────────────┐
│ Site cluster (SQLite, per active node) │
│ │ │
│ ┌─────────────────────────┴──────────────────────────────┐ │
│ │ Site-local AuditLog (SQLite, hot-path append) │ │
│ └────▲───────────────▲───────────────▲───────────────────┘ │
│ │ │ │ │
│ ┌─────────┴────────┐ ┌───┴──────┐ ┌─────┴────────────┐ │
│ │ External System │ │ Database │ │ Site S&F / │ │
│ │ Gateway (#7) │ │ Layer │ │ Notifications │ │
│ │ sync + cached │ │ sync + │ │ (transitions) │ │
│ └──────────────────┘ │ cached │ └──────────────────┘ │
│ └──────────┘ │
└───────────────────────────────────────────────────────────────────────┘
```
**Key properties:**
- **Strictly append-only.** Once written, an AuditLog row is never updated or deleted (except by retention purge). Operational state (live status, parked-for-retry, etc.) lives in `Notifications` / `SiteCalls` — not in AuditLog.
- **One row per lifecycle event.** A cached call that retries three times then parks produces five rows: enqueued, attempt #1, attempt #2, attempt #3, parked. A sync call produces one row. An inbound API hit produces one row.
- **Site-local first for site-originated events.** Hot-path script calls never wait on the network for an audit write.
- **Direct write for central-originated events.** Notification delivery attempts and inbound API hits land at central — they write directly to the central `AuditLog`. No detour through site SQLite.
- **At-least-once telemetry, idempotent on `EventId`.** Same dedup model as Site Call Audit today.
---
## 4. Data Model (first cut)
Single wide table, polymorphic by `Channel` + `Kind` discriminators, JSON payload column.
### Central: `AuditLog`
| Column | Type | Notes |
|---|---|---|
| `EventId` | `uniqueidentifier` PK | Generated where the event originates (site or central). Idempotency key. |
| `OccurredAtUtc` | `datetime2` | When the event happened (call returned, retry attempted, etc.). |
| `IngestedAtUtc` | `datetime2` | When central persisted the row (lags `OccurredAtUtc` for site-originated rows). |
| `Channel` | `varchar(32)` | `ApiOutbound` \| `DbOutbound` \| `Notification` \| `ApiInbound`. |
| `Kind` | `varchar(32)` | Event kind discriminator (see kinds list below). |
| `CorrelationId` | `uniqueidentifier` NULL | Ties multi-event operations together. `TrackedOperationId` for cached calls, `NotificationId` for notifications, request-id for inbound API. NULL for sync one-shot calls. |
| `SourceSiteId` | `varchar(64)` NULL | NULL for central-originated events (inbound API, central notification dispatch). |
| `SourceInstanceId` | `varchar(128)` NULL | Instance whose script initiated the action (when applicable). |
| `SourceScript` | `varchar(128)` NULL | Script name within the instance. |
| `Actor` | `varchar(128)` NULL | Inbound API: API key name. Outbound: script identity. Central: system user. |
| `Target` | `varchar(256)` NULL | Outbound API: external system + method. DB: connection name. Notification: list name. Inbound API: method name. |
| `Status` | `varchar(32)` | Outcome of *this event*: `Submitted`, `Forwarded`, `Attempted`, `Delivered`, `Failed`, `Parked`, `Discarded`, `Skipped`. |
| `HttpStatus` | `int` NULL | HTTP-bearing events only. |
| `DurationMs` | `int` NULL | Call/attempt duration. |
| `ErrorMessage` | `nvarchar(1024)` NULL | Truncated; `ErrorDetail` for full text. |
| `ErrorDetail` | `nvarchar(max)` NULL | Optional full exception/text on failures. |
| `RequestSummary` | `nvarchar(max)` NULL | Truncated request payload (configurable cap, default 8 KB). Headers redacted. |
| `ResponseSummary` | `nvarchar(max)` NULL | Truncated response payload (same cap). Full on errors. |
| `PayloadTruncated` | `bit` | Set if either summary was truncated. |
| `Extra` | `nvarchar(max)` NULL | Channel-specific JSON for fields we don't promote to columns. |
**Indexes (first cut):**
- `IX_AuditLog_OccurredAtUtc` — primary time-range index for global scans.
- `IX_AuditLog_Site_Occurred (SourceSiteId, OccurredAtUtc)` — per-site filters.
- `IX_AuditLog_Correlation (CorrelationId)` — drilldown from a single operation.
- `IX_AuditLog_Channel_Status_Occurred (Channel, Status, OccurredAtUtc)` — KPI / dashboard tiles.
- `IX_AuditLog_Target_Occurred (Target, OccurredAtUtc)` — "what did we send to system X."
- Partitioning by month on `OccurredAtUtc` from day one (purge becomes a partition switch instead of a delete storm).
**`Kind` values (flat — 10 discriminators across all channels):**
| Kind | Fires when |
|---|---|
| `ApiCall` | Sync `ExternalSystem.Call(...)` returns (success or permanent failure). One row per call. |
| `ApiCallCached` | A cached outbound-API attempt records its forward-ack (`Forwarded`) or each retry (`Attempted`). |
| `DbWrite` | Sync `Database.Connection().Execute*(...)` / `ExecuteReader(...)` completes. One row per call. |
| `DbWriteCached` | A cached outbound-DB attempt records its forward-ack (`Forwarded`) or each retry (`Attempted`). |
| `NotifySend` | Script's `Notify.Send(...)` is enqueued on the site — first row in a notification's lifecycle (`Status=Submitted`). |
| `NotifyDeliver` | Central Notification Outbox dispatcher records a delivery attempt (`Attempted`) or terminal outcome (`Delivered`/`Parked`/`Discarded`). |
| `InboundRequest` | An inbound API request completes — one row per request, written at request end with final status. |
| `InboundAuthFailure` | An inbound API request was rejected at the auth boundary (bad/missing key). One row, `Status=Failed`, `HttpStatus=401`. |
| `CachedSubmit` | Script-side enqueue of a cached call (`ExternalSystem.CachedCall` / `Database.CachedWrite`); first row in the cached-call lifecycle, written to site SQLite before any forward attempt. |
| `CachedResolve` | Terminal row for a cached operation — `Status` = `Delivered` / `Failed` / `Parked` / `Discarded`. |
### Site: `AuditLog` (SQLite)
Same shape minus `IngestedAtUtc` (irrelevant at the source) plus a local `ForwardState` column:
- `ForwardState`: `Pending` | `Forwarded` | `Reconciled`. Drives the telemetry loop and reconciliation pull.
**Site SQLite retention rule (hard invariant):**
A row is eligible for purge only when **both** conditions hold:
1. `OccurredAtUtc` is older than the configured site retention window (default **7 days**); AND
2. `ForwardState IN ('Forwarded', 'Reconciled')` — i.e., central has acknowledged receipt.
Rows still in `ForwardState = 'Pending'` are **never** purged on the basis of age. A prolonged central outage will grow the site audit table indefinitely until central is reachable again. This is intentional — losing audit rows to make room is a compliance violation, not a self-healing behavior.
To bound that growth in practice, the site emits a **`SiteAuditBacklog`** health metric (pending row count, oldest pending age, bytes on disk). Crossing operator-configured thresholds surfaces as a Health dashboard warning on the relevant site tile. This is the same pattern used by the Store-and-Forward Engine's backlog metric.
Central is the durable home; site SQLite is a write-buffer with a forwarding guarantee.
---
## 5. Where this fits in the existing component matrix
This work probably becomes **component #23: Audit Log**, with edges into:
- **#7 External System Gateway** — emits audit events for sync `Call()`, sync DB writes (and reads from scripts), and cached operations.
- **#14 Inbound API** — emits one row per request (success or failure) at request completion.
- **#21 Notification Outbox** — emits an audit row on enqueue, on each delivery attempt, and on terminal status.
- **#22 Site Call Audit** — emits an audit row on each lifecycle transition (enqueue, attempt, terminal). `SiteCalls` remains the operational state store; AuditLog is the immutable shadow.
- **#3 Site Runtime / #16 Commons** — script-trust-boundary call paths gain a thin audit interface.
- **#17 Configuration Database** — Audit Log is a separate concern from `IAuditService` (which stays config-change-only). Both coexist.
---
## 6. Ingestion paths
There are three write paths into the central `AuditLog`, all converging on the same table.
### 6.1 Site hot-path append (site-originated events)
1. Script issues an action across the trust boundary (`ExternalSystem.Call`, `Database` write/read, `Notify.Send`, etc.).
2. The component completing the action (External System Gateway, Database Layer, S&F Engine) builds an `AuditEvent` value object with a fresh `EventId` (Guid v4) and `OccurredAtUtc = UtcNow`.
3. Component appends the event to the site-local `AuditLog` SQLite via the `ISiteAuditWriter` interface. Single-statement `INSERT`, `ForwardState = 'Pending'`. Fire-and-forget from the caller's point of view (await returns once the local write is durable, typically microseconds).
4. Control returns to the script. No central round-trip on the hot path.
Failure modes on the hot path:
- **SQLite write fails** (disk full, IO error): the audit writer logs a critical event to the Site Event Log, surfaces a `SiteAuditWriteFailures` health metric, and *the action proceeds*. We do not fail user-facing actions because the audit write failed — but the operator must be told loudly. (Open question: do we want a "strict mode" where audit-write failure aborts the action? Default off.)
- **Audit writer not yet bootstrapped** (very early startup): events buffer in-memory bounded by a small ring; oldest discarded with a warning if it overflows. This window is normally sub-second.
### 6.2 Telemetry forward (site → central)
A `SiteAuditTelemetryActor` runs as a singleton on the active site node and drives the forwarding loop:
1. Selects up to N `Pending` rows from local `AuditLog` ordered by `OccurredAtUtc`.
2. Sends them in a batched gRPC `IngestAuditEvents(events)` call to central (over the existing `SiteStream` channel — same transport as cached-call telemetry today).
3. Central performs **insert-if-not-exists** on `EventId` (idempotent) and returns the accepted IDs.
4. Site flips `ForwardState = 'Forwarded'` for accepted IDs. Rejected IDs (transient central error) stay `Pending` for the next sweep.
Cadence: short polling interval (default 5s) when the queue is non-empty, longer (default 30s) when idle. Telemetry runs on a dedicated dispatcher so it doesn't compete with the script blocking-I/O dispatcher.
### 6.3 Reconciliation pull (self-healing for missed telemetry)
A central `SiteAuditReconciliationActor` periodically (default every 5 minutes per site) asks each site: *"What's your highest `EventId.OccurredAtUtc` with `ForwardState = 'Pending'`? And how many pending rows do you have?"* If central sees a non-empty pending backlog that hasn't drained on its own (e.g., telemetry actor is wedged), it issues a `PullAuditEvents(sinceUtc, batchSize)` request that returns rows directly. Central inserts-if-not-exists and acks them — site flips to `ForwardState = 'Reconciled'`.
This is the same self-healing pattern Site Call Audit uses for `SiteCalls`.
### 6.4 Central direct-write (central-originated events)
Events that originate at central never touch site SQLite:
- **Inbound API** — request completed at central; one `ApiInbound`/`InboundRequest` row written via `ICentralAuditWriter` synchronously inside the request handler middleware before the HTTP response is flushed. Auth failures emit `ApiInbound`/`InboundAuthFailure` instead.
- **Notification Outbox dispatcher** — each delivery attempt writes a `Notification`/`NotifyDeliver` row with `Status=Attempted`; terminal status writes a `Notification`/`NotifyDeliver` row with `Status=Delivered`/`Parked`/`Discarded`. (The site-originated `Notification`/`NotifySend` row, `Status=Submitted`, arrives via §6.2.)
Central direct-writes use the same insert-if-not-exists semantics keyed on `EventId`, so a retried request handler can't produce duplicates.
### 6.5 Cached operations — site emits, central writes twice
For `ExternalSystem.CachedCall` and `Database.CachedWrite`, the **site** is the source of truth for every audit row. The site writes each lifecycle event — `CachedSubmit` (`Status=Submitted`), then `ApiCallCached`/`DbWriteCached` rows for the forward-ack (`Status=Forwarded`) and each retry (`Status=Attempted`), then a terminal `CachedResolve` row (`Status=Delivered`/`Failed`/`Parked`/`Discarded`) — to its local SQLite `AuditLog` on the hot path (or on the retry tick for `Attempted` rows), then forwards via the same telemetry channel described in §6.2. The telemetry message format gains the audit-row fields additively — one packet per lifecycle transition carries both the operational state update AND the audit row content.
On receipt, central does two things in **one transaction**:
1. Insert-if-not-exists the immutable `AuditLog` row, keyed on `EventId`.
2. Upsert the operational `SiteCalls` row (existing Site Call Audit behavior — status, retry count, last error, timestamps).
This collapses what would otherwise be two telemetry concerns into one, keeps site SQLite as the single local source of truth for audit content, and preserves the existing operational `SiteCalls` shape for the dispatcher / UI. No central-side derivation; no double-emission from the site.
---
## 7. Per-channel event mapping
Worked examples — what each `Channel`/`Kind` row actually looks like. (Other columns omitted for brevity unless interesting.)
### 7.1 `ApiOutbound` — outbound HTTP via External System Gateway
**Sync call** (`ExternalSystem.Call("Weather", "GetForecast", { city: "Dublin" })` succeeds):
```
EventId = <new guid>
Channel = ApiOutbound
Kind = ApiCall
CorrelationId = NULL -- one-shot, no operation to correlate
SourceSiteId = "site-01"
SourceInstance = "Plant1.Boiler"
SourceScript = "OnHourly"
Target = "Weather/GetForecast"
Status = Delivered
HttpStatus = 200
DurationMs = 142
RequestSummary = '{"city":"Dublin"}' -- truncated to cap
ResponseSummary= '{"tempC":11.4,...}' -- truncated to cap
```
**Cached call** (`ExternalSystem.CachedCall(...)`, hits a 500, retries, succeeds on attempt 3):
```
1. Kind=CachedSubmit Status=Submitted CorrelationId=<tracked-op-id>
2. Kind=ApiCallCached Status=Forwarded CorrelationId=<same>
3. Kind=ApiCallCached Status=Attempted HttpStatus=500 CorrelationId=<same>
4. Kind=ApiCallCached Status=Attempted HttpStatus=500 CorrelationId=<same>
5. Kind=ApiCallCached Status=Attempted HttpStatus=200 CorrelationId=<same>
6. Kind=CachedResolve Status=Delivered CorrelationId=<same>
```
The shadow of the `SiteCalls` row's lifecycle, but immutable and time-ordered.
### 7.2 `DbOutbound` — outbound DB via Database layer
**Sync write** (`db.Execute("INSERT INTO Readings ...", new {...})`):
```
Channel = DbOutbound
Kind = DbWrite
Target = "PlantDB" -- connection name only, not server
CorrelationId = NULL
Status = Delivered
DurationMs = 9
RequestSummary = "INSERT INTO Readings(ts,val) VALUES (@p0,@p1)" -- SQL text
Extra = '{"rowsAffected":1,"params":{"p0":"2026-05-20T14:00Z","p1":42.7}}' -- values captured by default
```
**Sync read** (`db.Query<...>(...)`):
```
Channel = DbOutbound
Kind = DbWrite
Status = Delivered
DurationMs = 31
RequestSummary = "SELECT id, value FROM Readings WHERE ts > @p0"
Extra = '{"rowsReturned":42}'
ResponseSummary= NULL -- rows not captured by default; opt-in per connection
```
(Reads and writes share the `DbWrite` kind — the kind distinguishes the trust-boundary call shape, not the SQL verb. Distinguish by `RequestSummary` / `Extra.rowsAffected` vs `Extra.rowsReturned` when needed.)
**Cached write** — same multi-row lifecycle as the cached API example, using `Kind=DbWriteCached` for the `Forwarded` / `Attempted` rows in place of `ApiCallCached`.
### 7.3 `Notification` — outbound notifications
```
1. Kind=NotifySend Status=Submitted CorrelationId=<NotificationId> SourceSiteId="site-01" SourceInstance="Plant1.Boiler"
2. Kind=NotifyDeliver Status=Attempted ErrorMessage="SMTP 451 ..." CorrelationId=<same> SourceSiteId=NULL (dispatch is central)
3. Kind=NotifyDeliver Status=Attempted CorrelationId=<same>
4. Kind=NotifyDeliver Status=Delivered CorrelationId=<same>
Target = "OpsTeamEmail" -- notification list name
Extra = '{"resolvedTargets":["a@x.com","b@x.com"], "subject":"Boiler high temp"}'
RequestSummary = '...body, truncated...'
```
Note the site→central handoff is implicit: row 1 arrives via §6.2 telemetry (it originated at the site script); rows 24 are written by the central dispatcher directly via §6.4.
### 7.4 `ApiInbound` — inbound API
One row per request, written at request completion:
```
Channel = ApiInbound
Kind = InboundRequest
CorrelationId = <request-id> -- the request's correlation header (or generated)
SourceSiteId = NULL -- central-originated event
Actor = "AcmeSCADA" -- API key name (NOT the key itself)
Target = "RecordReading" -- inbound method name
Status = Delivered | Failed -- mapped from final HTTP outcome
HttpStatus = 200 | 400 | 500
DurationMs = 73
RequestSummary = '{"siteId":"...","value":12.4}' -- truncated; secrets/PII per redaction policy
ResponseSummary= '{"ok":true}' -- full body on 5xx
Extra = '{"remoteIp":"203.0.113.42","userAgent":"...","scriptInvoked":"RecordReading.Handle"}'
```
A bad API key → separate kind: `Kind=InboundAuthFailure`, `Status=Failed`, `HttpStatus=401`, `Actor=NULL`, `Extra` carries `remoteIp` for abuse triage.
---
## 8. Payload capture policy
### 8.1 Truncation
- Default cap: **8 KB** for each of `RequestSummary` and `ResponseSummary`. Configurable globally; per-target overrides allowed (§8.4).
- On any error row (`Status IN ('Failed', 'Parked', 'Discarded')`), the cap is raised to **64 KB** for that row — error context is precious.
- When a body is truncated, `PayloadTruncated = 1` and the captured prefix is preserved verbatim (UTF-8 byte-safe truncation, no mid-character cuts).
- Bodies exceeding the larger cap are still truncated; full bodies are never stored.
### 8.2 Redaction
Redaction happens **at the write site**, before the row touches SQLite (or central MS SQL for §6.4 events). Unredacted secrets never persist.
- **HTTP headers** — `Authorization`, `Cookie`, `Set-Cookie`, `X-API-Key`, and any header matching the configured redact-list (regex) become `<redacted>`. List is operator-extensible.
- **HTTP bodies** — captured verbatim by default. Operators can register per-External-System / per-Inbound-Method body redactors (regex → replacement) for known secret fields (e.g., `"password"\s*:\s*"[^"]+"`).
- **SQL** — statement text and parameter values captured verbatim by default. Per-connection redaction opt-in (e.g., redact parameters whose name matches `@apikey|@token|@password`).
- **Notification bodies** — captured per the existing notification rules (no behavioral change from today).
- **Safety net** — if a configured redactor throws, the affected payload becomes `"<redacted: redactor error>"` and a `AuditRedactionFailure` health metric increments. We over-redact, never under-redact, on configuration faults.
### 8.3 Never captured
- Raw API key material (only the key *name* via `Actor`).
- LDAP bind credentials, cluster secrets, Configuration DB connection strings.
- Framework traffic per §2 (out of scope by construction, not by redaction).
### 8.4 Configurability
Bound from `appsettings.json` (new `AuditLog` options class owned by the Audit Log component):
```jsonc
"AuditLog": {
"DefaultCapBytes": 8192,
"ErrorCapBytes": 65536,
"HeaderRedactList": [ "Authorization", "Cookie", "Set-Cookie", "X-API-Key" ],
"GlobalBodyRedactors": [
{ "Pattern": "\"password\"\\s*:\\s*\"[^\"]+\"", "Replacement": "\"password\":\"<redacted>\"" }
],
"PerTargetOverrides": {
"Weather/GetForecast": { "CapBytes": 4096 },
"PlantDB": { "RedactSqlParamsMatching": "@apikey|@token" }
}
}
```
Per-target keys bind by External System / Inbound Method / Notification List / Database Connection name.
---
## 9. Failure handling & idempotency
### 9.1 `EventId` is the dedup key
- Generated at the originator (site for §6.1/§6.5, central for §6.4). Guid v4.
- Central ingest is `INSERT … WHERE NOT EXISTS (SELECT 1 FROM AuditLog WHERE EventId = @id)`, executed under the PK constraint.
- Idempotent across telemetry retries, reconciliation pulls, and any combination thereof.
### 9.2 Central MS SQL outage
- Site telemetry calls fail; `ForwardState` stays `Pending`; backlog grows.
- Reconciliation pulls also fail.
- Site SQLite continues to absorb hot-path writes (no upstream dependency on the hot path).
- `SiteAuditBacklog` health metric crosses thresholds → Health dashboard surfaces it on the affected site tile.
- On recovery, telemetry drains; insert-if-not-exists handles any overlap.
### 9.3 Site SQLite write failure
- Audit writer fails to append (disk full, schema lock, transient IO error).
- **The action proceeds** — we do not fail script-initiated work because the audit write failed.
- `SiteAuditWriteFailures` health metric increments; critical-severity Site Event Log entry.
- A small in-memory ring (default 1024 rows) buffers events while the local writer is unhealthy; on ring overflow, oldest events are dropped with a Site Event Log warning per drop.
### 9.4 Telemetry actor wedged
- Reconciliation pull (§6.3) is the fallback. If two consecutive reconciliation cycles report a non-draining backlog, the supervisor restarts the telemetry actor and a `SiteAuditTelemetryStalled` event fires.
### 9.5 Central direct-write failure
- Inbound API: middleware audit failure is logged + metricked but never affects the HTTP response.
- Notification Outbox dispatcher: audit failure logs critical and increments `CentralAuditWriteFailures`; the operational `Notifications` row update proceeds.
### 9.6 Dedup horizon — there isn't one
`EventId` PK enforces uniqueness as long as a row exists in the table. Purge (§12) removes rows by `OccurredAtUtc`, not `EventId`; a stale telemetry retry arriving after the original was purged will insert a "new" row. Acceptable — a retry that arrives more than a year late is vanishingly rare and an extra row is harmless.
---
## 10. UI & query surface
### 10.1 Audit Log page (new, top-level)
Lives under a new **Audit** nav group in Central UI (sibling to **Notifications**). Standard Blazor Server + Bootstrap, custom components per the project UI rules.
**Filter bar (top of page, collapses to one row when not focused):**
- Time range (relative: 15m / 1h / 24h / 7d / custom).
- Channel (multi-select: `ApiOutbound`, `DbOutbound`, `Notification`, `ApiInbound`).
- Kind (filtered by selected channels).
- Status (multi-select).
- Site (multi-select, scoped to user's authorized sites).
- Instance / Script (text search with autocomplete).
- Target (text search — system+method, DB connection, list name).
- Actor (text search — inbound API key name).
- CorrelationId (paste a `TrackedOperationId` / `NotificationId` / request-id to see its full event sequence).
- "Errors only" toggle (`Status IN ('Failed', 'Parked', 'Discarded')`).
**Results grid:**
- Columns (resizable, reorderable, persisted per user): `OccurredAtUtc`, `Site`, `Channel`, `Kind`, `Status`, `Target`, `Actor`, `DurationMs`, `HttpStatus`, `ErrorMessage`.
- Keyset pagination on `(OccurredAtUtc desc, EventId desc)`. Default page 100.
- Click row → drilldown drawer.
**Drilldown drawer:**
- Pretty-prints `RequestSummary` / `ResponseSummary` (JSON auto-detected; SQL syntax-highlighted).
- Redaction indicators where headers/fields were stripped.
- "Copy as cURL" for `ApiOutbound` / `ApiInbound` rows.
- "Show all events for this operation" link → filters by `CorrelationId`.
### 10.2 Drill-in links from existing pages
- **Notifications** row → "View audit history" → Audit Log filtered to `CorrelationId = NotificationId`.
- **Site Calls** row → "View audit history" → Audit Log filtered to `CorrelationId = TrackedOperationId`.
- **External Systems** detail → "Recent activity" → Audit Log filtered to `Target starts-with <system>`.
- **Inbound API keys** detail → "Recent calls" → Audit Log filtered to `Actor = <key name>` AND `Channel = ApiInbound`.
- **Sites** detail → new "Audit feed" tab.
- **Instances** detail → new "Audit feed" tab.
### 10.3 Health dashboard tiles
Three new tiles in an "Audit" KPI group:
- **Audit volume** — events/min global + per-site sparkline.
- **Audit error rate** — % rows where `Status IN ('Failed', 'Parked', 'Discarded')`, rolling 5 min.
- **Audit backlog** — sum of `Pending` site rows; click → per-site breakdown.
### 10.4 Export
Audit Log page **Export** button streams CSV (current filter) server-side. Default cap 100k rows; larger exports use the CLI (§15).
---
## 11. Security & tamper-evidence
### 11.1 Append-only enforcement
- Application accesses `AuditLog` via a dedicated DB role `scadalink_audit_writer` granted `INSERT` + `SELECT` only — no `UPDATE`, no `DELETE`.
- Purge runs under a separate role `scadalink_audit_purger` whose permissions are limited to the partition-switch operation (§12.2). Row-level `DELETE` is not granted even to purge.
- A CI guard greps the data layer for any `UPDATE … AuditLog` or `DELETE … AuditLog` text and fails the build.
### 11.2 Authorization
- Reading the Audit Log requires the existing **Audit** role (today used for the IAuditService config-change log) extended with a new **OperationalAudit** permission.
- Per-site row scoping reuses the existing site-permission model from Security & Auth — a user sees only rows for sites they are authorized to operate.
- Bulk export (UI button + CLI) requires an additional **AuditExport** permission.
### 11.3 Payload redaction at write
See §8.2. Contract: unredacted secrets never persist. Safety net over-redacts on misconfiguration.
### 11.4 Tamper-evidence hash chain (deferred, v1.x)
- Each row gains a `RowHash` column.
- `RowHash = SHA-256(prev.RowHash || canonical(row))` per partition.
- Computed by a chaining job that runs after each monthly partition closes.
- Verifiable offline via `scadalink audit verify-chain --month YYYY-MM`.
- Default **off** in v1 to avoid operational burden. Flag for v1.x.
### 11.5 Site SQLite security
- File permissions: read/write by the ScadaLink service account only.
- Not backed up off-machine — site SQLite is a buffer with a forwarding guarantee, not a record. Central is the durable home.
---
## 12. Retention & purge mechanics
### 12.1 Central retention defaults
- **365 days** based on `OccurredAtUtc`. Configurable via `AuditLog:RetentionDays` (min 7, max 3650, validated at startup).
- **Single global retention in v1** — no per-channel/Kind overrides. Deferred to v1.x once production cost data shows whether overrides are needed.
### 12.2 Partition strategy
- Monthly partitions on `OccurredAtUtc`. Partition function `pf_AuditLog_Month`, scheme `ps_AuditLog_Month`, created in the EF Core migration.
- Purge by partition switch: move the eligible partition to a staging table, then drop. No row-by-row delete; no log bloat.
- Partition-maintenance job rolls forward each month (creates the next month's partition ahead of time).
### 12.3 Purge job
- Singleton actor `AuditLogPurgeActor` on the active central node, runs daily.
- Switches out any partition whose latest `OccurredAtUtc` is older than the global retention window. Pure partition-switch; no row-level deletes.
- Emits a `AuditLog:Purged` event (partition range, rowcount, duration).
### 12.4 Site SQLite purge
- Daily site job, hard invariant per §4: purge only `OccurredAtUtc < threshold AND ForwardState IN ('Forwarded','Reconciled')`.
- Default site retention **7 days** (configurable, min 1, max 90).
- Backlog metric (§9.2) provides visibility into "central outage → site bloat" before disk-full.
---
## 13. Performance & sizing
Rough back-of-envelope; load testing will confirm.
### 13.1 Per-site event rate (assumed nominal site)
| Channel/Kind | Typ events/min | Peak events/min |
|---|---:|---:|
| `ApiOutbound.ApiCall` | 10 | 100 |
| `ApiOutbound.ApiCallCached` (~4 rows/op incl. `CachedSubmit`/`CachedResolve`) | 4 | 20 |
| `DbOutbound.DbWrite` (writes) | 30 | 300 |
| `DbOutbound.DbWrite` (reads) | 60 | 600 |
| `DbOutbound.DbWriteCached` (~4 rows/op incl. `CachedSubmit`/`CachedResolve`) | 4 | 20 |
| `Notification.NotifySend` (site-emit) | 1 | 10 |
| **Per-site total** | **~110** | **~1,050** |
### 13.2 Central total (50-site deployment)
- Typical: ~5,500 events/min = ~**92 events/sec**.
- Peak: ~52,500 events/min = ~**875 events/sec**.
- Plus central-originated (Notification dispatch + Inbound API): assume ~30 events/sec typical.
MS SQL handles this with batched ingest and the time-aligned indexes.
### 13.3 Row size
- Fixed columns: ~250 bytes.
- Average captured payload: ~1 KB.
- Per row: **~1.3 KB**.
### 13.4 Yearly central footprint
- Typical: 100 events/sec × 86,400 × 365 × 1.3 KB ≈ **~4 TB** at default cap.
- Cap reduction (8 KB → 2 KB) or per-channel retention shaves this multi-fold.
### 13.5 Site SQLite footprint
- 110/min × 60 × 24 × 7 × 1.3 KB ≈ **~140 MB / site** at the 7-day window. Trivial.
### 13.6 Levers
- Reduce `DefaultCapBytes` per §8.1.
- Tighten per-channel retention per §12.1 (especially `DbOutbound.DbWrite` read traffic).
- Defer to v1.x: Parquet archival to object storage before purge (§15.2).
---
## 14. KPI surface & relationship to existing KPIs
### 14.1 New Audit Log KPIs
- **Volume** — events/min, global + per-site.
- **Error rate** — % rows where `Status IN ('Failed', 'Parked', 'Discarded')`, rolling 5 min.
- **Backlog** — sum of `Pending` site rows.
- **Top inbound callers** — top-10 `Actor` by request count, last 1h.
- **Top outbound 5xx** — top-10 `Target` by 5xx-status count, last 1h.
### 14.2 Relationship to existing KPIs
- **Notification Outbox KPIs** (queue depth, parked, delivered-last-interval, etc.) — unchanged, sourced from `Notifications`. Audit Log KPIs describe the audit table itself, not the notification subsystem.
- **Site Call Audit KPIs** — unchanged, sourced from `SiteCalls`.
- Audit Log KPIs occupy their own group on the Health dashboard. Nothing is collapsed or superseded.
---
## 15. CLI & external access
### 15.1 CLI commands
New `scadalink audit` command group:
- `scadalink audit query --site <s> --since <t> --kind <k> [...]` — same filter set as the UI.
- `scadalink audit export --since <t> --until <t> --format csv|jsonl|parquet --output <path>` — bulk export, server-side streaming.
- `scadalink audit verify-chain --month <YYYY-MM>` — hash-chain verification (when §11.4 is enabled).
Requires the same **OperationalAudit** / **AuditExport** permissions as the UI.
### 15.2 Object-storage archival (deferred, v1.x)
A monthly job dumps the closing partition to Parquet on operator-configured object storage before central purge — enabling indefinite cold retention without bloating MS SQL. Flag for v1.x; not in initial scope.
---
## 16. Locked decisions
| # | Question | Decision |
|---|---|---|
| 1 | Component number | **#23 Audit Log** (README matrix + HighLevelReqs). |
| 2 | Nav placement | New top-level **Audit** nav group in Central UI. |
| 3 | Hash-chain tamper evidence (§11.4) | Deferred to v1.x. v1 enforces append-only via DB grants only. |
| 4 | Parquet archival to object storage (§15.2) | Deferred to v1.x. |
| 5 | Per-channel retention overrides (§12.1) | Deferred to v1.x. v1 uses a single global `RetentionDays`. |
| 6 | Default payload cap | **8 KB** for `RequestSummary` / `ResponseSummary`; **64 KB** on error rows (`Status IN ('Failed', 'Parked', 'Discarded')`). |
All earlier design decisions (purpose, topology, scope, payload depth, lifecycle granularity, retention default, site→central path, UI shape, cached-call audit emission, SQL parameter capture, never-fail-on-audit-failure) are also locked. See §1§15.