Validated design for a new append-only AuditLog covering the script trust boundary: outbound API calls (sync + cached), outbound DB operations (sync + cached, incl. script-initiated reads), notifications, and inbound API requests. Layered alongside existing Notifications (#21) and SiteCalls (#22) operational tables. Key decisions: - One row per lifecycle event; strictly append-only. - Site SQLite hot-path append + best-effort gRPC telemetry + central reconciliation pull. Site purge requires ForwardState=Forwarded. - Cached calls: site emits; one telemetry packet feeds both the immutable AuditLog row and the operational SiteCalls upsert. - Payload: metadata + truncated bodies (8 KB default, 64 KB on errors). Headers redacted; SQL parameter values captured by default. - Audit-write failures never abort the user-facing action. - Monthly partitioning at central; 365-day global retention. - New Audit nav group + drill-in links from existing pages. Deferred to v1.x: hash-chain tamper evidence, Parquet archival, per-channel retention overrides. Provisional component #23.
596 lines
36 KiB
Markdown
596 lines
36 KiB
Markdown
# Centralized Audit Log — Design (Working Draft)
|
||
|
||
**Status:** Validated — ready for implementation planning.
|
||
**Owner:** Joseph Doherty
|
||
**Date:** 2026-05-20
|
||
**Provisional component number:** #23 Audit Log
|
||
|
||
> A new central, append-only audit log capturing every action a script causes to cross the cluster trust boundary: outbound API calls (sync + cached), outbound DB writes (sync + cached), notifications sent, and inbound API requests that invoke a script.
|
||
|
||
---
|
||
|
||
## 1. Purpose
|
||
|
||
Provide a **single forensic + operational record** of every integration action initiated by, or terminating in, a script — answering both:
|
||
|
||
- **Compliance / forensic:** "Did instance X send notification Y on date Z? What was the body? Did external system A get called by script B last quarter, and with what result?"
|
||
- **Operational visibility:** "Why is site S misbehaving right now? What did its scripts touch in the last 10 minutes? Which inbound API caller is hammering us?"
|
||
|
||
One store, rich payloads, long retention, dashboards + drilldowns + filter queries.
|
||
|
||
The audit log is **not** the operational state store. It does not drive dispatchers, retry loops, or Retry/Discard actions — those remain in [Notification Outbox](#21) and [Site Call Audit](#22). The audit log is the immutable history that **observes** those subsystems and adds coverage where they are silent.
|
||
|
||
---
|
||
|
||
## 2. Scope — the script trust boundary
|
||
|
||
The audit log captures **every action that a script causes to cross the cluster trust boundary**:
|
||
|
||
| Channel | Trigger | Direction | Covered today? |
|
||
|---|---|---|---|
|
||
| `ExternalSystem.Call(...)` | Script | Outbound | ❌ (gap) |
|
||
| `ExternalSystem.CachedCall(...)` | Script | Outbound | ✅ `SiteCalls` (Site Call Audit) |
|
||
| `Database.Connection().Execute*(...)` — writes | Script | Outbound | ❌ (gap) |
|
||
| `Database.CachedWrite(...)` | Script | Outbound | ✅ `SiteCalls` (Site Call Audit) |
|
||
| `Notify.To(list).Send(...)` | Script | Outbound | ✅ `Notifications` (Notification Outbox) |
|
||
| `POST /api/{method}` (Inbound API) | External | Inbound (invokes a script) | ❌ (gap) |
|
||
|
||
**Out of scope** — framework traffic is *not* audited:
|
||
|
||
- Health checks, heartbeats, cluster membership messages.
|
||
- gRPC inter-cluster real-time streams (attribute values, alarm states).
|
||
- Data Connection Layer ↔ OPC UA / custom protocol traffic.
|
||
- LDAP authentication probes, Traefik routing decisions.
|
||
- Internal Configuration Database queries by the framework.
|
||
- Site Event Log writes, audit log writes themselves.
|
||
|
||
This boundary is meaningful because the script trust model already controls what scripts can do; the audit log is the record of how that surface was exercised.
|
||
|
||
> **Note on DB reads.** Script-initiated reads via `Database.Connection().ExecuteReader(...)` count as actions from a script and ARE in scope. They are expected to be far less common than reads via DCL/subscriptions (which are framework traffic and excluded).
|
||
|
||
---
|
||
|
||
## 3. Architecture — layered, append-only
|
||
|
||
```
|
||
┌──────────────────────────────────────────────────────────────────────┐
|
||
│ Central cluster (MS SQL) │
|
||
│ │
|
||
│ ┌──────────────────┐ ┌───────────────┐ ┌────────────────────┐ │
|
||
│ │ Notification │ │ Site Call │ │ Inbound API │ │
|
||
│ │ Outbox (#21) │ │ Audit (#22) │ │ (#14) │ │
|
||
│ │ Notifications │ │ SiteCalls │ │ (no audit today) │ │
|
||
│ └────────┬─────────┘ └───────┬───────┘ └─────────┬──────────┘ │
|
||
│ │ emits │ emits │ emits │
|
||
│ ▼ ▼ ▼ │
|
||
│ ┌────────────────────────────────────────────────────────────┐ │
|
||
│ │ AuditLog (new, append-only, MS SQL) │ │
|
||
│ │ one row per lifecycle event across all channels │ │
|
||
│ └─────────────────────────▲──────────────────────────────────┘ │
|
||
│ │ telemetry (gRPC, idempotent) │
|
||
└─────────────────────────────────┼─────────────────────────────────────┘
|
||
│
|
||
│
|
||
┌─────────────────────────────────┼─────────────────────────────────────┐
|
||
│ Site cluster (SQLite, per active node) │
|
||
│ │ │
|
||
│ ┌─────────────────────────┴──────────────────────────────┐ │
|
||
│ │ Site-local AuditLog (SQLite, hot-path append) │ │
|
||
│ └────▲───────────────▲───────────────▲───────────────────┘ │
|
||
│ │ │ │ │
|
||
│ ┌─────────┴────────┐ ┌───┴──────┐ ┌─────┴────────────┐ │
|
||
│ │ External System │ │ Database │ │ Site S&F / │ │
|
||
│ │ Gateway (#7) │ │ Layer │ │ Notifications │ │
|
||
│ │ sync + cached │ │ sync + │ │ (transitions) │ │
|
||
│ └──────────────────┘ │ cached │ └──────────────────┘ │
|
||
│ └──────────┘ │
|
||
└───────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
**Key properties:**
|
||
|
||
- **Strictly append-only.** Once written, an AuditLog row is never updated or deleted (except by retention purge). Operational state (live status, parked-for-retry, etc.) lives in `Notifications` / `SiteCalls` — not in AuditLog.
|
||
- **One row per lifecycle event.** A cached call that retries three times then parks produces five rows: enqueued, attempt #1, attempt #2, attempt #3, parked. A sync call produces one row. An inbound API hit produces one row.
|
||
- **Site-local first for site-originated events.** Hot-path script calls never wait on the network for an audit write.
|
||
- **Direct write for central-originated events.** Notification delivery attempts and inbound API hits land at central — they write directly to the central `AuditLog`. No detour through site SQLite.
|
||
- **At-least-once telemetry, idempotent on `EventId`.** Same dedup model as Site Call Audit today.
|
||
|
||
---
|
||
|
||
## 4. Data Model (first cut)
|
||
|
||
Single wide table, polymorphic by `Channel` + `Kind` discriminators, JSON payload column.
|
||
|
||
### Central: `AuditLog`
|
||
|
||
| Column | Type | Notes |
|
||
|---|---|---|
|
||
| `EventId` | `uniqueidentifier` PK | Generated where the event originates (site or central). Idempotency key. |
|
||
| `OccurredAtUtc` | `datetime2` | When the event happened (call returned, retry attempted, etc.). |
|
||
| `IngestedAtUtc` | `datetime2` | When central persisted the row (lags `OccurredAtUtc` for site-originated rows). |
|
||
| `Channel` | `varchar(32)` | `ApiOutbound` \| `DbOutbound` \| `Notification` \| `ApiInbound`. |
|
||
| `Kind` | `varchar(32)` | Channel-specific event kind (see table below). |
|
||
| `CorrelationId` | `uniqueidentifier` NULL | Ties multi-event operations together. `TrackedOperationId` for cached calls, `NotificationId` for notifications, request-id for inbound API. NULL for sync one-shot calls. |
|
||
| `SourceSiteId` | `varchar(64)` NULL | NULL for central-originated events (inbound API, central notification dispatch). |
|
||
| `SourceInstanceId` | `varchar(128)` NULL | Instance whose script initiated the action (when applicable). |
|
||
| `SourceScript` | `varchar(128)` NULL | Script name within the instance. |
|
||
| `Actor` | `varchar(128)` NULL | Inbound API: API key name. Outbound: script identity. Central: system user. |
|
||
| `Target` | `varchar(256)` NULL | Outbound API: external system + method. DB: connection name. Notification: list name. Inbound API: method name. |
|
||
| `Status` | `varchar(32)` | Outcome of *this event*: `Success`, `TransientFailure`, `PermanentFailure`, `Enqueued`, `Retrying`, `Delivered`, `Parked`, `Discarded`. |
|
||
| `HttpStatus` | `int` NULL | HTTP-bearing events only. |
|
||
| `DurationMs` | `int` NULL | Call/attempt duration. |
|
||
| `ErrorMessage` | `nvarchar(1024)` NULL | Truncated; `ErrorDetail` for full text. |
|
||
| `ErrorDetail` | `nvarchar(max)` NULL | Optional full exception/text on failures. |
|
||
| `RequestSummary` | `nvarchar(max)` NULL | Truncated request payload (configurable cap, default 8 KB). Headers redacted. |
|
||
| `ResponseSummary` | `nvarchar(max)` NULL | Truncated response payload (same cap). Full on errors. |
|
||
| `PayloadTruncated` | `bit` | Set if either summary was truncated. |
|
||
| `Extra` | `nvarchar(max)` NULL | Channel-specific JSON for fields we don't promote to columns. |
|
||
|
||
**Indexes (first cut):**
|
||
|
||
- `IX_AuditLog_OccurredAtUtc` — primary time-range index for global scans.
|
||
- `IX_AuditLog_Site_Occurred (SourceSiteId, OccurredAtUtc)` — per-site filters.
|
||
- `IX_AuditLog_Correlation (CorrelationId)` — drilldown from a single operation.
|
||
- `IX_AuditLog_Channel_Status_Occurred (Channel, Status, OccurredAtUtc)` — KPI / dashboard tiles.
|
||
- `IX_AuditLog_Target_Occurred (Target, OccurredAtUtc)` — "what did we send to system X."
|
||
- Partitioning by month on `OccurredAtUtc` from day one (purge becomes a partition switch instead of a delete storm).
|
||
|
||
**`Kind` values by channel:**
|
||
|
||
| Channel | Kinds |
|
||
|---|---|
|
||
| `ApiOutbound` | `SyncCall`, `CachedEnqueued`, `CachedAttempt`, `CachedTerminal` |
|
||
| `DbOutbound` | `SyncWrite`, `SyncRead`, `CachedEnqueued`, `CachedAttempt`, `CachedTerminal` |
|
||
| `Notification` | `Enqueued`, `Attempt`, `Terminal` |
|
||
| `ApiInbound` | `Completed` (one row per request, written at request end with final status) |
|
||
|
||
### Site: `AuditLog` (SQLite)
|
||
|
||
Same shape minus `IngestedAtUtc` (irrelevant at the source) plus a local `ForwardState` column:
|
||
|
||
- `ForwardState`: `Pending` | `Forwarded` | `Reconciled`. Drives the telemetry loop and reconciliation pull.
|
||
|
||
**Site SQLite retention rule (hard invariant):**
|
||
|
||
A row is eligible for purge only when **both** conditions hold:
|
||
|
||
1. `OccurredAtUtc` is older than the configured site retention window (default **7 days**); AND
|
||
2. `ForwardState IN ('Forwarded', 'Reconciled')` — i.e., central has acknowledged receipt.
|
||
|
||
Rows still in `ForwardState = 'Pending'` are **never** purged on the basis of age. A prolonged central outage will grow the site audit table indefinitely until central is reachable again. This is intentional — losing audit rows to make room is a compliance violation, not a self-healing behavior.
|
||
|
||
To bound that growth in practice, the site emits a **`SiteAuditBacklog`** health metric (pending row count, oldest pending age, bytes on disk). Crossing operator-configured thresholds surfaces as a Health dashboard warning on the relevant site tile. This is the same pattern used by the Store-and-Forward Engine's backlog metric.
|
||
|
||
Central is the durable home; site SQLite is a write-buffer with a forwarding guarantee.
|
||
|
||
---
|
||
|
||
## 5. Where this fits in the existing component matrix
|
||
|
||
This work probably becomes **component #23: Audit Log**, with edges into:
|
||
|
||
- **#7 External System Gateway** — emits audit events for sync `Call()`, sync DB writes (and reads from scripts), and cached operations.
|
||
- **#14 Inbound API** — emits one row per request (success or failure) at request completion.
|
||
- **#21 Notification Outbox** — emits an audit row on enqueue, on each delivery attempt, and on terminal status.
|
||
- **#22 Site Call Audit** — emits an audit row on each lifecycle transition (enqueue, attempt, terminal). `SiteCalls` remains the operational state store; AuditLog is the immutable shadow.
|
||
- **#3 Site Runtime / #16 Commons** — script-trust-boundary call paths gain a thin audit interface.
|
||
- **#17 Configuration Database** — Audit Log is a separate concern from `IAuditService` (which stays config-change-only). Both coexist.
|
||
|
||
---
|
||
|
||
## 6. Ingestion paths
|
||
|
||
There are three write paths into the central `AuditLog`, all converging on the same table.
|
||
|
||
### 6.1 Site hot-path append (site-originated events)
|
||
|
||
1. Script issues an action across the trust boundary (`ExternalSystem.Call`, `Database` write/read, `Notify.Send`, etc.).
|
||
2. The component completing the action (External System Gateway, Database Layer, S&F Engine) builds an `AuditEvent` value object with a fresh `EventId` (Guid v4) and `OccurredAtUtc = UtcNow`.
|
||
3. Component appends the event to the site-local `AuditLog` SQLite via the `ISiteAuditWriter` interface. Single-statement `INSERT`, `ForwardState = 'Pending'`. Fire-and-forget from the caller's point of view (await returns once the local write is durable, typically microseconds).
|
||
4. Control returns to the script. No central round-trip on the hot path.
|
||
|
||
Failure modes on the hot path:
|
||
|
||
- **SQLite write fails** (disk full, IO error): the audit writer logs a critical event to the Site Event Log, surfaces a `SiteAuditWriteFailures` health metric, and *the action proceeds*. We do not fail user-facing actions because the audit write failed — but the operator must be told loudly. (Open question: do we want a "strict mode" where audit-write failure aborts the action? Default off.)
|
||
- **Audit writer not yet bootstrapped** (very early startup): events buffer in-memory bounded by a small ring; oldest discarded with a warning if it overflows. This window is normally sub-second.
|
||
|
||
### 6.2 Telemetry forward (site → central)
|
||
|
||
A `SiteAuditTelemetryActor` runs as a singleton on the active site node and drives the forwarding loop:
|
||
|
||
1. Selects up to N `Pending` rows from local `AuditLog` ordered by `OccurredAtUtc`.
|
||
2. Sends them in a batched gRPC `IngestAuditEvents(events)` call to central (over the existing `SiteStream` channel — same transport as cached-call telemetry today).
|
||
3. Central performs **insert-if-not-exists** on `EventId` (idempotent) and returns the accepted IDs.
|
||
4. Site flips `ForwardState = 'Forwarded'` for accepted IDs. Rejected IDs (transient central error) stay `Pending` for the next sweep.
|
||
|
||
Cadence: short polling interval (default 5s) when the queue is non-empty, longer (default 30s) when idle. Telemetry runs on a dedicated dispatcher so it doesn't compete with the script blocking-I/O dispatcher.
|
||
|
||
### 6.3 Reconciliation pull (self-healing for missed telemetry)
|
||
|
||
A central `SiteAuditReconciliationActor` periodically (default every 5 minutes per site) asks each site: *"What's your highest `EventId.OccurredAtUtc` with `ForwardState = 'Pending'`? And how many pending rows do you have?"* If central sees a non-empty pending backlog that hasn't drained on its own (e.g., telemetry actor is wedged), it issues a `PullAuditEvents(sinceUtc, batchSize)` request that returns rows directly. Central inserts-if-not-exists and acks them — site flips to `ForwardState = 'Reconciled'`.
|
||
|
||
This is the same self-healing pattern Site Call Audit uses for `SiteCalls`.
|
||
|
||
### 6.4 Central direct-write (central-originated events)
|
||
|
||
Events that originate at central never touch site SQLite:
|
||
|
||
- **Inbound API** — request completed at central; one `ApiInbound`/`Completed` row written via `ICentralAuditWriter` synchronously inside the request handler middleware before the HTTP response is flushed.
|
||
- **Notification Outbox dispatcher** — each delivery attempt writes a `Notification`/`Attempt` row; terminal status writes a `Notification`/`Terminal` row. (The site-originated `Notification`/`Enqueued` row arrives via §6.2.)
|
||
Central direct-writes use the same insert-if-not-exists semantics keyed on `EventId`, so a retried request handler can't produce duplicates.
|
||
|
||
### 6.5 Cached operations — site emits, central writes twice
|
||
|
||
For `ExternalSystem.CachedCall` and `Database.CachedWrite`, the **site** is the source of truth for every audit row. The site writes each lifecycle event (`CachedEnqueued`, `CachedAttempt`, `CachedTerminal`) to its local SQLite `AuditLog` on the hot path (or on the retry tick for `CachedAttempt`), then forwards via the same telemetry channel described in §6.2. The telemetry message format gains the audit-row fields additively — one packet per lifecycle transition carries both the operational state update AND the audit row content.
|
||
|
||
On receipt, central does two things in **one transaction**:
|
||
|
||
1. Insert-if-not-exists the immutable `AuditLog` row, keyed on `EventId`.
|
||
2. Upsert the operational `SiteCalls` row (existing Site Call Audit behavior — status, retry count, last error, timestamps).
|
||
|
||
This collapses what would otherwise be two telemetry concerns into one, keeps site SQLite as the single local source of truth for audit content, and preserves the existing operational `SiteCalls` shape for the dispatcher / UI. No central-side derivation; no double-emission from the site.
|
||
|
||
---
|
||
|
||
## 7. Per-channel event mapping
|
||
|
||
Worked examples — what each `Channel`/`Kind` row actually looks like. (Other columns omitted for brevity unless interesting.)
|
||
|
||
### 7.1 `ApiOutbound` — outbound HTTP via External System Gateway
|
||
|
||
**Sync call** (`ExternalSystem.Call("Weather", "GetForecast", { city: "Dublin" })` succeeds):
|
||
|
||
```
|
||
EventId = <new guid>
|
||
Channel = ApiOutbound
|
||
Kind = SyncCall
|
||
CorrelationId = NULL -- one-shot, no operation to correlate
|
||
SourceSiteId = "site-01"
|
||
SourceInstance = "Plant1.Boiler"
|
||
SourceScript = "OnHourly"
|
||
Target = "Weather/GetForecast"
|
||
Status = Success
|
||
HttpStatus = 200
|
||
DurationMs = 142
|
||
RequestSummary = '{"city":"Dublin"}' -- truncated to cap
|
||
ResponseSummary= '{"tempC":11.4,...}' -- truncated to cap
|
||
```
|
||
|
||
**Cached call** (`ExternalSystem.CachedCall(...)`, hits a 500, retries, succeeds on attempt 3):
|
||
|
||
```
|
||
1. Kind=CachedEnqueued Status=Enqueued CorrelationId=<tracked-op-id>
|
||
2. Kind=CachedAttempt Status=TransientFailure HttpStatus=500 CorrelationId=<same>
|
||
3. Kind=CachedAttempt Status=TransientFailure HttpStatus=500 CorrelationId=<same>
|
||
4. Kind=CachedAttempt Status=Success HttpStatus=200 CorrelationId=<same>
|
||
5. Kind=CachedTerminal Status=Delivered CorrelationId=<same>
|
||
```
|
||
|
||
The shadow of the `SiteCalls` row's lifecycle, but immutable and time-ordered.
|
||
|
||
### 7.2 `DbOutbound` — outbound DB via Database layer
|
||
|
||
**Sync write** (`db.Execute("INSERT INTO Readings ...", new {...})`):
|
||
|
||
```
|
||
Channel = DbOutbound
|
||
Kind = SyncWrite
|
||
Target = "PlantDB" -- connection name only, not server
|
||
CorrelationId = NULL
|
||
Status = Success
|
||
DurationMs = 9
|
||
RequestSummary = "INSERT INTO Readings(ts,val) VALUES (@p0,@p1)" -- SQL text
|
||
Extra = '{"rowsAffected":1,"params":{"p0":"2026-05-20T14:00Z","p1":42.7}}' -- values captured by default
|
||
|
||
```
|
||
|
||
**Sync read** (`db.Query<...>(...)`):
|
||
|
||
```
|
||
Channel = DbOutbound
|
||
Kind = SyncRead
|
||
Status = Success
|
||
DurationMs = 31
|
||
RequestSummary = "SELECT id, value FROM Readings WHERE ts > @p0"
|
||
Extra = '{"rowsReturned":42}'
|
||
ResponseSummary= NULL -- rows not captured by default; opt-in per connection
|
||
```
|
||
|
||
**Cached write** — same five-row lifecycle as the cached API example.
|
||
|
||
### 7.3 `Notification` — outbound notifications
|
||
|
||
```
|
||
1. Kind=Enqueued Status=Enqueued CorrelationId=<NotificationId> SourceSiteId="site-01" SourceInstance="Plant1.Boiler"
|
||
2. Kind=Attempt Status=TransientFailure ErrorMessage="SMTP 451 ..." CorrelationId=<same> SourceSiteId=NULL (dispatch is central)
|
||
3. Kind=Attempt Status=Success CorrelationId=<same>
|
||
4. Kind=Terminal Status=Delivered CorrelationId=<same>
|
||
Target = "OpsTeamEmail" -- notification list name
|
||
Extra = '{"resolvedTargets":["a@x.com","b@x.com"], "subject":"Boiler high temp"}'
|
||
RequestSummary = '...body, truncated...'
|
||
```
|
||
|
||
Note the site→central handoff is implicit: row 1 arrives via §6.2 telemetry (it originated at the site script); rows 2–4 are written by the central dispatcher directly via §6.4.
|
||
|
||
### 7.4 `ApiInbound` — inbound API
|
||
|
||
One row per request, written at request completion:
|
||
|
||
```
|
||
Channel = ApiInbound
|
||
Kind = Completed
|
||
CorrelationId = <request-id> -- the request's correlation header (or generated)
|
||
SourceSiteId = NULL -- central-originated event
|
||
Actor = "AcmeSCADA" -- API key name (NOT the key itself)
|
||
Target = "RecordReading" -- inbound method name
|
||
Status = Success | PermanentFailure -- mapped from final HTTP outcome
|
||
HttpStatus = 200 | 400 | 401 | 500
|
||
DurationMs = 73
|
||
RequestSummary = '{"siteId":"...","value":12.4}' -- truncated; secrets/PII per redaction policy
|
||
ResponseSummary= '{"ok":true}' -- full body on 5xx
|
||
Extra = '{"remoteIp":"203.0.113.42","userAgent":"...","scriptInvoked":"RecordReading.Handle"}'
|
||
```
|
||
|
||
A bad API key → row with `Status=PermanentFailure`, `HttpStatus=401`, `Actor=NULL`, `Extra` carries `remoteIp` for abuse triage.
|
||
|
||
---
|
||
|
||
## 8. Payload capture policy
|
||
|
||
### 8.1 Truncation
|
||
- Default cap: **8 KB** for each of `RequestSummary` and `ResponseSummary`. Configurable globally; per-target overrides allowed (§8.4).
|
||
- On any non-`Success` row, the cap is raised to **64 KB** for that row — error context is precious.
|
||
- When a body is truncated, `PayloadTruncated = 1` and the captured prefix is preserved verbatim (UTF-8 byte-safe truncation, no mid-character cuts).
|
||
- Bodies exceeding the larger cap are still truncated; full bodies are never stored.
|
||
|
||
### 8.2 Redaction
|
||
Redaction happens **at the write site**, before the row touches SQLite (or central MS SQL for §6.4 events). Unredacted secrets never persist.
|
||
|
||
- **HTTP headers** — `Authorization`, `Cookie`, `Set-Cookie`, `X-API-Key`, and any header matching the configured redact-list (regex) become `<redacted>`. List is operator-extensible.
|
||
- **HTTP bodies** — captured verbatim by default. Operators can register per-External-System / per-Inbound-Method body redactors (regex → replacement) for known secret fields (e.g., `"password"\s*:\s*"[^"]+"`).
|
||
- **SQL** — statement text and parameter values captured verbatim by default. Per-connection redaction opt-in (e.g., redact parameters whose name matches `@apikey|@token|@password`).
|
||
- **Notification bodies** — captured per the existing notification rules (no behavioral change from today).
|
||
- **Safety net** — if a configured redactor throws, the affected payload becomes `"<redacted: redactor error>"` and a `AuditRedactionFailure` health metric increments. We over-redact, never under-redact, on configuration faults.
|
||
|
||
### 8.3 Never captured
|
||
- Raw API key material (only the key *name* via `Actor`).
|
||
- LDAP bind credentials, cluster secrets, Configuration DB connection strings.
|
||
- Framework traffic per §2 (out of scope by construction, not by redaction).
|
||
|
||
### 8.4 Configurability
|
||
Bound from `appsettings.json` (new `AuditLog` options class owned by the Audit Log component):
|
||
|
||
```jsonc
|
||
"AuditLog": {
|
||
"DefaultCapBytes": 8192,
|
||
"ErrorCapBytes": 65536,
|
||
"HeaderRedactList": [ "Authorization", "Cookie", "Set-Cookie", "X-API-Key" ],
|
||
"GlobalBodyRedactors": [
|
||
{ "Pattern": "\"password\"\\s*:\\s*\"[^\"]+\"", "Replacement": "\"password\":\"<redacted>\"" }
|
||
],
|
||
"PerTargetOverrides": {
|
||
"Weather/GetForecast": { "CapBytes": 4096 },
|
||
"PlantDB": { "RedactSqlParamsMatching": "@apikey|@token" }
|
||
}
|
||
}
|
||
```
|
||
|
||
Per-target keys bind by External System / Inbound Method / Notification List / Database Connection name.
|
||
|
||
---
|
||
|
||
## 9. Failure handling & idempotency
|
||
|
||
### 9.1 `EventId` is the dedup key
|
||
- Generated at the originator (site for §6.1/§6.5, central for §6.4). Guid v4.
|
||
- Central ingest is `INSERT … WHERE NOT EXISTS (SELECT 1 FROM AuditLog WHERE EventId = @id)`, executed under the PK constraint.
|
||
- Idempotent across telemetry retries, reconciliation pulls, and any combination thereof.
|
||
|
||
### 9.2 Central MS SQL outage
|
||
- Site telemetry calls fail; `ForwardState` stays `Pending`; backlog grows.
|
||
- Reconciliation pulls also fail.
|
||
- Site SQLite continues to absorb hot-path writes (no upstream dependency on the hot path).
|
||
- `SiteAuditBacklog` health metric crosses thresholds → Health dashboard surfaces it on the affected site tile.
|
||
- On recovery, telemetry drains; insert-if-not-exists handles any overlap.
|
||
|
||
### 9.3 Site SQLite write failure
|
||
- Audit writer fails to append (disk full, schema lock, transient IO error).
|
||
- **The action proceeds** — we do not fail script-initiated work because the audit write failed.
|
||
- `SiteAuditWriteFailures` health metric increments; critical-severity Site Event Log entry.
|
||
- A small in-memory ring (default 1024 rows) buffers events while the local writer is unhealthy; on ring overflow, oldest events are dropped with a Site Event Log warning per drop.
|
||
|
||
### 9.4 Telemetry actor wedged
|
||
- Reconciliation pull (§6.3) is the fallback. If two consecutive reconciliation cycles report a non-draining backlog, the supervisor restarts the telemetry actor and a `SiteAuditTelemetryStalled` event fires.
|
||
|
||
### 9.5 Central direct-write failure
|
||
- Inbound API: middleware audit failure is logged + metricked but never affects the HTTP response.
|
||
- Notification Outbox dispatcher: audit failure logs critical and increments `CentralAuditWriteFailures`; the operational `Notifications` row update proceeds.
|
||
|
||
### 9.6 Dedup horizon — there isn't one
|
||
`EventId` PK enforces uniqueness as long as a row exists in the table. Purge (§12) removes rows by `OccurredAtUtc`, not `EventId`; a stale telemetry retry arriving after the original was purged will insert a "new" row. Acceptable — a retry that arrives more than a year late is vanishingly rare and an extra row is harmless.
|
||
|
||
---
|
||
|
||
## 10. UI & query surface
|
||
|
||
### 10.1 Audit Log page (new, top-level)
|
||
Lives under a new **Audit** nav group in Central UI (sibling to **Notifications**). Standard Blazor Server + Bootstrap, custom components per the project UI rules.
|
||
|
||
**Filter bar (top of page, collapses to one row when not focused):**
|
||
- Time range (relative: 15m / 1h / 24h / 7d / custom).
|
||
- Channel (multi-select: `ApiOutbound`, `DbOutbound`, `Notification`, `ApiInbound`).
|
||
- Kind (filtered by selected channels).
|
||
- Status (multi-select).
|
||
- Site (multi-select, scoped to user's authorized sites).
|
||
- Instance / Script (text search with autocomplete).
|
||
- Target (text search — system+method, DB connection, list name).
|
||
- Actor (text search — inbound API key name).
|
||
- CorrelationId (paste a `TrackedOperationId` / `NotificationId` / request-id to see its full event sequence).
|
||
- "Errors only" toggle (`Status NOT IN (Success, Delivered, Enqueued)`).
|
||
|
||
**Results grid:**
|
||
- Columns (resizable, reorderable, persisted per user): `OccurredAtUtc`, `Site`, `Channel`, `Kind`, `Status`, `Target`, `Actor`, `DurationMs`, `HttpStatus`, `ErrorMessage`.
|
||
- Keyset pagination on `(OccurredAtUtc desc, EventId desc)`. Default page 100.
|
||
- Click row → drilldown drawer.
|
||
|
||
**Drilldown drawer:**
|
||
- Pretty-prints `RequestSummary` / `ResponseSummary` (JSON auto-detected; SQL syntax-highlighted).
|
||
- Redaction indicators where headers/fields were stripped.
|
||
- "Copy as cURL" for `ApiOutbound` / `ApiInbound` rows.
|
||
- "Show all events for this operation" link → filters by `CorrelationId`.
|
||
|
||
### 10.2 Drill-in links from existing pages
|
||
- **Notifications** row → "View audit history" → Audit Log filtered to `CorrelationId = NotificationId`.
|
||
- **Site Calls** row → "View audit history" → Audit Log filtered to `CorrelationId = TrackedOperationId`.
|
||
- **External Systems** detail → "Recent activity" → Audit Log filtered to `Target starts-with <system>`.
|
||
- **Inbound API keys** detail → "Recent calls" → Audit Log filtered to `Actor = <key name>` AND `Channel = ApiInbound`.
|
||
- **Sites** detail → new "Audit feed" tab.
|
||
- **Instances** detail → new "Audit feed" tab.
|
||
|
||
### 10.3 Health dashboard tiles
|
||
Three new tiles in an "Audit" KPI group:
|
||
- **Audit volume** — events/min global + per-site sparkline.
|
||
- **Audit error rate** — % non-`Success` rows, rolling 5 min.
|
||
- **Audit backlog** — sum of `Pending` site rows; click → per-site breakdown.
|
||
|
||
### 10.4 Export
|
||
Audit Log page **Export** button streams CSV (current filter) server-side. Default cap 100k rows; larger exports use the CLI (§15).
|
||
|
||
---
|
||
|
||
## 11. Security & tamper-evidence
|
||
|
||
### 11.1 Append-only enforcement
|
||
- Application accesses `AuditLog` via a dedicated DB role `scadalink_audit_writer` granted `INSERT` + `SELECT` only — no `UPDATE`, no `DELETE`.
|
||
- Purge runs under a separate role `scadalink_audit_purger` whose permissions are limited to the partition-switch operation (§12.2). Row-level `DELETE` is not granted even to purge.
|
||
- A CI guard greps the data layer for any `UPDATE … AuditLog` or `DELETE … AuditLog` text and fails the build.
|
||
|
||
### 11.2 Authorization
|
||
- Reading the Audit Log requires the existing **Audit** role (today used for the IAuditService config-change log) extended with a new **OperationalAudit** permission.
|
||
- Per-site row scoping reuses the existing site-permission model from Security & Auth — a user sees only rows for sites they are authorized to operate.
|
||
- Bulk export (UI button + CLI) requires an additional **AuditExport** permission.
|
||
|
||
### 11.3 Payload redaction at write
|
||
See §8.2. Contract: unredacted secrets never persist. Safety net over-redacts on misconfiguration.
|
||
|
||
### 11.4 Tamper-evidence hash chain (deferred, v1.x)
|
||
- Each row gains a `RowHash` column.
|
||
- `RowHash = SHA-256(prev.RowHash || canonical(row))` per partition.
|
||
- Computed by a chaining job that runs after each monthly partition closes.
|
||
- Verifiable offline via `scadalink audit verify-chain --month YYYY-MM`.
|
||
- Default **off** in v1 to avoid operational burden. Flag for v1.x.
|
||
|
||
### 11.5 Site SQLite security
|
||
- File permissions: read/write by the ScadaLink service account only.
|
||
- Not backed up off-machine — site SQLite is a buffer with a forwarding guarantee, not a record. Central is the durable home.
|
||
|
||
---
|
||
|
||
## 12. Retention & purge mechanics
|
||
|
||
### 12.1 Central retention defaults
|
||
- **365 days** based on `OccurredAtUtc`. Configurable via `AuditLog:RetentionDays` (min 7, max 3650, validated at startup).
|
||
- **Single global retention in v1** — no per-channel/Kind overrides. Deferred to v1.x once production cost data shows whether overrides are needed.
|
||
|
||
### 12.2 Partition strategy
|
||
- Monthly partitions on `OccurredAtUtc`. Partition function `pf_AuditLog_Month`, scheme `ps_AuditLog_Month`, created in the EF Core migration.
|
||
- Purge by partition switch: move the eligible partition to a staging table, then drop. No row-by-row delete; no log bloat.
|
||
- Partition-maintenance job rolls forward each month (creates the next month's partition ahead of time).
|
||
|
||
### 12.3 Purge job
|
||
- Singleton actor `AuditLogPurgeActor` on the active central node, runs daily.
|
||
- Switches out any partition whose latest `OccurredAtUtc` is older than the global retention window. Pure partition-switch; no row-level deletes.
|
||
- Emits a `AuditLog:Purged` event (partition range, rowcount, duration).
|
||
|
||
### 12.4 Site SQLite purge
|
||
- Daily site job, hard invariant per §4: purge only `OccurredAtUtc < threshold AND ForwardState IN ('Forwarded','Reconciled')`.
|
||
- Default site retention **7 days** (configurable, min 1, max 90).
|
||
- Backlog metric (§9.2) provides visibility into "central outage → site bloat" before disk-full.
|
||
|
||
---
|
||
|
||
## 13. Performance & sizing
|
||
|
||
Rough back-of-envelope; load testing will confirm.
|
||
|
||
### 13.1 Per-site event rate (assumed nominal site)
|
||
| Channel/Kind | Typ events/min | Peak events/min |
|
||
|---|---:|---:|
|
||
| `ApiOutbound.SyncCall` | 10 | 100 |
|
||
| `ApiOutbound.Cached*` (~4 rows/op) | 4 | 20 |
|
||
| `DbOutbound.SyncWrite` | 30 | 300 |
|
||
| `DbOutbound.SyncRead` | 60 | 600 |
|
||
| `DbOutbound.Cached*` (~4 rows/op) | 4 | 20 |
|
||
| `Notification.Enqueued` (site-emit) | 1 | 10 |
|
||
| **Per-site total** | **~110** | **~1,050** |
|
||
|
||
### 13.2 Central total (50-site deployment)
|
||
- Typical: ~5,500 events/min = ~**92 events/sec**.
|
||
- Peak: ~52,500 events/min = ~**875 events/sec**.
|
||
- Plus central-originated (Notification dispatch + Inbound API): assume ~30 events/sec typical.
|
||
|
||
MS SQL handles this with batched ingest and the time-aligned indexes.
|
||
|
||
### 13.3 Row size
|
||
- Fixed columns: ~250 bytes.
|
||
- Average captured payload: ~1 KB.
|
||
- Per row: **~1.3 KB**.
|
||
|
||
### 13.4 Yearly central footprint
|
||
- Typical: 100 events/sec × 86,400 × 365 × 1.3 KB ≈ **~4 TB** at default cap.
|
||
- Cap reduction (8 KB → 2 KB) or per-channel retention shaves this multi-fold.
|
||
|
||
### 13.5 Site SQLite footprint
|
||
- 110/min × 60 × 24 × 7 × 1.3 KB ≈ **~140 MB / site** at the 7-day window. Trivial.
|
||
|
||
### 13.6 Levers
|
||
- Reduce `DefaultCapBytes` per §8.1.
|
||
- Tighten per-channel retention per §12.1 (especially `DbOutbound.SyncRead`).
|
||
- Defer to v1.x: Parquet archival to object storage before purge (§15.2).
|
||
|
||
---
|
||
|
||
## 14. KPI surface & relationship to existing KPIs
|
||
|
||
### 14.1 New Audit Log KPIs
|
||
- **Volume** — events/min, global + per-site.
|
||
- **Error rate** — % non-`Success` rows, rolling 5 min.
|
||
- **Backlog** — sum of `Pending` site rows.
|
||
- **Top inbound callers** — top-10 `Actor` by request count, last 1h.
|
||
- **Top outbound 5xx** — top-10 `Target` by 5xx-status count, last 1h.
|
||
|
||
### 14.2 Relationship to existing KPIs
|
||
- **Notification Outbox KPIs** (queue depth, parked, delivered-last-interval, etc.) — unchanged, sourced from `Notifications`. Audit Log KPIs describe the audit table itself, not the notification subsystem.
|
||
- **Site Call Audit KPIs** — unchanged, sourced from `SiteCalls`.
|
||
- Audit Log KPIs occupy their own group on the Health dashboard. Nothing is collapsed or superseded.
|
||
|
||
---
|
||
|
||
## 15. CLI & external access
|
||
|
||
### 15.1 CLI commands
|
||
New `scadalink audit` command group:
|
||
- `scadalink audit query --site <s> --since <t> --kind <k> [...]` — same filter set as the UI.
|
||
- `scadalink audit export --since <t> --until <t> --format csv|jsonl|parquet --output <path>` — bulk export, server-side streaming.
|
||
- `scadalink audit verify-chain --month <YYYY-MM>` — hash-chain verification (when §11.4 is enabled).
|
||
|
||
Requires the same **OperationalAudit** / **AuditExport** permissions as the UI.
|
||
|
||
### 15.2 Object-storage archival (deferred, v1.x)
|
||
A monthly job dumps the closing partition to Parquet on operator-configured object storage before central purge — enabling indefinite cold retention without bloating MS SQL. Flag for v1.x; not in initial scope.
|
||
|
||
---
|
||
|
||
## 16. Locked decisions
|
||
|
||
| # | Question | Decision |
|
||
|---|---|---|
|
||
| 1 | Component number | **#23 Audit Log** (README matrix + HighLevelReqs). |
|
||
| 2 | Nav placement | New top-level **Audit** nav group in Central UI. |
|
||
| 3 | Hash-chain tamper evidence (§11.4) | Deferred to v1.x. v1 enforces append-only via DB grants only. |
|
||
| 4 | Parquet archival to object storage (§15.2) | Deferred to v1.x. |
|
||
| 5 | Per-channel retention overrides (§12.1) | Deferred to v1.x. v1 uses a single global `RetentionDays`. |
|
||
| 6 | Default payload cap | **8 KB** for `RequestSummary` / `ResponseSummary`; **64 KB** on non-`Success` rows. |
|
||
|
||
All earlier design decisions (purpose, topology, scope, payload depth, lifecycle granularity, retention default, site→central path, UI shape, cached-call audit emission, SQL parameter capture, never-fail-on-audit-failure) are also locked. See §1–§15.
|