Files

Joseph Doherty fec0bb10ff docs(audit): add centralized audit log design (alog.md)

Validated design for a new append-only AuditLog covering the script
trust boundary: outbound API calls (sync + cached), outbound DB
operations (sync + cached, incl. script-initiated reads), notifications,
and inbound API requests. Layered alongside existing Notifications (#21)
and SiteCalls (#22) operational tables.

Key decisions:
- One row per lifecycle event; strictly append-only.
- Site SQLite hot-path append + best-effort gRPC telemetry + central
  reconciliation pull. Site purge requires ForwardState=Forwarded.
- Cached calls: site emits; one telemetry packet feeds both the
  immutable AuditLog row and the operational SiteCalls upsert.
- Payload: metadata + truncated bodies (8 KB default, 64 KB on errors).
  Headers redacted; SQL parameter values captured by default.
- Audit-write failures never abort the user-facing action.
- Monthly partitioning at central; 365-day global retention.
- New Audit nav group + drill-in links from existing pages.

Deferred to v1.x: hash-chain tamper evidence, Parquet archival,
per-channel retention overrides. Provisional component #23.

2026-05-20 07:21:44 -04:00

36 KiB

Raw Blame History

Centralized Audit Log — Design (Working Draft)

Status: Validated — ready for implementation planning. Owner: Joseph Doherty Date: 2026-05-20 Provisional component number: #23 Audit Log

A new central, append-only audit log capturing every action a script causes to cross the cluster trust boundary: outbound API calls (sync + cached), outbound DB writes (sync + cached), notifications sent, and inbound API requests that invoke a script.

1. Purpose

Provide a single forensic + operational record of every integration action initiated by, or terminating in, a script — answering both:

Compliance / forensic: "Did instance X send notification Y on date Z? What was the body? Did external system A get called by script B last quarter, and with what result?"
Operational visibility: "Why is site S misbehaving right now? What did its scripts touch in the last 10 minutes? Which inbound API caller is hammering us?"

One store, rich payloads, long retention, dashboards + drilldowns + filter queries.

The audit log is not the operational state store. It does not drive dispatchers, retry loops, or Retry/Discard actions — those remain in Notification Outbox and Site Call Audit. The audit log is the immutable history that observes those subsystems and adds coverage where they are silent.

2. Scope — the script trust boundary

The audit log captures every action that a script causes to cross the cluster trust boundary:

Channel	Trigger	Direction	Covered today?
`ExternalSystem.Call(...)`	Script	Outbound	❌ (gap)
`ExternalSystem.CachedCall(...)`	Script	Outbound	✅ `SiteCalls` (Site Call Audit)
`Database.Connection().Execute*(...)` — writes	Script	Outbound	❌ (gap)
`Database.CachedWrite(...)`	Script	Outbound	✅ `SiteCalls` (Site Call Audit)
`Notify.To(list).Send(...)`	Script	Outbound	✅ `Notifications` (Notification Outbox)
`POST /api/{method}` (Inbound API)	External	Inbound (invokes a script)	❌ (gap)

Out of scope — framework traffic is not audited:

Health checks, heartbeats, cluster membership messages.
gRPC inter-cluster real-time streams (attribute values, alarm states).
Data Connection Layer ↔ OPC UA / custom protocol traffic.
LDAP authentication probes, Traefik routing decisions.
Internal Configuration Database queries by the framework.
Site Event Log writes, audit log writes themselves.

This boundary is meaningful because the script trust model already controls what scripts can do; the audit log is the record of how that surface was exercised.

Note on DB reads. Script-initiated reads via Database.Connection().ExecuteReader(...) count as actions from a script and ARE in scope. They are expected to be far less common than reads via DCL/subscriptions (which are framework traffic and excluded).

3. Architecture — layered, append-only

              ┌──────────────────────────────────────────────────────────────────────┐
              │                       Central cluster (MS SQL)                       │
              │                                                                      │
              │   ┌──────────────────┐   ┌───────────────┐   ┌────────────────────┐  │
              │   │ Notification     │   │ Site Call     │   │ Inbound API        │  │
              │   │ Outbox (#21)     │   │ Audit (#22)   │   │ (#14)              │  │
              │   │ Notifications    │   │ SiteCalls     │   │ (no audit today)   │  │
              │   └────────┬─────────┘   └───────┬───────┘   └─────────┬──────────┘  │
              │            │ emits               │ emits               │ emits        │
              │            ▼                     ▼                     ▼              │
              │       ┌────────────────────────────────────────────────────────────┐  │
              │       │            AuditLog  (new, append-only, MS SQL)            │  │
              │       │  one row per lifecycle event across all channels           │  │
              │       └─────────────────────────▲──────────────────────────────────┘  │
              │                                 │ telemetry (gRPC, idempotent)        │
              └─────────────────────────────────┼─────────────────────────────────────┘
                                                │
                                                │
              ┌─────────────────────────────────┼─────────────────────────────────────┐
              │                 Site cluster (SQLite, per active node)                │
              │                                 │                                     │
              │       ┌─────────────────────────┴──────────────────────────────┐      │
              │       │   Site-local AuditLog  (SQLite, hot-path append)       │      │
              │       └────▲───────────────▲───────────────▲───────────────────┘      │
              │            │               │               │                          │
              │  ┌─────────┴────────┐  ┌───┴──────┐  ┌─────┴────────────┐             │
              │  │ External System  │  │ Database │  │ Site S&F /       │             │
              │  │ Gateway (#7)     │  │ Layer    │  │ Notifications    │             │
              │  │ sync + cached    │  │ sync +   │  │ (transitions)    │             │
              │  └──────────────────┘  │ cached   │  └──────────────────┘             │
              │                        └──────────┘                                   │
              └───────────────────────────────────────────────────────────────────────┘

Key properties:

Strictly append-only. Once written, an AuditLog row is never updated or deleted (except by retention purge). Operational state (live status, parked-for-retry, etc.) lives in Notifications / SiteCalls — not in AuditLog.
One row per lifecycle event. A cached call that retries three times then parks produces five rows: enqueued, attempt #1, attempt #2, attempt #3, parked. A sync call produces one row. An inbound API hit produces one row.
Site-local first for site-originated events. Hot-path script calls never wait on the network for an audit write.
Direct write for central-originated events. Notification delivery attempts and inbound API hits land at central — they write directly to the central AuditLog. No detour through site SQLite.
At-least-once telemetry, idempotent on EventId. Same dedup model as Site Call Audit today.

4. Data Model (first cut)

Single wide table, polymorphic by Channel + Kind discriminators, JSON payload column.

Central: `AuditLog`

Column	Type	Notes
`EventId`	`uniqueidentifier` PK	Generated where the event originates (site or central). Idempotency key.
`OccurredAtUtc`	`datetime2`	When the event happened (call returned, retry attempted, etc.).
`IngestedAtUtc`	`datetime2`	When central persisted the row (lags `OccurredAtUtc` for site-originated rows).
`Channel`	`varchar(32)`	`ApiOutbound` \| `DbOutbound` \| `Notification` \| `ApiInbound`.
`Kind`	`varchar(32)`	Channel-specific event kind (see table below).
`CorrelationId`	`uniqueidentifier` NULL	Ties multi-event operations together. `TrackedOperationId` for cached calls, `NotificationId` for notifications, request-id for inbound API. NULL for sync one-shot calls.
`SourceSiteId`	`varchar(64)` NULL	NULL for central-originated events (inbound API, central notification dispatch).
`SourceInstanceId`	`varchar(128)` NULL	Instance whose script initiated the action (when applicable).
`SourceScript`	`varchar(128)` NULL	Script name within the instance.
`Actor`	`varchar(128)` NULL	Inbound API: API key name. Outbound: script identity. Central: system user.
`Target`	`varchar(256)` NULL	Outbound API: external system + method. DB: connection name. Notification: list name. Inbound API: method name.
`Status`	`varchar(32)`	Outcome of this event: `Success`, `TransientFailure`, `PermanentFailure`, `Enqueued`, `Retrying`, `Delivered`, `Parked`, `Discarded`.
`HttpStatus`	`int` NULL	HTTP-bearing events only.
`DurationMs`	`int` NULL	Call/attempt duration.
`ErrorMessage`	`nvarchar(1024)` NULL	Truncated; `ErrorDetail` for full text.
`ErrorDetail`	`nvarchar(max)` NULL	Optional full exception/text on failures.
`RequestSummary`	`nvarchar(max)` NULL	Truncated request payload (configurable cap, default 8 KB). Headers redacted.
`ResponseSummary`	`nvarchar(max)` NULL	Truncated response payload (same cap). Full on errors.
`PayloadTruncated`	`bit`	Set if either summary was truncated.
`Extra`	`nvarchar(max)` NULL	Channel-specific JSON for fields we don't promote to columns.

Indexes (first cut):

IX_AuditLog_OccurredAtUtc — primary time-range index for global scans.
IX_AuditLog_Site_Occurred (SourceSiteId, OccurredAtUtc) — per-site filters.
IX_AuditLog_Correlation (CorrelationId) — drilldown from a single operation.
IX_AuditLog_Channel_Status_Occurred (Channel, Status, OccurredAtUtc) — KPI / dashboard tiles.
IX_AuditLog_Target_Occurred (Target, OccurredAtUtc) — "what did we send to system X."
Partitioning by month on OccurredAtUtc from day one (purge becomes a partition switch instead of a delete storm).

Kind values by channel:

Channel	Kinds
`ApiOutbound`	`SyncCall`, `CachedEnqueued`, `CachedAttempt`, `CachedTerminal`
`DbOutbound`	`SyncWrite`, `SyncRead`, `CachedEnqueued`, `CachedAttempt`, `CachedTerminal`
`Notification`	`Enqueued`, `Attempt`, `Terminal`
`ApiInbound`	`Completed` (one row per request, written at request end with final status)

Site: `AuditLog` (SQLite)

Same shape minus IngestedAtUtc (irrelevant at the source) plus a local ForwardState column:

ForwardState: Pending | Forwarded | Reconciled. Drives the telemetry loop and reconciliation pull.

Site SQLite retention rule (hard invariant):

A row is eligible for purge only when both conditions hold:

OccurredAtUtc is older than the configured site retention window (default 7 days); AND
ForwardState IN ('Forwarded', 'Reconciled') — i.e., central has acknowledged receipt.

Rows still in ForwardState = 'Pending' are never purged on the basis of age. A prolonged central outage will grow the site audit table indefinitely until central is reachable again. This is intentional — losing audit rows to make room is a compliance violation, not a self-healing behavior.

To bound that growth in practice, the site emits a SiteAuditBacklog health metric (pending row count, oldest pending age, bytes on disk). Crossing operator-configured thresholds surfaces as a Health dashboard warning on the relevant site tile. This is the same pattern used by the Store-and-Forward Engine's backlog metric.

Central is the durable home; site SQLite is a write-buffer with a forwarding guarantee.

5. Where this fits in the existing component matrix

This work probably becomes component #23: Audit Log, with edges into:

#7 External System Gateway — emits audit events for sync Call(), sync DB writes (and reads from scripts), and cached operations.
#14 Inbound API — emits one row per request (success or failure) at request completion.
#21 Notification Outbox — emits an audit row on enqueue, on each delivery attempt, and on terminal status.
#22 Site Call Audit — emits an audit row on each lifecycle transition (enqueue, attempt, terminal). SiteCalls remains the operational state store; AuditLog is the immutable shadow.
#3 Site Runtime / #16 Commons — script-trust-boundary call paths gain a thin audit interface.
#17 Configuration Database — Audit Log is a separate concern from IAuditService (which stays config-change-only). Both coexist.

6. Ingestion paths

There are three write paths into the central AuditLog, all converging on the same table.

6.1 Site hot-path append (site-originated events)

Script issues an action across the trust boundary (ExternalSystem.Call, Database write/read, Notify.Send, etc.).
The component completing the action (External System Gateway, Database Layer, S&F Engine) builds an AuditEvent value object with a fresh EventId (Guid v4) and OccurredAtUtc = UtcNow.
Component appends the event to the site-local AuditLog SQLite via the ISiteAuditWriter interface. Single-statement INSERT, ForwardState = 'Pending'. Fire-and-forget from the caller's point of view (await returns once the local write is durable, typically microseconds).
Control returns to the script. No central round-trip on the hot path.

Failure modes on the hot path:

SQLite write fails (disk full, IO error): the audit writer logs a critical event to the Site Event Log, surfaces a SiteAuditWriteFailures health metric, and the action proceeds. We do not fail user-facing actions because the audit write failed — but the operator must be told loudly. (Open question: do we want a "strict mode" where audit-write failure aborts the action? Default off.)
Audit writer not yet bootstrapped (very early startup): events buffer in-memory bounded by a small ring; oldest discarded with a warning if it overflows. This window is normally sub-second.

6.2 Telemetry forward (site → central)

A SiteAuditTelemetryActor runs as a singleton on the active site node and drives the forwarding loop:

Selects up to N Pending rows from local AuditLog ordered by OccurredAtUtc.
Sends them in a batched gRPC IngestAuditEvents(events) call to central (over the existing SiteStream channel — same transport as cached-call telemetry today).
Central performs insert-if-not-exists on EventId (idempotent) and returns the accepted IDs.
Site flips ForwardState = 'Forwarded' for accepted IDs. Rejected IDs (transient central error) stay Pending for the next sweep.

Cadence: short polling interval (default 5s) when the queue is non-empty, longer (default 30s) when idle. Telemetry runs on a dedicated dispatcher so it doesn't compete with the script blocking-I/O dispatcher.

6.3 Reconciliation pull (self-healing for missed telemetry)

A central SiteAuditReconciliationActor periodically (default every 5 minutes per site) asks each site: "What's your highest EventId.OccurredAtUtc with ForwardState = 'Pending'? And how many pending rows do you have?" If central sees a non-empty pending backlog that hasn't drained on its own (e.g., telemetry actor is wedged), it issues a PullAuditEvents(sinceUtc, batchSize) request that returns rows directly. Central inserts-if-not-exists and acks them — site flips to ForwardState = 'Reconciled'.

This is the same self-healing pattern Site Call Audit uses for SiteCalls.

6.4 Central direct-write (central-originated events)

Events that originate at central never touch site SQLite:

Inbound API — request completed at central; one ApiInbound/Completed row written via ICentralAuditWriter synchronously inside the request handler middleware before the HTTP response is flushed.
Notification Outbox dispatcher — each delivery attempt writes a Notification/Attempt row; terminal status writes a Notification/Terminal row. (The site-originated Notification/Enqueued row arrives via §6.2.) Central direct-writes use the same insert-if-not-exists semantics keyed on EventId, so a retried request handler can't produce duplicates.

6.5 Cached operations — site emits, central writes twice

For ExternalSystem.CachedCall and Database.CachedWrite, the site is the source of truth for every audit row. The site writes each lifecycle event (CachedEnqueued, CachedAttempt, CachedTerminal) to its local SQLite AuditLog on the hot path (or on the retry tick for CachedAttempt), then forwards via the same telemetry channel described in §6.2. The telemetry message format gains the audit-row fields additively — one packet per lifecycle transition carries both the operational state update AND the audit row content.

On receipt, central does two things in one transaction:

Insert-if-not-exists the immutable AuditLog row, keyed on EventId.
Upsert the operational SiteCalls row (existing Site Call Audit behavior — status, retry count, last error, timestamps).

This collapses what would otherwise be two telemetry concerns into one, keeps site SQLite as the single local source of truth for audit content, and preserves the existing operational SiteCalls shape for the dispatcher / UI. No central-side derivation; no double-emission from the site.

7. Per-channel event mapping

Worked examples — what each Channel/Kind row actually looks like. (Other columns omitted for brevity unless interesting.)

7.1 `ApiOutbound` — outbound HTTP via External System Gateway

Sync call (ExternalSystem.Call("Weather", "GetForecast", { city: "Dublin" }) succeeds):

EventId        = <new guid>
Channel        = ApiOutbound
Kind           = SyncCall
CorrelationId  = NULL                                  -- one-shot, no operation to correlate
SourceSiteId   = "site-01"
SourceInstance = "Plant1.Boiler"
SourceScript   = "OnHourly"
Target         = "Weather/GetForecast"
Status         = Success
HttpStatus     = 200
DurationMs     = 142
RequestSummary = '{"city":"Dublin"}'                  -- truncated to cap
ResponseSummary= '{"tempC":11.4,...}'                 -- truncated to cap

Cached call (ExternalSystem.CachedCall(...), hits a 500, retries, succeeds on attempt 3):

1. Kind=CachedEnqueued    Status=Enqueued    CorrelationId=<tracked-op-id>
2. Kind=CachedAttempt     Status=TransientFailure  HttpStatus=500  CorrelationId=<same>
3. Kind=CachedAttempt     Status=TransientFailure  HttpStatus=500  CorrelationId=<same>
4. Kind=CachedAttempt     Status=Success           HttpStatus=200  CorrelationId=<same>
5. Kind=CachedTerminal    Status=Delivered         CorrelationId=<same>

The shadow of the SiteCalls row's lifecycle, but immutable and time-ordered.

7.2 `DbOutbound` — outbound DB via Database layer

Sync write (db.Execute("INSERT INTO Readings ...", new {...})):

Channel        = DbOutbound
Kind           = SyncWrite
Target         = "PlantDB"                            -- connection name only, not server
CorrelationId  = NULL
Status         = Success
DurationMs     = 9
RequestSummary = "INSERT INTO Readings(ts,val) VALUES (@p0,@p1)"   -- SQL text
Extra          = '{"rowsAffected":1,"params":{"p0":"2026-05-20T14:00Z","p1":42.7}}'  -- values captured by default

Sync read (db.Query<...>(...)):

Channel        = DbOutbound
Kind           = SyncRead
Status         = Success
DurationMs     = 31
RequestSummary = "SELECT id, value FROM Readings WHERE ts > @p0"
Extra          = '{"rowsReturned":42}'
ResponseSummary= NULL                                 -- rows not captured by default; opt-in per connection

Cached write — same five-row lifecycle as the cached API example.

7.3 `Notification` — outbound notifications

1. Kind=Enqueued   Status=Enqueued      CorrelationId=<NotificationId>  SourceSiteId="site-01" SourceInstance="Plant1.Boiler"
2. Kind=Attempt    Status=TransientFailure  ErrorMessage="SMTP 451 ..." CorrelationId=<same>   SourceSiteId=NULL (dispatch is central)
3. Kind=Attempt    Status=Success                                       CorrelationId=<same>
4. Kind=Terminal   Status=Delivered                                     CorrelationId=<same>
Target = "OpsTeamEmail"                              -- notification list name
Extra  = '{"resolvedTargets":["a@x.com","b@x.com"], "subject":"Boiler high temp"}'
RequestSummary = '...body, truncated...'

Note the site→central handoff is implicit: row 1 arrives via §6.2 telemetry (it originated at the site script); rows 2–4 are written by the central dispatcher directly via §6.4.

7.4 `ApiInbound` — inbound API

One row per request, written at request completion:

Channel        = ApiInbound
Kind           = Completed
CorrelationId  = <request-id>                        -- the request's correlation header (or generated)
SourceSiteId   = NULL                                -- central-originated event
Actor          = "AcmeSCADA"                         -- API key name (NOT the key itself)
Target         = "RecordReading"                     -- inbound method name
Status         = Success | PermanentFailure          -- mapped from final HTTP outcome
HttpStatus     = 200 | 400 | 401 | 500
DurationMs     = 73
RequestSummary = '{"siteId":"...","value":12.4}'     -- truncated; secrets/PII per redaction policy
ResponseSummary= '{"ok":true}'                       -- full body on 5xx
Extra          = '{"remoteIp":"203.0.113.42","userAgent":"...","scriptInvoked":"RecordReading.Handle"}'

A bad API key → row with Status=PermanentFailure, HttpStatus=401, Actor=NULL, Extra carries remoteIp for abuse triage.

8. Payload capture policy

8.1 Truncation

Default cap: 8 KB for each of RequestSummary and ResponseSummary. Configurable globally; per-target overrides allowed (§8.4).
On any non-Success row, the cap is raised to 64 KB for that row — error context is precious.
When a body is truncated, PayloadTruncated = 1 and the captured prefix is preserved verbatim (UTF-8 byte-safe truncation, no mid-character cuts).
Bodies exceeding the larger cap are still truncated; full bodies are never stored.

8.2 Redaction

Redaction happens at the write site, before the row touches SQLite (or central MS SQL for §6.4 events). Unredacted secrets never persist.

HTTP headers — Authorization, Cookie, Set-Cookie, X-API-Key, and any header matching the configured redact-list (regex) become <redacted>. List is operator-extensible.
HTTP bodies — captured verbatim by default. Operators can register per-External-System / per-Inbound-Method body redactors (regex → replacement) for known secret fields (e.g., "password"\s*:\s*"[^"]+").
SQL — statement text and parameter values captured verbatim by default. Per-connection redaction opt-in (e.g., redact parameters whose name matches @apikey|@token|@password).
Notification bodies — captured per the existing notification rules (no behavioral change from today).
Safety net — if a configured redactor throws, the affected payload becomes "<redacted: redactor error>" and a AuditRedactionFailure health metric increments. We over-redact, never under-redact, on configuration faults.

8.3 Never captured

Raw API key material (only the key name via Actor).
LDAP bind credentials, cluster secrets, Configuration DB connection strings.
Framework traffic per §2 (out of scope by construction, not by redaction).

8.4 Configurability

Bound from appsettings.json (new AuditLog options class owned by the Audit Log component):

"AuditLog": {
  "DefaultCapBytes": 8192,
  "ErrorCapBytes": 65536,
  "HeaderRedactList": [ "Authorization", "Cookie", "Set-Cookie", "X-API-Key" ],
  "GlobalBodyRedactors": [
    { "Pattern": "\"password\"\\s*:\\s*\"[^\"]+\"", "Replacement": "\"password\":\"<redacted>\"" }
  ],
  "PerTargetOverrides": {
    "Weather/GetForecast": { "CapBytes": 4096 },
    "PlantDB":             { "RedactSqlParamsMatching": "@apikey|@token" }
  }
}

Per-target keys bind by External System / Inbound Method / Notification List / Database Connection name.

9. Failure handling & idempotency

9.1 `EventId` is the dedup key

Generated at the originator (site for §6.1/§6.5, central for §6.4). Guid v4.
Central ingest is INSERT … WHERE NOT EXISTS (SELECT 1 FROM AuditLog WHERE EventId = @id), executed under the PK constraint.
Idempotent across telemetry retries, reconciliation pulls, and any combination thereof.

9.2 Central MS SQL outage

Site telemetry calls fail; ForwardState stays Pending; backlog grows.
Reconciliation pulls also fail.
Site SQLite continues to absorb hot-path writes (no upstream dependency on the hot path).
SiteAuditBacklog health metric crosses thresholds → Health dashboard surfaces it on the affected site tile.
On recovery, telemetry drains; insert-if-not-exists handles any overlap.

9.3 Site SQLite write failure

Audit writer fails to append (disk full, schema lock, transient IO error).
The action proceeds — we do not fail script-initiated work because the audit write failed.
SiteAuditWriteFailures health metric increments; critical-severity Site Event Log entry.
A small in-memory ring (default 1024 rows) buffers events while the local writer is unhealthy; on ring overflow, oldest events are dropped with a Site Event Log warning per drop.

9.4 Telemetry actor wedged

Reconciliation pull (§6.3) is the fallback. If two consecutive reconciliation cycles report a non-draining backlog, the supervisor restarts the telemetry actor and a SiteAuditTelemetryStalled event fires.

9.5 Central direct-write failure

Inbound API: middleware audit failure is logged + metricked but never affects the HTTP response.
Notification Outbox dispatcher: audit failure logs critical and increments CentralAuditWriteFailures; the operational Notifications row update proceeds.

9.6 Dedup horizon — there isn't one

EventId PK enforces uniqueness as long as a row exists in the table. Purge (§12) removes rows by OccurredAtUtc, not EventId; a stale telemetry retry arriving after the original was purged will insert a "new" row. Acceptable — a retry that arrives more than a year late is vanishingly rare and an extra row is harmless.

10. UI & query surface

10.1 Audit Log page (new, top-level)

Lives under a new Audit nav group in Central UI (sibling to Notifications). Standard Blazor Server + Bootstrap, custom components per the project UI rules.

Filter bar (top of page, collapses to one row when not focused):

Time range (relative: 15m / 1h / 24h / 7d / custom).
Channel (multi-select: ApiOutbound, DbOutbound, Notification, ApiInbound).
Kind (filtered by selected channels).
Status (multi-select).
Site (multi-select, scoped to user's authorized sites).
Instance / Script (text search with autocomplete).
Target (text search — system+method, DB connection, list name).
Actor (text search — inbound API key name).
CorrelationId (paste a TrackedOperationId / NotificationId / request-id to see its full event sequence).
"Errors only" toggle (Status NOT IN (Success, Delivered, Enqueued)).

Results grid:

Columns (resizable, reorderable, persisted per user): OccurredAtUtc, Site, Channel, Kind, Status, Target, Actor, DurationMs, HttpStatus, ErrorMessage.
Keyset pagination on (OccurredAtUtc desc, EventId desc). Default page 100.
Click row → drilldown drawer.

Drilldown drawer:

Pretty-prints RequestSummary / ResponseSummary (JSON auto-detected; SQL syntax-highlighted).
Redaction indicators where headers/fields were stripped.
"Copy as cURL" for ApiOutbound / ApiInbound rows.
"Show all events for this operation" link → filters by CorrelationId.

10.2 Drill-in links from existing pages

Notifications row → "View audit history" → Audit Log filtered to CorrelationId = NotificationId.
Site Calls row → "View audit history" → Audit Log filtered to CorrelationId = TrackedOperationId.
External Systems detail → "Recent activity" → Audit Log filtered to Target starts-with <system>.
Inbound API keys detail → "Recent calls" → Audit Log filtered to Actor = <key name> AND Channel = ApiInbound.
Sites detail → new "Audit feed" tab.
Instances detail → new "Audit feed" tab.

10.3 Health dashboard tiles

Three new tiles in an "Audit" KPI group:

Audit volume — events/min global + per-site sparkline.
Audit error rate — % non-Success rows, rolling 5 min.
Audit backlog — sum of Pending site rows; click → per-site breakdown.

10.4 Export

Audit Log page Export button streams CSV (current filter) server-side. Default cap 100k rows; larger exports use the CLI (§15).

11. Security & tamper-evidence

11.1 Append-only enforcement

Application accesses AuditLog via a dedicated DB role scadalink_audit_writer granted INSERT + SELECT only — no UPDATE, no DELETE.
Purge runs under a separate role scadalink_audit_purger whose permissions are limited to the partition-switch operation (§12.2). Row-level DELETE is not granted even to purge.
A CI guard greps the data layer for any UPDATE … AuditLog or DELETE … AuditLog text and fails the build.

11.2 Authorization

Reading the Audit Log requires the existing Audit role (today used for the IAuditService config-change log) extended with a new OperationalAudit permission.
Per-site row scoping reuses the existing site-permission model from Security & Auth — a user sees only rows for sites they are authorized to operate.
Bulk export (UI button + CLI) requires an additional AuditExport permission.

11.3 Payload redaction at write

See §8.2. Contract: unredacted secrets never persist. Safety net over-redacts on misconfiguration.

11.4 Tamper-evidence hash chain (deferred, v1.x)

Each row gains a RowHash column.
RowHash = SHA-256(prev.RowHash || canonical(row)) per partition.
Computed by a chaining job that runs after each monthly partition closes.
Verifiable offline via scadalink audit verify-chain --month YYYY-MM.
Default off in v1 to avoid operational burden. Flag for v1.x.

11.5 Site SQLite security

File permissions: read/write by the ScadaLink service account only.
Not backed up off-machine — site SQLite is a buffer with a forwarding guarantee, not a record. Central is the durable home.

12. Retention & purge mechanics

12.1 Central retention defaults

365 days based on OccurredAtUtc. Configurable via AuditLog:RetentionDays (min 7, max 3650, validated at startup).
Single global retention in v1 — no per-channel/Kind overrides. Deferred to v1.x once production cost data shows whether overrides are needed.

12.2 Partition strategy

Monthly partitions on OccurredAtUtc. Partition function pf_AuditLog_Month, scheme ps_AuditLog_Month, created in the EF Core migration.
Purge by partition switch: move the eligible partition to a staging table, then drop. No row-by-row delete; no log bloat.
Partition-maintenance job rolls forward each month (creates the next month's partition ahead of time).

12.3 Purge job

Singleton actor AuditLogPurgeActor on the active central node, runs daily.
Switches out any partition whose latest OccurredAtUtc is older than the global retention window. Pure partition-switch; no row-level deletes.
Emits a AuditLog:Purged event (partition range, rowcount, duration).

12.4 Site SQLite purge

Daily site job, hard invariant per §4: purge only OccurredAtUtc < threshold AND ForwardState IN ('Forwarded','Reconciled').
Default site retention 7 days (configurable, min 1, max 90).
Backlog metric (§9.2) provides visibility into "central outage → site bloat" before disk-full.

13. Performance & sizing

Rough back-of-envelope; load testing will confirm.

13.1 Per-site event rate (assumed nominal site)

Channel/Kind	Typ events/min	Peak events/min
`ApiOutbound.SyncCall`	10	100
`ApiOutbound.Cached*` (~4 rows/op)	4	20
`DbOutbound.SyncWrite`	30	300
`DbOutbound.SyncRead`	60	600
`DbOutbound.Cached*` (~4 rows/op)	4	20
`Notification.Enqueued` (site-emit)	1	10
Per-site total	~110	~1,050

13.2 Central total (50-site deployment)

Typical: ~5,500 events/min = ~92 events/sec.
Peak: ~52,500 events/min = ~875 events/sec.
Plus central-originated (Notification dispatch + Inbound API): assume ~30 events/sec typical.

MS SQL handles this with batched ingest and the time-aligned indexes.

13.3 Row size

Fixed columns: ~250 bytes.
Average captured payload: ~1 KB.
Per row: ~1.3 KB.

13.4 Yearly central footprint

Typical: 100 events/sec × 86,400 × 365 × 1.3 KB ≈ ~4 TB at default cap.
Cap reduction (8 KB → 2 KB) or per-channel retention shaves this multi-fold.

13.5 Site SQLite footprint

110/min × 60 × 24 × 7 × 1.3 KB ≈ ~140 MB / site at the 7-day window. Trivial.

13.6 Levers

Reduce DefaultCapBytes per §8.1.
Tighten per-channel retention per §12.1 (especially DbOutbound.SyncRead).
Defer to v1.x: Parquet archival to object storage before purge (§15.2).

14. KPI surface & relationship to existing KPIs

14.1 New Audit Log KPIs

Volume — events/min, global + per-site.
Error rate — % non-Success rows, rolling 5 min.
Backlog — sum of Pending site rows.
Top inbound callers — top-10 Actor by request count, last 1h.
Top outbound 5xx — top-10 Target by 5xx-status count, last 1h.

14.2 Relationship to existing KPIs

Notification Outbox KPIs (queue depth, parked, delivered-last-interval, etc.) — unchanged, sourced from Notifications. Audit Log KPIs describe the audit table itself, not the notification subsystem.
Site Call Audit KPIs — unchanged, sourced from SiteCalls.
Audit Log KPIs occupy their own group on the Health dashboard. Nothing is collapsed or superseded.

15. CLI & external access

15.1 CLI commands

New scadalink audit command group:

scadalink audit query --site <s> --since <t> --kind <k> [...] — same filter set as the UI.
scadalink audit export --since <t> --until <t> --format csv|jsonl|parquet --output <path> — bulk export, server-side streaming.
scadalink audit verify-chain --month <YYYY-MM> — hash-chain verification (when §11.4 is enabled).

Requires the same OperationalAudit / AuditExport permissions as the UI.

15.2 Object-storage archival (deferred, v1.x)

A monthly job dumps the closing partition to Parquet on operator-configured object storage before central purge — enabling indefinite cold retention without bloating MS SQL. Flag for v1.x; not in initial scope.

16. Locked decisions

#	Question	Decision
1	Component number	#23 Audit Log (README matrix + HighLevelReqs).
2	Nav placement	New top-level Audit nav group in Central UI.
3	Hash-chain tamper evidence (§11.4)	Deferred to v1.x. v1 enforces append-only via DB grants only.
4	Parquet archival to object storage (§15.2)	Deferred to v1.x.
5	Per-channel retention overrides (§12.1)	Deferred to v1.x. v1 uses a single global `RetentionDays`.
6	Default payload cap	8 KB for `RequestSummary` / `ResponseSummary`; 64 KB on non-`Success` rows.

All earlier design decisions (purpose, topology, scope, payload depth, lifecycle granularity, retention default, site→central path, UI shape, cached-call audit emission, SQL parameter capture, never-fail-on-audit-failure) are also locked. See §1–§15.

36 KiB Raw Blame History Unescape Escape