docs(audit): apply cross-bundle review fixes before merge

Final cross-bundle reviewer identified 7 inconsistencies that the per-bundle
reviewers couldn't see; all fixed in one logical commit.

Critical:
- HighLevelReqs AL-3: drop 'then upsert-on-newer-status' — AuditLog is
  strictly append-only (correct for SiteCalls/Notifications, wrong for
  the immutable AuditLog shadow).
- Component-AuditLog Error rate KPI: align with HealthMonitoring's
  exclusion list (Success/Delivered/Enqueued) rather than just non-Success;
  otherwise every Delivered notification or Enqueued cached call would be
  counted as an error.

Important:
- Component-AuditLog line 154: ISiteAuditWriter -> IAuditWriter (canonical
  name per Commons and the rest of this doc).
- Component-AuditLog Central direct-write paragraph: convert remaining
  slash notation (ApiInbound/Completed, Notification/Attempt,
  Notification/Terminal) to dot notation used everywhere else.
- Component-ClusterInfrastructure: scope SiteCallAuditActor to
  reconciliation + KPIs + Retry/Discard relay; cached-telemetry ingest is
  AuditLogIngestActor's role per Combined Telemetry contract.
- Component-CentralUI Audit Log page: state the OperationalAudit read
  permission and the read-vs-export split (matching CLI doc).
- Component-NotificationOutbox: add never-fail-the-action invariant for
  dispatcher audit writes.

Minor:
- Component-InboundAPI: 'Non-blocking semantics' was ambiguous (could be
  read as async); reword to 'Fail-soft' — the write is still synchronous
  before flush, but failures are caught and don't change the response.
- Component-CLI: realign audit-query/audit-export flags to actually match
  the Central UI Audit Log filter set (channel, kind, status, site,
  instance, target, actor, correlation-id, errors-only); drop --user and
  --entity-id which are IAuditService concepts, not Audit Log columns.
- Component-AuditLog KPI tile names: 'Volume/Error rate/Backlog' ->
  'Audit volume/Audit error rate/Audit backlog' (matches Central UI and
  Health Monitoring); drop the two orphan KPIs (Top inbound callers, Top
  outbound 5xx) that were never surfaced anywhere.
- Component-AuditLog Interactions: re-attribute DbOutbound emissions to
  ESG (where Database.* lives) with a note that Site Runtime is the API
  surface for scripts.
- HighLevelReqs AL-12: drop 'and reconciliation operations' (CLI has no
  reconcile command; reconciliation is an internal self-healing pull).
  Add note that verify-chain becomes operational once AL-11's hash chain
  ships.
This commit is contained in:
Joseph Doherty
2026-05-20 09:00:11 -04:00
parent 34ea97bae9
commit c929562e41
7 changed files with 20 additions and 21 deletions

View File

@@ -151,7 +151,7 @@ writers — all idempotent on `EventId`.
The component completing a script-trust-boundary action (External System The component completing a script-trust-boundary action (External System
Gateway, Database layer, Store-and-Forward Engine) builds an `AuditEvent` with a Gateway, Database layer, Store-and-Forward Engine) builds an `AuditEvent` with a
fresh `EventId` (Guid v4) and `OccurredAtUtc = UtcNow`, then appends it to the fresh `EventId` (Guid v4) and `OccurredAtUtc = UtcNow`, then appends it to the
site-local `AuditLog` SQLite via `ISiteAuditWriter` with site-local `AuditLog` SQLite via `IAuditWriter` with
`ForwardState = 'Pending'`. The append is a single-statement INSERT and is `ForwardState = 'Pending'`. The append is a single-statement INSERT and is
durable in microseconds; control returns to the script with no central durable in microseconds; control returns to the script with no central
round-trip on the hot path. round-trip on the hot path.
@@ -178,10 +178,10 @@ pattern as Site Call Audit's reconciliation of `SiteCalls`.
### Central direct-write (central-originated events) ### Central direct-write (central-originated events)
Events originating at central never touch site SQLite. Inbound API writes one Events originating at central never touch site SQLite. Inbound API writes one
`ApiInbound`/`Completed` row via `ICentralAuditWriter` synchronously inside the `ApiInbound.Completed` row via `ICentralAuditWriter` synchronously inside the
request-handler middleware, before the HTTP response is flushed. The request-handler middleware, before the HTTP response is flushed. The
Notification Outbox dispatcher writes `Notification`/`Attempt` per delivery Notification Outbox dispatcher writes `Notification.Attempt` per delivery
attempt and `Notification`/`Terminal` on terminal status. Central direct-writes attempt and `Notification.Terminal` on terminal status. Central direct-writes
use the same insert-if-not-exists semantics keyed on `EventId`. use the same insert-if-not-exists semantics keyed on `EventId`.
## Cached Operations — Combined Telemetry ## Cached Operations — Combined Telemetry
@@ -291,11 +291,9 @@ MS SQL for direct-write events). Unredacted secrets never persist.
Point-in-time, computed from the central `AuditLog` table; global and per-site. Point-in-time, computed from the central `AuditLog` table; global and per-site.
- **Volume** — events/min. - **Audit volume** — events/min landing in the central `AuditLog`; global plus per-site sparkline.
- **Error rate** — % non-`Success` rows, rolling 5 min. - **Audit error rate** — % of central `AuditLog` rows with `Status` NOT IN (`Success`, `Delivered`, `Enqueued`) over a rolling 5-minute window. This is the operational error rate of audited operations (HTTP 5xx, transient failures, parked deliveries) — NOT audit-writer health, which surfaces separately via `CentralAuditWriteFailures` and `AuditRedactionFailure`.
- **Backlog** — sum of `Pending` site rows across sites. - **Audit backlog** — sum of `Pending` site rows across sites; click drills into a per-site breakdown.
- **Top inbound callers** — top-10 `Actor` by request count, last 1 h.
- **Top outbound 5xx** — top-10 `Target` by 5xx-status count, last 1 h.
[Notification Outbox](Component-NotificationOutbox.md) and [Notification Outbox](Component-NotificationOutbox.md) and
[Site Call Audit](Component-SiteCallAudit.md) KPIs are unaffected — they remain [Site Call Audit](Component-SiteCallAudit.md) KPIs are unaffected — they remain
@@ -355,9 +353,7 @@ global value in v1; per-channel overrides are deferred to v1.x.
emits `ApiOutbound.SyncCall` rows on every sync `Call()`. For `CachedCall`, emits `ApiOutbound.SyncCall` rows on every sync `Call()`. For `CachedCall`,
emits the combined cached telemetry packet (audit row + operational update) emits the combined cached telemetry packet (audit row + operational update)
per Cached Operations — Combined Telemetry. per Cached Operations — Combined Telemetry.
- **[Site Runtime (#3)](Component-SiteRuntime.md) — Database layer** — emits - **[External System Gateway (#7)](Component-ExternalSystemGateway.md) — Database layer** — the database access modes inside ESG emit `DbOutbound.SyncWrite` and `DbOutbound.SyncRead` on script-initiated `Connection()` calls; `Database.CachedWrite` emits the cached-write lifecycle rows via the combined-telemetry packet (same path as `ApiOutbound.Cached*`). Site Runtime is the API surface that exposes the `Database.*` calls to scripts; the audit emission itself lives in ESG.
`DbOutbound.SyncWrite`, `DbOutbound.SyncRead`, and the cached-write variants
via the same combined-telemetry path.
- **[Inbound API (#14)](Component-InboundAPI.md)** — emits one - **[Inbound API (#14)](Component-InboundAPI.md)** — emits one
`ApiInbound.Completed` row per request from request-handler middleware, `ApiInbound.Completed` row per request from request-handler middleware,
written directly to central via `ICentralAuditWriter` before the response is written directly to central via `ICentralAuditWriter` before the response is

View File

@@ -187,14 +187,15 @@ require the `OperationalAudit` permission; `audit export` additionally requires
exit code 2) on denial. exit code 2) on denial.
``` ```
scadalink audit query --site <s> --since <t> [--until <t>] [--kind <k>] [--user <u>] [--entity-id <id>] [--correlation-id <id>] [--status <s>] [--page <n>] [--page-size <n>] scadalink audit query --since <t> [--until <t>] [--channel <c>] [--kind <k>] [--status <s>] [--site <s>] [--instance <i>] [--target <t>] [--actor <a>] [--correlation-id <id>] [--errors-only] [--page <n>] [--page-size <n>]
scadalink audit export --since <t> --until <t> --format csv|jsonl|parquet --output <path> [--site <s>] [--kind <k>] scadalink audit export --since <t> --until <t> --format csv|jsonl|parquet --output <path> [--channel <c>] [--kind <k>] [--status <s>] [--site <s>] [--target <t>] [--actor <a>]
scadalink audit verify-chain --month <YYYY-MM> scadalink audit verify-chain --month <YYYY-MM>
``` ```
- `audit query` — filtered query against the central `AuditLog` table, matching the - `audit query` — filtered query against the central `AuditLog` table, matching the
Central UI filter set (site, time range, audit kind, user, entity, correlation ID, Central UI Audit Log page filter set (time range, channel, kind, status, site,
status, paging). Results stream as JSON (default) or table. instance/script, target, actor, correlation ID, errors-only). Results stream as
JSON (default) or table.
- `audit export` — server-side streaming export of the central `AuditLog` to the - `audit export` — server-side streaming export of the central `AuditLog` to the
requested format (`csv`, `jsonl`, `parquet`) written to `--output`. The server requested format (`csv`, `jsonl`, `parquet`) written to `--output`. The server
streams rows rather than materializing them in memory; the CLI writes bytes streams rows rather than materializing them in memory; the CLI writes bytes

View File

@@ -157,7 +157,7 @@ Central cluster only. Sites have no user interface.
### Audit Log (Admin / Audit Role) ### Audit Log (Admin / Audit Role)
- Lives under a **new top-level "Audit" nav group** (sibling to Notifications). In v1 the Audit nav group contains this single Audit Log page; the pre-existing Configuration Audit Log Viewer remains its own page below. - Lives under a **new top-level "Audit" nav group** (sibling to Notifications). In v1 the Audit nav group contains this single Audit Log page; the pre-existing Configuration Audit Log Viewer remains its own page below.
- Global query / filter / drilldown over the central `AuditLog` table maintained by the Audit Log component (#23). Read-only — the table is append-only, so there are no edit actions on rows. - Global query / filter / drilldown over the central `AuditLog` table maintained by the Audit Log component (#23). Read-only — the table is append-only, so there are no edit actions on rows.
- Per-site row scoping reuses the existing site-permission model from Security & Auth: a user sees only rows for sites they are authorized to operate. Bulk export (see below) requires the additional `AuditExport` permission. - Read access to the page requires the `OperationalAudit` permission (Security & Auth #10). Per-site row scoping reuses the existing site-permission model: a user sees only rows for sites they are authorized to operate. Bulk export (see below) additionally requires `AuditExport`. The split mirrors the CLI's permission model (see Component-CLI.md).
- **Filter bar** (top of page, collapses to a single row when not focused): - **Filter bar** (top of page, collapses to a single row when not focused):
- Time range — relative (15m / 1h / 24h / 7d) or custom. - Time range — relative (15m / 1h / 24h / 7d) or custom.
- Channel — multi-select: `ApiOutbound`, `DbOutbound`, `Notification`, `ApiInbound`. - Channel — multi-select: `ApiOutbound`, `DbOutbound`, `Notification`, `ApiInbound`.

View File

@@ -61,7 +61,7 @@ Akka.NET cluster singletons run on the active node of their cluster and migrate
### Central singletons (active central node) ### Central singletons (active central node)
- **`NotificationOutboxActor`** — owned by Notification Outbox (#21). Drives the central notification dispatch loop against the `Notifications` table. - **`NotificationOutboxActor`** — owned by Notification Outbox (#21). Drives the central notification dispatch loop against the `Notifications` table.
- **`SiteCallAuditActor`** — owned by Site Call Audit (#22). Ingests `CachedCall` / `CachedWrite` telemetry and reconciliation pulls into the `SiteCalls` table. - **`SiteCallAuditActor`** — owned by Site Call Audit (#22). Owns the operational `SiteCalls` table: drives periodic reconciliation pulls for `CachedCall` / `CachedWrite` lifecycle, computes KPIs, and relays operator Retry/Discard actions to the owning site. Note: ingest of cached-call telemetry is performed by `AuditLogIngestActor` (#23) in one transaction with the immutable `AuditLog` insert — see Component-AuditLog.md, Cached Operations — Combined Telemetry.
- **`AuditLogIngestActor`** — owned by Audit Log (#23). Receives gRPC telemetry batches of `AuditEvent` rows from sites and performs insert-if-not-exists on `EventId` against the central `AuditLog` table. For cached-call telemetry (which carries both audit-row content and operational-state fields in a single packet), the ingest performs the `AuditLog` insert and the `SiteCalls` upsert in **one transaction** — see Component-AuditLog.md for the combined-telemetry contract. - **`AuditLogIngestActor`** — owned by Audit Log (#23). Receives gRPC telemetry batches of `AuditEvent` rows from sites and performs insert-if-not-exists on `EventId` against the central `AuditLog` table. For cached-call telemetry (which carries both audit-row content and operational-state fields in a single packet), the ingest performs the `AuditLog` insert and the `SiteCalls` upsert in **one transaction** — see Component-AuditLog.md for the combined-telemetry contract.
- **`SiteAuditReconciliationActor`** — owned by Audit Log (#23). Periodic per-site pull (default every 5 minutes) that self-heals missed audit telemetry by asking each site for its oldest `ForwardState = 'Pending'` row and issuing a `PullAuditEvents(sinceUtc, batchSize)` when a non-draining backlog is detected. - **`SiteAuditReconciliationActor`** — owned by Audit Log (#23). Periodic per-site pull (default every 5 minutes) that self-heals missed audit telemetry by asking each site for its oldest `ForwardState = 'Pending'` row and issuing a `PullAuditEvents(sinceUtc, batchSize)` when a non-draining backlog is detected.
- **`AuditLogPurgeActor`** — owned by Audit Log (#23). Daily partition-switch purge against `ps_AuditLog_Month`; switches out any partition older than `AuditLog:RetentionDays` and emits an `AuditLog:Purged` event. Also rolls the partition scheme forward each month so the next month's partition exists ahead of time. - **`AuditLogPurgeActor`** — owned by Audit Log (#23). Daily partition-switch purge against `ps_AuditLog_Month`; switches out any partition older than `AuditLog:RetentionDays` and emits an `AuditLog:Purged` event. Also rolls the partition scheme forward each month so the next month's partition exists ahead of time.

View File

@@ -118,7 +118,7 @@ API method scripts are compiled at central startup — all method definitions ar
- **Every request — success or failure — emits one `ApiInbound.Completed` row** to `ICentralAuditWriter` from request middleware before the HTTP response is flushed. The row captures the API key **name** (never the key material), remote IP, user-agent, response status, duration, and truncated request/response bodies per the Audit Log capture policy (see Component-AuditLog.md, Payload Capture Policy). This supersedes the earlier failures-only stance: operational API traffic is now part of the centralized audit log, so configuration changes and call activity share a single retention/query surface. - **Every request — success or failure — emits one `ApiInbound.Completed` row** to `ICentralAuditWriter` from request middleware before the HTTP response is flushed. The row captures the API key **name** (never the key material), remote IP, user-agent, response status, duration, and truncated request/response bodies per the Audit Log capture policy (see Component-AuditLog.md, Payload Capture Policy). This supersedes the earlier failures-only stance: operational API traffic is now part of the centralized audit log, so configuration changes and call activity share a single retention/query surface.
- Script execution errors (500 responses) remain captured on the same `ApiInbound.Completed` row (response status + error fields) rather than emitting a separate failure-only event. - Script execution errors (500 responses) remain captured on the same `ApiInbound.Completed` row (response status + error fields) rather than emitting a separate failure-only event.
- **Non-blocking semantics.** Middleware audit-write failures are logged and metricked (see Health Monitoring #11`CentralAuditWriteFailures`) but **never affect the HTTP response**: a failed audit append does not turn a successful API call into an error returned to the caller. - **Fail-soft semantics.** The audit write is synchronous (inline before the response is flushed), but failures are caught: a write that throws is logged and increments `CentralAuditWriteFailures` (see Health Monitoring #11) and the request still returns its normal HTTP response. A failed audit append never turns a successful API call into an error returned to the caller.
- No rate limiting — this is a private API in a controlled industrial environment with a known set of callers. Misbehaving callers are handled operationally (disable the API key). - No rate limiting — this is a private API in a controlled industrial environment with a known set of callers. Misbehaving callers are handled operationally (disable the API key).
## Request Flow ## Request Flow

View File

@@ -110,6 +110,8 @@ Each delivery attempt also writes a `Notification.Attempt` row to the central `A
The operational `Notifications` table remains the **source of truth** for the dispatcher and for Retry/Discard actions; the `AuditLog` rows are immutable shadows. Operator Retry/Discard still mutates only the `Notifications` row, and each transition emits the corresponding `Notification.Attempt` / `Notification.Terminal` audit row. The operational `Notifications` table remains the **source of truth** for the dispatcher and for Retry/Discard actions; the `AuditLog` rows are immutable shadows. Operator Retry/Discard still mutates only the `Notifications` row, and each transition emits the corresponding `Notification.Attempt` / `Notification.Terminal` audit row.
**Audit-write failure never affects delivery.** If the `ICentralAuditWriter` direct-write fails (transient DB error, schema lock, etc.) the dispatcher logs the failure and increments the `CentralAuditWriteFailures` health metric (see Health Monitoring #11), but the delivery attempt's outcome on the `Notifications` row stands. The audit row is recovered by re-emission on the next dispatcher tick or by the on-startup reconciliation sweep; central never aborts a notification because audit failed.
## Delivery Adapters ## Delivery Adapters
A delivery adapter implementing `INotificationDeliveryAdapter` is registered per `Type`. Each `Deliver(...)` call returns one of `success | transient failure | permanent failure`, mirroring the External System Gateway error-classification pattern. A delivery adapter implementing `INotificationDeliveryAdapter` is registered per `Type`. Each `Deliver(...)` call returns one of `success | transient failure | permanent failure`, mirroring the External System Gateway error-classification pattern.

View File

@@ -448,7 +448,7 @@ Sections 10.110.4 cover **configuration-database audit** (config-mutating use
- **AL-1**: The system maintains an **append-only** central Audit Log recording every script-trust-boundary action — outbound external system calls (sync `Call` and `CachedCall`), outbound database operations (sync `Connection` access and `CachedWrite`), notifications, and inbound API method invocations. - **AL-1**: The system maintains an **append-only** central Audit Log recording every script-trust-boundary action — outbound external system calls (sync `Call` and `CachedCall`), outbound database operations (sync `Connection` access and `CachedWrite`), notifications, and inbound API method invocations.
- **AL-2**: For cached calls and notifications, the Audit Log captures **one row per lifecycle event** (e.g., enqueued, retrying, delivered, parked, discarded), not a single mutable row per operation. - **AL-2**: For cached calls and notifications, the Audit Log captures **one row per lifecycle event** (e.g., enqueued, retrying, delivered, parked, discarded), not a single mutable row per operation.
- **AL-3**: Site-originated events are appended to a **site-local SQLite hot-path** synchronously with the action, then **forwarded to central via gRPC telemetry**; central ingest is **idempotent on `EventId`** (insert-if-not-exists then upsert-on-newer-status). - **AL-3**: Site-originated events are appended to a **site-local SQLite hot-path** synchronously with the action, then **forwarded to central via gRPC telemetry**; central ingest is **idempotent on `EventId`** (insert-if-not-exists; the `AuditLog` table is strictly append-only, so rows are never updated after insert).
- **AL-4**: A periodic **central→site reconciliation pull** detects and replays any telemetry events that were missed (e.g., during a central outage), making the central Audit Log eventually consistent with sites. - **AL-4**: A periodic **central→site reconciliation pull** detects and replays any telemetry events that were missed (e.g., during a central outage), making the central Audit Log eventually consistent with sites.
- **AL-5**: Each row captures **payload metadata** (target, method, status, timings, correlation IDs) plus a **truncated request/response body****8 KB default**, expanded to **64 KB on error** outcomes. - **AL-5**: Each row captures **payload metadata** (target, method, status, timings, correlation IDs) plus a **truncated request/response body****8 KB default**, expanded to **64 KB on error** outcomes.
- **AL-6**: **HTTP headers are redacted by default**; **SQL parameter values are captured by default**. Per-target **redaction opt-in** is configurable on external systems, database connections, and inbound API methods. - **AL-6**: **HTTP headers are redacted by default**; **SQL parameter values are captured by default**. Per-target **redaction opt-in** is configurable on external systems, database connections, and inbound API methods.
@@ -457,7 +457,7 @@ Sections 10.110.4 cover **configuration-database audit** (config-mutating use
- **AL-9**: The site SQLite Audit Log is purged only when `ForwardState ∈ {Forwarded, Reconciled}` — i.e., a row must be either confirmed-forwarded *or* confirmed-reconciled before it can be removed. A central outage therefore **cannot cause audit loss at sites**. - **AL-9**: The site SQLite Audit Log is purged only when `ForwardState ∈ {Forwarded, Reconciled}` — i.e., a row must be either confirmed-forwarded *or* confirmed-reconciled before it can be removed. A central outage therefore **cannot cause audit loss at sites**.
- **AL-10**: The Central UI exposes an **Audit Log page** with a cross-channel filter (by site, target, status, time range, correlation ID), plus **drill-ins from existing operational pages** (Site Calls, Notification Outbox, Inbound API). - **AL-10**: The Central UI exposes an **Audit Log page** with a cross-channel filter (by site, target, status, time range, correlation ID), plus **drill-ins from existing operational pages** (Site Calls, Notification Outbox, Inbound API).
- **AL-11**: Append-only semantics are **enforced via DB roles** (no UPDATE/DELETE granted on the `AuditLog` table to application accounts); a **tamper-evidence hash chain is deferred to v1.x**. - **AL-11**: Append-only semantics are **enforced via DB roles** (no UPDATE/DELETE granted on the `AuditLog` table to application accounts); a **tamper-evidence hash chain is deferred to v1.x**.
- **AL-12**: The CLI provides a `scadalink audit` command group for query, export, and reconciliation operations against the central Audit Log. - **AL-12**: The CLI provides a `scadalink audit` command group for query, export, and hash-chain verification (verify-chain becomes operational once AL-11's hash chain ships) against the central Audit Log.
## 11. Health Monitoring ## 11. Health Monitoring