docs(audit): apply cross-bundle review fixes before merge
Final cross-bundle reviewer identified 7 inconsistencies that the per-bundle reviewers couldn't see; all fixed in one logical commit. Critical: - HighLevelReqs AL-3: drop 'then upsert-on-newer-status' — AuditLog is strictly append-only (correct for SiteCalls/Notifications, wrong for the immutable AuditLog shadow). - Component-AuditLog Error rate KPI: align with HealthMonitoring's exclusion list (Success/Delivered/Enqueued) rather than just non-Success; otherwise every Delivered notification or Enqueued cached call would be counted as an error. Important: - Component-AuditLog line 154: ISiteAuditWriter -> IAuditWriter (canonical name per Commons and the rest of this doc). - Component-AuditLog Central direct-write paragraph: convert remaining slash notation (ApiInbound/Completed, Notification/Attempt, Notification/Terminal) to dot notation used everywhere else. - Component-ClusterInfrastructure: scope SiteCallAuditActor to reconciliation + KPIs + Retry/Discard relay; cached-telemetry ingest is AuditLogIngestActor's role per Combined Telemetry contract. - Component-CentralUI Audit Log page: state the OperationalAudit read permission and the read-vs-export split (matching CLI doc). - Component-NotificationOutbox: add never-fail-the-action invariant for dispatcher audit writes. Minor: - Component-InboundAPI: 'Non-blocking semantics' was ambiguous (could be read as async); reword to 'Fail-soft' — the write is still synchronous before flush, but failures are caught and don't change the response. - Component-CLI: realign audit-query/audit-export flags to actually match the Central UI Audit Log filter set (channel, kind, status, site, instance, target, actor, correlation-id, errors-only); drop --user and --entity-id which are IAuditService concepts, not Audit Log columns. - Component-AuditLog KPI tile names: 'Volume/Error rate/Backlog' -> 'Audit volume/Audit error rate/Audit backlog' (matches Central UI and Health Monitoring); drop the two orphan KPIs (Top inbound callers, Top outbound 5xx) that were never surfaced anywhere. - Component-AuditLog Interactions: re-attribute DbOutbound emissions to ESG (where Database.* lives) with a note that Site Runtime is the API surface for scripts. - HighLevelReqs AL-12: drop 'and reconciliation operations' (CLI has no reconcile command; reconciliation is an internal self-healing pull). Add note that verify-chain becomes operational once AL-11's hash chain ships.
This commit is contained in:
@@ -151,7 +151,7 @@ writers — all idempotent on `EventId`.
|
||||
The component completing a script-trust-boundary action (External System
|
||||
Gateway, Database layer, Store-and-Forward Engine) builds an `AuditEvent` with a
|
||||
fresh `EventId` (Guid v4) and `OccurredAtUtc = UtcNow`, then appends it to the
|
||||
site-local `AuditLog` SQLite via `ISiteAuditWriter` with
|
||||
site-local `AuditLog` SQLite via `IAuditWriter` with
|
||||
`ForwardState = 'Pending'`. The append is a single-statement INSERT and is
|
||||
durable in microseconds; control returns to the script with no central
|
||||
round-trip on the hot path.
|
||||
@@ -178,10 +178,10 @@ pattern as Site Call Audit's reconciliation of `SiteCalls`.
|
||||
### Central direct-write (central-originated events)
|
||||
|
||||
Events originating at central never touch site SQLite. Inbound API writes one
|
||||
`ApiInbound`/`Completed` row via `ICentralAuditWriter` synchronously inside the
|
||||
`ApiInbound.Completed` row via `ICentralAuditWriter` synchronously inside the
|
||||
request-handler middleware, before the HTTP response is flushed. The
|
||||
Notification Outbox dispatcher writes `Notification`/`Attempt` per delivery
|
||||
attempt and `Notification`/`Terminal` on terminal status. Central direct-writes
|
||||
Notification Outbox dispatcher writes `Notification.Attempt` per delivery
|
||||
attempt and `Notification.Terminal` on terminal status. Central direct-writes
|
||||
use the same insert-if-not-exists semantics keyed on `EventId`.
|
||||
|
||||
## Cached Operations — Combined Telemetry
|
||||
@@ -291,11 +291,9 @@ MS SQL for direct-write events). Unredacted secrets never persist.
|
||||
|
||||
Point-in-time, computed from the central `AuditLog` table; global and per-site.
|
||||
|
||||
- **Volume** — events/min.
|
||||
- **Error rate** — % non-`Success` rows, rolling 5 min.
|
||||
- **Backlog** — sum of `Pending` site rows across sites.
|
||||
- **Top inbound callers** — top-10 `Actor` by request count, last 1 h.
|
||||
- **Top outbound 5xx** — top-10 `Target` by 5xx-status count, last 1 h.
|
||||
- **Audit volume** — events/min landing in the central `AuditLog`; global plus per-site sparkline.
|
||||
- **Audit error rate** — % of central `AuditLog` rows with `Status` NOT IN (`Success`, `Delivered`, `Enqueued`) over a rolling 5-minute window. This is the operational error rate of audited operations (HTTP 5xx, transient failures, parked deliveries) — NOT audit-writer health, which surfaces separately via `CentralAuditWriteFailures` and `AuditRedactionFailure`.
|
||||
- **Audit backlog** — sum of `Pending` site rows across sites; click drills into a per-site breakdown.
|
||||
|
||||
[Notification Outbox](Component-NotificationOutbox.md) and
|
||||
[Site Call Audit](Component-SiteCallAudit.md) KPIs are unaffected — they remain
|
||||
@@ -355,9 +353,7 @@ global value in v1; per-channel overrides are deferred to v1.x.
|
||||
emits `ApiOutbound.SyncCall` rows on every sync `Call()`. For `CachedCall`,
|
||||
emits the combined cached telemetry packet (audit row + operational update)
|
||||
per Cached Operations — Combined Telemetry.
|
||||
- **[Site Runtime (#3)](Component-SiteRuntime.md) — Database layer** — emits
|
||||
`DbOutbound.SyncWrite`, `DbOutbound.SyncRead`, and the cached-write variants
|
||||
via the same combined-telemetry path.
|
||||
- **[External System Gateway (#7)](Component-ExternalSystemGateway.md) — Database layer** — the database access modes inside ESG emit `DbOutbound.SyncWrite` and `DbOutbound.SyncRead` on script-initiated `Connection()` calls; `Database.CachedWrite` emits the cached-write lifecycle rows via the combined-telemetry packet (same path as `ApiOutbound.Cached*`). Site Runtime is the API surface that exposes the `Database.*` calls to scripts; the audit emission itself lives in ESG.
|
||||
- **[Inbound API (#14)](Component-InboundAPI.md)** — emits one
|
||||
`ApiInbound.Completed` row per request from request-handler middleware,
|
||||
written directly to central via `ICentralAuditWriter` before the response is
|
||||
|
||||
@@ -187,14 +187,15 @@ require the `OperationalAudit` permission; `audit export` additionally requires
|
||||
exit code 2) on denial.
|
||||
|
||||
```
|
||||
scadalink audit query --site <s> --since <t> [--until <t>] [--kind <k>] [--user <u>] [--entity-id <id>] [--correlation-id <id>] [--status <s>] [--page <n>] [--page-size <n>]
|
||||
scadalink audit export --since <t> --until <t> --format csv|jsonl|parquet --output <path> [--site <s>] [--kind <k>]
|
||||
scadalink audit query --since <t> [--until <t>] [--channel <c>] [--kind <k>] [--status <s>] [--site <s>] [--instance <i>] [--target <t>] [--actor <a>] [--correlation-id <id>] [--errors-only] [--page <n>] [--page-size <n>]
|
||||
scadalink audit export --since <t> --until <t> --format csv|jsonl|parquet --output <path> [--channel <c>] [--kind <k>] [--status <s>] [--site <s>] [--target <t>] [--actor <a>]
|
||||
scadalink audit verify-chain --month <YYYY-MM>
|
||||
```
|
||||
|
||||
- `audit query` — filtered query against the central `AuditLog` table, matching the
|
||||
Central UI filter set (site, time range, audit kind, user, entity, correlation ID,
|
||||
status, paging). Results stream as JSON (default) or table.
|
||||
Central UI Audit Log page filter set (time range, channel, kind, status, site,
|
||||
instance/script, target, actor, correlation ID, errors-only). Results stream as
|
||||
JSON (default) or table.
|
||||
- `audit export` — server-side streaming export of the central `AuditLog` to the
|
||||
requested format (`csv`, `jsonl`, `parquet`) written to `--output`. The server
|
||||
streams rows rather than materializing them in memory; the CLI writes bytes
|
||||
|
||||
@@ -157,7 +157,7 @@ Central cluster only. Sites have no user interface.
|
||||
### Audit Log (Admin / Audit Role)
|
||||
- Lives under a **new top-level "Audit" nav group** (sibling to Notifications). In v1 the Audit nav group contains this single Audit Log page; the pre-existing Configuration Audit Log Viewer remains its own page below.
|
||||
- Global query / filter / drilldown over the central `AuditLog` table maintained by the Audit Log component (#23). Read-only — the table is append-only, so there are no edit actions on rows.
|
||||
- Per-site row scoping reuses the existing site-permission model from Security & Auth: a user sees only rows for sites they are authorized to operate. Bulk export (see below) requires the additional `AuditExport` permission.
|
||||
- Read access to the page requires the `OperationalAudit` permission (Security & Auth #10). Per-site row scoping reuses the existing site-permission model: a user sees only rows for sites they are authorized to operate. Bulk export (see below) additionally requires `AuditExport`. The split mirrors the CLI's permission model (see Component-CLI.md).
|
||||
- **Filter bar** (top of page, collapses to a single row when not focused):
|
||||
- Time range — relative (15m / 1h / 24h / 7d) or custom.
|
||||
- Channel — multi-select: `ApiOutbound`, `DbOutbound`, `Notification`, `ApiInbound`.
|
||||
|
||||
@@ -61,7 +61,7 @@ Akka.NET cluster singletons run on the active node of their cluster and migrate
|
||||
### Central singletons (active central node)
|
||||
|
||||
- **`NotificationOutboxActor`** — owned by Notification Outbox (#21). Drives the central notification dispatch loop against the `Notifications` table.
|
||||
- **`SiteCallAuditActor`** — owned by Site Call Audit (#22). Ingests `CachedCall` / `CachedWrite` telemetry and reconciliation pulls into the `SiteCalls` table.
|
||||
- **`SiteCallAuditActor`** — owned by Site Call Audit (#22). Owns the operational `SiteCalls` table: drives periodic reconciliation pulls for `CachedCall` / `CachedWrite` lifecycle, computes KPIs, and relays operator Retry/Discard actions to the owning site. Note: ingest of cached-call telemetry is performed by `AuditLogIngestActor` (#23) in one transaction with the immutable `AuditLog` insert — see Component-AuditLog.md, Cached Operations — Combined Telemetry.
|
||||
- **`AuditLogIngestActor`** — owned by Audit Log (#23). Receives gRPC telemetry batches of `AuditEvent` rows from sites and performs insert-if-not-exists on `EventId` against the central `AuditLog` table. For cached-call telemetry (which carries both audit-row content and operational-state fields in a single packet), the ingest performs the `AuditLog` insert and the `SiteCalls` upsert in **one transaction** — see Component-AuditLog.md for the combined-telemetry contract.
|
||||
- **`SiteAuditReconciliationActor`** — owned by Audit Log (#23). Periodic per-site pull (default every 5 minutes) that self-heals missed audit telemetry by asking each site for its oldest `ForwardState = 'Pending'` row and issuing a `PullAuditEvents(sinceUtc, batchSize)` when a non-draining backlog is detected.
|
||||
- **`AuditLogPurgeActor`** — owned by Audit Log (#23). Daily partition-switch purge against `ps_AuditLog_Month`; switches out any partition older than `AuditLog:RetentionDays` and emits an `AuditLog:Purged` event. Also rolls the partition scheme forward each month so the next month's partition exists ahead of time.
|
||||
|
||||
@@ -118,7 +118,7 @@ API method scripts are compiled at central startup — all method definitions ar
|
||||
|
||||
- **Every request — success or failure — emits one `ApiInbound.Completed` row** to `ICentralAuditWriter` from request middleware before the HTTP response is flushed. The row captures the API key **name** (never the key material), remote IP, user-agent, response status, duration, and truncated request/response bodies per the Audit Log capture policy (see Component-AuditLog.md, Payload Capture Policy). This supersedes the earlier failures-only stance: operational API traffic is now part of the centralized audit log, so configuration changes and call activity share a single retention/query surface.
|
||||
- Script execution errors (500 responses) remain captured on the same `ApiInbound.Completed` row (response status + error fields) rather than emitting a separate failure-only event.
|
||||
- **Non-blocking semantics.** Middleware audit-write failures are logged and metricked (see Health Monitoring #11 — `CentralAuditWriteFailures`) but **never affect the HTTP response**: a failed audit append does not turn a successful API call into an error returned to the caller.
|
||||
- **Fail-soft semantics.** The audit write is synchronous (inline before the response is flushed), but failures are caught: a write that throws is logged and increments `CentralAuditWriteFailures` (see Health Monitoring #11) and the request still returns its normal HTTP response. A failed audit append never turns a successful API call into an error returned to the caller.
|
||||
- No rate limiting — this is a private API in a controlled industrial environment with a known set of callers. Misbehaving callers are handled operationally (disable the API key).
|
||||
|
||||
## Request Flow
|
||||
|
||||
@@ -110,6 +110,8 @@ Each delivery attempt also writes a `Notification.Attempt` row to the central `A
|
||||
|
||||
The operational `Notifications` table remains the **source of truth** for the dispatcher and for Retry/Discard actions; the `AuditLog` rows are immutable shadows. Operator Retry/Discard still mutates only the `Notifications` row, and each transition emits the corresponding `Notification.Attempt` / `Notification.Terminal` audit row.
|
||||
|
||||
**Audit-write failure never affects delivery.** If the `ICentralAuditWriter` direct-write fails (transient DB error, schema lock, etc.) the dispatcher logs the failure and increments the `CentralAuditWriteFailures` health metric (see Health Monitoring #11), but the delivery attempt's outcome on the `Notifications` row stands. The audit row is recovered by re-emission on the next dispatcher tick or by the on-startup reconciliation sweep; central never aborts a notification because audit failed.
|
||||
|
||||
## Delivery Adapters
|
||||
|
||||
A delivery adapter implementing `INotificationDeliveryAdapter` is registered per `Type`. Each `Deliver(...)` call returns one of `success | transient failure | permanent failure`, mirroring the External System Gateway error-classification pattern.
|
||||
|
||||
@@ -448,7 +448,7 @@ Sections 10.1–10.4 cover **configuration-database audit** (config-mutating use
|
||||
|
||||
- **AL-1**: The system maintains an **append-only** central Audit Log recording every script-trust-boundary action — outbound external system calls (sync `Call` and `CachedCall`), outbound database operations (sync `Connection` access and `CachedWrite`), notifications, and inbound API method invocations.
|
||||
- **AL-2**: For cached calls and notifications, the Audit Log captures **one row per lifecycle event** (e.g., enqueued, retrying, delivered, parked, discarded), not a single mutable row per operation.
|
||||
- **AL-3**: Site-originated events are appended to a **site-local SQLite hot-path** synchronously with the action, then **forwarded to central via gRPC telemetry**; central ingest is **idempotent on `EventId`** (insert-if-not-exists then upsert-on-newer-status).
|
||||
- **AL-3**: Site-originated events are appended to a **site-local SQLite hot-path** synchronously with the action, then **forwarded to central via gRPC telemetry**; central ingest is **idempotent on `EventId`** (insert-if-not-exists; the `AuditLog` table is strictly append-only, so rows are never updated after insert).
|
||||
- **AL-4**: A periodic **central→site reconciliation pull** detects and replays any telemetry events that were missed (e.g., during a central outage), making the central Audit Log eventually consistent with sites.
|
||||
- **AL-5**: Each row captures **payload metadata** (target, method, status, timings, correlation IDs) plus a **truncated request/response body** — **8 KB default**, expanded to **64 KB on error** outcomes.
|
||||
- **AL-6**: **HTTP headers are redacted by default**; **SQL parameter values are captured by default**. Per-target **redaction opt-in** is configurable on external systems, database connections, and inbound API methods.
|
||||
@@ -457,7 +457,7 @@ Sections 10.1–10.4 cover **configuration-database audit** (config-mutating use
|
||||
- **AL-9**: The site SQLite Audit Log is purged only when `ForwardState ∈ {Forwarded, Reconciled}` — i.e., a row must be either confirmed-forwarded *or* confirmed-reconciled before it can be removed. A central outage therefore **cannot cause audit loss at sites**.
|
||||
- **AL-10**: The Central UI exposes an **Audit Log page** with a cross-channel filter (by site, target, status, time range, correlation ID), plus **drill-ins from existing operational pages** (Site Calls, Notification Outbox, Inbound API).
|
||||
- **AL-11**: Append-only semantics are **enforced via DB roles** (no UPDATE/DELETE granted on the `AuditLog` table to application accounts); a **tamper-evidence hash chain is deferred to v1.x**.
|
||||
- **AL-12**: The CLI provides a `scadalink audit` command group for query, export, and reconciliation operations against the central Audit Log.
|
||||
- **AL-12**: The CLI provides a `scadalink audit` command group for query, export, and hash-chain verification (verify-chain becomes operational once AL-11's hash chain ships) against the central Audit Log.
|
||||
|
||||
## 11. Health Monitoring
|
||||
|
||||
|
||||
Reference in New Issue
Block a user