diff --git a/docs/requirements/Component-AuditLog.md b/docs/requirements/Component-AuditLog.md index dcb0e0b..13417b9 100644 --- a/docs/requirements/Component-AuditLog.md +++ b/docs/requirements/Component-AuditLog.md @@ -151,7 +151,7 @@ writers — all idempotent on `EventId`. The component completing a script-trust-boundary action (External System Gateway, Database layer, Store-and-Forward Engine) builds an `AuditEvent` with a fresh `EventId` (Guid v4) and `OccurredAtUtc = UtcNow`, then appends it to the -site-local `AuditLog` SQLite via `ISiteAuditWriter` with +site-local `AuditLog` SQLite via `IAuditWriter` with `ForwardState = 'Pending'`. The append is a single-statement INSERT and is durable in microseconds; control returns to the script with no central round-trip on the hot path. @@ -178,10 +178,10 @@ pattern as Site Call Audit's reconciliation of `SiteCalls`. ### Central direct-write (central-originated events) Events originating at central never touch site SQLite. Inbound API writes one -`ApiInbound`/`Completed` row via `ICentralAuditWriter` synchronously inside the +`ApiInbound.Completed` row via `ICentralAuditWriter` synchronously inside the request-handler middleware, before the HTTP response is flushed. The -Notification Outbox dispatcher writes `Notification`/`Attempt` per delivery -attempt and `Notification`/`Terminal` on terminal status. Central direct-writes +Notification Outbox dispatcher writes `Notification.Attempt` per delivery +attempt and `Notification.Terminal` on terminal status. Central direct-writes use the same insert-if-not-exists semantics keyed on `EventId`. ## Cached Operations — Combined Telemetry @@ -291,11 +291,9 @@ MS SQL for direct-write events). Unredacted secrets never persist. Point-in-time, computed from the central `AuditLog` table; global and per-site. -- **Volume** — events/min. -- **Error rate** — % non-`Success` rows, rolling 5 min. -- **Backlog** — sum of `Pending` site rows across sites. -- **Top inbound callers** — top-10 `Actor` by request count, last 1 h. -- **Top outbound 5xx** — top-10 `Target` by 5xx-status count, last 1 h. +- **Audit volume** — events/min landing in the central `AuditLog`; global plus per-site sparkline. +- **Audit error rate** — % of central `AuditLog` rows with `Status` NOT IN (`Success`, `Delivered`, `Enqueued`) over a rolling 5-minute window. This is the operational error rate of audited operations (HTTP 5xx, transient failures, parked deliveries) — NOT audit-writer health, which surfaces separately via `CentralAuditWriteFailures` and `AuditRedactionFailure`. +- **Audit backlog** — sum of `Pending` site rows across sites; click drills into a per-site breakdown. [Notification Outbox](Component-NotificationOutbox.md) and [Site Call Audit](Component-SiteCallAudit.md) KPIs are unaffected — they remain @@ -355,9 +353,7 @@ global value in v1; per-channel overrides are deferred to v1.x. emits `ApiOutbound.SyncCall` rows on every sync `Call()`. For `CachedCall`, emits the combined cached telemetry packet (audit row + operational update) per Cached Operations — Combined Telemetry. -- **[Site Runtime (#3)](Component-SiteRuntime.md) — Database layer** — emits - `DbOutbound.SyncWrite`, `DbOutbound.SyncRead`, and the cached-write variants - via the same combined-telemetry path. +- **[External System Gateway (#7)](Component-ExternalSystemGateway.md) — Database layer** — the database access modes inside ESG emit `DbOutbound.SyncWrite` and `DbOutbound.SyncRead` on script-initiated `Connection()` calls; `Database.CachedWrite` emits the cached-write lifecycle rows via the combined-telemetry packet (same path as `ApiOutbound.Cached*`). Site Runtime is the API surface that exposes the `Database.*` calls to scripts; the audit emission itself lives in ESG. - **[Inbound API (#14)](Component-InboundAPI.md)** — emits one `ApiInbound.Completed` row per request from request-handler middleware, written directly to central via `ICentralAuditWriter` before the response is diff --git a/docs/requirements/Component-CLI.md b/docs/requirements/Component-CLI.md index 6f04cdc..c002575 100644 --- a/docs/requirements/Component-CLI.md +++ b/docs/requirements/Component-CLI.md @@ -187,14 +187,15 @@ require the `OperationalAudit` permission; `audit export` additionally requires exit code 2) on denial. ``` -scadalink audit query --site --since [--until ] [--kind ] [--user ] [--entity-id ] [--correlation-id ] [--status ] [--page ] [--page-size ] -scadalink audit export --since --until --format csv|jsonl|parquet --output [--site ] [--kind ] +scadalink audit query --since [--until ] [--channel ] [--kind ] [--status ] [--site ] [--instance ] [--target ] [--actor ] [--correlation-id ] [--errors-only] [--page ] [--page-size ] +scadalink audit export --since --until --format csv|jsonl|parquet --output [--channel ] [--kind ] [--status ] [--site ] [--target ] [--actor ] scadalink audit verify-chain --month ``` - `audit query` — filtered query against the central `AuditLog` table, matching the - Central UI filter set (site, time range, audit kind, user, entity, correlation ID, - status, paging). Results stream as JSON (default) or table. + Central UI Audit Log page filter set (time range, channel, kind, status, site, + instance/script, target, actor, correlation ID, errors-only). Results stream as + JSON (default) or table. - `audit export` — server-side streaming export of the central `AuditLog` to the requested format (`csv`, `jsonl`, `parquet`) written to `--output`. The server streams rows rather than materializing them in memory; the CLI writes bytes diff --git a/docs/requirements/Component-CentralUI.md b/docs/requirements/Component-CentralUI.md index 19e5e19..a43cfa3 100644 --- a/docs/requirements/Component-CentralUI.md +++ b/docs/requirements/Component-CentralUI.md @@ -157,7 +157,7 @@ Central cluster only. Sites have no user interface. ### Audit Log (Admin / Audit Role) - Lives under a **new top-level "Audit" nav group** (sibling to Notifications). In v1 the Audit nav group contains this single Audit Log page; the pre-existing Configuration Audit Log Viewer remains its own page below. - Global query / filter / drilldown over the central `AuditLog` table maintained by the Audit Log component (#23). Read-only — the table is append-only, so there are no edit actions on rows. -- Per-site row scoping reuses the existing site-permission model from Security & Auth: a user sees only rows for sites they are authorized to operate. Bulk export (see below) requires the additional `AuditExport` permission. +- Read access to the page requires the `OperationalAudit` permission (Security & Auth #10). Per-site row scoping reuses the existing site-permission model: a user sees only rows for sites they are authorized to operate. Bulk export (see below) additionally requires `AuditExport`. The split mirrors the CLI's permission model (see Component-CLI.md). - **Filter bar** (top of page, collapses to a single row when not focused): - Time range — relative (15m / 1h / 24h / 7d) or custom. - Channel — multi-select: `ApiOutbound`, `DbOutbound`, `Notification`, `ApiInbound`. diff --git a/docs/requirements/Component-ClusterInfrastructure.md b/docs/requirements/Component-ClusterInfrastructure.md index 85400a6..895641d 100644 --- a/docs/requirements/Component-ClusterInfrastructure.md +++ b/docs/requirements/Component-ClusterInfrastructure.md @@ -61,7 +61,7 @@ Akka.NET cluster singletons run on the active node of their cluster and migrate ### Central singletons (active central node) - **`NotificationOutboxActor`** — owned by Notification Outbox (#21). Drives the central notification dispatch loop against the `Notifications` table. -- **`SiteCallAuditActor`** — owned by Site Call Audit (#22). Ingests `CachedCall` / `CachedWrite` telemetry and reconciliation pulls into the `SiteCalls` table. +- **`SiteCallAuditActor`** — owned by Site Call Audit (#22). Owns the operational `SiteCalls` table: drives periodic reconciliation pulls for `CachedCall` / `CachedWrite` lifecycle, computes KPIs, and relays operator Retry/Discard actions to the owning site. Note: ingest of cached-call telemetry is performed by `AuditLogIngestActor` (#23) in one transaction with the immutable `AuditLog` insert — see Component-AuditLog.md, Cached Operations — Combined Telemetry. - **`AuditLogIngestActor`** — owned by Audit Log (#23). Receives gRPC telemetry batches of `AuditEvent` rows from sites and performs insert-if-not-exists on `EventId` against the central `AuditLog` table. For cached-call telemetry (which carries both audit-row content and operational-state fields in a single packet), the ingest performs the `AuditLog` insert and the `SiteCalls` upsert in **one transaction** — see Component-AuditLog.md for the combined-telemetry contract. - **`SiteAuditReconciliationActor`** — owned by Audit Log (#23). Periodic per-site pull (default every 5 minutes) that self-heals missed audit telemetry by asking each site for its oldest `ForwardState = 'Pending'` row and issuing a `PullAuditEvents(sinceUtc, batchSize)` when a non-draining backlog is detected. - **`AuditLogPurgeActor`** — owned by Audit Log (#23). Daily partition-switch purge against `ps_AuditLog_Month`; switches out any partition older than `AuditLog:RetentionDays` and emits an `AuditLog:Purged` event. Also rolls the partition scheme forward each month so the next month's partition exists ahead of time. diff --git a/docs/requirements/Component-InboundAPI.md b/docs/requirements/Component-InboundAPI.md index 9870266..7844f1a 100644 --- a/docs/requirements/Component-InboundAPI.md +++ b/docs/requirements/Component-InboundAPI.md @@ -118,7 +118,7 @@ API method scripts are compiled at central startup — all method definitions ar - **Every request — success or failure — emits one `ApiInbound.Completed` row** to `ICentralAuditWriter` from request middleware before the HTTP response is flushed. The row captures the API key **name** (never the key material), remote IP, user-agent, response status, duration, and truncated request/response bodies per the Audit Log capture policy (see Component-AuditLog.md, Payload Capture Policy). This supersedes the earlier failures-only stance: operational API traffic is now part of the centralized audit log, so configuration changes and call activity share a single retention/query surface. - Script execution errors (500 responses) remain captured on the same `ApiInbound.Completed` row (response status + error fields) rather than emitting a separate failure-only event. -- **Non-blocking semantics.** Middleware audit-write failures are logged and metricked (see Health Monitoring #11 — `CentralAuditWriteFailures`) but **never affect the HTTP response**: a failed audit append does not turn a successful API call into an error returned to the caller. +- **Fail-soft semantics.** The audit write is synchronous (inline before the response is flushed), but failures are caught: a write that throws is logged and increments `CentralAuditWriteFailures` (see Health Monitoring #11) and the request still returns its normal HTTP response. A failed audit append never turns a successful API call into an error returned to the caller. - No rate limiting — this is a private API in a controlled industrial environment with a known set of callers. Misbehaving callers are handled operationally (disable the API key). ## Request Flow diff --git a/docs/requirements/Component-NotificationOutbox.md b/docs/requirements/Component-NotificationOutbox.md index 020d6e4..e7cfb37 100644 --- a/docs/requirements/Component-NotificationOutbox.md +++ b/docs/requirements/Component-NotificationOutbox.md @@ -110,6 +110,8 @@ Each delivery attempt also writes a `Notification.Attempt` row to the central `A The operational `Notifications` table remains the **source of truth** for the dispatcher and for Retry/Discard actions; the `AuditLog` rows are immutable shadows. Operator Retry/Discard still mutates only the `Notifications` row, and each transition emits the corresponding `Notification.Attempt` / `Notification.Terminal` audit row. +**Audit-write failure never affects delivery.** If the `ICentralAuditWriter` direct-write fails (transient DB error, schema lock, etc.) the dispatcher logs the failure and increments the `CentralAuditWriteFailures` health metric (see Health Monitoring #11), but the delivery attempt's outcome on the `Notifications` row stands. The audit row is recovered by re-emission on the next dispatcher tick or by the on-startup reconciliation sweep; central never aborts a notification because audit failed. + ## Delivery Adapters A delivery adapter implementing `INotificationDeliveryAdapter` is registered per `Type`. Each `Deliver(...)` call returns one of `success | transient failure | permanent failure`, mirroring the External System Gateway error-classification pattern. diff --git a/docs/requirements/HighLevelReqs.md b/docs/requirements/HighLevelReqs.md index 0871577..06853cf 100644 --- a/docs/requirements/HighLevelReqs.md +++ b/docs/requirements/HighLevelReqs.md @@ -448,7 +448,7 @@ Sections 10.1–10.4 cover **configuration-database audit** (config-mutating use - **AL-1**: The system maintains an **append-only** central Audit Log recording every script-trust-boundary action — outbound external system calls (sync `Call` and `CachedCall`), outbound database operations (sync `Connection` access and `CachedWrite`), notifications, and inbound API method invocations. - **AL-2**: For cached calls and notifications, the Audit Log captures **one row per lifecycle event** (e.g., enqueued, retrying, delivered, parked, discarded), not a single mutable row per operation. -- **AL-3**: Site-originated events are appended to a **site-local SQLite hot-path** synchronously with the action, then **forwarded to central via gRPC telemetry**; central ingest is **idempotent on `EventId`** (insert-if-not-exists then upsert-on-newer-status). +- **AL-3**: Site-originated events are appended to a **site-local SQLite hot-path** synchronously with the action, then **forwarded to central via gRPC telemetry**; central ingest is **idempotent on `EventId`** (insert-if-not-exists; the `AuditLog` table is strictly append-only, so rows are never updated after insert). - **AL-4**: A periodic **central→site reconciliation pull** detects and replays any telemetry events that were missed (e.g., during a central outage), making the central Audit Log eventually consistent with sites. - **AL-5**: Each row captures **payload metadata** (target, method, status, timings, correlation IDs) plus a **truncated request/response body** — **8 KB default**, expanded to **64 KB on error** outcomes. - **AL-6**: **HTTP headers are redacted by default**; **SQL parameter values are captured by default**. Per-target **redaction opt-in** is configurable on external systems, database connections, and inbound API methods. @@ -457,7 +457,7 @@ Sections 10.1–10.4 cover **configuration-database audit** (config-mutating use - **AL-9**: The site SQLite Audit Log is purged only when `ForwardState ∈ {Forwarded, Reconciled}` — i.e., a row must be either confirmed-forwarded *or* confirmed-reconciled before it can be removed. A central outage therefore **cannot cause audit loss at sites**. - **AL-10**: The Central UI exposes an **Audit Log page** with a cross-channel filter (by site, target, status, time range, correlation ID), plus **drill-ins from existing operational pages** (Site Calls, Notification Outbox, Inbound API). - **AL-11**: Append-only semantics are **enforced via DB roles** (no UPDATE/DELETE granted on the `AuditLog` table to application accounts); a **tamper-evidence hash chain is deferred to v1.x**. -- **AL-12**: The CLI provides a `scadalink audit` command group for query, export, and reconciliation operations against the central Audit Log. +- **AL-12**: The CLI provides a `scadalink audit` command group for query, export, and hash-chain verification (verify-chain becomes operational once AL-11's hash chain ships) against the central Audit Log. ## 11. Health Monitoring