Merge branch 'feature/audit-log-docs': centralized Audit Log (#23) design

Adds new component #23 Audit Log: a central, append-only forensic +
operational record of every script-trust-boundary action — outbound API
calls (sync + cached), outbound DB operations (sync + cached, incl.
script-initiated reads), notifications, and inbound API requests.

Sits alongside the existing operational stores (Notifications #21 and
SiteCalls #22) without replacing them. Site-local SQLite hot-path append
+ best-effort gRPC telemetry + central reconciliation pull; cached calls
emit one combined telemetry packet that drives both the immutable
AuditLog insert and the operational SiteCalls upsert in a single
transaction. Central direct-write for Inbound API middleware and
Notification Outbox dispatcher events.

Key invariants:
- Strictly append-only at central (enforced via DB roles + CI grep
  guard); monthly partitioning, 365-day default retention via partition
  switch (no row-level deletes).
- Site SQLite purge requires ForwardState in {Forwarded, Reconciled};
  central outage cannot cause audit loss at sites.
- Audit-write failure never aborts the user-facing action.
- Payload: metadata + truncated bodies (8 KB default, 64 KB on errors);
  headers redacted by default, SQL parameter values captured by default
  with per-connection opt-out.
- New top-level Audit nav group in Central UI with drill-ins from
  Notifications, Site Calls, External Systems, Inbound API Keys, Sites,
  Instances.

Deferred to v1.x: hash-chain tamper evidence, Parquet archival,
per-channel retention overrides.

23 commits, 17 files changed (+1,419/-21). Component-AuditLog.md (new)
plus cross-references in 11 existing component docs, README,
HighLevelReqs (AL-1..AL-12), and CLAUDE.md.
This commit is contained in:
Joseph Doherty
2026-05-20 09:00:30 -04:00
17 changed files with 1418 additions and 12 deletions

View File

@@ -36,7 +36,7 @@ This project contains design documentation for a distributed SCADA system built
- Use `git diff` to review changes before committing. - Use `git diff` to review changes before committing.
- Commit related changes together with a descriptive message summarizing the design decision. - Commit related changes together with a descriptive message summarizing the design decision.
## Current Component List (22 components) ## Current Component List (23 components)
1. Template Engine — Template modeling, inheritance, composition, validation, flattening, diffs. 1. Template Engine — Template modeling, inheritance, composition, validation, flattening, diffs.
2. Deployment Manager — Central-side deployment pipeline, system-wide artifact deployment, instance lifecycle. 2. Deployment Manager — Central-side deployment pipeline, system-wide artifact deployment, instance lifecycle.
@@ -60,6 +60,7 @@ This project contains design documentation for a distributed SCADA system built
20. Traefik Proxy — Reverse proxy/load balancer fronting central cluster, active node routing via `/health/active`, automatic failover. 20. Traefik Proxy — Reverse proxy/load balancer fronting central cluster, active node routing via `/health/active`, automatic failover.
21. Notification Outbox — Central component ingesting store-and-forwarded notifications, `Notifications` audit table, dispatcher loop, retry/parking, delivery KPIs. 21. Notification Outbox — Central component ingesting store-and-forwarded notifications, `Notifications` audit table, dispatcher loop, retry/parking, delivery KPIs.
22. Site Call Audit — Central component auditing site cached calls (`CachedCall`/`CachedWrite`); `SiteCalls` audit table, telemetry ingest, reconciliation, KPIs, central→site Retry/Discard relay; sites remain the source of truth. 22. Site Call Audit — Central component auditing site cached calls (`CachedCall`/`CachedWrite`); `SiteCalls` audit table, telemetry ingest, reconciliation, KPIs, central→site Retry/Discard relay; sites remain the source of truth.
23. Audit Log — Central append-only AuditLog table spanning every script-trust-boundary action (outbound API sync+cached, outbound DB sync+cached, notifications, inbound API). Site SQLite hot-path + gRPC telemetry + reconciliation; combined telemetry with Site Call Audit; central direct-write for Notification Outbox dispatch + Inbound API; monthly partitioning, 365-day retention.
## Key Design Decisions (for context across sessions) ## Key Design Decisions (for context across sessions)
@@ -127,6 +128,18 @@ This project contains design documentation for a distributed SCADA system built
- Site Call Audit (#22): central `SiteCallAuditActor` singleton with a `SiteCalls` audit table (central MS SQL) fed by best-effort site telemetry plus periodic reconciliation pulls — an eventually-consistent mirror, NOT a dispatcher; cached-call delivery stays site-local. Ingest is insert-if-not-exists then upsert-on-newer-status. - Site Call Audit (#22): central `SiteCallAuditActor` singleton with a `SiteCalls` audit table (central MS SQL) fed by best-effort site telemetry plus periodic reconciliation pulls — an eventually-consistent mirror, NOT a dispatcher; cached-call delivery stays site-local. Ingest is insert-if-not-exists then upsert-on-newer-status.
- Central UI Site Calls page + central→site `RetryParkedOperation`/`DiscardParkedOperation` relay for parked cached calls; central never mutates the `SiteCalls` row directly. - Central UI Site Calls page + central→site `RetryParkedOperation`/`DiscardParkedOperation` relay for parked cached calls; central never mutates the `SiteCalls` row directly.
### Centralized Audit Log
- Layered design — append-only `AuditLog` (#23) sits alongside operational `Notifications` (#21) and `SiteCalls` (#22), not replacing them.
- Scope = script trust boundary: outbound API (sync + cached), outbound DB (sync + cached), notifications, inbound API. Framework/internal traffic is explicitly excluded.
- One row per lifecycle event; cached calls produce 4+ rows per operation (`Submitted`, `Forwarded`, `Attempted`, `Delivered`/`Parked`/`Discarded`).
- Site SQLite hot-path first, then gRPC telemetry to central; ingest is idempotent on `EventId`; periodic reconciliation pull as fallback when telemetry is lost.
- Cached operations: site emits a single additively-extended `CachedCallTelemetry` packet carrying both audit events and operational state; central writes `AuditLog` + `SiteCalls` in one transaction.
- Payload cap 8 KB by default / 64 KB on error rows; auth headers redacted by default; SQL parameter values captured by default; per-target redaction opt-in.
- Audit-write failure NEVER aborts the user-facing action — audit is best-effort, the action's own success/failure path is authoritative.
- 365-day central retention with monthly partition-switch purge; 7-day site SQLite retention with a hard `ForwardState` invariant (no row purged until forwarded or reconciled).
- Append-only enforced via DB roles (writer role has INSERT only, no UPDATE/DELETE); hash-chain tamper evidence and Parquet archival are deferred to v1.x.
- Central UI: new top-level **Audit** nav group + Audit Log page, with drill-ins from Notifications, Site Calls, External Systems, Inbound API Keys, Sites, and Instances.
### Security & Auth ### Security & Auth
- Authentication: direct LDAP bind (username/password), no Kerberos/NTLM. LDAPS/StartTLS required. - Authentication: direct LDAP bind (username/password), no Kerberos/NTLM. LDAPS/StartTLS required.
- Cookie+JWT hybrid sessions: HttpOnly/Secure cookie carries an embedded JWT (HMAC-SHA256 shared symmetric key), 15-minute expiry with sliding refresh, 30-minute idle timeout. Cookies are the correct transport for Blazor Server (SignalR circuits). - Cookie+JWT hybrid sessions: HttpOnly/Secure cookie carries an embedded JWT (HMAC-SHA256 shared symmetric key), 15-minute expiry with sliding refresh, 30-minute idle timeout. Cookies are the correct transport for Blazor Server (SignalR circuits).

View File

@@ -56,6 +56,7 @@ This document serves as the master index for the SCADA system design. The system
| 20 | Traefik Proxy | [docs/requirements/Component-TraefikProxy.md](docs/requirements/Component-TraefikProxy.md) | Reverse proxy/load balancer fronting central cluster, active node routing via `/health/active`, automatic failover. | | 20 | Traefik Proxy | [docs/requirements/Component-TraefikProxy.md](docs/requirements/Component-TraefikProxy.md) | Reverse proxy/load balancer fronting central cluster, active node routing via `/health/active`, automatic failover. |
| 21 | Notification Outbox | [docs/requirements/Component-NotificationOutbox.md](docs/requirements/Component-NotificationOutbox.md) | Central component ingesting store-and-forwarded notifications into the `Notifications` audit table, with `NotificationOutboxActor` singleton dispatcher, per-type delivery adapters, retry/parking, status tracking, daily purge, and delivery KPIs. | | 21 | Notification Outbox | [docs/requirements/Component-NotificationOutbox.md](docs/requirements/Component-NotificationOutbox.md) | Central component ingesting store-and-forwarded notifications into the `Notifications` audit table, with `NotificationOutboxActor` singleton dispatcher, per-type delivery adapters, retry/parking, status tracking, daily purge, and delivery KPIs. |
| 22 | Site Call Audit | [docs/requirements/Component-SiteCallAudit.md](docs/requirements/Component-SiteCallAudit.md) | Central component auditing site cached calls (`ExternalSystem.CachedCall`/`Database.CachedWrite`) into the `SiteCalls` audit table, with `SiteCallAuditActor` singleton, telemetry ingest, periodic reconciliation, point-in-time KPIs, daily purge, and central→site Retry/Discard relay for parked calls. | | 22 | Site Call Audit | [docs/requirements/Component-SiteCallAudit.md](docs/requirements/Component-SiteCallAudit.md) | Central component auditing site cached calls (`ExternalSystem.CachedCall`/`Database.CachedWrite`) into the `SiteCalls` audit table, with `SiteCallAuditActor` singleton, telemetry ingest, periodic reconciliation, point-in-time KPIs, daily purge, and central→site Retry/Discard relay for parked calls. |
| 23 | Audit Log | [docs/requirements/Component-AuditLog.md](docs/requirements/Component-AuditLog.md) | New central append-only AuditLog spanning every script-trust-boundary action (outbound API sync+cached, outbound DB sync+cached, notifications, inbound API). Site-local SQLite hot-path append + gRPC telemetry + central reconciliation; combined telemetry packet with Site Call Audit; central direct-write for Notification Outbox dispatch + Inbound API middleware; monthly partitioning, 365-day default retention. |
### Reference Documentation ### Reference Documentation
@@ -90,6 +91,17 @@ This document serves as the master index for the SCADA system design. The system
│ │ Mgmt │ ◄── CLI (ClusterClient) │ │ │ Mgmt │ ◄── CLI (ClusterClient) │
│ │ Service │ ManagementActor + Receptionist │ │ │ Service │ ManagementActor + Receptionist │
│ └──────────┘ │ │ └──────────┘ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Ntf │ │ Site │ │ Audit │ Observ. / │
│ │ Outbox │ │ Call │ │ Log │ Audit area │
│ │ (#21) │ │ Audit │ │ (#23) │ │
│ │ │ │ (#22) │ │ │ │
│ └────▲─────┘ └────▲─────┘ └────▲─────┘ │
│ │ ingests │ ingests │ ingests │
│ │ (S&F) │ (telemetry)│ (telemetry + │
│ │ │ │ direct-write │
│ │ │ │ from Ntf Outbox │
│ │ │ │ & Inbound API) │
│ ┌───────────────────────────────────┐ │ │ ┌───────────────────────────────────┐ │
│ │ Akka.NET Communication Layer │ │ │ │ Akka.NET Communication Layer │ │
│ │ ClusterClient: command/control │ │ │ │ ClusterClient: command/control │ │

View File

@@ -0,0 +1,787 @@
# Centralized Audit Log Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers-extended-cc:executing-plans to implement this plan task-by-task.
>
> **Repo nature:** Design-documentation only. No code, no tests. Each task is a documentation change. "Verify" = re-read the diff + grep for stale cross-references. Commit after each task.
**Goal:** Document the new **#23 Audit Log** component and propagate its cross-references across every affected component design, the README, HighLevelReqs, and CLAUDE.md — exactly as specified in `alog.md` (committed `fec0bb1`).
**Architecture:** Layered, append-only `AuditLog` table at central, alongside existing `Notifications` (#21) and `SiteCalls` (#22) operational stores. Site SQLite writes on the hot path; gRPC telemetry forwards to central; site purge requires `ForwardState ∈ {Forwarded, Reconciled}`. Cached calls send a single telemetry packet that drives both the immutable `AuditLog` insert and the operational `SiteCalls` upsert. Central-originated events (Inbound API, Notification dispatch attempts) write directly. Monthly partitioning at central, 365-day default retention.
**Tech Stack:** Markdown only. No code in v1 of this plan.
**Spec:** `/Users/dohertj2/Desktop/scadalink-design/alog.md` (see commit `fec0bb1`). All task content below cites sections of that file.
---
## Task 0: Prepare branch
**Files:**
- None — git operation only.
**Step 1: Confirm working tree state**
Run: `git status --short`
Expected: three unstaged `infra/` modifications (unrelated; leave them alone), nothing else.
**Step 2: Create feature branch off `main`**
Run: `git switch -c feature/audit-log-docs`
Expected: switched to a new branch.
**Step 3: Verify branch**
Run: `git rev-parse --abbrev-ref HEAD`
Expected: `feature/audit-log-docs`.
**No commit at this task — just branch prep.**
---
## Task 1: Author `Component-AuditLog.md`
**Files:**
- Create: `docs/requirements/Component-AuditLog.md`
**Step 1: Read context**
Read `alog.md` §1§16. Read the structural style of `docs/requirements/Component-SiteCallAudit.md` and `docs/requirements/Component-NotificationOutbox.md` — mirror their section ordering (Purpose / Location / Responsibilities / Tables / Lifecycle / Ingest & Idempotency / Reconciliation / Retention & Purge / KPIs / Configuration / Dependencies / Interactions).
**Step 2: Write the skeleton**
Create the file with these top-level headings (verbatim, in order):
```
# Component: Audit Log
## Purpose
## Location
## Responsibilities
## Scope — the script trust boundary
## The `AuditLog` Table (central)
## The Site-Local `AuditLog` (SQLite)
## Ingestion Paths
## Cached Operations — Combined Telemetry
## Payload Capture Policy
## Failure Handling & Idempotency
## Retention & Purge
## Security & Tamper-Evidence
## KPIs
## Configuration
## Dependencies
## Interactions
```
**Step 3: Fill `Purpose`**
Two-paragraph version of `alog.md` §1. Lead sentence: "Provides a single, append-only, forensic + operational record of every integration action initiated by, or terminating in, a script — across outbound API, outbound DB, notifications, and inbound API." Second paragraph: not a dispatcher, observes Notification Outbox (#21) and Site Call Audit (#22), adds coverage where they are silent.
**Step 4: Fill `Location`**
Central cluster + site cluster. Central: `AuditLog` table in MS SQL plus three singleton actors on the active central node — `AuditLogIngestActor` (telemetry receiver), `SiteAuditReconciliationActor`, `AuditLogPurgeActor`. Sites: `AuditLog` SQLite database file alongside the S&F buffer plus `SiteAuditTelemetryActor` singleton on the active site node. Registered as component #23 in the Host role configuration.
**Step 5: Fill `Responsibilities`**
Bullet list mirroring `alog.md` §1§3 commitments. Six bullets:
- Accept site-local hot-path audit writes from script-trust-boundary call paths.
- Forward site audit rows to central via gRPC telemetry with at-least-once + idempotency on `EventId`.
- Run periodic reconciliation pulls per site to self-heal missed telemetry.
- Accept central-originated audit writes (Inbound API, Notification dispatch attempts).
- Compute point-in-time KPIs (global + per-site) from the central `AuditLog` table.
- Purge expired rows by monthly partition switch.
**Step 6: Fill `Scope — the script trust boundary`**
Reproduce the table from `alog.md` §2 verbatim (the six rows). Add the "Out of scope" bullet list. Add the DB-reads note.
**Step 7: Fill `The AuditLog Table (central)`**
Reproduce the column table from `alog.md` §4. Then the index list. Then the `Kind`-per-channel table (with the inbound API simplification — only `Completed`).
**Step 8: Fill `The Site-Local AuditLog (SQLite)`**
State same schema as central minus `IngestedAtUtc`, plus `ForwardState` (`Pending | Forwarded | Reconciled`). Reproduce the **hard purge invariant** from `alog.md` §4 verbatim:
> A row is eligible for purge only when both `OccurredAtUtc < retention threshold` AND `ForwardState IN ('Forwarded', 'Reconciled')`. Pending rows are never purged.
Mention the `SiteAuditBacklog` health metric.
**Step 9: Fill `Ingestion Paths`**
Three subsections mirroring `alog.md` §6.1, §6.2, §6.3, §6.4. Keep concise — full pseudo-code lives in `alog.md`; the component doc captures the contract.
**Step 10: Fill `Cached Operations — Combined Telemetry`**
Capture `alog.md` §6.5 — site is source of truth, one telemetry packet carries both the audit row and the SiteCalls operational update; central ingest performs both writes in a single transaction.
**Step 11: Fill `Payload Capture Policy`**
Compress `alog.md` §8 into 812 lines: defaults (8 KB / 64 KB on error), header redaction, body-redactor regex hook, SQL captures values by default with per-connection opt-out, never-captured list (API keys, LDAP creds, secrets), safety-net over-redacts on misconfiguration.
**Step 12: Fill `Failure Handling & Idempotency`**
Compress `alog.md` §9: EventId is the PK and dedup key; never-fail-the-action principle; ring buffer for transient SQLite write failures; reconciliation as fallback when telemetry actor wedges; central-direct-write failure handling.
**Step 13: Fill `Retention & Purge`**
Compress `alog.md` §12: 365-day default central retention; monthly partition switch; no row-level deletes at central; site 7-day default; site purge respects `ForwardState`.
**Step 14: Fill `Security & Tamper-Evidence`**
Compress `alog.md` §11: dedicated `scadalink_audit_writer` (INSERT+SELECT) and `scadalink_audit_purger` (partition-switch only) DB roles; CI grep guard against `UPDATE`/`DELETE` of `AuditLog`; Audit + OperationalAudit + AuditExport permissions; hash-chain tamper evidence deferred to v1.x.
**Step 15: Fill `KPIs`**
List the five KPIs from `alog.md` §14: Volume, Error rate, Backlog, Top inbound callers, Top outbound 5xx. Note that Notification Outbox and Site Call Audit KPIs are unaffected.
**Step 16: Fill `Configuration`**
Show the `AuditLog` `appsettings.json` shape from `alog.md` §8.4. Include `DefaultCapBytes`, `ErrorCapBytes`, `HeaderRedactList`, `GlobalBodyRedactors`, `PerTargetOverrides`, and `RetentionDays` (global only in v1).
**Step 17: Fill `Dependencies`**
Cross-references to:
- **Commons (#16)** — `AuditEvent`, `IAuditWriter`, `ICentralAuditWriter`, `AuditChannel`, `AuditKind`, `AuditStatus` types and interfaces.
- **Configuration Database (#17)** — `AuditLog` table schema, partition function/scheme, DB roles, retention options.
- **Cluster Infrastructure (#13)** — singleton placement and supervision (`AuditLogIngestActor`, `SiteAuditTelemetryActor`, `SiteAuditReconciliationActor`, `AuditLogPurgeActor`).
- **Communication (#5)** — gRPC telemetry message types added to the existing site-stream proto additively.
- **Site Runtime (#3)** — script trust boundary touchpoints invoke `IAuditWriter`.
- **Host (#15)** — registers the new component under the central + site roles.
**Step 18: Fill `Interactions`**
Edges to:
- **External System Gateway (#7)** — emits `ApiOutbound.SyncCall` rows; for `CachedCall` emits combined telemetry (audit + operational).
- **Site Runtime (#3) / Database layer** — emits `DbOutbound.SyncWrite`, `DbOutbound.SyncRead`, and cached variants similarly.
- **Inbound API (#14)** — emits `ApiInbound.Completed` rows from request middleware.
- **Notification Outbox (#21)** — site-emitted `Notification.Enqueued` flows via audit telemetry; central dispatcher writes `Notification.Attempt` and `Notification.Terminal` rows directly via `ICentralAuditWriter`.
- **Site Call Audit (#22)** — shares the cached-call telemetry packet; central ingest of that packet performs both `AuditLog` insert and `SiteCalls` upsert in one transaction.
- **Central UI (#9)** — new Audit nav group + Audit Log page; drill-in links from Notifications, Site Calls, External Systems, Inbound API key, Sites, Instances detail pages.
- **Health Monitoring (#11)** — three new tiles (Volume, Error rate, Backlog) plus new metrics (`SiteAuditBacklog`, `SiteAuditWriteFailures`, `SiteAuditTelemetryStalled`, `CentralAuditWriteFailures`, `AuditRedactionFailure`).
- **CLI (#19)** — `scadalink audit query|export|verify-chain` commands.
**Step 19: Verify**
Run: `grep -n "Component-AuditLog.md\|#23" docs/requirements/Component-AuditLog.md`
Expected: file references itself sensibly.
Run: `wc -l docs/requirements/Component-AuditLog.md`
Expected: ~250400 lines (sanity check; not exact).
**Step 20: Commit**
```bash
git add docs/requirements/Component-AuditLog.md
git commit -m "docs(audit): add Component-AuditLog (#23) design document"
```
---
## Task 2: Update `Component-Commons.md`
**Files:**
- Modify: `docs/requirements/Component-Commons.md`
**Step 1: Read existing structure**
Read the file to find the right sections — likely "Types", "Interfaces", "Messages", "Entities". Note which subsections audit-related additions belong in.
**Step 2: Add to `Types/`**
Under the Types section, add:
- `AuditChannel` enum: `ApiOutbound | DbOutbound | Notification | ApiInbound`.
- `AuditKind` enum: union of channel-specific values from `alog.md` §4 table.
- `AuditStatus` enum: `Success | TransientFailure | PermanentFailure | Enqueued | Retrying | Delivered | Parked | Discarded`.
- `AuditEvent` POCO record carrying every column from `alog.md` §4 (central schema), plus a `ForwardState` for site SQLite.
**Step 3: Add to `Interfaces/`**
- `IAuditWriter` — site-local hot-path interface: `Task WriteAsync(AuditEvent evt, CancellationToken ct)`. Implementation lives in Audit Log (#23) component.
- `ICentralAuditWriter` — central direct-write interface: `Task WriteAsync(AuditEvent evt, CancellationToken ct)` with insert-if-not-exists semantics on `EventId`.
**Step 4: Add to `Messages/`**
- `AuditTelemetryEnvelope` — gRPC message wrapping a batch of `AuditEvent` rows for telemetry forwarding.
- `CachedCallTelemetry` — the existing SiteCalls telemetry message, additively extended in place to also carry `AuditEvent` content alongside the operational `SiteCalls` upsert fields. Do NOT rename; per `Component-Commons.md` REQ-COM-5a, message renames are breaking changes. Extend the existing entry's description.
**Step 5: Verify**
Run: `grep -n "AuditEvent\|IAuditWriter\|AuditChannel" docs/requirements/Component-Commons.md`
Expected: all five identifiers appear in the right sections.
**Step 6: Commit**
```bash
git add docs/requirements/Component-Commons.md
git commit -m "docs(audit): register AuditEvent, IAuditWriter, AuditTelemetry types in Commons"
```
---
## Task 3: Update `Component-ConfigurationDatabase.md`
**Files:**
- Modify: `docs/requirements/Component-ConfigurationDatabase.md`
**Step 1: Read existing structure**
Find the "Tables" and "Roles" / "Permissions" / "Migrations" sections.
**Step 2: Add `AuditLog` table description**
Under Tables, add a new subsection mirroring how `Notifications` and `SiteCalls` are documented. Include:
- Full column list from `alog.md` §4 (central table).
- Index list from `alog.md` §4.
- Monthly partitioning: partition function `pf_AuditLog_Month`, scheme `ps_AuditLog_Month`, filegroup-per-month rollover.
- PK on `EventId` for idempotency.
**Step 3: Add `AuditLog` DB roles**
Under Roles/Permissions, add `scadalink_audit_writer` (INSERT+SELECT only) and `scadalink_audit_purger` (partition-switch only). Note the CI grep guard against `UPDATE … AuditLog` / `DELETE … AuditLog`.
**Step 4: Add `AuditLog` migration note**
Under Migrations, note that the initial migration creates the partition function/scheme and the table aligned to the scheme; partition-maintenance job is owned by the Audit Log component, not the Configuration DB.
**Step 5: Add retention config note**
Mention `AuditLog:RetentionDays` (global only in v1) as an Audit Log options key consumed by the purge actor.
**Step 6: Verify cross-reference**
Run: `grep -n "AuditLog\|Audit Log" docs/requirements/Component-ConfigurationDatabase.md`
Expected: new table appears in the Tables section, roles in Roles section.
**Step 7: Commit**
```bash
git add docs/requirements/Component-ConfigurationDatabase.md
git commit -m "docs(audit): add AuditLog table, partitioning, and DB roles to Config DB"
```
---
## Task 4: Update `Component-ClusterInfrastructure.md`
**Files:**
- Modify: `docs/requirements/Component-ClusterInfrastructure.md`
**Step 1: Read singleton-placement section**
Find where Notification Outbox / Site Call Audit singletons are documented (active-central placement model).
**Step 2: Register central singletons**
Add to the central-singleton list:
- `AuditLogIngestActor` — receives gRPC telemetry batches, performs insert-if-not-exists on `EventId`; for cached telemetry, performs both `AuditLog` insert and `SiteCalls` upsert in one transaction.
- `SiteAuditReconciliationActor` — periodic per-site pull, default every 5 minutes.
- `AuditLogPurgeActor` — daily partition-switch purge.
**Step 3: Register site singletons**
Add to the site-singleton list:
- `SiteAuditTelemetryActor` — drains the local `AuditLog` SQLite's `Pending` rows to central in batches; short interval (5s) when busy, longer (30s) when idle.
**Step 4: Note dedicated dispatcher**
Add a one-liner: `SiteAuditTelemetryActor` runs on a dedicated dispatcher so it doesn't compete with the script blocking-I/O dispatcher (per `alog.md` §6.2).
**Step 5: Verify**
Run: `grep -n "AuditLogIngestActor\|SiteAuditTelemetryActor\|AuditLogPurgeActor\|SiteAuditReconciliationActor" docs/requirements/Component-ClusterInfrastructure.md`
Expected: all four singletons listed.
**Step 6: Commit**
```bash
git add docs/requirements/Component-ClusterInfrastructure.md
git commit -m "docs(audit): register AuditLog singletons in Cluster Infrastructure"
```
---
## Task 5: Update `Component-SiteRuntime.md`
**Files:**
- Modify: `docs/requirements/Component-SiteRuntime.md`
**Step 1: Find script-trust-boundary section**
Locate the section listing what scripts can/cannot do and how their boundary-crossing calls are mediated.
**Step 2: Note audit hook**
Add: "Every script-trust-boundary call (External System Gateway, Database layer, Notify) emits an `AuditEvent` to `IAuditWriter` (site-local SQLite append). Hot path; never fails the calling action; failures logged via the `SiteAuditWriteFailures` health metric (see Health Monitoring #11)."
**Step 3: Note site SQLite footprint**
Find the section discussing site storage (SQLite for deployed configs, S&F buffer, event log, operation tracking). Add the `AuditLog` SQLite database file as a peer with the 7-day-purge-respecting-ForwardState invariant; cross-reference to Component-AuditLog.md.
**Step 4: Verify**
Run: `grep -n "IAuditWriter\|AuditLog\|Audit Log" docs/requirements/Component-SiteRuntime.md`
Expected: hook documented, SQLite file mentioned.
**Step 5: Commit**
```bash
git add docs/requirements/Component-SiteRuntime.md
git commit -m "docs(audit): note IAuditWriter hook and site SQLite in Site Runtime"
```
---
## Task 6: Update `Component-ExternalSystemGateway.md`
**Files:**
- Modify: `docs/requirements/Component-ExternalSystemGateway.md`
**Step 1: Find Call/CachedCall sections**
Locate the dual-call-modes documentation.
**Step 2: Note audit emission on sync calls**
Under `ExternalSystem.Call`, add: "Emits an `ApiOutbound.SyncCall` row to `IAuditWriter` at call completion (success or failure). Payload captured per the Audit Log policy (#23 §Payload Capture Policy). Audit-write failure never aborts the script."
**Step 3: Note audit emission on cached calls**
Under `ExternalSystem.CachedCall`, add: "Each lifecycle transition (`CachedEnqueued`, `CachedAttempt`, `CachedTerminal`) emits an audit row via the combined cached-operation telemetry packet — one packet carries both the audit row and the SiteCalls upsert (see Audit Log #23 §Cached Operations and Site Call Audit #22)."
**Step 4: Note audit emission on DB writes**
Under `Database.Connection()` (synchronous), add: "Script-initiated `Execute`/`ExecuteScalar` calls emit `DbOutbound.SyncWrite` rows; `ExecuteReader` emits `DbOutbound.SyncRead`. SQL parameter values are captured by default; per-connection redaction opt-in via the Audit Log configuration (#23 §Payload Capture Policy §8.2)."
**Step 5: Note audit emission on cached DB writes**
Under `Database.CachedWrite`, add: same combined-telemetry pattern as cached external calls.
**Step 6: Verify**
Run: `grep -n "AuditLog\|Audit Log\|ApiOutbound\|DbOutbound\|IAuditWriter" docs/requirements/Component-ExternalSystemGateway.md`
Expected: hooks documented in all four call-mode subsections.
**Step 7: Commit**
```bash
git add docs/requirements/Component-ExternalSystemGateway.md
git commit -m "docs(audit): emit AuditLog rows from External System Gateway call paths"
```
---
## Task 7: Update `Component-SiteCallAudit.md`
**Files:**
- Modify: `docs/requirements/Component-SiteCallAudit.md`
**Step 1: Find Ingest & Idempotency section**
Locate the "Ingest & Idempotency" section (around line 69 in current file).
**Step 2: Note combined telemetry**
Add a new paragraph: "From v1.x onward, the cached-operation telemetry packet additively carries the `AuditEvent` content alongside the existing operational fields. Central's `AuditLogIngestActor` (Audit Log #23) performs both the immutable `AuditLog` insert and the `SiteCalls` upsert in a single transaction. Idempotency keys remain `EventId` (for AuditLog) and `TrackedOperationId` (for SiteCalls)."
**Step 3: Cross-reference Audit Log**
Find the Dependencies / Interactions sections (typically near the end). Add an edge to **Audit Log (#23)** noting the shared telemetry packet and dual-write ingest.
**Step 4: Verify**
Run: `grep -n "Audit Log\|AuditLog\|AuditEvent\|#23" docs/requirements/Component-SiteCallAudit.md`
Expected: combined-telemetry paragraph + Dependencies edge present.
**Step 5: Commit**
```bash
git add docs/requirements/Component-SiteCallAudit.md
git commit -m "docs(audit): note shared cached-operation telemetry with Audit Log"
```
---
## Task 8: Update `Component-NotificationOutbox.md`
**Files:**
- Modify: `docs/requirements/Component-NotificationOutbox.md`
**Step 1: Find dispatcher section**
Locate the section describing the central dispatcher's delivery attempt loop.
**Step 2: Note central direct-write of attempt/terminal**
Add: "Each delivery attempt writes a `Notification.Attempt` row to the `AuditLog` via `ICentralAuditWriter`; transition to a terminal status (`Delivered` / `Parked` / `Discarded`) writes a `Notification.Terminal` row. Audit writes are direct (no telemetry — the dispatcher runs at central). The site-emitted `Notification.Enqueued` row arrives via the standard audit telemetry channel."
**Step 3: Cross-reference Audit Log**
Add to Dependencies / Interactions: edge to **Audit Log (#23)** noting central direct-write of dispatch lifecycle events.
**Step 4: Note status independence**
Add a clarifying sentence: "The operational `Notifications` table remains the source of truth for the dispatcher and for Retry/Discard actions; the `AuditLog` rows are immutable shadows."
**Step 5: Verify**
Run: `grep -n "Audit Log\|ICentralAuditWriter\|Notification.Attempt\|#23" docs/requirements/Component-NotificationOutbox.md`
Expected: dispatcher hook + Dependencies edge present.
**Step 6: Commit**
```bash
git add docs/requirements/Component-NotificationOutbox.md
git commit -m "docs(audit): central direct-write of notification dispatch events to AuditLog"
```
---
## Task 9: Update `Component-InboundAPI.md`
**Files:**
- Modify: `docs/requirements/Component-InboundAPI.md`
**Step 1: Find request-completion / logging section**
Locate the section describing how requests are processed and what gets logged today (today: failures only, per the brainstorm exploration).
**Step 2: Replace failures-only stance**
Edit the "failures-only logging" claim so it now reads: "Every request (success or failure) emits one `ApiInbound.Completed` row to `ICentralAuditWriter` from request middleware before the HTTP response is flushed. The row captures the API key *name* (never the key material), remote IP, user-agent, response status, duration, and truncated request/response bodies per the Audit Log capture policy (#23 §Payload Capture Policy)."
**Step 3: Cross-reference Audit Log**
Add Dependencies edge to **Audit Log (#23)**.
**Step 4: Note non-blocking semantics**
Add: "Middleware audit-write failures are logged and metricked (see Health Monitoring #11) but never affect the HTTP response."
**Step 5: Verify**
Run: `grep -n "Audit Log\|ApiInbound\|ICentralAuditWriter\|#23" docs/requirements/Component-InboundAPI.md`
Expected: middleware hook + Dependencies edge present.
**Step 6: Commit**
```bash
git add docs/requirements/Component-InboundAPI.md
git commit -m "docs(audit): emit ApiInbound.Completed audit row per request"
```
---
## Task 10: Update `Component-CentralUI.md`
**Files:**
- Modify: `docs/requirements/Component-CentralUI.md`
**Step 1: Find navigation / page list**
Locate the section enumerating top-level nav groups and pages.
**Step 2: Add Audit nav group**
Add a new top-level group **Audit** with one page in v1:
- **Audit Log** — global query/filter/drilldown over the central `AuditLog` table.
Document the filter bar and results grid columns from `alog.md` §10.1.
**Step 3: Add drill-in links**
In the existing Notifications, Site Calls, External Systems, Inbound API Keys, Sites, and Instances detail-page documentation, add a "View audit history" / "Recent activity" / "Audit feed" entry that opens the Audit Log page pre-filtered (per `alog.md` §10.2).
**Step 4: Add Health dashboard tiles**
In the Health dashboard documentation, add three tiles under a new "Audit" KPI group: Audit volume, Audit error rate, Audit backlog (per `alog.md` §10.3 / §14).
**Step 5: Note UI rules already covered**
No new framework choices — sticks to Blazor Server + Bootstrap + custom components per the existing project rules (per memory note `feedback_central_ui.md`).
**Step 6: Verify**
Run: `grep -n "Audit Log\|Audit nav\|Audit feed\|Audit volume\|#23" docs/requirements/Component-CentralUI.md`
Expected: nav group, page, drill-ins, tiles all documented.
**Step 7: Commit**
```bash
git add docs/requirements/Component-CentralUI.md
git commit -m "docs(audit): add Audit nav group, Audit Log page, drill-ins, and KPI tiles to Central UI"
```
---
## Task 11: Update `Component-HealthMonitoring.md`
**Files:**
- Modify: `docs/requirements/Component-HealthMonitoring.md`
**Step 1: Find metrics list**
Locate where existing site + central metrics are enumerated.
**Step 2: Add new site metrics**
- `SiteAuditBacklog` — count of `Pending` rows in site-local `AuditLog` plus oldest-pending-age plus on-disk bytes. Threshold drives a Health dashboard warning on the affected site tile.
- `SiteAuditWriteFailures` — count of failed hot-path appends since last report.
- `SiteAuditTelemetryStalled` — boolean flag set when reconciliation reports a non-draining backlog over two cycles.
**Step 3: Add new central metrics**
- `CentralAuditWriteFailures` — central direct-write failures (Inbound API middleware, Notification Outbox dispatcher).
- `AuditRedactionFailure` — payload redactor errors (over-redacted, safety-net hit).
**Step 4: Add new tiles**
Three new dashboard tiles under an "Audit" group: Audit volume, Audit error rate, Audit backlog.
**Step 5: Cross-reference Audit Log**
Dependencies edge to **Audit Log (#23)**.
**Step 6: Verify**
Run: `grep -n "SiteAuditBacklog\|SiteAuditWriteFailures\|CentralAuditWriteFailures\|AuditRedactionFailure\|Audit volume" docs/requirements/Component-HealthMonitoring.md`
Expected: all five metrics + three tiles listed.
**Step 7: Commit**
```bash
git add docs/requirements/Component-HealthMonitoring.md
git commit -m "docs(audit): add Audit Log health metrics and dashboard tiles"
```
---
## Task 12: Update `Component-CLI.md`
**Files:**
- Modify: `docs/requirements/Component-CLI.md`
**Step 1: Find command-group list**
Locate the section enumerating top-level CLI command groups.
**Step 2: Add `scadalink audit` group**
Three subcommands per `alog.md` §15.1:
- `audit query --site <s> --since <t> --kind <k> [...]` — UI-equivalent filter set.
- `audit export --since <t> --until <t> --format csv|jsonl|parquet --output <path>` — server-side streaming export.
- `audit verify-chain --month <YYYY-MM>` — hash-chain verification (no-op in v1; available once §11.4 ships).
Note: requires `OperationalAudit` + `AuditExport` permissions (Security & Auth #10).
**Step 3: Cross-reference Audit Log and Management Service**
Dependencies edges to **Audit Log (#23)** and **Management Service (#18)** (the CLI hits central via the existing HTTP Management API).
**Step 4: Verify**
Run: `grep -n "scadalink audit\|audit query\|audit export\|audit verify-chain\|#23" docs/requirements/Component-CLI.md`
Expected: command group documented with all three subcommands.
**Step 5: Commit**
```bash
git add docs/requirements/Component-CLI.md
git commit -m "docs(audit): add scadalink audit command group to CLI"
```
---
## Task 13: Update `README.md`
**Files:**
- Modify: `README.md`
**Step 1: Find component table**
Locate the markdown table containing rows #1#22 (currently around lines 3658).
**Step 2: Add row #23**
Append a row after `Site Call Audit`:
```
| 23 | Audit Log | [docs/requirements/Component-AuditLog.md](docs/requirements/Component-AuditLog.md) | New central append-only AuditLog spanning every script-trust-boundary action (outbound API sync+cached, outbound DB sync+cached, notifications, inbound API). Site-local SQLite hot-path append + gRPC telemetry + central reconciliation; combined telemetry packet with Site Call Audit; central direct-write for Notification Outbox dispatch + Inbound API middleware; monthly partitioning, 365-day default retention. |
```
**Step 3: Update architecture diagram (logical)**
In the architecture diagram, add an `AuditLog` box under the central cluster's "Audit Log" / observability cluster (parallel to Notification Outbox and Site Call Audit). Add a thin arrow from each affected component into it.
**Step 4: Verify**
Run: `grep -n "Audit Log\|Component-AuditLog.md\|| 23 |" README.md`
Expected: new row + diagram entry present.
**Step 5: Commit**
```bash
git add README.md
git commit -m "docs(audit): register Audit Log (#23) in the README component table"
```
---
## Task 14: Update `docs/requirements/HighLevelReqs.md`
**Files:**
- Modify: `docs/requirements/HighLevelReqs.md`
**Step 1: Find functional-area sections**
Locate the section that currently contains requirements for Notification Outbox and Site Call Audit (likely under "Observability" or "Audit & Reporting").
**Step 2: Add Audit Log requirements section**
Add a new subsection **"Centralized Audit Log"** with numbered requirements covering:
- AL-1: Append-only central record of every script-trust-boundary action.
- AL-2: One row per lifecycle event for cached calls and notifications.
- AL-3: Site-local hot-path append; gRPC telemetry to central; idempotent on `EventId`.
- AL-4: Reconciliation pull self-heals missed telemetry.
- AL-5: Payload metadata + truncated bodies (8 KB default, 64 KB on errors).
- AL-6: Headers redacted by default; SQL parameter values captured by default; per-target redaction opt-in.
- AL-7: Audit-write failure never aborts the user-facing action.
- AL-8: 365-day default central retention; monthly partition switch purge.
- AL-9: Site SQLite purge requires `ForwardState ∈ {Forwarded, Reconciled}`; central outage cannot cause audit loss at sites.
- AL-10: Central UI Audit Log page with cross-channel filter and drill-ins from existing operational pages.
- AL-11: Append-only enforced via DB roles; tamper-evidence hash chain deferred to v1.x.
- AL-12: CLI `scadalink audit` command group.
**Step 3: Cross-reference Audit Log component**
Add a "See Component-AuditLog.md (#23)" pointer at the top of the subsection.
**Step 4: Verify**
Run: `grep -n "AL-1\|AL-12\|Centralized Audit Log\|Component-AuditLog.md" docs/requirements/HighLevelReqs.md`
Expected: section header and all twelve requirements present.
**Step 5: Commit**
```bash
git add docs/requirements/HighLevelReqs.md
git commit -m "docs(audit): add Centralized Audit Log requirements (AL-1..AL-12) to HighLevelReqs"
```
---
## Task 15: Update `CLAUDE.md`
**Files:**
- Modify: `CLAUDE.md`
**Step 1: Update Current Component List**
Change the heading from `## Current Component List (22 components)` to `## Current Component List (23 components)`. Append a new line at the end of the numbered list:
```
23. Audit Log — Central append-only AuditLog table spanning every script-trust-boundary action (outbound API sync+cached, outbound DB sync+cached, notifications, inbound API). Site SQLite hot-path + gRPC telemetry + reconciliation; combined telemetry with Site Call Audit; central direct-write for Notification Outbox dispatch + Inbound API; monthly partitioning, 365-day retention.
```
**Step 2: Add Key Design Decisions block**
In the **Key Design Decisions** section, add a new subsection **`### Centralized Audit Log`** with bulleted decisions mirroring `alog.md` §1§15 highlights:
- Layered design — append-only AuditLog alongside operational Notifications (#21) and SiteCalls (#22), not replacing them.
- Scope = script trust boundary; framework traffic explicitly excluded.
- One row per lifecycle event; cached calls produce 4+ rows per operation.
- Site SQLite hot-path first; gRPC telemetry to central; idempotent on `EventId`; reconciliation pull as fallback.
- Cached operations: site emits, one telemetry packet carries audit + operational state; central writes both in one transaction.
- Payload cap 8 KB default / 64 KB on errors; headers redacted by default; SQL parameter values captured by default; per-target redaction opt-in.
- Audit-write failure never aborts the user-facing action.
- 365-day central retention with monthly partition-switch purge; 7-day site SQLite with hard `ForwardState` invariant.
- Append-only enforced via DB roles; hash-chain tamper evidence and Parquet archival deferred to v1.x.
- New top-level **Audit** nav group + Audit Log page + drill-ins from Notifications / Site Calls / External Systems / Inbound API Keys / Sites / Instances.
**Step 3: Verify**
Run: `grep -n "Centralized Audit Log\|Audit Log\|23 components\|23\\. Audit Log" CLAUDE.md`
Expected: count updated, list extended, Key Design Decisions block present.
**Step 4: Commit**
```bash
git add CLAUDE.md
git commit -m "docs(audit): register Audit Log (#23) in CLAUDE.md component list and key decisions"
```
---
## Task 16: Final cross-reference verification
**Files:**
- None — verification only.
**Step 1: Grep for stale references**
Run: `grep -rn "22 components\|Currently 22\|22\\. Site Call Audit\\s*$" docs/ README.md CLAUDE.md`
Expected: no hits — all updated to 23.
**Step 2: Grep for orphan references**
Run: `grep -rn "Component-AuditLog.md" docs/ README.md CLAUDE.md`
Expected: hits in README, CLAUDE.md, and each affected component doc. Confirm the file exists at the referenced path.
**Step 3: Verify all twelve affected component docs cross-reference Audit Log**
Run: `for f in docs/requirements/Component-{ExternalSystemGateway,InboundAPI,NotificationOutbox,SiteCallAudit,SiteRuntime,Commons,CentralUI,ConfigurationDatabase,ClusterInfrastructure,HealthMonitoring,CLI}.md; do echo "--- $f"; grep -c "Audit Log\|AuditLog\|#23" "$f"; done`
Expected: each file shows count ≥ 1.
**Step 4: Verify alog.md still matches the design canonically**
Run: `git diff fec0bb1 -- alog.md`
Expected: no diff — alog.md is unchanged from the validated commit.
**Step 5: Skim the new file once more end-to-end**
Read: `docs/requirements/Component-AuditLog.md`. Verify section ordering, completeness, no contradictions with `alog.md`.
**Step 6: Review the commit graph**
Run: `git log --oneline feature/audit-log-docs ^main`
Expected: 14 commits — one per Task 113 plus Task 15 (Task 14 is HighLevelReqs in this list — recount: tasks 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 = 15 commits). Adjust expectation: 15 docs/commits.
**Step 7: Final commit (only if any fix-ups needed)**
If grep finds any issue, fix it and commit with `docs(audit): cross-reference cleanup`. Otherwise no commit at this task.
---
## Task 17: Merge to main (optional, on user request only)
**Files:**
- None — git operation only.
**Step 1: Confirm with user**
Per CLAUDE.md and harness policy, do not push or merge to main without explicit user instruction. This task documents the option but does not execute automatically.
**Step 2: If user requests merge**
```bash
git switch main
git merge --no-ff feature/audit-log-docs -m "Merge feature/audit-log-docs: centralized audit log design"
```
**Step 3: If user requests push**
```bash
git push origin main
```
(or push the feature branch instead — operator's call).
---
## Execution Notes
- **Tasks 214 are mostly independent of each other** once Task 1 is done. Suitable for parallel execution via the **subagent-driven-development** sub-skill — one fresh subagent per task, review between commits.
- **Tasks 15 and 16** must run last (Task 15 is the CLAUDE.md rollup; Task 16 is verification).
- **Task 0** must run first (branch prep).
- Total: 17 tasks, ~15 commits, ~250400 lines of new prose in `Component-AuditLog.md` plus smaller per-component additions.
- Spec is `alog.md` (commit `fec0bb1`); every task cites the relevant section.

View File

@@ -0,0 +1,26 @@
{
"planPath": "docs/plans/2026-05-20-centralized-audit-log.md",
"spec": "alog.md (commit fec0bb1)",
"repoNature": "design-documentation-only",
"tasks": [
{"id": 0, "subject": "Task 0: Prepare branch", "status": "pending", "blockedBy": []},
{"id": 1, "subject": "Task 1: Author Component-AuditLog.md", "status": "pending", "blockedBy": [0]},
{"id": 2, "subject": "Task 2: Update Component-Commons.md", "status": "pending", "blockedBy": [0]},
{"id": 3, "subject": "Task 3: Update Component-ConfigurationDatabase.md", "status": "pending", "blockedBy": [1]},
{"id": 4, "subject": "Task 4: Update Component-ClusterInfrastructure.md", "status": "pending", "blockedBy": [1]},
{"id": 5, "subject": "Task 5: Update Component-SiteRuntime.md", "status": "pending", "blockedBy": [1]},
{"id": 6, "subject": "Task 6: Update Component-ExternalSystemGateway.md", "status": "pending", "blockedBy": [1]},
{"id": 7, "subject": "Task 7: Update Component-SiteCallAudit.md", "status": "pending", "blockedBy": [1]},
{"id": 8, "subject": "Task 8: Update Component-NotificationOutbox.md", "status": "pending", "blockedBy": [1]},
{"id": 9, "subject": "Task 9: Update Component-InboundAPI.md", "status": "pending", "blockedBy": [1]},
{"id": 10, "subject": "Task 10: Update Component-CentralUI.md", "status": "pending", "blockedBy": [1]},
{"id": 11, "subject": "Task 11: Update Component-HealthMonitoring.md", "status": "pending", "blockedBy": [1]},
{"id": 12, "subject": "Task 12: Update Component-CLI.md", "status": "pending", "blockedBy": [1]},
{"id": 13, "subject": "Task 13: Update README.md", "status": "pending", "blockedBy": [1]},
{"id": 14, "subject": "Task 14: Update HighLevelReqs.md", "status": "pending", "blockedBy": [1]},
{"id": 15, "subject": "Task 15: Update CLAUDE.md", "status": "pending", "blockedBy": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]},
{"id": 16, "subject": "Task 16: Final cross-reference verification", "status": "pending", "blockedBy": [15]},
{"id": 17, "subject": "Task 17: Merge to main (user-gated)", "status": "pending", "blockedBy": [16]}
],
"lastUpdated": "2026-05-20T00:00:00Z"
}

View File

@@ -0,0 +1,380 @@
# Component: Audit Log
## Purpose
Provides a single, append-only, forensic + operational record of every
integration action initiated by, or terminating in, a script — across outbound
API, outbound DB, notifications, and inbound API. One row per lifecycle event,
rich payloads, long retention, dashboards, drilldowns, and filter queries,
answering both forensic questions ("did instance X send notification Y on date
Z, with what body?") and operational ones ("which inbound caller is hammering
us right now?").
The Audit Log is **not a dispatcher**. It does not drive delivery, retry loops,
or operator Retry/Discard actions — those remain in [Notification Outbox](Component-NotificationOutbox.md)
and [Site Call Audit](Component-SiteCallAudit.md). The Audit Log is the
immutable history that **observes** those subsystems and adds coverage where
they are silent (sync `ExternalSystem.Call`, sync DB writes and reads, inbound
API requests).
## Location
Central cluster and site clusters.
- **Central:** the `AuditLog` table in central MS SQL, plus three singletons on
the active central node — `AuditLogIngestActor` (telemetry receiver),
`SiteAuditReconciliationActor`, and `AuditLogPurgeActor`.
- **Sites:** a site-local `AuditLog` SQLite database file alongside the
Store-and-Forward buffer, plus a `SiteAuditTelemetryActor` singleton on the
active site node.
Registered as component #23 in the Host role configuration.
## Responsibilities
- Accept site-local hot-path audit writes from script-trust-boundary call paths.
- Forward site audit rows to central via gRPC telemetry with at-least-once
delivery and idempotency on `EventId`.
- Run periodic per-site reconciliation pulls so missed telemetry self-heals.
- Accept central-originated audit writes (Inbound API, Notification dispatch
attempts and terminal status).
- Compute point-in-time KPIs (global and per-site) from the central `AuditLog`
table.
- Purge expired rows by monthly partition switch — no row-level deletes.
## Scope — the script trust boundary
The Audit Log captures every action a script causes to cross the cluster trust
boundary:
| Channel | Trigger | Direction | Covered today? |
|---|---|---|---|
| `ExternalSystem.Call(...)` | Script | Outbound | No (gap) |
| `ExternalSystem.CachedCall(...)` | Script | Outbound | Yes — `SiteCalls` (Site Call Audit) |
| `Database.Connection().Execute*(...)` — writes | Script | Outbound | No (gap) |
| `Database.CachedWrite(...)` | Script | Outbound | Yes — `SiteCalls` (Site Call Audit) |
| `Notify.To(list).Send(...)` | Script | Outbound | Yes — `Notifications` (Notification Outbox) |
| `POST /api/{method}` (Inbound API) | External | Inbound (invokes a script) | No (gap) |
Out of scope — framework traffic is not audited:
- Health checks, heartbeats, cluster membership messages.
- gRPC inter-cluster real-time streams (attribute values, alarm states).
- Data Connection Layer ↔ OPC UA / custom protocol traffic.
- LDAP authentication probes, Traefik routing decisions.
- Internal Configuration Database queries by the framework.
- Site Event Log writes; audit log writes themselves.
Script-initiated DB **reads** via `Database.Connection().ExecuteReader(...)`
count as actions from a script and are in scope. Reads via DCL / subscriptions
are framework traffic and excluded.
## The `AuditLog` Table (central)
Single wide table in central MS SQL, polymorphic by `Channel` + `Kind`
discriminators, with a JSON `Extra` column for channel-specific overflow. One
row per lifecycle event across all channels.
| Column | Type | Notes |
|---|---|---|
| `EventId` | `uniqueidentifier` PK | Generated where the event originates (site or central). Idempotency key. |
| `OccurredAtUtc` | `datetime2` | When the event happened (call returned, retry attempted, etc.). |
| `IngestedAtUtc` | `datetime2` | When central persisted the row (lags `OccurredAtUtc` for site-originated rows). |
| `Channel` | `varchar(32)` | `ApiOutbound` \| `DbOutbound` \| `Notification` \| `ApiInbound`. |
| `Kind` | `varchar(32)` | Channel-specific event kind (see below). |
| `CorrelationId` | `uniqueidentifier` NULL | Ties multi-event operations together. `TrackedOperationId` for cached calls, `NotificationId` for notifications, request-id for inbound API. NULL for sync one-shot calls. |
| `SourceSiteId` | `varchar(64)` NULL | NULL for central-originated events. |
| `SourceInstanceId` | `varchar(128)` NULL | Instance whose script initiated the action (when applicable). |
| `SourceScript` | `varchar(128)` NULL | Script name within the instance. |
| `Actor` | `varchar(128)` NULL | Inbound API: API key name. Outbound: script identity. Central: system user. |
| `Target` | `varchar(256)` NULL | Outbound API: external system + method. DB: connection name. Notification: list name. Inbound API: method name. |
| `Status` | `varchar(32)` | Outcome of *this event*`Success`, `TransientFailure`, `PermanentFailure`, `Enqueued`, `Retrying`, `Delivered`, `Parked`, `Discarded`. |
| `HttpStatus` | `int` NULL | HTTP-bearing events only. |
| `DurationMs` | `int` NULL | Call / attempt duration. |
| `ErrorMessage` | `nvarchar(1024)` NULL | Truncated; `ErrorDetail` for full text. |
| `ErrorDetail` | `nvarchar(max)` NULL | Optional full exception text on failures. |
| `RequestSummary` | `nvarchar(max)` NULL | Truncated request payload (configurable cap). Headers redacted. |
| `ResponseSummary` | `nvarchar(max)` NULL | Truncated response payload. Full on errors. |
| `PayloadTruncated` | `bit` | Set if either summary was truncated. |
| `Extra` | `nvarchar(max)` NULL | Channel-specific JSON for fields we don't promote to columns. |
**Indexes (first cut):**
- `IX_AuditLog_OccurredAtUtc` — primary time-range index for global scans.
- `IX_AuditLog_Site_Occurred (SourceSiteId, OccurredAtUtc)` — per-site filters.
- `IX_AuditLog_Correlation (CorrelationId)` — drilldown from a single operation.
- `IX_AuditLog_Channel_Status_Occurred (Channel, Status, OccurredAtUtc)` — KPI / dashboard tiles.
- `IX_AuditLog_Target_Occurred (Target, OccurredAtUtc)` — "what did we send to system X".
- Monthly partitioning on `OccurredAtUtc` from day one; purge is a partition switch (see Retention & Purge).
**`Kind` values by channel:**
| Channel | Kinds |
|---|---|
| `ApiOutbound` | `SyncCall`, `CachedEnqueued`, `CachedAttempt`, `CachedTerminal` |
| `DbOutbound` | `SyncWrite`, `SyncRead`, `CachedEnqueued`, `CachedAttempt`, `CachedTerminal` |
| `Notification` | `Enqueued`, `Attempt`, `Terminal` |
| `ApiInbound` | `Completed` — one row per request, written at request end with final status |
Inbound API is intentionally collapsed to a single `Completed` row per request
rather than a multi-event lifecycle.
## The Site-Local `AuditLog` (SQLite)
A SQLite database file on each site node, alongside the Store-and-Forward
buffer. Same schema as central minus `IngestedAtUtc` (irrelevant at the source),
plus a `ForwardState` column with values `Pending | Forwarded | Reconciled` that
drives the telemetry loop and reconciliation pull.
**Site SQLite retention rule (hard invariant):**
> A row is eligible for purge only when both `OccurredAtUtc < retention threshold` AND `ForwardState IN ('Forwarded', 'Reconciled')`. Pending rows are never purged.
A prolonged central outage will grow the site audit table indefinitely until
central is reachable again. This is intentional — losing audit rows to make
room is a compliance violation, not a self-healing behavior. To bound that
growth in practice, the site emits a `SiteAuditBacklog` health metric (pending
row count, oldest pending age, bytes on disk); crossing operator-configured
thresholds surfaces a warning on the relevant site tile in the Health
dashboard, mirroring the Store-and-Forward Engine's backlog metric.
Central is the durable home. Site SQLite is a write-buffer with a forwarding
guarantee.
## Ingestion Paths
Four paths feed the central `AuditLog` — one site originator and three central
writers — all idempotent on `EventId`.
### Site hot-path append (site-originated events)
The component completing a script-trust-boundary action (External System
Gateway, Database layer, Store-and-Forward Engine) builds an `AuditEvent` with a
fresh `EventId` (Guid v4) and `OccurredAtUtc = UtcNow`, then appends it to the
site-local `AuditLog` SQLite via `IAuditWriter` with
`ForwardState = 'Pending'`. The append is a single-statement INSERT and is
durable in microseconds; control returns to the script with no central
round-trip on the hot path.
### Telemetry forward (site → central)
A `SiteAuditTelemetryActor` singleton drives the forwarding loop: select up to
N `Pending` rows ordered by `OccurredAtUtc`, batch-send them to central via the
existing `SiteStream` gRPC channel as `IngestAuditEvents(events)`, and on
central-ack flip `ForwardState = 'Forwarded'` for accepted IDs. Rejected IDs
stay `Pending` for the next sweep. Cadence is short (default 5 s) when
non-empty, longer (default 30 s) when idle; telemetry runs on a dedicated
dispatcher.
### Reconciliation pull (self-healing for missed telemetry)
A central `SiteAuditReconciliationActor` periodically (default 5 min per site)
asks each site for its oldest `Pending` row and pending count; if backlog is
non-draining (e.g., telemetry actor wedged), central issues a
`PullAuditEvents(sinceUtc, batchSize)` and inserts-if-not-exists. Accepted rows
are flipped to `ForwardState = 'Reconciled'` site-side. Same self-healing
pattern as Site Call Audit's reconciliation of `SiteCalls`.
### Central direct-write (central-originated events)
Events originating at central never touch site SQLite. Inbound API writes one
`ApiInbound.Completed` row via `ICentralAuditWriter` synchronously inside the
request-handler middleware, before the HTTP response is flushed. The
Notification Outbox dispatcher writes `Notification.Attempt` per delivery
attempt and `Notification.Terminal` on terminal status. Central direct-writes
use the same insert-if-not-exists semantics keyed on `EventId`.
## Cached Operations — Combined Telemetry
For `ExternalSystem.CachedCall` and `Database.CachedWrite`, the **site** is the
source of truth for every audit row. The site writes each lifecycle event
(`CachedEnqueued`, `CachedAttempt`, `CachedTerminal`) to its local SQLite
`AuditLog` on the hot path (or on the retry tick for `CachedAttempt`), then
forwards via the same telemetry channel. The telemetry message format gains the
audit-row fields additively — one packet per lifecycle transition carries both
the operational state update AND the audit row content.
On receipt, central performs both writes in one transaction:
1. Insert-if-not-exists the immutable `AuditLog` row, keyed on `EventId`.
2. Upsert the operational `SiteCalls` row — existing Site Call Audit behavior
(status, retry count, last error, timestamps).
This collapses two telemetry concerns into one, keeps site SQLite as the
single local source of truth for audit content, and preserves the existing
operational `SiteCalls` shape for the dispatcher and UI.
## Payload Capture Policy
- **Default cap** — 8 KB for each of `RequestSummary` and `ResponseSummary`;
raised to 64 KB on any non-`Success` row.
- **Truncation** — UTF-8 byte-safe; `PayloadTruncated = 1` when applied. Full
bodies are never stored.
- **HTTP headers** — `Authorization`, `Cookie`, `Set-Cookie`, `X-API-Key`, and
any header matching the configured redact-list regex become `<redacted>`.
- **HTTP bodies** — captured verbatim by default. Operators register per-target
body redactors (regex → replacement) for known secret fields.
- **SQL** — statement text and parameter values captured verbatim by default;
per-connection opt-in to redact parameters whose name matches a regex.
- **Never captured** — raw API key material (only the key *name* via `Actor`),
LDAP bind credentials, cluster secrets, Configuration DB connection strings.
- **Safety net** — if a configured redactor throws, the affected payload becomes
`"<redacted: redactor error>"` and `AuditRedactionFailure` increments. We
over-redact, never under-redact, on configuration faults.
Redaction happens at the write site, before the row touches SQLite (or central
MS SQL for direct-write events). Unredacted secrets never persist.
## Failure Handling & Idempotency
- **`EventId` is the dedup key.** Generated at the originator; central ingest
is `INSERT … WHERE NOT EXISTS (SELECT 1 FROM AuditLog WHERE EventId = @id)`
under the PK constraint. Idempotent across telemetry retries, reconciliation
pulls, and any combination of the two.
- **Never fail the action.** A failed audit write — site SQLite or central
direct-write — logs a critical Site Event Log entry and increments a health
metric (`SiteAuditWriteFailures` or `CentralAuditWriteFailures`), but the
user-facing action proceeds. We do not fail script-initiated work because the
audit write failed.
- **Hot-path ring buffer.** While the site audit writer is unhealthy
(disk full, schema lock, transient IO), events buffer in a small in-memory
ring (default 1024 rows); oldest are discarded with a Site Event Log warning
per drop.
- **Reconciliation as fallback.** If two consecutive reconciliation cycles
report a non-draining backlog, the supervisor restarts the telemetry actor
and a `SiteAuditTelemetryStalled` event fires.
- **No dedup horizon.** `EventId` PK enforces uniqueness only while a row
exists. A retry that arrives after the original row is purged inserts a "new"
row — vanishingly rare and harmless.
## Retention & Purge
- **Central:** 365-day default based on `OccurredAtUtc`, configurable via
`AuditLog:RetentionDays` (min 7, max 3650). Single global retention in v1 —
no per-channel overrides.
- **Partitioning:** monthly partitions on `OccurredAtUtc` from day one
(`pf_AuditLog_Month` / `ps_AuditLog_Month`). Purge is a partition switch;
there are no row-level deletes at central.
- **Purge actor:** `AuditLogPurgeActor` singleton on the active central node
runs daily, switches out any partition whose latest `OccurredAtUtc` is older
than the retention window, and emits an `AuditLog:Purged` event (partition
range, rowcount, duration). A partition-maintenance step rolls forward each
month, creating the next month's partition ahead of time.
- **Sites:** daily site job; default 7-day retention (configurable, min 1,
max 90). Respects the hard `ForwardState` invariant — `Pending` rows are
never purged on age alone.
## Security & Tamper-Evidence
- **Append-only enforcement.** The application accesses `AuditLog` via a
dedicated DB role `scadalink_audit_writer` granted `INSERT` + `SELECT` only —
no `UPDATE`, no `DELETE`. Purge runs under a separate role
`scadalink_audit_purger` whose permissions are limited to the partition-switch
operation; row-level `DELETE` is not granted even to purge.
- **CI grep guard.** The build greps the data layer for any
`UPDATE … AuditLog` or `DELETE … AuditLog` text and fails on a hit.
- **Authorization.** Reading the Audit Log requires the existing **Audit** role
extended with a new **OperationalAudit** permission. Per-site row scoping
reuses the existing site-permission model; bulk export requires an additional
**AuditExport** permission.
- **Payload redaction at write.** See Payload Capture Policy. Unredacted
secrets never persist; the safety net over-redacts on misconfiguration.
- **Hash-chain tamper evidence — deferred to v1.x.** A future `RowHash` column,
computed per partition as `SHA-256(prev.RowHash || canonical(row))`, will be
verifiable offline via `scadalink audit verify-chain --month YYYY-MM`. Off by
default in v1.
- **Site SQLite security.** File permissions: read/write by the ScadaLink
service account only. Not backed up off-machine — site SQLite is a buffer,
not a record.
## KPIs
Point-in-time, computed from the central `AuditLog` table; global and per-site.
- **Audit volume** — events/min landing in the central `AuditLog`; global plus per-site sparkline.
- **Audit error rate** — % of central `AuditLog` rows with `Status` NOT IN (`Success`, `Delivered`, `Enqueued`) over a rolling 5-minute window. This is the operational error rate of audited operations (HTTP 5xx, transient failures, parked deliveries) — NOT audit-writer health, which surfaces separately via `CentralAuditWriteFailures` and `AuditRedactionFailure`.
- **Audit backlog** — sum of `Pending` site rows across sites; click drills into a per-site breakdown.
[Notification Outbox](Component-NotificationOutbox.md) and
[Site Call Audit](Component-SiteCallAudit.md) KPIs are unaffected — they remain
sourced from `Notifications` and `SiteCalls` respectively. Audit Log KPIs
describe the audit table itself.
## Configuration
Bound from `appsettings.json` to a new `AuditLogOptions` class owned by this
component (Options pattern):
```jsonc
"AuditLog": {
"DefaultCapBytes": 8192,
"ErrorCapBytes": 65536,
"HeaderRedactList": [ "Authorization", "Cookie", "Set-Cookie", "X-API-Key" ],
"GlobalBodyRedactors": [
{ "Pattern": "\"password\"\\s*:\\s*\"[^\"]+\"", "Replacement": "\"password\":\"<redacted>\"" }
],
"PerTargetOverrides": {
"Weather/GetForecast": { "CapBytes": 4096 },
"PlantDB": { "RedactSqlParamsMatching": "@apikey|@token" }
},
"RetentionDays": 365
}
```
`PerTargetOverrides` keys bind by External System / Inbound Method /
Notification List / Database Connection name. `RetentionDays` is a single
global value in v1; per-channel overrides are deferred to v1.x.
## Dependencies
- **[Commons (#16)](Component-Commons.md)** — `AuditEvent`, `IAuditWriter` /
`ICentralAuditWriter` interfaces, and the `AuditChannel`, `AuditKind`,
`AuditStatus` enum types live here.
- **[Configuration Database (#17)](Component-ConfigurationDatabase.md)** — hosts
the `AuditLog` table schema, the monthly partition function and scheme, the
`scadalink_audit_writer` / `scadalink_audit_purger` DB roles, and the EF
migration. Distinct concern from `IAuditService` (config-change audit), which
is unchanged.
- **[Cluster Infrastructure (#13)](Component-ClusterInfrastructure.md)** —
singleton placement and supervision for `AuditLogIngestActor`,
`SiteAuditTelemetryActor`, `SiteAuditReconciliationActor`, and
`AuditLogPurgeActor`.
- **[CentralSite Communication (#5)](Component-Communication.md)** — carries
audit telemetry. New gRPC message types (`IngestAuditEvents`,
`PullAuditEvents`) are added to the existing site-stream proto additively.
- **[Site Runtime (#3)](Component-SiteRuntime.md)** — script-trust-boundary
call paths invoke `IAuditWriter` to append events.
- **[Host (#15)](Component-Host.md)** — registers this component (#23) under
the central and site roles.
## Interactions
- **[External System Gateway (#7)](Component-ExternalSystemGateway.md)** —
emits `ApiOutbound.SyncCall` rows on every sync `Call()`. For `CachedCall`,
emits the combined cached telemetry packet (audit row + operational update)
per Cached Operations — Combined Telemetry.
- **[External System Gateway (#7)](Component-ExternalSystemGateway.md) — Database layer** — the database access modes inside ESG emit `DbOutbound.SyncWrite` and `DbOutbound.SyncRead` on script-initiated `Connection()` calls; `Database.CachedWrite` emits the cached-write lifecycle rows via the combined-telemetry packet (same path as `ApiOutbound.Cached*`). Site Runtime is the API surface that exposes the `Database.*` calls to scripts; the audit emission itself lives in ESG.
- **[Inbound API (#14)](Component-InboundAPI.md)** — emits one
`ApiInbound.Completed` row per request from request-handler middleware,
written directly to central via `ICentralAuditWriter` before the response is
flushed.
- **[Notification Outbox (#21)](Component-NotificationOutbox.md)** — the
site-emitted `Notification.Enqueued` row flows via audit telemetry; the
central dispatcher writes `Notification.Attempt` (per delivery attempt) and
`Notification.Terminal` (on terminal status) directly via
`ICentralAuditWriter`. The operational `Notifications` table is unchanged.
- **[Site Call Audit (#22)](Component-SiteCallAudit.md)** — shares the
cached-call telemetry packet. Central ingest of that packet performs both the
`AuditLog` insert and the `SiteCalls` upsert in one transaction. `SiteCalls`
remains the operational state store; the Audit Log is its immutable shadow.
- **[Central UI (#9)](Component-CentralUI.md)** — a new **Audit** nav group
hosts the Audit Log page (filter bar, results grid, drilldown drawer,
server-side CSV export). Drill-in links appear on Notifications, Site Calls,
External Systems, Inbound API key, Sites, and Instances detail pages.
- **[Health Monitoring (#11)](Component-HealthMonitoring.md)** — three new
tiles (Volume, Error rate, Backlog) plus new health metrics:
`SiteAuditBacklog`, `SiteAuditWriteFailures`, `SiteAuditTelemetryStalled`,
`CentralAuditWriteFailures`, `AuditRedactionFailure`.
- **[CLI (#19)](Component-CLI.md)** — new `scadalink audit query`,
`scadalink audit export`, and `scadalink audit verify-chain` commands; same
permission requirements as the UI.

View File

@@ -172,6 +172,40 @@ scadalink security scope-rule delete --id <id>
scadalink audit-log query [--user <username>] [--entity-type <type>] [--action <action>] [--from <date>] [--to <date>] [--page <n>] [--page-size <n>] scadalink audit-log query [--user <username>] [--entity-type <type>] [--action <action>] [--from <date>] [--to <date>] [--page <n>] [--page-size <n>]
``` ```
The legacy `audit-log query` above targets the original configuration-change audit
(IAuditService) surface. The new centralized Audit Log component (#23) is exposed via
the `scadalink audit` group below.
### Centralized Audit Commands
The `scadalink audit` group targets the centralized Audit Log component (#23) and
exposes the UI-equivalent operational audit surface. Permissions follow the same
read-vs-export split the Central UI uses (see Component-AuditLog.md, Security &
Tamper-Evidence, and Security & Auth #10): `audit query` and `audit verify-chain`
require the `OperationalAudit` permission; `audit export` additionally requires
`AuditExport`. The server enforces permission checks and returns HTTP 403 (CLI
exit code 2) on denial.
```
scadalink audit query --since <t> [--until <t>] [--channel <c>] [--kind <k>] [--status <s>] [--site <s>] [--instance <i>] [--target <t>] [--actor <a>] [--correlation-id <id>] [--errors-only] [--page <n>] [--page-size <n>]
scadalink audit export --since <t> --until <t> --format csv|jsonl|parquet --output <path> [--channel <c>] [--kind <k>] [--status <s>] [--site <s>] [--target <t>] [--actor <a>]
scadalink audit verify-chain --month <YYYY-MM>
```
- `audit query` — filtered query against the central `AuditLog` table, matching the
Central UI Audit Log page filter set (time range, channel, kind, status, site,
instance/script, target, actor, correlation ID, errors-only). Results stream as
JSON (default) or table.
- `audit export` — server-side streaming export of the central `AuditLog` to the
requested format (`csv`, `jsonl`, `parquet`) written to `--output`. The server
streams rows rather than materializing them in memory; the CLI writes bytes
through to disk. Supports the same scoping filters as `audit query`.
- `audit verify-chain` — hash-chain verification for the named month.
**No-op in v1**: the command is defined so the command tree is stable, but
verification only becomes meaningful once the hash-chain ships (see
Component-AuditLog.md, Security & Tamper-Evidence). Until then, the server
responds with a "verification not yet available" status and the CLI exits 0.
### Health Commands ### Health Commands
``` ```
scadalink health summary scadalink health summary
@@ -273,6 +307,8 @@ Configuration is resolved in the following priority order (highest wins):
- **Commons**: Message contracts (`Messages/Management/`) for command type definitions and registry. - **Commons**: Message contracts (`Messages/Management/`) for command type definitions and registry.
- **System.CommandLine**: Command-line argument parsing. - **System.CommandLine**: Command-line argument parsing.
- **Microsoft.AspNetCore.SignalR.Client**: SignalR client for the `debug stream` command's WebSocket connection. - **Microsoft.AspNetCore.SignalR.Client**: SignalR client for the `debug stream` command's WebSocket connection.
- **Management Service (#18)**: The CLI hits the central cluster via the existing HTTP Management API (`POST /management`), which dispatches to the ManagementActor. The `scadalink audit` command group rides this same transport — there is no separate audit endpoint.
- **Audit Log (#23)**: The `scadalink audit query`, `audit export`, and `audit verify-chain` subcommands target the centralized Audit Log component's query/export/verify surfaces via the Management API. Permission checks (`OperationalAudit`, `AuditExport`) are enforced server-side.
## Interactions ## Interactions

View File

@@ -58,6 +58,7 @@ Central cluster only. Sites have no user interface.
### External System Management (Design Role) ### External System Management (Design Role)
- Define external system contracts: connection details, API method definitions (parameters, return types). - Define external system contracts: connection details, API method definitions (parameters, return types).
- Define retry settings per external system (max retry count, fixed time between retries). - Define retry settings per external system (max retry count, fixed time between retries).
- The external system detail page includes a **"Recent activity"** link that opens the Audit Log page pre-filtered to `Channel = ApiOutbound` and `Target` starts-with the system name — surfacing the system's recent outbound API audit history.
### Database Connection Management (Design Role) ### Database Connection Management (Design Role)
- Define named database connections: server, database, credentials. - Define named database connections: server, database, credentials.
@@ -74,6 +75,11 @@ Central cluster only. Sites have no user interface.
- Define data connections and assign them to sites (name, protocol type, connection details). - Define data connections and assign them to sites (name, protocol type, connection details).
- **Data connection form**: "Primary Endpoint Configuration" (required JSON text area) and optional "Backup Endpoint Configuration" (collapsible section, hidden by default, revealed via "Add Backup Endpoint" button; "Remove Backup" button when editing an existing backup). "Failover Retry Count" numeric input (default 3, min 1, max 20) is visible only when a backup endpoint is configured. - **Data connection form**: "Primary Endpoint Configuration" (required JSON text area) and optional "Backup Endpoint Configuration" (collapsible section, hidden by default, revealed via "Add Backup Endpoint" button; "Remove Backup" button when editing an existing backup). "Failover Retry Count" numeric input (default 3, min 1, max 20) is visible only when a backup endpoint is configured.
- **Data connection list page**: Shows Primary Config and Backup Config columns. Active Endpoint column populated from health reports. - **Data connection list page**: Shows Primary Config and Backup Config columns. Active Endpoint column populated from health reports.
- The site detail page exposes a new **"Audit feed"** tab that hosts the Audit Log page pre-filtered to `Site = <site>` — an in-context view of every operational audit event for that site.
### Inbound API Management (Admin Role for keys, Design Role for methods)
- Manage inbound API keys (create, enable / disable, delete) and define API methods (name, parameters, return values, approved keys, implementation script).
- The API key detail page includes a **"Recent calls"** link that opens the Audit Log page pre-filtered to `Actor = <key name>` and `Channel = ApiInbound` — surfacing the key's recent inbound-call audit history.
### Area Management (Admin Role) ### Area Management (Admin Role)
- Define hierarchical area structures per site. - Define hierarchical area structures per site.
@@ -89,6 +95,7 @@ Central cluster only. Sites have no user interface.
- **Disable** instances — stops data collection, script triggers, and alarm evaluation at the site while retaining the deployed configuration. - **Disable** instances — stops data collection, script triggers, and alarm evaluation at the site while retaining the deployed configuration.
- **Enable** instances — re-activates a disabled instance. - **Enable** instances — re-activates a disabled instance.
- **Delete** instances — removes the running configuration from the site. Blocked if the site is unreachable. Store-and-forward messages are not cleared. - **Delete** instances — removes the running configuration from the site. Blocked if the site is unreachable. Store-and-forward messages are not cleared.
- The instance detail page exposes a new **"Audit feed"** tab that hosts the Audit Log page pre-filtered to the instance (`Site = <site>` and the `Instance / Script` filter set to the instance unique name) — an in-context view of every operational audit event involving that instance.
### Deployment (Deployment Role) ### Deployment (Deployment Role)
- View list of instances with staleness indicators (deployed config differs from template-derived config). - View list of instances with staleness indicators (deployed config differs from template-derived config).
@@ -124,6 +131,7 @@ Central cluster only. Sites have no user interface.
- **KPI tiles** at the top of the page: queue depth (`Pending` + `Retrying`), stuck count, parked count, delivered in the last interval, and oldest pending age. The KPIs are central-computed on demand from the `Notifications` table. - **KPI tiles** at the top of the page: queue depth (`Pending` + `Retrying`), stuck count, parked count, delivered in the last interval, and oldest pending age. The KPIs are central-computed on demand from the `Notifications` table.
- A **queryable notification list** filterable by status, type, source site, notification list, and time range, with a **stuck-only toggle** and keyword search on subject. Each row shows the notification's status, retry count, last error, and key timestamps. - A **queryable notification list** filterable by status, type, source site, notification list, and time range, with a **stuck-only toggle** and keyword search on subject. Each row shows the notification's status, retry count, last error, and key timestamps.
- **Retry** and **Discard** actions are available on parked notifications: Retry returns the notification to `Pending` and resets `RetryCount` / `NextAttemptAt`; Discard moves it to `Discarded`. The row is retained either way so the table stays a complete audit record. - **Retry** and **Discard** actions are available on parked notifications: Retry returns the notification to `Pending` and resets `RetryCount` / `NextAttemptAt`; Discard moves it to `Discarded`. The row is retained either way so the table stays a complete audit record.
- Each row exposes a **"View audit history"** action that opens the Audit Log page pre-filtered to `CorrelationId = NotificationId`, surfacing every operational audit event recorded for that notification.
- **Stuck rows are visually badged** — a notification is stuck if it is `Pending` or `Retrying` and older than the configurable stuck-age threshold. Stuck detection is display-only; there is no automated escalation or alerting. - **Stuck rows are visually badged** — a notification is stuck if it is `Pending` or `Retrying` and older than the configurable stuck-age threshold. Stuck detection is display-only; there is no automated escalation or alerting.
- All queries are served from the central `Notifications` table — no remote per-site queries are needed, unlike the Parked Message Management page. - All queries are served from the central `Notifications` table — no remote per-site queries are needed, unlike the Parked Message Management page.
@@ -131,6 +139,7 @@ Central cluster only. Sites have no user interface.
- Monitor cached calls store-and-forwarded from sites — `ExternalSystem.CachedCall()` and `Database.CachedWrite()` operations. Scoped to the `ExternalCall` and `DatabaseWrite` kinds only; notifications keep their separate Notification Outbox page and are not merged here. - Monitor cached calls store-and-forwarded from sites — `ExternalSystem.CachedCall()` and `Database.CachedWrite()` operations. Scoped to the `ExternalCall` and `DatabaseWrite` kinds only; notifications keep their separate Notification Outbox page and are not merged here.
- A **queryable cached-call list** filterable by site, kind, status, and time range. Each row shows the call's timestamp, site, kind, target summary, status badge, retry count, and last error. - A **queryable cached-call list** filterable by site, kind, status, and time range. Each row shows the call's timestamp, site, kind, target summary, status badge, retry count, and last error.
- **Retry** and **Discard** actions are available on `Parked` rows only — `Failed` rows are not actionable, since a permanent failure would simply fail again and its error was already returned synchronously to the calling script. The actions issue central→site commands to the owning site; if the site is offline the UI surfaces a "site unreachable" message. - **Retry** and **Discard** actions are available on `Parked` rows only — `Failed` rows are not actionable, since a permanent failure would simply fail again and its error was already returned synchronously to the calling script. The actions issue central→site commands to the owning site; if the site is offline the UI surfaces a "site unreachable" message.
- Each row exposes a **"View audit history"** action that opens the Audit Log page pre-filtered to `CorrelationId = TrackedOperationId`, showing every operational audit event recorded for that cached call.
- Data is served from the central Site Call Audit component's `SiteCalls` table. The page is **read-mostly** — an eventually-consistent mirror of site state; the site remains the source of truth. - Data is served from the central Site Call Audit component's `SiteCalls` table. The page is **read-mostly** — an eventually-consistent mirror of site state; the site remains the source of truth.
### Health Monitoring Dashboard (All Roles) ### Health Monitoring Dashboard (All Roles)
@@ -138,14 +147,42 @@ Central cluster only. Sites have no user interface.
- Per-site detail: active/standby node status, data connection health, script error rates, alarm evaluation error rates, store-and-forward buffer depths. - Per-site detail: active/standby node status, data connection health, script error rates, alarm evaluation error rates, store-and-forward buffer depths.
- Headline **Notification Outbox KPI tiles** — queue depth, stuck count, and parked count. These are central-computed by the Notification Outbox from the central `Notifications` table (not part of any site health report). The full outbox view is on the dedicated Notification Outbox page. - Headline **Notification Outbox KPI tiles** — queue depth, stuck count, and parked count. These are central-computed by the Notification Outbox from the central `Notifications` table (not part of any site health report). The full outbox view is on the dedicated Notification Outbox page.
- Headline **Site Call Audit KPI tiles** — buffered count, parked count, and failed-last-interval. These are central-computed by the Site Call Audit component from the central `SiteCalls` table (not part of any site health report). The full cached-call view is on the dedicated Site Calls page. - Headline **Site Call Audit KPI tiles** — buffered count, parked count, and failed-last-interval. These are central-computed by the Site Call Audit component from the central `SiteCalls` table (not part of any site health report). The full cached-call view is on the dedicated Site Calls page.
- Headline **Audit KPI tiles** — three tiles in a new "Audit" KPI group: **Audit volume**, **Audit error rate**, and **Audit backlog**. These are sourced from the Audit Log component (#23) and Health Monitoring per the metric definitions in Component-HealthMonitoring.md; the dashboard simply surfaces them. The full audit query view is on the dedicated Audit Log page.
### Site Event Log Viewer (Deployment Role) ### Site Event Log Viewer (Deployment Role)
- Query site event logs remotely. - Query site event logs remotely.
- Filter by event type, time range, instance. - Filter by event type, time range, instance.
- View script executions, alarm events (activations, clears, evaluation errors), deployment events (including script compilation results), connection status changes, store-and-forward activity, instance lifecycle events (enable, disable, delete). - View script executions, alarm events (activations, clears, evaluation errors), deployment events (including script compilation results), connection status changes, store-and-forward activity, instance lifecycle events (enable, disable, delete).
### Audit Log Viewer (Admin Role) ### Audit Log (Admin / Audit Role)
- Query the central audit log. - Lives under a **new top-level "Audit" nav group** (sibling to Notifications). In v1 the Audit nav group contains this single Audit Log page; the pre-existing Configuration Audit Log Viewer remains its own page below.
- Global query / filter / drilldown over the central `AuditLog` table maintained by the Audit Log component (#23). Read-only — the table is append-only, so there are no edit actions on rows.
- Read access to the page requires the `OperationalAudit` permission (Security & Auth #10). Per-site row scoping reuses the existing site-permission model: a user sees only rows for sites they are authorized to operate. Bulk export (see below) additionally requires `AuditExport`. The split mirrors the CLI's permission model (see Component-CLI.md).
- **Filter bar** (top of page, collapses to a single row when not focused):
- Time range — relative (15m / 1h / 24h / 7d) or custom.
- Channel — multi-select: `ApiOutbound`, `DbOutbound`, `Notification`, `ApiInbound`.
- Kind — multi-select; the available options are filtered by the selected Channels.
- Status — multi-select.
- Site — multi-select, scoped to the user's authorized sites.
- Instance / Script — text search with autocomplete.
- Target — text search (system + method, DB connection, list name).
- Actor — text search (inbound API key name).
- CorrelationId — paste a `TrackedOperationId` / `NotificationId` / request-id to see the full event sequence for one operation.
- "Errors only" toggle — shorthand for `Status NOT IN (Success, Delivered, Enqueued)`.
- **Results grid** (custom Blazor + Bootstrap component, consistent with the rest of the UI — no third-party grid):
- Columns, all resizable and reorderable, persisted per user: `OccurredAtUtc`, `Site`, `Channel`, `Kind`, `Status`, `Target`, `Actor`, `DurationMs`, `HttpStatus`, `ErrorMessage`.
- Keyset pagination ordered by `(OccurredAtUtc desc, EventId desc)`. Default page size 100.
- Clicking a row opens the drilldown drawer.
- **Drilldown drawer**:
- Pretty-prints `RequestSummary` / `ResponseSummary` — JSON is auto-detected and syntax-highlighted; SQL is syntax-highlighted.
- Surfaces **redaction indicators** wherever headers or fields were stripped at write time, per the Audit Log component's "Payload Capture Policy".
- **"Copy as cURL"** action on `ApiOutbound` and `ApiInbound` rows.
- **"Show all events for this operation"** link — re-applies the current view filtered by the row's `CorrelationId`.
- **Export** button on the page header streams a server-side CSV of the current filter (default cap 100k rows; larger exports go through the CLI). Requires the `AuditExport` permission.
### Configuration Audit Log Viewer (Admin Role)
- Pre-existing viewer for the `IAuditService` configuration-change log (template / instance / site / etc. before-after edits). Lives under the same **Audit** nav group as the operational Audit Log above.
- Query the central configuration audit log.
- Filter by user, entity type, action type, time range. - Filter by user, entity type, action type, time range.
- View before/after state for each change. - View before/after state for each change.
@@ -163,3 +200,4 @@ Central cluster only. Sites have no user interface.
- **Health Monitoring**: Provides site health data for the dashboard. - **Health Monitoring**: Provides site health data for the dashboard.
- **Notification Outbox**: Provides notification delivery KPIs and serves the `Notifications` table queries and Retry/Discard actions for the Notification Outbox page. - **Notification Outbox**: Provides notification delivery KPIs and serves the `Notifications` table queries and Retry/Discard actions for the Notification Outbox page.
- **Site Call Audit**: Serves the `SiteCalls` table queries and relays Retry/Discard actions to sites for the Site Calls page. - **Site Call Audit**: Serves the `SiteCalls` table queries and relays Retry/Discard actions to sites for the Site Calls page.
- **Audit Log (#23)**: Serves all `AuditLog` table queries (filter / grid / drilldown / CSV export) for the new Audit Log page and the drill-in surfaces on Notifications, Site Calls, External Systems, Inbound API keys, Sites, and Instances. Payload capture, redaction, and per-site authorization follow the Audit Log component's "Payload Capture Policy" and "Security & Tamper-Evidence" sections.

View File

@@ -54,6 +54,23 @@ remains the home of the configuration contract that the Host consumes.
- Connected to local SQLite databases (store-and-forward buffer, event logs, deployed configurations). - Connected to local SQLite databases (store-and-forward buffer, event logs, deployed configurations).
- Connected to machines via data connections (OPC UA). - Connected to machines via data connections (OPC UA).
## Cluster Singletons
Akka.NET cluster singletons run on the active node of their cluster and migrate on failover. Each singleton listed here is owned by the named component; this component (Cluster Infrastructure) provides only the hosting, supervision, and active-node placement guarantee.
### Central singletons (active central node)
- **`NotificationOutboxActor`** — owned by Notification Outbox (#21). Drives the central notification dispatch loop against the `Notifications` table.
- **`SiteCallAuditActor`** — owned by Site Call Audit (#22). Owns the operational `SiteCalls` table: drives periodic reconciliation pulls for `CachedCall` / `CachedWrite` lifecycle, computes KPIs, and relays operator Retry/Discard actions to the owning site. Note: ingest of cached-call telemetry is performed by `AuditLogIngestActor` (#23) in one transaction with the immutable `AuditLog` insert — see Component-AuditLog.md, Cached Operations — Combined Telemetry.
- **`AuditLogIngestActor`** — owned by Audit Log (#23). Receives gRPC telemetry batches of `AuditEvent` rows from sites and performs insert-if-not-exists on `EventId` against the central `AuditLog` table. For cached-call telemetry (which carries both audit-row content and operational-state fields in a single packet), the ingest performs the `AuditLog` insert and the `SiteCalls` upsert in **one transaction** — see Component-AuditLog.md for the combined-telemetry contract.
- **`SiteAuditReconciliationActor`** — owned by Audit Log (#23). Periodic per-site pull (default every 5 minutes) that self-heals missed audit telemetry by asking each site for its oldest `ForwardState = 'Pending'` row and issuing a `PullAuditEvents(sinceUtc, batchSize)` when a non-draining backlog is detected.
- **`AuditLogPurgeActor`** — owned by Audit Log (#23). Daily partition-switch purge against `ps_AuditLog_Month`; switches out any partition older than `AuditLog:RetentionDays` and emits an `AuditLog:Purged` event. Also rolls the partition scheme forward each month so the next month's partition exists ahead of time.
### Site singletons (active site node, per site cluster)
- **Site Runtime Deployment Manager** — owned by Site Runtime (#3). Owns the full Instance Actor hierarchy; re-creates it on failover from local SQLite.
- **`SiteAuditTelemetryActor`** — owned by Audit Log (#23). Drains the local site `AuditLog` SQLite's `ForwardState = 'Pending'` rows to central in batches via the existing `SiteStream` gRPC channel; cadence is short (default 5 s) when the queue is non-empty and longer (default 30 s) when idle. Runs on a **dedicated dispatcher** so it does not compete with the script blocking-I/O dispatcher (per Component-AuditLog.md, Ingestion Paths → Telemetry forward).
## Failover Behavior ## Failover Behavior
### Detection ### Detection

View File

@@ -38,6 +38,10 @@ Commons must define shared primitive and utility types used across multiple comp
- **`TrackedOperationId`**: A GUID identifying a tracked store-and-forward operation (`ExternalSystem.CachedCall`, `Database.CachedWrite`, `Notify.Send`). Generated caller-side at the site at call time, returned to the script as a tracking handle, and reused as the idempotency key for telemetry sent to central. The notification domain's existing `NotificationId` is the notification-specific name for this same concept. - **`TrackedOperationId`**: A GUID identifying a tracked store-and-forward operation (`ExternalSystem.CachedCall`, `Database.CachedWrite`, `Notify.Send`). Generated caller-side at the site at call time, returned to the script as a tracking handle, and reused as the idempotency key for telemetry sent to central. The notification domain's existing `NotificationId` is the notification-specific name for this same concept.
- **`TrackedOperationKind` enum**: ExternalCall, DatabaseWrite. Discriminates the two cached-call kinds carried by a tracked operation (notifications are tracked separately via the `NotificationType` enum). - **`TrackedOperationKind` enum**: ExternalCall, DatabaseWrite. Discriminates the two cached-call kinds carried by a tracked operation (notifications are tracked separately via the `NotificationType` enum).
- **`TrackedOperationStatus` enum**: Pending, Retrying, Delivered, Parked, Failed, Discarded. The unified lifecycle state shared by all tracked store-and-forward operations. This is the operation's externally-observable lifecycle status in the site-local tracking table (the status record); it is related to but distinct from the S&F buffer's own `StoreAndForwardMessageStatus`, which tracks a buffered message's retry state within the buffer (the retry mechanism). `Failed` (permanent failure) has no notification analogue — notifications use only the other five states (the `NotificationStatus` enum omits `Failed`). - **`TrackedOperationStatus` enum**: Pending, Retrying, Delivered, Parked, Failed, Discarded. The unified lifecycle state shared by all tracked store-and-forward operations. This is the operation's externally-observable lifecycle status in the site-local tracking table (the status record); it is related to but distinct from the S&F buffer's own `StoreAndForwardMessageStatus`, which tracks a buffered message's retry state within the buffer (the retry mechanism). `Failed` (permanent failure) has no notification analogue — notifications use only the other five states (the `NotificationStatus` enum omits `Failed`).
- **`AuditChannel` enum**: ApiOutbound, DbOutbound, Notification, ApiInbound. Discriminates the script-trust-boundary channel that produced an `AuditEvent`. Owned by the Audit Log component.
- **`AuditKind` enum**: SyncCall, CachedEnqueued, CachedAttempt, CachedTerminal, SyncWrite, SyncRead, Enqueued, Attempt, Terminal, Completed. Channel-specific event kind — the valid `Kind` values for each `AuditChannel` are listed in the Audit Log component design (`Component-AuditLog.md`).
- **`AuditStatus` enum**: Success, TransientFailure, PermanentFailure, Enqueued, Retrying, Delivered, Parked, Discarded. Outcome of a single audit event row; superset of `TrackedOperationStatus` to also cover one-shot sync calls.
- **`AuditEvent`**: A record carrying every column of the central `AuditLog` row — `EventId` (GUID, idempotency key), `OccurredAtUtc`, `IngestedAtUtc`, `Channel` (`AuditChannel`), `Kind` (`AuditKind`), `CorrelationId`, `SourceSiteId`, `SourceInstanceId`, `SourceScript`, `Actor`, `Target`, `Status` (`AuditStatus`), `HttpStatus`, `DurationMs`, `ErrorMessage`, `ErrorDetail`, `RequestSummary`, `ResponseSummary`, `PayloadTruncated`, `Extra` — plus a site-only `ForwardState` (`Pending` | `Forwarded` | `Reconciled`) used by the site SQLite write-buffer's telemetry/reconciliation loop. `IngestedAtUtc` is unset at the site and stamped on central ingest. See `Component-AuditLog.md` for the persistence schema and ingest semantics.
Types defined here must be immutable and thread-safe. Types defined here must be immutable and thread-safe.
@@ -107,6 +111,8 @@ Commons must define service interfaces for cross-cutting concerns that multiple
- **`IExternalSystemClient`**: Provides script-facing invocation of external system HTTP APIs (synchronous `Call` and store-and-forward `CachedCall`). Implemented by the External System Gateway, consumed by the script runtime context. - **`IExternalSystemClient`**: Provides script-facing invocation of external system HTTP APIs (synchronous `Call` and store-and-forward `CachedCall`). Implemented by the External System Gateway, consumed by the script runtime context.
- **`IInstanceLocator`**: Resolves an instance unique name to its site identifier. Used by the Inbound API's `Route.To()` to determine the destination site. - **`IInstanceLocator`**: Resolves an instance unique name to its site identifier. Used by the Inbound API's `Route.To()` to determine the destination site.
- **`INotificationDeliveryService`**: Sends notifications to a named notification list, routing transient failures to store-and-forward. Implemented by the Notification Service, consumed by the script runtime context. - **`INotificationDeliveryService`**: Sends notifications to a named notification list, routing transient failures to store-and-forward. Implemented by the Notification Service, consumed by the script runtime context.
- **`IAuditWriter`**: Site-local hot-path interface for appending an `AuditEvent` to the site SQLite `AuditLog`: `Task WriteAsync(AuditEvent evt, CancellationToken ct)`. Single durable INSERT, `ForwardState = Pending`. Consumed by the script-trust-boundary call paths (External System Gateway, Database layer, Store-and-Forward Engine). Implementation lives in the Audit Log component.
- **`ICentralAuditWriter`**: Central direct-write interface for central-originated audit rows (Inbound API request completion, Notification Outbox dispatcher attempts/terminals): `Task WriteAsync(AuditEvent evt, CancellationToken ct)`, with insert-if-not-exists semantics on `EventId` so retried handlers cannot produce duplicates. Implementation lives in the Audit Log component.
These interfaces are defined in Commons so that consuming components depend only on the abstraction, not on the implementing component. These interfaces are defined in Commons so that consuming components depend only on the abstraction, not on the implementing component.
@@ -123,8 +129,9 @@ Commons must define the shared DTOs and message contracts used for inter-compone
- **Script Execution DTOs**: Script call requests (with recursion depth), return values, error results. - **Script Execution DTOs**: Script call requests (with recursion depth), return values, error results.
- **System-Wide Artifact DTOs**: Shared script packages, external system definitions, database connection definitions, notification list definitions. - **System-Wide Artifact DTOs**: Shared script packages, external system definitions, database connection definitions, notification list definitions.
- **Notification DTOs**: `NotificationSubmit` (site→central submission: `NotificationId`, `ListName`, `Subject`, `Body`, provenance, `SiteEnqueuedAt`) and `NotificationSubmitAck` (central acknowledgement returned only after the `Notifications` row is persisted — ack-after-persist — which the site Store-and-Forward Engine waits on before clearing the buffered message). `NotificationStatusQuery` / `NotificationStatusResponse` back the `Notify.Status` script API, round-tripping a status record (status, retry count, last error, key timestamps) once a notification has been forwarded. Recipient resolution is *not* part of any contract — the site forwards only `(listName, subject, body)` and central resolves the list at delivery time. Subject to the additive-only evolution rules in REQ-COM-5a, since a submission can cross the site→central version-skew boundary. - **Notification DTOs**: `NotificationSubmit` (site→central submission: `NotificationId`, `ListName`, `Subject`, `Body`, provenance, `SiteEnqueuedAt`) and `NotificationSubmitAck` (central acknowledgement returned only after the `Notifications` row is persisted — ack-after-persist — which the site Store-and-Forward Engine waits on before clearing the buffered message). `NotificationStatusQuery` / `NotificationStatusResponse` back the `Notify.Status` script API, round-tripping a status record (status, retry count, last error, key timestamps) once a notification has been forwarded. Recipient resolution is *not* part of any contract — the site forwards only `(listName, subject, body)` and central resolves the list at delivery time. Subject to the additive-only evolution rules in REQ-COM-5a, since a submission can cross the site→central version-skew boundary.
- **Cached Call Tracking DTOs**: `CachedCallTelemetry` (site→central lifecycle telemetry for a tracked cached call: `TrackedOperationId`, source site, `Kind` — the `TrackedOperationKind` enum (`ExternalCall` / `DatabaseWrite`) — target summary, status, retry count, last error, key timestamps, and source instance / script provenance) and `CachedCallReconcileRequest` / `CachedCallReconcileResponse` (cursor-based per-site pull of tracking rows changed since a cursor, used so missed telemetry self-heals). All three live in the `Integration/` message folder and are subject to the additive-only evolution rules in REQ-COM-5a, since they cross the site→central version-skew boundary. - **Cached Call Tracking DTOs**: `CachedCallTelemetry` (site→central lifecycle telemetry for a tracked cached call: `TrackedOperationId`, source site, `Kind` — the `TrackedOperationKind` enum (`ExternalCall` / `DatabaseWrite`) — target summary, status, retry count, last error, key timestamps, and source instance / script provenance) and `CachedCallReconcileRequest` / `CachedCallReconcileResponse` (cursor-based per-site pull of tracking rows changed since a cursor, used so missed telemetry self-heals). All three live in the `Integration/` message folder and are subject to the additive-only evolution rules in REQ-COM-5a, since they cross the site→central version-skew boundary. `CachedCallTelemetry` is additively extended to also carry the `AuditEvent` content for the corresponding lifecycle transition (`CachedEnqueued` / `CachedAttempt` / `CachedTerminal`), so one packet drives both the `SiteCalls` operational upsert and the `AuditLog` insert-if-not-exists in a single central transaction — see [Component-AuditLog.md](Component-AuditLog.md), Cached Operations — Combined Telemetry.
- **Parked Operation Command DTOs**: `RetryParkedOperation` and `DiscardParkedOperation` (central→site command/control messages keyed by `TrackedOperationId`, instructing the owning site to retry or discard a parked store-and-forward operation). These generalize the existing parked-message retry/discard commands to also cover parked cached calls; they live in the `RemoteQuery/` message folder alongside the other parked-message management messages. - **Parked Operation Command DTOs**: `RetryParkedOperation` and `DiscardParkedOperation` (central→site command/control messages keyed by `TrackedOperationId`, instructing the owning site to retry or discard a parked store-and-forward operation). These generalize the existing parked-message retry/discard commands to also cover parked cached calls; they live in the `RemoteQuery/` message folder alongside the other parked-message management messages.
- **Audit Telemetry DTOs**: `AuditTelemetryEnvelope` (site→central gRPC message wrapping a batch of `AuditEvent` rows for the `IngestAuditEvents` telemetry call) and the matching reconciliation pull messages (`PullAuditEvents` request/response carrying a `sinceUtc` cursor and a batch of `AuditEvent` rows). Live in the `Integration/` message folder, subject to the additive-only evolution rules in REQ-COM-5a since they cross the site→central version-skew boundary. Cached-operation audit rows do **not** travel via `AuditTelemetryEnvelope` — they are folded into `CachedCallTelemetry` per the bullet above.
All message types must be `record` types or immutable classes suitable for use as Akka.NET messages (though Commons itself must not depend on Akka.NET). All message types must be `record` types or immutable classes suitable for use as Akka.NET messages (though Commons itself must not depend on Akka.NET).
@@ -157,7 +164,9 @@ ScadaLink.Commons/
│ │ # DataType, StoreAndForwardCategory, │ │ # DataType, StoreAndForwardCategory,
│ │ # StoreAndForwardMessageStatus, │ │ # StoreAndForwardMessageStatus,
│ │ # NotificationType, NotificationStatus, │ │ # NotificationType, NotificationStatus,
│ │ # TrackedOperationKind, TrackedOperationStatus │ │ # TrackedOperationKind, TrackedOperationStatus,
│ │ # AuditChannel, AuditKind, AuditStatus
│ ├── Audit/ # AuditEvent record (site + central audit row)
│ ├── DataConnections/ # OPC UA endpoint config value objects + enums │ ├── DataConnections/ # OPC UA endpoint config value objects + enums
│ ├── Flattening/ # FlattenedConfiguration, ConfigurationDiff, │ ├── Flattening/ # FlattenedConfiguration, ConfigurationDiff,
│ │ # DeploymentPackage, ValidationResult │ │ # DeploymentPackage, ValidationResult
@@ -177,6 +186,8 @@ ScadaLink.Commons/
│ │ └── ICentralUiRepository.cs │ │ └── ICentralUiRepository.cs
│ └── Services/ # REQ-COM-4a: Cross-cutting service interfaces │ └── Services/ # REQ-COM-4a: Cross-cutting service interfaces
│ ├── IAuditService.cs │ ├── IAuditService.cs
│ ├── IAuditWriter.cs
│ ├── ICentralAuditWriter.cs
│ ├── IDatabaseGateway.cs │ ├── IDatabaseGateway.cs
│ ├── IExternalSystemClient.cs │ ├── IExternalSystemClient.cs
│ ├── IInstanceLocator.cs │ ├── IInstanceLocator.cs
@@ -209,7 +220,8 @@ ScadaLink.Commons/
│ ├── DataConnection/ # data-connection subscribe/write/health messages │ ├── DataConnection/ # data-connection subscribe/write/health messages
│ ├── Instance/ # attribute get/set request/command messages │ ├── Instance/ # attribute get/set request/command messages
│ ├── Integration/ # external-integration call request/response, │ ├── Integration/ # external-integration call request/response,
│ │ # cached-call tracking telemetry + reconcile │ │ # cached-call tracking telemetry + reconcile,
│ │ # audit telemetry envelope + reconcile
│ ├── Notification/ # NotificationSubmit + ack, │ ├── Notification/ # NotificationSubmit + ack,
│ │ # NotificationStatusQuery/Response │ │ # NotificationStatusQuery/Response
│ ├── InboundApi/ # Route.To() request messages │ ├── InboundApi/ # Route.To() request messages

View File

@@ -60,6 +60,9 @@ The configuration database stores all central system data, organized by domain a
### Site Calls ### Site Calls
- **SiteCalls**: The central audit table for cached site calls — `ExternalSystem.CachedCall()` and `Database.CachedWrite()` — owned by the Site Call Audit component and a sibling of the `Notifications` table. One row per cached operation. Columns: `TrackedOperationId` (GUID, primary key — generated site-side at call time, used as the idempotency key), `SourceSite`, `Kind` (a `TrackedOperationKind` enum stored with values `ExternalCall` / `DatabaseWrite`), `TargetSummary` (external system + method for an `ExternalCall`, database connection name for a `DatabaseWrite`), `Status` (a `TrackedOperationStatus` enum stored with values `Pending`, `Retrying`, `Delivered`, `Parked`, `Failed`, `Discarded`), `RetryCount`, `LastError`, `Provenance` (source instance / script), `CreatedAtUtc`, `UpdatedAtUtc`, `TerminalAtUtc`. The table is populated **only** by Site Call Audit telemetry and reconciliation pulls — sites are the source of truth and the row is an eventually-consistent mirror, never written by a central dispatcher. Ingestion is **insert-if-not-exists** keyed on `TrackedOperationId`, then **upsert-on-newer-status**; the lifecycle is monotonic, so at-least-once and out-of-order telemetry are harmless. Indexed on `Status` and `SourceSite` for KPI computation and the Central UI query page. Terminal rows are removed by a daily purge job — see Scheduled Maintenance below. See Component-SiteCallAudit.md for the full lifecycle. - **SiteCalls**: The central audit table for cached site calls — `ExternalSystem.CachedCall()` and `Database.CachedWrite()` — owned by the Site Call Audit component and a sibling of the `Notifications` table. One row per cached operation. Columns: `TrackedOperationId` (GUID, primary key — generated site-side at call time, used as the idempotency key), `SourceSite`, `Kind` (a `TrackedOperationKind` enum stored with values `ExternalCall` / `DatabaseWrite`), `TargetSummary` (external system + method for an `ExternalCall`, database connection name for a `DatabaseWrite`), `Status` (a `TrackedOperationStatus` enum stored with values `Pending`, `Retrying`, `Delivered`, `Parked`, `Failed`, `Discarded`), `RetryCount`, `LastError`, `Provenance` (source instance / script), `CreatedAtUtc`, `UpdatedAtUtc`, `TerminalAtUtc`. The table is populated **only** by Site Call Audit telemetry and reconciliation pulls — sites are the source of truth and the row is an eventually-consistent mirror, never written by a central dispatcher. Ingestion is **insert-if-not-exists** keyed on `TrackedOperationId`, then **upsert-on-newer-status**; the lifecycle is monotonic, so at-least-once and out-of-order telemetry are harmless. Indexed on `Status` and `SourceSite` for KPI computation and the Central UI query page. Terminal rows are removed by a daily purge job — see Scheduled Maintenance below. See Component-SiteCallAudit.md for the full lifecycle.
### Audit Log
- **AuditLog**: The central, append-only audit table owned by the Audit Log component — one row per script-trust-boundary lifecycle event across all channels (outbound API calls, outbound DB writes/reads, notifications, and inbound API requests). Sibling of the `Notifications` and `SiteCalls` tables but distinct: `AuditLog` is the immutable history that observes the other subsystems, not an operational state store. Columns: `EventId` (`uniqueidentifier` primary key — generated at the originator, used as the idempotency key), `OccurredAtUtc` (`datetime2`), `IngestedAtUtc` (`datetime2`), `Channel` (`varchar(32)``ApiOutbound` / `DbOutbound` / `Notification` / `ApiInbound`), `Kind` (`varchar(32)` — channel-specific event kind), `CorrelationId` (`uniqueidentifier` NULL — `TrackedOperationId` for cached calls, `NotificationId` for notifications, request-id for inbound API), `SourceSiteId` (`varchar(64)` NULL), `SourceInstanceId` (`varchar(128)` NULL), `SourceScript` (`varchar(128)` NULL), `Actor` (`varchar(128)` NULL), `Target` (`varchar(256)` NULL), `Status` (`varchar(32)` — outcome of *this event*: `Success`, `TransientFailure`, `PermanentFailure`, `Enqueued`, `Retrying`, `Delivered`, `Parked`, `Discarded`), `HttpStatus` (`int` NULL), `DurationMs` (`int` NULL), `ErrorMessage` (`nvarchar(1024)` NULL), `ErrorDetail` (`nvarchar(max)` NULL), `RequestSummary` (`nvarchar(max)` NULL — truncated request payload, headers redacted), `ResponseSummary` (`nvarchar(max)` NULL — truncated response payload), `PayloadTruncated` (`bit`), `Extra` (`nvarchar(max)` NULL — channel-specific JSON for fields not promoted to columns). Indexes: `IX_AuditLog_OccurredAtUtc` (primary time-range index for global scans), `IX_AuditLog_Site_Occurred (SourceSiteId, OccurredAtUtc)` (per-site filters), `IX_AuditLog_Correlation (CorrelationId)` (drilldown from a single operation), `IX_AuditLog_Channel_Status_Occurred (Channel, Status, OccurredAtUtc)` (KPI / dashboard tiles), and `IX_AuditLog_Target_Occurred (Target, OccurredAtUtc)` ("what did we send to system X"). The primary key on `EventId` enforces idempotency — central ingest is `INSERT … WHERE NOT EXISTS`, so at-least-once telemetry and reconciliation retries collapse to a single row. **Monthly partitioning** on `OccurredAtUtc` from day one via partition function `pf_AuditLog_Month` and partition scheme `ps_AuditLog_Month`, with a filegroup-per-month rollover so that retention purge is a partition switch rather than a row-level delete. The partition-maintenance job that rolls the scheme forward and switches expired partitions is owned by the Audit Log component, not this component. The table is populated only by Audit Log writers (site telemetry, central direct-write, reconciliation pulls); central ingest is **insert-if-not-exists** keyed on `EventId`. See Component-AuditLog.md for the full lifecycle, payload-capture policy, and ingestion paths.
### Inbound API ### Inbound API
- **API Keys**: Key definitions (name/label, key value, enabled flag). - **API Keys**: Key definitions (name/label, key value, enabled flag).
- **API Methods**: Method definitions (name, approved key references, parameter definitions, return value definitions, implementation script, timeout). - **API Methods**: Method definitions (name, approved key references, parameter definitions, return value definitions, implementation script, timeout).
@@ -215,7 +218,7 @@ Since only the after-state is stored, change history for an entity is reconstruc
### Query Capabilities ### Query Capabilities
The Central UI audit log viewer can filter by: The Central UI Configuration Audit Log Viewer (distinct from the operational Audit Log page in #23) can filter by:
- **User**: Who made the change. - **User**: Who made the change.
- **Entity type**: What kind of entity was changed. - **Entity type**: What kind of entity was changed.
- **Action type**: What kind of operation was performed. - **Action type**: What kind of operation was performed.
@@ -226,6 +229,17 @@ Results are returned in reverse chronological order (most recent first) with pag
--- ---
## Database Roles
The configuration database defines dedicated SQL Server roles for the append-only `AuditLog` table so that the application can never accidentally mutate audit history:
- **`scadalink_audit_writer`** — the role used by application code that ingests audit events (the `AuditLogIngestActor`, central direct-write paths, and the Notification Outbox dispatcher). Granted `INSERT` and `SELECT` on `AuditLog` only — explicitly **no** `UPDATE` and **no** `DELETE`. Audit ingest is `INSERT … WHERE NOT EXISTS` keyed on `EventId`, which this grant set fully supports.
- **`scadalink_audit_purger`** — the role used by the `AuditLogPurgeActor`. Granted only the permissions required to execute the monthly partition-switch operation (switch out a partition to a staging table and drop the staging table). Row-level `DELETE` on `AuditLog` is **not** granted even to the purge role; retention is a partition switch, never a row-by-row delete.
A CI grep guard fails the build on any occurrence of `UPDATE … AuditLog` or `DELETE … AuditLog` in the data-access layer source, backstopping the DB-grant enforcement at code-review time. See Component-AuditLog.md (Security & Tamper-Evidence) for the full enforcement contract.
---
## Migration Management ## Migration Management
### Entity Framework Core Migrations ### Entity Framework Core Migrations
@@ -233,6 +247,7 @@ Results are returned in reverse chronological order (most recent first) with pag
- Schema changes are managed via EF Core Migrations (`dotnet ef migrations add`, `dotnet ef migrations script`). - Schema changes are managed via EF Core Migrations (`dotnet ef migrations add`, `dotnet ef migrations script`).
- Each migration is a versioned, incremental schema change. - Each migration is a versioned, incremental schema change.
- New tables are introduced as their own migration — for example, the `Notifications` table for the Notification Outbox ships as a dedicated EF Core migration that creates the table, its `Type`/`Status` value conversions, and its dispatcher and KPI indexes. - New tables are introduced as their own migration — for example, the `Notifications` table for the Notification Outbox ships as a dedicated EF Core migration that creates the table, its `Type`/`Status` value conversions, and its dispatcher and KPI indexes.
- The initial `AuditLog` migration creates the monthly partition function `pf_AuditLog_Month` and partition scheme `ps_AuditLog_Month`, then creates the `AuditLog` table aligned to that scheme on `OccurredAtUtc`, along with the indexes listed under Database Schema. The migration also creates the `scadalink_audit_writer` and `scadalink_audit_purger` DB roles with the grants described in Database Roles. The ongoing **partition-maintenance job** that rolls the scheme forward each month (creating the next month's partition ahead of time) and switches out expired partitions is owned by the **Audit Log component** (`AuditLogPurgeActor` and its monthly roll-forward step), not by the Configuration Database component — this component is responsible only for the initial schema, roles, and any EF migrations against the table going forward.
### Development Environment ### Development Environment
- Migrations are **auto-applied** at application startup using `dbContext.Database.MigrateAsync()`. - Migrations are **auto-applied** at application startup using `dbContext.Database.MigrateAsync()`.
@@ -282,6 +297,10 @@ The `Notifications` table grows one row per notification and is never trimmed by
The `SiteCalls` table grows one row per cached site call and is never trimmed by normal operation. To bound table growth while preserving a strong audit trail, a **daily purge job** deletes terminal rows (`Delivered`, `Failed`, `Discarded`) older than a configurable retention window (default 365 days). Non-terminal rows (`Pending`, `Retrying`, `Parked`) are never purged. The purge is a bulk `DELETE`; it is owned and scheduled by the Site Call Audit component (see Component-SiteCallAudit.md), which supplies the retention window. The Configuration Database component provides only the repository operation and the table. The `SiteCalls` table grows one row per cached site call and is never trimmed by normal operation. To bound table growth while preserving a strong audit trail, a **daily purge job** deletes terminal rows (`Delivered`, `Failed`, `Discarded`) older than a configurable retention window (default 365 days). Non-terminal rows (`Pending`, `Retrying`, `Parked`) are never purged. The purge is a bulk `DELETE`; it is owned and scheduled by the Site Call Audit component (see Component-SiteCallAudit.md), which supplies the retention window. The Configuration Database component provides only the repository operation and the table.
### AuditLog Table Purge
The `AuditLog` table is append-only and grows by every script-trust-boundary event across all channels. Unlike `Notifications` and `SiteCalls`, purge is **never a row-level `DELETE`** — it is a **monthly partition switch** against the `ps_AuditLog_Month` scheme. A daily job switches out any partition whose latest `OccurredAtUtc` is older than the global retention window (default 365 days, configurable via the `AuditLog:RetentionDays` Audit Log option — single global value in v1, no per-channel overrides) and drops the resulting staging table. The job is owned and scheduled by the Audit Log component (`AuditLogPurgeActor` — see Component-AuditLog.md), which is also the consumer of the `AuditLog:RetentionDays` option. The Configuration Database component contributes only the table, the partition function/scheme, the indexes, and the DB roles that constrain the purge to a partition switch.
--- ---
## Connection Management ## Connection Management
@@ -310,6 +329,6 @@ The `SiteCalls` table grows one row per cached site call and is never trimmed by
- **Notification Service**: Uses `INotificationRepository` for notification lists and SMTP configuration. - **Notification Service**: Uses `INotificationRepository` for notification lists and SMTP configuration.
- **Notification Outbox**: Uses `INotificationOutboxRepository` for all access to the `Notifications` table — ingest, dispatch polling, status updates, KPI queries, and the daily purge of terminal rows. - **Notification Outbox**: Uses `INotificationOutboxRepository` for all access to the `Notifications` table — ingest, dispatch polling, status updates, KPI queries, and the daily purge of terminal rows.
- **Site Call Audit**: Uses `ISiteCallAuditRepository` for all access to the `SiteCalls` table — telemetry/reconciliation ingest, KPI queries, and the daily purge of terminal rows. - **Site Call Audit**: Uses `ISiteCallAuditRepository` for all access to the `SiteCalls` table — telemetry/reconciliation ingest, KPI queries, and the daily purge of terminal rows.
- **Central UI**: Uses `ICentralUiRepository` for read-oriented queries across domain areas, including audit log queries for the audit log viewer. - **Central UI**: Uses `ICentralUiRepository` for read-oriented queries across domain areas, including config-audit queries for the Configuration Audit Log Viewer (the operational Audit Log page is owned by #23).
- **All central components that modify state**: Call `IAuditService.LogAsync()` after successful operations to record audit entries within the same transaction. - **All central components that modify state**: Call `IAuditService.LogAsync()` after successful operations to record audit entries within the same transaction.
- **Host**: Provides database connection configuration. Registers DbContext, repository implementations, and `IAuditService` implementation in the DI container. Triggers auto-migration in development or validates schema version in production. - **Host**: Provides database connection configuration. Registers DbContext, repository implementations, and `IAuditService` implementation in the DI container. Triggers auto-migration in development or validates schema version in production.

View File

@@ -57,6 +57,7 @@ Each database connection definition includes:
- Script calls `Database.Connection("name")` and receives a raw ADO.NET `SqlConnection`. - Script calls `Database.Connection("name")` and receives a raw ADO.NET `SqlConnection`.
- Full control: queries, updates, transactions, stored procedures. - Full control: queries, updates, transactions, stored procedures.
- Failures are immediate — no buffering. - Failures are immediate — no buffering.
- **Audit emission**: script-initiated `Execute`/`ExecuteScalar` calls emit `DbOutbound.SyncWrite` rows; `ExecuteReader` emits `DbOutbound.SyncRead`. SQL parameter values are captured by default; per-connection redaction opt-in via the Audit Log configuration (see [Component-AuditLog.md](Component-AuditLog.md), Payload Capture Policy). Audit-write failure never aborts the script.
### Cached Write (Store-and-Forward) ### Cached Write (Store-and-Forward)
- Script calls `Database.CachedWrite("name", "sql", parameters)`. This is **deferred delivery**: the call returns a `TrackedOperationId` tracking handle immediately rather than the write result. - Script calls `Database.CachedWrite("name", "sql", parameters)`. This is **deferred delivery**: the call returns a `TrackedOperationId` tracking handle immediately rather than the write result.
@@ -64,6 +65,7 @@ Each database connection definition includes:
- The write is attempted immediately. On immediate success it is recorded as a terminal `Delivered` tracking record. On **transient failure** (database unavailable) it is buffered (`Pending`/`Retrying`) and retried per the connection's retry settings by the Store-and-Forward Engine. - The write is attempted immediately. On immediate success it is recorded as a terminal `Delivered` tracking record. On **transient failure** (database unavailable) it is buffered (`Pending`/`Retrying`) and retried per the connection's retry settings by the Store-and-Forward Engine.
- On **permanent failure** (e.g. a SQL syntax or constraint error — a request that will never succeed), the error is returned **synchronously** to the calling script and the write is **not** buffered. The call is also recorded as a terminal `Failed` tracking record capturing the error. - On **permanent failure** (e.g. a SQL syntax or constraint error — a request that will never succeed), the error is returned **synchronously** to the calling script and the write is **not** buffered. The call is also recorded as a terminal `Failed` tracking record capturing the error.
- Cached-write status is observable to scripts via `Tracking.Status(id)` (answered site-locally and authoritatively) and centrally via the Site Call Audit component. - Cached-write status is observable to scripts via `Tracking.Status(id)` (answered site-locally and authoritatively) and centrally via the Site Call Audit component.
- **Audit emission**: each lifecycle transition (`CachedEnqueued`, `CachedAttempt`, `CachedTerminal`) emits an audit row via the combined cached-operation telemetry packet — one packet carries both the audit row and the SiteCalls upsert (see [Component-AuditLog.md](Component-AuditLog.md), Cached Operations — Combined Telemetry, and [Component-SiteCallAudit.md](Component-SiteCallAudit.md)). Audit-write failure never aborts the script.
## Invocation Protocol ## Invocation Protocol
@@ -83,6 +85,7 @@ Scripts choose between two call modes per invocation, mirroring the dual-mode da
- The HTTP request is executed immediately. The script blocks until the response is received or the timeout elapses. - The HTTP request is executed immediately. The script blocks until the response is received or the timeout elapses.
- **All failures** (transient and permanent) return an error to the calling script. No store-and-forward buffering. - **All failures** (transient and permanent) return an error to the calling script. No store-and-forward buffering.
- Use for request/response interactions where the script needs the result (e.g., fetching a recipe, querying inventory). - Use for request/response interactions where the script needs the result (e.g., fetching a recipe, querying inventory).
- **Audit emission**: emits an `ApiOutbound.SyncCall` row to `IAuditWriter` at call completion (success or failure). Payload captured per the Audit Log policy (see [Component-AuditLog.md](Component-AuditLog.md), Payload Capture Policy). Audit-write failure never aborts the script.
### Cached (Store-and-Forward) ### Cached (Store-and-Forward)
- Script calls `ExternalSystem.CachedCall("systemName", "methodName", params)`. This is **deferred delivery**: the call returns a `TrackedOperationId` tracking handle immediately rather than the response body. - Script calls `ExternalSystem.CachedCall("systemName", "methodName", params)`. This is **deferred delivery**: the call returns a `TrackedOperationId` tracking handle immediately rather than the response body.
@@ -90,6 +93,7 @@ Scripts choose between two call modes per invocation, mirroring the dual-mode da
- On **transient failure** (connection refused, timeout, HTTP 5xx), the call is routed to the Store-and-Forward Engine for retry per the system's retry settings. The script does **not** block — the call is buffered (`Pending`/`Retrying`) and the script continues. - On **transient failure** (connection refused, timeout, HTTP 5xx), the call is routed to the Store-and-Forward Engine for retry per the system's retry settings. The script does **not** block — the call is buffered (`Pending`/`Retrying`) and the script continues.
- On **permanent failure** (HTTP 4xx), the error is returned **synchronously** to the calling script. No retry — the request itself is wrong. The call is also recorded as a terminal `Failed` tracking record capturing the error. - On **permanent failure** (HTTP 4xx), the error is returned **synchronously** to the calling script. No retry — the request itself is wrong. The call is also recorded as a terminal `Failed` tracking record capturing the error.
- Cached-call status is observable to scripts via `Tracking.Status(id)` (answered site-locally and authoritatively) and centrally via the Site Call Audit component. - Cached-call status is observable to scripts via `Tracking.Status(id)` (answered site-locally and authoritatively) and centrally via the Site Call Audit component.
- **Audit emission**: each lifecycle transition (`CachedEnqueued`, `CachedAttempt`, `CachedTerminal`) emits an audit row via the combined cached-operation telemetry packet — one packet carries both the audit row and the SiteCalls upsert (see [Component-AuditLog.md](Component-AuditLog.md), Cached Operations — Combined Telemetry, and [Component-SiteCallAudit.md](Component-SiteCallAudit.md)). Audit-write failure never aborts the script.
- Use for outbound data pushes where deferred delivery is acceptable (e.g., posting production data, sending quality reports). - Use for outbound data pushes where deferred delivery is acceptable (e.g., posting production data, sending quality reports).
## Call Timeout & Error Handling ## Call Timeout & Error Handling

View File

@@ -34,6 +34,11 @@ Site clusters (metric collection and reporting). Central cluster (aggregation an
| Notification Outbox queue depth | Notification Outbox (central) | Count of `Pending` + `Retrying` notifications — central-computed, not site-reported | | Notification Outbox queue depth | Notification Outbox (central) | Count of `Pending` + `Retrying` notifications — central-computed, not site-reported |
| Notification Outbox stuck count | Notification Outbox (central) | Count of `Pending` / `Retrying` notifications older than the configurable stuck-age threshold — central-computed, not site-reported | | Notification Outbox stuck count | Notification Outbox (central) | Count of `Pending` / `Retrying` notifications older than the configurable stuck-age threshold — central-computed, not site-reported |
| Notification Outbox parked count | Notification Outbox (central) | Count of `Parked` notifications — central-computed, not site-reported | | Notification Outbox parked count | Notification Outbox (central) | Count of `Parked` notifications — central-computed, not site-reported |
| `SiteAuditBacklog` | Audit Log (site) | Count of `Pending` rows in the site-local `AuditLog` plus oldest-pending-age plus on-disk bytes. A configurable threshold drives a Health dashboard warning on the affected site tile. |
| `SiteAuditWriteFailures` | Audit Log (site) | Count of failed hot-path audit appends at the site since the last health report. |
| `SiteAuditTelemetryStalled` | Audit Log (site) | Boolean flag set when reconciliation reports a non-draining site-local audit backlog over two consecutive cycles. |
| `CentralAuditWriteFailures` | Audit Log (central) | Count of central direct-write audit failures (Inbound API middleware, Notification Outbox dispatcher, and any other central direct writers) since the last interval. |
| `AuditRedactionFailure` | Audit Log (central) | Count of payload redactor errors (over-redacted payloads, safety-net hit) since the last interval. |
## Reporting Protocol ## Reporting Protocol
@@ -76,6 +81,16 @@ The Site Call Audit is a **central** component, so its KPIs — like the Notific
Unlike the Notification Outbox, the Site Call Audit is **not a dispatcher** — cached calls are delivered by each site's Store-and-Forward Engine, and the `SiteCalls` table is an eventually-consistent central mirror of site-owned status. Unlike the Notification Outbox, the Site Call Audit is **not a dispatcher** — cached calls are delivered by each site's Store-and-Forward Engine, and the `SiteCalls` table is an eventually-consistent central mirror of site-owned status.
## Audit Log KPIs
The Audit Log spans both sites (hot-path append + telemetry forward) and central (direct-write + ingest + redaction). Its operational health surfaces as three new dashboard tiles grouped under **Audit**:
- **Audit volume** — events/min landing in the central `AuditLog` table, shown global plus per-site sparkline; sourced from the Audit Log component on the active central node.
- **Audit error rate** — percent of central `AuditLog` rows with `Status` other than `Success` / `Delivered` / `Enqueued` over a rolling 5-minute window. This is the operational error rate of audited operations (HTTP 5xx, transient failures, parked deliveries, etc.) — NOT the audit writer's own health. Audit-writer issues surface separately via `CentralAuditWriteFailures` and `AuditRedactionFailure`.
- **Audit backlog** — global aggregate of `SiteAuditBacklog` across reporting sites (count of `Pending` site-local audit rows, oldest pending age, on-disk bytes); click drills into a per-site breakdown. The per-site tile surfaces a warning badge when its `SiteAuditBacklog` crosses the configurable threshold or when `SiteAuditTelemetryStalled` is set.
These tiles are **point-in-time** like the Notification Outbox and Site Call Audit KPI tiles — no time-series store; consistent with Health Monitoring's "current status only" philosophy. The site-scoped `SiteAuditBacklog` / `SiteAuditWriteFailures` / `SiteAuditTelemetryStalled` metrics arrive in the existing site health report; the central-scoped `CentralAuditWriteFailures` / `AuditRedactionFailure` metrics are central-computed alongside the existing central KPIs.
## Central Storage ## Central Storage
- Health metrics are held **in memory** at the central cluster for display in the UI. - Health metrics are held **in memory** at the central cluster for display in the UI.
@@ -97,6 +112,7 @@ Unlike the Notification Outbox, the Site Call Audit is **not a dispatcher** —
- **Cluster Infrastructure (site)**: Provides node role status. - **Cluster Infrastructure (site)**: Provides node role status.
- **Notification Outbox (central)**: Provides central-computed outbox KPIs — queue depth, stuck count, parked count — for the headline dashboard tiles. - **Notification Outbox (central)**: Provides central-computed outbox KPIs — queue depth, stuck count, parked count — for the headline dashboard tiles.
- **Site Call Audit (central)**: Provides central-computed cached-call KPIs — buffered count, parked count, failed/delivered (last interval), oldest pending age, stuck count — for the headline dashboard tiles. - **Site Call Audit (central)**: Provides central-computed cached-call KPIs — buffered count, parked count, failed/delivered (last interval), oldest pending age, stuck count — for the headline dashboard tiles.
- **Audit Log (#23)**: Provides the site-reported `SiteAuditBacklog` / `SiteAuditWriteFailures` / `SiteAuditTelemetryStalled` metrics (via the site health report) and the central-computed `CentralAuditWriteFailures` / `AuditRedactionFailure` metrics, plus the central audit-row rate feeding the **Audit** dashboard tile group (Audit volume, Audit error rate, Audit backlog).
## Interactions ## Interactions

View File

@@ -116,8 +116,9 @@ API method scripts are compiled at central startup — all method definitions ar
## API Call Logging ## API Call Logging
- **Only failures are logged.** Script execution errors (500 responses) are logged centrally. - **Every request — success or failure — emits one `ApiInbound.Completed` row** to `ICentralAuditWriter` from request middleware before the HTTP response is flushed. The row captures the API key **name** (never the key material), remote IP, user-agent, response status, duration, and truncated request/response bodies per the Audit Log capture policy (see Component-AuditLog.md, Payload Capture Policy). This supersedes the earlier failures-only stance: operational API traffic is now part of the centralized audit log, so configuration changes and call activity share a single retention/query surface.
- Successful API calls are **not** logged the audit log is reserved for configuration changes, not operational traffic. - Script execution errors (500 responses) remain captured on the same `ApiInbound.Completed` row (response status + error fields) rather than emitting a separate failure-only event.
- **Fail-soft semantics.** The audit write is synchronous (inline before the response is flushed), but failures are caught: a write that throws is logged and increments `CentralAuditWriteFailures` (see Health Monitoring #11) and the request still returns its normal HTTP response. A failed audit append never turns a successful API call into an error returned to the caller.
- No rate limiting — this is a private API in a controlled industrial environment with a known set of callers. Misbehaving callers are handled operationally (disable the API key). - No rate limiting — this is a private API in a controlled industrial environment with a known set of callers. Misbehaving callers are handled operationally (disable the API key).
## Request Flow ## Request Flow
@@ -197,7 +198,8 @@ Inbound API scripts **cannot** call shared scripts directly — shared scripts a
- **Configuration Database (MS SQL)**: Stores API keys and method definitions. - **Configuration Database (MS SQL)**: Stores API keys and method definitions.
- **Communication Layer**: Routes requests to sites when method implementations need site data. - **Communication Layer**: Routes requests to sites when method implementations need site data.
- **Security & Auth**: API key validation (separate from LDAP/AD — API uses key-based auth). - **Security & Auth**: API key validation (separate from LDAP/AD — API uses key-based auth).
- **Configuration Database (via IAuditService)**: All API key and method definition changes are audit logged. Optionally, API call activity can be logged. - **Configuration Database (via IAuditService)**: All API key and method definition changes are audit logged.
- **Audit Log (#23)**: Every inbound API request emits an `ApiInbound.Completed` row via `ICentralAuditWriter` from request middleware (non-blocking for the HTTP response). Payload truncation/redaction follows the Audit Log Payload Capture Policy.
- **Cluster Infrastructure**: API is hosted on the active central node and fails over with it. - **Cluster Infrastructure**: API is hosted on the active central node and fails over with it.
## Interactions ## Interactions

View File

@@ -106,6 +106,12 @@ The dispatcher loop runs on a fixed interval. On each tick the `NotificationOutb
- **transient failure** → `Retrying`, increment `RetryCount`, set `NextAttemptAt`, record `LastError`; once retries are exhausted → `Parked`. - **transient failure** → `Retrying`, increment `RetryCount`, set `NextAttemptAt`, record `LastError`; once retries are exhausted → `Parked`.
- **permanent failure** → `Parked`, record `LastError`. - **permanent failure** → `Parked`, record `LastError`.
Each delivery attempt also writes a `Notification.Attempt` row to the central `AuditLog` via `ICentralAuditWriter`; a transition to a terminal status (`Delivered` / `Parked` / `Discarded`) writes a `Notification.Terminal` row. Audit writes are **direct** (no telemetry — the dispatcher runs at central), insert-if-not-exists on `EventId`. The site-emitted `Notification.Enqueued` row arrives separately via the standard audit telemetry channel from the site's SQLite write-buffer, so the full per-notification audit trail is `Enqueued` (site-originated) → `Attempt` × N (central direct-write) → `Terminal` (central direct-write). See [Component-AuditLog.md](Component-AuditLog.md), Central direct-write (central-originated events).
The operational `Notifications` table remains the **source of truth** for the dispatcher and for Retry/Discard actions; the `AuditLog` rows are immutable shadows. Operator Retry/Discard still mutates only the `Notifications` row, and each transition emits the corresponding `Notification.Attempt` / `Notification.Terminal` audit row.
**Audit-write failure never affects delivery.** If the `ICentralAuditWriter` direct-write fails (transient DB error, schema lock, etc.) the dispatcher logs the failure and increments the `CentralAuditWriteFailures` health metric (see Health Monitoring #11), but the delivery attempt's outcome on the `Notifications` row stands. The audit row is recovered by re-emission on the next dispatcher tick or by the on-startup reconciliation sweep; central never aborts a notification because audit failed.
## Delivery Adapters ## Delivery Adapters
A delivery adapter implementing `INotificationDeliveryAdapter` is registered per `Type`. Each `Deliver(...)` call returns one of `success | transient failure | permanent failure`, mirroring the External System Gateway error-classification pattern. A delivery adapter implementing `INotificationDeliveryAdapter` is registered per `Type`. Each `Deliver(...)` call returns one of `success | transient failure | permanent failure`, mirroring the External System Gateway error-classification pattern.
@@ -157,6 +163,7 @@ Delivery max-retry-count and retry interval are not part of `NotificationOutboxO
- **Notification Service**: Provides notification-list and SMTP definitions, and the per-type delivery adapters the outbox invokes. - **Notification Service**: Provides notification-list and SMTP definitions, and the per-type delivery adapters the outbox invokes.
- **Configuration Database**: Hosts the `Notifications` table; provides the entity POCO, repository, and EF migration for outbox persistence. - **Configuration Database**: Hosts the `Notifications` table; provides the entity POCO, repository, and EF migration for outbox persistence.
- **CentralSite Communication**: Carries inbound notification submissions and acks between sites and central. - **CentralSite Communication**: Carries inbound notification submissions and acks between sites and central.
- **Audit Log (#23)**: The dispatcher direct-writes `Notification.Attempt` and `Notification.Terminal` rows to the central `AuditLog` via `ICentralAuditWriter` (insert-if-not-exists on `EventId`); the site-emitted `Notification.Enqueued` row arrives via the standard audit telemetry channel. See [Component-AuditLog.md](Component-AuditLog.md), Central direct-write (central-originated events).
- **Health Monitoring**: Consumes the outbox KPIs as central-computed headline metrics. - **Health Monitoring**: Consumes the outbox KPIs as central-computed headline metrics.
- **Central UI**: Hosts the Notification Outbox page. - **Central UI**: Hosts the Notification Outbox page.

View File

@@ -73,6 +73,14 @@ then **upsert-on-newer-status**. The lifecycle is monotonic, so status only
advances and never regresses; at-least-once and out-of-order telemetry are advances and never regresses; at-least-once and out-of-order telemetry are
therefore harmless. therefore harmless.
From v1.x onward, the `CachedCallTelemetry` message additively carries the
`AuditEvent` content alongside the existing operational fields. Central's
`AuditLogIngestActor` (Audit Log #23) performs both the immutable `AuditLog`
insert and the `SiteCalls` upsert in a single transaction. Idempotency keys
remain `EventId` (for `AuditLog`) and `TrackedOperationId` (for `SiteCalls`).
See [Component-AuditLog.md](Component-AuditLog.md), Cached Operations —
Combined Telemetry, for the dual-write contract.
## Reconciliation ## Reconciliation
Because telemetry is best-effort, `SiteCallAuditActor` periodically — and on site Because telemetry is best-effort, `SiteCallAuditActor` periodically — and on site
@@ -119,6 +127,12 @@ configurable window (default 365 days), matching the `Notifications` purge.
responses; sends Retry/Discard commands. responses; sends Retry/Discard commands.
- **Store-and-Forward Engine**: the site-side origin of cached-call telemetry and - **Store-and-Forward Engine**: the site-side origin of cached-call telemetry and
the executor of relayed Retry/Discard commands. the executor of relayed Retry/Discard commands.
- **Audit Log (#23)**: shares the `CachedCallTelemetry` packet — each lifecycle
transition (`CachedEnqueued`, `CachedAttempt`, `CachedTerminal`) carries an
`AuditEvent` alongside the operational fields, and central's
`AuditLogIngestActor` performs the `AuditLog` insert and the `SiteCalls`
upsert in a single transaction (see [Component-AuditLog.md](Component-AuditLog.md),
Cached Operations — Combined Telemetry).
- **Commons**: `TrackedOperationId`, status enum, telemetry message contracts. - **Commons**: `TrackedOperationId`, status enum, telemetry message contracts.
## Interactions ## Interactions

View File

@@ -294,6 +294,10 @@ Scripts execute **in-process** with constrained access. The following restrictio
These constraints are enforced by restricting the set of assemblies and namespaces available to the script compilation context. These constraints are enforced by restricting the set of assemblies and namespaces available to the script compilation context.
### Script Trust Boundary Auditing
Every script-trust-boundary call (External System Gateway, Database layer, Notify) emits an `AuditEvent` to `IAuditWriter` (site-local SQLite append). Hot path; never fails the calling action; failures logged via the `SiteAuditWriteFailures` health metric (see [Component-HealthMonitoring.md](Component-HealthMonitoring.md)). The central audit mirror and event schema live in [Component-AuditLog.md](Component-AuditLog.md).
## Script Scoping Rules ## Script Scoping Rules
- Scripts can only read/write attributes on **their own instance** (via the parent Instance Actor). - Scripts can only read/write attributes on **their own instance** (via the parent Instance Actor).
@@ -363,7 +367,7 @@ Per Akka.NET best practices, internal actor communication uses **Tell** (fire-an
- **Communication Layer**: Receives deployments and lifecycle commands from central. Handles debug view requests. Reports deployment results. - **Communication Layer**: Receives deployments and lifecycle commands from central. Handles debug view requests. Reports deployment results.
- **Site Event Logging**: Records script executions, alarm events, deployment events, instance lifecycle events. - **Site Event Logging**: Records script executions, alarm events, deployment events, instance lifecycle events.
- **Health Monitoring**: Reports script error rates and alarm evaluation error rates. - **Health Monitoring**: Reports script error rates and alarm evaluation error rates.
- **Local SQLite**: Persists deployed configurations, system-wide artifacts (external system definitions, database connection definitions, data connection definitions). - **Local SQLite**: Persists deployed configurations, system-wide artifacts (external system definitions, database connection definitions, data connection definitions). Sites also maintain peer SQLite stores for the Store-and-Forward buffer, the site event log, the operation tracking table, and the site-local `AuditLog` (see [Component-AuditLog.md](Component-AuditLog.md)). The `AuditLog` file is purged on the same daily cadence as the others but respects the hard `ForwardState` invariant — rows still `Pending` forward are never purged, regardless of age.
## Interactions ## Interactions

View File

@@ -440,6 +440,25 @@ All system-modifying actions are logged, including:
### 10.4 Transactional Guarantee ### 10.4 Transactional Guarantee
- Audit entries are written **synchronously** within the same database transaction as the change (via the unit-of-work pattern). If the change succeeds, the audit entry is guaranteed to be recorded. If the change rolls back, the audit entry rolls back too. - Audit entries are written **synchronously** within the same database transaction as the change (via the unit-of-work pattern). If the change succeeds, the audit entry is guaranteed to be recorded. If the change rolls back, the audit entry rolls back too.
### 10.5 Centralized Audit Log (Script Trust Boundary)
*See [Component-AuditLog.md](Component-AuditLog.md) (#23) for the full component design.*
Sections 10.110.4 cover **configuration-database audit** (config-mutating user actions via `IAuditService`). This subsection defines the separate **runtime Audit Log** that captures every action crossing the **script trust boundary** at sites and central:
- **AL-1**: The system maintains an **append-only** central Audit Log recording every script-trust-boundary action — outbound external system calls (sync `Call` and `CachedCall`), outbound database operations (sync `Connection` access and `CachedWrite`), notifications, and inbound API method invocations.
- **AL-2**: For cached calls and notifications, the Audit Log captures **one row per lifecycle event** (e.g., enqueued, retrying, delivered, parked, discarded), not a single mutable row per operation.
- **AL-3**: Site-originated events are appended to a **site-local SQLite hot-path** synchronously with the action, then **forwarded to central via gRPC telemetry**; central ingest is **idempotent on `EventId`** (insert-if-not-exists; the `AuditLog` table is strictly append-only, so rows are never updated after insert).
- **AL-4**: A periodic **central→site reconciliation pull** detects and replays any telemetry events that were missed (e.g., during a central outage), making the central Audit Log eventually consistent with sites.
- **AL-5**: Each row captures **payload metadata** (target, method, status, timings, correlation IDs) plus a **truncated request/response body****8 KB default**, expanded to **64 KB on error** outcomes.
- **AL-6**: **HTTP headers are redacted by default**; **SQL parameter values are captured by default**. Per-target **redaction opt-in** is configurable on external systems, database connections, and inbound API methods.
- **AL-7**: A failure to write or forward an audit row **never aborts the user-facing action** — the hot-path action proceeds and the audit record is recovered via the local hot-path buffer plus reconciliation.
- **AL-8**: Central retention defaults to **365 days**, enforced by a **monthly partition switch-and-drop** purge — no row-by-row delete.
- **AL-9**: The site SQLite Audit Log is purged only when `ForwardState ∈ {Forwarded, Reconciled}` — i.e., a row must be either confirmed-forwarded *or* confirmed-reconciled before it can be removed. A central outage therefore **cannot cause audit loss at sites**.
- **AL-10**: The Central UI exposes an **Audit Log page** with a cross-channel filter (by site, target, status, time range, correlation ID), plus **drill-ins from existing operational pages** (Site Calls, Notification Outbox, Inbound API).
- **AL-11**: Append-only semantics are **enforced via DB roles** (no UPDATE/DELETE granted on the `AuditLog` table to application accounts); a **tamper-evidence hash chain is deferred to v1.x**.
- **AL-12**: The CLI provides a `scadalink audit` command group for query, export, and hash-chain verification (verify-chain becomes operational once AL-11's hash chain ships) against the central Audit Log.
## 11. Health Monitoring ## 11. Health Monitoring
### 11.1 Monitored Metrics ### 11.1 Monitored Metrics