From 6ffa47f2587cb96b4f4a1e73e9c3e6e9d21e87b0 Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Thu, 21 May 2026 14:34:12 -0400 Subject: [PATCH] docs(design): Audit Log ExecutionId universal correlation --- .../2026-05-21-audit-executionid-design.md | 115 ++++++++++++++++++ 1 file changed, 115 insertions(+) create mode 100644 docs/plans/2026-05-21-audit-executionid-design.md diff --git a/docs/plans/2026-05-21-audit-executionid-design.md b/docs/plans/2026-05-21-audit-executionid-design.md new file mode 100644 index 0000000..c2ef66f --- /dev/null +++ b/docs/plans/2026-05-21-audit-executionid-design.md @@ -0,0 +1,115 @@ +# Audit Log — ExecutionId Universal Correlation (Design) + +**Date:** 2026-05-21 +**Status:** Validated — ready for implementation planning. + +## Problem + +The audit `CorrelationId` column is overloaded with three incompatible meanings — +`TrackedOperationId` for cached calls, `NotificationId` for notifications, the +script-execution id for sync calls (added 2026-05-21), and request-local ids for +inbound. It is `NULL` for sync one-shot calls. There is no single value that ties +together *everything one script run (or inbound request) did*: a run that makes a +sync API call, a cached call and a notification produces three unrelated +correlation ids, and nothing links the cached call's lifecycle rows back to the +run that launched them. + +A single `CorrelationId` column cannot serve both scopes — the **operation +lifecycle** (a cached call's `Submit→Attempted→Resolve`; a notification's +`Send→Deliver`, which the Site Calls / Notifications "View audit history" +drill-ins depend on) and the **execution trace** (all operations of one run). + +## Decision + +Add a dedicated, nullable **`ExecutionId`** column to the audit row. It identifies +the originating **script execution** or **inbound API request**. Every audit row +that execution produces carries the same `ExecutionId`. `CorrelationId` is left +exactly as it is — it keeps the per-operation lifecycle meaning, so the existing +operation drill-ins are unaffected. + +Result: `WHERE ExecutionId = X` returns every audit row of one run — sync +`ApiCall`/`DbWrite`, the whole cached-call lifecycle, `NotifySend`, +`NotifyDeliver`, and the inbound row — across both the site and central tables. + +`ScriptRuntimeContext` already holds a per-execution id (`_auditCorrelationId`, +added 2026-05-21). That id becomes the `ExecutionId`; this work stamps it into the +new column from every emitter and threads it to the two paths where the script +context is not in scope. + +### Considered and rejected + +- **Overload `CorrelationId`** with the execution id everywhere — breaks the + cached-call / notification "View audit history" drill-ins (they filter + `CorrelationId` by `TrackedOperationId` / `NotificationId`), or forces them to + show the whole run instead of the one operation. +- **Stash the execution id in `Extra` JSON** — no schema change, but `Extra` is + unindexed; filtering an audit table of this volume by it is unworkable. + +## Schema changes (all additive, nullable — no backfill; pre-existing rows stay `NULL`) + +| Where | Change | +|---|---| +| `ScadaLink.Commons` | `AuditEvent` record (and the site-local variant) gains `Guid? ExecutionId`. | +| Central MS SQL `AuditLog` | new `ExecutionId uniqueidentifier NULL` column + index `IX_AuditLog_Execution (ExecutionId)`. EF migration — additive nullable column is a metadata-only `ALTER`, fast even on the monthly-partitioned table. | +| Site SQLite `auditlog.db` `AuditLog` | new `ExecutionId TEXT NULL` column (`SqliteAuditWriter` schema + `MapRow`). | +| gRPC `AuditEventDto` (`sitestream.proto`) | additive `execution_id` field; `AuditEventDtoMapper` maps it both directions. | +| Central MS SQL `Notifications` | new `OriginExecutionId uniqueidentifier NULL` column — carries the originating run's id so the dispatcher can echo it onto `NotifyDeliver` audit rows. EF migration. | + +`SiteCalls` needs no new column — the cached telemetry packet already carries the +audit half, which now has `ExecutionId` directly. + +## Emitter coverage — every audit row carries `ExecutionId` + +| Emitter | `ExecutionId` source | +|---|---| +| Sync `ApiCall`, sync `DbWrite` | `ScriptRuntimeContext` execution id (in scope today) | +| Cached call script-side rows (`CachedSubmit`, immediate `Attempted`/`CachedResolve`) | `ScriptRuntimeContext` execution id | +| Cached call **S&F retry-loop** rows (`CachedCallLifecycleBridge`) | threaded through the store-and-forward buffered message → `CachedCallAttemptContext` → the bridge. This same threading also fixes the pre-existing `SourceScript = NULL` gap on those rows (identical boundary). | +| `NotifySend` (site, script-side) | `ScriptRuntimeContext` execution id | +| `NotifyDeliver` (central dispatch) | `Notifications.OriginExecutionId` — the id rides on `NotificationSubmit`, is persisted on the `Notifications` row, and the dispatcher stamps it on every `NotifyDeliver` row | +| Inbound `InboundRequest` / `InboundAuthFailure` | request id minted once in `AuditWriteMiddleware` | + +## Data flow + +- **Site script run** — `ScriptRuntimeContext` generates the execution id (or is + given one); every emitter it owns stamps `ExecutionId`. +- **Buffered cached call** — the execution id rides on the S&F buffered message; + the retry loop reconstructs it into `CachedCallAttemptContext`; + `CachedCallLifecycleBridge` stamps it on the retry-loop audit rows. +- **Notification** — the `NotifySend` row stamps it site-side; the id travels on + `NotificationSubmit`, is stored as `Notifications.OriginExecutionId`, and the + dispatcher stamps every `NotifyDeliver` row it emits. +- **Inbound API request** — `AuditWriteMiddleware` mints a request id and stamps + the inbound audit row. + +## UI / CLI surface + +- **Central UI Audit Log page** — `ExecutionId` added as a results-grid column + (the grid already supports resize/reorder); an `ExecutionId` paste-filter in + the filter bar; the page accepts `?executionId=`; a row drill-in + "View this execution" → `/audit/log?executionId=`. +- **CLI** — `scadalink audit query --execution-id `. +- **ManagementService** — `/api/audit/query` and the export endpoint accept an + `executionId` filter parameter. + +## Compatibility + +- Two additive nullable columns; additive proto field; additive message-contract + fields — all version-compatible. No data backfill; historical rows keep + `ExecutionId = NULL`. +- `CorrelationId` semantics unchanged — every existing drill-in keeps working. + +## Testing + +- Repository: query-by-`ExecutionId`; migration smoke test. +- Emitter unit tests: each emitter stamps `ExecutionId`; the cached-call lifecycle + rows from one run share it; `NotifyDeliver` echoes `Notifications.OriginExecutionId`. +- Integration: a script run that does a sync call + a cached call + a notification + → all resulting audit rows share one `ExecutionId` end-to-end. +- Central UI: bUnit (grid column, filter, drill-in) + Playwright. + +## Out of scope + +- Bridging the inbound request id into the routed site script's execution + (cross-cluster threading) — a separate future change. +- Backfilling `ExecutionId` on historical audit rows.