Files
scadalink-design/docs/plans/2026-05-21-audit-executionid-design.md
2026-05-21 14:34:12 -04:00

116 lines
6.4 KiB
Markdown

# Audit Log — ExecutionId Universal Correlation (Design)
**Date:** 2026-05-21
**Status:** Validated — ready for implementation planning.
## Problem
The audit `CorrelationId` column is overloaded with three incompatible meanings —
`TrackedOperationId` for cached calls, `NotificationId` for notifications, the
script-execution id for sync calls (added 2026-05-21), and request-local ids for
inbound. It is `NULL` for sync one-shot calls. There is no single value that ties
together *everything one script run (or inbound request) did*: a run that makes a
sync API call, a cached call and a notification produces three unrelated
correlation ids, and nothing links the cached call's lifecycle rows back to the
run that launched them.
A single `CorrelationId` column cannot serve both scopes — the **operation
lifecycle** (a cached call's `Submit→Attempted→Resolve`; a notification's
`Send→Deliver`, which the Site Calls / Notifications "View audit history"
drill-ins depend on) and the **execution trace** (all operations of one run).
## Decision
Add a dedicated, nullable **`ExecutionId`** column to the audit row. It identifies
the originating **script execution** or **inbound API request**. Every audit row
that execution produces carries the same `ExecutionId`. `CorrelationId` is left
exactly as it is — it keeps the per-operation lifecycle meaning, so the existing
operation drill-ins are unaffected.
Result: `WHERE ExecutionId = X` returns every audit row of one run — sync
`ApiCall`/`DbWrite`, the whole cached-call lifecycle, `NotifySend`,
`NotifyDeliver`, and the inbound row — across both the site and central tables.
`ScriptRuntimeContext` already holds a per-execution id (`_auditCorrelationId`,
added 2026-05-21). That id becomes the `ExecutionId`; this work stamps it into the
new column from every emitter and threads it to the two paths where the script
context is not in scope.
### Considered and rejected
- **Overload `CorrelationId`** with the execution id everywhere — breaks the
cached-call / notification "View audit history" drill-ins (they filter
`CorrelationId` by `TrackedOperationId` / `NotificationId`), or forces them to
show the whole run instead of the one operation.
- **Stash the execution id in `Extra` JSON** — no schema change, but `Extra` is
unindexed; filtering an audit table of this volume by it is unworkable.
## Schema changes (all additive, nullable — no backfill; pre-existing rows stay `NULL`)
| Where | Change |
|---|---|
| `ScadaLink.Commons` | `AuditEvent` record (and the site-local variant) gains `Guid? ExecutionId`. |
| Central MS SQL `AuditLog` | new `ExecutionId uniqueidentifier NULL` column + index `IX_AuditLog_Execution (ExecutionId)`. EF migration — additive nullable column is a metadata-only `ALTER`, fast even on the monthly-partitioned table. |
| Site SQLite `auditlog.db` `AuditLog` | new `ExecutionId TEXT NULL` column (`SqliteAuditWriter` schema + `MapRow`). |
| gRPC `AuditEventDto` (`sitestream.proto`) | additive `execution_id` field; `AuditEventDtoMapper` maps it both directions. |
| Central MS SQL `Notifications` | new `OriginExecutionId uniqueidentifier NULL` column — carries the originating run's id so the dispatcher can echo it onto `NotifyDeliver` audit rows. EF migration. |
`SiteCalls` needs no new column — the cached telemetry packet already carries the
audit half, which now has `ExecutionId` directly.
## Emitter coverage — every audit row carries `ExecutionId`
| Emitter | `ExecutionId` source |
|---|---|
| Sync `ApiCall`, sync `DbWrite` | `ScriptRuntimeContext` execution id (in scope today) |
| Cached call script-side rows (`CachedSubmit`, immediate `Attempted`/`CachedResolve`) | `ScriptRuntimeContext` execution id |
| Cached call **S&F retry-loop** rows (`CachedCallLifecycleBridge`) | threaded through the store-and-forward buffered message → `CachedCallAttemptContext` → the bridge. This same threading also fixes the pre-existing `SourceScript = NULL` gap on those rows (identical boundary). |
| `NotifySend` (site, script-side) | `ScriptRuntimeContext` execution id |
| `NotifyDeliver` (central dispatch) | `Notifications.OriginExecutionId` — the id rides on `NotificationSubmit`, is persisted on the `Notifications` row, and the dispatcher stamps it on every `NotifyDeliver` row |
| Inbound `InboundRequest` / `InboundAuthFailure` | request id minted once in `AuditWriteMiddleware` |
## Data flow
- **Site script run** — `ScriptRuntimeContext` generates the execution id (or is
given one); every emitter it owns stamps `ExecutionId`.
- **Buffered cached call** — the execution id rides on the S&F buffered message;
the retry loop reconstructs it into `CachedCallAttemptContext`;
`CachedCallLifecycleBridge` stamps it on the retry-loop audit rows.
- **Notification** — the `NotifySend` row stamps it site-side; the id travels on
`NotificationSubmit`, is stored as `Notifications.OriginExecutionId`, and the
dispatcher stamps every `NotifyDeliver` row it emits.
- **Inbound API request** — `AuditWriteMiddleware` mints a request id and stamps
the inbound audit row.
## UI / CLI surface
- **Central UI Audit Log page** — `ExecutionId` added as a results-grid column
(the grid already supports resize/reorder); an `ExecutionId` paste-filter in
the filter bar; the page accepts `?executionId=<guid>`; a row drill-in
"View this execution" → `/audit/log?executionId=<guid>`.
- **CLI** — `scadalink audit query --execution-id <guid>`.
- **ManagementService** — `/api/audit/query` and the export endpoint accept an
`executionId` filter parameter.
## Compatibility
- Two additive nullable columns; additive proto field; additive message-contract
fields — all version-compatible. No data backfill; historical rows keep
`ExecutionId = NULL`.
- `CorrelationId` semantics unchanged — every existing drill-in keeps working.
## Testing
- Repository: query-by-`ExecutionId`; migration smoke test.
- Emitter unit tests: each emitter stamps `ExecutionId`; the cached-call lifecycle
rows from one run share it; `NotifyDeliver` echoes `Notifications.OriginExecutionId`.
- Integration: a script run that does a sync call + a cached call + a notification
→ all resulting audit rows share one `ExecutionId` end-to-end.
- Central UI: bUnit (grid column, filter, drill-in) + Playwright.
## Out of scope
- Bridging the inbound request id into the routed site script's execution
(cross-cluster threading) — a separate future change.
- Backfilling `ExecutionId` on historical audit rows.