docs(design): Audit Log ExecutionId universal correlation
This commit is contained in:
115
docs/plans/2026-05-21-audit-executionid-design.md
Normal file
115
docs/plans/2026-05-21-audit-executionid-design.md
Normal file
@@ -0,0 +1,115 @@
|
||||
# Audit Log — ExecutionId Universal Correlation (Design)
|
||||
|
||||
**Date:** 2026-05-21
|
||||
**Status:** Validated — ready for implementation planning.
|
||||
|
||||
## Problem
|
||||
|
||||
The audit `CorrelationId` column is overloaded with three incompatible meanings —
|
||||
`TrackedOperationId` for cached calls, `NotificationId` for notifications, the
|
||||
script-execution id for sync calls (added 2026-05-21), and request-local ids for
|
||||
inbound. It is `NULL` for sync one-shot calls. There is no single value that ties
|
||||
together *everything one script run (or inbound request) did*: a run that makes a
|
||||
sync API call, a cached call and a notification produces three unrelated
|
||||
correlation ids, and nothing links the cached call's lifecycle rows back to the
|
||||
run that launched them.
|
||||
|
||||
A single `CorrelationId` column cannot serve both scopes — the **operation
|
||||
lifecycle** (a cached call's `Submit→Attempted→Resolve`; a notification's
|
||||
`Send→Deliver`, which the Site Calls / Notifications "View audit history"
|
||||
drill-ins depend on) and the **execution trace** (all operations of one run).
|
||||
|
||||
## Decision
|
||||
|
||||
Add a dedicated, nullable **`ExecutionId`** column to the audit row. It identifies
|
||||
the originating **script execution** or **inbound API request**. Every audit row
|
||||
that execution produces carries the same `ExecutionId`. `CorrelationId` is left
|
||||
exactly as it is — it keeps the per-operation lifecycle meaning, so the existing
|
||||
operation drill-ins are unaffected.
|
||||
|
||||
Result: `WHERE ExecutionId = X` returns every audit row of one run — sync
|
||||
`ApiCall`/`DbWrite`, the whole cached-call lifecycle, `NotifySend`,
|
||||
`NotifyDeliver`, and the inbound row — across both the site and central tables.
|
||||
|
||||
`ScriptRuntimeContext` already holds a per-execution id (`_auditCorrelationId`,
|
||||
added 2026-05-21). That id becomes the `ExecutionId`; this work stamps it into the
|
||||
new column from every emitter and threads it to the two paths where the script
|
||||
context is not in scope.
|
||||
|
||||
### Considered and rejected
|
||||
|
||||
- **Overload `CorrelationId`** with the execution id everywhere — breaks the
|
||||
cached-call / notification "View audit history" drill-ins (they filter
|
||||
`CorrelationId` by `TrackedOperationId` / `NotificationId`), or forces them to
|
||||
show the whole run instead of the one operation.
|
||||
- **Stash the execution id in `Extra` JSON** — no schema change, but `Extra` is
|
||||
unindexed; filtering an audit table of this volume by it is unworkable.
|
||||
|
||||
## Schema changes (all additive, nullable — no backfill; pre-existing rows stay `NULL`)
|
||||
|
||||
| Where | Change |
|
||||
|---|---|
|
||||
| `ScadaLink.Commons` | `AuditEvent` record (and the site-local variant) gains `Guid? ExecutionId`. |
|
||||
| Central MS SQL `AuditLog` | new `ExecutionId uniqueidentifier NULL` column + index `IX_AuditLog_Execution (ExecutionId)`. EF migration — additive nullable column is a metadata-only `ALTER`, fast even on the monthly-partitioned table. |
|
||||
| Site SQLite `auditlog.db` `AuditLog` | new `ExecutionId TEXT NULL` column (`SqliteAuditWriter` schema + `MapRow`). |
|
||||
| gRPC `AuditEventDto` (`sitestream.proto`) | additive `execution_id` field; `AuditEventDtoMapper` maps it both directions. |
|
||||
| Central MS SQL `Notifications` | new `OriginExecutionId uniqueidentifier NULL` column — carries the originating run's id so the dispatcher can echo it onto `NotifyDeliver` audit rows. EF migration. |
|
||||
|
||||
`SiteCalls` needs no new column — the cached telemetry packet already carries the
|
||||
audit half, which now has `ExecutionId` directly.
|
||||
|
||||
## Emitter coverage — every audit row carries `ExecutionId`
|
||||
|
||||
| Emitter | `ExecutionId` source |
|
||||
|---|---|
|
||||
| Sync `ApiCall`, sync `DbWrite` | `ScriptRuntimeContext` execution id (in scope today) |
|
||||
| Cached call script-side rows (`CachedSubmit`, immediate `Attempted`/`CachedResolve`) | `ScriptRuntimeContext` execution id |
|
||||
| Cached call **S&F retry-loop** rows (`CachedCallLifecycleBridge`) | threaded through the store-and-forward buffered message → `CachedCallAttemptContext` → the bridge. This same threading also fixes the pre-existing `SourceScript = NULL` gap on those rows (identical boundary). |
|
||||
| `NotifySend` (site, script-side) | `ScriptRuntimeContext` execution id |
|
||||
| `NotifyDeliver` (central dispatch) | `Notifications.OriginExecutionId` — the id rides on `NotificationSubmit`, is persisted on the `Notifications` row, and the dispatcher stamps it on every `NotifyDeliver` row |
|
||||
| Inbound `InboundRequest` / `InboundAuthFailure` | request id minted once in `AuditWriteMiddleware` |
|
||||
|
||||
## Data flow
|
||||
|
||||
- **Site script run** — `ScriptRuntimeContext` generates the execution id (or is
|
||||
given one); every emitter it owns stamps `ExecutionId`.
|
||||
- **Buffered cached call** — the execution id rides on the S&F buffered message;
|
||||
the retry loop reconstructs it into `CachedCallAttemptContext`;
|
||||
`CachedCallLifecycleBridge` stamps it on the retry-loop audit rows.
|
||||
- **Notification** — the `NotifySend` row stamps it site-side; the id travels on
|
||||
`NotificationSubmit`, is stored as `Notifications.OriginExecutionId`, and the
|
||||
dispatcher stamps every `NotifyDeliver` row it emits.
|
||||
- **Inbound API request** — `AuditWriteMiddleware` mints a request id and stamps
|
||||
the inbound audit row.
|
||||
|
||||
## UI / CLI surface
|
||||
|
||||
- **Central UI Audit Log page** — `ExecutionId` added as a results-grid column
|
||||
(the grid already supports resize/reorder); an `ExecutionId` paste-filter in
|
||||
the filter bar; the page accepts `?executionId=<guid>`; a row drill-in
|
||||
"View this execution" → `/audit/log?executionId=<guid>`.
|
||||
- **CLI** — `scadalink audit query --execution-id <guid>`.
|
||||
- **ManagementService** — `/api/audit/query` and the export endpoint accept an
|
||||
`executionId` filter parameter.
|
||||
|
||||
## Compatibility
|
||||
|
||||
- Two additive nullable columns; additive proto field; additive message-contract
|
||||
fields — all version-compatible. No data backfill; historical rows keep
|
||||
`ExecutionId = NULL`.
|
||||
- `CorrelationId` semantics unchanged — every existing drill-in keeps working.
|
||||
|
||||
## Testing
|
||||
|
||||
- Repository: query-by-`ExecutionId`; migration smoke test.
|
||||
- Emitter unit tests: each emitter stamps `ExecutionId`; the cached-call lifecycle
|
||||
rows from one run share it; `NotifyDeliver` echoes `Notifications.OriginExecutionId`.
|
||||
- Integration: a script run that does a sync call + a cached call + a notification
|
||||
→ all resulting audit rows share one `ExecutionId` end-to-end.
|
||||
- Central UI: bUnit (grid column, filter, drill-in) + Playwright.
|
||||
|
||||
## Out of scope
|
||||
|
||||
- Bridging the inbound request id into the routed site script's execution
|
||||
(cross-cluster threading) — a separate future change.
|
||||
- Backfilling `ExecutionId` on historical audit rows.
|
||||
Reference in New Issue
Block a user