diff --git a/docs/plans/2026-05-21-audit-parent-executionid-design.md b/docs/plans/2026-05-21-audit-parent-executionid-design.md new file mode 100644 index 0000000..5c29f19 --- /dev/null +++ b/docs/plans/2026-05-21-audit-parent-executionid-design.md @@ -0,0 +1,222 @@ +# Audit Log — Cross-Execution Correlation (`ParentExecutionId`) Design + +**Date:** 2026-05-21 +**Status:** Validated — ready for implementation planning. + +## Problem + +The Audit Log carries `ExecutionId` (`Guid?`) — a universal per-run correlation +value stamped on every audit row, identifying the originating script execution +or inbound API request. It is **per-execution and flat**: `WHERE ExecutionId = X` +returns everything *one* run did, but nothing links an execution to the +execution that *spawned* it. A call chain cannot be traced across the execution +boundary. + +Two cross-execution cases exist: + +1. **Inbound API request → routed site script.** An inbound HTTP request runs an + inbound method script (`InboundScriptExecutor`, central) which calls + `Route.Call(scriptName, params)`; that sends a `RouteToCallRequest` to a site + instance, which runs `scriptName` as a fresh site-side execution. The inbound + request and the routed site script get two unrelated `ExecutionId`s. +2. **Tag cascade.** Script A writes an attribute; the attribute change triggers + script B as a separate execution. A and B are unrelated. + +## Decision + +Add a dedicated, nullable **`ParentExecutionId`** (`Guid?`) column to the audit +row. Every execution still gets its own fresh `ExecutionId` (unchanged). An +execution *spawned by* another carries the spawner's `ExecutionId` in its +`ParentExecutionId`; a top-level (tag/timer/inbound/un-bridged) execution leaves +it null. Walking `ParentExecutionId → ExecutionId` recursively reconstructs the +chain as a tree. + +**First cut — in scope:** case 1 only, the **inbound → routed-site-script +bridge**. It is the most concrete case and the spawn point is an explicit, +threadable RPC (`RouteToCallRequest`). + +**Out of scope:** case 2 (tag cascade) — the trigger is data-driven and +decoupled; "which execution wrote the tag that triggered me" is not tracked +anywhere today. Deferred as a follow-up. The `ParentExecutionId` model +generalises to it with no schema change if that data is ever threaded. + +### Considered and rejected + +- **Reuse `ExecutionId`** — the routed script *adopts* the inbound request's + `ExecutionId` instead of generating its own. Cheaper (no new column) but + conflates two genuinely separate executions on two clusters, breaks the + invariant "one `ExecutionId` = one `ScriptRuntimeContext` run", and does not + generalise to tag cascade. +- **Point `ParentExecutionId` at the root** (flatten the chain to two levels) + instead of the immediate spawner — simpler queries but loses intermediate + hops, needs a separately threaded root id, and does not generalise. Rejected + in favour of the immediate-spawner tree. + +## Architecture & data flow + +The id propagated is the **inbound API request's `ExecutionId`**. The chain: + +1. **Mint the inbound request id once, early.** Today `AuditWriteMiddleware` + mints a `Guid.NewGuid()` late, only for the inbound row's `ExecutionId`. Move + the mint to the HTTP entry and stash it on `HttpContext.Items`, so both the + middleware (writes the `InboundRequest` row at request end) and + `InboundScriptExecutor` (needs it *before* the script runs) read the same id. +2. **Carry it on the routing RPC.** `RouteHelper.Call` builds a + `RouteToCallRequest`; an additive `ParentExecutionId` field is set from the + stashed inbound id. (`RouteHelper`'s own per-op GUID is a separate concern — + left alone.) +3. **Site side: thread it into the routed script's context.** The site handler + for `RouteToCallRequest` passes it to a new optional `parentExecutionId` ctor + param on `ScriptRuntimeContext` (sibling to the existing `executionId` + param). The routed script still generates its **own** fresh `ExecutionId`. +4. **Every emitter stamps `ParentExecutionId`** alongside `ExecutionId`. + +**Recursion (immediate-spawner tree).** A routed script that itself calls +`Route.Call` threads its own `ExecutionId` onward, so a grandchild's +`ParentExecutionId` points at its immediate spawner, not the root. Walk the tree +recursively to reconstruct any depth. + +**The inbound request's own row** (`InboundRequest` / `InboundAuthFailure`) is +top-level → `ParentExecutionId = NULL`. Only the routed site script and every +row it produces carry the pointer. + +## Schema changes (all additive, nullable — no backfill; pre-existing rows stay `NULL`) + +| Where | Change | +|---|---| +| `ScadaLink.Commons` | `AuditEvent.ParentExecutionId` (`Guid?`); `RouteToCallRequest.ParentExecutionId` (`Guid?`); `Notification.OriginParentExecutionId` (`Guid?`); `NotificationSubmit.OriginParentExecutionId` (`Guid?`). | +| Central MS SQL `AuditLog` | `ParentExecutionId uniqueidentifier NULL` column + partition-aligned index `IX_AuditLog_ParentExecution (ParentExecutionId)` (mirror `AddAuditLogExecutionId`). EF migration — additive nullable column is a metadata-only `ALTER`. | +| Central MS SQL `Notifications` | `OriginParentExecutionId uniqueidentifier NULL` column + EF migration (mirror `AddNotificationOriginExecutionId`). | +| Site SQLite `auditlog.db` `AuditLog` | `ParentExecutionId TEXT NULL` — added **via the idempotent `ALTER`-if-missing upgrade path** (per commit `5198b11`), never relying on `CREATE TABLE IF NOT EXISTS`. | +| gRPC `AuditEventDto` (`sitestream.proto`) | additive `parent_execution_id` field (next free number); `AuditEventDtoMapper` maps it both directions (Guid ↔ string; empty string ↔ null). | +| `ScriptRuntimeContext` | optional `parentExecutionId` ctor param + stored `_parentExecutionId` field. | + +`IX_AuditLog_ParentExecution` is load-bearing: the tree view's downward +recursive join seeks on it, and it backs the `parentExecutionId` filter. + +`SiteCalls` needs no new column — the cached telemetry packet carries the audit +half, which now has `ParentExecutionId` directly. + +## Emitter coverage — full (mirrors the `ExecutionId` rollout) + +Every audit row a routed-script run produces carries `ParentExecutionId`, so +`WHERE ParentExecutionId = X` returns the routed run's complete trust-boundary +footprint. + +| Emitter | `ParentExecutionId` source | +|---|---| +| Sync `ApiCall`, sync `DbWrite` | `ScriptRuntimeContext._parentExecutionId` (in scope) | +| Cached call script-side rows (`CachedSubmit`, immediate `Attempted`/`CachedResolve`) | `ScriptRuntimeContext._parentExecutionId` | +| Cached call **S&F retry-loop** rows (`CachedCallLifecycleBridge`) | threaded through the S&F buffered message → `CachedCallAttemptContext` → the bridge, as a sibling to the `ExecutionId` already threaded there | +| `NotifySend` (site, script-side) | `ScriptRuntimeContext._parentExecutionId` | +| `NotifyDeliver` (central dispatch) | `Notifications.OriginParentExecutionId` — rides on `NotificationSubmit`, persisted on the `Notifications` row, dispatcher stamps every `NotifyDeliver` row | +| Inbound `InboundRequest` / `InboundAuthFailure` | `NULL` — inbound is top-level | + +The threading reuses the carry points the `ExecutionId` rollout already opened +(S&F buffer, `NotificationSubmit` → `Notifications`); `ParentExecutionId` is a +sibling field at each, not a new boundary. + +## Recursive chain/tree view + +A new repository method `GetExecutionTreeAsync(Guid executionId)`: + +- **Walk up** to the root: iterative single-parent follow + (`SELECT TOP 1 ParentExecutionId WHERE ExecutionId = current AND + ParentExecutionId IS NOT NULL`) until null. Cheap — each execution has exactly + one parent. +- **Walk down** from the root: recursive CTE joining + `ParentExecutionId = ancestor.ExecutionId`, seeking on + `IX_AuditLog_ParentExecution`. `MAXRECURSION` capped (e.g. 32) — chains are + shallow; the cap guards against corrupt/pathological data. +- Returns a flat list of execution nodes: `ExecutionId`, `ParentExecutionId`, + row count, channels/statuses present, `SourceSiteId`/`SourceInstanceId`, + first/last `OccurredAtUtc`. The UI assembles the tree from the flat list. + +**UI.** New route `/audit/execution-tree?executionId=`, reached via a +"View execution chain" drill-in from any audit row and from the `ExecutionId` +column. Renders an expandable custom Blazor tree (no component frameworks); each +node shows the execution summary; clicking a node filters the Audit Log grid to +`?executionId=`. The tree is always rooted at the topmost ancestor, so the +reader sees the full chain regardless of which row they entered from. + +Plus the cheaper navigation affordances: `ParentExecutionId` grid column (short +form / monospace), a `ParentExecutionId` paste-filter, a `?parentExecutionId=` +query param, and a "View parent execution" drill-in (links +`?executionId=`). + +### Edge cases + +- **Parent with no rows of its own.** An execution that performed no + trust-boundary action emits no audit rows, yet a child still references it via + `ParentExecutionId`. The upward walk resolves the GUID but finds no rows for + that node → render it as a stub node ("execution with no audited actions"). +- **Purged parent.** A parent execution older than the 365-day central + retention has no rows → the upward walk stops there; the chain renders as far + as it resolves. +- **Cycle guard.** The `ParentExecutionId` graph is acyclic by construction + (each execution is minted fresh and its parent always pre-exists), but + `MAXRECURSION` bounds the downward CTE against corrupt data. + +## CLI / ManagementService + +- CLI: `scadalink audit query --parent-execution-id `; + `AuditLogQueryFilter` gains a `ParentExecutionId` single-value filter + dimension (mirror `ExecutionId`). +- ManagementService `/api/audit/query` + export endpoint and the CentralUI + export endpoints parse a `parentExecutionId` query param (lax-parse — + unparseable dropped). +- The tree view's data path: `GetExecutionTreeAsync` is exposed however the + existing Audit Log page sources its grid data — mirror that path; add a + ManagementService endpoint only if the page goes through it. +- **No CLI `audit tree` command in the first cut** — the tree is a UI forensic + affordance; the `--parent-execution-id` filter covers scripted use. Noted as a + possible follow-up. + +## Compatibility + +- Additive nullable columns; additive proto field; additive message-contract + fields — all version-compatible. No backfill; historical rows keep + `ParentExecutionId = NULL`. +- `ExecutionId` and `CorrelationId` semantics unchanged — every existing + drill-in keeps working. + +## Failure handling + +- Audit-write failure NEVER aborts the user-facing action — unchanged invariant; + `ParentExecutionId` is just another field on the row. +- Site `auditlog.db` schema change MUST use the idempotent `ALTER`-if-missing + path (commit `5198b11`); do not repeat the original `CREATE TABLE IF NOT + EXISTS` mistake. + +## Testing + +- Repository: query-by-`ParentExecutionId`; `GetExecutionTreeAsync` (multi-level + tree, stub-parent node, `MAXRECURSION` cap); migration smoke test. +- Emitter unit tests: each emitter stamps `ParentExecutionId`; the cached-call + lifecycle rows from one routed run share it; `NotifyDeliver` echoes + `Notifications.OriginParentExecutionId`. +- **Headline integration test:** an inbound API request that calls `Route.Call` + → the routed site script does a sync `ExternalSystem.Call`, a cached call, and + a `Notify.Send` → every resulting audit row (site + central) carries + `ParentExecutionId` = the inbound request's `ExecutionId`, while each has its + own distinct `ExecutionId`. +- Central UI: bUnit (column renders, filter maps, query param parsed, tree + assembled from the flat list) + Playwright (drill-in → tree → node click + filters the grid). + +## Out of scope / follow-ups + +- **Tag cascade (case 2)** — deferred. If the attribute-write path ever carries + the writing execution's id into the triggered script's `ScriptRuntimeContext`, + the same `ParentExecutionId` column and tree view cover it with no schema + change. +- CLI `audit tree` command — possible follow-up. +- Backfilling `ParentExecutionId` on historical audit rows — not done. + +## Constraints + +- Additive everywhere — nullable columns, additive proto/message fields, no + backfill. +- Never touch `infra/*`; `alog.md` is the locked v1 spec — do not modify it. +- Site `auditlog.db` schema change MUST use the idempotent `ALTER`-if-missing + path (commit `5198b11`).