docs(auditlog): ParentExecutionId cross-execution correlation design
This commit is contained in:
222
docs/plans/2026-05-21-audit-parent-executionid-design.md
Normal file
222
docs/plans/2026-05-21-audit-parent-executionid-design.md
Normal file
@@ -0,0 +1,222 @@
|
|||||||
|
# Audit Log — Cross-Execution Correlation (`ParentExecutionId`) Design
|
||||||
|
|
||||||
|
**Date:** 2026-05-21
|
||||||
|
**Status:** Validated — ready for implementation planning.
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
The Audit Log carries `ExecutionId` (`Guid?`) — a universal per-run correlation
|
||||||
|
value stamped on every audit row, identifying the originating script execution
|
||||||
|
or inbound API request. It is **per-execution and flat**: `WHERE ExecutionId = X`
|
||||||
|
returns everything *one* run did, but nothing links an execution to the
|
||||||
|
execution that *spawned* it. A call chain cannot be traced across the execution
|
||||||
|
boundary.
|
||||||
|
|
||||||
|
Two cross-execution cases exist:
|
||||||
|
|
||||||
|
1. **Inbound API request → routed site script.** An inbound HTTP request runs an
|
||||||
|
inbound method script (`InboundScriptExecutor`, central) which calls
|
||||||
|
`Route.Call(scriptName, params)`; that sends a `RouteToCallRequest` to a site
|
||||||
|
instance, which runs `scriptName` as a fresh site-side execution. The inbound
|
||||||
|
request and the routed site script get two unrelated `ExecutionId`s.
|
||||||
|
2. **Tag cascade.** Script A writes an attribute; the attribute change triggers
|
||||||
|
script B as a separate execution. A and B are unrelated.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
Add a dedicated, nullable **`ParentExecutionId`** (`Guid?`) column to the audit
|
||||||
|
row. Every execution still gets its own fresh `ExecutionId` (unchanged). An
|
||||||
|
execution *spawned by* another carries the spawner's `ExecutionId` in its
|
||||||
|
`ParentExecutionId`; a top-level (tag/timer/inbound/un-bridged) execution leaves
|
||||||
|
it null. Walking `ParentExecutionId → ExecutionId` recursively reconstructs the
|
||||||
|
chain as a tree.
|
||||||
|
|
||||||
|
**First cut — in scope:** case 1 only, the **inbound → routed-site-script
|
||||||
|
bridge**. It is the most concrete case and the spawn point is an explicit,
|
||||||
|
threadable RPC (`RouteToCallRequest`).
|
||||||
|
|
||||||
|
**Out of scope:** case 2 (tag cascade) — the trigger is data-driven and
|
||||||
|
decoupled; "which execution wrote the tag that triggered me" is not tracked
|
||||||
|
anywhere today. Deferred as a follow-up. The `ParentExecutionId` model
|
||||||
|
generalises to it with no schema change if that data is ever threaded.
|
||||||
|
|
||||||
|
### Considered and rejected
|
||||||
|
|
||||||
|
- **Reuse `ExecutionId`** — the routed script *adopts* the inbound request's
|
||||||
|
`ExecutionId` instead of generating its own. Cheaper (no new column) but
|
||||||
|
conflates two genuinely separate executions on two clusters, breaks the
|
||||||
|
invariant "one `ExecutionId` = one `ScriptRuntimeContext` run", and does not
|
||||||
|
generalise to tag cascade.
|
||||||
|
- **Point `ParentExecutionId` at the root** (flatten the chain to two levels)
|
||||||
|
instead of the immediate spawner — simpler queries but loses intermediate
|
||||||
|
hops, needs a separately threaded root id, and does not generalise. Rejected
|
||||||
|
in favour of the immediate-spawner tree.
|
||||||
|
|
||||||
|
## Architecture & data flow
|
||||||
|
|
||||||
|
The id propagated is the **inbound API request's `ExecutionId`**. The chain:
|
||||||
|
|
||||||
|
1. **Mint the inbound request id once, early.** Today `AuditWriteMiddleware`
|
||||||
|
mints a `Guid.NewGuid()` late, only for the inbound row's `ExecutionId`. Move
|
||||||
|
the mint to the HTTP entry and stash it on `HttpContext.Items`, so both the
|
||||||
|
middleware (writes the `InboundRequest` row at request end) and
|
||||||
|
`InboundScriptExecutor` (needs it *before* the script runs) read the same id.
|
||||||
|
2. **Carry it on the routing RPC.** `RouteHelper.Call` builds a
|
||||||
|
`RouteToCallRequest`; an additive `ParentExecutionId` field is set from the
|
||||||
|
stashed inbound id. (`RouteHelper`'s own per-op GUID is a separate concern —
|
||||||
|
left alone.)
|
||||||
|
3. **Site side: thread it into the routed script's context.** The site handler
|
||||||
|
for `RouteToCallRequest` passes it to a new optional `parentExecutionId` ctor
|
||||||
|
param on `ScriptRuntimeContext` (sibling to the existing `executionId`
|
||||||
|
param). The routed script still generates its **own** fresh `ExecutionId`.
|
||||||
|
4. **Every emitter stamps `ParentExecutionId`** alongside `ExecutionId`.
|
||||||
|
|
||||||
|
**Recursion (immediate-spawner tree).** A routed script that itself calls
|
||||||
|
`Route.Call` threads its own `ExecutionId` onward, so a grandchild's
|
||||||
|
`ParentExecutionId` points at its immediate spawner, not the root. Walk the tree
|
||||||
|
recursively to reconstruct any depth.
|
||||||
|
|
||||||
|
**The inbound request's own row** (`InboundRequest` / `InboundAuthFailure`) is
|
||||||
|
top-level → `ParentExecutionId = NULL`. Only the routed site script and every
|
||||||
|
row it produces carry the pointer.
|
||||||
|
|
||||||
|
## Schema changes (all additive, nullable — no backfill; pre-existing rows stay `NULL`)
|
||||||
|
|
||||||
|
| Where | Change |
|
||||||
|
|---|---|
|
||||||
|
| `ScadaLink.Commons` | `AuditEvent.ParentExecutionId` (`Guid?`); `RouteToCallRequest.ParentExecutionId` (`Guid?`); `Notification.OriginParentExecutionId` (`Guid?`); `NotificationSubmit.OriginParentExecutionId` (`Guid?`). |
|
||||||
|
| Central MS SQL `AuditLog` | `ParentExecutionId uniqueidentifier NULL` column + partition-aligned index `IX_AuditLog_ParentExecution (ParentExecutionId)` (mirror `AddAuditLogExecutionId`). EF migration — additive nullable column is a metadata-only `ALTER`. |
|
||||||
|
| Central MS SQL `Notifications` | `OriginParentExecutionId uniqueidentifier NULL` column + EF migration (mirror `AddNotificationOriginExecutionId`). |
|
||||||
|
| Site SQLite `auditlog.db` `AuditLog` | `ParentExecutionId TEXT NULL` — added **via the idempotent `ALTER`-if-missing upgrade path** (per commit `5198b11`), never relying on `CREATE TABLE IF NOT EXISTS`. |
|
||||||
|
| gRPC `AuditEventDto` (`sitestream.proto`) | additive `parent_execution_id` field (next free number); `AuditEventDtoMapper` maps it both directions (Guid ↔ string; empty string ↔ null). |
|
||||||
|
| `ScriptRuntimeContext` | optional `parentExecutionId` ctor param + stored `_parentExecutionId` field. |
|
||||||
|
|
||||||
|
`IX_AuditLog_ParentExecution` is load-bearing: the tree view's downward
|
||||||
|
recursive join seeks on it, and it backs the `parentExecutionId` filter.
|
||||||
|
|
||||||
|
`SiteCalls` needs no new column — the cached telemetry packet carries the audit
|
||||||
|
half, which now has `ParentExecutionId` directly.
|
||||||
|
|
||||||
|
## Emitter coverage — full (mirrors the `ExecutionId` rollout)
|
||||||
|
|
||||||
|
Every audit row a routed-script run produces carries `ParentExecutionId`, so
|
||||||
|
`WHERE ParentExecutionId = X` returns the routed run's complete trust-boundary
|
||||||
|
footprint.
|
||||||
|
|
||||||
|
| Emitter | `ParentExecutionId` source |
|
||||||
|
|---|---|
|
||||||
|
| Sync `ApiCall`, sync `DbWrite` | `ScriptRuntimeContext._parentExecutionId` (in scope) |
|
||||||
|
| Cached call script-side rows (`CachedSubmit`, immediate `Attempted`/`CachedResolve`) | `ScriptRuntimeContext._parentExecutionId` |
|
||||||
|
| Cached call **S&F retry-loop** rows (`CachedCallLifecycleBridge`) | threaded through the S&F buffered message → `CachedCallAttemptContext` → the bridge, as a sibling to the `ExecutionId` already threaded there |
|
||||||
|
| `NotifySend` (site, script-side) | `ScriptRuntimeContext._parentExecutionId` |
|
||||||
|
| `NotifyDeliver` (central dispatch) | `Notifications.OriginParentExecutionId` — rides on `NotificationSubmit`, persisted on the `Notifications` row, dispatcher stamps every `NotifyDeliver` row |
|
||||||
|
| Inbound `InboundRequest` / `InboundAuthFailure` | `NULL` — inbound is top-level |
|
||||||
|
|
||||||
|
The threading reuses the carry points the `ExecutionId` rollout already opened
|
||||||
|
(S&F buffer, `NotificationSubmit` → `Notifications`); `ParentExecutionId` is a
|
||||||
|
sibling field at each, not a new boundary.
|
||||||
|
|
||||||
|
## Recursive chain/tree view
|
||||||
|
|
||||||
|
A new repository method `GetExecutionTreeAsync(Guid executionId)`:
|
||||||
|
|
||||||
|
- **Walk up** to the root: iterative single-parent follow
|
||||||
|
(`SELECT TOP 1 ParentExecutionId WHERE ExecutionId = current AND
|
||||||
|
ParentExecutionId IS NOT NULL`) until null. Cheap — each execution has exactly
|
||||||
|
one parent.
|
||||||
|
- **Walk down** from the root: recursive CTE joining
|
||||||
|
`ParentExecutionId = ancestor.ExecutionId`, seeking on
|
||||||
|
`IX_AuditLog_ParentExecution`. `MAXRECURSION` capped (e.g. 32) — chains are
|
||||||
|
shallow; the cap guards against corrupt/pathological data.
|
||||||
|
- Returns a flat list of execution nodes: `ExecutionId`, `ParentExecutionId`,
|
||||||
|
row count, channels/statuses present, `SourceSiteId`/`SourceInstanceId`,
|
||||||
|
first/last `OccurredAtUtc`. The UI assembles the tree from the flat list.
|
||||||
|
|
||||||
|
**UI.** New route `/audit/execution-tree?executionId=<guid>`, reached via a
|
||||||
|
"View execution chain" drill-in from any audit row and from the `ExecutionId`
|
||||||
|
column. Renders an expandable custom Blazor tree (no component frameworks); each
|
||||||
|
node shows the execution summary; clicking a node filters the Audit Log grid to
|
||||||
|
`?executionId=<node>`. The tree is always rooted at the topmost ancestor, so the
|
||||||
|
reader sees the full chain regardless of which row they entered from.
|
||||||
|
|
||||||
|
Plus the cheaper navigation affordances: `ParentExecutionId` grid column (short
|
||||||
|
form / monospace), a `ParentExecutionId` paste-filter, a `?parentExecutionId=`
|
||||||
|
query param, and a "View parent execution" drill-in (links
|
||||||
|
`?executionId=<parentId>`).
|
||||||
|
|
||||||
|
### Edge cases
|
||||||
|
|
||||||
|
- **Parent with no rows of its own.** An execution that performed no
|
||||||
|
trust-boundary action emits no audit rows, yet a child still references it via
|
||||||
|
`ParentExecutionId`. The upward walk resolves the GUID but finds no rows for
|
||||||
|
that node → render it as a stub node ("execution with no audited actions").
|
||||||
|
- **Purged parent.** A parent execution older than the 365-day central
|
||||||
|
retention has no rows → the upward walk stops there; the chain renders as far
|
||||||
|
as it resolves.
|
||||||
|
- **Cycle guard.** The `ParentExecutionId` graph is acyclic by construction
|
||||||
|
(each execution is minted fresh and its parent always pre-exists), but
|
||||||
|
`MAXRECURSION` bounds the downward CTE against corrupt data.
|
||||||
|
|
||||||
|
## CLI / ManagementService
|
||||||
|
|
||||||
|
- CLI: `scadalink audit query --parent-execution-id <guid>`;
|
||||||
|
`AuditLogQueryFilter` gains a `ParentExecutionId` single-value filter
|
||||||
|
dimension (mirror `ExecutionId`).
|
||||||
|
- ManagementService `/api/audit/query` + export endpoint and the CentralUI
|
||||||
|
export endpoints parse a `parentExecutionId` query param (lax-parse —
|
||||||
|
unparseable dropped).
|
||||||
|
- The tree view's data path: `GetExecutionTreeAsync` is exposed however the
|
||||||
|
existing Audit Log page sources its grid data — mirror that path; add a
|
||||||
|
ManagementService endpoint only if the page goes through it.
|
||||||
|
- **No CLI `audit tree` command in the first cut** — the tree is a UI forensic
|
||||||
|
affordance; the `--parent-execution-id` filter covers scripted use. Noted as a
|
||||||
|
possible follow-up.
|
||||||
|
|
||||||
|
## Compatibility
|
||||||
|
|
||||||
|
- Additive nullable columns; additive proto field; additive message-contract
|
||||||
|
fields — all version-compatible. No backfill; historical rows keep
|
||||||
|
`ParentExecutionId = NULL`.
|
||||||
|
- `ExecutionId` and `CorrelationId` semantics unchanged — every existing
|
||||||
|
drill-in keeps working.
|
||||||
|
|
||||||
|
## Failure handling
|
||||||
|
|
||||||
|
- Audit-write failure NEVER aborts the user-facing action — unchanged invariant;
|
||||||
|
`ParentExecutionId` is just another field on the row.
|
||||||
|
- Site `auditlog.db` schema change MUST use the idempotent `ALTER`-if-missing
|
||||||
|
path (commit `5198b11`); do not repeat the original `CREATE TABLE IF NOT
|
||||||
|
EXISTS` mistake.
|
||||||
|
|
||||||
|
## Testing
|
||||||
|
|
||||||
|
- Repository: query-by-`ParentExecutionId`; `GetExecutionTreeAsync` (multi-level
|
||||||
|
tree, stub-parent node, `MAXRECURSION` cap); migration smoke test.
|
||||||
|
- Emitter unit tests: each emitter stamps `ParentExecutionId`; the cached-call
|
||||||
|
lifecycle rows from one routed run share it; `NotifyDeliver` echoes
|
||||||
|
`Notifications.OriginParentExecutionId`.
|
||||||
|
- **Headline integration test:** an inbound API request that calls `Route.Call`
|
||||||
|
→ the routed site script does a sync `ExternalSystem.Call`, a cached call, and
|
||||||
|
a `Notify.Send` → every resulting audit row (site + central) carries
|
||||||
|
`ParentExecutionId` = the inbound request's `ExecutionId`, while each has its
|
||||||
|
own distinct `ExecutionId`.
|
||||||
|
- Central UI: bUnit (column renders, filter maps, query param parsed, tree
|
||||||
|
assembled from the flat list) + Playwright (drill-in → tree → node click
|
||||||
|
filters the grid).
|
||||||
|
|
||||||
|
## Out of scope / follow-ups
|
||||||
|
|
||||||
|
- **Tag cascade (case 2)** — deferred. If the attribute-write path ever carries
|
||||||
|
the writing execution's id into the triggered script's `ScriptRuntimeContext`,
|
||||||
|
the same `ParentExecutionId` column and tree view cover it with no schema
|
||||||
|
change.
|
||||||
|
- CLI `audit tree` command — possible follow-up.
|
||||||
|
- Backfilling `ParentExecutionId` on historical audit rows — not done.
|
||||||
|
|
||||||
|
## Constraints
|
||||||
|
|
||||||
|
- Additive everywhere — nullable columns, additive proto/message fields, no
|
||||||
|
backfill.
|
||||||
|
- Never touch `infra/*`; `alog.md` is the locked v1 spec — do not modify it.
|
||||||
|
- Site `auditlog.db` schema change MUST use the idempotent `ALTER`-if-missing
|
||||||
|
path (commit `5198b11`).
|
||||||
Reference in New Issue
Block a user