dohertj2/scadalink-design

Fork 0

Files

Joseph Doherty 6be26e2813 docs(auditlog): ParentExecutionId cross-execution correlation design

2026-05-21 16:53:25 -04:00

12 KiB

Raw Blame History

Audit Log — Cross-Execution Correlation (`ParentExecutionId`) Design

Date: 2026-05-21 Status: Validated — ready for implementation planning.

Problem

The Audit Log carries ExecutionId (Guid?) — a universal per-run correlation value stamped on every audit row, identifying the originating script execution or inbound API request. It is per-execution and flat: WHERE ExecutionId = X returns everything one run did, but nothing links an execution to the execution that spawned it. A call chain cannot be traced across the execution boundary.

Two cross-execution cases exist:

Inbound API request → routed site script. An inbound HTTP request runs an inbound method script (InboundScriptExecutor, central) which calls Route.Call(scriptName, params); that sends a RouteToCallRequest to a site instance, which runs scriptName as a fresh site-side execution. The inbound request and the routed site script get two unrelated ExecutionIds.
Tag cascade. Script A writes an attribute; the attribute change triggers script B as a separate execution. A and B are unrelated.

Decision

Add a dedicated, nullable ParentExecutionId (Guid?) column to the audit row. Every execution still gets its own fresh ExecutionId (unchanged). An execution spawned by another carries the spawner's ExecutionId in its ParentExecutionId; a top-level (tag/timer/inbound/un-bridged) execution leaves it null. Walking ParentExecutionId → ExecutionId recursively reconstructs the chain as a tree.

First cut — in scope: case 1 only, the inbound → routed-site-script bridge. It is the most concrete case and the spawn point is an explicit, threadable RPC (RouteToCallRequest).

Out of scope: case 2 (tag cascade) — the trigger is data-driven and decoupled; "which execution wrote the tag that triggered me" is not tracked anywhere today. Deferred as a follow-up. The ParentExecutionId model generalises to it with no schema change if that data is ever threaded.

Considered and rejected

Reuse ExecutionId — the routed script adopts the inbound request's ExecutionId instead of generating its own. Cheaper (no new column) but conflates two genuinely separate executions on two clusters, breaks the invariant "one ExecutionId = one ScriptRuntimeContext run", and does not generalise to tag cascade.
Point ParentExecutionId at the root (flatten the chain to two levels) instead of the immediate spawner — simpler queries but loses intermediate hops, needs a separately threaded root id, and does not generalise. Rejected in favour of the immediate-spawner tree.

Architecture & data flow

The id propagated is the inbound API request's ExecutionId. The chain:

Mint the inbound request id once, early. Today AuditWriteMiddleware mints a Guid.NewGuid() late, only for the inbound row's ExecutionId. Move the mint to the HTTP entry and stash it on HttpContext.Items, so both the middleware (writes the InboundRequest row at request end) and InboundScriptExecutor (needs it before the script runs) read the same id.
Carry it on the routing RPC. RouteHelper.Call builds a RouteToCallRequest; an additive ParentExecutionId field is set from the stashed inbound id. (RouteHelper's own per-op GUID is a separate concern — left alone.)
Site side: thread it into the routed script's context. The site handler for RouteToCallRequest passes it to a new optional parentExecutionId ctor param on ScriptRuntimeContext (sibling to the existing executionId param). The routed script still generates its own fresh ExecutionId.
Every emitter stamps ParentExecutionId alongside ExecutionId.

Recursion (immediate-spawner tree). A routed script that itself calls Route.Call threads its own ExecutionId onward, so a grandchild's ParentExecutionId points at its immediate spawner, not the root. Walk the tree recursively to reconstruct any depth.

The inbound request's own row (InboundRequest / InboundAuthFailure) is top-level → ParentExecutionId = NULL. Only the routed site script and every row it produces carry the pointer.

Schema changes (all additive, nullable — no backfill; pre-existing rows stay `NULL`)

Where	Change
`ScadaLink.Commons`	`AuditEvent.ParentExecutionId` (`Guid?`); `RouteToCallRequest.ParentExecutionId` (`Guid?`); `Notification.OriginParentExecutionId` (`Guid?`); `NotificationSubmit.OriginParentExecutionId` (`Guid?`).
Central MS SQL `AuditLog`	`ParentExecutionId uniqueidentifier NULL` column + partition-aligned index `IX_AuditLog_ParentExecution (ParentExecutionId)` (mirror `AddAuditLogExecutionId`). EF migration — additive nullable column is a metadata-only `ALTER`.
Central MS SQL `Notifications`	`OriginParentExecutionId uniqueidentifier NULL` column + EF migration (mirror `AddNotificationOriginExecutionId`).
Site SQLite `auditlog.db` `AuditLog`	`ParentExecutionId TEXT NULL` — added via the idempotent `ALTER`-if-missing upgrade path (per commit `5198b11`), never relying on `CREATE TABLE IF NOT EXISTS`.
gRPC `AuditEventDto` (`sitestream.proto`)	additive `parent_execution_id` field (next free number); `AuditEventDtoMapper` maps it both directions (Guid ↔ string; empty string ↔ null).
`ScriptRuntimeContext`	optional `parentExecutionId` ctor param + stored `_parentExecutionId` field.

IX_AuditLog_ParentExecution is load-bearing: the tree view's downward recursive join seeks on it, and it backs the parentExecutionId filter.

SiteCalls needs no new column — the cached telemetry packet carries the audit half, which now has ParentExecutionId directly.

Emitter coverage — full (mirrors the `ExecutionId` rollout)

Every audit row a routed-script run produces carries ParentExecutionId, so WHERE ParentExecutionId = X returns the routed run's complete trust-boundary footprint.

Emitter	`ParentExecutionId` source
Sync `ApiCall`, sync `DbWrite`	`ScriptRuntimeContext._parentExecutionId` (in scope)
Cached call script-side rows (`CachedSubmit`, immediate `Attempted`/`CachedResolve`)	`ScriptRuntimeContext._parentExecutionId`
Cached call S&F retry-loop rows (`CachedCallLifecycleBridge`)	threaded through the S&F buffered message → `CachedCallAttemptContext` → the bridge, as a sibling to the `ExecutionId` already threaded there
`NotifySend` (site, script-side)	`ScriptRuntimeContext._parentExecutionId`
`NotifyDeliver` (central dispatch)	`Notifications.OriginParentExecutionId` — rides on `NotificationSubmit`, persisted on the `Notifications` row, dispatcher stamps every `NotifyDeliver` row
Inbound `InboundRequest` / `InboundAuthFailure`	`NULL` — inbound is top-level

The threading reuses the carry points the ExecutionId rollout already opened (S&F buffer, NotificationSubmit → Notifications); ParentExecutionId is a sibling field at each, not a new boundary.

Recursive chain/tree view

A new repository method GetExecutionTreeAsync(Guid executionId):

Walk up to the root: iterative single-parent follow (SELECT TOP 1 ParentExecutionId WHERE ExecutionId = current AND ParentExecutionId IS NOT NULL) until null. Cheap — each execution has exactly one parent.
Walk down from the root: recursive CTE joining ParentExecutionId = ancestor.ExecutionId, seeking on IX_AuditLog_ParentExecution. MAXRECURSION capped (e.g. 32) — chains are shallow; the cap guards against corrupt/pathological data.
Returns a flat list of execution nodes: ExecutionId, ParentExecutionId, row count, channels/statuses present, SourceSiteId/SourceInstanceId, first/last OccurredAtUtc. The UI assembles the tree from the flat list.

UI. New route /audit/execution-tree?executionId=<guid>, reached via a "View execution chain" drill-in from any audit row and from the ExecutionId column. Renders an expandable custom Blazor tree (no component frameworks); each node shows the execution summary; clicking a node filters the Audit Log grid to ?executionId=<node>. The tree is always rooted at the topmost ancestor, so the reader sees the full chain regardless of which row they entered from.

Plus the cheaper navigation affordances: ParentExecutionId grid column (short form / monospace), a ParentExecutionId paste-filter, a ?parentExecutionId= query param, and a "View parent execution" drill-in (links ?executionId=<parentId>).

Edge cases

Parent with no rows of its own. An execution that performed no trust-boundary action emits no audit rows, yet a child still references it via ParentExecutionId. The upward walk resolves the GUID but finds no rows for that node → render it as a stub node ("execution with no audited actions").
Purged parent. A parent execution older than the 365-day central retention has no rows → the upward walk stops there; the chain renders as far as it resolves.
Cycle guard. The ParentExecutionId graph is acyclic by construction (each execution is minted fresh and its parent always pre-exists), but MAXRECURSION bounds the downward CTE against corrupt data.

CLI / ManagementService

CLI: scadalink audit query --parent-execution-id <guid>; AuditLogQueryFilter gains a ParentExecutionId single-value filter dimension (mirror ExecutionId).
ManagementService /api/audit/query + export endpoint and the CentralUI export endpoints parse a parentExecutionId query param (lax-parse — unparseable dropped).
The tree view's data path: GetExecutionTreeAsync is exposed however the existing Audit Log page sources its grid data — mirror that path; add a ManagementService endpoint only if the page goes through it.
No CLI audit tree command in the first cut — the tree is a UI forensic affordance; the --parent-execution-id filter covers scripted use. Noted as a possible follow-up.

Compatibility

Additive nullable columns; additive proto field; additive message-contract fields — all version-compatible. No backfill; historical rows keep ParentExecutionId = NULL.
ExecutionId and CorrelationId semantics unchanged — every existing drill-in keeps working.

Failure handling

Audit-write failure NEVER aborts the user-facing action — unchanged invariant; ParentExecutionId is just another field on the row.
Site auditlog.db schema change MUST use the idempotent ALTER-if-missing path (commit 5198b11); do not repeat the original CREATE TABLE IF NOT EXISTS mistake.

Testing

Repository: query-by-ParentExecutionId; GetExecutionTreeAsync (multi-level tree, stub-parent node, MAXRECURSION cap); migration smoke test.
Emitter unit tests: each emitter stamps ParentExecutionId; the cached-call lifecycle rows from one routed run share it; NotifyDeliver echoes Notifications.OriginParentExecutionId.
Headline integration test: an inbound API request that calls Route.Call → the routed site script does a sync ExternalSystem.Call, a cached call, and a Notify.Send → every resulting audit row (site + central) carries ParentExecutionId = the inbound request's ExecutionId, while each has its own distinct ExecutionId.
Central UI: bUnit (column renders, filter maps, query param parsed, tree assembled from the flat list) + Playwright (drill-in → tree → node click filters the grid).

Out of scope / follow-ups

Tag cascade (case 2) — deferred. If the attribute-write path ever carries the writing execution's id into the triggered script's ScriptRuntimeContext, the same ParentExecutionId column and tree view cover it with no schema change.
CLI audit tree command — possible follow-up.
Backfilling ParentExecutionId on historical audit rows — not done.

Constraints

Additive everywhere — nullable columns, additive proto/message fields, no backfill.
Never touch infra/*; alog.md is the locked v1 spec — do not modify it.
Site auditlog.db schema change MUST use the idempotent ALTER-if-missing path (commit 5198b11).

12 KiB Raw Blame History

Audit Log — Cross-Execution Correlation (ParentExecutionId) Design