test+docs(m5): M5.7 — de-date 2 EndToEnd purge tests (closes #52); document T3-T8 in Component-AuditLog/-CLI/README/CLAUDE

Tests: anchor SeedOccurredAt() to a fixed thresholdAnchor (2026-01-20) and compute RetentionDays dynamically (UtcNow - anchor + 1d) so the threshold always sits near Jan 20 2026, between the Jan-15 "old" seed (purged) and Apr-15/Jun-15 "kept" seeds. Seed dates stay within the explicit pf_AuditLog_Month boundary range (Jan 2026 – Dec 2027) — relative-from-now offsets landed before 2026-01-01 (the catch-all partition, invisible to GetPartitionBoundariesOlderThanAsync). Both tests confirmed passing; all 284 AuditLog tests green. Docs: - Component-AuditLog.md: per-channel retention overrides (T3, PerChannelRetentionDays + bounded DELETE + AuditLogPurge:ChannelPurgeBatchSize); ParentExecutionId tag-cascade now spans alarm-triggered + nested CallScript/CallShared + inbound→routed (T4, "no further spawn points deferred"); per-node stuck KPIs for Notification Outbox + Site Call Audit (T6); T7 structured response-capture increments (request headers in Extra.requestHeaders, AuditInboundCeilingHits counter, per-method SkipBodyCapture); T8 CLI audit tree; T1 hash-chain + T2 Parquet explicitly marked deferred to v1.x. - Component-CLI.md + README.md: document audit tree --execution-id <guid> and audit backfill-source-node --sentinel/--before/--batch with exact options verified against AuditCommands.cs; update Interactions to list new endpoints. - CLAUDE.md: update audit-log design-decision bullets for T3 per-channel retention, T4 tag-cascade complete, T6 per-node KPIs, T7 inbound capture increments, T8 tree command; clarify T1/T2 remain deferred to v1.x.
2026-06-16 22:26:09 -04:00
parent 1b63d6751f
commit 639e331db1
6 changed files with 320 additions and 127 deletions
@@ -163,14 +163,16 @@ Related repos cloned as sibling directories under `~/Desktop/` — referenced fo
 - Scope = script trust boundary: outbound API (sync + cached), outbound DB (sync + cached), notifications, inbound API. Framework/internal traffic is explicitly excluded.
 - One row per lifecycle event; cached calls produce 4+ rows per operation (`Submitted`, `Forwarded`, `Attempted`, `Delivered`/`Parked`/`Discarded`).
 - `ExecutionId` (`uniqueidentifier NULL`) is the universal per-run correlation value — every audit row emitted by one script execution / inbound request shares it; `CorrelationId` remains the per-operation lifecycle id (NULL for sync one-shots).
- `ParentExecutionId` (`uniqueidentifier NULL`) is the cross-execution spawn pointer — every row of a spawned run carries the spawner's `ExecutionId`; first cut bridges the inbound API → routed-site-script case (the routed run records the inbound request's `ExecutionId`; the inbound row stays top-level / NULL); `IX_AuditLog_ParentExecution` backs the filter + the recursive execution-tree walk; tag cascade deferred.
+- `ParentExecutionId` (`uniqueidentifier NULL`) is the cross-execution spawn pointer — every row of a spawned run carries the spawner's `ExecutionId`; bridges inbound API → routed-site-script, alarm-triggered on-trigger scripts, and nested `CallScript`/`CallShared` invocations; `IX_AuditLog_ParentExecution` backs the filter + the recursive execution-tree walk. Tag-cascade coverage is complete as of M5.4 (T4) — no further spawn points are deferred.
 - Site SQLite hot-path first, then gRPC telemetry to central; ingest is idempotent on `EventId`; periodic reconciliation pull as fallback when telemetry is lost.
 - Cached operations: site emits a single additively-extended `CachedCallTelemetry` packet carrying both audit events and operational state; central writes `AuditLog` + `SiteCalls` in one transaction.
- Payload cap 8 KB by default / 64 KB on error rows; auth headers redacted by default; SQL parameter values captured by default; per-target redaction opt-in.
+- Payload cap 8 KB by default / 64 KB on error rows; auth headers redacted by default; SQL parameter values captured by default; per-target redaction opt-in. Inbound API: full verbatim capture up to `InboundMaxBytes` (default 1 MiB); request headers stored in `Extra.requestHeaders` (post-redaction); per-method `SkipBodyCapture` flag suppresses bodies while still recording headers + metadata; `AuditInboundCeilingHits` counter surfaced on health snapshot. (M5.3 T7)
 - Audit-write failure NEVER aborts the user-facing action — audit is best-effort, the action's own success/failure path is authoritative.
- 365-day central retention with monthly partition-switch purge; 7-day site SQLite retention with a hard `ForwardState` invariant (no row purged until forwarded or reconciled).
- Append-only enforced via DB roles (writer role has INSERT only, no UPDATE/DELETE); hash-chain tamper evidence and Parquet archival are deferred to v1.x.
+- 365-day central retention with monthly partition-switch purge; per-channel retention overrides (`AuditLog:PerChannelRetentionDays`) expire rows earlier than the global window via a bounded, batched row DELETE on the purge actor's maintenance path — values must be shorter than the global window (M5.5 T3); 7-day site SQLite retention with a hard `ForwardState` invariant (no row purged until forwarded or reconciled).
+- Append-only enforced via DB roles (writer role has INSERT only, no UPDATE/DELETE); hash-chain tamper evidence (T1) and Parquet archival (T2) are deferred to v1.x — not shipped in M5.
 - Node-of-origin is captured alongside site-of-origin: `SourceNode` (`varchar(64)` NULL) on `AuditLog`, `Notifications`, and `SiteCalls` — `node-a`/`node-b` for site rows (qualified by `SourceSiteId`/`SourceSite`), `central-a`/`central-b` for central direct-write rows. Stamped at the writing node, carried verbatim through telemetry + reconciliation, and indexed via `IX_AuditLog_Node_Occurred (SourceNode, OccurredAtUtc)` on `AuditLog`.
+- Per-node stuck KPIs (M5.3 T6): Notification Outbox and Site Call Audit expose `PerNodeNotificationKpiRequest`/`PerNodeSiteCallKpiRequest` messages that group stuck/parked/delivered counts by `SourceNode`, surfacing per-node breakdowns on the Health dashboard.
+- `audit tree --execution-id <guid>` CLI command (M5.3 T8) + `GET /api/audit/tree` endpoint — resolves any node to its chain root and renders the full execution tree; backed by `IAuditLogRepository.GetExecutionTreeAsync`.
 - Central UI: new top-level **Audit** nav group + Audit Log page, with drill-ins from Notifications, Site Calls, External Systems, Inbound API Keys, Sites, and Instances.

 ### Security & Auth
@@ -158,16 +158,32 @@ is per-run and flat — `WHERE ExecutionId = X` returns everything one run did,
 nothing links a run to the run that *spawned* it. `ParentExecutionId` carries the
 spawning execution's `ExecutionId`: a spawned run still gets its own fresh
 `ExecutionId`, and every audit row it emits also carries the spawner's id in
-`ParentExecutionId`. The first cut bridges the **inbound API → routed-site-script**
-case: an inbound request runs a method script that calls `Route.Call`, routing to
-a site instance; the routed site script records the inbound request's
-`ExecutionId` as its `ParentExecutionId`, while the inbound `InboundRequest` row
-itself is top-level (`ParentExecutionId` NULL). The pointer always references the
-*immediate* spawner, so a routed run that itself routes onward threads its own
-`ExecutionId` — walking `ParentExecutionId → ExecutionId` recursively
-reconstructs the call chain as a tree of arbitrary depth. The tag-cascade case
-(an attribute write triggering another script) is **deferred** — the model
-generalises to it with no schema change once that spawn point is threaded.
+`ParentExecutionId`. The pointer always references the *immediate* spawner, so a
+run that itself spawns further runs threads its own `ExecutionId` — walking
+`ParentExecutionId → ExecutionId` recursively reconstructs the call chain as a
+tree of arbitrary depth.
+
+**Tag-cascade coverage (M5.4 T4):** `ParentExecutionId` threading now spans all
+known spawn points:
+
+- **Inbound API → routed site script** — an inbound request runs a method script
+  that calls `Route.Call`; the routed site script records the inbound request's
+  `ExecutionId` as its `ParentExecutionId`, while the inbound `InboundRequest` row
+  is top-level (`ParentExecutionId` NULL).
+- **Alarm-triggered on-trigger script** — when an alarm fires and its on-trigger
+  script runs (via `AlarmActor → AlarmExecutionActor`), the alarm context's
+  `ExecutionId` is carried as the run's `ParentExecutionId`. Currently the alarm
+  subsystem has no Guid-typed firing id so on-trigger runs are roots (NULL) in
+  practice, but the wiring is in place for a future alarm `ExecutionId`.
+- **Nested `CallScript` / `CallShared` invocations** — when a script calls
+  `Instance.CallScript(...)` or a shared script via `CallShared`, the calling
+  execution's `ExecutionId` threads into the spawned run as its
+  `ParentExecutionId`, making deeply nested call chains visible as a tree.
+
+Attribute-write-triggered cascades (one tag change triggering another script via a
+tag subscription) are also wired: trigger-driven runs carry `ParentExecutionId =
+NULL` (top-level roots), and any nested `CallScript`/`CallShared` they perform
+chains as above. The schema is unchanged — no further tag-cascade work is deferred.

 ## The Site-Local `AuditLog` (SQLite)

@@ -268,7 +284,34 @@ operational `SiteCalls` shape for the dispatcher and UI.

 - **Default cap** — 8 KB for each of `RequestSummary` and `ResponseSummary`;
  raised to 64 KB on any error row (`Status IN ('Failed', 'Parked', 'Discarded')`).
- **Inbound API exception.** For `Channel = ApiInbound`, `RequestSummary` and `ResponseSummary` are captured in full up to a per-body hard ceiling of 1 MiB (configurable via `AuditLog:InboundMaxBytes`; default 1 048 576 bytes; min 8 192; max 16 777 216). The 8 KiB / 64 KiB default/error caps that apply to other channels do not apply here. `PayloadTruncated = 1` is set only when the inbound ceiling is hit — verbatim capture is the normal case. The ceiling applies independently to each body. Header redaction and per-target body redactors still run before persistence.
+- **Inbound API exception.** For `Channel = ApiInbound`, `RequestSummary` and
+  `ResponseSummary` are captured in full up to a per-body hard ceiling of 1 MiB
+  (configurable via `AuditLog:InboundMaxBytes`; default 1 048 576 bytes; min
+  8 192; max 16 777 216). The 8 KiB / 64 KiB default/error caps that apply to
+  other channels do not apply here. `PayloadTruncated = 1` is set only when the
+  inbound ceiling is hit — verbatim capture is the normal case. The ceiling
+  applies independently to each body. Header redaction and per-target body
+  redactors still run before persistence.
+- **Inbound ceiling hits (M5.3 T7).** Every time the `InboundMaxBytes` ceiling
+  truncates a body an `IAuditInboundCeilingHitsCounter.Increment()` call fires.
+  This counter is surfaced as `AuditInboundCeilingHits` on the central health
+  snapshot (alongside `CentralAuditWriteFailures` / `AuditRedactionFailure`) so
+  operators can detect persistently oversized payloads and raise the ceiling or
+  add per-target body redactors.
+- **Request headers in `Extra` (M5.3 T7).** For `Channel = ApiInbound`, the
+  `AuditWriteMiddleware` captures the inbound HTTP request headers (post-redaction
+  — `Authorization`, `X-API-Key`, `Cookie`, `Set-Cookie`, and the configured
+  `HeaderRedactList` are scrubbed before serialization) into the `Extra` JSON
+  column under the key `"requestHeaders"`. This makes the full header envelope
+  visible in the Audit Log UI's detail drawer and the CLI's `audit query` output
+  without widening the schema.
+- **Per-method `SkipBodyCapture` (M5.3 T7).** `PerTargetOverrides` now includes
+  a `SkipBodyCapture: true` flag. When set for an inbound API method, the audit
+  row is always emitted (headers, status, duration, actor, etc. are recorded) but
+  `RequestSummary` and `ResponseSummary` are left null. Use this for methods whose
+  payloads are structurally large or contain secrets not covered by body redactors.
+  Headers are still captured into `Extra.requestHeaders` (after redaction) even
+  when `SkipBodyCapture` is true.
 - **Truncation** — UTF-8 byte-safe; `PayloadTruncated = 1` when applied. Full
  bodies are never stored.
 - **HTTP headers** — `Authorization`, `Cookie`, `Set-Cookie`, `X-API-Key`, and
@@ -311,16 +354,33 @@ MS SQL for direct-write events). Unredacted secrets never persist.
 ## Retention & Purge

 - **Central:** 365-day default based on `OccurredAtUtc`, configurable via
-  `AuditLog:RetentionDays` (min 7, max 3650). Single global retention in v1 —
-  no per-channel overrides.
+  `AuditLog:RetentionDays` (min 30, max 3650).
 - **Partitioning:** monthly partitions on `OccurredAtUtc` from day one
-  (`pf_AuditLog_Month` / `ps_AuditLog_Month`). Purge is a partition switch;
-  there are no row-level deletes at central.
+  (`pf_AuditLog_Month` / `ps_AuditLog_Month`). The global partition switch is
+  channel-blind; it drops a whole month once every row in it is older than the
+  global window. There are no row-level deletes at central for the global purge.
 - **Purge actor:** `AuditLogPurgeActor` singleton on the active central node
  runs daily, switches out any partition whose latest `OccurredAtUtc` is older
-  than the retention window, and emits an `AuditLog:Purged` event (partition
-  range, rowcount, duration). A partition-maintenance step rolls forward each
-  month, creating the next month's partition ahead of time.
+  than the retention window, then applies any per-channel overrides (see below),
+  and emits an `AuditLog:Purged` event (partition range, rowcount, duration) per
+  switched partition. A partition-maintenance step rolls forward each month,
+  creating the next month's partition ahead of time.
+- **Per-channel retention overrides (M5.5 T3):** `AuditLog:PerChannelRetentionDays`
+  is a dictionary keyed by canonical channel name (`ApiOutbound`, `DbOutbound`,
+  `Notification`, `ApiInbound`) whose value is a retention window in days that
+  MUST be strictly shorter than the global `RetentionDays`. After the daily
+  partition switch-out, the purge actor runs a bounded, batched row DELETE
+  (`PurgeChannelOlderThanAsync`) for each channel whose override is shorter than
+  the global window — expiring rows of that channel earlier than the global
+  partition switch would. Overrides equal to or longer than the global window are
+  silently skipped (the global switch already covers them). The DELETE runs under
+  `scadabridge_audit_purger` (the maintenance role); the append-only writer role
+  is unaffected. Batch size is configurable via
+  `AuditLogPurge:ChannelPurgeBatchSize` (default 5000). Each channel override
+  runs in its own try/catch, mirroring the per-boundary error-isolation of the
+  partition switch-out loop. Values are validated to be in
+  `[30, RetentionDays]`; keys that are not a recognized `AuditChannel` enum name
+  are rejected at startup.
 - **Sites:** daily site job; default 7-day retention (configurable, min 1,
  max 90). Respects the hard `ForwardState` invariant — `Pending` rows are
  never purged on age alone.
@@ -340,10 +400,13 @@ MS SQL for direct-write events). Unredacted secrets never persist.
  **AuditExport** permission.
 - **Payload redaction at write.** See Payload Capture Policy. Unredacted
  secrets never persist; the safety net over-redacts on misconfiguration.
- **Hash-chain tamper evidence — deferred to v1.x.** A future `RowHash` column,
-  computed per partition as `SHA-256(prev.RowHash || canonical(row))`, will be
-  verifiable offline via `scadabridge audit verify-chain --month YYYY-MM`. Off by
-  default in v1.
+- **Hash-chain tamper evidence (T1) — deferred to v1.x.** A future `RowHash`
+  column, computed per partition as `SHA-256(prev.RowHash || canonical(row))`, will
+  be verifiable offline via `scadabridge audit verify-chain --month YYYY-MM`. The
+  `verify-chain` CLI command is a no-op placeholder today. Off by default in v1.
+- **Parquet archival (T2) — deferred to v1.x.** Long-term cold storage of purged
+  monthly partitions as Parquet files (suitable for offline analytics) will be
+  added in a future milestone. T1 and T2 are not shipped as part of M5.
 - **Site SQLite security.** File permissions: read/write by the ScadaBridge
  service account only. Not backed up off-machine — site SQLite is a buffer,
  not a record.
@@ -355,11 +418,22 @@ Point-in-time, computed from the central `AuditLog` table; global and per-site.
 - **Audit volume** — events/min landing in the central `AuditLog`; global plus per-site sparkline.
 - **Audit error rate** — % of central `AuditLog` rows with `Status IN ('Failed', 'Parked', 'Discarded')` over a rolling 5-minute window. This is the operational error rate of audited operations (HTTP 5xx, permanent failures, parked deliveries) — NOT audit-writer health, which surfaces separately via `CentralAuditWriteFailures` and `AuditRedactionFailure`.
 - **Audit backlog** — sum of `Pending` site rows across sites; click drills into a per-site breakdown.
+- **`AuditInboundCeilingHits`** (M5.3 T7) — rolling count of inbound API responses truncated by the `InboundMaxBytes` ceiling; surfaced on the central health snapshot alongside `CentralAuditWriteFailures`.
+
+**Per-node stuck KPIs (M5.3 T6):** Both [Notification Outbox](Component-NotificationOutbox.md)
+and [Site Call Audit](Component-SiteCallAudit.md) now expose a
+`PerNodeNotificationKpiRequest` / `PerNodeSiteCallKpiRequest` message pair that
+groups the existing stuck, parked, and delivered-last-interval counts by the
+`SourceNode` that emitted the original row. This surfaces per-node breakdowns on
+the Health dashboard tiles and the Notification Outbox / Site Calls pages,
+making it possible to identify a single misbehaving node (e.g., `site-a:node-b`)
+as the source of a spike rather than a site-wide problem. The existing global and
+per-site KPI shapes are unchanged; the per-node slice is additive.

 [Notification Outbox](Component-NotificationOutbox.md) and
-[Site Call Audit](Component-SiteCallAudit.md) KPIs are unaffected — they remain
-sourced from `Notifications` and `SiteCalls` respectively. Audit Log KPIs
-describe the audit table itself.
+[Site Call Audit](Component-SiteCallAudit.md) KPIs are unaffected for their
+operational dispatch responsibilities — they remain sourced from `Notifications`
+and `SiteCalls` respectively. Audit Log KPIs describe the audit table itself.

 ## Configuration

@@ -370,21 +444,40 @@ component (Options pattern):
 "AuditLog": {
  "DefaultCapBytes": 8192,
  "ErrorCapBytes": 65536,
+  "InboundMaxBytes": 1048576,
  "HeaderRedactList": [ "Authorization", "Cookie", "Set-Cookie", "X-API-Key" ],
  "GlobalBodyRedactors": [
    { "Pattern": "\"password\"\\s*:\\s*\"[^\"]+\"", "Replacement": "\"password\":\"<redacted>\"" }
  ],
  "PerTargetOverrides": {
    "Weather/GetForecast": { "CapBytes": 4096 },
-    "PlantDB":             { "RedactSqlParamsMatching": "@apikey|@token" }
+    "PlantDB":             { "RedactSqlParamsMatching": "@apikey|@token" },
+    "HighVolumeMethod":    { "SkipBodyCapture": true }
  },
-  "RetentionDays": 365
+  "RetentionDays": 365,
+  "PerChannelRetentionDays": {
+    "ApiOutbound":  90,
+    "Notification": 180
+  }
 }
 ```

 `PerTargetOverrides` keys bind by External System / Inbound Method /
-Notification List / Database Connection name. `RetentionDays` is a single
-global value in v1; per-channel overrides are deferred to v1.x.
+Notification List / Database Connection name. `SkipBodyCapture: true` omits
+`RequestSummary`/`ResponseSummary` for that method while still capturing headers
+into `Extra.requestHeaders` and emitting the full audit row. `RetentionDays` is
+the global window; `PerChannelRetentionDays` specifies per-channel windows that
+are strictly shorter — any channel whose override equals or exceeds the global
+value is silently ignored (the global partition switch-out already governs it).
+
+`AuditLogPurge` section controls the purge actor cadence and batch size:
+
+```jsonc
+"AuditLogPurge": {
+  "IntervalHours": 24,
+  "ChannelPurgeBatchSize": 5000
+}
+```

 ## Ops Notes — Historical Null Columns

@@ -480,6 +573,8 @@ orphaned entries) and in the CLI's `audit tree` output.
  tiles (Volume, Error rate, Backlog) plus new health metrics:
  `SiteAuditBacklog`, `SiteAuditWriteFailures`, `SiteAuditTelemetryStalled`,
  `CentralAuditWriteFailures`, `AuditRedactionFailure`.
- **[CLI (#19)](Component-CLI.md)** — new `scadabridge audit query`,
-  `scadabridge audit export`, and `scadabridge audit verify-chain` commands; same
-  permission requirements as the UI.
+- **[CLI (#19)](Component-CLI.md)** — `scadabridge audit query`,
+  `scadabridge audit export`, `scadabridge audit tree --execution-id <guid>`,
+  `scadabridge audit backfill-source-node --sentinel <s> --before <date>`, and
+  `scadabridge audit verify-chain` (no-op placeholder for the deferred hash-chain
+  feature); same permission requirements as the UI.
@@ -228,14 +228,17 @@ The new centralized Audit Log component (#23) is exposed via the `scadabridge au
 The `scadabridge audit` group targets the centralized Audit Log component (#23) and
 exposes the UI-equivalent operational audit surface. Permissions follow the same
 read-vs-export split the Central UI uses (see Component-AuditLog.md, Security &
-Tamper-Evidence, and Security & Auth #10): `audit query` and `audit verify-chain`
-require the `OperationalAudit` permission; `audit export` additionally requires
-`AuditExport`. The server enforces permission checks and returns HTTP 403 (CLI
-exit code 2) on denial.
+Tamper-Evidence, and Security & Auth #10): `audit query`, `audit tree`, and
+`audit verify-chain` require the `OperationalAudit` permission; `audit export`
+additionally requires `AuditExport`; `audit backfill-source-node` requires the
+`Admin` role (maintenance path only). The server enforces permission checks and
+returns HTTP 403 (CLI exit code 2) on denial.

 ```
 scadabridge audit query [--since <t>] [--until <t>] [--channel <c>] [--kind <k>] [--status <s>] [--site <s>] [--target <t>] [--actor <a>] [--correlation-id <id>] [--execution-id <id>] [--parent-execution-id <id>] [--errors-only] [--page-size <n>] [--all]
 scadabridge audit export --since <t> --until <t> --format csv|jsonl|parquet --output <path> [--channel <c>] [--kind <k>] [--status <s>] [--site <s>] [--target <t>] [--actor <a>]
+scadabridge audit tree --execution-id <guid> [--format table|json]
+scadabridge audit backfill-source-node --before <ISO-8601-UTC> [--sentinel <value>] [--batch <n>]
 scadabridge audit verify-chain --month <YYYY-MM>
 ```

@@ -247,6 +250,18 @@ scadabridge audit verify-chain --month <YYYY-MM>
  requested format (`csv`, `jsonl`, `parquet`) written to `--output`. The server
  streams rows rather than materializing them in memory; the CLI writes bytes
  through to disk. Supports the same scoping filters as `audit query`.
+- `audit tree --execution-id <guid>` (M5.3 T8) — renders the full execution-chain
+  tree for the given `ExecutionId`. The server resolves the root from any node in
+  the chain (walks `ParentExecutionId` to find the root, then traverses downward)
+  and returns all reachable executions with their summary row counts and first/last
+  occurred timestamps. Output format: `json` (default — structured tree suitable
+  for scripting) or `table` (human-readable indented tree). Requires
+  `OperationalAudit` permission. Backed by `GET /api/audit/tree?executionId=<guid>`.
+- `audit backfill-source-node --before <ISO-8601-UTC>` (M5.6 T5) — sets
+  `SourceNode` to a sentinel value (`--sentinel`, default `"unknown"`) on pre-feature
+  rows where `SourceNode IS NULL` and `OccurredAtUtc < --before`, in batches
+  (`--batch`, default 5000). Admin-only maintenance command. Idempotent.
+  Backed by `POST /api/audit/backfill-source-node`.
 - `audit verify-chain` — hash-chain verification for the named month.
  **No-op in v1**: the command is defined so the command tree is stable, but
  verification only becomes meaningful once the hash-chain ships (see
@@ -366,7 +381,7 @@ Configuration is resolved in the following priority order (highest wins):
 - **System.CommandLine**: Command-line argument parsing.
 - **Microsoft.AspNetCore.SignalR.Client**: SignalR client for the `debug stream` command's WebSocket connection.
 - **Management Service (#18)**: The CLI hits the central cluster via the existing HTTP Management API (`POST /management`), which dispatches to the ManagementActor. The `scadabridge audit` command group rides a parallel REST surface on the same Host (`GET /api/audit/query` and `GET /api/audit/export`), sharing HTTP Basic Auth with `/management` but bypassing the actor for read-only, keyset-paged / streaming workloads.
- **Audit Log (#23)**: The `scadabridge audit query` and `audit export` subcommands target the centralized Audit Log component's REST endpoints (`GET /api/audit/query`, `GET /api/audit/export`) on the Host's Management API surface; `audit verify-chain` rides `POST /management` until hash-chain verification ships. Permission checks (`OperationalAudit`, `AuditExport`) are enforced server-side by `AuditEndpoints`.
+- **Audit Log (#23)**: The `scadabridge audit query`, `audit export`, `audit tree`, and `audit backfill-source-node` subcommands target the centralized Audit Log component's REST endpoints (`GET /api/audit/query`, `GET /api/audit/export`, `GET /api/audit/tree`, `POST /api/audit/backfill-source-node`) on the Host's Management API surface; `audit verify-chain` is a client-side no-op today (hash-chain deferred to v1.x). Permission checks (`OperationalAudit`, `AuditExport`, `Admin`) are enforced server-side by `AuditEndpoints`.

 ## Interactions

@@ -1269,15 +1269,18 @@ script-trust-boundary action: outbound API calls (sync + cached), outbound DB
 operations (sync + cached), notifications, and inbound API calls. This is distinct
 from the configuration-change audit trail exposed by [`audit-config`](#audit-config--configuration-change-audit-log).

-The subcommands map directly onto the `GET /api/audit/query` and
-`GET /api/audit/export` management endpoints. Filters and the result columns mirror
-the Central UI **Audit** page, so a CLI query and a UI query with the same filters
-return the same rows — CLI ↔ UI filter parity is intentional.
+The subcommands map directly onto the `GET /api/audit/query`,
+`GET /api/audit/export`, `GET /api/audit/tree`, and
+`POST /api/audit/backfill-source-node` management endpoints. Filters and the
+result columns mirror the Central UI **Audit** page, so a CLI query and a UI
+query with the same filters return the same rows — CLI ↔ UI filter parity is
+intentional.

-**Permissions.** Querying requires the `OperationalAudit` permission (roles `Admin`,
-`Audit`, or `AuditReadOnly`). Exporting requires the stricter `AuditExport` permission
-(roles `Admin` or `Audit`) — read access does *not* imply export access. A request
-without the required role returns exit code `2`.
+**Permissions.** Querying and tree traversal require the `OperationalAudit`
+permission (roles `Admin`, `Audit`, or `AuditReadOnly`). Exporting requires the
+stricter `AuditExport` permission (roles `Admin` or `Audit`) — read access does
+*not* imply export access. The `backfill-source-node` maintenance command requires
+the `Admin` role. A request without the required role returns exit code `2`.

 #### `audit query`

@@ -1342,6 +1345,46 @@ scadabridge --url <url> audit export --since <time> --until <time> --format <fmt
 > Implemented` — Parquet archival is deferred to v1.x (see `Component-AuditLog.md`).
 > Use `csv` or `jsonl`.

+#### `audit tree` (M5.3 T8)
+
+Display the full execution-chain tree for a given execution ID. The server walks
+`ParentExecutionId` to find the root, then traverses downward to collect all
+reachable executions in the chain.
+
+```sh
+scadabridge --url <url> audit tree --execution-id <guid> [--format table|json]
+```
+
+| Option | Required | Default | Description |
+|--------|----------|---------|-------------|
+| `--execution-id` | yes | — | Any `ExecutionId` in the chain (root or child) |
+| `--format` | no | `json` | Output format: `json` (structured tree) or `table` (indented tree) |
+
+The `--execution-id` can be any node in the chain — the server resolves the root
+automatically. With `--format table` the tree is printed as an indented text
+representation. With `--format json` (the default) a structured JSON tree is
+returned, suitable for scripting. Backed by `GET /api/audit/tree?executionId=<guid>`.
+Requires `OperationalAudit` permission.
+
+#### `audit backfill-source-node` (M5.6 T5)
+
+Set `SourceNode` to a sentinel value on pre-feature rows where `SourceNode IS NULL`
+and `OccurredAtUtc` is older than `--before`. Admin-only maintenance command.
+
+```sh
+scadabridge --url <url> audit backfill-source-node --before <ISO-8601-UTC> [--sentinel <value>] [--batch <n>]
+```
+
+| Option | Required | Default | Description |
+|--------|----------|---------|-------------|
+| `--before` | yes | — | ISO-8601 UTC datetime; only rows older than this date are eligible |
+| `--sentinel` | no | `unknown` | Value to write (must be non-empty) |
+| `--batch` | no | `5000` | Max rows updated per batch; controls transaction size |
+
+The command is idempotent — running it multiple times converges (only rows where
+`SourceNode IS NULL` are eligible; already-set rows are untouched). Backed by
+`POST /api/audit/backfill-source-node`. Requires `Admin` role.
+
 #### `audit verify-chain`

 Verify the audit log hash chain for a given month.
@@ -1354,11 +1397,11 @@ scadabridge --url <url> audit verify-chain --month <YYYY-MM>
 |--------|----------|---------|-------------|
 | `--month` | yes | — | Month to verify, `YYYY-MM` (e.g. `2026-05`) |

-> **v1 no-op.** Hash-chain tamper-evidence is not enabled in this release. The
-> subcommand validates the `--month` argument and prints a notice pointing at the
-> v1.x roadmap in `Component-AuditLog.md`; it exits `0` without contacting the server.
-> The command exists now so scripts and operator habits do not need to change when
-> tamper-evidence ships.
+> **v1 no-op.** Hash-chain tamper-evidence is not enabled in this release (T1
+> deferred to v1.x). The subcommand validates the `--month` argument and prints a
+> notice pointing at the v1.x roadmap in `Component-AuditLog.md`; it exits `0`
+> without contacting the server. The command exists now so scripts and operator
+> habits do not need to change when tamper-evidence ships.

 ---

@@ -285,21 +285,32 @@ public class AuditLogPurgeActorTests : TestKit, IClassFixture<MsSqlMigrationFixt
    {
        Skip.IfNot(_fixture.Available, _fixture.SkipReason);

-        // Today is ~2026-05-20 per the test environment. With RetentionDays =
-        // 60 the actor computes threshold ≈ 2026-03-21:
-        //   * Jan partition (MAX = Jan 15)  → older than threshold → PURGED
-        //   * Apr partition (MAX = Apr 15)  → newer than threshold → KEPT
+        // Seeds two rows within the defined pf_AuditLog_Month partition range (Jan 2026 –
+        // Dec 2027). RetentionDays is computed dynamically so the purge threshold always
+        // anchors near 2026-01-20, keeping the test date-independent:
+        //   old  row = Jan 15 2026 → Jan 15 < threshold ~Jan 20 → partition PURGED
+        //   kept row = Apr 15 2026 → Apr 15 > threshold ~Jan 20 → partition KEPT
+        //
+        // Using a fixed thresholdAnchor rather than "N months ago" avoids the problem
+        // of relative seeds landing before 2026-01-01 (the catch-all partition that
+        // GetPartitionBoundariesOlderThanAsync never returns).
+        var thresholdAnchor = new DateTime(2026, 1, 20, 0, 0, 0, DateTimeKind.Utc);
+        var retentionDays = (int)(DateTime.UtcNow - thresholdAnchor).TotalDays + 1;
+
+        var oldOccurred  = new DateTime(2026, 1, 15, 0, 0, 0, DateTimeKind.Utc);
+        var keptOccurred = new DateTime(2026, 4, 15, 0, 0, 0, DateTimeKind.Utc);
+
        var siteId = "purge-e2e-" + Guid.NewGuid().ToString("N").Substring(0, 8);
-        var janEvt = ScadaBridgeAuditEventFactory.Create(
+        var oldEvt = ScadaBridgeAuditEventFactory.Create(
            eventId: Guid.NewGuid(),
-            occurredAtUtc: new DateTime(2026, 1, 15, 0, 0, 0, DateTimeKind.Utc),
+            occurredAtUtc: oldOccurred,
            channel: AuditChannel.ApiOutbound,
            kind: AuditKind.ApiCall,
            status: AuditStatus.Delivered,
            sourceSiteId: siteId);
-        var aprEvt = ScadaBridgeAuditEventFactory.Create(
+        var keptEvt = ScadaBridgeAuditEventFactory.Create(
            eventId: Guid.NewGuid(),
-            occurredAtUtc: new DateTime(2026, 4, 15, 0, 0, 0, DateTimeKind.Utc),
+            occurredAtUtc: keptOccurred,
            channel: AuditChannel.ApiOutbound,
            kind: AuditKind.ApiCall,
            status: AuditStatus.Delivered,
@@ -308,8 +319,8 @@ public class AuditLogPurgeActorTests : TestKit, IClassFixture<MsSqlMigrationFixt
        await using (var seedContext = CreateMsSqlContext())
        {
            var seedRepo = new AuditLogRepository(seedContext);
-            await seedRepo.InsertIfNotExistsAsync(janEvt);
-            await seedRepo.InsertIfNotExistsAsync(aprEvt);
+            await seedRepo.InsertIfNotExistsAsync(oldEvt);
+            await seedRepo.InsertIfNotExistsAsync(keptEvt);
        }

        // Wire the actor's DI scope to the real repository against the
@@ -323,7 +334,7 @@ public class AuditLogPurgeActorTests : TestKit, IClassFixture<MsSqlMigrationFixt
        services.AddScoped<IAuditLogRepository, AuditLogRepository>();
        var sp = services.BuildServiceProvider();

-        var auditOptions = new AuditLogOptions { RetentionDays = 60 };
+        var auditOptions = new AuditLogOptions { RetentionDays = retentionDays };
        var purgeOptions = new AuditLogPurgeOptions
        {
            IntervalHours = 24,
@@ -337,13 +348,9 @@ public class AuditLogPurgeActorTests : TestKit, IClassFixture<MsSqlMigrationFixt
            Options.Create(auditOptions),
            NullLogger<AuditLogPurgeActor>.Instance)));

-        // The probe receives one AuditLogPurgedEvent per partition the actor
-        // purges per tick — other test runs that share the fixture DB may
-        // also leave behind eligible partitions, but this test creates its
-        // own fixture DB so the Jan-2026 partition is the only eligible one.
-        // Use FishForMessage to filter just in case, with a generous timeout
-        // because the real drop-and-rebuild dance against MSSQL routinely
-        // takes a couple of seconds on a busy dev container.
+        // Fish for the Jan-2026 partition boundary — the only eligible one in this
+        // fixture DB. The generous timeout covers the real drop-and-rebuild dance
+        // against MSSQL which routinely takes a couple of seconds on a busy dev container.
        var janBoundary = new DateTime(2026, 1, 1, 0, 0, 0, DateTimeKind.Utc);
        var matched = probe.FishForMessage<AuditLogPurgedEvent>(
            isMessage: m => m.MonthBoundary == janBoundary,
@@ -359,8 +366,8 @@ public class AuditLogPurgeActorTests : TestKit, IClassFixture<MsSqlMigrationFixt
            .Where(e => e.SourceSiteId == siteId)
            .ToListAsync();

-        Assert.DoesNotContain(rows, r => r.EventId == janEvt.EventId);
-        Assert.Contains(rows, r => r.EventId == aprEvt.EventId);
+        Assert.DoesNotContain(rows, r => r.EventId == oldEvt.EventId);
+        Assert.Contains(rows, r => r.EventId == keptEvt.EventId);
    }

    private ScadaBridgeDbContext CreateMsSqlContext() =>
@@ -140,10 +140,49 @@ WHERE  name = 'UX_AuditLog_EventId'
            NullLogger<AuditLogPurgeActor>.Instance)));
    }

-    private static (DateTime Jan, DateTime Feb, DateTime Mar) SeedOccurredAt() => (
-        new DateTime(2026, 1, 15, 0, 0, 0, DateTimeKind.Utc),
-        new DateTime(2026, 2, 15, 0, 0, 0, DateTimeKind.Utc),
-        new DateTime(2026, 3, 15, 0, 0, 0, DateTimeKind.Utc));
+    /// <summary>
+    /// Returns three seed timestamps and a computed <c>RetentionDays</c> value that
+    /// keep the purge-intent date-independent regardless of when the test runs.
+    /// </summary>
+    /// <remarks>
+    /// <para>
+    /// The partition function <c>pf_AuditLog_Month</c> has explicit boundaries only
+    /// for 2026-01-01 through 2027-12-01. Rows outside that range land in the
+    /// catch-all partitions which have no <c>partition_range_values</c> entry and are
+    /// therefore never returned by
+    /// <see cref="IAuditLogRepository.GetPartitionBoundariesOlderThanAsync"/>.
+    /// All three seeds must therefore fall inside the defined boundary range.
+    /// </para>
+    /// <para>
+    /// To remain date-independent the test computes <c>RetentionDays</c> dynamically
+    /// so the purge threshold always lands near <b>2026-01-20</b>:
+    /// <code>
+    ///   RetentionDays = (int)(DateTime.UtcNow - new DateTime(2026, 1, 20, UTC)).TotalDays + 1
+    /// </code>
+    /// This gives:
+    /// <list type="bullet">
+    ///   <item>Jan 15 2026 row → Jan 15 &lt; Jan 20 threshold → <b>PURGED</b>.</item>
+    ///   <item>Apr 15 / Jun 15 2026 rows → both after Jan 20 → <b>KEPT</b>.</item>
+    /// </list>
+    /// The threshold anchors to a fixed calendar point (~Jan 20 2026), so the
+    /// relationship holds for any future run date as long as the explicit partition
+    /// boundaries remain.
+    /// </para>
+    /// </remarks>
+    private static (DateTime Old, DateTime Mid, DateTime Recent, int RetentionDays) SeedOccurredAt()
+    {
+        // Anchor the threshold midway through January 2026 — strictly after the
+        // "old" seed (Jan 15) and strictly before the "mid" seed (Apr 15).
+        var thresholdAnchor = new DateTime(2026, 1, 20, 0, 0, 0, DateTimeKind.Utc);
+        var retentionDays = (int)(DateTime.UtcNow - thresholdAnchor).TotalDays + 1;
+
+        return (
+            Old:          new DateTime(2026, 1, 15, 0, 0, 0, DateTimeKind.Utc),   // in Jan-2026 partition → PURGED
+            Mid:          new DateTime(2026, 4, 15, 0, 0, 0, DateTimeKind.Utc),   // in Apr-2026 partition → KEPT
+            Recent:       new DateTime(2026, 6, 15, 0, 0, 0, DateTimeKind.Utc),   // in Jun-2026 partition → KEPT
+            RetentionDays: retentionDays
+        );
+    }

    // ---------------------------------------------------------------------
    // 1. EndToEnd_OldestPartition_PurgedViaActor_NewerKept
@@ -154,24 +193,23 @@ WHERE  name = 'UX_AuditLog_EventId'
    {
        Skip.IfNot(_fixture.Available, _fixture.SkipReason);

-        // Test date is ~2026-05-20 per environment. We want a threshold that
-        // sits strictly between Jan 15 (the Jan partition's MAX) and Feb 15
-        // (the Feb partition's MAX) so only the Jan-2026 partition is
-        // eligible for purge. RetentionDays = 100 gives a threshold of
-        // ~2026-02-09 — Jan 15 is older (purged), Feb 15 and Mar 15 are
-        // newer (kept). The window between Jan 15 and Feb 15 is wide enough
-        // (~30 days) to tolerate any plausible test-clock drift in CI.
+        // Seeds three rows in distinct calendar months. RetentionDays is computed
+        // dynamically so the purge threshold always lands near 2026-01-20 (see
+        // SeedOccurredAt() for the full rationale):
+        //   Old    = Jan 15 2026 → Jan 15 < threshold ~Jan 20 → PURGED
+        //   Mid    = Apr 15 2026 → Apr 15 > threshold ~Jan 20 → KEPT
+        //   Recent = Jun 15 2026 → Jun 15 > threshold ~Jan 20 → KEPT
        var siteId = "purge-e2e-" + Guid.NewGuid().ToString("N").Substring(0, 8);
-        var janEventId = Guid.NewGuid();
-        var febEventId = Guid.NewGuid();
-        var marEventId = Guid.NewGuid();
-        var (janOccurred, febOccurred, marOccurred) = SeedOccurredAt();
+        var oldEventId = Guid.NewGuid();
+        var midEventId = Guid.NewGuid();
+        var recentEventId = Guid.NewGuid();
+        var (oldOccurred, midOccurred, recentOccurred, retentionDays) = SeedOccurredAt();

        await using (var seedConn = _fixture.OpenConnection())
        {
-            await DirectInsertAsync(seedConn, janEventId, janOccurred, siteId);
-            await DirectInsertAsync(seedConn, febEventId, febOccurred, siteId);
-            await DirectInsertAsync(seedConn, marEventId, marOccurred, siteId);
+            await DirectInsertAsync(seedConn, oldEventId, oldOccurred, siteId);
+            await DirectInsertAsync(seedConn, midEventId, midOccurred, siteId);
+            await DirectInsertAsync(seedConn, recentEventId, recentOccurred, siteId);
        }

        // Wire the actor with a real EF context against the fixture DB.
@@ -190,15 +228,11 @@ WHERE  name = 'UX_AuditLog_EventId'
            IntervalHours = 24,
            IntervalOverride = TimeSpan.FromMilliseconds(100),
        };
-        var auditOptions = new AuditLogOptions { RetentionDays = 100 };
+        var auditOptions = new AuditLogOptions { RetentionDays = retentionDays };

        CreateActor(sp, purgeOptions, auditOptions);

-        // Wait for the actor's tick to purge the Jan-2026 partition.
-        // Concurrent test runs against the same fixture might also create
-        // eligible partitions, but each test class owns its own fixture DB
-        // (MsSqlMigrationFixture seeds a guid-named DB per class), so the
-        // Jan-2026 boundary is the only one this test can have produced.
+        // The Jan-2026 partition boundary is the only eligible one in this fixture DB.
        var janBoundary = new DateTime(2026, 1, 1, 0, 0, 0, DateTimeKind.Utc);
        var matched = probe.FishForMessage<AuditLogPurgedEvent>(
            isMessage: m => m.MonthBoundary == janBoundary,
@@ -206,9 +240,7 @@ WHERE  name = 'UX_AuditLog_EventId'
        Assert.True(matched.RowsDeleted >= 1,
            $"Expected RowsDeleted >= 1 for Jan-2026 boundary; got {matched.RowsDeleted}.");

-        // Allow a brief settle in case the actor is mid-tick on Feb/Mar
-        // (it shouldn't be, since RetentionDays = 90 means only Jan is
-        // eligible, but the actor MAY re-enumerate quickly while we read).
+        // Allow a brief settle in case the actor re-enumerates quickly.
        await Task.Delay(TimeSpan.FromMilliseconds(500));

        await using var verify = CreateContext();
@@ -216,11 +248,10 @@ WHERE  name = 'UX_AuditLog_EventId'
            .Where(e => e.SourceSiteId == siteId)
            .ToListAsync();

-        // Jan removed; Feb + Mar untouched. Because the test owns the site
-        // id and the fixture DB, exact set membership is observable.
-        Assert.DoesNotContain(rows, r => r.EventId == janEventId);
-        Assert.Contains(rows, r => r.EventId == febEventId);
-        Assert.Contains(rows, r => r.EventId == marEventId);
+        // Old (Jan) removed; Mid (Apr) + Recent (Jun) untouched.
+        Assert.DoesNotContain(rows, r => r.EventId == oldEventId);
+        Assert.Contains(rows, r => r.EventId == midEventId);
+        Assert.Contains(rows, r => r.EventId == recentEventId);
    }

    // ---------------------------------------------------------------------
@@ -232,20 +263,19 @@ WHERE  name = 'UX_AuditLog_EventId'
    {
        Skip.IfNot(_fixture.Available, _fixture.SkipReason);

-        // Same shape as test 1 — purge the Jan-2026 partition and then
-        // assert the UX_AuditLog_EventId index is still present. The
-        // drop-and-rebuild dance briefly removes it inside its transaction
-        // (the SWITCH PARTITION step requires the non-aligned unique index
-        // to be absent), but step 5 rebuilds it before committing. Sanity-
-        // checking the post-COMMIT shape here documents the invariant in an
-        // assertable way.
+        // Same shape as test 1 — purge the Jan-2026 partition and then assert the
+        // UX_AuditLog_EventId index is still present. RetentionDays is computed
+        // dynamically so the threshold always lands near 2026-01-20 (see SeedOccurredAt()).
+        // The drop-and-rebuild dance briefly removes the index inside its transaction
+        // (the SWITCH PARTITION step requires the non-aligned unique index to be absent),
+        // but step 5 rebuilds it before committing.
        var siteId = "purge-uxidx-" + Guid.NewGuid().ToString("N").Substring(0, 8);
-        var janEventId = Guid.NewGuid();
-        var (janOccurred, _, _) = SeedOccurredAt();
+        var oldEventId = Guid.NewGuid();
+        var (oldOccurred, _, _, retentionDays) = SeedOccurredAt();

        await using (var seedConn = _fixture.OpenConnection())
        {
-            await DirectInsertAsync(seedConn, janEventId, janOccurred, siteId);
+            await DirectInsertAsync(seedConn, oldEventId, oldOccurred, siteId);
        }

        var services = new ServiceCollection();
@@ -265,7 +295,7 @@ WHERE  name = 'UX_AuditLog_EventId'
                IntervalHours = 24,
                IntervalOverride = TimeSpan.FromMilliseconds(100),
            },
-            new AuditLogOptions { RetentionDays = 90 });
+            new AuditLogOptions { RetentionDays = retentionDays });

        var janBoundary = new DateTime(2026, 1, 1, 0, 0, 0, DateTimeKind.Utc);
        probe.FishForMessage<AuditLogPurgedEvent>(
@@ -287,18 +317,19 @@ WHERE  name = 'UX_AuditLog_EventId'
    {
        Skip.IfNot(_fixture.Available, _fixture.SkipReason);

-        // Seed + purge a Jan-2026 row, THEN exercise InsertIfNotExistsAsync
-        // twice for a fresh (May-2026) EventId. The second call must be a
-        // no-op (duplicate-key collision swallowed by the repository, per
-        // M2 Bundle A's race-fix) — which means the rebuilt
-        // UX_AuditLog_EventId unique index is functioning as intended.
+        // Seed + purge the Jan-2026 row, THEN exercise InsertIfNotExistsAsync twice for
+        // a fresh recent EventId. The second call must be a no-op (duplicate-key collision
+        // swallowed by the repository, per M2 Bundle A's race-fix) — which means the
+        // rebuilt UX_AuditLog_EventId unique index is functioning as intended.
+        // RetentionDays is computed dynamically so the threshold always lands near
+        // 2026-01-20 (see SeedOccurredAt()).
        var siteId = "purge-idem-" + Guid.NewGuid().ToString("N").Substring(0, 8);
-        var janEventId = Guid.NewGuid();
-        var (janOccurred, _, _) = SeedOccurredAt();
+        var oldEventId = Guid.NewGuid();
+        var (oldOccurred, _, _, retentionDays) = SeedOccurredAt();

        await using (var seedConn = _fixture.OpenConnection())
        {
-            await DirectInsertAsync(seedConn, janEventId, janOccurred, siteId);
+            await DirectInsertAsync(seedConn, oldEventId, oldOccurred, siteId);
        }

        var services = new ServiceCollection();
@@ -318,7 +349,7 @@ WHERE  name = 'UX_AuditLog_EventId'
                IntervalHours = 24,
                IntervalOverride = TimeSpan.FromMilliseconds(100),
            },
-            new AuditLogOptions { RetentionDays = 90 });
+            new AuditLogOptions { RetentionDays = retentionDays });

        var janBoundary = new DateTime(2026, 1, 1, 0, 0, 0, DateTimeKind.Utc);
        probe.FishForMessage<AuditLogPurgedEvent>(
@@ -334,7 +365,7 @@ WHERE  name = 'UX_AuditLog_EventId'
        await Task.Delay(TimeSpan.FromMilliseconds(500));

        var freshEventId = Guid.NewGuid();
-        var freshOccurred = new DateTime(2026, 5, 15, 12, 0, 0, DateTimeKind.Utc);
+        var freshOccurred = new DateTime(2026, 5, 15, 12, 0, 0, DateTimeKind.Utc); // within partition range, well inside retention window
        var freshSite = "purge-idem-fresh-" + Guid.NewGuid().ToString("N").Substring(0, 8);
        var freshEvt = ScadaBridgeAuditEventFactory.Create(
            eventId: freshEventId,