merge: integrate WaitAsync/M5-audit (parallel session) with galaxy array-write + inbound-timeout fixes

2026-06-17 09:28:15 -04:00
parent bf2f481bb4 11534089b9
commit af54c8ad11
88 changed files with 7714 additions and 169 deletions
@@ -0,0 +1,150 @@
+# M5 — Audit Hardening (T3–T8) — Design
+
+**Status:** Approved (awaiting plan).
+**Worktree/branch:** `worktree-m5-audit-hardening` off `main` (`e77e209`).
+**Source:** Phase-2 milestone M5 from `docs/plans/2026-06-15-stillpending-completion-design.md`.
+
+## Goal
+
+Harden the centralized Audit Log with six independent, ready-to-build items. Two
+items originally listed under M5 — **T1 hash-chain tamper evidence** and **T2
+Parquet export** — remain **deferred to v1.x** (per CLAUDE.md's audit design
+decisions); their stubs (CLI `verify-chain` no-op, export `501`) stay unchanged.
+
+## Scope (in)
+
+T3 per-channel retention · T4 ParentExecutionId tag-cascade · T5 historical
+backfill (reframed) · T6 per-node stuck KPIs · T7 structured response-capture
+increments · T8 CLI `audit tree`.
+
+## Scope (out / deferred to v1.x)
+
+T1 hash-chain (no Hash/PrevHash columns, no real verify-chain), T2 Parquet
+export (the `501` gate stays). Reversing those deferrals is a separate decision.
+
+---
+
+## Items
+
+### T8 — CLI `audit tree` (smallest; reuses existing server walk + UI)
+The recursive execution-tree walk (`IAuditLogRepository.GetExecutionTreeAsync`,
+backed by `IX_AuditLog_ParentExecution`) and the Blazor `ExecutionTreePage`
+already exist; only an HTTP projection + CLI surface are missing.
+- **Server:** add `GET /api/audit/tree?executionId=…` in
+  `AuditEndpoints.MapAuditAPI` → `repo.GetExecutionTreeAsync` → serialize
+  `ExecutionTreeNode[]`.
+- **CLI:** add `audit tree --execution-id <guid> [--format table|json]` in
+  `AuditCommands` + an `AuditTreeHelpers` renderer (indented ASCII tree for
+  `table`; raw nodes for `json`), mirroring `AuditQueryHelpers`/`AuditExportHelpers`.
+- No schema change. **Tests:** endpoint returns the tree; CLI renders a
+  multi-level tree + handles not-found.
+
+### T6 — Per-node stuck-count KPIs
+KPIs are per-site today; `SourceNode` is on the `Notification` and `SiteCalls`
+rows but not aggregated.
+- Add `ComputePerNodeKpisAsync` (group by `SourceNode`) parallel to the existing
+  `ComputePerSiteKpisAsync` in `NotificationOutboxRepository` and
+  `SiteCallAuditRepository`.
+- New `PerNode…KpiRequest`/`Response` message pair per actor; register in each
+  actor's `Receive<>`.
+- Surface a per-node breakdown on the existing KPI tiles
+  (`AuditKpiTiles`/`SiteCallKpiTiles`) — additive, behind the existing tiles.
+- **Tests:** repository grouping returns correct per-node counts (stuck/parked/
+  queue-depth); message round-trip.
+
+### T7 — Structured response-capture increments (no schema change)
+- **(a) Inbound request headers** → captured into the existing `Extra` JSON in
+  `AuditWriteMiddleware.EmitInboundAudit`, passed through the existing header
+  redactor (auth headers redacted by default).
+- **(b) `AuditInboundCeilingHits`** counter on `AuditCentralHealthSnapshot`
+  (alongside the existing failure counters), incremented when an inbound row
+  truncates (request or response hits `InboundMaxBytes`). Surfaced via the
+  health snapshot.
+- **(c) Per-method opt-out** of body capture: a `SkipBodyCapture` flag on
+  `PerTargetRedactionOverride`, checked in the capture pipeline so a noisy/
+  sensitive method can suppress body capture (headers + metadata still recorded).
+- **Tests:** request headers land in `Extra` and are redacted; ceiling-hit
+  increments the counter; opt-out suppresses body but keeps the row.
+
+### T4 — `ParentExecutionId` tag-cascade (touches the actor model — high-risk)
+Completes the execution tree beyond the inbound-API→routed-script case.
+- **Alarm on-trigger:** thread a `Guid? parentExecutionId` through
+  `AlarmActor.SpawnAlarmExecutionActor` → `AlarmExecutionActor` →
+  `ScriptRuntimeContext`, so an alarm-triggered script chains to its firing
+  context (the alarm's own execution id where one exists; otherwise a root).
+- **Nested `CallScript`/`CallShared`:** in `ScriptRuntimeContext`, pass **the
+  current run's `ExecutionId`** (not the inherited `_parentExecutionId`) as the
+  child invocation's `ParentExecutionId`, so `A → CallScript(B)` records B's
+  parent as A — a true multi-level tree.
+- **Timer/expression-trigger top-level runs** stay roots (no spawner) — unchanged.
+- **Tests:** alarm-triggered script row carries the expected parent; a 2-level
+  nested `CallScript` produces a chain A→B→C walkable by `GetExecutionTreeAsync`.
+- **Risk:** serialized actor state + correlation plumbing; covered by targeted
+  SiteRuntime actor tests + a tree-walk integration assertion.
+
+### T3 — Per-channel retention overrides (one design wrinkle, resolved)
+Retention is a single global `RetentionDays`; the purge actor switches out whole
+month partitions by `OccurredAtUtc` (channel-blind).
+- Add `PerChannelRetentionDays` (`Dictionary<string,int>`, keyed by channel /
+  `Action` name) to `AuditLogOptions`, validated like the global value; a channel
+  override may only be **shorter** than the global window (longer is meaningless
+  under month-partition switch-out, which is governed by the largest retention).
+- **Mechanism (resolved):** after the coarse global partition purge, the purge
+  actor runs a **bounded row-level delete** for channels whose override is
+  shorter than global (`DELETE … WHERE Action=@channel AND OccurredAtUtc<@thr`,
+  batched). This runs from the **purge/maintenance path, not the writer role** —
+  the append-only invariant binds the writer/ingest role, not maintenance. The
+  **M2.10 CI grep-guard is widened** to allow the purge actor's single audited
+  deletion call site (an allow-list entry, not a blanket exemption).
+- **Tests:** a channel with a shorter override is purged earlier than the global;
+  channels without an override follow the global; the guard still rejects
+  UPDATE/DELETE everywhere except the sanctioned purge site.
+
+### T5 — Historical backfill (reframed per the computed-column reality)
+- **`SourceNode`** is a physical nullable column. For truly historical rows the
+  node-of-origin is **unknowable**, so the backfill sets a **configurable
+  sentinel** (default `"unknown"`) on `NULL` rows via a one-shot maintenance
+  command (run from the purge/maintenance path), rather than guessing a node.
+- **`ExecutionId`/`ParentExecutionId`** are **persisted computed columns derived
+  from `DetailsJson`**; backfilling them means mutating the JSON, which
+  append-only forbids. These are **documented as a runbook limitation** (pre-feature
+  rows stay NULL) — no code.
+- **Tests:** the SourceNode backfill sets the sentinel only on NULL rows within a
+  bounded range and is idempotent; documentation note added.
+
+---
+
+## Cross-cutting
+
+- **Shared seams:** `AuditLogOptions` (T3, T7), `AuditEndpoints.MapAuditAPI`
+  (T8), `AuditCommands` (T8), `AuditCentralHealthSnapshot` (T6, T7),
+  `IAuditLogRepository`/the KPI repositories (T6), the purge/maintenance role
+  (T3, T5). No AuditLog **schema** change in M5 (T1/T2 deferred).
+- **Append-only:** the only new deletion is T3's purge-role channel delete +
+  T5's purge-role sentinel UPDATE — both maintenance-path, both reflected in the
+  CI guard's allow-list. Writer/ingest paths stay INSERT-only.
+
+## Testing strategy
+
+Per-item unit + targeted integration tests (above). T4 additionally gets a
+tree-walk integration assertion. Full-solution build + targeted suites at the
+integration step. No new infra dependency (Parquet deferred).
+
+## Sequencing
+
+Independent items, parallelizable by disjoint area:
+- **Wave A (parallel):** T8 (CLI+endpoint), T6 (KPI repos+actors+tiles), T7
+  (middleware+health+redaction-override) — disjoint projects.
+- **Wave B (parallel):** T4 (SiteRuntime actors — high-risk), T3 (AuditLog
+  options+purge actor+CI guard), T5 (purge-path backfill command + runbook).
+- **Wave C:** integration verification + docs (Component-AuditLog/-CLI, CLAUDE.md
+  KPI/retention notes, runbook).
+
+## Risks
+
+- **T4** actor-model correlation (serialized state) — targeted tests + tree-walk
+  assertion.
+- **T3** append-only tension — resolved via maintenance-role delete + CI-guard
+  allow-list; verify the guard still blocks all other DELETE/UPDATE.
+- **T5** node-of-origin unknowable — sentinel + documented limitation (no false
+  precision).
@@ -0,0 +1,92 @@
+# M5 — Audit Hardening (T3–T8) Implementation Plan
+
+> **For Claude:** executed via superpowers-extended-cc:subagent-driven-development in this session.
+
+**Goal:** Ship six independent audit-log hardening items (per-channel retention, ParentExecutionId tag-cascade, SourceNode backfill, per-node stuck KPIs, structured response-capture increments, CLI `audit tree`) without an AuditLog schema change.
+
+**Architecture:** Each item extends an existing seam identified in the survey. No new infra dependency (T1 hash-chain + T2 Parquet stay deferred to v1.x). Design: `docs/plans/2026-06-16-m5-audit-hardening-design.md`.
+
+**Tech Stack:** C#/.NET 10, EF Core (MS SQL), Akka.NET, Blazor Server, System.CommandLine, xUnit.
+
+**Conventions:** targeted builds/tests per task (`dotnet build <proj>`, `dotnet test --filter`); full-solution build only at integration (M5.7). Implementers do NOT create worktrees (already in `worktree-m5-audit-hardening`) and commit with pathspec form `git commit -m "..." -- <paths>` (retry on index.lock). Append-only invariant holds for writer/ingest paths; the only sanctioned mutations are T3's purge-role channel delete and T5's purge-role sentinel UPDATE, both reflected in the M2.10 CI-guard allow-list.
+
+---
+
+# Wave A — leverage-existing-infra (parallel; disjoint projects)
+
+### Task M5.1 (T8): CLI `audit tree` + tree endpoint
+**Classification:** standard · **~5 min** · **Parallelizable with:** M5.2, M5.3
+**Files:**
+- Modify: `src/ZB.MOM.WW.ScadaBridge.ManagementService/AuditEndpoints.cs` (`MapAuditAPI`, ~line 97) — add `GET /api/audit/tree?executionId=<guid>` → `IAuditLogRepository.GetExecutionTreeAsync(executionId)` → JSON `ExecutionTreeNode[]`; 400 on missing/invalid guid, empty array when no rows.
+- Create: `src/ZB.MOM.WW.ScadaBridge.CLI/Commands/AuditTreeHelpers.cs` — render `ExecutionTreeNode[]` as an indented ASCII tree (table) and as raw JSON (`--format json`), mirroring `AuditQueryHelpers`/`AuditExportHelpers`.
+- Modify: `src/ZB.MOM.WW.ScadaBridge.CLI/Commands/AuditCommands.cs` (`Build`, ~line 28) — add `BuildTree()`: `audit tree --execution-id <guid> [--format table|json]`, calls the new endpoint via the existing `ManagementHttpClient` pattern.
+- Test: ManagementService tests for the endpoint (multi-level tree + not-found); CLI tests for `AuditTreeHelpers` rendering.
+**AC:** `audit tree --execution-id <id>` prints the execution tree (root→children, indented); `--format json` emits the node array; the server walk reuses the existing `GetExecutionTreeAsync` (no new SQL). No schema change.
+
+### Task M5.2 (T6): Per-node stuck-count KPIs
+**Classification:** standard · **~5 min** · **Parallelizable with:** M5.1, M5.3
+**Files:**
+- Modify: `NotificationOutboxRepository` — add `ComputePerNodeKpisAsync` (group by `SourceNode`) parallel to `ComputePerSiteKpisAsync`.
+- Modify: `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/...Repository` — same `ComputePerNodeKpisAsync`.
+- Modify: `NotificationOutboxActor.cs` (~line 1054) + `SiteCallAuditActor.cs` (~line 781) — add a `PerNode…KpiRequest`/`Response` message pair (in Commons messages) and a `Receive<>`/handler each.
+- Modify: CentralUI `AuditKpiTiles.razor` / `SiteCallKpiTiles.razor` (or the per-site KPI panel) — add an additive per-node breakdown.
+- Test: repository per-node grouping returns correct stuck/parked/queue-depth counts; actor message round-trip.
+**AC:** per-node stuck/parked counts available + surfaced; `SourceNode` already on both tables (no migration). Per-site KPIs unchanged.
+
+### Task M5.3 (T7): Structured response-capture increments
+**Classification:** standard · **~5 min** · **Parallelizable with:** M5.1, M5.2
+**Files:**
+- Modify: `src/ZB.MOM.WW.ScadaBridge.AuditLog/...AuditWriteMiddleware.cs` (`EmitInboundAudit`, ~line 246) — capture inbound **request headers** into the existing `Extra` JSON (through the existing header redactor; auth headers redacted by default).
+- Modify: `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditCentralHealthSnapshot.cs` — add an `AuditInboundCeilingHits` counter (+ its interface), incremented from the middleware when an inbound row truncates (`requestTruncated || responseTruncated`).
+- Modify: `src/ZB.MOM.WW.ScadaBridge.AuditLog/Configuration/PerTargetRedactionOverride.cs` — add a `SkipBodyCapture` flag; honor it in the capture pipeline (suppress body, keep headers + metadata + the row).
+- Test: request headers land in `Extra` and are redacted; ceiling-hit increments the counter; `SkipBodyCapture` suppresses body but still writes the row.
+**AC:** no schema change (uses `Extra` JSON + health snapshot); existing redaction behavior preserved.
+
+---
+
+# Wave B — actor model + maintenance (parallel; T5 after M5.1's CLI edits)
+
+### Task M5.4 (T4): ParentExecutionId tag-cascade
+**Classification:** high-risk (actor model + correlation) · **~5 min** · **Parallelizable with:** M5.5 (and M5.6)
+**Files:**
+- Modify: `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/AlarmActor.cs` (`SpawnAlarmExecutionActor`, ~line 578) + `AlarmExecutionActor.cs` (ctor, ~line 90) — thread a `Guid? parentExecutionId` so alarm-triggered scripts chain to the firing context; pass it into the `ScriptRuntimeContext` (currently `null`).
+- Modify: `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Scripts/ScriptRuntimeContext.cs` (`CallScript` ~line 394, `CallShared`) — pass **the current run's `_executionId`** (not the inherited `_parentExecutionId`) as the child invocation's `ParentExecutionId`, forming a true multi-level tree.
+- Test (`tests/.../SiteRuntime.Tests/`): an alarm-triggered script row carries the expected parent; a 2-level nested `CallScript` (A→B→C) is walkable via `GetExecutionTreeAsync` (or assert the emitted `ParentExecutionId` chain).
+**AC:** alarm/trigger-spawned and nested-call runs form a correct execution tree; top-level timer/expression-trigger runs stay roots; no regression to the inbound-API→routed-script path.
+
+### Task M5.5 (T3): Per-channel retention overrides
+**Classification:** high-risk (purge/deletion + CI guard) · **~5 min** · **Parallelizable with:** M5.4, M5.6
+**Files:**
+- Modify: `src/ZB.MOM.WW.ScadaBridge.AuditLog/Configuration/AuditLogOptions.cs` — add `Dictionary<string,int> PerChannelRetentionDays` (keyed by `Action`/channel name); validate in `AuditLogOptionsValidator.cs` (each override in `[30, global]`, shorter-than-global only).
+- Modify: `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogPurgeActor.cs` (`HandlePurgeTickAsync`, ~line 135) — after the global partition switch-out, for each channel with a shorter override, run a **bounded batched DELETE** (`WHERE Action=@channel AND OccurredAtUtc<@threshold`) via the purge/maintenance path.
+- Modify: the M2.10 CI grep-guard script — add an allow-list entry for the purge actor's single audited DELETE call site (do NOT blanket-exempt; the guard must still reject all other UPDATE/DELETE on AuditLog).
+- Test: a channel with a shorter override is purged earlier than global; un-overridden channels follow global; the CI guard still fails on a stray DELETE elsewhere.
+**AC:** per-channel retention works without violating writer-role append-only; the guard remains effective.
+
+### Task M5.6 (T5): SourceNode sentinel backfill + runbook
+**Classification:** small · **~4 min** · **Parallelizable with:** M5.4, M5.5 · **Depends on:** M5.1 (shares `AuditCommands.cs`)
+**Files:**
+- Create: a one-shot maintenance backfill (purge/maintenance path) that sets `SourceNode` to a configurable sentinel (default `"unknown"`) on `NULL` rows within a bounded `OccurredAtUtc` range; idempotent.
+- Modify: `src/ZB.MOM.WW.ScadaBridge.CLI/Commands/AuditCommands.cs` — add `audit backfill-source-node [--sentinel <s>] [--before <date>]` invoking it (after M5.1's `audit tree` is in, to avoid a concurrent edit to this file).
+- Modify/Create: a runbook note (`deploy/.../RUNBOOK.md` or the AuditLog component doc) documenting that `ExecutionId`/`ParentExecutionId` are computed from `DetailsJson` and CANNOT be backfilled under append-only (pre-feature rows stay NULL) — no false precision.
+- Test: backfill sets the sentinel only on NULL rows in range, is idempotent, and does not touch non-NULL rows.
+**AC:** SourceNode backfill is sanctioned maintenance (CI-guard allow-listed if it does UPDATE); the computed-id limitation is documented, not coded.
+
+---
+
+# Wave C — integration + docs
+
+### Task M5.7: Integration verification + docs
+**Classification:** high-risk (final integration reviewer) · **~5 min** · **Depends on:** M5.1–M5.6
+**Steps:**
+1. `dotnet build ZB.MOM.WW.ScadaBridge.slnx` (full solution).
+2. Targeted tests across AuditLog, ManagementService, CLI, NotificationOutbox/SiteCallAudit, SiteRuntime, CentralUI; run the CI grep-guard to confirm it still blocks stray UPDATE/DELETE.
+3. Docs: `docs/requirements/Component-AuditLog.md` (per-channel retention, per-node KPIs, response-capture increments, tag-cascade, `audit tree`), `Component-CLI.md` + CLI README (`audit tree`, `audit backfill-source-node`), CLAUDE.md audit notes (per-channel retention; tag-cascade now beyond inbound; per-node KPIs), and the runbook computed-id limitation.
+4. Commit; final integration review of the whole `1b7600f..HEAD` diff.
+**AC:** full build green; all targeted suites + CI guard green; docs reflect the six shipped items; no doc claims a deferred item shipped (T1/T2 remain deferred).
+
+---
+
+## Native tasks & dependencies
+
+Sub-tasks created as native tasks under umbrella #16 (M5). Edges: M5.6 ⟵ M5.1 (shared CLI file); M5.7 ⟵ M5.1–M5.6. Waves: A = {M5.1, M5.2, M5.3} parallel; B = {M5.4, M5.5, M5.6} parallel (M5.6 after M5.1); C = M5.7.
@@ -0,0 +1,13 @@
+{
+  "planPath": "docs/plans/2026-06-16-m5-audit-hardening.md",
+  "tasks": [
+    {"id": 119, "subject": "M5.1 (T8): CLI audit tree + tree endpoint", "status": "pending"},
+    {"id": 120, "subject": "M5.2 (T6): Per-node stuck-count KPIs", "status": "pending"},
+    {"id": 121, "subject": "M5.3 (T7): Structured response-capture increments", "status": "pending"},
+    {"id": 122, "subject": "M5.4 (T4): ParentExecutionId tag-cascade", "status": "pending"},
+    {"id": 123, "subject": "M5.5 (T3): Per-channel retention overrides", "status": "pending"},
+    {"id": 124, "subject": "M5.6 (T5): SourceNode sentinel backfill + runbook", "status": "pending", "blockedBy": [119]},
+    {"id": 125, "subject": "M5.7: M5 integration verification + docs", "status": "pending", "blockedBy": [119, 120, 121, 122, 123, 124]}
+  ],
+  "lastUpdated": "2026-06-16"
+}
@@ -0,0 +1,264 @@
+# Patch request — event-driven "wait for attribute change (with timeout)" script helper
+
+**Date:** 2026-06-17
+**Type:** Source enhancement (small, additive) to the SiteRuntime script surface
+**Why now:** the DELMIA/MES receiver re-implementation
+([`2026-06-17-delmia-mes-receiver-templates-design.md`](2026-06-17-delmia-mes-receiver-templates-design.md), §9 risk #1)
+currently has to **busy-poll** for the handshake completion flag. This spec describes the gap
+and a precise, patch-ready design for a host-provided `WaitAsync` helper so scripts can wait
+**event-driven** for a tag/attribute to reach a value, bounded by a timeout.
+
+> All file paths, line numbers, message records, and signatures below were read from source on
+> 2026-06-17. Treat line numbers as guides (they drift); the type/method names are the anchors.
+
+---
+
+## 1. The gap
+
+The receiver handshake (and any request/response tag interaction) needs to **wait until a
+data-sourced attribute reaches a value** — e.g. wait up to 30 s for `RecipeProcessedFlag == true`
+or `MoveInCompleteFlag == true` after setting the trigger flag.
+
+ScadaBridge's script surface today has **read** (`Attributes.GetAsync` / indexer) and **write**
+(`Attributes.SetAsync` / indexer), but **no "wait for value" primitive**. The only way to wait is
+a manual poll loop:
+
+```csharp
+// current workaround — every handshake script repeats this
+var deadline = DateTime.UtcNow.AddSeconds(30);
+while (DateTime.UtcNow < deadline && !CancellationToken.IsCancellationRequested)
+{
+    if ((bool?)(await Attributes.GetAsync("RecipeProcessedFlag")) == true) break;
+    await Task.Delay(200, CancellationToken);
+}
+```
+
+Why this is unsatisfactory:
+
+- **Latency** — completion is detected up to one poll interval late (200 ms here).
+- **Wasted work** — each iteration is an actor `Ask` (`GetAttributeRequest` round-trip to the
+  `InstanceActor`); N handshakes × M polls = a lot of needless messages.
+- **Boilerplate** — the same loop is copy-pasted into every handshake script, easy to get wrong
+  (forgetting `CancellationToken`, off-by-one on the deadline, not handling quality).
+- **No quality awareness** — the poll reads whatever value is cached regardless of OPC/MX quality.
+
+Crucially, **the data is already being pushed to the actor that owns it.** A data-sourced
+attribute's value arrives from the DCL and is applied in the `InstanceActor`, which then raises
+`AttributeValueChanged`. So an event-driven waiter is natural and removes the poll entirely.
+
+---
+
+## 2. Where the change goes (verified wiring)
+
+| Concern | Type / file | Notes |
+|---|---|---|
+| Change notification | `AttributeValueChanged(InstanceUniqueName, AttributePath, AttributeName, Value, Quality, Timestamp)` — `src/ZB.MOM.WW.ScadaBridge.Commons/Messages/Streaming/AttributeValueChanged.cs` | raised on **every** change |
+| **Single choke point** | `InstanceActor.HandleAttributeValueChanged(...)` — `src/…/SiteRuntime/Actors/InstanceActor.cs` | both static writes (`HandleSetStaticAttributeCore`) **and** DCL/subscription updates (`HandleTagValueUpdate` ← `TagValueUpdate`) funnel through here, then `PublishAndNotifyChildren` |
+| Owner of state | `InstanceActor` (`_attributes`, `_attributeQualities`, `_attributeTimestamps`) | **single-threaded** — registration + current-value check is atomic here |
+| Script read path | `AttributeAccessor` (`ScopeAccessors.cs`) → `ScriptRuntimeContext.GetAttribute` → `Ask<GetAttributeResponse>(GetAttributeRequest)` | the helper mirrors this |
+| Script globals build | `ScriptExecutionActor` (`src/…/SiteRuntime/Actors/ScriptExecutionActor.cs`) builds `ScriptRuntimeContext` (passes `instanceActor`, `self`, `_askTimeout`) and `ScriptGlobals` (`CancellationToken = cts.Token` from the per-script timeout) | **the script timeout token is NOT currently passed into `ScriptRuntimeContext`** — this patch must thread it in |
+| Helper idiom | `ScriptRuntimeContext` nested helpers (e.g. `ExternalSystemHelper`) — ctor deps stored as readonly fields, exposed via an on-demand property | follow this idiom |
+| Trust model | `ScriptTrustPolicy` (`src/…/ScriptAnalysis/`) | `System.Threading.Tasks` + `CancellationToken`/`CancellationTokenSource` are in `AllowedExceptions`; lambdas/`Func<>` are fine. **No trust change needed** — the wait runs in host code; the script just `await`s a provided method. |
+
+**Design principle:** do the wait **inside the `InstanceActor`** as a one-shot registered waiter,
+not in the script via polling. Because the actor is single-threaded and `HandleAttributeValueChanged`
+is the one place every change passes, a waiter that (a) checks the current value on registration and
+(b) is re-evaluated on each change **cannot miss the edge** between "read current" and "subscribe".
+
+---
+
+## 3. Proposed API (script-facing)
+
+Add to the `Attributes` accessor (`AttributeAccessor` in `ScopeAccessors.cs`), so scope/composition
+path resolution (`Resolve(name)`) applies just like get/set:
+
+```csharp
+// Wait until `name` equals targetValue (value-equality, codec-normalized). Returns true if matched
+// within the timeout, false if it timed out. Honors the script CancellationToken.
+Task<bool> Attributes.WaitAsync(string name, object? targetValue, TimeSpan timeout);
+
+// Predicate form — site-local template scripts only (predicate is an in-process delegate).
+Task<bool> Attributes.WaitAsync(string name, Func<object?, bool> predicate, TimeSpan timeout);
+
+// Optional richer overload that also returns the matched value + quality.
+Task<WaitResult> Attributes.WaitForAsync(string name, object? targetValue, TimeSpan timeout);
+// record WaitResult(bool Matched, object? Value, string? Quality, bool TimedOut);
+```
+
+> **Status:** IMPLEMENTED. `Attributes.WaitForAsync(...)` returns a `WaitResult`
+> (`readonly record struct WaitResult(bool Matched, object? Value, string? Quality, bool TimedOut)`
+> in Commons), populated on match (Value + Quality) and `Matched:false, TimedOut:true` on timeout.
+
+Return **bool** (not throw) for the common case — the handshake wants matched/timed-out, not an
+exception. The value-equality overload is the one the handshake needs and is the one that can also
+be exposed on the inbound/routed side (§6), because a value serializes and a delegate does not.
+
+Handshake, rewritten (replaces the §1 poll loop):
+
+```csharp
+await Attributes.SetAsync("RecipeDownloadFlag", true);                 // trigger
+var ok = await Attributes.WaitAsync("RecipeProcessedFlag", true, TimeSpan.FromSeconds(30));
+if (!ok) return new { Result = false, ResultText = "Timeout waiting for recipe to be processed" };
+return new {
+    Result     = (bool?)(await Attributes.GetAsync("RecipeProcessResult")) ?? false,
+    ResultText = (string?)(await Attributes.GetAsync("RecipeProcessResultText")) ?? ""
+};
+```
+
+```csharp
+await Attributes.SetAsync("MoveInFlag", true);
+var ok = await Attributes.WaitAsync("MoveInCompleteFlag", true, TimeSpan.FromSeconds(30));
+// … read MoveInSuccessfulFlag / MoveInErrorText / MoveInBatchID …
+```
+
+---
+
+## 4. Implementation outline (the patch)
+
+### 4.1 New messages (`src/ZB.MOM.WW.ScadaBridge.Commons/Messages/…`)
+```csharp
+// actor protocol (site-local; delegate is fine because messaging is in-process)
+public record WaitForAttributeRequest(
+    string  CorrelationId,
+    string  InstanceName,
+    string  AttributeName,            // already scope-resolved by the accessor
+    string? TargetValueEncoded,       // AttributeValueCodec.Encode(targetValue); null = "any change"
+    Func<object?, bool>? Predicate,   // local-only; null when TargetValueEncoded is used
+    TimeSpan Timeout,
+    DateTimeOffset OccurredAtUtc);
+
+public record WaitForAttributeResponse(
+    string CorrelationId,
+    bool   Matched,
+    object? Value,
+    string Quality,
+    bool   TimedOut,
+    string? ErrorMessage = null);
+
+// internal self-message used to fire the timeout
+public record WaitForAttributeTimeout(string CorrelationId);
+```
+
+### 4.2 `InstanceActor` (`src/…/SiteRuntime/Actors/InstanceActor.cs`)
+- Add a registry: `Dictionary<string, PendingWait> _attributeWaiters` keyed by `CorrelationId`, where
+  `PendingWait` holds the attribute name, the match test (decoded target value **or** predicate),
+  the original `Sender` (`IActorRef`), and the scheduled `ICancelable` timeout handle.
+- **Handle `WaitForAttributeRequest`:**
+  1. Build the match test (decode `TargetValueEncoded` via `AttributeValueCodec` → equality test, or
+     use `Predicate`).
+  2. **Fast path:** if the current `_attributes[name]` already satisfies the test, reply
+     `WaitForAttributeResponse(Matched: true, Value, Quality)` immediately and return.
+  3. Otherwise register the waiter and schedule the timeout:
+     `Context.System.Scheduler.ScheduleTellOnce(effectiveTimeout, Self, new WaitForAttributeTimeout(cid), Self)`,
+     storing the returned `ICancelable`. Capture `Sender` now (it is invalid later).
+  4. Bound `effectiveTimeout = min(request.Timeout, requestDeadlineFromCaller)` (the caller's `Ask`
+     already carries the script token; see §4.3). Optionally cap the number of concurrent waiters
+     per instance (defensive; reply with `ErrorMessage` if exceeded).
+- **In `HandleAttributeValueChanged` (after state is updated):** iterate `_attributeWaiters` whose
+  attribute matches the changed `AttributeName`; for any whose test now passes, cancel its timeout,
+  reply `WaitForAttributeResponse(Matched: true, …)`, and remove it. (Iterate over a snapshot to
+  allow removal during enumeration.)
+- **Handle `WaitForAttributeTimeout`:** if still registered, reply
+  `WaitForAttributeResponse(Matched: false, TimedOut: true)` and remove.
+- Optional: a `quality == "Good"`-only mode (parameter on the request) if a handshake must ignore
+  Bad-quality transients.
+
+> **Status:** IMPLEMENTED as an opt-in `requireGoodQuality` parameter on `WaitAsync`/`WaitForAsync`
+> (additive trailing `RequireGoodQuality` field on `WaitForAttributeRequest`, gated at both the
+> fast-path and resolve-loop match sites). Default `false` = quality-agnostic (matches on value only).
+
+### 4.3 `ScriptRuntimeContext` (`src/…/SiteRuntime/Scripts/ScriptRuntimeContext.cs`)
+- **Thread the script timeout token in.** Add a `CancellationToken scriptTimeoutToken` constructor
+  parameter (today only `_askTimeout` is available to helpers; the per-script `cts.Token` is **not**
+  passed). `ScriptExecutionActor` already has `cts.Token` — pass it when constructing the context.
+- Add a method that the accessor calls:
+  ```csharp
+  public async Task<bool> WaitAttribute(string name, string? targetValueEncoded,
+                                        Func<object?,bool>? predicate, TimeSpan timeout)
+  {
+      var cid = Guid.NewGuid().ToString();
+      var req = new WaitForAttributeRequest(cid, _instanceName, name, targetValueEncoded,
+                                            predicate, timeout, DateTimeOffset.UtcNow);
+      // Ask bounded by the script timeout token so a script-deadline abort cancels the await.
+      var resp = await _instanceActor.Ask<WaitForAttributeResponse>(
+                     req, timeout + _askTimeout /* small slack */, _scriptTimeoutToken);
+      return resp.Matched;
+  }
+  ```
+
+### 4.4 `ScriptExecutionActor` (`src/…/SiteRuntime/Actors/ScriptExecutionActor.cs`)
+- Pass `cts.Token` (the per-script timeout, created at the `new CancellationTokenSource(timeout)`
+  site) into the new `ScriptRuntimeContext` constructor parameter from §4.3.
+
+### 4.5 `AttributeAccessor` (`src/…/SiteRuntime/Scripts/ScopeAccessors.cs`)
+```csharp
+public Task<bool> WaitAsync(string key, object? targetValue, TimeSpan timeout)
+    => _ctx.WaitAttribute(Resolve(key), AttributeValueCodec.Encode(targetValue), null, timeout);
+
+public Task<bool> WaitAsync(string key, Func<object?, bool> predicate, TimeSpan timeout)
+    => _ctx.WaitAttribute(Resolve(key), null, predicate, timeout);
+```
+
+### 4.6 Trust model — no change
+`WaitAsync` is a host-provided async method; the wait/scheduling happens in host code. The script
+only `await`s it and may pass a `Func<>` (a normal closure, not reflection). `System.Threading.Tasks`
+ `CancellationToken` are already in `ScriptTrustPolicy.AllowedExceptions`. Verify the new helper
+type/members don't collide with `ForbiddenIdentifiers` (`dynamic`, `Activator`) — they don't.
+
+---
+
+## 5. Correctness notes
+
+- **No missed edge.** Registration (current-value check) and change-handling both run on the
+  `InstanceActor`'s single thread, so a value that flips between "set trigger" and "register waiter"
+  is caught by the fast-path check; a value that flips after registration is caught by
+  `HandleAttributeValueChanged`. The poll-loop and this design are both correct; this one is
+  event-driven and cheaper.
+- **Timeout is authoritative and self-cleaning.** The scheduled `WaitForAttributeTimeout` guarantees
+  the waiter is removed and the caller answered even if the value never changes. Match cancels the
+  scheduled timeout.
+- **Cancellation.** Bounding the helper `Ask` with the script timeout token means a script that hits
+  its own `ExecutionTimeoutSeconds` abandons the wait; pair with a best-effort cancel message to the
+  actor to evict the orphan waiter promptly (otherwise it self-evicts at its own timeout).
+- **Concurrency / re-entrancy.** Multiple waiters per instance are fine (keyed by `CorrelationId`).
+  Consider a per-instance cap as a guard against a script leaking waiters in a loop.
+
+---
+
+## 6. Optional: inbound / routed variant
+
+For symmetry with `RouteTarget.GetAttributes` (`src/…/InboundAPI/RouteHelper.cs`), an inbound script
+could call `Route.To(code).WaitForAttribute(name, targetValue, timeout)`. Mirror the existing routed
+pattern: add `RouteToWaitForAttributeRequest/Response`, an `IInstanceRouter.RouteToWaitForAttributeAsync`
+method, and unpack it on the site comms actor into the same `WaitForAttributeRequest` to the
+`InstanceActor`. **Value-equality only** across the wire — a `Func<>` predicate cannot be serialized,
+so the routed form takes the encoded target value (the predicate overload stays site-local). This is
+optional: the receiver handshake runs **inside** the template script (site-local), so §3–§5 alone
+fully cover the DELMIA/MES use case.
+
+> **Status:** IMPLEMENTED. `Route.To(code).WaitForAttribute(name, targetValue, timeout)` is wired
+> end-to-end (`RouteToWaitForAttributeRequest/Response` → `IInstanceRouter` → `CommunicationService`
+> → `SiteCommunicationActor` → `DeploymentManagerActor` → `InstanceActor`), value-equality only
+> across the wire. NOT wired into the CentralUI Test-Run sandbox — that remains a follow-up.
+
+---
+
+## 7. Acceptance criteria
+
+1. A template script can `await Attributes.WaitAsync("Flag", true, TimeSpan.FromSeconds(30))` and it
+   returns `true` promptly when the data-sourced attribute reaches `true` (driven by a DCL update),
+   with no poll loop.
+2. Returns `false` (no throw) when the value never matches within the timeout.
+3. The wait is bounded by the script's own `ExecutionTimeoutSeconds` (a shorter script deadline wins).
+4. No `AttributeValueChanged` edge is missed across the register/change boundary (unit test: flip the
+   value in the same actor step as registration, and one step after).
+5. Waiters are removed on match and on timeout (no leak; assert registry empty afterward).
+6. Scope/composition path resolution works (`Children["DelmiaReceiver"]`-scoped wait resolves to the
+   composed child's attribute).
+7. Passes `ScriptAnalysis` trust validation unchanged.
+8. The DELMIA/MES handshake base scripts (design doc §4) compile and pass using `WaitAsync` in place
+   of the poll loop.
+
+Suggested tests: extend `InstanceActor` tests (waiter fast-path, change-match, timeout, removal) and
+the script-surface tests under `tests/…/SiteRuntime*`.
+```
@@ -0,0 +1,226 @@
+# WaitAsync Deferred Optional Items — Implementation Plan
+
+> **For Claude:** REQUIRED SUB-SKILL: Use superpowers-extended-cc:executing-plans (subagent-driven) to implement this plan task-by-task.
+
+**Goal:** Implement the three items deferred from the WaitAsync spec (`docs/plans/2026-06-17-waitfor-attribute-change-helper-spec.md`): §3 `WaitForAsync`/`WaitResult` richer overload, §4.2 quality-gated ("Good"-only) matching, and §6 inbound/routed `Route.To(...).WaitForAttribute` variant.
+
+**Architecture:** Builds on the shipped core (`b89d69a`→`04e97f4`). Two of the items (§3, §4.2) are site-local enrichments of the existing `Attributes` script surface + `InstanceActor` waiter; no new actor protocol shapes beyond an additive `RequireGoodQuality` field. The third (§6) mirrors the existing `Route.To(...).GetAttributes` cross-cluster path end-to-end (`RouteTarget` → `IInstanceRouter` → `CommunicationService` → `SiteCommunicationActor` → `DeploymentManagerActor` → `InstanceActor`), value-equality only across the wire, with the cluster Ask bounded by the *wait* timeout rather than the generic integration timeout.
+
+**Tech Stack:** C#/.NET 10, Akka.NET 1.5, xUnit + Akka.TestKit + NSubstitute.
+
+**Branch/worktree:** `waitfor-attr-helper` at `/Users/dohertj2/Desktop/ScadaBridge/.claude/worktrees/waitfor-attr-helper` (off local main; carries the core feature). Implementers do NOT create worktrees, commit **pathspec form** (`git commit -m "…" -- <paths>`), do NOT push, do NOT touch main. Targeted builds/tests per task; full-solution build only in WD-3.
+
+---
+
+## Naming / shared shapes
+
+- New script return type `WaitResult` (Commons): `public readonly record struct WaitResult(bool Matched, object? Value, string? Quality, bool TimedOut);`
+- `WaitForAttributeRequest` gains a trailing additive field `bool RequireGoodQuality = false` (site-local request). `RequireGoodQuality` semantics: a match requires the value test to pass **and** `string.Equals(quality, "Good", StringComparison.Ordinal)`.
+- Routed contract (value-equality only, no predicate, no quality flag across the wire — §6 says value-equality only): `RouteToWaitForAttributeRequest` / `RouteToWaitForAttributeResponse` (Commons `Messages/InboundApi`).
+- The `WaitForAttributeResponse.Quality` field is already `string?` (null on timeout/error).
+
+---
+
+## Execution waves
+
+- **Wave 1 (parallel, disjoint files):** WD-1 ∥ WD-2a. (2 concurrent committers; post-wave HEAD-presence check.)
+- **Wave 2:** WD-2b (after WD-2a).
+- **Wave 3:** WD-3 (after WD-1, WD-2a, WD-2b).
+
+WD-1 must add `RequireGoodQuality` ONLY as a **trailing defaulted** ctor param of `WaitForAttributeRequest`, so WD-2b's `new WaitForAttributeRequest(...)` (built in wave 2) compiles regardless.
+
+---
+
+### Task WD-1: Site-local `WaitForAsync` + `WaitResult` + quality-gated mode (§3 + §4.2)
+
+**Classification:** high-risk (modifies the `InstanceActor` single-threaded match evaluation + an additive message-contract field)
+**Estimated implement time:** ~5 min
+**Parallelizable with:** WD-2a
+
+**Files:**
+- Create: `src/ZB.MOM.WW.ScadaBridge.Commons/Types/WaitResult.cs`
+- Modify: `src/ZB.MOM.WW.ScadaBridge.Commons/Messages/Instance/WaitForAttribute.cs` (add trailing `bool RequireGoodQuality = false` to `WaitForAttributeRequest`)
+- Modify: `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/InstanceActor.cs` (thread `RequireGoodQuality` into `PendingWait` + both match sites)
+- Modify: `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Scripts/ScriptRuntimeContext.cs` (add `WaitAttributeFull` returning `WaitResult`; add `requireGoodQuality` param)
+- Modify: `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Scripts/ScopeAccessors.cs` (add `WaitForAsync` overloads + `requireGoodQuality` optional param on `WaitAsync`)
+- Test: `tests/ZB.MOM.WW.ScadaBridge.SiteRuntime.Tests/Actors/InstanceActorWaitForAttributeTests.cs` + `tests/ZB.MOM.WW.ScadaBridge.SiteRuntime.Tests/Scripts/ScopeAccessorTests.cs`
+
+**Steps (TDD):**
+
+1. **`WaitResult`** — add the readonly record struct above.
+
+2. **`WaitForAttributeRequest`** — add trailing `bool RequireGoodQuality = false`. Keep the `Func<>` predicate field as-is. Update the XML-doc.
+
+3. **`InstanceActor`** — add `bool RequireGoodQuality` to the `PendingWait` record. At BOTH match sites build the effective match as:
+   ```csharp
+   // fast-path (HandleWaitForAttribute): quality from _attributeQualities.GetValueOrDefault(name, <existing default>)
+   // resolve loop (ResolveMatchedWaiters): quality from changed.Quality
+   bool QualityOk(string? q) => !requireGoodQuality || string.Equals(q, "Good", StringComparison.Ordinal);
+   bool matched = QualityOk(quality) && test(value);   // keep test() inside its existing try/catch
+   ```
+   Store `RequireGoodQuality` on the `PendingWait` so the resolve loop knows it. Keep the throwing-predicate guard (the `QualityOk && test` must still be inside the existing try/catch). The fast-path quality-fail when `requireGoodQuality` is just a non-match → register + schedule timeout as normal (do NOT fast-reply matched).
+
+4. **`ScriptRuntimeContext`** — refactor: a private `Task<WaitForAttributeResponse> WaitInternal(name, encoded, predicate, timeout, requireGoodQuality)` that does the token-bounded `Ask` (keep the existing `AskTimeoutException → ...` handling; on AskTimeout return a synthetic `WaitForAttributeResponse(.., Matched:false, TimedOut:true)`). Then:
+   ```csharp
+   public async Task<bool> WaitAttribute(string name, string? enc, Func<object?,bool>? pred, TimeSpan t, bool requireGoodQuality = false)
+       => (await WaitInternal(name, enc, pred, t, requireGoodQuality)).Matched;
+   public async Task<WaitResult> WaitAttributeFull(string name, string? enc, Func<object?,bool>? pred, TimeSpan t, bool requireGoodQuality = false)
+   { var r = await WaitInternal(...); return new WaitResult(r.Matched, r.Value, r.Quality, r.TimedOut); }
+   ```
+   (Note: `WaitAttribute`'s existing `AskTimeoutException → return false` must be preserved — fold it into `WaitInternal` returning a non-matched/timed-out response, OR catch in both. Do NOT catch `OperationCanceledException`/`TaskCanceledException`.)
+
+5. **`AttributeAccessor`** — add `requireGoodQuality` optional param to both existing `WaitAsync` overloads, and add two `WaitForAsync` overloads:
+   ```csharp
+   public Task<WaitResult> WaitForAsync(string key, object? targetValue, TimeSpan timeout, bool requireGoodQuality = false)
+       => _ctx.WaitAttributeFull(Resolve(key), AttributeValueCodec.Encode(targetValue), null, timeout, requireGoodQuality);
+   public Task<WaitResult> WaitForAsync(string key, Func<object?,bool> predicate, TimeSpan timeout, bool requireGoodQuality = false)
+       => _ctx.WaitAttributeFull(Resolve(key), null, predicate, timeout, requireGoodQuality);
+   ```
+   XML-doc: `requireGoodQuality:true` ignores Bad/Uncertain-quality transients.
+
+6. **Tests** (extend existing files): (a) `WaitForAsync` returns a populated `WaitResult` on match (Value+Quality) and on timeout (`Matched:false, TimedOut:true`). (b) quality-gated: a value reaching target at **Bad** quality does NOT match when `requireGoodQuality:true` (stays pending → times out), but DOES match when `false`; and matches when it reaches target at Good quality. Cover both fast-path (already-at-target-but-Bad) and change-match. (c) scope resolution still applied for `WaitForAsync`.
+
+7. Build `Commons` + `SiteRuntime` + the SiteRuntime test project; run `--filter "FullyQualifiedName~WaitForAttribute|FullyQualifiedName~WaitAsync|FullyQualifiedName~WaitForAsync"` and the `~InstanceActor|~ScopeAccessor` regression filter. All green.
+
+8. Commit (pathspec).
+
+---
+
+### Task WD-2a: Routed contract + central path (§6, part 1)
+
+**Classification:** high-risk (cross-cluster message contract + `IInstanceRouter` surface)
+**Estimated implement time:** ~5 min
+**Parallelizable with:** WD-1
+
+**Files:**
+- Modify: `src/ZB.MOM.WW.ScadaBridge.Commons/Messages/InboundApi/RouteToInstanceRequest.cs` (add the two records)
+- Modify: `src/ZB.MOM.WW.ScadaBridge.InboundAPI/IInstanceRouter.cs` (add method)
+- Modify: `src/ZB.MOM.WW.ScadaBridge.InboundAPI/CommunicationServiceInstanceRouter.cs` (delegate)
+- Modify: `src/ZB.MOM.WW.ScadaBridge.InboundAPI/RouteHelper.cs` (`RouteTarget.WaitForAttribute`)
+- Modify: `src/ZB.MOM.WW.ScadaBridge.Communication/CommunicationService.cs` (`RouteToWaitForAttributeAsync` — **wait-timeout-aware** Ask)
+- Modify (compile-break fixes — interface gained a member): `tests/ZB.MOM.WW.ScadaBridge.AuditLog.Tests/Integration/ParentExecutionIdCorrelationTests.cs` (`BridgingInstanceRouter`) and the inline `IInstanceRouter` double in `tests/ZB.MOM.WW.ScadaBridge.InboundAPI.Tests/EndpointContentTypeTests.cs`
+- Test: `tests/ZB.MOM.WW.ScadaBridge.InboundAPI.Tests/RouteHelperTests.cs`
+
+**Steps (TDD):**
+
+1. **Commons records** (mirror `RouteToGetAttributes*`, value-equality only):
+   ```csharp
+   public record RouteToWaitForAttributeRequest(
+       string CorrelationId, string InstanceUniqueName, string AttributeName,
+       string? TargetValueEncoded, TimeSpan Timeout, DateTimeOffset Timestamp,
+       Guid? ParentExecutionId = null);
+   public record RouteToWaitForAttributeResponse(
+       string CorrelationId, bool Matched, object? Value, string? Quality, bool TimedOut,
+       bool Success, string? ErrorMessage, DateTimeOffset Timestamp);
+   ```
+   (`Success`/`ErrorMessage` = routing-level outcome, e.g. instance-not-found; `Matched`/`TimedOut`/`Value`/`Quality` = wait outcome.)
+
+2. **`IInstanceRouter`** — add `Task<RouteToWaitForAttributeResponse> RouteToWaitForAttributeAsync(string siteId, RouteToWaitForAttributeRequest request, CancellationToken cancellationToken);`. **Update all 3 implementers** (prod `CommunicationServiceInstanceRouter` + the 2 test doubles listed above; the test doubles can return a canned response / throw NotImplemented only if never exercised — prefer a sane canned response).
+
+3. **`CommunicationServiceInstanceRouter`** — delegate to `_communicationService.RouteToWaitForAttributeAsync(...)`.
+
+4. **`RouteHelper.RouteTarget`** — add (mirror `GetAttributes`, throw on `!Success`):
+   ```csharp
+   public async Task<bool> WaitForAttribute(string attributeName, object? targetValue, TimeSpan timeout, CancellationToken cancellationToken = default)
+   {
+       var token = Effective(cancellationToken);
+       var siteId = await ResolveSiteAsync(token);
+       var request = new RouteToWaitForAttributeRequest(Guid.NewGuid().ToString(), _instanceCode,
+           attributeName, AttributeValueCodec.Encode(targetValue), timeout, DateTimeOffset.UtcNow, _parentExecutionId);
+       var response = await _instanceRouter.RouteToWaitForAttributeAsync(siteId, request, token);
+       if (!response.Success) throw new InvalidOperationException(response.ErrorMessage ?? "Remote attribute wait failed");
+       return response.Matched;
+   }
+   ```
+   (`AttributeValueCodec` is in Commons.Types — add the using if needed.)
+
+5. **`CommunicationService.RouteToWaitForAttributeAsync`** — mirror `RouteToGetAttributesAsync` BUT bound the Ask by the wait timeout, not the generic integration timeout:
+   ```csharp
+   var envelope = new SiteEnvelope(siteId, request);
+   var askTimeout = request.Timeout + _options.IntegrationTimeout; // slack beyond the wait
+   return await GetActor().Ask<RouteToWaitForAttributeResponse>(envelope, askTimeout, cancellationToken);
+   ```
+
+6. **Test** (`RouteHelperTests`): with a substitute `IInstanceRouter` returning a canned `RouteToWaitForAttributeResponse(Matched:true,...)`, `Route.To("x").WaitForAttribute("Flag", true, 30s)` returns true; `Success:false` → throws `InvalidOperationException`; the encoded target equals `AttributeValueCodec.Encode(true)`.
+
+7. Build `Commons` + `InboundAPI` + `Communication` + the two affected test projects; run `--filter "FullyQualifiedName~RouteHelper"` + a build of AuditLog.Tests/InboundAPI.Tests to confirm the interface-addition compiles. Commit (pathspec).
+
+---
+
+### Task WD-2b: Site unpacking + handler (§6, part 2)
+
+**Classification:** high-risk (actor handler crossing into `InstanceActor`; Ask-timeout correctness)
+**Estimated implement time:** ~4 min
+**Parallelizable with:** none
+**blockedBy:** WD-2a
+
+**Files:**
+- Modify: `src/ZB.MOM.WW.ScadaBridge.Communication/Actors/SiteCommunicationActor.cs` (add `Receive<RouteToWaitForAttributeRequest>(msg => _deploymentManagerProxy.Forward(msg));` next to the other RouteTo forwards ~line 145)
+- Modify: `src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/DeploymentManagerActor.cs` (`Receive<RouteToWaitForAttributeRequest>(RouteInboundApiWaitForAttribute);` + handler)
+- Test: `tests/ZB.MOM.WW.ScadaBridge.SiteRuntime.Tests/Actors/DeploymentManagerActorTests.cs`
+
+**Steps (TDD):**
+
+1. **`SiteCommunicationActor`** — add the `Receive`/Forward line.
+
+2. **`DeploymentManagerActor.RouteInboundApiWaitForAttribute`** — mirror `RouteInboundApiGetAttributes`:
+   ```csharp
+   private void RouteInboundApiWaitForAttribute(RouteToWaitForAttributeRequest request)
+   {
+       if (!_instanceActors.TryGetValue(request.InstanceUniqueName, out var instanceActor))
+       {
+           Sender.Tell(new RouteToWaitForAttributeResponse(request.CorrelationId, false, null, null, false,
+               false, $"Instance '{request.InstanceUniqueName}' not found on this site.", DateTimeOffset.UtcNow));
+           return;
+       }
+       var sender = Sender;
+       var inner = new WaitForAttributeRequest(request.CorrelationId, request.InstanceUniqueName,
+           request.AttributeName, request.TargetValueEncoded, null /*predicate*/, request.Timeout,
+           DateTimeOffset.UtcNow /*, RequireGoodQuality defaults false */);
+       // Ask bounded by the WAIT timeout + slack (NOT a fixed 30s).
+       instanceActor.Ask<WaitForAttributeResponse>(inner, request.Timeout + TimeSpan.FromSeconds(5))
+           .ContinueWith(t => t.IsCompletedSuccessfully
+               ? new RouteToWaitForAttributeResponse(request.CorrelationId, t.Result.Matched, t.Result.Value,
+                   t.Result.Quality, t.Result.TimedOut, true, null, DateTimeOffset.UtcNow)
+               : new RouteToWaitForAttributeResponse(request.CorrelationId, false, null, null, false, false,
+                   t.Exception?.GetBaseException().Message ?? "Attribute wait timed out", DateTimeOffset.UtcNow))
+           .PipeTo(sender);
+   }
+   ```
+   (`WaitForAttributeRequest` lives in Commons `Messages/Instance` — add the using. Build with both the trailing-`RequireGoodQuality` and pre-field signatures in mind; passing 7 positional args + default is fine.)
+
+3. **Test** (`DeploymentManagerActorTests`, mirror the routed get-attributes test): deploy/register an instance whose attribute already equals the target → `RouteToWaitForAttributeRequest` → `RouteToWaitForAttributeResponse(Success:true, Matched:true)`; unknown instance → `Success:false`.
+
+4. Build `Communication` + `SiteRuntime` + SiteRuntime test project; run `--filter "FullyQualifiedName~DeploymentManagerActor"`. Commit (pathspec).
+
+---
+
+### Task WD-3: Integration — docs + full verification
+
+**Classification:** standard
+**Estimated implement time:** ~4 min
+**Parallelizable with:** none
+**blockedBy:** WD-1, WD-2a, WD-2b
+
+**Files:**
+- Modify: `docs/plans/2026-06-17-waitfor-attribute-change-helper-spec.md` (mark §3 `WaitForAsync`/`WaitResult`, §4.2 quality-gated mode, and §6 routed variant as IMPLEMENTED; note Test-Run sandbox parity excluded)
+- Modify: `docs/requirements/Component-SiteRuntime.md` (script-surface note: `Attributes.WaitForAsync` + `requireGoodQuality`) and `docs/requirements/Component-InboundAPI.md` (`Route.To(...).WaitForAttribute`) — brief, only if those docs enumerate the script surface
+- (No new component, no migration, no docker config change)
+
+**Steps:**
+
+1. Update the spec doc + component docs as above.
+2. **Full-solution build:** `dotnet build ZB.MOM.WW.ScadaBridge.slnx` — 0 errors.
+3. **Targeted test sweep** across everything touched:
+   `dotnet test tests/ZB.MOM.WW.ScadaBridge.SiteRuntime.Tests/... --filter "FullyQualifiedName~WaitForAttribute|FullyQualifiedName~WaitAsync|FullyQualifiedName~WaitForAsync|FullyQualifiedName~DeploymentManagerActor"`,
+   `dotnet test tests/ZB.MOM.WW.ScadaBridge.InboundAPI.Tests/... --filter "FullyQualifiedName~RouteHelper"`,
+   and a build of `tests/ZB.MOM.WW.ScadaBridge.AuditLog.Tests` + `tests/ZB.MOM.WW.ScadaBridge.Communication.Tests` to confirm no compile/regression from the interface addition.
+4. `git diff` review; commit (pathspec).
+
+---
+
+## Out of scope (explicit)
+
+- Routed `WaitForAttribute` is NOT wired into the CentralUI Test-Run sandbox (`ISandboxInstanceGateway`/`SandboxInstanceGateway`); production inbound scripts get it. Follow-up if Test-Run parity is wanted.
+- No predicate or quality flag across the wire (§6 is value-equality only, per spec).
+- No docker redeploy (no cluster-runtime config change; additive script surface only).
@@ -0,0 +1,10 @@
+{
+  "planPath": "docs/plans/2026-06-17-waitfor-deferred-items.md",
+  "tasks": [
+    {"id": 1, "subject": "WD-1: site-local WaitForAsync + WaitResult + quality-gated mode (§3+§4.2)", "classification": "high-risk", "status": "pending", "parallelizableWith": [2]},
+    {"id": 2, "subject": "WD-2a: routed contract + central path (§6 part 1)", "classification": "high-risk", "status": "pending", "parallelizableWith": [1]},
+    {"id": 3, "subject": "WD-2b: site unpacking + DeploymentManager handler (§6 part 2)", "classification": "high-risk", "status": "pending", "blockedBy": [2]},
+    {"id": 4, "subject": "WD-3: integration — docs + full verification", "classification": "standard", "status": "pending", "blockedBy": [1, 2, 3]}
+  ],
+  "lastUpdated": "2026-06-17"
+}
@@ -158,16 +158,32 @@ is per-run and flat — `WHERE ExecutionId = X` returns everything one run did,
 nothing links a run to the run that *spawned* it. `ParentExecutionId` carries the
 spawning execution's `ExecutionId`: a spawned run still gets its own fresh
 `ExecutionId`, and every audit row it emits also carries the spawner's id in
-`ParentExecutionId`. The first cut bridges the **inbound API → routed-site-script**
-case: an inbound request runs a method script that calls `Route.Call`, routing to
-a site instance; the routed site script records the inbound request's
-`ExecutionId` as its `ParentExecutionId`, while the inbound `InboundRequest` row
-itself is top-level (`ParentExecutionId` NULL). The pointer always references the
-*immediate* spawner, so a routed run that itself routes onward threads its own
-`ExecutionId` — walking `ParentExecutionId → ExecutionId` recursively
-reconstructs the call chain as a tree of arbitrary depth. The tag-cascade case
-(an attribute write triggering another script) is **deferred** — the model
-generalises to it with no schema change once that spawn point is threaded.
+`ParentExecutionId`. The pointer always references the *immediate* spawner, so a
+run that itself spawns further runs threads its own `ExecutionId` — walking
+`ParentExecutionId → ExecutionId` recursively reconstructs the call chain as a
+tree of arbitrary depth.
+
+**Tag-cascade coverage (M5.4 T4):** `ParentExecutionId` threading now spans all
+known spawn points:
+
+- **Inbound API → routed site script** — an inbound request runs a method script
+  that calls `Route.Call`; the routed site script records the inbound request's
+  `ExecutionId` as its `ParentExecutionId`, while the inbound `InboundRequest` row
+  is top-level (`ParentExecutionId` NULL).
+- **Alarm-triggered on-trigger script** — when an alarm fires and its on-trigger
+  script runs (via `AlarmActor → AlarmExecutionActor`), the alarm context's
+  `ExecutionId` is carried as the run's `ParentExecutionId`. Currently the alarm
+  subsystem has no Guid-typed firing id so on-trigger runs are roots (NULL) in
+  practice, but the wiring is in place for a future alarm `ExecutionId`.
+- **Nested `CallScript` / `CallShared` invocations** — when a script calls
+  `Instance.CallScript(...)` or a shared script via `CallShared`, the calling
+  execution's `ExecutionId` threads into the spawned run as its
+  `ParentExecutionId`, making deeply nested call chains visible as a tree.
+
+Attribute-write-triggered cascades (one tag change triggering another script via a
+tag subscription) are also wired: trigger-driven runs carry `ParentExecutionId =
+NULL` (top-level roots), and any nested `CallScript`/`CallShared` they perform
+chains as above. The schema is unchanged — no further tag-cascade work is deferred.

 ## The Site-Local `AuditLog` (SQLite)

@@ -268,7 +284,34 @@ operational `SiteCalls` shape for the dispatcher and UI.

 - **Default cap** — 8 KB for each of `RequestSummary` and `ResponseSummary`;
  raised to 64 KB on any error row (`Status IN ('Failed', 'Parked', 'Discarded')`).
- **Inbound API exception.** For `Channel = ApiInbound`, `RequestSummary` and `ResponseSummary` are captured in full up to a per-body hard ceiling of 1 MiB (configurable via `AuditLog:InboundMaxBytes`; default 1 048 576 bytes; min 8 192; max 16 777 216). The 8 KiB / 64 KiB default/error caps that apply to other channels do not apply here. `PayloadTruncated = 1` is set only when the inbound ceiling is hit — verbatim capture is the normal case. The ceiling applies independently to each body. Header redaction and per-target body redactors still run before persistence.
+- **Inbound API exception.** For `Channel = ApiInbound`, `RequestSummary` and
+  `ResponseSummary` are captured in full up to a per-body hard ceiling of 1 MiB
+  (configurable via `AuditLog:InboundMaxBytes`; default 1 048 576 bytes; min
+  8 192; max 16 777 216). The 8 KiB / 64 KiB default/error caps that apply to
+  other channels do not apply here. `PayloadTruncated = 1` is set only when the
+  inbound ceiling is hit — verbatim capture is the normal case. The ceiling
+  applies independently to each body. Header redaction and per-target body
+  redactors still run before persistence.
+- **Inbound ceiling hits (M5.3 T7).** Every time the `InboundMaxBytes` ceiling
+  truncates a body an `IAuditInboundCeilingHitsCounter.Increment()` call fires.
+  This counter is surfaced as `AuditInboundCeilingHits` on the central health
+  snapshot (alongside `CentralAuditWriteFailures` / `AuditRedactionFailure`) so
+  operators can detect persistently oversized payloads and raise the ceiling or
+  add per-target body redactors.
+- **Request headers in `Extra` (M5.3 T7).** For `Channel = ApiInbound`, the
+  `AuditWriteMiddleware` captures the inbound HTTP request headers (post-redaction
+  — `Authorization`, `X-API-Key`, `Cookie`, `Set-Cookie`, and the configured
+  `HeaderRedactList` are scrubbed before serialization) into the `Extra` JSON
+  column under the key `"requestHeaders"`. This makes the full header envelope
+  visible in the Audit Log UI's detail drawer and the CLI's `audit query` output
+  without widening the schema.
+- **Per-method `SkipBodyCapture` (M5.3 T7).** `PerTargetOverrides` now includes
+  a `SkipBodyCapture: true` flag. When set for an inbound API method, the audit
+  row is always emitted (headers, status, duration, actor, etc. are recorded) but
+  `RequestSummary` and `ResponseSummary` are left null. Use this for methods whose
+  payloads are structurally large or contain secrets not covered by body redactors.
+  Headers are still captured into `Extra.requestHeaders` (after redaction) even
+  when `SkipBodyCapture` is true.
 - **Truncation** — UTF-8 byte-safe; `PayloadTruncated = 1` when applied. Full
  bodies are never stored.
 - **HTTP headers** — `Authorization`, `Cookie`, `Set-Cookie`, `X-API-Key`, and
@@ -311,16 +354,33 @@ MS SQL for direct-write events). Unredacted secrets never persist.
 ## Retention & Purge

 - **Central:** 365-day default based on `OccurredAtUtc`, configurable via
-  `AuditLog:RetentionDays` (min 7, max 3650). Single global retention in v1 —
-  no per-channel overrides.
+  `AuditLog:RetentionDays` (min 30, max 3650).
 - **Partitioning:** monthly partitions on `OccurredAtUtc` from day one
-  (`pf_AuditLog_Month` / `ps_AuditLog_Month`). Purge is a partition switch;
-  there are no row-level deletes at central.
+  (`pf_AuditLog_Month` / `ps_AuditLog_Month`). The global partition switch is
+  channel-blind; it drops a whole month once every row in it is older than the
+  global window. There are no row-level deletes at central for the global purge.
 - **Purge actor:** `AuditLogPurgeActor` singleton on the active central node
  runs daily, switches out any partition whose latest `OccurredAtUtc` is older
-  than the retention window, and emits an `AuditLog:Purged` event (partition
-  range, rowcount, duration). A partition-maintenance step rolls forward each
-  month, creating the next month's partition ahead of time.
+  than the retention window, then applies any per-channel overrides (see below),
+  and emits an `AuditLog:Purged` event (partition range, rowcount, duration) per
+  switched partition. A partition-maintenance step rolls forward each month,
+  creating the next month's partition ahead of time.
+- **Per-channel retention overrides (M5.5 T3):** `AuditLog:PerChannelRetentionDays`
+  is a dictionary keyed by canonical channel name (`ApiOutbound`, `DbOutbound`,
+  `Notification`, `ApiInbound`) whose value is a retention window in days that
+  MUST be strictly shorter than the global `RetentionDays`. After the daily
+  partition switch-out, the purge actor runs a bounded, batched row DELETE
+  (`PurgeChannelOlderThanAsync`) for each channel whose override is shorter than
+  the global window — expiring rows of that channel earlier than the global
+  partition switch would. Overrides equal to or longer than the global window are
+  silently skipped (the global switch already covers them). The DELETE runs under
+  `scadabridge_audit_purger` (the maintenance role); the append-only writer role
+  is unaffected. Batch size is configurable via
+  `AuditLogPurge:ChannelPurgeBatchSize` (default 5000). Each channel override
+  runs in its own try/catch, mirroring the per-boundary error-isolation of the
+  partition switch-out loop. Values are validated to be in
+  `[30, RetentionDays]`; keys that are not a recognized `AuditChannel` enum name
+  are rejected at startup.
 - **Sites:** daily site job; default 7-day retention (configurable, min 1,
  max 90). Respects the hard `ForwardState` invariant — `Pending` rows are
  never purged on age alone.
@@ -340,10 +400,13 @@ MS SQL for direct-write events). Unredacted secrets never persist.
  **AuditExport** permission.
 - **Payload redaction at write.** See Payload Capture Policy. Unredacted
  secrets never persist; the safety net over-redacts on misconfiguration.
- **Hash-chain tamper evidence — deferred to v1.x.** A future `RowHash` column,
-  computed per partition as `SHA-256(prev.RowHash || canonical(row))`, will be
-  verifiable offline via `scadabridge audit verify-chain --month YYYY-MM`. Off by
-  default in v1.
+- **Hash-chain tamper evidence (T1) — deferred to v1.x.** A future `RowHash`
+  column, computed per partition as `SHA-256(prev.RowHash || canonical(row))`, will
+  be verifiable offline via `scadabridge audit verify-chain --month YYYY-MM`. The
+  `verify-chain` CLI command is a no-op placeholder today. Off by default in v1.
+- **Parquet archival (T2) — deferred to v1.x.** Long-term cold storage of purged
+  monthly partitions as Parquet files (suitable for offline analytics) will be
+  added in a future milestone. T1 and T2 are not shipped as part of M5.
 - **Site SQLite security.** File permissions: read/write by the ScadaBridge
  service account only. Not backed up off-machine — site SQLite is a buffer,
  not a record.
@@ -355,11 +418,22 @@ Point-in-time, computed from the central `AuditLog` table; global and per-site.
 - **Audit volume** — events/min landing in the central `AuditLog`; global plus per-site sparkline.
 - **Audit error rate** — % of central `AuditLog` rows with `Status IN ('Failed', 'Parked', 'Discarded')` over a rolling 5-minute window. This is the operational error rate of audited operations (HTTP 5xx, permanent failures, parked deliveries) — NOT audit-writer health, which surfaces separately via `CentralAuditWriteFailures` and `AuditRedactionFailure`.
 - **Audit backlog** — sum of `Pending` site rows across sites; click drills into a per-site breakdown.
+- **`AuditInboundCeilingHits`** (M5.3 T7) — rolling count of inbound API responses truncated by the `InboundMaxBytes` ceiling; surfaced on the central health snapshot alongside `CentralAuditWriteFailures`.
+
+**Per-node stuck KPIs (M5.3 T6):** Both [Notification Outbox](Component-NotificationOutbox.md)
+and [Site Call Audit](Component-SiteCallAudit.md) now expose a
+`PerNodeNotificationKpiRequest` / `PerNodeSiteCallKpiRequest` message pair that
+groups the existing stuck, parked, and delivered-last-interval counts by the
+`SourceNode` that emitted the original row. This surfaces per-node breakdowns on
+the Health dashboard tiles and the Notification Outbox / Site Calls pages,
+making it possible to identify a single misbehaving node (e.g., `site-a:node-b`)
+as the source of a spike rather than a site-wide problem. The existing global and
+per-site KPI shapes are unchanged; the per-node slice is additive.

 [Notification Outbox](Component-NotificationOutbox.md) and
-[Site Call Audit](Component-SiteCallAudit.md) KPIs are unaffected — they remain
-sourced from `Notifications` and `SiteCalls` respectively. Audit Log KPIs
-describe the audit table itself.
+[Site Call Audit](Component-SiteCallAudit.md) KPIs are unaffected for their
+operational dispatch responsibilities — they remain sourced from `Notifications`
+and `SiteCalls` respectively. Audit Log KPIs describe the audit table itself.

 ## Configuration

@@ -370,21 +444,78 @@ component (Options pattern):
 "AuditLog": {
  "DefaultCapBytes": 8192,
  "ErrorCapBytes": 65536,
+  "InboundMaxBytes": 1048576,
  "HeaderRedactList": [ "Authorization", "Cookie", "Set-Cookie", "X-API-Key" ],
  "GlobalBodyRedactors": [
    { "Pattern": "\"password\"\\s*:\\s*\"[^\"]+\"", "Replacement": "\"password\":\"<redacted>\"" }
  ],
  "PerTargetOverrides": {
    "Weather/GetForecast": { "CapBytes": 4096 },
-    "PlantDB":             { "RedactSqlParamsMatching": "@apikey|@token" }
+    "PlantDB":             { "RedactSqlParamsMatching": "@apikey|@token" },
+    "HighVolumeMethod":    { "SkipBodyCapture": true }
  },
-  "RetentionDays": 365
+  "RetentionDays": 365,
+  "PerChannelRetentionDays": {
+    "ApiOutbound":  90,
+    "Notification": 180
+  }
 }
 ```

 `PerTargetOverrides` keys bind by External System / Inbound Method /
-Notification List / Database Connection name. `RetentionDays` is a single
-global value in v1; per-channel overrides are deferred to v1.x.
+Notification List / Database Connection name. `SkipBodyCapture: true` omits
+`RequestSummary`/`ResponseSummary` for that method while still capturing headers
+into `Extra.requestHeaders` and emitting the full audit row. `RetentionDays` is
+the global window; `PerChannelRetentionDays` specifies per-channel windows that
+are strictly shorter — any channel whose override equals or exceeds the global
+value is silently ignored (the global partition switch-out already governs it).
+
+`AuditLogPurge` section controls the purge actor cadence and batch size:
+
+```jsonc
+"AuditLogPurge": {
+  "IntervalHours": 24,
+  "ChannelPurgeBatchSize": 5000
+}
+```
+
+## Ops Notes — Historical Null Columns
+
+### `SourceNode` backfill (M5.6 T5)
+
+`SourceNode` (`varchar(64)` NULL) is a physical column stamped on every row at
+write time. Rows ingested before M5.6 shipped have `SourceNode IS NULL` because
+the value was not populated until the feature landed. A one-time CLI command sets
+these to a configurable sentinel:
+
+```
+scadabridge audit backfill-source-node --before <ISO-8601-UTC> [--sentinel unknown] [--batch 5000]
+```
+
+The default sentinel is `"unknown"`. The true node-of-origin for pre-feature rows
+is **unknowable** retroactively — the emitting node is long gone from the telemetry
+pipeline. The sentinel makes that explicit rather than leaving the column NULL
+(which the Audit Log UI's Node filter already treats as "unresolved", but which
+an operator might mistake for a data-quality bug).
+
+The backfill runs via `POST /api/audit/backfill-source-node` (Admin role required)
+on the maintenance/purge path, NOT the append-only `scadabridge_audit_writer` role.
+It is idempotent and can be re-run safely.
+
+### `ExecutionId` and `ParentExecutionId` — cannot be backfilled
+
+`ExecutionId` and `ParentExecutionId` are **PERSISTED COMPUTED columns** derived
+from `DetailsJson`. They were introduced in the same feature window as the column
+itself but their value comes from the JSON payload that was written at ingest time.
+
+The AuditLog append-only invariant **forbids mutating `DetailsJson`** — rows may
+only be inserted, never updated. Because backfilling the computed values would
+require rewriting the underlying `DetailsJson`, it is impossible under the
+append-only contract. Pre-feature rows carry `NULL` in both columns permanently.
+
+This is a documented limitation, not a defect. The NULL values are visible in the
+Audit Log UI's execution-tree drilldown (rows with no `ExecutionId` appear as
+orphaned entries) and in the CLI's `audit tree` output.

 ## Dependencies

@@ -442,6 +573,8 @@ global value in v1; per-channel overrides are deferred to v1.x.
  tiles (Volume, Error rate, Backlog) plus new health metrics:
  `SiteAuditBacklog`, `SiteAuditWriteFailures`, `SiteAuditTelemetryStalled`,
  `CentralAuditWriteFailures`, `AuditRedactionFailure`.
- **[CLI (#19)](Component-CLI.md)** — new `scadabridge audit query`,
-  `scadabridge audit export`, and `scadabridge audit verify-chain` commands; same
-  permission requirements as the UI.
+- **[CLI (#19)](Component-CLI.md)** — `scadabridge audit query`,
+  `scadabridge audit export`, `scadabridge audit tree --execution-id <guid>`,
+  `scadabridge audit backfill-source-node --sentinel <s> --before <date>`, and
+  `scadabridge audit verify-chain` (no-op placeholder for the deferred hash-chain
+  feature); same permission requirements as the UI.
@@ -228,14 +228,17 @@ The new centralized Audit Log component (#23) is exposed via the `scadabridge au
 The `scadabridge audit` group targets the centralized Audit Log component (#23) and
 exposes the UI-equivalent operational audit surface. Permissions follow the same
 read-vs-export split the Central UI uses (see Component-AuditLog.md, Security &
-Tamper-Evidence, and Security & Auth #10): `audit query` and `audit verify-chain`
-require the `OperationalAudit` permission; `audit export` additionally requires
-`AuditExport`. The server enforces permission checks and returns HTTP 403 (CLI
-exit code 2) on denial.
+Tamper-Evidence, and Security & Auth #10): `audit query`, `audit tree`, and
+`audit verify-chain` require the `OperationalAudit` permission; `audit export`
+additionally requires `AuditExport`; `audit backfill-source-node` requires the
+`Admin` role (maintenance path only). The server enforces permission checks and
+returns HTTP 403 (CLI exit code 2) on denial.

 ```
 scadabridge audit query [--since <t>] [--until <t>] [--channel <c>] [--kind <k>] [--status <s>] [--site <s>] [--target <t>] [--actor <a>] [--correlation-id <id>] [--execution-id <id>] [--parent-execution-id <id>] [--errors-only] [--page-size <n>] [--all]
 scadabridge audit export --since <t> --until <t> --format csv|jsonl|parquet --output <path> [--channel <c>] [--kind <k>] [--status <s>] [--site <s>] [--target <t>] [--actor <a>]
+scadabridge audit tree --execution-id <guid> [--format table|json]
+scadabridge audit backfill-source-node --before <ISO-8601-UTC> [--sentinel <value>] [--batch <n>]
 scadabridge audit verify-chain --month <YYYY-MM>
 ```

@@ -247,6 +250,18 @@ scadabridge audit verify-chain --month <YYYY-MM>
  requested format (`csv`, `jsonl`, `parquet`) written to `--output`. The server
  streams rows rather than materializing them in memory; the CLI writes bytes
  through to disk. Supports the same scoping filters as `audit query`.
+- `audit tree --execution-id <guid>` (M5.3 T8) — renders the full execution-chain
+  tree for the given `ExecutionId`. The server resolves the root from any node in
+  the chain (walks `ParentExecutionId` to find the root, then traverses downward)
+  and returns all reachable executions with their summary row counts and first/last
+  occurred timestamps. Output format: `json` (default — structured tree suitable
+  for scripting) or `table` (human-readable indented tree). Requires
+  `OperationalAudit` permission. Backed by `GET /api/audit/tree?executionId=<guid>`.
+- `audit backfill-source-node --before <ISO-8601-UTC>` (M5.6 T5) — sets
+  `SourceNode` to a sentinel value (`--sentinel`, default `"unknown"`) on pre-feature
+  rows where `SourceNode IS NULL` and `OccurredAtUtc < --before`, in batches
+  (`--batch`, default 5000). Admin-only maintenance command. Idempotent.
+  Backed by `POST /api/audit/backfill-source-node`.
 - `audit verify-chain` — hash-chain verification for the named month.
  **No-op in v1**: the command is defined so the command tree is stable, but
  verification only becomes meaningful once the hash-chain ships (see
@@ -366,7 +381,7 @@ Configuration is resolved in the following priority order (highest wins):
 - **System.CommandLine**: Command-line argument parsing.
 - **Microsoft.AspNetCore.SignalR.Client**: SignalR client for the `debug stream` command's WebSocket connection.
 - **Management Service (#18)**: The CLI hits the central cluster via the existing HTTP Management API (`POST /management`), which dispatches to the ManagementActor. The `scadabridge audit` command group rides a parallel REST surface on the same Host (`GET /api/audit/query` and `GET /api/audit/export`), sharing HTTP Basic Auth with `/management` but bypassing the actor for read-only, keyset-paged / streaming workloads.
- **Audit Log (#23)**: The `scadabridge audit query` and `audit export` subcommands target the centralized Audit Log component's REST endpoints (`GET /api/audit/query`, `GET /api/audit/export`) on the Host's Management API surface; `audit verify-chain` rides `POST /management` until hash-chain verification ships. Permission checks (`OperationalAudit`, `AuditExport`) are enforced server-side by `AuditEndpoints`.
+- **Audit Log (#23)**: The `scadabridge audit query`, `audit export`, `audit tree`, and `audit backfill-source-node` subcommands target the centralized Audit Log component's REST endpoints (`GET /api/audit/query`, `GET /api/audit/export`, `GET /api/audit/tree`, `POST /api/audit/backfill-source-node`) on the Host's Management API surface; `audit verify-chain` is a client-side no-op today (hash-chain deferred to v1.x). Permission checks (`OperationalAudit`, `AuditExport`, `Admin`) are enforced server-side by `AuditEndpoints`.

 ## Interactions

@@ -189,6 +189,7 @@ Inbound API scripts **cannot** call shared scripts directly — shared scripts a
 - `Route.To("instanceUniqueCode").GetAttributes("attr1", "attr2", ...)` — Read multiple attribute values in a **single call**, returned as a dictionary of name-value pairs.
 - `Route.To("instanceUniqueCode").SetAttribute("attributeName", value)` — Write a single attribute value on a specific instance at any site.
 - `Route.To("instanceUniqueCode").SetAttributes(dictionary)` — Write multiple attribute values in a **single call**, accepting a dictionary of name-value pairs.
+- `Route.To("instanceUniqueCode").WaitForAttribute("attributeName", targetValue, timeout)` — Wait, event-driven, until an attribute on a specific instance at any site reaches `targetValue` (value-equality only across the wire), bounded by `timeout`. Returns `true` if matched within the timeout, `false` if it timed out. The cluster call is bounded by the wait timeout rather than the generic integration timeout.

 #### Input/Output
 - **Input parameters** are available as defined in the method definition.