# M5 — Audit Hardening (T3–T8) — Design **Status:** Approved (awaiting plan). **Worktree/branch:** `worktree-m5-audit-hardening` off `main` (`e77e209`). **Source:** Phase-2 milestone M5 from `docs/plans/2026-06-15-stillpending-completion-design.md`. ## Goal Harden the centralized Audit Log with six independent, ready-to-build items. Two items originally listed under M5 — **T1 hash-chain tamper evidence** and **T2 Parquet export** — remain **deferred to v1.x** (per CLAUDE.md's audit design decisions); their stubs (CLI `verify-chain` no-op, export `501`) stay unchanged. ## Scope (in) T3 per-channel retention · T4 ParentExecutionId tag-cascade · T5 historical backfill (reframed) · T6 per-node stuck KPIs · T7 structured response-capture increments · T8 CLI `audit tree`. ## Scope (out / deferred to v1.x) T1 hash-chain (no Hash/PrevHash columns, no real verify-chain), T2 Parquet export (the `501` gate stays). Reversing those deferrals is a separate decision. --- ## Items ### T8 — CLI `audit tree` (smallest; reuses existing server walk + UI) The recursive execution-tree walk (`IAuditLogRepository.GetExecutionTreeAsync`, backed by `IX_AuditLog_ParentExecution`) and the Blazor `ExecutionTreePage` already exist; only an HTTP projection + CLI surface are missing. - **Server:** add `GET /api/audit/tree?executionId=…` in `AuditEndpoints.MapAuditAPI` → `repo.GetExecutionTreeAsync` → serialize `ExecutionTreeNode[]`. - **CLI:** add `audit tree --execution-id [--format table|json]` in `AuditCommands` + an `AuditTreeHelpers` renderer (indented ASCII tree for `table`; raw nodes for `json`), mirroring `AuditQueryHelpers`/`AuditExportHelpers`. - No schema change. **Tests:** endpoint returns the tree; CLI renders a multi-level tree + handles not-found. ### T6 — Per-node stuck-count KPIs KPIs are per-site today; `SourceNode` is on the `Notification` and `SiteCalls` rows but not aggregated. - Add `ComputePerNodeKpisAsync` (group by `SourceNode`) parallel to the existing `ComputePerSiteKpisAsync` in `NotificationOutboxRepository` and `SiteCallAuditRepository`. - New `PerNode…KpiRequest`/`Response` message pair per actor; register in each actor's `Receive<>`. - Surface a per-node breakdown on the existing KPI tiles (`AuditKpiTiles`/`SiteCallKpiTiles`) — additive, behind the existing tiles. - **Tests:** repository grouping returns correct per-node counts (stuck/parked/ queue-depth); message round-trip. ### T7 — Structured response-capture increments (no schema change) - **(a) Inbound request headers** → captured into the existing `Extra` JSON in `AuditWriteMiddleware.EmitInboundAudit`, passed through the existing header redactor (auth headers redacted by default). - **(b) `AuditInboundCeilingHits`** counter on `AuditCentralHealthSnapshot` (alongside the existing failure counters), incremented when an inbound row truncates (request or response hits `InboundMaxBytes`). Surfaced via the health snapshot. - **(c) Per-method opt-out** of body capture: a `SkipBodyCapture` flag on `PerTargetRedactionOverride`, checked in the capture pipeline so a noisy/ sensitive method can suppress body capture (headers + metadata still recorded). - **Tests:** request headers land in `Extra` and are redacted; ceiling-hit increments the counter; opt-out suppresses body but keeps the row. ### T4 — `ParentExecutionId` tag-cascade (touches the actor model — high-risk) Completes the execution tree beyond the inbound-API→routed-script case. - **Alarm on-trigger:** thread a `Guid? parentExecutionId` through `AlarmActor.SpawnAlarmExecutionActor` → `AlarmExecutionActor` → `ScriptRuntimeContext`, so an alarm-triggered script chains to its firing context (the alarm's own execution id where one exists; otherwise a root). - **Nested `CallScript`/`CallShared`:** in `ScriptRuntimeContext`, pass **the current run's `ExecutionId`** (not the inherited `_parentExecutionId`) as the child invocation's `ParentExecutionId`, so `A → CallScript(B)` records B's parent as A — a true multi-level tree. - **Timer/expression-trigger top-level runs** stay roots (no spawner) — unchanged. - **Tests:** alarm-triggered script row carries the expected parent; a 2-level nested `CallScript` produces a chain A→B→C walkable by `GetExecutionTreeAsync`. - **Risk:** serialized actor state + correlation plumbing; covered by targeted SiteRuntime actor tests + a tree-walk integration assertion. ### T3 — Per-channel retention overrides (one design wrinkle, resolved) Retention is a single global `RetentionDays`; the purge actor switches out whole month partitions by `OccurredAtUtc` (channel-blind). - Add `PerChannelRetentionDays` (`Dictionary`, keyed by channel / `Action` name) to `AuditLogOptions`, validated like the global value; a channel override may only be **shorter** than the global window (longer is meaningless under month-partition switch-out, which is governed by the largest retention). - **Mechanism (resolved):** after the coarse global partition purge, the purge actor runs a **bounded row-level delete** for channels whose override is shorter than global (`DELETE … WHERE Action=@channel AND OccurredAtUtc<@thr`, batched). This runs from the **purge/maintenance path, not the writer role** — the append-only invariant binds the writer/ingest role, not maintenance. The **M2.10 CI grep-guard is widened** to allow the purge actor's single audited deletion call site (an allow-list entry, not a blanket exemption). - **Tests:** a channel with a shorter override is purged earlier than the global; channels without an override follow the global; the guard still rejects UPDATE/DELETE everywhere except the sanctioned purge site. ### T5 — Historical backfill (reframed per the computed-column reality) - **`SourceNode`** is a physical nullable column. For truly historical rows the node-of-origin is **unknowable**, so the backfill sets a **configurable sentinel** (default `"unknown"`) on `NULL` rows via a one-shot maintenance command (run from the purge/maintenance path), rather than guessing a node. - **`ExecutionId`/`ParentExecutionId`** are **persisted computed columns derived from `DetailsJson`**; backfilling them means mutating the JSON, which append-only forbids. These are **documented as a runbook limitation** (pre-feature rows stay NULL) — no code. - **Tests:** the SourceNode backfill sets the sentinel only on NULL rows within a bounded range and is idempotent; documentation note added. --- ## Cross-cutting - **Shared seams:** `AuditLogOptions` (T3, T7), `AuditEndpoints.MapAuditAPI` (T8), `AuditCommands` (T8), `AuditCentralHealthSnapshot` (T6, T7), `IAuditLogRepository`/the KPI repositories (T6), the purge/maintenance role (T3, T5). No AuditLog **schema** change in M5 (T1/T2 deferred). - **Append-only:** the only new deletion is T3's purge-role channel delete + T5's purge-role sentinel UPDATE — both maintenance-path, both reflected in the CI guard's allow-list. Writer/ingest paths stay INSERT-only. ## Testing strategy Per-item unit + targeted integration tests (above). T4 additionally gets a tree-walk integration assertion. Full-solution build + targeted suites at the integration step. No new infra dependency (Parquet deferred). ## Sequencing Independent items, parallelizable by disjoint area: - **Wave A (parallel):** T8 (CLI+endpoint), T6 (KPI repos+actors+tiles), T7 (middleware+health+redaction-override) — disjoint projects. - **Wave B (parallel):** T4 (SiteRuntime actors — high-risk), T3 (AuditLog options+purge actor+CI guard), T5 (purge-path backfill command + runbook). - **Wave C:** integration verification + docs (Component-AuditLog/-CLI, CLAUDE.md KPI/retention notes, runbook). ## Risks - **T4** actor-model correlation (serialized state) — targeted tests + tree-walk assertion. - **T3** append-only tension — resolved via maintenance-role delete + CI-guard allow-list; verify the guard still blocks all other DELETE/UPDATE. - **T5** node-of-origin unknowable — sentinel + documented limitation (no false precision).