From 1b7600fcb353a40d51fa895d6961cd36996d1b4f Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Tue, 16 Jun 2026 21:10:21 -0400 Subject: [PATCH] =?UTF-8?q?docs(m5):=20design=20=E2=80=94=20audit=20harden?= =?UTF-8?q?ing=20T3-T8=20(T1=20hash-chain=20+=20T2=20Parquet=20stay=20defe?= =?UTF-8?q?rred)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../2026-06-16-m5-audit-hardening-design.md | 150 ++++++++++++++++++ 1 file changed, 150 insertions(+) create mode 100644 docs/plans/2026-06-16-m5-audit-hardening-design.md diff --git a/docs/plans/2026-06-16-m5-audit-hardening-design.md b/docs/plans/2026-06-16-m5-audit-hardening-design.md new file mode 100644 index 00000000..f2af7c4f --- /dev/null +++ b/docs/plans/2026-06-16-m5-audit-hardening-design.md @@ -0,0 +1,150 @@ +# M5 — Audit Hardening (T3–T8) — Design + +**Status:** Approved (awaiting plan). +**Worktree/branch:** `worktree-m5-audit-hardening` off `main` (`e77e209`). +**Source:** Phase-2 milestone M5 from `docs/plans/2026-06-15-stillpending-completion-design.md`. + +## Goal + +Harden the centralized Audit Log with six independent, ready-to-build items. Two +items originally listed under M5 — **T1 hash-chain tamper evidence** and **T2 +Parquet export** — remain **deferred to v1.x** (per CLAUDE.md's audit design +decisions); their stubs (CLI `verify-chain` no-op, export `501`) stay unchanged. + +## Scope (in) + +T3 per-channel retention · T4 ParentExecutionId tag-cascade · T5 historical +backfill (reframed) · T6 per-node stuck KPIs · T7 structured response-capture +increments · T8 CLI `audit tree`. + +## Scope (out / deferred to v1.x) + +T1 hash-chain (no Hash/PrevHash columns, no real verify-chain), T2 Parquet +export (the `501` gate stays). Reversing those deferrals is a separate decision. + +--- + +## Items + +### T8 — CLI `audit tree` (smallest; reuses existing server walk + UI) +The recursive execution-tree walk (`IAuditLogRepository.GetExecutionTreeAsync`, +backed by `IX_AuditLog_ParentExecution`) and the Blazor `ExecutionTreePage` +already exist; only an HTTP projection + CLI surface are missing. +- **Server:** add `GET /api/audit/tree?executionId=…` in + `AuditEndpoints.MapAuditAPI` → `repo.GetExecutionTreeAsync` → serialize + `ExecutionTreeNode[]`. +- **CLI:** add `audit tree --execution-id [--format table|json]` in + `AuditCommands` + an `AuditTreeHelpers` renderer (indented ASCII tree for + `table`; raw nodes for `json`), mirroring `AuditQueryHelpers`/`AuditExportHelpers`. +- No schema change. **Tests:** endpoint returns the tree; CLI renders a + multi-level tree + handles not-found. + +### T6 — Per-node stuck-count KPIs +KPIs are per-site today; `SourceNode` is on the `Notification` and `SiteCalls` +rows but not aggregated. +- Add `ComputePerNodeKpisAsync` (group by `SourceNode`) parallel to the existing + `ComputePerSiteKpisAsync` in `NotificationOutboxRepository` and + `SiteCallAuditRepository`. +- New `PerNode…KpiRequest`/`Response` message pair per actor; register in each + actor's `Receive<>`. +- Surface a per-node breakdown on the existing KPI tiles + (`AuditKpiTiles`/`SiteCallKpiTiles`) — additive, behind the existing tiles. +- **Tests:** repository grouping returns correct per-node counts (stuck/parked/ + queue-depth); message round-trip. + +### T7 — Structured response-capture increments (no schema change) +- **(a) Inbound request headers** → captured into the existing `Extra` JSON in + `AuditWriteMiddleware.EmitInboundAudit`, passed through the existing header + redactor (auth headers redacted by default). +- **(b) `AuditInboundCeilingHits`** counter on `AuditCentralHealthSnapshot` + (alongside the existing failure counters), incremented when an inbound row + truncates (request or response hits `InboundMaxBytes`). Surfaced via the + health snapshot. +- **(c) Per-method opt-out** of body capture: a `SkipBodyCapture` flag on + `PerTargetRedactionOverride`, checked in the capture pipeline so a noisy/ + sensitive method can suppress body capture (headers + metadata still recorded). +- **Tests:** request headers land in `Extra` and are redacted; ceiling-hit + increments the counter; opt-out suppresses body but keeps the row. + +### T4 — `ParentExecutionId` tag-cascade (touches the actor model — high-risk) +Completes the execution tree beyond the inbound-API→routed-script case. +- **Alarm on-trigger:** thread a `Guid? parentExecutionId` through + `AlarmActor.SpawnAlarmExecutionActor` → `AlarmExecutionActor` → + `ScriptRuntimeContext`, so an alarm-triggered script chains to its firing + context (the alarm's own execution id where one exists; otherwise a root). +- **Nested `CallScript`/`CallShared`:** in `ScriptRuntimeContext`, pass **the + current run's `ExecutionId`** (not the inherited `_parentExecutionId`) as the + child invocation's `ParentExecutionId`, so `A → CallScript(B)` records B's + parent as A — a true multi-level tree. +- **Timer/expression-trigger top-level runs** stay roots (no spawner) — unchanged. +- **Tests:** alarm-triggered script row carries the expected parent; a 2-level + nested `CallScript` produces a chain A→B→C walkable by `GetExecutionTreeAsync`. +- **Risk:** serialized actor state + correlation plumbing; covered by targeted + SiteRuntime actor tests + a tree-walk integration assertion. + +### T3 — Per-channel retention overrides (one design wrinkle, resolved) +Retention is a single global `RetentionDays`; the purge actor switches out whole +month partitions by `OccurredAtUtc` (channel-blind). +- Add `PerChannelRetentionDays` (`Dictionary`, keyed by channel / + `Action` name) to `AuditLogOptions`, validated like the global value; a channel + override may only be **shorter** than the global window (longer is meaningless + under month-partition switch-out, which is governed by the largest retention). +- **Mechanism (resolved):** after the coarse global partition purge, the purge + actor runs a **bounded row-level delete** for channels whose override is + shorter than global (`DELETE … WHERE Action=@channel AND OccurredAtUtc<@thr`, + batched). This runs from the **purge/maintenance path, not the writer role** — + the append-only invariant binds the writer/ingest role, not maintenance. The + **M2.10 CI grep-guard is widened** to allow the purge actor's single audited + deletion call site (an allow-list entry, not a blanket exemption). +- **Tests:** a channel with a shorter override is purged earlier than the global; + channels without an override follow the global; the guard still rejects + UPDATE/DELETE everywhere except the sanctioned purge site. + +### T5 — Historical backfill (reframed per the computed-column reality) +- **`SourceNode`** is a physical nullable column. For truly historical rows the + node-of-origin is **unknowable**, so the backfill sets a **configurable + sentinel** (default `"unknown"`) on `NULL` rows via a one-shot maintenance + command (run from the purge/maintenance path), rather than guessing a node. +- **`ExecutionId`/`ParentExecutionId`** are **persisted computed columns derived + from `DetailsJson`**; backfilling them means mutating the JSON, which + append-only forbids. These are **documented as a runbook limitation** (pre-feature + rows stay NULL) — no code. +- **Tests:** the SourceNode backfill sets the sentinel only on NULL rows within a + bounded range and is idempotent; documentation note added. + +--- + +## Cross-cutting + +- **Shared seams:** `AuditLogOptions` (T3, T7), `AuditEndpoints.MapAuditAPI` + (T8), `AuditCommands` (T8), `AuditCentralHealthSnapshot` (T6, T7), + `IAuditLogRepository`/the KPI repositories (T6), the purge/maintenance role + (T3, T5). No AuditLog **schema** change in M5 (T1/T2 deferred). +- **Append-only:** the only new deletion is T3's purge-role channel delete + + T5's purge-role sentinel UPDATE — both maintenance-path, both reflected in the + CI guard's allow-list. Writer/ingest paths stay INSERT-only. + +## Testing strategy + +Per-item unit + targeted integration tests (above). T4 additionally gets a +tree-walk integration assertion. Full-solution build + targeted suites at the +integration step. No new infra dependency (Parquet deferred). + +## Sequencing + +Independent items, parallelizable by disjoint area: +- **Wave A (parallel):** T8 (CLI+endpoint), T6 (KPI repos+actors+tiles), T7 + (middleware+health+redaction-override) — disjoint projects. +- **Wave B (parallel):** T4 (SiteRuntime actors — high-risk), T3 (AuditLog + options+purge actor+CI guard), T5 (purge-path backfill command + runbook). +- **Wave C:** integration verification + docs (Component-AuditLog/-CLI, CLAUDE.md + KPI/retention notes, runbook). + +## Risks + +- **T4** actor-model correlation (serialized state) — targeted tests + tree-walk + assertion. +- **T3** append-only tension — resolved via maintenance-role delete + CI-guard + allow-list; verify the guard still blocks all other DELETE/UPDATE. +- **T5** node-of-origin unknowable — sentinel + documented limitation (no false + precision).