Files

T

Joseph Doherty 1b7600fcb3 docs(m5): design — audit hardening T3-T8 (T1 hash-chain + T2 Parquet stay deferred)

2026-06-16 21:10:21 -04:00

8.0 KiB

Raw Blame History

M5 — Audit Hardening (T3–T8) — Design

Status: Approved (awaiting plan). Worktree/branch: worktree-m5-audit-hardening off main (e77e209). Source: Phase-2 milestone M5 from docs/plans/2026-06-15-stillpending-completion-design.md.

Goal

Harden the centralized Audit Log with six independent, ready-to-build items. Two items originally listed under M5 — T1 hash-chain tamper evidence and T2 Parquet export — remain deferred to v1.x (per CLAUDE.md's audit design decisions); their stubs (CLI verify-chain no-op, export 501) stay unchanged.

Scope (in)

T3 per-channel retention · T4 ParentExecutionId tag-cascade · T5 historical backfill (reframed) · T6 per-node stuck KPIs · T7 structured response-capture increments · T8 CLI audit tree.

Scope (out / deferred to v1.x)

T1 hash-chain (no Hash/PrevHash columns, no real verify-chain), T2 Parquet export (the 501 gate stays). Reversing those deferrals is a separate decision.

Items

T8 — CLI `audit tree` (smallest; reuses existing server walk + UI)

The recursive execution-tree walk (IAuditLogRepository.GetExecutionTreeAsync, backed by IX_AuditLog_ParentExecution) and the Blazor ExecutionTreePage already exist; only an HTTP projection + CLI surface are missing.

Server: add GET /api/audit/tree?executionId=… in AuditEndpoints.MapAuditAPI → repo.GetExecutionTreeAsync → serialize ExecutionTreeNode[].
CLI: add audit tree --execution-id <guid> [--format table|json] in AuditCommands + an AuditTreeHelpers renderer (indented ASCII tree for table; raw nodes for json), mirroring AuditQueryHelpers/AuditExportHelpers.
No schema change. Tests: endpoint returns the tree; CLI renders a multi-level tree + handles not-found.

T6 — Per-node stuck-count KPIs

KPIs are per-site today; SourceNode is on the Notification and SiteCalls rows but not aggregated.

Add ComputePerNodeKpisAsync (group by SourceNode) parallel to the existing ComputePerSiteKpisAsync in NotificationOutboxRepository and SiteCallAuditRepository.
New PerNode…KpiRequest/Response message pair per actor; register in each actor's Receive<>.
Surface a per-node breakdown on the existing KPI tiles (AuditKpiTiles/SiteCallKpiTiles) — additive, behind the existing tiles.
Tests: repository grouping returns correct per-node counts (stuck/parked/ queue-depth); message round-trip.

T7 — Structured response-capture increments (no schema change)

(a) Inbound request headers → captured into the existing Extra JSON in AuditWriteMiddleware.EmitInboundAudit, passed through the existing header redactor (auth headers redacted by default).
(b) AuditInboundCeilingHits counter on AuditCentralHealthSnapshot (alongside the existing failure counters), incremented when an inbound row truncates (request or response hits InboundMaxBytes). Surfaced via the health snapshot.
(c) Per-method opt-out of body capture: a SkipBodyCapture flag on PerTargetRedactionOverride, checked in the capture pipeline so a noisy/ sensitive method can suppress body capture (headers + metadata still recorded).
Tests: request headers land in Extra and are redacted; ceiling-hit increments the counter; opt-out suppresses body but keeps the row.

T4 — `ParentExecutionId` tag-cascade (touches the actor model — high-risk)

Completes the execution tree beyond the inbound-API→routed-script case.

Alarm on-trigger: thread a Guid? parentExecutionId through AlarmActor.SpawnAlarmExecutionActor → AlarmExecutionActor → ScriptRuntimeContext, so an alarm-triggered script chains to its firing context (the alarm's own execution id where one exists; otherwise a root).
Nested CallScript/CallShared: in ScriptRuntimeContext, pass the current run's ExecutionId (not the inherited _parentExecutionId) as the child invocation's ParentExecutionId, so A → CallScript(B) records B's parent as A — a true multi-level tree.
Timer/expression-trigger top-level runs stay roots (no spawner) — unchanged.
Tests: alarm-triggered script row carries the expected parent; a 2-level nested CallScript produces a chain A→B→C walkable by GetExecutionTreeAsync.
Risk: serialized actor state + correlation plumbing; covered by targeted SiteRuntime actor tests + a tree-walk integration assertion.

T3 — Per-channel retention overrides (one design wrinkle, resolved)

Retention is a single global RetentionDays; the purge actor switches out whole month partitions by OccurredAtUtc (channel-blind).

Add PerChannelRetentionDays (Dictionary<string,int>, keyed by channel / Action name) to AuditLogOptions, validated like the global value; a channel override may only be shorter than the global window (longer is meaningless under month-partition switch-out, which is governed by the largest retention).
Mechanism (resolved): after the coarse global partition purge, the purge actor runs a bounded row-level delete for channels whose override is shorter than global (DELETE … WHERE Action=@channel AND OccurredAtUtc<@thr, batched). This runs from the purge/maintenance path, not the writer role — the append-only invariant binds the writer/ingest role, not maintenance. The M2.10 CI grep-guard is widened to allow the purge actor's single audited deletion call site (an allow-list entry, not a blanket exemption).
Tests: a channel with a shorter override is purged earlier than the global; channels without an override follow the global; the guard still rejects UPDATE/DELETE everywhere except the sanctioned purge site.

T5 — Historical backfill (reframed per the computed-column reality)

SourceNode is a physical nullable column. For truly historical rows the node-of-origin is unknowable, so the backfill sets a configurable sentinel (default "unknown") on NULL rows via a one-shot maintenance command (run from the purge/maintenance path), rather than guessing a node.
ExecutionId/ParentExecutionId are persisted computed columns derived from DetailsJson; backfilling them means mutating the JSON, which append-only forbids. These are documented as a runbook limitation (pre-feature rows stay NULL) — no code.
Tests: the SourceNode backfill sets the sentinel only on NULL rows within a bounded range and is idempotent; documentation note added.

Cross-cutting

Shared seams: AuditLogOptions (T3, T7), AuditEndpoints.MapAuditAPI (T8), AuditCommands (T8), AuditCentralHealthSnapshot (T6, T7), IAuditLogRepository/the KPI repositories (T6), the purge/maintenance role (T3, T5). No AuditLog schema change in M5 (T1/T2 deferred).
Append-only: the only new deletion is T3's purge-role channel delete + T5's purge-role sentinel UPDATE — both maintenance-path, both reflected in the CI guard's allow-list. Writer/ingest paths stay INSERT-only.

Testing strategy

Per-item unit + targeted integration tests (above). T4 additionally gets a tree-walk integration assertion. Full-solution build + targeted suites at the integration step. No new infra dependency (Parquet deferred).

Sequencing

Independent items, parallelizable by disjoint area:

Wave A (parallel): T8 (CLI+endpoint), T6 (KPI repos+actors+tiles), T7 (middleware+health+redaction-override) — disjoint projects.
Wave B (parallel): T4 (SiteRuntime actors — high-risk), T3 (AuditLog options+purge actor+CI guard), T5 (purge-path backfill command + runbook).
Wave C: integration verification + docs (Component-AuditLog/-CLI, CLAUDE.md KPI/retention notes, runbook).

Risks

T4 actor-model correlation (serialized state) — targeted tests + tree-walk assertion.
T3 append-only tension — resolved via maintenance-role delete + CI-guard allow-list; verify the guard still blocks all other DELETE/UPDATE.
T5 node-of-origin unknowable — sentinel + documented limitation (no false precision).

8.0 KiB Raw Blame History Unescape Escape