8.0 KiB
M5 — Audit Hardening (T3–T8) — Design
Status: Approved (awaiting plan).
Worktree/branch: worktree-m5-audit-hardening off main (e77e209).
Source: Phase-2 milestone M5 from docs/plans/2026-06-15-stillpending-completion-design.md.
Goal
Harden the centralized Audit Log with six independent, ready-to-build items. Two
items originally listed under M5 — T1 hash-chain tamper evidence and T2
Parquet export — remain deferred to v1.x (per CLAUDE.md's audit design
decisions); their stubs (CLI verify-chain no-op, export 501) stay unchanged.
Scope (in)
T3 per-channel retention · T4 ParentExecutionId tag-cascade · T5 historical
backfill (reframed) · T6 per-node stuck KPIs · T7 structured response-capture
increments · T8 CLI audit tree.
Scope (out / deferred to v1.x)
T1 hash-chain (no Hash/PrevHash columns, no real verify-chain), T2 Parquet
export (the 501 gate stays). Reversing those deferrals is a separate decision.
Items
T8 — CLI audit tree (smallest; reuses existing server walk + UI)
The recursive execution-tree walk (IAuditLogRepository.GetExecutionTreeAsync,
backed by IX_AuditLog_ParentExecution) and the Blazor ExecutionTreePage
already exist; only an HTTP projection + CLI surface are missing.
- Server: add
GET /api/audit/tree?executionId=…inAuditEndpoints.MapAuditAPI→repo.GetExecutionTreeAsync→ serializeExecutionTreeNode[]. - CLI: add
audit tree --execution-id <guid> [--format table|json]inAuditCommands+ anAuditTreeHelpersrenderer (indented ASCII tree fortable; raw nodes forjson), mirroringAuditQueryHelpers/AuditExportHelpers. - No schema change. Tests: endpoint returns the tree; CLI renders a multi-level tree + handles not-found.
T6 — Per-node stuck-count KPIs
KPIs are per-site today; SourceNode is on the Notification and SiteCalls
rows but not aggregated.
- Add
ComputePerNodeKpisAsync(group bySourceNode) parallel to the existingComputePerSiteKpisAsyncinNotificationOutboxRepositoryandSiteCallAuditRepository. - New
PerNode…KpiRequest/Responsemessage pair per actor; register in each actor'sReceive<>. - Surface a per-node breakdown on the existing KPI tiles
(
AuditKpiTiles/SiteCallKpiTiles) — additive, behind the existing tiles. - Tests: repository grouping returns correct per-node counts (stuck/parked/ queue-depth); message round-trip.
T7 — Structured response-capture increments (no schema change)
- (a) Inbound request headers → captured into the existing
ExtraJSON inAuditWriteMiddleware.EmitInboundAudit, passed through the existing header redactor (auth headers redacted by default). - (b)
AuditInboundCeilingHitscounter onAuditCentralHealthSnapshot(alongside the existing failure counters), incremented when an inbound row truncates (request or response hitsInboundMaxBytes). Surfaced via the health snapshot. - (c) Per-method opt-out of body capture: a
SkipBodyCaptureflag onPerTargetRedactionOverride, checked in the capture pipeline so a noisy/ sensitive method can suppress body capture (headers + metadata still recorded). - Tests: request headers land in
Extraand are redacted; ceiling-hit increments the counter; opt-out suppresses body but keeps the row.
T4 — ParentExecutionId tag-cascade (touches the actor model — high-risk)
Completes the execution tree beyond the inbound-API→routed-script case.
- Alarm on-trigger: thread a
Guid? parentExecutionIdthroughAlarmActor.SpawnAlarmExecutionActor→AlarmExecutionActor→ScriptRuntimeContext, so an alarm-triggered script chains to its firing context (the alarm's own execution id where one exists; otherwise a root). - Nested
CallScript/CallShared: inScriptRuntimeContext, pass the current run'sExecutionId(not the inherited_parentExecutionId) as the child invocation'sParentExecutionId, soA → CallScript(B)records B's parent as A — a true multi-level tree. - Timer/expression-trigger top-level runs stay roots (no spawner) — unchanged.
- Tests: alarm-triggered script row carries the expected parent; a 2-level
nested
CallScriptproduces a chain A→B→C walkable byGetExecutionTreeAsync. - Risk: serialized actor state + correlation plumbing; covered by targeted SiteRuntime actor tests + a tree-walk integration assertion.
T3 — Per-channel retention overrides (one design wrinkle, resolved)
Retention is a single global RetentionDays; the purge actor switches out whole
month partitions by OccurredAtUtc (channel-blind).
- Add
PerChannelRetentionDays(Dictionary<string,int>, keyed by channel /Actionname) toAuditLogOptions, validated like the global value; a channel override may only be shorter than the global window (longer is meaningless under month-partition switch-out, which is governed by the largest retention). - Mechanism (resolved): after the coarse global partition purge, the purge
actor runs a bounded row-level delete for channels whose override is
shorter than global (
DELETE … WHERE Action=@channel AND OccurredAtUtc<@thr, batched). This runs from the purge/maintenance path, not the writer role — the append-only invariant binds the writer/ingest role, not maintenance. The M2.10 CI grep-guard is widened to allow the purge actor's single audited deletion call site (an allow-list entry, not a blanket exemption). - Tests: a channel with a shorter override is purged earlier than the global; channels without an override follow the global; the guard still rejects UPDATE/DELETE everywhere except the sanctioned purge site.
T5 — Historical backfill (reframed per the computed-column reality)
SourceNodeis a physical nullable column. For truly historical rows the node-of-origin is unknowable, so the backfill sets a configurable sentinel (default"unknown") onNULLrows via a one-shot maintenance command (run from the purge/maintenance path), rather than guessing a node.ExecutionId/ParentExecutionIdare persisted computed columns derived fromDetailsJson; backfilling them means mutating the JSON, which append-only forbids. These are documented as a runbook limitation (pre-feature rows stay NULL) — no code.- Tests: the SourceNode backfill sets the sentinel only on NULL rows within a bounded range and is idempotent; documentation note added.
Cross-cutting
- Shared seams:
AuditLogOptions(T3, T7),AuditEndpoints.MapAuditAPI(T8),AuditCommands(T8),AuditCentralHealthSnapshot(T6, T7),IAuditLogRepository/the KPI repositories (T6), the purge/maintenance role (T3, T5). No AuditLog schema change in M5 (T1/T2 deferred). - Append-only: the only new deletion is T3's purge-role channel delete + T5's purge-role sentinel UPDATE — both maintenance-path, both reflected in the CI guard's allow-list. Writer/ingest paths stay INSERT-only.
Testing strategy
Per-item unit + targeted integration tests (above). T4 additionally gets a tree-walk integration assertion. Full-solution build + targeted suites at the integration step. No new infra dependency (Parquet deferred).
Sequencing
Independent items, parallelizable by disjoint area:
- Wave A (parallel): T8 (CLI+endpoint), T6 (KPI repos+actors+tiles), T7 (middleware+health+redaction-override) — disjoint projects.
- Wave B (parallel): T4 (SiteRuntime actors — high-risk), T3 (AuditLog options+purge actor+CI guard), T5 (purge-path backfill command + runbook).
- Wave C: integration verification + docs (Component-AuditLog/-CLI, CLAUDE.md KPI/retention notes, runbook).
Risks
- T4 actor-model correlation (serialized state) — targeted tests + tree-walk assertion.
- T3 append-only tension — resolved via maintenance-role delete + CI-guard allow-list; verify the guard still blocks all other DELETE/UPDATE.
- T5 node-of-origin unknowable — sentinel + documented limitation (no false precision).