docs(m5): design — audit hardening T3-T8 (T1 hash-chain + T2 Parquet stay deferred)
This commit is contained in:
@@ -0,0 +1,150 @@
|
|||||||
|
# M5 — Audit Hardening (T3–T8) — Design
|
||||||
|
|
||||||
|
**Status:** Approved (awaiting plan).
|
||||||
|
**Worktree/branch:** `worktree-m5-audit-hardening` off `main` (`e77e209`).
|
||||||
|
**Source:** Phase-2 milestone M5 from `docs/plans/2026-06-15-stillpending-completion-design.md`.
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
Harden the centralized Audit Log with six independent, ready-to-build items. Two
|
||||||
|
items originally listed under M5 — **T1 hash-chain tamper evidence** and **T2
|
||||||
|
Parquet export** — remain **deferred to v1.x** (per CLAUDE.md's audit design
|
||||||
|
decisions); their stubs (CLI `verify-chain` no-op, export `501`) stay unchanged.
|
||||||
|
|
||||||
|
## Scope (in)
|
||||||
|
|
||||||
|
T3 per-channel retention · T4 ParentExecutionId tag-cascade · T5 historical
|
||||||
|
backfill (reframed) · T6 per-node stuck KPIs · T7 structured response-capture
|
||||||
|
increments · T8 CLI `audit tree`.
|
||||||
|
|
||||||
|
## Scope (out / deferred to v1.x)
|
||||||
|
|
||||||
|
T1 hash-chain (no Hash/PrevHash columns, no real verify-chain), T2 Parquet
|
||||||
|
export (the `501` gate stays). Reversing those deferrals is a separate decision.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Items
|
||||||
|
|
||||||
|
### T8 — CLI `audit tree` (smallest; reuses existing server walk + UI)
|
||||||
|
The recursive execution-tree walk (`IAuditLogRepository.GetExecutionTreeAsync`,
|
||||||
|
backed by `IX_AuditLog_ParentExecution`) and the Blazor `ExecutionTreePage`
|
||||||
|
already exist; only an HTTP projection + CLI surface are missing.
|
||||||
|
- **Server:** add `GET /api/audit/tree?executionId=…` in
|
||||||
|
`AuditEndpoints.MapAuditAPI` → `repo.GetExecutionTreeAsync` → serialize
|
||||||
|
`ExecutionTreeNode[]`.
|
||||||
|
- **CLI:** add `audit tree --execution-id <guid> [--format table|json]` in
|
||||||
|
`AuditCommands` + an `AuditTreeHelpers` renderer (indented ASCII tree for
|
||||||
|
`table`; raw nodes for `json`), mirroring `AuditQueryHelpers`/`AuditExportHelpers`.
|
||||||
|
- No schema change. **Tests:** endpoint returns the tree; CLI renders a
|
||||||
|
multi-level tree + handles not-found.
|
||||||
|
|
||||||
|
### T6 — Per-node stuck-count KPIs
|
||||||
|
KPIs are per-site today; `SourceNode` is on the `Notification` and `SiteCalls`
|
||||||
|
rows but not aggregated.
|
||||||
|
- Add `ComputePerNodeKpisAsync` (group by `SourceNode`) parallel to the existing
|
||||||
|
`ComputePerSiteKpisAsync` in `NotificationOutboxRepository` and
|
||||||
|
`SiteCallAuditRepository`.
|
||||||
|
- New `PerNode…KpiRequest`/`Response` message pair per actor; register in each
|
||||||
|
actor's `Receive<>`.
|
||||||
|
- Surface a per-node breakdown on the existing KPI tiles
|
||||||
|
(`AuditKpiTiles`/`SiteCallKpiTiles`) — additive, behind the existing tiles.
|
||||||
|
- **Tests:** repository grouping returns correct per-node counts (stuck/parked/
|
||||||
|
queue-depth); message round-trip.
|
||||||
|
|
||||||
|
### T7 — Structured response-capture increments (no schema change)
|
||||||
|
- **(a) Inbound request headers** → captured into the existing `Extra` JSON in
|
||||||
|
`AuditWriteMiddleware.EmitInboundAudit`, passed through the existing header
|
||||||
|
redactor (auth headers redacted by default).
|
||||||
|
- **(b) `AuditInboundCeilingHits`** counter on `AuditCentralHealthSnapshot`
|
||||||
|
(alongside the existing failure counters), incremented when an inbound row
|
||||||
|
truncates (request or response hits `InboundMaxBytes`). Surfaced via the
|
||||||
|
health snapshot.
|
||||||
|
- **(c) Per-method opt-out** of body capture: a `SkipBodyCapture` flag on
|
||||||
|
`PerTargetRedactionOverride`, checked in the capture pipeline so a noisy/
|
||||||
|
sensitive method can suppress body capture (headers + metadata still recorded).
|
||||||
|
- **Tests:** request headers land in `Extra` and are redacted; ceiling-hit
|
||||||
|
increments the counter; opt-out suppresses body but keeps the row.
|
||||||
|
|
||||||
|
### T4 — `ParentExecutionId` tag-cascade (touches the actor model — high-risk)
|
||||||
|
Completes the execution tree beyond the inbound-API→routed-script case.
|
||||||
|
- **Alarm on-trigger:** thread a `Guid? parentExecutionId` through
|
||||||
|
`AlarmActor.SpawnAlarmExecutionActor` → `AlarmExecutionActor` →
|
||||||
|
`ScriptRuntimeContext`, so an alarm-triggered script chains to its firing
|
||||||
|
context (the alarm's own execution id where one exists; otherwise a root).
|
||||||
|
- **Nested `CallScript`/`CallShared`:** in `ScriptRuntimeContext`, pass **the
|
||||||
|
current run's `ExecutionId`** (not the inherited `_parentExecutionId`) as the
|
||||||
|
child invocation's `ParentExecutionId`, so `A → CallScript(B)` records B's
|
||||||
|
parent as A — a true multi-level tree.
|
||||||
|
- **Timer/expression-trigger top-level runs** stay roots (no spawner) — unchanged.
|
||||||
|
- **Tests:** alarm-triggered script row carries the expected parent; a 2-level
|
||||||
|
nested `CallScript` produces a chain A→B→C walkable by `GetExecutionTreeAsync`.
|
||||||
|
- **Risk:** serialized actor state + correlation plumbing; covered by targeted
|
||||||
|
SiteRuntime actor tests + a tree-walk integration assertion.
|
||||||
|
|
||||||
|
### T3 — Per-channel retention overrides (one design wrinkle, resolved)
|
||||||
|
Retention is a single global `RetentionDays`; the purge actor switches out whole
|
||||||
|
month partitions by `OccurredAtUtc` (channel-blind).
|
||||||
|
- Add `PerChannelRetentionDays` (`Dictionary<string,int>`, keyed by channel /
|
||||||
|
`Action` name) to `AuditLogOptions`, validated like the global value; a channel
|
||||||
|
override may only be **shorter** than the global window (longer is meaningless
|
||||||
|
under month-partition switch-out, which is governed by the largest retention).
|
||||||
|
- **Mechanism (resolved):** after the coarse global partition purge, the purge
|
||||||
|
actor runs a **bounded row-level delete** for channels whose override is
|
||||||
|
shorter than global (`DELETE … WHERE Action=@channel AND OccurredAtUtc<@thr`,
|
||||||
|
batched). This runs from the **purge/maintenance path, not the writer role** —
|
||||||
|
the append-only invariant binds the writer/ingest role, not maintenance. The
|
||||||
|
**M2.10 CI grep-guard is widened** to allow the purge actor's single audited
|
||||||
|
deletion call site (an allow-list entry, not a blanket exemption).
|
||||||
|
- **Tests:** a channel with a shorter override is purged earlier than the global;
|
||||||
|
channels without an override follow the global; the guard still rejects
|
||||||
|
UPDATE/DELETE everywhere except the sanctioned purge site.
|
||||||
|
|
||||||
|
### T5 — Historical backfill (reframed per the computed-column reality)
|
||||||
|
- **`SourceNode`** is a physical nullable column. For truly historical rows the
|
||||||
|
node-of-origin is **unknowable**, so the backfill sets a **configurable
|
||||||
|
sentinel** (default `"unknown"`) on `NULL` rows via a one-shot maintenance
|
||||||
|
command (run from the purge/maintenance path), rather than guessing a node.
|
||||||
|
- **`ExecutionId`/`ParentExecutionId`** are **persisted computed columns derived
|
||||||
|
from `DetailsJson`**; backfilling them means mutating the JSON, which
|
||||||
|
append-only forbids. These are **documented as a runbook limitation** (pre-feature
|
||||||
|
rows stay NULL) — no code.
|
||||||
|
- **Tests:** the SourceNode backfill sets the sentinel only on NULL rows within a
|
||||||
|
bounded range and is idempotent; documentation note added.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Cross-cutting
|
||||||
|
|
||||||
|
- **Shared seams:** `AuditLogOptions` (T3, T7), `AuditEndpoints.MapAuditAPI`
|
||||||
|
(T8), `AuditCommands` (T8), `AuditCentralHealthSnapshot` (T6, T7),
|
||||||
|
`IAuditLogRepository`/the KPI repositories (T6), the purge/maintenance role
|
||||||
|
(T3, T5). No AuditLog **schema** change in M5 (T1/T2 deferred).
|
||||||
|
- **Append-only:** the only new deletion is T3's purge-role channel delete +
|
||||||
|
T5's purge-role sentinel UPDATE — both maintenance-path, both reflected in the
|
||||||
|
CI guard's allow-list. Writer/ingest paths stay INSERT-only.
|
||||||
|
|
||||||
|
## Testing strategy
|
||||||
|
|
||||||
|
Per-item unit + targeted integration tests (above). T4 additionally gets a
|
||||||
|
tree-walk integration assertion. Full-solution build + targeted suites at the
|
||||||
|
integration step. No new infra dependency (Parquet deferred).
|
||||||
|
|
||||||
|
## Sequencing
|
||||||
|
|
||||||
|
Independent items, parallelizable by disjoint area:
|
||||||
|
- **Wave A (parallel):** T8 (CLI+endpoint), T6 (KPI repos+actors+tiles), T7
|
||||||
|
(middleware+health+redaction-override) — disjoint projects.
|
||||||
|
- **Wave B (parallel):** T4 (SiteRuntime actors — high-risk), T3 (AuditLog
|
||||||
|
options+purge actor+CI guard), T5 (purge-path backfill command + runbook).
|
||||||
|
- **Wave C:** integration verification + docs (Component-AuditLog/-CLI, CLAUDE.md
|
||||||
|
KPI/retention notes, runbook).
|
||||||
|
|
||||||
|
## Risks
|
||||||
|
|
||||||
|
- **T4** actor-model correlation (serialized state) — targeted tests + tree-walk
|
||||||
|
assertion.
|
||||||
|
- **T3** append-only tension — resolved via maintenance-role delete + CI-guard
|
||||||
|
allow-list; verify the guard still blocks all other DELETE/UPDATE.
|
||||||
|
- **T5** node-of-origin unknowable — sentinel + documented limitation (no false
|
||||||
|
precision).
|
||||||
Reference in New Issue
Block a user