Files
ScadaBridge/docs/plans/2026-06-16-m5-audit-hardening-design.md
T

151 lines
8.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# M5 — Audit Hardening (T3T8) — Design
**Status:** Approved (awaiting plan).
**Worktree/branch:** `worktree-m5-audit-hardening` off `main` (`e77e209`).
**Source:** Phase-2 milestone M5 from `docs/plans/2026-06-15-stillpending-completion-design.md`.
## Goal
Harden the centralized Audit Log with six independent, ready-to-build items. Two
items originally listed under M5 — **T1 hash-chain tamper evidence** and **T2
Parquet export** — remain **deferred to v1.x** (per CLAUDE.md's audit design
decisions); their stubs (CLI `verify-chain` no-op, export `501`) stay unchanged.
## Scope (in)
T3 per-channel retention · T4 ParentExecutionId tag-cascade · T5 historical
backfill (reframed) · T6 per-node stuck KPIs · T7 structured response-capture
increments · T8 CLI `audit tree`.
## Scope (out / deferred to v1.x)
T1 hash-chain (no Hash/PrevHash columns, no real verify-chain), T2 Parquet
export (the `501` gate stays). Reversing those deferrals is a separate decision.
---
## Items
### T8 — CLI `audit tree` (smallest; reuses existing server walk + UI)
The recursive execution-tree walk (`IAuditLogRepository.GetExecutionTreeAsync`,
backed by `IX_AuditLog_ParentExecution`) and the Blazor `ExecutionTreePage`
already exist; only an HTTP projection + CLI surface are missing.
- **Server:** add `GET /api/audit/tree?executionId=…` in
`AuditEndpoints.MapAuditAPI``repo.GetExecutionTreeAsync` → serialize
`ExecutionTreeNode[]`.
- **CLI:** add `audit tree --execution-id <guid> [--format table|json]` in
`AuditCommands` + an `AuditTreeHelpers` renderer (indented ASCII tree for
`table`; raw nodes for `json`), mirroring `AuditQueryHelpers`/`AuditExportHelpers`.
- No schema change. **Tests:** endpoint returns the tree; CLI renders a
multi-level tree + handles not-found.
### T6 — Per-node stuck-count KPIs
KPIs are per-site today; `SourceNode` is on the `Notification` and `SiteCalls`
rows but not aggregated.
- Add `ComputePerNodeKpisAsync` (group by `SourceNode`) parallel to the existing
`ComputePerSiteKpisAsync` in `NotificationOutboxRepository` and
`SiteCallAuditRepository`.
- New `PerNode…KpiRequest`/`Response` message pair per actor; register in each
actor's `Receive<>`.
- Surface a per-node breakdown on the existing KPI tiles
(`AuditKpiTiles`/`SiteCallKpiTiles`) — additive, behind the existing tiles.
- **Tests:** repository grouping returns correct per-node counts (stuck/parked/
queue-depth); message round-trip.
### T7 — Structured response-capture increments (no schema change)
- **(a) Inbound request headers** → captured into the existing `Extra` JSON in
`AuditWriteMiddleware.EmitInboundAudit`, passed through the existing header
redactor (auth headers redacted by default).
- **(b) `AuditInboundCeilingHits`** counter on `AuditCentralHealthSnapshot`
(alongside the existing failure counters), incremented when an inbound row
truncates (request or response hits `InboundMaxBytes`). Surfaced via the
health snapshot.
- **(c) Per-method opt-out** of body capture: a `SkipBodyCapture` flag on
`PerTargetRedactionOverride`, checked in the capture pipeline so a noisy/
sensitive method can suppress body capture (headers + metadata still recorded).
- **Tests:** request headers land in `Extra` and are redacted; ceiling-hit
increments the counter; opt-out suppresses body but keeps the row.
### T4 — `ParentExecutionId` tag-cascade (touches the actor model — high-risk)
Completes the execution tree beyond the inbound-API→routed-script case.
- **Alarm on-trigger:** thread a `Guid? parentExecutionId` through
`AlarmActor.SpawnAlarmExecutionActor``AlarmExecutionActor`
`ScriptRuntimeContext`, so an alarm-triggered script chains to its firing
context (the alarm's own execution id where one exists; otherwise a root).
- **Nested `CallScript`/`CallShared`:** in `ScriptRuntimeContext`, pass **the
current run's `ExecutionId`** (not the inherited `_parentExecutionId`) as the
child invocation's `ParentExecutionId`, so `A → CallScript(B)` records B's
parent as A — a true multi-level tree.
- **Timer/expression-trigger top-level runs** stay roots (no spawner) — unchanged.
- **Tests:** alarm-triggered script row carries the expected parent; a 2-level
nested `CallScript` produces a chain A→B→C walkable by `GetExecutionTreeAsync`.
- **Risk:** serialized actor state + correlation plumbing; covered by targeted
SiteRuntime actor tests + a tree-walk integration assertion.
### T3 — Per-channel retention overrides (one design wrinkle, resolved)
Retention is a single global `RetentionDays`; the purge actor switches out whole
month partitions by `OccurredAtUtc` (channel-blind).
- Add `PerChannelRetentionDays` (`Dictionary<string,int>`, keyed by channel /
`Action` name) to `AuditLogOptions`, validated like the global value; a channel
override may only be **shorter** than the global window (longer is meaningless
under month-partition switch-out, which is governed by the largest retention).
- **Mechanism (resolved):** after the coarse global partition purge, the purge
actor runs a **bounded row-level delete** for channels whose override is
shorter than global (`DELETE … WHERE Action=@channel AND OccurredAtUtc<@thr`,
batched). This runs from the **purge/maintenance path, not the writer role**
the append-only invariant binds the writer/ingest role, not maintenance. The
**M2.10 CI grep-guard is widened** to allow the purge actor's single audited
deletion call site (an allow-list entry, not a blanket exemption).
- **Tests:** a channel with a shorter override is purged earlier than the global;
channels without an override follow the global; the guard still rejects
UPDATE/DELETE everywhere except the sanctioned purge site.
### T5 — Historical backfill (reframed per the computed-column reality)
- **`SourceNode`** is a physical nullable column. For truly historical rows the
node-of-origin is **unknowable**, so the backfill sets a **configurable
sentinel** (default `"unknown"`) on `NULL` rows via a one-shot maintenance
command (run from the purge/maintenance path), rather than guessing a node.
- **`ExecutionId`/`ParentExecutionId`** are **persisted computed columns derived
from `DetailsJson`**; backfilling them means mutating the JSON, which
append-only forbids. These are **documented as a runbook limitation** (pre-feature
rows stay NULL) — no code.
- **Tests:** the SourceNode backfill sets the sentinel only on NULL rows within a
bounded range and is idempotent; documentation note added.
---
## Cross-cutting
- **Shared seams:** `AuditLogOptions` (T3, T7), `AuditEndpoints.MapAuditAPI`
(T8), `AuditCommands` (T8), `AuditCentralHealthSnapshot` (T6, T7),
`IAuditLogRepository`/the KPI repositories (T6), the purge/maintenance role
(T3, T5). No AuditLog **schema** change in M5 (T1/T2 deferred).
- **Append-only:** the only new deletion is T3's purge-role channel delete +
T5's purge-role sentinel UPDATE — both maintenance-path, both reflected in the
CI guard's allow-list. Writer/ingest paths stay INSERT-only.
## Testing strategy
Per-item unit + targeted integration tests (above). T4 additionally gets a
tree-walk integration assertion. Full-solution build + targeted suites at the
integration step. No new infra dependency (Parquet deferred).
## Sequencing
Independent items, parallelizable by disjoint area:
- **Wave A (parallel):** T8 (CLI+endpoint), T6 (KPI repos+actors+tiles), T7
(middleware+health+redaction-override) — disjoint projects.
- **Wave B (parallel):** T4 (SiteRuntime actors — high-risk), T3 (AuditLog
options+purge actor+CI guard), T5 (purge-path backfill command + runbook).
- **Wave C:** integration verification + docs (Component-AuditLog/-CLI, CLAUDE.md
KPI/retention notes, runbook).
## Risks
- **T4** actor-model correlation (serialized state) — targeted tests + tree-walk
assertion.
- **T3** append-only tension — resolved via maintenance-role delete + CI-guard
allow-list; verify the guard still blocks all other DELETE/UPDATE.
- **T5** node-of-origin unknowable — sentinel + documented limitation (no false
precision).