Files
ScadaBridge/docs/plans/2026-06-16-m5-audit-hardening-design.md
T

8.0 KiB
Raw Blame History

M5 — Audit Hardening (T3T8) — Design

Status: Approved (awaiting plan). Worktree/branch: worktree-m5-audit-hardening off main (e77e209). Source: Phase-2 milestone M5 from docs/plans/2026-06-15-stillpending-completion-design.md.

Goal

Harden the centralized Audit Log with six independent, ready-to-build items. Two items originally listed under M5 — T1 hash-chain tamper evidence and T2 Parquet export — remain deferred to v1.x (per CLAUDE.md's audit design decisions); their stubs (CLI verify-chain no-op, export 501) stay unchanged.

Scope (in)

T3 per-channel retention · T4 ParentExecutionId tag-cascade · T5 historical backfill (reframed) · T6 per-node stuck KPIs · T7 structured response-capture increments · T8 CLI audit tree.

Scope (out / deferred to v1.x)

T1 hash-chain (no Hash/PrevHash columns, no real verify-chain), T2 Parquet export (the 501 gate stays). Reversing those deferrals is a separate decision.


Items

T8 — CLI audit tree (smallest; reuses existing server walk + UI)

The recursive execution-tree walk (IAuditLogRepository.GetExecutionTreeAsync, backed by IX_AuditLog_ParentExecution) and the Blazor ExecutionTreePage already exist; only an HTTP projection + CLI surface are missing.

  • Server: add GET /api/audit/tree?executionId=… in AuditEndpoints.MapAuditAPIrepo.GetExecutionTreeAsync → serialize ExecutionTreeNode[].
  • CLI: add audit tree --execution-id <guid> [--format table|json] in AuditCommands + an AuditTreeHelpers renderer (indented ASCII tree for table; raw nodes for json), mirroring AuditQueryHelpers/AuditExportHelpers.
  • No schema change. Tests: endpoint returns the tree; CLI renders a multi-level tree + handles not-found.

T6 — Per-node stuck-count KPIs

KPIs are per-site today; SourceNode is on the Notification and SiteCalls rows but not aggregated.

  • Add ComputePerNodeKpisAsync (group by SourceNode) parallel to the existing ComputePerSiteKpisAsync in NotificationOutboxRepository and SiteCallAuditRepository.
  • New PerNode…KpiRequest/Response message pair per actor; register in each actor's Receive<>.
  • Surface a per-node breakdown on the existing KPI tiles (AuditKpiTiles/SiteCallKpiTiles) — additive, behind the existing tiles.
  • Tests: repository grouping returns correct per-node counts (stuck/parked/ queue-depth); message round-trip.

T7 — Structured response-capture increments (no schema change)

  • (a) Inbound request headers → captured into the existing Extra JSON in AuditWriteMiddleware.EmitInboundAudit, passed through the existing header redactor (auth headers redacted by default).
  • (b) AuditInboundCeilingHits counter on AuditCentralHealthSnapshot (alongside the existing failure counters), incremented when an inbound row truncates (request or response hits InboundMaxBytes). Surfaced via the health snapshot.
  • (c) Per-method opt-out of body capture: a SkipBodyCapture flag on PerTargetRedactionOverride, checked in the capture pipeline so a noisy/ sensitive method can suppress body capture (headers + metadata still recorded).
  • Tests: request headers land in Extra and are redacted; ceiling-hit increments the counter; opt-out suppresses body but keeps the row.

T4 — ParentExecutionId tag-cascade (touches the actor model — high-risk)

Completes the execution tree beyond the inbound-API→routed-script case.

  • Alarm on-trigger: thread a Guid? parentExecutionId through AlarmActor.SpawnAlarmExecutionActorAlarmExecutionActorScriptRuntimeContext, so an alarm-triggered script chains to its firing context (the alarm's own execution id where one exists; otherwise a root).
  • Nested CallScript/CallShared: in ScriptRuntimeContext, pass the current run's ExecutionId (not the inherited _parentExecutionId) as the child invocation's ParentExecutionId, so A → CallScript(B) records B's parent as A — a true multi-level tree.
  • Timer/expression-trigger top-level runs stay roots (no spawner) — unchanged.
  • Tests: alarm-triggered script row carries the expected parent; a 2-level nested CallScript produces a chain A→B→C walkable by GetExecutionTreeAsync.
  • Risk: serialized actor state + correlation plumbing; covered by targeted SiteRuntime actor tests + a tree-walk integration assertion.

T3 — Per-channel retention overrides (one design wrinkle, resolved)

Retention is a single global RetentionDays; the purge actor switches out whole month partitions by OccurredAtUtc (channel-blind).

  • Add PerChannelRetentionDays (Dictionary<string,int>, keyed by channel / Action name) to AuditLogOptions, validated like the global value; a channel override may only be shorter than the global window (longer is meaningless under month-partition switch-out, which is governed by the largest retention).
  • Mechanism (resolved): after the coarse global partition purge, the purge actor runs a bounded row-level delete for channels whose override is shorter than global (DELETE … WHERE Action=@channel AND OccurredAtUtc<@thr, batched). This runs from the purge/maintenance path, not the writer role — the append-only invariant binds the writer/ingest role, not maintenance. The M2.10 CI grep-guard is widened to allow the purge actor's single audited deletion call site (an allow-list entry, not a blanket exemption).
  • Tests: a channel with a shorter override is purged earlier than the global; channels without an override follow the global; the guard still rejects UPDATE/DELETE everywhere except the sanctioned purge site.

T5 — Historical backfill (reframed per the computed-column reality)

  • SourceNode is a physical nullable column. For truly historical rows the node-of-origin is unknowable, so the backfill sets a configurable sentinel (default "unknown") on NULL rows via a one-shot maintenance command (run from the purge/maintenance path), rather than guessing a node.
  • ExecutionId/ParentExecutionId are persisted computed columns derived from DetailsJson; backfilling them means mutating the JSON, which append-only forbids. These are documented as a runbook limitation (pre-feature rows stay NULL) — no code.
  • Tests: the SourceNode backfill sets the sentinel only on NULL rows within a bounded range and is idempotent; documentation note added.

Cross-cutting

  • Shared seams: AuditLogOptions (T3, T7), AuditEndpoints.MapAuditAPI (T8), AuditCommands (T8), AuditCentralHealthSnapshot (T6, T7), IAuditLogRepository/the KPI repositories (T6), the purge/maintenance role (T3, T5). No AuditLog schema change in M5 (T1/T2 deferred).
  • Append-only: the only new deletion is T3's purge-role channel delete + T5's purge-role sentinel UPDATE — both maintenance-path, both reflected in the CI guard's allow-list. Writer/ingest paths stay INSERT-only.

Testing strategy

Per-item unit + targeted integration tests (above). T4 additionally gets a tree-walk integration assertion. Full-solution build + targeted suites at the integration step. No new infra dependency (Parquet deferred).

Sequencing

Independent items, parallelizable by disjoint area:

  • Wave A (parallel): T8 (CLI+endpoint), T6 (KPI repos+actors+tiles), T7 (middleware+health+redaction-override) — disjoint projects.
  • Wave B (parallel): T4 (SiteRuntime actors — high-risk), T3 (AuditLog options+purge actor+CI guard), T5 (purge-path backfill command + runbook).
  • Wave C: integration verification + docs (Component-AuditLog/-CLI, CLAUDE.md KPI/retention notes, runbook).

Risks

  • T4 actor-model correlation (serialized state) — targeted tests + tree-walk assertion.
  • T3 append-only tension — resolved via maintenance-role delete + CI-guard allow-list; verify the guard still blocks all other DELETE/UPDATE.
  • T5 node-of-origin unknowable — sentinel + documented limitation (no false precision).