Files
ScadaBridge/docs/plans/2026-06-16-m5-audit-hardening.md
T

9.4 KiB
Raw Blame History

M5 — Audit Hardening (T3T8) Implementation Plan

For Claude: executed via superpowers-extended-cc:subagent-driven-development in this session.

Goal: Ship six independent audit-log hardening items (per-channel retention, ParentExecutionId tag-cascade, SourceNode backfill, per-node stuck KPIs, structured response-capture increments, CLI audit tree) without an AuditLog schema change.

Architecture: Each item extends an existing seam identified in the survey. No new infra dependency (T1 hash-chain + T2 Parquet stay deferred to v1.x). Design: docs/plans/2026-06-16-m5-audit-hardening-design.md.

Tech Stack: C#/.NET 10, EF Core (MS SQL), Akka.NET, Blazor Server, System.CommandLine, xUnit.

Conventions: targeted builds/tests per task (dotnet build <proj>, dotnet test --filter); full-solution build only at integration (M5.7). Implementers do NOT create worktrees (already in worktree-m5-audit-hardening) and commit with pathspec form git commit -m "..." -- <paths> (retry on index.lock). Append-only invariant holds for writer/ingest paths; the only sanctioned mutations are T3's purge-role channel delete and T5's purge-role sentinel UPDATE, both reflected in the M2.10 CI-guard allow-list.


Wave A — leverage-existing-infra (parallel; disjoint projects)

Task M5.1 (T8): CLI audit tree + tree endpoint

Classification: standard · ~5 min · Parallelizable with: M5.2, M5.3 Files:

  • Modify: src/ZB.MOM.WW.ScadaBridge.ManagementService/AuditEndpoints.cs (MapAuditAPI, ~line 97) — add GET /api/audit/tree?executionId=<guid>IAuditLogRepository.GetExecutionTreeAsync(executionId) → JSON ExecutionTreeNode[]; 400 on missing/invalid guid, empty array when no rows.
  • Create: src/ZB.MOM.WW.ScadaBridge.CLI/Commands/AuditTreeHelpers.cs — render ExecutionTreeNode[] as an indented ASCII tree (table) and as raw JSON (--format json), mirroring AuditQueryHelpers/AuditExportHelpers.
  • Modify: src/ZB.MOM.WW.ScadaBridge.CLI/Commands/AuditCommands.cs (Build, ~line 28) — add BuildTree(): audit tree --execution-id <guid> [--format table|json], calls the new endpoint via the existing ManagementHttpClient pattern.
  • Test: ManagementService tests for the endpoint (multi-level tree + not-found); CLI tests for AuditTreeHelpers rendering. AC: audit tree --execution-id <id> prints the execution tree (root→children, indented); --format json emits the node array; the server walk reuses the existing GetExecutionTreeAsync (no new SQL). No schema change.

Task M5.2 (T6): Per-node stuck-count KPIs

Classification: standard · ~5 min · Parallelizable with: M5.1, M5.3 Files:

  • Modify: NotificationOutboxRepository — add ComputePerNodeKpisAsync (group by SourceNode) parallel to ComputePerSiteKpisAsync.
  • Modify: src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/...Repository — same ComputePerNodeKpisAsync.
  • Modify: NotificationOutboxActor.cs (~line 1054) + SiteCallAuditActor.cs (~line 781) — add a PerNode…KpiRequest/Response message pair (in Commons messages) and a Receive<>/handler each.
  • Modify: CentralUI AuditKpiTiles.razor / SiteCallKpiTiles.razor (or the per-site KPI panel) — add an additive per-node breakdown.
  • Test: repository per-node grouping returns correct stuck/parked/queue-depth counts; actor message round-trip. AC: per-node stuck/parked counts available + surfaced; SourceNode already on both tables (no migration). Per-site KPIs unchanged.

Task M5.3 (T7): Structured response-capture increments

Classification: standard · ~5 min · Parallelizable with: M5.1, M5.2 Files:

  • Modify: src/ZB.MOM.WW.ScadaBridge.AuditLog/...AuditWriteMiddleware.cs (EmitInboundAudit, ~line 246) — capture inbound request headers into the existing Extra JSON (through the existing header redactor; auth headers redacted by default).
  • Modify: src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditCentralHealthSnapshot.cs — add an AuditInboundCeilingHits counter (+ its interface), incremented from the middleware when an inbound row truncates (requestTruncated || responseTruncated).
  • Modify: src/ZB.MOM.WW.ScadaBridge.AuditLog/Configuration/PerTargetRedactionOverride.cs — add a SkipBodyCapture flag; honor it in the capture pipeline (suppress body, keep headers + metadata + the row).
  • Test: request headers land in Extra and are redacted; ceiling-hit increments the counter; SkipBodyCapture suppresses body but still writes the row. AC: no schema change (uses Extra JSON + health snapshot); existing redaction behavior preserved.

Wave B — actor model + maintenance (parallel; T5 after M5.1's CLI edits)

Task M5.4 (T4): ParentExecutionId tag-cascade

Classification: high-risk (actor model + correlation) · ~5 min · Parallelizable with: M5.5 (and M5.6) Files:

  • Modify: src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Actors/AlarmActor.cs (SpawnAlarmExecutionActor, ~line 578) + AlarmExecutionActor.cs (ctor, ~line 90) — thread a Guid? parentExecutionId so alarm-triggered scripts chain to the firing context; pass it into the ScriptRuntimeContext (currently null).
  • Modify: src/ZB.MOM.WW.ScadaBridge.SiteRuntime/Scripts/ScriptRuntimeContext.cs (CallScript ~line 394, CallShared) — pass the current run's _executionId (not the inherited _parentExecutionId) as the child invocation's ParentExecutionId, forming a true multi-level tree.
  • Test (tests/.../SiteRuntime.Tests/): an alarm-triggered script row carries the expected parent; a 2-level nested CallScript (A→B→C) is walkable via GetExecutionTreeAsync (or assert the emitted ParentExecutionId chain). AC: alarm/trigger-spawned and nested-call runs form a correct execution tree; top-level timer/expression-trigger runs stay roots; no regression to the inbound-API→routed-script path.

Task M5.5 (T3): Per-channel retention overrides

Classification: high-risk (purge/deletion + CI guard) · ~5 min · Parallelizable with: M5.4, M5.6 Files:

  • Modify: src/ZB.MOM.WW.ScadaBridge.AuditLog/Configuration/AuditLogOptions.cs — add Dictionary<string,int> PerChannelRetentionDays (keyed by Action/channel name); validate in AuditLogOptionsValidator.cs (each override in [30, global], shorter-than-global only).
  • Modify: src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogPurgeActor.cs (HandlePurgeTickAsync, ~line 135) — after the global partition switch-out, for each channel with a shorter override, run a bounded batched DELETE (WHERE Action=@channel AND OccurredAtUtc<@threshold) via the purge/maintenance path.
  • Modify: the M2.10 CI grep-guard script — add an allow-list entry for the purge actor's single audited DELETE call site (do NOT blanket-exempt; the guard must still reject all other UPDATE/DELETE on AuditLog).
  • Test: a channel with a shorter override is purged earlier than global; un-overridden channels follow global; the CI guard still fails on a stray DELETE elsewhere. AC: per-channel retention works without violating writer-role append-only; the guard remains effective.

Task M5.6 (T5): SourceNode sentinel backfill + runbook

Classification: small · ~4 min · Parallelizable with: M5.4, M5.5 · Depends on: M5.1 (shares AuditCommands.cs) Files:

  • Create: a one-shot maintenance backfill (purge/maintenance path) that sets SourceNode to a configurable sentinel (default "unknown") on NULL rows within a bounded OccurredAtUtc range; idempotent.
  • Modify: src/ZB.MOM.WW.ScadaBridge.CLI/Commands/AuditCommands.cs — add audit backfill-source-node [--sentinel <s>] [--before <date>] invoking it (after M5.1's audit tree is in, to avoid a concurrent edit to this file).
  • Modify/Create: a runbook note (deploy/.../RUNBOOK.md or the AuditLog component doc) documenting that ExecutionId/ParentExecutionId are computed from DetailsJson and CANNOT be backfilled under append-only (pre-feature rows stay NULL) — no false precision.
  • Test: backfill sets the sentinel only on NULL rows in range, is idempotent, and does not touch non-NULL rows. AC: SourceNode backfill is sanctioned maintenance (CI-guard allow-listed if it does UPDATE); the computed-id limitation is documented, not coded.

Wave C — integration + docs

Task M5.7: Integration verification + docs

Classification: high-risk (final integration reviewer) · ~5 min · Depends on: M5.1M5.6 Steps:

  1. dotnet build ZB.MOM.WW.ScadaBridge.slnx (full solution).
  2. Targeted tests across AuditLog, ManagementService, CLI, NotificationOutbox/SiteCallAudit, SiteRuntime, CentralUI; run the CI grep-guard to confirm it still blocks stray UPDATE/DELETE.
  3. Docs: docs/requirements/Component-AuditLog.md (per-channel retention, per-node KPIs, response-capture increments, tag-cascade, audit tree), Component-CLI.md + CLI README (audit tree, audit backfill-source-node), CLAUDE.md audit notes (per-channel retention; tag-cascade now beyond inbound; per-node KPIs), and the runbook computed-id limitation.
  4. Commit; final integration review of the whole 1b7600f..HEAD diff. AC: full build green; all targeted suites + CI guard green; docs reflect the six shipped items; no doc claims a deferred item shipped (T1/T2 remain deferred).

Native tasks & dependencies

Sub-tasks created as native tasks under umbrella #16 (M5). Edges: M5.6 ⟵ M5.1 (shared CLI file); M5.7 ⟵ M5.1M5.6. Waves: A = {M5.1, M5.2, M5.3} parallel; B = {M5.4, M5.5, M5.6} parallel (M5.6 after M5.1); C = M5.7.