test+docs(m5): M5.7 — de-date 2 EndToEnd purge tests (closes #52); document T3-T8 in Component-AuditLog/-CLI/README/CLAUDE

Tests: anchor SeedOccurredAt() to a fixed thresholdAnchor (2026-01-20) and compute RetentionDays dynamically (UtcNow - anchor + 1d) so the threshold always sits near Jan 20 2026, between the Jan-15 "old" seed (purged) and Apr-15/Jun-15 "kept" seeds. Seed dates stay within the explicit pf_AuditLog_Month boundary range (Jan 2026 – Dec 2027) — relative-from-now offsets landed before 2026-01-01 (the catch-all partition, invisible to GetPartitionBoundariesOlderThanAsync). Both tests confirmed passing; all 284 AuditLog tests green. Docs: - Component-AuditLog.md: per-channel retention overrides (T3, PerChannelRetentionDays + bounded DELETE + AuditLogPurge:ChannelPurgeBatchSize); ParentExecutionId tag-cascade now spans alarm-triggered + nested CallScript/CallShared + inbound→routed (T4, "no further spawn points deferred"); per-node stuck KPIs for Notification Outbox + Site Call Audit (T6); T7 structured response-capture increments (request headers in Extra.requestHeaders, AuditInboundCeilingHits counter, per-method SkipBodyCapture); T8 CLI audit tree; T1 hash-chain + T2 Parquet explicitly marked deferred to v1.x. - Component-CLI.md + README.md: document audit tree --execution-id <guid> and audit backfill-source-node --sentinel/--before/--batch with exact options verified against AuditCommands.cs; update Interactions to list new endpoints. - CLAUDE.md: update audit-log design-decision bullets for T3 per-channel retention, T4 tag-cascade complete, T6 per-node KPIs, T7 inbound capture increments, T8 tree command; clarify T1/T2 remain deferred to v1.x.
2026-06-16 22:26:09 -04:00
parent 1b63d6751f
commit 639e331db1
6 changed files with 320 additions and 127 deletions
@@ -158,16 +158,32 @@ is per-run and flat — `WHERE ExecutionId = X` returns everything one run did,
 nothing links a run to the run that *spawned* it. `ParentExecutionId` carries the
 spawning execution's `ExecutionId`: a spawned run still gets its own fresh
 `ExecutionId`, and every audit row it emits also carries the spawner's id in
-`ParentExecutionId`. The first cut bridges the **inbound API → routed-site-script**
-case: an inbound request runs a method script that calls `Route.Call`, routing to
-a site instance; the routed site script records the inbound request's
-`ExecutionId` as its `ParentExecutionId`, while the inbound `InboundRequest` row
-itself is top-level (`ParentExecutionId` NULL). The pointer always references the
-*immediate* spawner, so a routed run that itself routes onward threads its own
-`ExecutionId` — walking `ParentExecutionId → ExecutionId` recursively
-reconstructs the call chain as a tree of arbitrary depth. The tag-cascade case
-(an attribute write triggering another script) is **deferred** — the model
-generalises to it with no schema change once that spawn point is threaded.
+`ParentExecutionId`. The pointer always references the *immediate* spawner, so a
+run that itself spawns further runs threads its own `ExecutionId` — walking
+`ParentExecutionId → ExecutionId` recursively reconstructs the call chain as a
+tree of arbitrary depth.
+
+**Tag-cascade coverage (M5.4 T4):** `ParentExecutionId` threading now spans all
+known spawn points:
+
+- **Inbound API → routed site script** — an inbound request runs a method script
+  that calls `Route.Call`; the routed site script records the inbound request's
+  `ExecutionId` as its `ParentExecutionId`, while the inbound `InboundRequest` row
+  is top-level (`ParentExecutionId` NULL).
+- **Alarm-triggered on-trigger script** — when an alarm fires and its on-trigger
+  script runs (via `AlarmActor → AlarmExecutionActor`), the alarm context's
+  `ExecutionId` is carried as the run's `ParentExecutionId`. Currently the alarm
+  subsystem has no Guid-typed firing id so on-trigger runs are roots (NULL) in
+  practice, but the wiring is in place for a future alarm `ExecutionId`.
+- **Nested `CallScript` / `CallShared` invocations** — when a script calls
+  `Instance.CallScript(...)` or a shared script via `CallShared`, the calling
+  execution's `ExecutionId` threads into the spawned run as its
+  `ParentExecutionId`, making deeply nested call chains visible as a tree.
+
+Attribute-write-triggered cascades (one tag change triggering another script via a
+tag subscription) are also wired: trigger-driven runs carry `ParentExecutionId =
+NULL` (top-level roots), and any nested `CallScript`/`CallShared` they perform
+chains as above. The schema is unchanged — no further tag-cascade work is deferred.

 ## The Site-Local `AuditLog` (SQLite)

@@ -268,7 +284,34 @@ operational `SiteCalls` shape for the dispatcher and UI.

 - **Default cap** — 8 KB for each of `RequestSummary` and `ResponseSummary`;
  raised to 64 KB on any error row (`Status IN ('Failed', 'Parked', 'Discarded')`).
- **Inbound API exception.** For `Channel = ApiInbound`, `RequestSummary` and `ResponseSummary` are captured in full up to a per-body hard ceiling of 1 MiB (configurable via `AuditLog:InboundMaxBytes`; default 1 048 576 bytes; min 8 192; max 16 777 216). The 8 KiB / 64 KiB default/error caps that apply to other channels do not apply here. `PayloadTruncated = 1` is set only when the inbound ceiling is hit — verbatim capture is the normal case. The ceiling applies independently to each body. Header redaction and per-target body redactors still run before persistence.
+- **Inbound API exception.** For `Channel = ApiInbound`, `RequestSummary` and
+  `ResponseSummary` are captured in full up to a per-body hard ceiling of 1 MiB
+  (configurable via `AuditLog:InboundMaxBytes`; default 1 048 576 bytes; min
+  8 192; max 16 777 216). The 8 KiB / 64 KiB default/error caps that apply to
+  other channels do not apply here. `PayloadTruncated = 1` is set only when the
+  inbound ceiling is hit — verbatim capture is the normal case. The ceiling
+  applies independently to each body. Header redaction and per-target body
+  redactors still run before persistence.
+- **Inbound ceiling hits (M5.3 T7).** Every time the `InboundMaxBytes` ceiling
+  truncates a body an `IAuditInboundCeilingHitsCounter.Increment()` call fires.
+  This counter is surfaced as `AuditInboundCeilingHits` on the central health
+  snapshot (alongside `CentralAuditWriteFailures` / `AuditRedactionFailure`) so
+  operators can detect persistently oversized payloads and raise the ceiling or
+  add per-target body redactors.
+- **Request headers in `Extra` (M5.3 T7).** For `Channel = ApiInbound`, the
+  `AuditWriteMiddleware` captures the inbound HTTP request headers (post-redaction
+  — `Authorization`, `X-API-Key`, `Cookie`, `Set-Cookie`, and the configured
+  `HeaderRedactList` are scrubbed before serialization) into the `Extra` JSON
+  column under the key `"requestHeaders"`. This makes the full header envelope
+  visible in the Audit Log UI's detail drawer and the CLI's `audit query` output
+  without widening the schema.
+- **Per-method `SkipBodyCapture` (M5.3 T7).** `PerTargetOverrides` now includes
+  a `SkipBodyCapture: true` flag. When set for an inbound API method, the audit
+  row is always emitted (headers, status, duration, actor, etc. are recorded) but
+  `RequestSummary` and `ResponseSummary` are left null. Use this for methods whose
+  payloads are structurally large or contain secrets not covered by body redactors.
+  Headers are still captured into `Extra.requestHeaders` (after redaction) even
+  when `SkipBodyCapture` is true.
 - **Truncation** — UTF-8 byte-safe; `PayloadTruncated = 1` when applied. Full
  bodies are never stored.
 - **HTTP headers** — `Authorization`, `Cookie`, `Set-Cookie`, `X-API-Key`, and
@@ -311,16 +354,33 @@ MS SQL for direct-write events). Unredacted secrets never persist.
 ## Retention & Purge

 - **Central:** 365-day default based on `OccurredAtUtc`, configurable via
-  `AuditLog:RetentionDays` (min 7, max 3650). Single global retention in v1 —
-  no per-channel overrides.
+  `AuditLog:RetentionDays` (min 30, max 3650).
 - **Partitioning:** monthly partitions on `OccurredAtUtc` from day one
-  (`pf_AuditLog_Month` / `ps_AuditLog_Month`). Purge is a partition switch;
-  there are no row-level deletes at central.
+  (`pf_AuditLog_Month` / `ps_AuditLog_Month`). The global partition switch is
+  channel-blind; it drops a whole month once every row in it is older than the
+  global window. There are no row-level deletes at central for the global purge.
 - **Purge actor:** `AuditLogPurgeActor` singleton on the active central node
  runs daily, switches out any partition whose latest `OccurredAtUtc` is older
-  than the retention window, and emits an `AuditLog:Purged` event (partition
-  range, rowcount, duration). A partition-maintenance step rolls forward each
-  month, creating the next month's partition ahead of time.
+  than the retention window, then applies any per-channel overrides (see below),
+  and emits an `AuditLog:Purged` event (partition range, rowcount, duration) per
+  switched partition. A partition-maintenance step rolls forward each month,
+  creating the next month's partition ahead of time.
+- **Per-channel retention overrides (M5.5 T3):** `AuditLog:PerChannelRetentionDays`
+  is a dictionary keyed by canonical channel name (`ApiOutbound`, `DbOutbound`,
+  `Notification`, `ApiInbound`) whose value is a retention window in days that
+  MUST be strictly shorter than the global `RetentionDays`. After the daily
+  partition switch-out, the purge actor runs a bounded, batched row DELETE
+  (`PurgeChannelOlderThanAsync`) for each channel whose override is shorter than
+  the global window — expiring rows of that channel earlier than the global
+  partition switch would. Overrides equal to or longer than the global window are
+  silently skipped (the global switch already covers them). The DELETE runs under
+  `scadabridge_audit_purger` (the maintenance role); the append-only writer role
+  is unaffected. Batch size is configurable via
+  `AuditLogPurge:ChannelPurgeBatchSize` (default 5000). Each channel override
+  runs in its own try/catch, mirroring the per-boundary error-isolation of the
+  partition switch-out loop. Values are validated to be in
+  `[30, RetentionDays]`; keys that are not a recognized `AuditChannel` enum name
+  are rejected at startup.
 - **Sites:** daily site job; default 7-day retention (configurable, min 1,
  max 90). Respects the hard `ForwardState` invariant — `Pending` rows are
  never purged on age alone.
@@ -340,10 +400,13 @@ MS SQL for direct-write events). Unredacted secrets never persist.
  **AuditExport** permission.
 - **Payload redaction at write.** See Payload Capture Policy. Unredacted
  secrets never persist; the safety net over-redacts on misconfiguration.
- **Hash-chain tamper evidence — deferred to v1.x.** A future `RowHash` column,
-  computed per partition as `SHA-256(prev.RowHash || canonical(row))`, will be
-  verifiable offline via `scadabridge audit verify-chain --month YYYY-MM`. Off by
-  default in v1.
+- **Hash-chain tamper evidence (T1) — deferred to v1.x.** A future `RowHash`
+  column, computed per partition as `SHA-256(prev.RowHash || canonical(row))`, will
+  be verifiable offline via `scadabridge audit verify-chain --month YYYY-MM`. The
+  `verify-chain` CLI command is a no-op placeholder today. Off by default in v1.
+- **Parquet archival (T2) — deferred to v1.x.** Long-term cold storage of purged
+  monthly partitions as Parquet files (suitable for offline analytics) will be
+  added in a future milestone. T1 and T2 are not shipped as part of M5.
 - **Site SQLite security.** File permissions: read/write by the ScadaBridge
  service account only. Not backed up off-machine — site SQLite is a buffer,
  not a record.
@@ -355,11 +418,22 @@ Point-in-time, computed from the central `AuditLog` table; global and per-site.
 - **Audit volume** — events/min landing in the central `AuditLog`; global plus per-site sparkline.
 - **Audit error rate** — % of central `AuditLog` rows with `Status IN ('Failed', 'Parked', 'Discarded')` over a rolling 5-minute window. This is the operational error rate of audited operations (HTTP 5xx, permanent failures, parked deliveries) — NOT audit-writer health, which surfaces separately via `CentralAuditWriteFailures` and `AuditRedactionFailure`.
 - **Audit backlog** — sum of `Pending` site rows across sites; click drills into a per-site breakdown.
+- **`AuditInboundCeilingHits`** (M5.3 T7) — rolling count of inbound API responses truncated by the `InboundMaxBytes` ceiling; surfaced on the central health snapshot alongside `CentralAuditWriteFailures`.
+
+**Per-node stuck KPIs (M5.3 T6):** Both [Notification Outbox](Component-NotificationOutbox.md)
+and [Site Call Audit](Component-SiteCallAudit.md) now expose a
+`PerNodeNotificationKpiRequest` / `PerNodeSiteCallKpiRequest` message pair that
+groups the existing stuck, parked, and delivered-last-interval counts by the
+`SourceNode` that emitted the original row. This surfaces per-node breakdowns on
+the Health dashboard tiles and the Notification Outbox / Site Calls pages,
+making it possible to identify a single misbehaving node (e.g., `site-a:node-b`)
+as the source of a spike rather than a site-wide problem. The existing global and
+per-site KPI shapes are unchanged; the per-node slice is additive.

 [Notification Outbox](Component-NotificationOutbox.md) and
-[Site Call Audit](Component-SiteCallAudit.md) KPIs are unaffected — they remain
-sourced from `Notifications` and `SiteCalls` respectively. Audit Log KPIs
-describe the audit table itself.
+[Site Call Audit](Component-SiteCallAudit.md) KPIs are unaffected for their
+operational dispatch responsibilities — they remain sourced from `Notifications`
+and `SiteCalls` respectively. Audit Log KPIs describe the audit table itself.

 ## Configuration

@@ -370,21 +444,40 @@ component (Options pattern):
 "AuditLog": {
  "DefaultCapBytes": 8192,
  "ErrorCapBytes": 65536,
+  "InboundMaxBytes": 1048576,
  "HeaderRedactList": [ "Authorization", "Cookie", "Set-Cookie", "X-API-Key" ],
  "GlobalBodyRedactors": [
    { "Pattern": "\"password\"\\s*:\\s*\"[^\"]+\"", "Replacement": "\"password\":\"<redacted>\"" }
  ],
  "PerTargetOverrides": {
    "Weather/GetForecast": { "CapBytes": 4096 },
-    "PlantDB":             { "RedactSqlParamsMatching": "@apikey|@token" }
+    "PlantDB":             { "RedactSqlParamsMatching": "@apikey|@token" },
+    "HighVolumeMethod":    { "SkipBodyCapture": true }
  },
-  "RetentionDays": 365
+  "RetentionDays": 365,
+  "PerChannelRetentionDays": {
+    "ApiOutbound":  90,
+    "Notification": 180
+  }
 }
 ```

 `PerTargetOverrides` keys bind by External System / Inbound Method /
-Notification List / Database Connection name. `RetentionDays` is a single
-global value in v1; per-channel overrides are deferred to v1.x.
+Notification List / Database Connection name. `SkipBodyCapture: true` omits
+`RequestSummary`/`ResponseSummary` for that method while still capturing headers
+into `Extra.requestHeaders` and emitting the full audit row. `RetentionDays` is
+the global window; `PerChannelRetentionDays` specifies per-channel windows that
+are strictly shorter — any channel whose override equals or exceeds the global
+value is silently ignored (the global partition switch-out already governs it).
+
+`AuditLogPurge` section controls the purge actor cadence and batch size:
+
+```jsonc
+"AuditLogPurge": {
+  "IntervalHours": 24,
+  "ChannelPurgeBatchSize": 5000
+}
+```

 ## Ops Notes — Historical Null Columns

@@ -480,6 +573,8 @@ orphaned entries) and in the CLI's `audit tree` output.
  tiles (Volume, Error rate, Backlog) plus new health metrics:
  `SiteAuditBacklog`, `SiteAuditWriteFailures`, `SiteAuditTelemetryStalled`,
  `CentralAuditWriteFailures`, `AuditRedactionFailure`.
- **[CLI (#19)](Component-CLI.md)** — new `scadabridge audit query`,
-  `scadabridge audit export`, and `scadabridge audit verify-chain` commands; same
-  permission requirements as the UI.
+- **[CLI (#19)](Component-CLI.md)** — `scadabridge audit query`,
+  `scadabridge audit export`, `scadabridge audit tree --execution-id <guid>`,
+  `scadabridge audit backfill-source-node --sentinel <s> --before <date>`, and
+  `scadabridge audit verify-chain` (no-op placeholder for the deferred hash-chain
+  feature); same permission requirements as the UI.
@@ -228,14 +228,17 @@ The new centralized Audit Log component (#23) is exposed via the `scadabridge au
 The `scadabridge audit` group targets the centralized Audit Log component (#23) and
 exposes the UI-equivalent operational audit surface. Permissions follow the same
 read-vs-export split the Central UI uses (see Component-AuditLog.md, Security &
-Tamper-Evidence, and Security & Auth #10): `audit query` and `audit verify-chain`
-require the `OperationalAudit` permission; `audit export` additionally requires
-`AuditExport`. The server enforces permission checks and returns HTTP 403 (CLI
-exit code 2) on denial.
+Tamper-Evidence, and Security & Auth #10): `audit query`, `audit tree`, and
+`audit verify-chain` require the `OperationalAudit` permission; `audit export`
+additionally requires `AuditExport`; `audit backfill-source-node` requires the
+`Admin` role (maintenance path only). The server enforces permission checks and
+returns HTTP 403 (CLI exit code 2) on denial.

 ```
 scadabridge audit query [--since <t>] [--until <t>] [--channel <c>] [--kind <k>] [--status <s>] [--site <s>] [--target <t>] [--actor <a>] [--correlation-id <id>] [--execution-id <id>] [--parent-execution-id <id>] [--errors-only] [--page-size <n>] [--all]
 scadabridge audit export --since <t> --until <t> --format csv|jsonl|parquet --output <path> [--channel <c>] [--kind <k>] [--status <s>] [--site <s>] [--target <t>] [--actor <a>]
+scadabridge audit tree --execution-id <guid> [--format table|json]
+scadabridge audit backfill-source-node --before <ISO-8601-UTC> [--sentinel <value>] [--batch <n>]
 scadabridge audit verify-chain --month <YYYY-MM>
 ```

@@ -247,6 +250,18 @@ scadabridge audit verify-chain --month <YYYY-MM>
  requested format (`csv`, `jsonl`, `parquet`) written to `--output`. The server
  streams rows rather than materializing them in memory; the CLI writes bytes
  through to disk. Supports the same scoping filters as `audit query`.
+- `audit tree --execution-id <guid>` (M5.3 T8) — renders the full execution-chain
+  tree for the given `ExecutionId`. The server resolves the root from any node in
+  the chain (walks `ParentExecutionId` to find the root, then traverses downward)
+  and returns all reachable executions with their summary row counts and first/last
+  occurred timestamps. Output format: `json` (default — structured tree suitable
+  for scripting) or `table` (human-readable indented tree). Requires
+  `OperationalAudit` permission. Backed by `GET /api/audit/tree?executionId=<guid>`.
+- `audit backfill-source-node --before <ISO-8601-UTC>` (M5.6 T5) — sets
+  `SourceNode` to a sentinel value (`--sentinel`, default `"unknown"`) on pre-feature
+  rows where `SourceNode IS NULL` and `OccurredAtUtc < --before`, in batches
+  (`--batch`, default 5000). Admin-only maintenance command. Idempotent.
+  Backed by `POST /api/audit/backfill-source-node`.
 - `audit verify-chain` — hash-chain verification for the named month.
  **No-op in v1**: the command is defined so the command tree is stable, but
  verification only becomes meaningful once the hash-chain ships (see
@@ -366,7 +381,7 @@ Configuration is resolved in the following priority order (highest wins):
 - **System.CommandLine**: Command-line argument parsing.
 - **Microsoft.AspNetCore.SignalR.Client**: SignalR client for the `debug stream` command's WebSocket connection.
 - **Management Service (#18)**: The CLI hits the central cluster via the existing HTTP Management API (`POST /management`), which dispatches to the ManagementActor. The `scadabridge audit` command group rides a parallel REST surface on the same Host (`GET /api/audit/query` and `GET /api/audit/export`), sharing HTTP Basic Auth with `/management` but bypassing the actor for read-only, keyset-paged / streaming workloads.
- **Audit Log (#23)**: The `scadabridge audit query` and `audit export` subcommands target the centralized Audit Log component's REST endpoints (`GET /api/audit/query`, `GET /api/audit/export`) on the Host's Management API surface; `audit verify-chain` rides `POST /management` until hash-chain verification ships. Permission checks (`OperationalAudit`, `AuditExport`) are enforced server-side by `AuditEndpoints`.
+- **Audit Log (#23)**: The `scadabridge audit query`, `audit export`, `audit tree`, and `audit backfill-source-node` subcommands target the centralized Audit Log component's REST endpoints (`GET /api/audit/query`, `GET /api/audit/export`, `GET /api/audit/tree`, `POST /api/audit/backfill-source-node`) on the Host's Management API surface; `audit verify-chain` is a client-side no-op today (hash-chain deferred to v1.x). Permission checks (`OperationalAudit`, `AuditExport`, `Admin`) are enforced server-side by `AuditEndpoints`.

 ## Interactions