test+docs(m5): M5.7 — de-date 2 EndToEnd purge tests (closes #52); document T3-T8 in Component-AuditLog/-CLI/README/CLAUDE

Tests: anchor SeedOccurredAt() to a fixed thresholdAnchor (2026-01-20) and compute
RetentionDays dynamically (UtcNow - anchor + 1d) so the threshold always sits near
Jan 20 2026, between the Jan-15 "old" seed (purged) and Apr-15/Jun-15 "kept" seeds.
Seed dates stay within the explicit pf_AuditLog_Month boundary range (Jan 2026 –
Dec 2027) — relative-from-now offsets landed before 2026-01-01 (the catch-all
partition, invisible to GetPartitionBoundariesOlderThanAsync). Both tests confirmed
passing; all 284 AuditLog tests green.

Docs:
- Component-AuditLog.md: per-channel retention overrides (T3, PerChannelRetentionDays
  + bounded DELETE + AuditLogPurge:ChannelPurgeBatchSize); ParentExecutionId tag-cascade
  now spans alarm-triggered + nested CallScript/CallShared + inbound→routed (T4, "no
  further spawn points deferred"); per-node stuck KPIs for Notification Outbox +
  Site Call Audit (T6); T7 structured response-capture increments (request headers in
  Extra.requestHeaders, AuditInboundCeilingHits counter, per-method SkipBodyCapture);
  T8 CLI audit tree; T1 hash-chain + T2 Parquet explicitly marked deferred to v1.x.
- Component-CLI.md + README.md: document audit tree --execution-id <guid> and
  audit backfill-source-node --sentinel/--before/--batch with exact options verified
  against AuditCommands.cs; update Interactions to list new endpoints.
- CLAUDE.md: update audit-log design-decision bullets for T3 per-channel retention,
  T4 tag-cascade complete, T6 per-node KPIs, T7 inbound capture increments, T8 tree
  command; clarify T1/T2 remain deferred to v1.x.
This commit is contained in:
Joseph Doherty
2026-06-16 22:26:09 -04:00
parent 1b63d6751f
commit 639e331db1
6 changed files with 320 additions and 127 deletions
+127 -32
View File
@@ -158,16 +158,32 @@ is per-run and flat — `WHERE ExecutionId = X` returns everything one run did,
nothing links a run to the run that *spawned* it. `ParentExecutionId` carries the
spawning execution's `ExecutionId`: a spawned run still gets its own fresh
`ExecutionId`, and every audit row it emits also carries the spawner's id in
`ParentExecutionId`. The first cut bridges the **inbound API → routed-site-script**
case: an inbound request runs a method script that calls `Route.Call`, routing to
a site instance; the routed site script records the inbound request's
`ExecutionId` as its `ParentExecutionId`, while the inbound `InboundRequest` row
itself is top-level (`ParentExecutionId` NULL). The pointer always references the
*immediate* spawner, so a routed run that itself routes onward threads its own
`ExecutionId` — walking `ParentExecutionId → ExecutionId` recursively
reconstructs the call chain as a tree of arbitrary depth. The tag-cascade case
(an attribute write triggering another script) is **deferred** — the model
generalises to it with no schema change once that spawn point is threaded.
`ParentExecutionId`. The pointer always references the *immediate* spawner, so a
run that itself spawns further runs threads its own `ExecutionId` — walking
`ParentExecutionId → ExecutionId` recursively reconstructs the call chain as a
tree of arbitrary depth.
**Tag-cascade coverage (M5.4 T4):** `ParentExecutionId` threading now spans all
known spawn points:
- **Inbound API → routed site script** — an inbound request runs a method script
that calls `Route.Call`; the routed site script records the inbound request's
`ExecutionId` as its `ParentExecutionId`, while the inbound `InboundRequest` row
is top-level (`ParentExecutionId` NULL).
- **Alarm-triggered on-trigger script** — when an alarm fires and its on-trigger
script runs (via `AlarmActor → AlarmExecutionActor`), the alarm context's
`ExecutionId` is carried as the run's `ParentExecutionId`. Currently the alarm
subsystem has no Guid-typed firing id so on-trigger runs are roots (NULL) in
practice, but the wiring is in place for a future alarm `ExecutionId`.
- **Nested `CallScript` / `CallShared` invocations** — when a script calls
`Instance.CallScript(...)` or a shared script via `CallShared`, the calling
execution's `ExecutionId` threads into the spawned run as its
`ParentExecutionId`, making deeply nested call chains visible as a tree.
Attribute-write-triggered cascades (one tag change triggering another script via a
tag subscription) are also wired: trigger-driven runs carry `ParentExecutionId =
NULL` (top-level roots), and any nested `CallScript`/`CallShared` they perform
chains as above. The schema is unchanged — no further tag-cascade work is deferred.
## The Site-Local `AuditLog` (SQLite)
@@ -268,7 +284,34 @@ operational `SiteCalls` shape for the dispatcher and UI.
- **Default cap** — 8 KB for each of `RequestSummary` and `ResponseSummary`;
raised to 64 KB on any error row (`Status IN ('Failed', 'Parked', 'Discarded')`).
- **Inbound API exception.** For `Channel = ApiInbound`, `RequestSummary` and `ResponseSummary` are captured in full up to a per-body hard ceiling of 1 MiB (configurable via `AuditLog:InboundMaxBytes`; default 1 048 576 bytes; min 8 192; max 16 777 216). The 8 KiB / 64 KiB default/error caps that apply to other channels do not apply here. `PayloadTruncated = 1` is set only when the inbound ceiling is hit — verbatim capture is the normal case. The ceiling applies independently to each body. Header redaction and per-target body redactors still run before persistence.
- **Inbound API exception.** For `Channel = ApiInbound`, `RequestSummary` and
`ResponseSummary` are captured in full up to a per-body hard ceiling of 1 MiB
(configurable via `AuditLog:InboundMaxBytes`; default 1 048 576 bytes; min
8 192; max 16 777 216). The 8 KiB / 64 KiB default/error caps that apply to
other channels do not apply here. `PayloadTruncated = 1` is set only when the
inbound ceiling is hit — verbatim capture is the normal case. The ceiling
applies independently to each body. Header redaction and per-target body
redactors still run before persistence.
- **Inbound ceiling hits (M5.3 T7).** Every time the `InboundMaxBytes` ceiling
truncates a body an `IAuditInboundCeilingHitsCounter.Increment()` call fires.
This counter is surfaced as `AuditInboundCeilingHits` on the central health
snapshot (alongside `CentralAuditWriteFailures` / `AuditRedactionFailure`) so
operators can detect persistently oversized payloads and raise the ceiling or
add per-target body redactors.
- **Request headers in `Extra` (M5.3 T7).** For `Channel = ApiInbound`, the
`AuditWriteMiddleware` captures the inbound HTTP request headers (post-redaction
`Authorization`, `X-API-Key`, `Cookie`, `Set-Cookie`, and the configured
`HeaderRedactList` are scrubbed before serialization) into the `Extra` JSON
column under the key `"requestHeaders"`. This makes the full header envelope
visible in the Audit Log UI's detail drawer and the CLI's `audit query` output
without widening the schema.
- **Per-method `SkipBodyCapture` (M5.3 T7).** `PerTargetOverrides` now includes
a `SkipBodyCapture: true` flag. When set for an inbound API method, the audit
row is always emitted (headers, status, duration, actor, etc. are recorded) but
`RequestSummary` and `ResponseSummary` are left null. Use this for methods whose
payloads are structurally large or contain secrets not covered by body redactors.
Headers are still captured into `Extra.requestHeaders` (after redaction) even
when `SkipBodyCapture` is true.
- **Truncation** — UTF-8 byte-safe; `PayloadTruncated = 1` when applied. Full
bodies are never stored.
- **HTTP headers** — `Authorization`, `Cookie`, `Set-Cookie`, `X-API-Key`, and
@@ -311,16 +354,33 @@ MS SQL for direct-write events). Unredacted secrets never persist.
## Retention & Purge
- **Central:** 365-day default based on `OccurredAtUtc`, configurable via
`AuditLog:RetentionDays` (min 7, max 3650). Single global retention in v1 —
no per-channel overrides.
`AuditLog:RetentionDays` (min 30, max 3650).
- **Partitioning:** monthly partitions on `OccurredAtUtc` from day one
(`pf_AuditLog_Month` / `ps_AuditLog_Month`). Purge is a partition switch;
there are no row-level deletes at central.
(`pf_AuditLog_Month` / `ps_AuditLog_Month`). The global partition switch is
channel-blind; it drops a whole month once every row in it is older than the
global window. There are no row-level deletes at central for the global purge.
- **Purge actor:** `AuditLogPurgeActor` singleton on the active central node
runs daily, switches out any partition whose latest `OccurredAtUtc` is older
than the retention window, and emits an `AuditLog:Purged` event (partition
range, rowcount, duration). A partition-maintenance step rolls forward each
month, creating the next month's partition ahead of time.
than the retention window, then applies any per-channel overrides (see below),
and emits an `AuditLog:Purged` event (partition range, rowcount, duration) per
switched partition. A partition-maintenance step rolls forward each month,
creating the next month's partition ahead of time.
- **Per-channel retention overrides (M5.5 T3):** `AuditLog:PerChannelRetentionDays`
is a dictionary keyed by canonical channel name (`ApiOutbound`, `DbOutbound`,
`Notification`, `ApiInbound`) whose value is a retention window in days that
MUST be strictly shorter than the global `RetentionDays`. After the daily
partition switch-out, the purge actor runs a bounded, batched row DELETE
(`PurgeChannelOlderThanAsync`) for each channel whose override is shorter than
the global window — expiring rows of that channel earlier than the global
partition switch would. Overrides equal to or longer than the global window are
silently skipped (the global switch already covers them). The DELETE runs under
`scadabridge_audit_purger` (the maintenance role); the append-only writer role
is unaffected. Batch size is configurable via
`AuditLogPurge:ChannelPurgeBatchSize` (default 5000). Each channel override
runs in its own try/catch, mirroring the per-boundary error-isolation of the
partition switch-out loop. Values are validated to be in
`[30, RetentionDays]`; keys that are not a recognized `AuditChannel` enum name
are rejected at startup.
- **Sites:** daily site job; default 7-day retention (configurable, min 1,
max 90). Respects the hard `ForwardState` invariant — `Pending` rows are
never purged on age alone.
@@ -340,10 +400,13 @@ MS SQL for direct-write events). Unredacted secrets never persist.
**AuditExport** permission.
- **Payload redaction at write.** See Payload Capture Policy. Unredacted
secrets never persist; the safety net over-redacts on misconfiguration.
- **Hash-chain tamper evidence — deferred to v1.x.** A future `RowHash` column,
computed per partition as `SHA-256(prev.RowHash || canonical(row))`, will be
verifiable offline via `scadabridge audit verify-chain --month YYYY-MM`. Off by
default in v1.
- **Hash-chain tamper evidence (T1) — deferred to v1.x.** A future `RowHash`
column, computed per partition as `SHA-256(prev.RowHash || canonical(row))`, will
be verifiable offline via `scadabridge audit verify-chain --month YYYY-MM`. The
`verify-chain` CLI command is a no-op placeholder today. Off by default in v1.
- **Parquet archival (T2) — deferred to v1.x.** Long-term cold storage of purged
monthly partitions as Parquet files (suitable for offline analytics) will be
added in a future milestone. T1 and T2 are not shipped as part of M5.
- **Site SQLite security.** File permissions: read/write by the ScadaBridge
service account only. Not backed up off-machine — site SQLite is a buffer,
not a record.
@@ -355,11 +418,22 @@ Point-in-time, computed from the central `AuditLog` table; global and per-site.
- **Audit volume** — events/min landing in the central `AuditLog`; global plus per-site sparkline.
- **Audit error rate** — % of central `AuditLog` rows with `Status IN ('Failed', 'Parked', 'Discarded')` over a rolling 5-minute window. This is the operational error rate of audited operations (HTTP 5xx, permanent failures, parked deliveries) — NOT audit-writer health, which surfaces separately via `CentralAuditWriteFailures` and `AuditRedactionFailure`.
- **Audit backlog** — sum of `Pending` site rows across sites; click drills into a per-site breakdown.
- **`AuditInboundCeilingHits`** (M5.3 T7) — rolling count of inbound API responses truncated by the `InboundMaxBytes` ceiling; surfaced on the central health snapshot alongside `CentralAuditWriteFailures`.
**Per-node stuck KPIs (M5.3 T6):** Both [Notification Outbox](Component-NotificationOutbox.md)
and [Site Call Audit](Component-SiteCallAudit.md) now expose a
`PerNodeNotificationKpiRequest` / `PerNodeSiteCallKpiRequest` message pair that
groups the existing stuck, parked, and delivered-last-interval counts by the
`SourceNode` that emitted the original row. This surfaces per-node breakdowns on
the Health dashboard tiles and the Notification Outbox / Site Calls pages,
making it possible to identify a single misbehaving node (e.g., `site-a:node-b`)
as the source of a spike rather than a site-wide problem. The existing global and
per-site KPI shapes are unchanged; the per-node slice is additive.
[Notification Outbox](Component-NotificationOutbox.md) and
[Site Call Audit](Component-SiteCallAudit.md) KPIs are unaffected they remain
sourced from `Notifications` and `SiteCalls` respectively. Audit Log KPIs
describe the audit table itself.
[Site Call Audit](Component-SiteCallAudit.md) KPIs are unaffected for their
operational dispatch responsibilities — they remain sourced from `Notifications`
and `SiteCalls` respectively. Audit Log KPIs describe the audit table itself.
## Configuration
@@ -370,21 +444,40 @@ component (Options pattern):
"AuditLog": {
"DefaultCapBytes": 8192,
"ErrorCapBytes": 65536,
"InboundMaxBytes": 1048576,
"HeaderRedactList": [ "Authorization", "Cookie", "Set-Cookie", "X-API-Key" ],
"GlobalBodyRedactors": [
{ "Pattern": "\"password\"\\s*:\\s*\"[^\"]+\"", "Replacement": "\"password\":\"<redacted>\"" }
],
"PerTargetOverrides": {
"Weather/GetForecast": { "CapBytes": 4096 },
"PlantDB": { "RedactSqlParamsMatching": "@apikey|@token" }
"PlantDB": { "RedactSqlParamsMatching": "@apikey|@token" },
"HighVolumeMethod": { "SkipBodyCapture": true }
},
"RetentionDays": 365
"RetentionDays": 365,
"PerChannelRetentionDays": {
"ApiOutbound": 90,
"Notification": 180
}
}
```
`PerTargetOverrides` keys bind by External System / Inbound Method /
Notification List / Database Connection name. `RetentionDays` is a single
global value in v1; per-channel overrides are deferred to v1.x.
Notification List / Database Connection name. `SkipBodyCapture: true` omits
`RequestSummary`/`ResponseSummary` for that method while still capturing headers
into `Extra.requestHeaders` and emitting the full audit row. `RetentionDays` is
the global window; `PerChannelRetentionDays` specifies per-channel windows that
are strictly shorter — any channel whose override equals or exceeds the global
value is silently ignored (the global partition switch-out already governs it).
`AuditLogPurge` section controls the purge actor cadence and batch size:
```jsonc
"AuditLogPurge": {
"IntervalHours": 24,
"ChannelPurgeBatchSize": 5000
}
```
## Ops Notes — Historical Null Columns
@@ -480,6 +573,8 @@ orphaned entries) and in the CLI's `audit tree` output.
tiles (Volume, Error rate, Backlog) plus new health metrics:
`SiteAuditBacklog`, `SiteAuditWriteFailures`, `SiteAuditTelemetryStalled`,
`CentralAuditWriteFailures`, `AuditRedactionFailure`.
- **[CLI (#19)](Component-CLI.md)** — new `scadabridge audit query`,
`scadabridge audit export`, and `scadabridge audit verify-chain` commands; same
permission requirements as the UI.
- **[CLI (#19)](Component-CLI.md)** — `scadabridge audit query`,
`scadabridge audit export`, `scadabridge audit tree --execution-id <guid>`,
`scadabridge audit backfill-source-node --sentinel <s> --before <date>`, and
`scadabridge audit verify-chain` (no-op placeholder for the deferred hash-chain
feature); same permission requirements as the UI.
+20 -5
View File
@@ -228,14 +228,17 @@ The new centralized Audit Log component (#23) is exposed via the `scadabridge au
The `scadabridge audit` group targets the centralized Audit Log component (#23) and
exposes the UI-equivalent operational audit surface. Permissions follow the same
read-vs-export split the Central UI uses (see Component-AuditLog.md, Security &
Tamper-Evidence, and Security & Auth #10): `audit query` and `audit verify-chain`
require the `OperationalAudit` permission; `audit export` additionally requires
`AuditExport`. The server enforces permission checks and returns HTTP 403 (CLI
exit code 2) on denial.
Tamper-Evidence, and Security & Auth #10): `audit query`, `audit tree`, and
`audit verify-chain` require the `OperationalAudit` permission; `audit export`
additionally requires `AuditExport`; `audit backfill-source-node` requires the
`Admin` role (maintenance path only). The server enforces permission checks and
returns HTTP 403 (CLI exit code 2) on denial.
```
scadabridge audit query [--since <t>] [--until <t>] [--channel <c>] [--kind <k>] [--status <s>] [--site <s>] [--target <t>] [--actor <a>] [--correlation-id <id>] [--execution-id <id>] [--parent-execution-id <id>] [--errors-only] [--page-size <n>] [--all]
scadabridge audit export --since <t> --until <t> --format csv|jsonl|parquet --output <path> [--channel <c>] [--kind <k>] [--status <s>] [--site <s>] [--target <t>] [--actor <a>]
scadabridge audit tree --execution-id <guid> [--format table|json]
scadabridge audit backfill-source-node --before <ISO-8601-UTC> [--sentinel <value>] [--batch <n>]
scadabridge audit verify-chain --month <YYYY-MM>
```
@@ -247,6 +250,18 @@ scadabridge audit verify-chain --month <YYYY-MM>
requested format (`csv`, `jsonl`, `parquet`) written to `--output`. The server
streams rows rather than materializing them in memory; the CLI writes bytes
through to disk. Supports the same scoping filters as `audit query`.
- `audit tree --execution-id <guid>` (M5.3 T8) — renders the full execution-chain
tree for the given `ExecutionId`. The server resolves the root from any node in
the chain (walks `ParentExecutionId` to find the root, then traverses downward)
and returns all reachable executions with their summary row counts and first/last
occurred timestamps. Output format: `json` (default — structured tree suitable
for scripting) or `table` (human-readable indented tree). Requires
`OperationalAudit` permission. Backed by `GET /api/audit/tree?executionId=<guid>`.
- `audit backfill-source-node --before <ISO-8601-UTC>` (M5.6 T5) — sets
`SourceNode` to a sentinel value (`--sentinel`, default `"unknown"`) on pre-feature
rows where `SourceNode IS NULL` and `OccurredAtUtc < --before`, in batches
(`--batch`, default 5000). Admin-only maintenance command. Idempotent.
Backed by `POST /api/audit/backfill-source-node`.
- `audit verify-chain` — hash-chain verification for the named month.
**No-op in v1**: the command is defined so the command tree is stable, but
verification only becomes meaningful once the hash-chain ships (see
@@ -366,7 +381,7 @@ Configuration is resolved in the following priority order (highest wins):
- **System.CommandLine**: Command-line argument parsing.
- **Microsoft.AspNetCore.SignalR.Client**: SignalR client for the `debug stream` command's WebSocket connection.
- **Management Service (#18)**: The CLI hits the central cluster via the existing HTTP Management API (`POST /management`), which dispatches to the ManagementActor. The `scadabridge audit` command group rides a parallel REST surface on the same Host (`GET /api/audit/query` and `GET /api/audit/export`), sharing HTTP Basic Auth with `/management` but bypassing the actor for read-only, keyset-paged / streaming workloads.
- **Audit Log (#23)**: The `scadabridge audit query` and `audit export` subcommands target the centralized Audit Log component's REST endpoints (`GET /api/audit/query`, `GET /api/audit/export`) on the Host's Management API surface; `audit verify-chain` rides `POST /management` until hash-chain verification ships. Permission checks (`OperationalAudit`, `AuditExport`) are enforced server-side by `AuditEndpoints`.
- **Audit Log (#23)**: The `scadabridge audit query`, `audit export`, `audit tree`, and `audit backfill-source-node` subcommands target the centralized Audit Log component's REST endpoints (`GET /api/audit/query`, `GET /api/audit/export`, `GET /api/audit/tree`, `POST /api/audit/backfill-source-node`) on the Host's Management API surface; `audit verify-chain` is a client-side no-op today (hash-chain deferred to v1.x). Permission checks (`OperationalAudit`, `AuditExport`, `Admin`) are enforced server-side by `AuditEndpoints`.
## Interactions