test+docs(m5): M5.7 — de-date 2 EndToEnd purge tests (closes #52); document T3-T8 in Component-AuditLog/-CLI/README/CLAUDE
Tests: anchor SeedOccurredAt() to a fixed thresholdAnchor (2026-01-20) and compute RetentionDays dynamically (UtcNow - anchor + 1d) so the threshold always sits near Jan 20 2026, between the Jan-15 "old" seed (purged) and Apr-15/Jun-15 "kept" seeds. Seed dates stay within the explicit pf_AuditLog_Month boundary range (Jan 2026 – Dec 2027) — relative-from-now offsets landed before 2026-01-01 (the catch-all partition, invisible to GetPartitionBoundariesOlderThanAsync). Both tests confirmed passing; all 284 AuditLog tests green. Docs: - Component-AuditLog.md: per-channel retention overrides (T3, PerChannelRetentionDays + bounded DELETE + AuditLogPurge:ChannelPurgeBatchSize); ParentExecutionId tag-cascade now spans alarm-triggered + nested CallScript/CallShared + inbound→routed (T4, "no further spawn points deferred"); per-node stuck KPIs for Notification Outbox + Site Call Audit (T6); T7 structured response-capture increments (request headers in Extra.requestHeaders, AuditInboundCeilingHits counter, per-method SkipBodyCapture); T8 CLI audit tree; T1 hash-chain + T2 Parquet explicitly marked deferred to v1.x. - Component-CLI.md + README.md: document audit tree --execution-id <guid> and audit backfill-source-node --sentinel/--before/--batch with exact options verified against AuditCommands.cs; update Interactions to list new endpoints. - CLAUDE.md: update audit-log design-decision bullets for T3 per-channel retention, T4 tag-cascade complete, T6 per-node KPIs, T7 inbound capture increments, T8 tree command; clarify T1/T2 remain deferred to v1.x.
This commit is contained in:
@@ -158,16 +158,32 @@ is per-run and flat — `WHERE ExecutionId = X` returns everything one run did,
|
||||
nothing links a run to the run that *spawned* it. `ParentExecutionId` carries the
|
||||
spawning execution's `ExecutionId`: a spawned run still gets its own fresh
|
||||
`ExecutionId`, and every audit row it emits also carries the spawner's id in
|
||||
`ParentExecutionId`. The first cut bridges the **inbound API → routed-site-script**
|
||||
case: an inbound request runs a method script that calls `Route.Call`, routing to
|
||||
a site instance; the routed site script records the inbound request's
|
||||
`ExecutionId` as its `ParentExecutionId`, while the inbound `InboundRequest` row
|
||||
itself is top-level (`ParentExecutionId` NULL). The pointer always references the
|
||||
*immediate* spawner, so a routed run that itself routes onward threads its own
|
||||
`ExecutionId` — walking `ParentExecutionId → ExecutionId` recursively
|
||||
reconstructs the call chain as a tree of arbitrary depth. The tag-cascade case
|
||||
(an attribute write triggering another script) is **deferred** — the model
|
||||
generalises to it with no schema change once that spawn point is threaded.
|
||||
`ParentExecutionId`. The pointer always references the *immediate* spawner, so a
|
||||
run that itself spawns further runs threads its own `ExecutionId` — walking
|
||||
`ParentExecutionId → ExecutionId` recursively reconstructs the call chain as a
|
||||
tree of arbitrary depth.
|
||||
|
||||
**Tag-cascade coverage (M5.4 T4):** `ParentExecutionId` threading now spans all
|
||||
known spawn points:
|
||||
|
||||
- **Inbound API → routed site script** — an inbound request runs a method script
|
||||
that calls `Route.Call`; the routed site script records the inbound request's
|
||||
`ExecutionId` as its `ParentExecutionId`, while the inbound `InboundRequest` row
|
||||
is top-level (`ParentExecutionId` NULL).
|
||||
- **Alarm-triggered on-trigger script** — when an alarm fires and its on-trigger
|
||||
script runs (via `AlarmActor → AlarmExecutionActor`), the alarm context's
|
||||
`ExecutionId` is carried as the run's `ParentExecutionId`. Currently the alarm
|
||||
subsystem has no Guid-typed firing id so on-trigger runs are roots (NULL) in
|
||||
practice, but the wiring is in place for a future alarm `ExecutionId`.
|
||||
- **Nested `CallScript` / `CallShared` invocations** — when a script calls
|
||||
`Instance.CallScript(...)` or a shared script via `CallShared`, the calling
|
||||
execution's `ExecutionId` threads into the spawned run as its
|
||||
`ParentExecutionId`, making deeply nested call chains visible as a tree.
|
||||
|
||||
Attribute-write-triggered cascades (one tag change triggering another script via a
|
||||
tag subscription) are also wired: trigger-driven runs carry `ParentExecutionId =
|
||||
NULL` (top-level roots), and any nested `CallScript`/`CallShared` they perform
|
||||
chains as above. The schema is unchanged — no further tag-cascade work is deferred.
|
||||
|
||||
## The Site-Local `AuditLog` (SQLite)
|
||||
|
||||
@@ -268,7 +284,34 @@ operational `SiteCalls` shape for the dispatcher and UI.
|
||||
|
||||
- **Default cap** — 8 KB for each of `RequestSummary` and `ResponseSummary`;
|
||||
raised to 64 KB on any error row (`Status IN ('Failed', 'Parked', 'Discarded')`).
|
||||
- **Inbound API exception.** For `Channel = ApiInbound`, `RequestSummary` and `ResponseSummary` are captured in full up to a per-body hard ceiling of 1 MiB (configurable via `AuditLog:InboundMaxBytes`; default 1 048 576 bytes; min 8 192; max 16 777 216). The 8 KiB / 64 KiB default/error caps that apply to other channels do not apply here. `PayloadTruncated = 1` is set only when the inbound ceiling is hit — verbatim capture is the normal case. The ceiling applies independently to each body. Header redaction and per-target body redactors still run before persistence.
|
||||
- **Inbound API exception.** For `Channel = ApiInbound`, `RequestSummary` and
|
||||
`ResponseSummary` are captured in full up to a per-body hard ceiling of 1 MiB
|
||||
(configurable via `AuditLog:InboundMaxBytes`; default 1 048 576 bytes; min
|
||||
8 192; max 16 777 216). The 8 KiB / 64 KiB default/error caps that apply to
|
||||
other channels do not apply here. `PayloadTruncated = 1` is set only when the
|
||||
inbound ceiling is hit — verbatim capture is the normal case. The ceiling
|
||||
applies independently to each body. Header redaction and per-target body
|
||||
redactors still run before persistence.
|
||||
- **Inbound ceiling hits (M5.3 T7).** Every time the `InboundMaxBytes` ceiling
|
||||
truncates a body an `IAuditInboundCeilingHitsCounter.Increment()` call fires.
|
||||
This counter is surfaced as `AuditInboundCeilingHits` on the central health
|
||||
snapshot (alongside `CentralAuditWriteFailures` / `AuditRedactionFailure`) so
|
||||
operators can detect persistently oversized payloads and raise the ceiling or
|
||||
add per-target body redactors.
|
||||
- **Request headers in `Extra` (M5.3 T7).** For `Channel = ApiInbound`, the
|
||||
`AuditWriteMiddleware` captures the inbound HTTP request headers (post-redaction
|
||||
— `Authorization`, `X-API-Key`, `Cookie`, `Set-Cookie`, and the configured
|
||||
`HeaderRedactList` are scrubbed before serialization) into the `Extra` JSON
|
||||
column under the key `"requestHeaders"`. This makes the full header envelope
|
||||
visible in the Audit Log UI's detail drawer and the CLI's `audit query` output
|
||||
without widening the schema.
|
||||
- **Per-method `SkipBodyCapture` (M5.3 T7).** `PerTargetOverrides` now includes
|
||||
a `SkipBodyCapture: true` flag. When set for an inbound API method, the audit
|
||||
row is always emitted (headers, status, duration, actor, etc. are recorded) but
|
||||
`RequestSummary` and `ResponseSummary` are left null. Use this for methods whose
|
||||
payloads are structurally large or contain secrets not covered by body redactors.
|
||||
Headers are still captured into `Extra.requestHeaders` (after redaction) even
|
||||
when `SkipBodyCapture` is true.
|
||||
- **Truncation** — UTF-8 byte-safe; `PayloadTruncated = 1` when applied. Full
|
||||
bodies are never stored.
|
||||
- **HTTP headers** — `Authorization`, `Cookie`, `Set-Cookie`, `X-API-Key`, and
|
||||
@@ -311,16 +354,33 @@ MS SQL for direct-write events). Unredacted secrets never persist.
|
||||
## Retention & Purge
|
||||
|
||||
- **Central:** 365-day default based on `OccurredAtUtc`, configurable via
|
||||
`AuditLog:RetentionDays` (min 7, max 3650). Single global retention in v1 —
|
||||
no per-channel overrides.
|
||||
`AuditLog:RetentionDays` (min 30, max 3650).
|
||||
- **Partitioning:** monthly partitions on `OccurredAtUtc` from day one
|
||||
(`pf_AuditLog_Month` / `ps_AuditLog_Month`). Purge is a partition switch;
|
||||
there are no row-level deletes at central.
|
||||
(`pf_AuditLog_Month` / `ps_AuditLog_Month`). The global partition switch is
|
||||
channel-blind; it drops a whole month once every row in it is older than the
|
||||
global window. There are no row-level deletes at central for the global purge.
|
||||
- **Purge actor:** `AuditLogPurgeActor` singleton on the active central node
|
||||
runs daily, switches out any partition whose latest `OccurredAtUtc` is older
|
||||
than the retention window, and emits an `AuditLog:Purged` event (partition
|
||||
range, rowcount, duration). A partition-maintenance step rolls forward each
|
||||
month, creating the next month's partition ahead of time.
|
||||
than the retention window, then applies any per-channel overrides (see below),
|
||||
and emits an `AuditLog:Purged` event (partition range, rowcount, duration) per
|
||||
switched partition. A partition-maintenance step rolls forward each month,
|
||||
creating the next month's partition ahead of time.
|
||||
- **Per-channel retention overrides (M5.5 T3):** `AuditLog:PerChannelRetentionDays`
|
||||
is a dictionary keyed by canonical channel name (`ApiOutbound`, `DbOutbound`,
|
||||
`Notification`, `ApiInbound`) whose value is a retention window in days that
|
||||
MUST be strictly shorter than the global `RetentionDays`. After the daily
|
||||
partition switch-out, the purge actor runs a bounded, batched row DELETE
|
||||
(`PurgeChannelOlderThanAsync`) for each channel whose override is shorter than
|
||||
the global window — expiring rows of that channel earlier than the global
|
||||
partition switch would. Overrides equal to or longer than the global window are
|
||||
silently skipped (the global switch already covers them). The DELETE runs under
|
||||
`scadabridge_audit_purger` (the maintenance role); the append-only writer role
|
||||
is unaffected. Batch size is configurable via
|
||||
`AuditLogPurge:ChannelPurgeBatchSize` (default 5000). Each channel override
|
||||
runs in its own try/catch, mirroring the per-boundary error-isolation of the
|
||||
partition switch-out loop. Values are validated to be in
|
||||
`[30, RetentionDays]`; keys that are not a recognized `AuditChannel` enum name
|
||||
are rejected at startup.
|
||||
- **Sites:** daily site job; default 7-day retention (configurable, min 1,
|
||||
max 90). Respects the hard `ForwardState` invariant — `Pending` rows are
|
||||
never purged on age alone.
|
||||
@@ -340,10 +400,13 @@ MS SQL for direct-write events). Unredacted secrets never persist.
|
||||
**AuditExport** permission.
|
||||
- **Payload redaction at write.** See Payload Capture Policy. Unredacted
|
||||
secrets never persist; the safety net over-redacts on misconfiguration.
|
||||
- **Hash-chain tamper evidence — deferred to v1.x.** A future `RowHash` column,
|
||||
computed per partition as `SHA-256(prev.RowHash || canonical(row))`, will be
|
||||
verifiable offline via `scadabridge audit verify-chain --month YYYY-MM`. Off by
|
||||
default in v1.
|
||||
- **Hash-chain tamper evidence (T1) — deferred to v1.x.** A future `RowHash`
|
||||
column, computed per partition as `SHA-256(prev.RowHash || canonical(row))`, will
|
||||
be verifiable offline via `scadabridge audit verify-chain --month YYYY-MM`. The
|
||||
`verify-chain` CLI command is a no-op placeholder today. Off by default in v1.
|
||||
- **Parquet archival (T2) — deferred to v1.x.** Long-term cold storage of purged
|
||||
monthly partitions as Parquet files (suitable for offline analytics) will be
|
||||
added in a future milestone. T1 and T2 are not shipped as part of M5.
|
||||
- **Site SQLite security.** File permissions: read/write by the ScadaBridge
|
||||
service account only. Not backed up off-machine — site SQLite is a buffer,
|
||||
not a record.
|
||||
@@ -355,11 +418,22 @@ Point-in-time, computed from the central `AuditLog` table; global and per-site.
|
||||
- **Audit volume** — events/min landing in the central `AuditLog`; global plus per-site sparkline.
|
||||
- **Audit error rate** — % of central `AuditLog` rows with `Status IN ('Failed', 'Parked', 'Discarded')` over a rolling 5-minute window. This is the operational error rate of audited operations (HTTP 5xx, permanent failures, parked deliveries) — NOT audit-writer health, which surfaces separately via `CentralAuditWriteFailures` and `AuditRedactionFailure`.
|
||||
- **Audit backlog** — sum of `Pending` site rows across sites; click drills into a per-site breakdown.
|
||||
- **`AuditInboundCeilingHits`** (M5.3 T7) — rolling count of inbound API responses truncated by the `InboundMaxBytes` ceiling; surfaced on the central health snapshot alongside `CentralAuditWriteFailures`.
|
||||
|
||||
**Per-node stuck KPIs (M5.3 T6):** Both [Notification Outbox](Component-NotificationOutbox.md)
|
||||
and [Site Call Audit](Component-SiteCallAudit.md) now expose a
|
||||
`PerNodeNotificationKpiRequest` / `PerNodeSiteCallKpiRequest` message pair that
|
||||
groups the existing stuck, parked, and delivered-last-interval counts by the
|
||||
`SourceNode` that emitted the original row. This surfaces per-node breakdowns on
|
||||
the Health dashboard tiles and the Notification Outbox / Site Calls pages,
|
||||
making it possible to identify a single misbehaving node (e.g., `site-a:node-b`)
|
||||
as the source of a spike rather than a site-wide problem. The existing global and
|
||||
per-site KPI shapes are unchanged; the per-node slice is additive.
|
||||
|
||||
[Notification Outbox](Component-NotificationOutbox.md) and
|
||||
[Site Call Audit](Component-SiteCallAudit.md) KPIs are unaffected — they remain
|
||||
sourced from `Notifications` and `SiteCalls` respectively. Audit Log KPIs
|
||||
describe the audit table itself.
|
||||
[Site Call Audit](Component-SiteCallAudit.md) KPIs are unaffected for their
|
||||
operational dispatch responsibilities — they remain sourced from `Notifications`
|
||||
and `SiteCalls` respectively. Audit Log KPIs describe the audit table itself.
|
||||
|
||||
## Configuration
|
||||
|
||||
@@ -370,21 +444,40 @@ component (Options pattern):
|
||||
"AuditLog": {
|
||||
"DefaultCapBytes": 8192,
|
||||
"ErrorCapBytes": 65536,
|
||||
"InboundMaxBytes": 1048576,
|
||||
"HeaderRedactList": [ "Authorization", "Cookie", "Set-Cookie", "X-API-Key" ],
|
||||
"GlobalBodyRedactors": [
|
||||
{ "Pattern": "\"password\"\\s*:\\s*\"[^\"]+\"", "Replacement": "\"password\":\"<redacted>\"" }
|
||||
],
|
||||
"PerTargetOverrides": {
|
||||
"Weather/GetForecast": { "CapBytes": 4096 },
|
||||
"PlantDB": { "RedactSqlParamsMatching": "@apikey|@token" }
|
||||
"PlantDB": { "RedactSqlParamsMatching": "@apikey|@token" },
|
||||
"HighVolumeMethod": { "SkipBodyCapture": true }
|
||||
},
|
||||
"RetentionDays": 365
|
||||
"RetentionDays": 365,
|
||||
"PerChannelRetentionDays": {
|
||||
"ApiOutbound": 90,
|
||||
"Notification": 180
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
`PerTargetOverrides` keys bind by External System / Inbound Method /
|
||||
Notification List / Database Connection name. `RetentionDays` is a single
|
||||
global value in v1; per-channel overrides are deferred to v1.x.
|
||||
Notification List / Database Connection name. `SkipBodyCapture: true` omits
|
||||
`RequestSummary`/`ResponseSummary` for that method while still capturing headers
|
||||
into `Extra.requestHeaders` and emitting the full audit row. `RetentionDays` is
|
||||
the global window; `PerChannelRetentionDays` specifies per-channel windows that
|
||||
are strictly shorter — any channel whose override equals or exceeds the global
|
||||
value is silently ignored (the global partition switch-out already governs it).
|
||||
|
||||
`AuditLogPurge` section controls the purge actor cadence and batch size:
|
||||
|
||||
```jsonc
|
||||
"AuditLogPurge": {
|
||||
"IntervalHours": 24,
|
||||
"ChannelPurgeBatchSize": 5000
|
||||
}
|
||||
```
|
||||
|
||||
## Ops Notes — Historical Null Columns
|
||||
|
||||
@@ -480,6 +573,8 @@ orphaned entries) and in the CLI's `audit tree` output.
|
||||
tiles (Volume, Error rate, Backlog) plus new health metrics:
|
||||
`SiteAuditBacklog`, `SiteAuditWriteFailures`, `SiteAuditTelemetryStalled`,
|
||||
`CentralAuditWriteFailures`, `AuditRedactionFailure`.
|
||||
- **[CLI (#19)](Component-CLI.md)** — new `scadabridge audit query`,
|
||||
`scadabridge audit export`, and `scadabridge audit verify-chain` commands; same
|
||||
permission requirements as the UI.
|
||||
- **[CLI (#19)](Component-CLI.md)** — `scadabridge audit query`,
|
||||
`scadabridge audit export`, `scadabridge audit tree --execution-id <guid>`,
|
||||
`scadabridge audit backfill-source-node --sentinel <s> --before <date>`, and
|
||||
`scadabridge audit verify-chain` (no-op placeholder for the deferred hash-chain
|
||||
feature); same permission requirements as the UI.
|
||||
|
||||
@@ -228,14 +228,17 @@ The new centralized Audit Log component (#23) is exposed via the `scadabridge au
|
||||
The `scadabridge audit` group targets the centralized Audit Log component (#23) and
|
||||
exposes the UI-equivalent operational audit surface. Permissions follow the same
|
||||
read-vs-export split the Central UI uses (see Component-AuditLog.md, Security &
|
||||
Tamper-Evidence, and Security & Auth #10): `audit query` and `audit verify-chain`
|
||||
require the `OperationalAudit` permission; `audit export` additionally requires
|
||||
`AuditExport`. The server enforces permission checks and returns HTTP 403 (CLI
|
||||
exit code 2) on denial.
|
||||
Tamper-Evidence, and Security & Auth #10): `audit query`, `audit tree`, and
|
||||
`audit verify-chain` require the `OperationalAudit` permission; `audit export`
|
||||
additionally requires `AuditExport`; `audit backfill-source-node` requires the
|
||||
`Admin` role (maintenance path only). The server enforces permission checks and
|
||||
returns HTTP 403 (CLI exit code 2) on denial.
|
||||
|
||||
```
|
||||
scadabridge audit query [--since <t>] [--until <t>] [--channel <c>] [--kind <k>] [--status <s>] [--site <s>] [--target <t>] [--actor <a>] [--correlation-id <id>] [--execution-id <id>] [--parent-execution-id <id>] [--errors-only] [--page-size <n>] [--all]
|
||||
scadabridge audit export --since <t> --until <t> --format csv|jsonl|parquet --output <path> [--channel <c>] [--kind <k>] [--status <s>] [--site <s>] [--target <t>] [--actor <a>]
|
||||
scadabridge audit tree --execution-id <guid> [--format table|json]
|
||||
scadabridge audit backfill-source-node --before <ISO-8601-UTC> [--sentinel <value>] [--batch <n>]
|
||||
scadabridge audit verify-chain --month <YYYY-MM>
|
||||
```
|
||||
|
||||
@@ -247,6 +250,18 @@ scadabridge audit verify-chain --month <YYYY-MM>
|
||||
requested format (`csv`, `jsonl`, `parquet`) written to `--output`. The server
|
||||
streams rows rather than materializing them in memory; the CLI writes bytes
|
||||
through to disk. Supports the same scoping filters as `audit query`.
|
||||
- `audit tree --execution-id <guid>` (M5.3 T8) — renders the full execution-chain
|
||||
tree for the given `ExecutionId`. The server resolves the root from any node in
|
||||
the chain (walks `ParentExecutionId` to find the root, then traverses downward)
|
||||
and returns all reachable executions with their summary row counts and first/last
|
||||
occurred timestamps. Output format: `json` (default — structured tree suitable
|
||||
for scripting) or `table` (human-readable indented tree). Requires
|
||||
`OperationalAudit` permission. Backed by `GET /api/audit/tree?executionId=<guid>`.
|
||||
- `audit backfill-source-node --before <ISO-8601-UTC>` (M5.6 T5) — sets
|
||||
`SourceNode` to a sentinel value (`--sentinel`, default `"unknown"`) on pre-feature
|
||||
rows where `SourceNode IS NULL` and `OccurredAtUtc < --before`, in batches
|
||||
(`--batch`, default 5000). Admin-only maintenance command. Idempotent.
|
||||
Backed by `POST /api/audit/backfill-source-node`.
|
||||
- `audit verify-chain` — hash-chain verification for the named month.
|
||||
**No-op in v1**: the command is defined so the command tree is stable, but
|
||||
verification only becomes meaningful once the hash-chain ships (see
|
||||
@@ -366,7 +381,7 @@ Configuration is resolved in the following priority order (highest wins):
|
||||
- **System.CommandLine**: Command-line argument parsing.
|
||||
- **Microsoft.AspNetCore.SignalR.Client**: SignalR client for the `debug stream` command's WebSocket connection.
|
||||
- **Management Service (#18)**: The CLI hits the central cluster via the existing HTTP Management API (`POST /management`), which dispatches to the ManagementActor. The `scadabridge audit` command group rides a parallel REST surface on the same Host (`GET /api/audit/query` and `GET /api/audit/export`), sharing HTTP Basic Auth with `/management` but bypassing the actor for read-only, keyset-paged / streaming workloads.
|
||||
- **Audit Log (#23)**: The `scadabridge audit query` and `audit export` subcommands target the centralized Audit Log component's REST endpoints (`GET /api/audit/query`, `GET /api/audit/export`) on the Host's Management API surface; `audit verify-chain` rides `POST /management` until hash-chain verification ships. Permission checks (`OperationalAudit`, `AuditExport`) are enforced server-side by `AuditEndpoints`.
|
||||
- **Audit Log (#23)**: The `scadabridge audit query`, `audit export`, `audit tree`, and `audit backfill-source-node` subcommands target the centralized Audit Log component's REST endpoints (`GET /api/audit/query`, `GET /api/audit/export`, `GET /api/audit/tree`, `POST /api/audit/backfill-source-node`) on the Host's Management API surface; `audit verify-chain` is a client-side no-op today (hash-chain deferred to v1.x). Permission checks (`OperationalAudit`, `AuditExport`, `Admin`) are enforced server-side by `AuditEndpoints`.
|
||||
|
||||
## Interactions
|
||||
|
||||
|
||||
Reference in New Issue
Block a user