docs(audit): fix Audit error rate semantics and CLI permission split

Bundle D code-review feedback on 0ae1a25 and e6f7a7f: - Audit error rate (HealthMonitoring tile) was described as a combined view of CentralAuditWriteFailures + AuditRedactionFailure (writer health). Per alog.md §10.3 / §14.1 it is the operational error rate of audited operations: % of central AuditLog rows with Status not in (Success/Delivered/Enqueued) over a rolling 5-min window. Audit writer issues surface separately via the dedicated metrics. - Audit volume description gains the spec-mandated 'events/min, global + per-site sparkline' shape. - CLI: scadalink audit was claiming all three subcommands need both OperationalAudit and AuditExport. Per alog.md §11.2 / §15.1, read (query, verify-chain) needs OperationalAudit; bulk export additionally requires AuditExport. Restored the spec's split.
2026-05-20 08:30:42 -04:00
parent e6f7a7ff79
commit 72388a7616
2 changed files with 9 additions and 7 deletions
--- a/docs/requirements/Component-HealthMonitoring.md
+++ b/docs/requirements/Component-HealthMonitoring.md
@@ -85,9 +85,9 @@ Unlike the Notification Outbox, the Site Call Audit is **not a dispatcher** —

 The Audit Log spans both sites (hot-path append + telemetry forward) and central (direct-write + ingest + redaction). Its operational health surfaces as three new dashboard tiles grouped under **Audit**:

- **Audit volume** — rate of audit rows landing in the central `AuditLog` table over the last interval, sourced from the Audit Log component on the active central node.
- **Audit error rate** — combined view of `CentralAuditWriteFailures` and `AuditRedactionFailure` over the last interval; non-zero values warrant a check of the Audit Log component's own logs.
- **Audit backlog** — global aggregate of `SiteAuditBacklog` across reporting sites (count of `Pending` site-local audit rows, oldest pending age, on-disk bytes). The per-site tile also surfaces a warning badge when its `SiteAuditBacklog` crosses the configurable threshold or when `SiteAuditTelemetryStalled` is set.
+- **Audit volume** — events/min landing in the central `AuditLog` table, shown global plus per-site sparkline; sourced from the Audit Log component on the active central node.
+- **Audit error rate** — percent of central `AuditLog` rows with `Status` other than `Success` / `Delivered` / `Enqueued` over a rolling 5-minute window. This is the operational error rate of audited operations (HTTP 5xx, transient failures, parked deliveries, etc.) — NOT the audit writer's own health. Audit-writer issues surface separately via `CentralAuditWriteFailures` and `AuditRedactionFailure`.
+- **Audit backlog** — global aggregate of `SiteAuditBacklog` across reporting sites (count of `Pending` site-local audit rows, oldest pending age, on-disk bytes); click drills into a per-site breakdown. The per-site tile surfaces a warning badge when its `SiteAuditBacklog` crosses the configurable threshold or when `SiteAuditTelemetryStalled` is set.

 These tiles are **point-in-time** like the Notification Outbox and Site Call Audit KPI tiles — no time-series store; consistent with Health Monitoring's "current status only" philosophy. The site-scoped `SiteAuditBacklog` / `SiteAuditWriteFailures` / `SiteAuditTelemetryStalled` metrics arrive in the existing site health report; the central-scoped `CentralAuditWriteFailures` / `AuditRedactionFailure` metrics are central-computed alongside the existing central KPIs.