docs(audit): add Audit Log health metrics and dashboard tiles

This commit is contained in:
Joseph Doherty
2026-05-20 08:26:03 -04:00
parent 61ec4161bf
commit 0ae1a254d7

View File

@@ -34,6 +34,11 @@ Site clusters (metric collection and reporting). Central cluster (aggregation an
| Notification Outbox queue depth | Notification Outbox (central) | Count of `Pending` + `Retrying` notifications — central-computed, not site-reported |
| Notification Outbox stuck count | Notification Outbox (central) | Count of `Pending` / `Retrying` notifications older than the configurable stuck-age threshold — central-computed, not site-reported |
| Notification Outbox parked count | Notification Outbox (central) | Count of `Parked` notifications — central-computed, not site-reported |
| `SiteAuditBacklog` | Audit Log (site) | Count of `Pending` rows in the site-local `AuditLog` plus oldest-pending-age plus on-disk bytes. A configurable threshold drives a Health dashboard warning on the affected site tile. |
| `SiteAuditWriteFailures` | Audit Log (site) | Count of failed hot-path audit appends at the site since the last health report. |
| `SiteAuditTelemetryStalled` | Audit Log (site) | Boolean flag set when reconciliation reports a non-draining site-local audit backlog over two consecutive cycles. |
| `CentralAuditWriteFailures` | Audit Log (central) | Count of central direct-write audit failures (Inbound API middleware, Notification Outbox dispatcher, and any other central direct writers) since the last interval. |
| `AuditRedactionFailure` | Audit Log (central) | Count of payload redactor errors (over-redacted payloads, safety-net hit) since the last interval. |
## Reporting Protocol
@@ -76,6 +81,16 @@ The Site Call Audit is a **central** component, so its KPIs — like the Notific
Unlike the Notification Outbox, the Site Call Audit is **not a dispatcher** — cached calls are delivered by each site's Store-and-Forward Engine, and the `SiteCalls` table is an eventually-consistent central mirror of site-owned status.
## Audit Log KPIs
The Audit Log spans both sites (hot-path append + telemetry forward) and central (direct-write + ingest + redaction). Its operational health surfaces as three new dashboard tiles grouped under **Audit**:
- **Audit volume** — rate of audit rows landing in the central `AuditLog` table over the last interval, sourced from the Audit Log component on the active central node.
- **Audit error rate** — combined view of `CentralAuditWriteFailures` and `AuditRedactionFailure` over the last interval; non-zero values warrant a check of the Audit Log component's own logs.
- **Audit backlog** — global aggregate of `SiteAuditBacklog` across reporting sites (count of `Pending` site-local audit rows, oldest pending age, on-disk bytes). The per-site tile also surfaces a warning badge when its `SiteAuditBacklog` crosses the configurable threshold or when `SiteAuditTelemetryStalled` is set.
These tiles are **point-in-time** like the Notification Outbox and Site Call Audit KPI tiles — no time-series store; consistent with Health Monitoring's "current status only" philosophy. The site-scoped `SiteAuditBacklog` / `SiteAuditWriteFailures` / `SiteAuditTelemetryStalled` metrics arrive in the existing site health report; the central-scoped `CentralAuditWriteFailures` / `AuditRedactionFailure` metrics are central-computed alongside the existing central KPIs.
## Central Storage
- Health metrics are held **in memory** at the central cluster for display in the UI.
@@ -97,6 +112,7 @@ Unlike the Notification Outbox, the Site Call Audit is **not a dispatcher** —
- **Cluster Infrastructure (site)**: Provides node role status.
- **Notification Outbox (central)**: Provides central-computed outbox KPIs — queue depth, stuck count, parked count — for the headline dashboard tiles.
- **Site Call Audit (central)**: Provides central-computed cached-call KPIs — buffered count, parked count, failed/delivered (last interval), oldest pending age, stuck count — for the headline dashboard tiles.
- **Audit Log (#23)**: Provides the site-reported `SiteAuditBacklog` / `SiteAuditWriteFailures` / `SiteAuditTelemetryStalled` metrics (via the site health report) and the central-computed `CentralAuditWriteFailures` / `AuditRedactionFailure` metrics, plus the central audit-row rate feeding the **Audit** dashboard tile group (Audit volume, Audit error rate, Audit backlog).
## Interactions