From 0ae1a254d71ff742f0deb4a895ab98a2dfb04d13 Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Wed, 20 May 2026 08:26:03 -0400 Subject: [PATCH] docs(audit): add Audit Log health metrics and dashboard tiles --- docs/requirements/Component-HealthMonitoring.md | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/docs/requirements/Component-HealthMonitoring.md b/docs/requirements/Component-HealthMonitoring.md index 548603d..190d33b 100644 --- a/docs/requirements/Component-HealthMonitoring.md +++ b/docs/requirements/Component-HealthMonitoring.md @@ -34,6 +34,11 @@ Site clusters (metric collection and reporting). Central cluster (aggregation an | Notification Outbox queue depth | Notification Outbox (central) | Count of `Pending` + `Retrying` notifications — central-computed, not site-reported | | Notification Outbox stuck count | Notification Outbox (central) | Count of `Pending` / `Retrying` notifications older than the configurable stuck-age threshold — central-computed, not site-reported | | Notification Outbox parked count | Notification Outbox (central) | Count of `Parked` notifications — central-computed, not site-reported | +| `SiteAuditBacklog` | Audit Log (site) | Count of `Pending` rows in the site-local `AuditLog` plus oldest-pending-age plus on-disk bytes. A configurable threshold drives a Health dashboard warning on the affected site tile. | +| `SiteAuditWriteFailures` | Audit Log (site) | Count of failed hot-path audit appends at the site since the last health report. | +| `SiteAuditTelemetryStalled` | Audit Log (site) | Boolean flag set when reconciliation reports a non-draining site-local audit backlog over two consecutive cycles. | +| `CentralAuditWriteFailures` | Audit Log (central) | Count of central direct-write audit failures (Inbound API middleware, Notification Outbox dispatcher, and any other central direct writers) since the last interval. | +| `AuditRedactionFailure` | Audit Log (central) | Count of payload redactor errors (over-redacted payloads, safety-net hit) since the last interval. | ## Reporting Protocol @@ -76,6 +81,16 @@ The Site Call Audit is a **central** component, so its KPIs — like the Notific Unlike the Notification Outbox, the Site Call Audit is **not a dispatcher** — cached calls are delivered by each site's Store-and-Forward Engine, and the `SiteCalls` table is an eventually-consistent central mirror of site-owned status. +## Audit Log KPIs + +The Audit Log spans both sites (hot-path append + telemetry forward) and central (direct-write + ingest + redaction). Its operational health surfaces as three new dashboard tiles grouped under **Audit**: + +- **Audit volume** — rate of audit rows landing in the central `AuditLog` table over the last interval, sourced from the Audit Log component on the active central node. +- **Audit error rate** — combined view of `CentralAuditWriteFailures` and `AuditRedactionFailure` over the last interval; non-zero values warrant a check of the Audit Log component's own logs. +- **Audit backlog** — global aggregate of `SiteAuditBacklog` across reporting sites (count of `Pending` site-local audit rows, oldest pending age, on-disk bytes). The per-site tile also surfaces a warning badge when its `SiteAuditBacklog` crosses the configurable threshold or when `SiteAuditTelemetryStalled` is set. + +These tiles are **point-in-time** like the Notification Outbox and Site Call Audit KPI tiles — no time-series store; consistent with Health Monitoring's "current status only" philosophy. The site-scoped `SiteAuditBacklog` / `SiteAuditWriteFailures` / `SiteAuditTelemetryStalled` metrics arrive in the existing site health report; the central-scoped `CentralAuditWriteFailures` / `AuditRedactionFailure` metrics are central-computed alongside the existing central KPIs. + ## Central Storage - Health metrics are held **in memory** at the central cluster for display in the UI. @@ -97,6 +112,7 @@ Unlike the Notification Outbox, the Site Call Audit is **not a dispatcher** — - **Cluster Infrastructure (site)**: Provides node role status. - **Notification Outbox (central)**: Provides central-computed outbox KPIs — queue depth, stuck count, parked count — for the headline dashboard tiles. - **Site Call Audit (central)**: Provides central-computed cached-call KPIs — buffered count, parked count, failed/delivered (last interval), oldest pending age, stuck count — for the headline dashboard tiles. +- **Audit Log (#23)**: Provides the site-reported `SiteAuditBacklog` / `SiteAuditWriteFailures` / `SiteAuditTelemetryStalled` metrics (via the site health report) and the central-computed `CentralAuditWriteFailures` / `AuditRedactionFailure` metrics, plus the central audit-row rate feeding the **Audit** dashboard tile group (Audit volume, Audit error rate, Audit backlog). ## Interactions