docs(requirements): add Site Call Audit component (#22)

This commit is contained in:
Joseph Doherty
2026-05-19 11:32:00 -04:00
parent a08ad09514
commit 627c48c458

View File

@@ -0,0 +1,117 @@
# Component: Site Call Audit
## Purpose
Provides central, queryable audit and operational visibility for cached calls
made by site scripts — `ExternalSystem.CachedCall()` and `Database.CachedWrite()`.
Each such call carries a `TrackedOperationId`; sites report lifecycle telemetry
to this component, which maintains a central audit record, computes KPIs, and
relays Retry/Discard actions back to the owning site.
This is the second centrally-hosted observability component for site
store-and-forward activity (the Notification Outbox is the first). Unlike the
Notification Outbox, Site Call Audit is **not a dispatcher** — it never delivers
anything. Cached calls are delivered by the site's Store-and-Forward Engine
against site-local external systems and databases, which central cannot reach.
## Location
Central cluster only. A singleton actor (`SiteCallAuditActor`) on the active
central node. Registered as component #22 in the Host role configuration.
## Responsibilities
- Ingest cached-call lifecycle telemetry from sites into the central `SiteCalls`
table.
- Run periodic per-site reconciliation pulls so missed telemetry self-heals.
- Compute point-in-time KPIs (global and per-site) from the `SiteCalls` table.
- Relay operator Retry/Discard actions for parked cached calls to the owning
site over the command/control channel.
- Purge terminal audit rows after a configurable retention window.
## The `SiteCalls` Table
Lives in the central MS SQL configuration database — a sibling of the
`Notifications` table. One row per `TrackedOperationId`:
- **TrackedOperationId** — GUID, primary key. Generated site-side at call time.
- **SourceSite** — site that issued the call.
- **Kind** — `ExternalCall` or `DatabaseWrite`.
- **TargetSummary** — external system + method name, or database connection name.
- **Status** — `Pending`, `Retrying`, `Delivered`, `Parked`, `Failed`, `Discarded`.
- **RetryCount** — attempts so far.
- **LastError** — most recent error detail, if any.
- **Provenance** — source instance / script.
- **CreatedAtUtc**, **UpdatedAtUtc**, **TerminalAtUtc** — key timestamps.
## Status Lifecycle
`Pending → Retrying → Delivered / Parked / Failed / Discarded`
- **Delivered** — succeeded. A cached call that succeeds on its first immediate
attempt is recorded directly as `Delivered`.
- **Parked** — transient retries exhausted; awaiting manual action.
- **Failed** — permanent failure (e.g. HTTP 4xx). The error was also returned
synchronously to the calling script; the record captures it.
- **Discarded** — an operator discarded a parked operation.
The site is the source of truth. The `SiteCalls` row is an eventually-consistent
mirror — never queried by scripts (`Tracking.Status()` is answered site-locally).
## Ingest & Idempotency
Telemetry ingestion is **insert-if-not-exists** keyed on `TrackedOperationId`,
then **upsert-on-newer-status**. The lifecycle is monotonic, so status only
advances and never regresses; at-least-once and out-of-order telemetry are
therefore harmless.
## Reconciliation
Because telemetry is best-effort, `SiteCallAuditActor` periodically — and on site
reconnect — pulls "all tracking rows changed since cursor X" from each site.
Gaps left by lost telemetry self-heal. Central converges to the site; the site
never depends on central.
## Retry / Discard Relay
Parked cached calls live in the owning site's S&F buffer. Operator Retry/Discard
from the Central UI is relayed to that site as a `RetryParkedOperation` /
`DiscardParkedOperation` command over the command/control channel. The site
applies the change and emits telemetry reflecting the new state; central never
mutates the `SiteCalls` row directly. If the site is offline the command fails
fast and the UI surfaces a "site unreachable" message.
## KPIs
Point-in-time, computed from the `SiteCalls` table, global and per-source-site,
mirroring the Notification Outbox KPI shape:
- Buffered count (`Pending` + `Retrying`)
- Parked count
- Failed-last-interval
- Delivered-last-interval
- Oldest-pending age
- Stuck count — `Pending`/`Retrying` older than a configurable threshold
(default 10 minutes); display-only, no escalation.
## Retention
Daily purge of terminal rows (`Delivered`, `Failed`, `Discarded`) after a
configurable window (default 365 days), matching the `Notifications` purge.
## Dependencies
- **Configuration Database**: hosts the `SiteCalls` table and its repository.
- **CentralSite Communication**: receives cached-call telemetry and reconciliation
responses; sends Retry/Discard commands.
- **Store-and-Forward Engine**: the site-side origin of cached-call telemetry and
the executor of relayed Retry/Discard commands.
- **Commons**: `TrackedOperationId`, status enum, telemetry message contracts.
## Interactions
- **Central UI**: the Site Calls page queries this component and issues
Retry/Discard actions.
- **Health Monitoring**: surfaces Site Call Audit KPI tiles on the dashboard.
- **Cluster Infrastructure**: hosts the `SiteCallAuditActor` singleton with
active/standby failover.