diff --git a/docs/requirements/Component-SiteCallAudit.md b/docs/requirements/Component-SiteCallAudit.md new file mode 100644 index 0000000..33b5d8b --- /dev/null +++ b/docs/requirements/Component-SiteCallAudit.md @@ -0,0 +1,117 @@ +# Component: Site Call Audit + +## Purpose + +Provides central, queryable audit and operational visibility for cached calls +made by site scripts — `ExternalSystem.CachedCall()` and `Database.CachedWrite()`. +Each such call carries a `TrackedOperationId`; sites report lifecycle telemetry +to this component, which maintains a central audit record, computes KPIs, and +relays Retry/Discard actions back to the owning site. + +This is the second centrally-hosted observability component for site +store-and-forward activity (the Notification Outbox is the first). Unlike the +Notification Outbox, Site Call Audit is **not a dispatcher** — it never delivers +anything. Cached calls are delivered by the site's Store-and-Forward Engine +against site-local external systems and databases, which central cannot reach. + +## Location + +Central cluster only. A singleton actor (`SiteCallAuditActor`) on the active +central node. Registered as component #22 in the Host role configuration. + +## Responsibilities + +- Ingest cached-call lifecycle telemetry from sites into the central `SiteCalls` + table. +- Run periodic per-site reconciliation pulls so missed telemetry self-heals. +- Compute point-in-time KPIs (global and per-site) from the `SiteCalls` table. +- Relay operator Retry/Discard actions for parked cached calls to the owning + site over the command/control channel. +- Purge terminal audit rows after a configurable retention window. + +## The `SiteCalls` Table + +Lives in the central MS SQL configuration database — a sibling of the +`Notifications` table. One row per `TrackedOperationId`: + +- **TrackedOperationId** — GUID, primary key. Generated site-side at call time. +- **SourceSite** — site that issued the call. +- **Kind** — `ExternalCall` or `DatabaseWrite`. +- **TargetSummary** — external system + method name, or database connection name. +- **Status** — `Pending`, `Retrying`, `Delivered`, `Parked`, `Failed`, `Discarded`. +- **RetryCount** — attempts so far. +- **LastError** — most recent error detail, if any. +- **Provenance** — source instance / script. +- **CreatedAtUtc**, **UpdatedAtUtc**, **TerminalAtUtc** — key timestamps. + +## Status Lifecycle + +`Pending → Retrying → Delivered / Parked / Failed / Discarded` + +- **Delivered** — succeeded. A cached call that succeeds on its first immediate + attempt is recorded directly as `Delivered`. +- **Parked** — transient retries exhausted; awaiting manual action. +- **Failed** — permanent failure (e.g. HTTP 4xx). The error was also returned + synchronously to the calling script; the record captures it. +- **Discarded** — an operator discarded a parked operation. + +The site is the source of truth. The `SiteCalls` row is an eventually-consistent +mirror — never queried by scripts (`Tracking.Status()` is answered site-locally). + +## Ingest & Idempotency + +Telemetry ingestion is **insert-if-not-exists** keyed on `TrackedOperationId`, +then **upsert-on-newer-status**. The lifecycle is monotonic, so status only +advances and never regresses; at-least-once and out-of-order telemetry are +therefore harmless. + +## Reconciliation + +Because telemetry is best-effort, `SiteCallAuditActor` periodically — and on site +reconnect — pulls "all tracking rows changed since cursor X" from each site. +Gaps left by lost telemetry self-heal. Central converges to the site; the site +never depends on central. + +## Retry / Discard Relay + +Parked cached calls live in the owning site's S&F buffer. Operator Retry/Discard +from the Central UI is relayed to that site as a `RetryParkedOperation` / +`DiscardParkedOperation` command over the command/control channel. The site +applies the change and emits telemetry reflecting the new state; central never +mutates the `SiteCalls` row directly. If the site is offline the command fails +fast and the UI surfaces a "site unreachable" message. + +## KPIs + +Point-in-time, computed from the `SiteCalls` table, global and per-source-site, +mirroring the Notification Outbox KPI shape: + +- Buffered count (`Pending` + `Retrying`) +- Parked count +- Failed-last-interval +- Delivered-last-interval +- Oldest-pending age +- Stuck count — `Pending`/`Retrying` older than a configurable threshold + (default 10 minutes); display-only, no escalation. + +## Retention + +Daily purge of terminal rows (`Delivered`, `Failed`, `Discarded`) after a +configurable window (default 365 days), matching the `Notifications` purge. + +## Dependencies + +- **Configuration Database**: hosts the `SiteCalls` table and its repository. +- **Central–Site Communication**: receives cached-call telemetry and reconciliation + responses; sends Retry/Discard commands. +- **Store-and-Forward Engine**: the site-side origin of cached-call telemetry and + the executor of relayed Retry/Discard commands. +- **Commons**: `TrackedOperationId`, status enum, telemetry message contracts. + +## Interactions + +- **Central UI**: the Site Calls page queries this component and issues + Retry/Discard actions. +- **Health Monitoring**: surfaces Site Call Audit KPI tiles on the dashboard. +- **Cluster Infrastructure**: hosts the `SiteCallAuditActor` singleton with + active/standby failover.