6.7 KiB
Component: Site Call Audit
Purpose
Provides central, queryable audit and operational visibility for cached calls
made by site scripts — ExternalSystem.CachedCall() and Database.CachedWrite().
Each such call carries a TrackedOperationId; sites report lifecycle telemetry
to this component, which maintains a central audit record, computes KPIs, and
relays Retry/Discard actions back to the owning site.
This is the second centrally-hosted observability component for site store-and-forward activity (the Notification Outbox is the first). Unlike the Notification Outbox, Site Call Audit is not a dispatcher — it never delivers anything. Cached calls are delivered by the site's Store-and-Forward Engine against site-local external systems and databases, which central cannot reach.
Location
Central cluster only. A singleton actor (SiteCallAuditActor) on the active
central node. Registered as component #22 in the Host role configuration.
Responsibilities
- Ingest cached-call lifecycle telemetry from sites into the central
SiteCallstable. - Run periodic per-site reconciliation pulls so missed telemetry self-heals.
- Compute point-in-time KPIs (global and per-site) from the
SiteCallstable. - Relay operator Retry/Discard actions for parked cached calls to the owning site over the command/control channel.
- Purge terminal audit rows after a configurable retention window.
The SiteCalls Table
Lives in the central MS SQL configuration database — a sibling of the
Notifications table. One row per TrackedOperationId:
- TrackedOperationId — GUID, primary key. Generated site-side at call time.
- SourceSite — site that issued the call.
- Kind —
TrackedOperationKindenum:ExternalCallorDatabaseWrite. - TargetSummary — external system + method name for an
ExternalCall; for aDatabaseWrite, just the database connection name — intentionally not the SQL statement or table, a deliberate scoping choice. - Status —
Pending,Retrying,Delivered,Parked,Failed,Discarded. - RetryCount — attempts so far.
- LastError — most recent error detail, if any.
- Provenance — source instance / script.
- CreatedAtUtc, UpdatedAtUtc, TerminalAtUtc — key timestamps.
Status Lifecycle
Pending → Retrying → Delivered / Parked / Failed / Discarded
- Pending — non-terminal: buffered after a transient failure, awaiting its first retry.
- Retrying — non-terminal: undergoing retry attempts.
- Delivered — terminal, success. A cached call that succeeds on its first
immediate attempt is recorded directly as
Delivered. - Parked — non-terminal: transient retries exhausted; awaiting manual action.
- Failed — terminal: permanent failure (e.g. HTTP 4xx). The error was also
returned synchronously to the calling script; the record captures it.
Failedrows are not operator-actionable — see Retry / Discard Relay. - Discarded — terminal, reached only by operator action on a
Parkedrow. The row is kept (not deleted) so the table remains a complete audit record.
The site is the source of truth. The SiteCalls row is an eventually-consistent
mirror — never queried by scripts (Tracking.Status() is answered site-locally).
Ingest & Idempotency
Telemetry ingestion is insert-if-not-exists keyed on TrackedOperationId,
then upsert-on-newer-status. The lifecycle is monotonic, so status only
advances and never regresses; at-least-once and out-of-order telemetry are
therefore harmless.
From v1.x onward, the CachedCallTelemetry message additively carries the
AuditEvent content alongside the existing operational fields. Central's
AuditLogIngestActor (Audit Log #23) performs both the immutable AuditLog
insert and the SiteCalls upsert in a single transaction. Idempotency keys
remain EventId (for AuditLog) and TrackedOperationId (for SiteCalls).
See Component-AuditLog.md, Cached Operations —
Combined Telemetry, for the dual-write contract.
Reconciliation
Because telemetry is best-effort, SiteCallAuditActor periodically — and on site
reconnect — pulls "all tracking rows changed since cursor X" from each site.
Gaps left by lost telemetry self-heal. Central converges to the site; the site
never depends on central.
Retry / Discard Relay
Parked cached calls live in the owning site's S&F buffer. Operator Retry/Discard
from the Central UI is relayed to that site as a RetryParkedOperation /
DiscardParkedOperation command over the command/control channel. The site
applies the change and emits telemetry reflecting the new state; central never
mutates the SiteCalls row directly. If the site is offline the command fails
fast and the UI surfaces a "site unreachable" message.
Only Parked rows are operator-actionable. Failed rows offer no Retry or
Discard: a permanent failure (e.g. HTTP 4xx) would simply fail again, and the
error was already returned synchronously to the calling script — there is
nothing for an operator to recover.
KPIs
Point-in-time, computed from the SiteCalls table, global and per-source-site,
mirroring the Notification Outbox KPI shape:
- Buffered count (
Pending+Retrying) - Parked count
- Failed-last-interval
- Delivered-last-interval
- Oldest-pending age
- Stuck count —
Pending/Retryingolder than a configurable threshold (default 10 minutes); display-only, no escalation.
Retention
Daily purge of terminal rows (Delivered, Failed, Discarded) after a
configurable window (default 365 days), matching the Notifications purge.
Dependencies
- Configuration Database: hosts the
SiteCallstable and its repository. - Central–Site Communication: receives cached-call telemetry and reconciliation responses; sends Retry/Discard commands.
- Store-and-Forward Engine: the site-side origin of cached-call telemetry and the executor of relayed Retry/Discard commands.
- Audit Log (#23): shares the
CachedCallTelemetrypacket — each lifecycle transition (CachedEnqueued,CachedAttempt,CachedTerminal) carries anAuditEventalongside the operational fields, and central'sAuditLogIngestActorperforms theAuditLoginsert and theSiteCallsupsert in a single transaction (see Component-AuditLog.md, Cached Operations — Combined Telemetry). - Commons:
TrackedOperationId, status enum, telemetry message contracts.
Interactions
- Central UI: the Site Calls page queries this component and issues Retry/Discard actions.
- Health Monitoring: surfaces Site Call Audit KPI tiles on the dashboard.
- Cluster Infrastructure: hosts the
SiteCallAuditActorsingleton with active/standby failover.