- Adds SourceNode varchar(64) NULL to AuditLog, Notifications, and SiteCalls tables with role-name semantics: node-a/node-b for site rows (qualified by SourceSiteId), central-a/central-b for central direct-write rows. - New IX_AuditLog_Node_Occurred (SourceNode, OccurredAtUtc) index. - Reframes CLAUDE.md from documentation-only to implementation project. - Adds docs/plans/2026-05-23-audit-source-node.md + tasks.json companion.
149 lines
7.0 KiB
Markdown
149 lines
7.0 KiB
Markdown
# Component: Site Call Audit
|
||
|
||
## Purpose
|
||
|
||
Provides central, queryable audit and operational visibility for cached calls
|
||
made by site scripts — `ExternalSystem.CachedCall()` and `Database.CachedWrite()`.
|
||
Each such call carries a `TrackedOperationId`; sites report lifecycle telemetry
|
||
to this component, which maintains a central audit record, computes KPIs, and
|
||
relays Retry/Discard actions back to the owning site.
|
||
|
||
This is the second centrally-hosted observability component for site
|
||
store-and-forward activity (the Notification Outbox is the first). Unlike the
|
||
Notification Outbox, Site Call Audit is **not a dispatcher** — it never delivers
|
||
anything. Cached calls are delivered by the site's Store-and-Forward Engine
|
||
against site-local external systems and databases, which central cannot reach.
|
||
|
||
## Location
|
||
|
||
Central cluster only. A singleton actor (`SiteCallAuditActor`) on the active
|
||
central node. Registered as component #22 in the Host role configuration.
|
||
|
||
## Responsibilities
|
||
|
||
- Ingest cached-call lifecycle telemetry from sites into the central `SiteCalls`
|
||
table.
|
||
- Run periodic per-site reconciliation pulls so missed telemetry self-heals.
|
||
- Compute point-in-time KPIs (global and per-site) from the `SiteCalls` table.
|
||
- Relay operator Retry/Discard actions for parked cached calls to the owning
|
||
site over the command/control channel.
|
||
- Purge terminal audit rows after a configurable retention window.
|
||
|
||
## The `SiteCalls` Table
|
||
|
||
Lives in the central MS SQL configuration database — a sibling of the
|
||
`Notifications` table. One row per `TrackedOperationId`:
|
||
|
||
- **TrackedOperationId** — GUID, primary key. Generated site-side at call time.
|
||
- **SourceSite** — site that issued the call.
|
||
- **SourceNode** — the cluster node on which the call was issued (`node-a` /
|
||
`node-b`, qualified by `SourceSite`). Nullable. Stamped site-side at submit
|
||
time and carried verbatim through the combined `CachedCallTelemetry` packet,
|
||
reconciliation pulls, and the central upsert.
|
||
- **Kind** — `TrackedOperationKind` enum: `ExternalCall` or `DatabaseWrite`.
|
||
- **TargetSummary** — external system + method name for an `ExternalCall`; for a
|
||
`DatabaseWrite`, just the database connection name — intentionally not the SQL
|
||
statement or table, a deliberate scoping choice.
|
||
- **Status** — `Pending`, `Retrying`, `Delivered`, `Parked`, `Failed`, `Discarded`.
|
||
- **RetryCount** — attempts so far.
|
||
- **LastError** — most recent error detail, if any.
|
||
- **Provenance** — source instance / script.
|
||
- **CreatedAtUtc**, **UpdatedAtUtc**, **TerminalAtUtc** — key timestamps.
|
||
|
||
## Status Lifecycle
|
||
|
||
`Pending → Retrying → Delivered / Parked / Failed / Discarded`
|
||
|
||
- **Pending** — non-terminal: buffered after a transient failure, awaiting its
|
||
first retry.
|
||
- **Retrying** — non-terminal: undergoing retry attempts.
|
||
- **Delivered** — terminal, success. A cached call that succeeds on its first
|
||
immediate attempt is recorded directly as `Delivered`.
|
||
- **Parked** — non-terminal: transient retries exhausted; awaiting manual action.
|
||
- **Failed** — terminal: permanent failure (e.g. HTTP 4xx). The error was also
|
||
returned synchronously to the calling script; the record captures it. `Failed`
|
||
rows are **not operator-actionable** — see Retry / Discard Relay.
|
||
- **Discarded** — terminal, reached **only by operator action** on a `Parked`
|
||
row. The row is kept (not deleted) so the table remains a complete audit
|
||
record.
|
||
|
||
The site is the source of truth. The `SiteCalls` row is an eventually-consistent
|
||
mirror — never queried by scripts (`Tracking.Status()` is answered site-locally).
|
||
|
||
## Ingest & Idempotency
|
||
|
||
Telemetry ingestion is **insert-if-not-exists** keyed on `TrackedOperationId`,
|
||
then **upsert-on-newer-status**. The lifecycle is monotonic, so status only
|
||
advances and never regresses; at-least-once and out-of-order telemetry are
|
||
therefore harmless.
|
||
|
||
From v1.x onward, the `CachedCallTelemetry` message additively carries the
|
||
`AuditEvent` content alongside the existing operational fields. Central's
|
||
`AuditLogIngestActor` (Audit Log #23) performs both the immutable `AuditLog`
|
||
insert and the `SiteCalls` upsert in a single transaction. Idempotency keys
|
||
remain `EventId` (for `AuditLog`) and `TrackedOperationId` (for `SiteCalls`).
|
||
See [Component-AuditLog.md](Component-AuditLog.md), Cached Operations —
|
||
Combined Telemetry, for the dual-write contract.
|
||
|
||
## Reconciliation
|
||
|
||
Because telemetry is best-effort, `SiteCallAuditActor` periodically — and on site
|
||
reconnect — pulls "all tracking rows changed since cursor X" from each site.
|
||
Gaps left by lost telemetry self-heal. Central converges to the site; the site
|
||
never depends on central.
|
||
|
||
## Retry / Discard Relay
|
||
|
||
Parked cached calls live in the owning site's S&F buffer. Operator Retry/Discard
|
||
from the Central UI is relayed to that site as a `RetryParkedOperation` /
|
||
`DiscardParkedOperation` command over the command/control channel. The site
|
||
applies the change and emits telemetry reflecting the new state; central never
|
||
mutates the `SiteCalls` row directly. If the site is offline the command fails
|
||
fast and the UI surfaces a "site unreachable" message.
|
||
|
||
Only `Parked` rows are operator-actionable. `Failed` rows offer no Retry or
|
||
Discard: a permanent failure (e.g. HTTP 4xx) would simply fail again, and the
|
||
error was already returned synchronously to the calling script — there is
|
||
nothing for an operator to recover.
|
||
|
||
## KPIs
|
||
|
||
Point-in-time, computed from the `SiteCalls` table, global and per-source-site,
|
||
mirroring the Notification Outbox KPI shape:
|
||
|
||
- Buffered count (`Pending` + `Retrying`)
|
||
- Parked count
|
||
- Failed-last-interval
|
||
- Delivered-last-interval
|
||
- Oldest-pending age
|
||
- Stuck count — `Pending`/`Retrying` older than a configurable threshold
|
||
(default 10 minutes); display-only, no escalation.
|
||
|
||
## Retention
|
||
|
||
Daily purge of terminal rows (`Delivered`, `Failed`, `Discarded`) after a
|
||
configurable window (default 365 days), matching the `Notifications` purge.
|
||
|
||
## Dependencies
|
||
|
||
- **Configuration Database**: hosts the `SiteCalls` table and its repository.
|
||
- **Central–Site Communication**: receives cached-call telemetry and reconciliation
|
||
responses; sends Retry/Discard commands.
|
||
- **Store-and-Forward Engine**: the site-side origin of cached-call telemetry and
|
||
the executor of relayed Retry/Discard commands.
|
||
- **Audit Log (#23)**: shares the `CachedCallTelemetry` packet — each lifecycle
|
||
transition (`CachedEnqueued`, `CachedAttempt`, `CachedTerminal`) carries an
|
||
`AuditEvent` alongside the operational fields, and central's
|
||
`AuditLogIngestActor` performs the `AuditLog` insert and the `SiteCalls`
|
||
upsert in a single transaction (see [Component-AuditLog.md](Component-AuditLog.md),
|
||
Cached Operations — Combined Telemetry).
|
||
- **Commons**: `TrackedOperationId`, status enum, telemetry message contracts.
|
||
|
||
## Interactions
|
||
|
||
- **Central UI**: the Site Calls page queries this component and issues
|
||
Retry/Discard actions.
|
||
- **Health Monitoring**: surfaces Site Call Audit KPI tiles on the dashboard.
|
||
- **Cluster Infrastructure**: hosts the `SiteCallAuditActor` singleton with
|
||
active/standby failover.
|