Files

Joseph Doherty e7ed858920 docs(plans): design cached-call tracking with trackable IDs

Add TrackedOperationId handles to CachedCall/CachedWrite under a unified tracking model. New site-local tracking table is the status source of truth; new central Site Call Audit component (#22) mirrors status via telemetry, exposes KPIs and a Site Calls UI page with central->site Retry/Discard.

2026-05-19 11:26:37 -04:00

10 KiB

Raw Blame History

Cached Call Tracking — Design

Date: 2026-05-19 Status: Approved Topic: Trackable IDs for cached external system calls and cached database writes

Problem

ExternalSystem.CachedCall() and Database.CachedWrite() are fire-and-forget: a script gets no handle back, cannot confirm delivery, and an operator cannot tie a parked S&F message to a known business operation. Notify.Send() already returns a trackable NotificationId. The goal is to give cached external/database calls the same first-class traceability, under a tracking model unified across all three store-and-forward producers.

Decision

Add a trackable ID to cached calls via Approach B — a sibling central component (Site Call Audit) plus shared tracking contracts in Commons. The Notification Outbox is left unchanged; unification lives in shared types and a consistent script API, not in a merged table or component.

Why a sibling, not a merged component

Delivery locality is the decisive constraint:

Notifications are central-delivered: sites store-and-forward them to the central cluster, which delivers via SMTP. The NotificationOutboxActor runs a dispatcher loop. Central becomes the source of truth after handoff.
Cached calls / DB writes are site-delivered: the external system or database often sits on the site's own network and is unreachable from central. The site's S&F Engine must always own delivery, and the site remains the source of truth for status. Central audit is an eventually-consistent mirror.

Merging both into one component (Approach A) would put a dispatcher loop that is live for some rows and dormant for others into a single component, hiding a real architectural difference. Approach B expresses the difference honestly while still giving scripts a unified ID model and Status() API.

Unified tracking model

`TrackedOperationId`

A GUID, defined in Commons, generated caller-side at the site at call time. It is both the tracking handle returned to the script and the idempotency key for telemetry sent to central. Notify.Send()'s existing NotificationId is the notification-domain name for this same type — no behavior change for notifications.

Script API

Call	Returns
`ExternalSystem.CachedCall(system, method, params)`	`TrackedOperationId`
`Database.CachedWrite(name, sql, params)`	`TrackedOperationId`
`Notify.Send(...)`	`TrackedOperationId` (unchanged)
`Tracking.Status(id)`	unified status record (status, retry count, last error, key timestamps)

Tracking.Status(id) is the unified accessor. Notify.Status(id) is retained as a thin alias for backward compatibility.

Status lifecycle

Pending → Retrying → Delivered / Parked / Failed / Discarded

Delivered — succeeded. A cached call that succeeds on its first immediate attempt goes straight here and never enters the S&F buffer.
Parked — transient retries exhausted; awaiting manual action.
Failed — permanent failure (e.g. HTTP 4xx). The error is also returned synchronously to the calling script, exactly as today; the record captures it. This is the one state beyond the notification lifecycle.
Discarded — operator discarded a parked operation.

There is no Forwarding state for cached calls — that exists only because notifications hand off to central. For cached calls, Tracking.Status(id) is always answered site-locally and authoritatively.

Site-side architecture

Site-local operation tracking table

A new SQLite table alongside the existing S&F buffer DB. One row per TrackedOperationId, created the moment the script issues the cached call, regardless of outcome:

Fields: kind, target summary (system+method, or DB name), status, retry count, last error, created/updated/terminal timestamps, source provenance (instance/script).
This table is the status record. The S&F buffer remains purely the retry mechanism; a buffered message references its TrackedOperationId.
Immediate success writes a terminal Delivered row directly here, with nothing placed in the S&F buffer.
Tracking.Status(id) reads this table — local, authoritative, available even when central is unreachable.
Retention: terminal rows purged after a configurable window (default 7 days; the site holds live operational state, central holds long-term audit).

Telemetry to central

On every lifecycle transition (Created → Retrying → Delivered/Parked/Failed/ Discarded) the site emits a telemetry event over the existing site→central channel: TrackedOperationId, kind, summary, status, retry count, last error, timestamps, source site. Best-effort, at-least-once, idempotent on the ID.

Reconciliation

Because telemetry is best-effort, the central side periodically (and on reconnect) pulls "all tracking rows changed since cursor X" per site. Missed telemetry self-heals. The site never depends on central; central converges to the site.

Carried-over rules (unchanged)

Tracking rows, like buffered messages, are not cleared on instance deletion.
Cached-call idempotency remains the caller's responsibility — a retry can still double-deliver.

Central — Site Call Audit component (new component #22)

`SiteCalls` table (central MS SQL)

Sibling of the Notifications table. One row per TrackedOperationId: source site, kind, target summary, status, retry count, last error, created/updated/terminal timestamps. Fed only by site telemetry and reconciliation pulls.

Ingestion is insert-if-not-exists, then upsert-on-newer-status. The lifecycle is monotonic, so status only advances, never regresses — making at-least-once and out-of-order telemetry harmless. Daily purge of terminal rows after a configurable window (default 365 days, mirroring Notifications).

`SiteCallAuditActor`

Singleton on the active central node. Ingests telemetry, runs the periodic reconciliation pulls, computes KPIs, and relays Retry/Discard commands to sites.

It is not a dispatcher — the crucial difference from NotificationOutboxActor. Central has no path to a site's external systems or databases; this component is an audit sink, a query surface, and a command relay only.

KPIs

Point-in-time from the SiteCalls table, global and per-site, mirroring the Notification Outbox KPI shape: buffered count (Pending+Retrying), parked count, failed-last-interval, delivered-last-interval, oldest-pending age, and stuck count (Pending/Retrying older than a configurable threshold, default 10 minutes — display-only, no alerting).

Central→site command path (Retry / Discard)

Parked operations live in the site's S&F buffer, so Retry/Discard from the Central UI must travel down to the owning site:

New ClusterClient command/control messages, central→site: RetryParkedOperation(TrackedOperationId) and DiscardParkedOperation(TrackedOperationId), riding the existing per-site ClusterClient.
The site applies the command to its S&F buffer / tracking table, then emits normal telemetry reflecting the new state (Retrying, or Discarded).
Central never directly mutates the SiteCalls row. It sends the command and lets the resulting telemetry update the audit row — the site stays the single source of truth.
If the site is offline, the command fails fast and the UI surfaces a "site unreachable" message.

Central UI

New page — Site Calls — in the same nav group as the Notification Outbox page:

Covers cached calls only: ExternalCall + DatabaseWrite. Notifications keep their existing dedicated Notification Outbox page.
Queryable list filtered by site, kind, status, and time range. Columns: timestamp, site, kind, target summary, status badge, retry count, last error.
Retry / Discard actions on Parked rows, issuing the central→site commands above.
Headline KPI tiles on the Health dashboard alongside the existing Notification Outbox tiles. Stuck rows get a display-only badge — no escalation.
Custom Blazor Server + Bootstrap components, consistent with the rest of the Central UI.

Error handling & edge cases

Telemetry loss — reconciliation pull self-heals; central is explicitly eventually-consistent.
Out-of-order / duplicate telemetry — monotonic-status upsert keyed on TrackedOperationId makes both harmless.
Permanent failure on a cached call — error returned synchronously to the script (unchanged) and recorded as terminal Failed.
Site offline during Retry/Discard — command fails fast; UI says so; the audit row is unchanged until confirming telemetry arrives.
Cached-call double-delivery — still the caller's responsibility; the idempotency note stays in the ESG doc.
Instance deletion — tracking rows and buffered messages survive, per the existing S&F rule.

Affected documents

New: docs/requirements/Component-SiteCallAudit.md
Component-ExternalSystemGateway.md — CachedCall/CachedWrite return TrackedOperationId; Failed state; Tracking.Status.
Component-StoreAndForward.md — site-local tracking table, telemetry emission, reconciliation, TrackedOperationId on buffer entries.
Component-SiteRuntime.md — Script Runtime API: return types and Tracking.Status(id).
Component-Communication.md — telemetry channel and RetryParkedOperation/DiscardParkedOperation commands.
Component-Commons.md — TrackedOperationId, unified status enum, telemetry message contracts.
Component-ConfigurationDatabase.md — SiteCalls table, EF mapping, migration.
Component-CentralUI.md — new Site Calls page.
Component-HealthMonitoring.md — KPI tiles on the dashboard.
Component-NotificationService.md / Component-NotificationOutbox.md — note the shared TrackedOperationId model and Notify.Status alias.
README.md — component table updated to 22 components.
CLAUDE.md — component list and Key Design Decisions.

Out of scope

A CLI surface for site-local Retry/Discard (can be added later if needed).
Merging notifications into the Site Calls page or a unified outbox component.
Routing cached-call delivery through central.

10 KiB Raw Blame History