Add TrackedOperationId handles to CachedCall/CachedWrite under a unified tracking model. New site-local tracking table is the status source of truth; new central Site Call Audit component (#22) mirrors status via telemetry, exposes KPIs and a Site Calls UI page with central->site Retry/Discard.
10 KiB
Cached Call Tracking — Design
Date: 2026-05-19 Status: Approved Topic: Trackable IDs for cached external system calls and cached database writes
Problem
ExternalSystem.CachedCall() and Database.CachedWrite() are fire-and-forget: a
script gets no handle back, cannot confirm delivery, and an operator cannot tie a
parked S&F message to a known business operation. Notify.Send() already returns a
trackable NotificationId. The goal is to give cached external/database calls the
same first-class traceability, under a tracking model unified across all three
store-and-forward producers.
Decision
Add a trackable ID to cached calls via Approach B — a sibling central component
(Site Call Audit) plus shared tracking contracts in Commons. The Notification
Outbox is left unchanged; unification lives in shared types and a consistent script
API, not in a merged table or component.
Why a sibling, not a merged component
Delivery locality is the decisive constraint:
- Notifications are central-delivered: sites store-and-forward them to the
central cluster, which delivers via SMTP. The
NotificationOutboxActorruns a dispatcher loop. Central becomes the source of truth after handoff. - Cached calls / DB writes are site-delivered: the external system or database often sits on the site's own network and is unreachable from central. The site's S&F Engine must always own delivery, and the site remains the source of truth for status. Central audit is an eventually-consistent mirror.
Merging both into one component (Approach A) would put a dispatcher loop that is live
for some rows and dormant for others into a single component, hiding a real
architectural difference. Approach B expresses the difference honestly while still
giving scripts a unified ID model and Status() API.
Unified tracking model
TrackedOperationId
A GUID, defined in Commons, generated caller-side at the site at call time. It is
both the tracking handle returned to the script and the idempotency key for telemetry
sent to central. Notify.Send()'s existing NotificationId is the notification-domain
name for this same type — no behavior change for notifications.
Script API
| Call | Returns |
|---|---|
ExternalSystem.CachedCall(system, method, params) |
TrackedOperationId |
Database.CachedWrite(name, sql, params) |
TrackedOperationId |
Notify.Send(...) |
TrackedOperationId (unchanged) |
Tracking.Status(id) |
unified status record (status, retry count, last error, key timestamps) |
Tracking.Status(id) is the unified accessor. Notify.Status(id) is retained as a
thin alias for backward compatibility.
Status lifecycle
Pending → Retrying → Delivered / Parked / Failed / Discarded
- Delivered — succeeded. A cached call that succeeds on its first immediate attempt goes straight here and never enters the S&F buffer.
- Parked — transient retries exhausted; awaiting manual action.
- Failed — permanent failure (e.g. HTTP 4xx). The error is also returned synchronously to the calling script, exactly as today; the record captures it. This is the one state beyond the notification lifecycle.
- Discarded — operator discarded a parked operation.
There is no Forwarding state for cached calls — that exists only because
notifications hand off to central. For cached calls, Tracking.Status(id) is always
answered site-locally and authoritatively.
Site-side architecture
Site-local operation tracking table
A new SQLite table alongside the existing S&F buffer DB. One row per
TrackedOperationId, created the moment the script issues the cached call,
regardless of outcome:
- Fields: kind, target summary (system+method, or DB name), status, retry count, last error, created/updated/terminal timestamps, source provenance (instance/script).
- This table is the status record. The S&F buffer remains purely the retry
mechanism; a buffered message references its
TrackedOperationId. - Immediate success writes a terminal
Deliveredrow directly here, with nothing placed in the S&F buffer. Tracking.Status(id)reads this table — local, authoritative, available even when central is unreachable.- Retention: terminal rows purged after a configurable window (default 7 days; the site holds live operational state, central holds long-term audit).
Telemetry to central
On every lifecycle transition (Created → Retrying → Delivered/Parked/Failed/ Discarded) the site emits a telemetry event over the existing site→central channel:
TrackedOperationId, kind, summary, status, retry count, last error, timestamps,
source site. Best-effort, at-least-once, idempotent on the ID.
Reconciliation
Because telemetry is best-effort, the central side periodically (and on reconnect) pulls "all tracking rows changed since cursor X" per site. Missed telemetry self-heals. The site never depends on central; central converges to the site.
Carried-over rules (unchanged)
- Tracking rows, like buffered messages, are not cleared on instance deletion.
- Cached-call idempotency remains the caller's responsibility — a retry can still double-deliver.
Central — Site Call Audit component (new component #22)
SiteCalls table (central MS SQL)
Sibling of the Notifications table. One row per TrackedOperationId: source site,
kind, target summary, status, retry count, last error, created/updated/terminal
timestamps. Fed only by site telemetry and reconciliation pulls.
Ingestion is insert-if-not-exists, then upsert-on-newer-status. The lifecycle
is monotonic, so status only advances, never regresses — making at-least-once and
out-of-order telemetry harmless. Daily purge of terminal rows after a configurable
window (default 365 days, mirroring Notifications).
SiteCallAuditActor
Singleton on the active central node. Ingests telemetry, runs the periodic reconciliation pulls, computes KPIs, and relays Retry/Discard commands to sites.
It is not a dispatcher — the crucial difference from NotificationOutboxActor.
Central has no path to a site's external systems or databases; this component is an
audit sink, a query surface, and a command relay only.
KPIs
Point-in-time from the SiteCalls table, global and per-site, mirroring the
Notification Outbox KPI shape: buffered count (Pending+Retrying), parked count,
failed-last-interval, delivered-last-interval, oldest-pending age, and stuck count
(Pending/Retrying older than a configurable threshold, default 10 minutes —
display-only, no alerting).
Central→site command path (Retry / Discard)
Parked operations live in the site's S&F buffer, so Retry/Discard from the Central UI must travel down to the owning site:
- New ClusterClient command/control messages, central→site:
RetryParkedOperation(TrackedOperationId)andDiscardParkedOperation(TrackedOperationId), riding the existing per-site ClusterClient. - The site applies the command to its S&F buffer / tracking table, then emits normal
telemetry reflecting the new state (
Retrying, orDiscarded). - Central never directly mutates the
SiteCallsrow. It sends the command and lets the resulting telemetry update the audit row — the site stays the single source of truth. - If the site is offline, the command fails fast and the UI surfaces a "site unreachable" message.
Central UI
New page — Site Calls — in the same nav group as the Notification Outbox page:
- Covers cached calls only:
ExternalCall+DatabaseWrite. Notifications keep their existing dedicated Notification Outbox page. - Queryable list filtered by site, kind, status, and time range. Columns: timestamp, site, kind, target summary, status badge, retry count, last error.
- Retry / Discard actions on
Parkedrows, issuing the central→site commands above. - Headline KPI tiles on the Health dashboard alongside the existing Notification Outbox tiles. Stuck rows get a display-only badge — no escalation.
- Custom Blazor Server + Bootstrap components, consistent with the rest of the Central UI.
Error handling & edge cases
- Telemetry loss — reconciliation pull self-heals; central is explicitly eventually-consistent.
- Out-of-order / duplicate telemetry — monotonic-status upsert keyed on
TrackedOperationIdmakes both harmless. - Permanent failure on a cached call — error returned synchronously to the script
(unchanged) and recorded as terminal
Failed. - Site offline during Retry/Discard — command fails fast; UI says so; the audit row is unchanged until confirming telemetry arrives.
- Cached-call double-delivery — still the caller's responsibility; the idempotency note stays in the ESG doc.
- Instance deletion — tracking rows and buffered messages survive, per the existing S&F rule.
Affected documents
- New:
docs/requirements/Component-SiteCallAudit.md Component-ExternalSystemGateway.md—CachedCall/CachedWritereturnTrackedOperationId;Failedstate;Tracking.Status.Component-StoreAndForward.md— site-local tracking table, telemetry emission, reconciliation,TrackedOperationIdon buffer entries.Component-SiteRuntime.md— Script Runtime API: return types andTracking.Status(id).Component-Communication.md— telemetry channel andRetryParkedOperation/DiscardParkedOperationcommands.Component-Commons.md—TrackedOperationId, unified status enum, telemetry message contracts.Component-ConfigurationDatabase.md—SiteCallstable, EF mapping, migration.Component-CentralUI.md— new Site Calls page.Component-HealthMonitoring.md— KPI tiles on the dashboard.Component-NotificationService.md/Component-NotificationOutbox.md— note the sharedTrackedOperationIdmodel andNotify.Statusalias.README.md— component table updated to 22 components.CLAUDE.md— component list and Key Design Decisions.
Out of scope
- A CLI surface for site-local Retry/Discard (can be added later if needed).
- Merging notifications into the Site Calls page or a unified outbox component.
- Routing cached-call delivery through central.