Add TrackedOperationId handles to CachedCall/CachedWrite under a unified tracking model. New site-local tracking table is the status source of truth; new central Site Call Audit component (#22) mirrors status via telemetry, exposes KPIs and a Site Calls UI page with central->site Retry/Discard.
218 lines
10 KiB
Markdown
218 lines
10 KiB
Markdown
# Cached Call Tracking — Design
|
|
|
|
**Date**: 2026-05-19
|
|
**Status**: Approved
|
|
**Topic**: Trackable IDs for cached external system calls and cached database writes
|
|
|
|
## Problem
|
|
|
|
`ExternalSystem.CachedCall()` and `Database.CachedWrite()` are fire-and-forget: a
|
|
script gets no handle back, cannot confirm delivery, and an operator cannot tie a
|
|
parked S&F message to a known business operation. `Notify.Send()` already returns a
|
|
trackable `NotificationId`. The goal is to give cached external/database calls the
|
|
same first-class traceability, under a tracking model unified across all three
|
|
store-and-forward producers.
|
|
|
|
## Decision
|
|
|
|
Add a trackable ID to cached calls via **Approach B — a sibling central component
|
|
(`Site Call Audit`) plus shared tracking contracts in Commons**. The Notification
|
|
Outbox is left unchanged; unification lives in shared types and a consistent script
|
|
API, not in a merged table or component.
|
|
|
|
### Why a sibling, not a merged component
|
|
|
|
Delivery locality is the decisive constraint:
|
|
|
|
- **Notifications** are *central-delivered*: sites store-and-forward them to the
|
|
central cluster, which delivers via SMTP. The `NotificationOutboxActor` runs a
|
|
dispatcher loop. Central becomes the source of truth after handoff.
|
|
- **Cached calls / DB writes** are *site-delivered*: the external system or database
|
|
often sits on the site's own network and is unreachable from central. The site's
|
|
S&F Engine must always own delivery, and the **site remains the source of truth**
|
|
for status. Central audit is an eventually-consistent mirror.
|
|
|
|
Merging both into one component (Approach A) would put a dispatcher loop that is live
|
|
for some rows and dormant for others into a single component, hiding a real
|
|
architectural difference. Approach B expresses the difference honestly while still
|
|
giving scripts a unified ID model and `Status()` API.
|
|
|
|
## Unified tracking model
|
|
|
|
### `TrackedOperationId`
|
|
|
|
A GUID, defined in Commons, generated **caller-side at the site at call time**. It is
|
|
both the tracking handle returned to the script and the idempotency key for telemetry
|
|
sent to central. `Notify.Send()`'s existing `NotificationId` is the notification-domain
|
|
name for this same type — no behavior change for notifications.
|
|
|
|
### Script API
|
|
|
|
| Call | Returns |
|
|
|---|---|
|
|
| `ExternalSystem.CachedCall(system, method, params)` | `TrackedOperationId` |
|
|
| `Database.CachedWrite(name, sql, params)` | `TrackedOperationId` |
|
|
| `Notify.Send(...)` | `TrackedOperationId` (unchanged) |
|
|
| `Tracking.Status(id)` | unified status record (status, retry count, last error, key timestamps) |
|
|
|
|
`Tracking.Status(id)` is the unified accessor. `Notify.Status(id)` is retained as a
|
|
thin alias for backward compatibility.
|
|
|
|
### Status lifecycle
|
|
|
|
`Pending → Retrying → Delivered / Parked / Failed / Discarded`
|
|
|
|
- **Delivered** — succeeded. A cached call that succeeds on its first immediate
|
|
attempt goes straight here and never enters the S&F buffer.
|
|
- **Parked** — transient retries exhausted; awaiting manual action.
|
|
- **Failed** — permanent failure (e.g. HTTP 4xx). The error is *also* returned
|
|
synchronously to the calling script, exactly as today; the record captures it.
|
|
This is the one state beyond the notification lifecycle.
|
|
- **Discarded** — operator discarded a parked operation.
|
|
|
|
There is no `Forwarding` state for cached calls — that exists only because
|
|
notifications hand off to central. For cached calls, `Tracking.Status(id)` is always
|
|
answered site-locally and authoritatively.
|
|
|
|
## Site-side architecture
|
|
|
|
### Site-local operation tracking table
|
|
|
|
A new SQLite table alongside the existing S&F buffer DB. One row per
|
|
`TrackedOperationId`, created the moment the script issues the cached call,
|
|
regardless of outcome:
|
|
|
|
- Fields: kind, target summary (system+method, or DB name), status, retry count,
|
|
last error, created/updated/terminal timestamps, source provenance
|
|
(instance/script).
|
|
- This table is the **status record**. The S&F buffer remains purely the **retry
|
|
mechanism**; a buffered message references its `TrackedOperationId`.
|
|
- Immediate success writes a terminal `Delivered` row directly here, with nothing
|
|
placed in the S&F buffer.
|
|
- `Tracking.Status(id)` reads this table — local, authoritative, available even when
|
|
central is unreachable.
|
|
- Retention: terminal rows purged after a configurable window (default 7 days; the
|
|
site holds live operational state, central holds long-term audit).
|
|
|
|
### Telemetry to central
|
|
|
|
On every lifecycle transition (`Created → Retrying → Delivered/Parked/Failed/
|
|
Discarded`) the site emits a telemetry event over the existing site→central channel:
|
|
`TrackedOperationId`, kind, summary, status, retry count, last error, timestamps,
|
|
source site. Best-effort, at-least-once, idempotent on the ID.
|
|
|
|
### Reconciliation
|
|
|
|
Because telemetry is best-effort, the central side periodically (and on reconnect)
|
|
pulls "all tracking rows changed since cursor X" per site. Missed telemetry
|
|
self-heals. The site never depends on central; central converges to the site.
|
|
|
|
### Carried-over rules (unchanged)
|
|
|
|
- Tracking rows, like buffered messages, are not cleared on instance deletion.
|
|
- Cached-call idempotency remains the caller's responsibility — a retry can still
|
|
double-deliver.
|
|
|
|
## Central — Site Call Audit component (new component #22)
|
|
|
|
### `SiteCalls` table (central MS SQL)
|
|
|
|
Sibling of the `Notifications` table. One row per `TrackedOperationId`: source site,
|
|
kind, target summary, status, retry count, last error, created/updated/terminal
|
|
timestamps. Fed only by site telemetry and reconciliation pulls.
|
|
|
|
Ingestion is **insert-if-not-exists**, then **upsert-on-newer-status**. The lifecycle
|
|
is monotonic, so status only advances, never regresses — making at-least-once and
|
|
out-of-order telemetry harmless. Daily purge of terminal rows after a configurable
|
|
window (default 365 days, mirroring `Notifications`).
|
|
|
|
### `SiteCallAuditActor`
|
|
|
|
Singleton on the active central node. Ingests telemetry, runs the periodic
|
|
reconciliation pulls, computes KPIs, and relays Retry/Discard commands to sites.
|
|
|
|
It is **not a dispatcher** — the crucial difference from `NotificationOutboxActor`.
|
|
Central has no path to a site's external systems or databases; this component is an
|
|
audit sink, a query surface, and a command relay only.
|
|
|
|
### KPIs
|
|
|
|
Point-in-time from the `SiteCalls` table, global and per-site, mirroring the
|
|
Notification Outbox KPI shape: buffered count (`Pending`+`Retrying`), parked count,
|
|
failed-last-interval, delivered-last-interval, oldest-pending age, and stuck count
|
|
(`Pending`/`Retrying` older than a configurable threshold, default 10 minutes —
|
|
display-only, no alerting).
|
|
|
|
## Central→site command path (Retry / Discard)
|
|
|
|
Parked operations live in the site's S&F buffer, so Retry/Discard from the Central UI
|
|
must travel down to the owning site:
|
|
|
|
- New ClusterClient command/control messages, central→site:
|
|
`RetryParkedOperation(TrackedOperationId)` and
|
|
`DiscardParkedOperation(TrackedOperationId)`, riding the existing per-site
|
|
ClusterClient.
|
|
- The site applies the command to its S&F buffer / tracking table, then emits normal
|
|
telemetry reflecting the new state (`Retrying`, or `Discarded`).
|
|
- Central never directly mutates the `SiteCalls` row. It sends the command and lets
|
|
the resulting telemetry update the audit row — the site stays the single source of
|
|
truth.
|
|
- If the site is offline, the command fails fast and the UI surfaces a
|
|
"site unreachable" message.
|
|
|
|
## Central UI
|
|
|
|
New page — **Site Calls** — in the same nav group as the Notification Outbox page:
|
|
|
|
- Covers cached calls only: `ExternalCall` + `DatabaseWrite`. Notifications keep their
|
|
existing dedicated Notification Outbox page.
|
|
- Queryable list filtered by site, kind, status, and time range. Columns: timestamp,
|
|
site, kind, target summary, status badge, retry count, last error.
|
|
- Retry / Discard actions on `Parked` rows, issuing the central→site commands above.
|
|
- Headline KPI tiles on the Health dashboard alongside the existing Notification
|
|
Outbox tiles. Stuck rows get a display-only badge — no escalation.
|
|
- Custom Blazor Server + Bootstrap components, consistent with the rest of the
|
|
Central UI.
|
|
|
|
## Error handling & edge cases
|
|
|
|
- **Telemetry loss** — reconciliation pull self-heals; central is explicitly
|
|
eventually-consistent.
|
|
- **Out-of-order / duplicate telemetry** — monotonic-status upsert keyed on
|
|
`TrackedOperationId` makes both harmless.
|
|
- **Permanent failure on a cached call** — error returned synchronously to the script
|
|
(unchanged) and recorded as terminal `Failed`.
|
|
- **Site offline during Retry/Discard** — command fails fast; UI says so; the audit
|
|
row is unchanged until confirming telemetry arrives.
|
|
- **Cached-call double-delivery** — still the caller's responsibility; the idempotency
|
|
note stays in the ESG doc.
|
|
- **Instance deletion** — tracking rows and buffered messages survive, per the
|
|
existing S&F rule.
|
|
|
|
## Affected documents
|
|
|
|
- **New**: `docs/requirements/Component-SiteCallAudit.md`
|
|
- `Component-ExternalSystemGateway.md` — `CachedCall`/`CachedWrite` return
|
|
`TrackedOperationId`; `Failed` state; `Tracking.Status`.
|
|
- `Component-StoreAndForward.md` — site-local tracking table, telemetry emission,
|
|
reconciliation, `TrackedOperationId` on buffer entries.
|
|
- `Component-SiteRuntime.md` — Script Runtime API: return types and
|
|
`Tracking.Status(id)`.
|
|
- `Component-Communication.md` — telemetry channel and
|
|
`RetryParkedOperation`/`DiscardParkedOperation` commands.
|
|
- `Component-Commons.md` — `TrackedOperationId`, unified status enum, telemetry
|
|
message contracts.
|
|
- `Component-ConfigurationDatabase.md` — `SiteCalls` table, EF mapping, migration.
|
|
- `Component-CentralUI.md` — new Site Calls page.
|
|
- `Component-HealthMonitoring.md` — KPI tiles on the dashboard.
|
|
- `Component-NotificationService.md` / `Component-NotificationOutbox.md` — note the
|
|
shared `TrackedOperationId` model and `Notify.Status` alias.
|
|
- `README.md` — component table updated to 22 components.
|
|
- `CLAUDE.md` — component list and Key Design Decisions.
|
|
|
|
## Out of scope
|
|
|
|
- A CLI surface for site-local Retry/Discard (can be added later if needed).
|
|
- Merging notifications into the Site Calls page or a unified outbox component.
|
|
- Routing cached-call delivery through central.
|