diff --git a/docs/plans/2026-05-19-cached-call-tracking-design.md b/docs/plans/2026-05-19-cached-call-tracking-design.md new file mode 100644 index 0000000..72175b1 --- /dev/null +++ b/docs/plans/2026-05-19-cached-call-tracking-design.md @@ -0,0 +1,217 @@ +# Cached Call Tracking — Design + +**Date**: 2026-05-19 +**Status**: Approved +**Topic**: Trackable IDs for cached external system calls and cached database writes + +## Problem + +`ExternalSystem.CachedCall()` and `Database.CachedWrite()` are fire-and-forget: a +script gets no handle back, cannot confirm delivery, and an operator cannot tie a +parked S&F message to a known business operation. `Notify.Send()` already returns a +trackable `NotificationId`. The goal is to give cached external/database calls the +same first-class traceability, under a tracking model unified across all three +store-and-forward producers. + +## Decision + +Add a trackable ID to cached calls via **Approach B — a sibling central component +(`Site Call Audit`) plus shared tracking contracts in Commons**. The Notification +Outbox is left unchanged; unification lives in shared types and a consistent script +API, not in a merged table or component. + +### Why a sibling, not a merged component + +Delivery locality is the decisive constraint: + +- **Notifications** are *central-delivered*: sites store-and-forward them to the + central cluster, which delivers via SMTP. The `NotificationOutboxActor` runs a + dispatcher loop. Central becomes the source of truth after handoff. +- **Cached calls / DB writes** are *site-delivered*: the external system or database + often sits on the site's own network and is unreachable from central. The site's + S&F Engine must always own delivery, and the **site remains the source of truth** + for status. Central audit is an eventually-consistent mirror. + +Merging both into one component (Approach A) would put a dispatcher loop that is live +for some rows and dormant for others into a single component, hiding a real +architectural difference. Approach B expresses the difference honestly while still +giving scripts a unified ID model and `Status()` API. + +## Unified tracking model + +### `TrackedOperationId` + +A GUID, defined in Commons, generated **caller-side at the site at call time**. It is +both the tracking handle returned to the script and the idempotency key for telemetry +sent to central. `Notify.Send()`'s existing `NotificationId` is the notification-domain +name for this same type — no behavior change for notifications. + +### Script API + +| Call | Returns | +|---|---| +| `ExternalSystem.CachedCall(system, method, params)` | `TrackedOperationId` | +| `Database.CachedWrite(name, sql, params)` | `TrackedOperationId` | +| `Notify.Send(...)` | `TrackedOperationId` (unchanged) | +| `Tracking.Status(id)` | unified status record (status, retry count, last error, key timestamps) | + +`Tracking.Status(id)` is the unified accessor. `Notify.Status(id)` is retained as a +thin alias for backward compatibility. + +### Status lifecycle + +`Pending → Retrying → Delivered / Parked / Failed / Discarded` + +- **Delivered** — succeeded. A cached call that succeeds on its first immediate + attempt goes straight here and never enters the S&F buffer. +- **Parked** — transient retries exhausted; awaiting manual action. +- **Failed** — permanent failure (e.g. HTTP 4xx). The error is *also* returned + synchronously to the calling script, exactly as today; the record captures it. + This is the one state beyond the notification lifecycle. +- **Discarded** — operator discarded a parked operation. + +There is no `Forwarding` state for cached calls — that exists only because +notifications hand off to central. For cached calls, `Tracking.Status(id)` is always +answered site-locally and authoritatively. + +## Site-side architecture + +### Site-local operation tracking table + +A new SQLite table alongside the existing S&F buffer DB. One row per +`TrackedOperationId`, created the moment the script issues the cached call, +regardless of outcome: + +- Fields: kind, target summary (system+method, or DB name), status, retry count, + last error, created/updated/terminal timestamps, source provenance + (instance/script). +- This table is the **status record**. The S&F buffer remains purely the **retry + mechanism**; a buffered message references its `TrackedOperationId`. +- Immediate success writes a terminal `Delivered` row directly here, with nothing + placed in the S&F buffer. +- `Tracking.Status(id)` reads this table — local, authoritative, available even when + central is unreachable. +- Retention: terminal rows purged after a configurable window (default 7 days; the + site holds live operational state, central holds long-term audit). + +### Telemetry to central + +On every lifecycle transition (`Created → Retrying → Delivered/Parked/Failed/ +Discarded`) the site emits a telemetry event over the existing site→central channel: +`TrackedOperationId`, kind, summary, status, retry count, last error, timestamps, +source site. Best-effort, at-least-once, idempotent on the ID. + +### Reconciliation + +Because telemetry is best-effort, the central side periodically (and on reconnect) +pulls "all tracking rows changed since cursor X" per site. Missed telemetry +self-heals. The site never depends on central; central converges to the site. + +### Carried-over rules (unchanged) + +- Tracking rows, like buffered messages, are not cleared on instance deletion. +- Cached-call idempotency remains the caller's responsibility — a retry can still + double-deliver. + +## Central — Site Call Audit component (new component #22) + +### `SiteCalls` table (central MS SQL) + +Sibling of the `Notifications` table. One row per `TrackedOperationId`: source site, +kind, target summary, status, retry count, last error, created/updated/terminal +timestamps. Fed only by site telemetry and reconciliation pulls. + +Ingestion is **insert-if-not-exists**, then **upsert-on-newer-status**. The lifecycle +is monotonic, so status only advances, never regresses — making at-least-once and +out-of-order telemetry harmless. Daily purge of terminal rows after a configurable +window (default 365 days, mirroring `Notifications`). + +### `SiteCallAuditActor` + +Singleton on the active central node. Ingests telemetry, runs the periodic +reconciliation pulls, computes KPIs, and relays Retry/Discard commands to sites. + +It is **not a dispatcher** — the crucial difference from `NotificationOutboxActor`. +Central has no path to a site's external systems or databases; this component is an +audit sink, a query surface, and a command relay only. + +### KPIs + +Point-in-time from the `SiteCalls` table, global and per-site, mirroring the +Notification Outbox KPI shape: buffered count (`Pending`+`Retrying`), parked count, +failed-last-interval, delivered-last-interval, oldest-pending age, and stuck count +(`Pending`/`Retrying` older than a configurable threshold, default 10 minutes — +display-only, no alerting). + +## Central→site command path (Retry / Discard) + +Parked operations live in the site's S&F buffer, so Retry/Discard from the Central UI +must travel down to the owning site: + +- New ClusterClient command/control messages, central→site: + `RetryParkedOperation(TrackedOperationId)` and + `DiscardParkedOperation(TrackedOperationId)`, riding the existing per-site + ClusterClient. +- The site applies the command to its S&F buffer / tracking table, then emits normal + telemetry reflecting the new state (`Retrying`, or `Discarded`). +- Central never directly mutates the `SiteCalls` row. It sends the command and lets + the resulting telemetry update the audit row — the site stays the single source of + truth. +- If the site is offline, the command fails fast and the UI surfaces a + "site unreachable" message. + +## Central UI + +New page — **Site Calls** — in the same nav group as the Notification Outbox page: + +- Covers cached calls only: `ExternalCall` + `DatabaseWrite`. Notifications keep their + existing dedicated Notification Outbox page. +- Queryable list filtered by site, kind, status, and time range. Columns: timestamp, + site, kind, target summary, status badge, retry count, last error. +- Retry / Discard actions on `Parked` rows, issuing the central→site commands above. +- Headline KPI tiles on the Health dashboard alongside the existing Notification + Outbox tiles. Stuck rows get a display-only badge — no escalation. +- Custom Blazor Server + Bootstrap components, consistent with the rest of the + Central UI. + +## Error handling & edge cases + +- **Telemetry loss** — reconciliation pull self-heals; central is explicitly + eventually-consistent. +- **Out-of-order / duplicate telemetry** — monotonic-status upsert keyed on + `TrackedOperationId` makes both harmless. +- **Permanent failure on a cached call** — error returned synchronously to the script + (unchanged) and recorded as terminal `Failed`. +- **Site offline during Retry/Discard** — command fails fast; UI says so; the audit + row is unchanged until confirming telemetry arrives. +- **Cached-call double-delivery** — still the caller's responsibility; the idempotency + note stays in the ESG doc. +- **Instance deletion** — tracking rows and buffered messages survive, per the + existing S&F rule. + +## Affected documents + +- **New**: `docs/requirements/Component-SiteCallAudit.md` +- `Component-ExternalSystemGateway.md` — `CachedCall`/`CachedWrite` return + `TrackedOperationId`; `Failed` state; `Tracking.Status`. +- `Component-StoreAndForward.md` — site-local tracking table, telemetry emission, + reconciliation, `TrackedOperationId` on buffer entries. +- `Component-SiteRuntime.md` — Script Runtime API: return types and + `Tracking.Status(id)`. +- `Component-Communication.md` — telemetry channel and + `RetryParkedOperation`/`DiscardParkedOperation` commands. +- `Component-Commons.md` — `TrackedOperationId`, unified status enum, telemetry + message contracts. +- `Component-ConfigurationDatabase.md` — `SiteCalls` table, EF mapping, migration. +- `Component-CentralUI.md` — new Site Calls page. +- `Component-HealthMonitoring.md` — KPI tiles on the dashboard. +- `Component-NotificationService.md` / `Component-NotificationOutbox.md` — note the + shared `TrackedOperationId` model and `Notify.Status` alias. +- `README.md` — component table updated to 22 components. +- `CLAUDE.md` — component list and Key Design Decisions. + +## Out of scope + +- A CLI surface for site-local Retry/Discard (can be added later if needed). +- Merging notifications into the Site Calls page or a unified outbox component. +- Routing cached-call delivery through central.