22 KiB
Cached Call Tracking Implementation Plan
For Claude: REQUIRED SUB-SKILL: Use superpowers-extended-cc:executing-plans to implement this plan task-by-task.
Goal: Give cached external system calls and cached database writes a trackable TrackedOperationId, backed by a site-local tracking table and a new central Site Call Audit component, under a tracking model unified with Notify.Send.
Architecture: Approach B from the design doc — a sibling central component (Site Call Audit), not a merged outbox. The site stays the source of truth for cached-call status; central audit is an eventually-consistent mirror fed by best-effort telemetry plus a reconciliation pull. Delivery of cached calls remains site-local.
Tech Stack: This is a design-documentation change. "Implementation" means editing Markdown design documents under docs/requirements/, plus README.md and CLAUDE.md. No source code is touched. The authoritative design is docs/plans/2026-05-19-cached-call-tracking-design.md — read it before starting.
Working conventions (from CLAUDE.md):
- Edit documents in place; no copies or backups.
- Component docs follow: Purpose, Location, Responsibilities, design sections, Dependencies, Interactions.
- Keep cross-references accurate across all docs.
- Use
git diffto review before committing.
Per-task workflow (replaces TDD for this docs project):
- Read the target file in full first.
- Make the edits described.
- Verify: run
git diff <file>and confirm the change reads correctly and matches the design doc. - Cross-reference check: run the grep given in the task; confirm no stale references.
- Commit with the given message.
Task 1: Create the Site Call Audit component document
Files:
- Create:
docs/requirements/Component-SiteCallAudit.md
Step 1: Write the new component doc
Create the file following the standard component structure. Content:
# Component: Site Call Audit
## Purpose
Provides central, queryable audit and operational visibility for cached calls
made by site scripts — `ExternalSystem.CachedCall()` and `Database.CachedWrite()`.
Each such call carries a `TrackedOperationId`; sites report lifecycle telemetry
to this component, which maintains a central audit record, computes KPIs, and
relays Retry/Discard actions back to the owning site.
This is the second centrally-hosted observability component for site
store-and-forward activity (the Notification Outbox is the first). Unlike the
Notification Outbox, Site Call Audit is **not a dispatcher** — it never delivers
anything. Cached calls are delivered by the site's Store-and-Forward Engine
against site-local external systems and databases, which central cannot reach.
## Location
Central cluster only. A singleton actor (`SiteCallAuditActor`) on the active
central node. Registered as component #22 in the Host role configuration.
## Responsibilities
- Ingest cached-call lifecycle telemetry from sites into the central `SiteCalls`
table.
- Run periodic per-site reconciliation pulls so missed telemetry self-heals.
- Compute point-in-time KPIs (global and per-site) from the `SiteCalls` table.
- Relay operator Retry/Discard actions for parked cached calls to the owning
site over the command/control channel.
- Purge terminal audit rows after a configurable retention window.
## The `SiteCalls` Table
Lives in the central MS SQL configuration database — a sibling of the
`Notifications` table. One row per `TrackedOperationId`:
- **TrackedOperationId** — GUID, primary key. Generated site-side at call time.
- **SourceSite** — site that issued the call.
- **Kind** — `ExternalCall` or `DatabaseWrite`.
- **TargetSummary** — external system + method name, or database connection name.
- **Status** — `Pending`, `Retrying`, `Delivered`, `Parked`, `Failed`, `Discarded`.
- **RetryCount** — attempts so far.
- **LastError** — most recent error detail, if any.
- **Provenance** — source instance / script.
- **CreatedAtUtc**, **UpdatedAtUtc**, **TerminalAtUtc** — key timestamps.
## Status Lifecycle
`Pending → Retrying → Delivered / Parked / Failed / Discarded`
- **Delivered** — succeeded. A cached call that succeeds on its first immediate
attempt is recorded directly as `Delivered`.
- **Parked** — transient retries exhausted; awaiting manual action.
- **Failed** — permanent failure (e.g. HTTP 4xx). The error was also returned
synchronously to the calling script; the record captures it.
- **Discarded** — an operator discarded a parked operation.
The site is the source of truth. The `SiteCalls` row is an eventually-consistent
mirror — never queried by scripts (`Tracking.Status()` is answered site-locally).
## Ingest & Idempotency
Telemetry ingestion is **insert-if-not-exists** keyed on `TrackedOperationId`,
then **upsert-on-newer-status**. The lifecycle is monotonic, so status only
advances and never regresses; at-least-once and out-of-order telemetry are
therefore harmless.
## Reconciliation
Because telemetry is best-effort, `SiteCallAuditActor` periodically — and on site
reconnect — pulls "all tracking rows changed since cursor X" from each site.
Gaps left by lost telemetry self-heal. Central converges to the site; the site
never depends on central.
## Retry / Discard Relay
Parked cached calls live in the owning site's S&F buffer. Operator Retry/Discard
from the Central UI is relayed to that site as a `RetryParkedOperation` /
`DiscardParkedOperation` command over the command/control channel. The site
applies the change and emits telemetry reflecting the new state; central never
mutates the `SiteCalls` row directly. If the site is offline the command fails
fast and the UI surfaces a "site unreachable" message.
## KPIs
Point-in-time, computed from the `SiteCalls` table, global and per-source-site,
mirroring the Notification Outbox KPI shape:
- Buffered count (`Pending` + `Retrying`)
- Parked count
- Failed-last-interval
- Delivered-last-interval
- Oldest-pending age
- Stuck count — `Pending`/`Retrying` older than a configurable threshold
(default 10 minutes); display-only, no escalation.
## Retention
Daily purge of terminal rows (`Delivered`, `Failed`, `Discarded`) after a
configurable window (default 365 days), matching the `Notifications` purge.
## Dependencies
- **Configuration Database**: hosts the `SiteCalls` table and its repository.
- **Central–Site Communication**: receives cached-call telemetry and reconciliation
responses; sends Retry/Discard commands.
- **Store-and-Forward Engine**: the site-side origin of cached-call telemetry and
the executor of relayed Retry/Discard commands.
- **Commons**: `TrackedOperationId`, status enum, telemetry message contracts.
## Interactions
- **Central UI**: the Site Calls page queries this component and issues
Retry/Discard actions.
- **Health Monitoring**: surfaces Site Call Audit KPI tiles on the dashboard.
- **Cluster Infrastructure**: hosts the `SiteCallAuditActor` singleton with
active/standby failover.
Step 2: Verify
Run: git diff --stat and open the new file.
Expected: structure matches other Component-*.md files (Purpose → Interactions).
Step 3: Commit
git add docs/requirements/Component-SiteCallAudit.md
git commit -m "docs(requirements): add Site Call Audit component (#22)"
Task 2: Add shared tracking contracts to Commons
Files:
- Modify:
docs/requirements/Component-Commons.md— sectionsREQ-COM-1(data types),REQ-COM-5(message contracts)
Step 1: Edit the doc
In ### REQ-COM-1: Shared Data Type System, add TrackedOperationId as a shared
type: a GUID identifying any tracked store-and-forward operation
(CachedCall, CachedWrite, Notify.Send), generated caller-side at the site
at call time, doubling as the telemetry idempotency key. Note that the existing
NotificationId is the notification-domain name for this same concept.
Add a shared TrackedOperationStatus enum:
Pending, Retrying, Delivered, Parked, Failed, Discarded.
In ### REQ-COM-5: Cross-Component Message Contracts, add the cached-call
telemetry and command contracts (additive-only, per REQ-COM-5a):
CachedCallTelemetry—TrackedOperationId, source site,Kind, target summary, status, retry count, last error, timestamps, provenance.CachedCallReconcileRequest/CachedCallReconcileResponse— cursor-based per-site pull of changed tracking rows.RetryParkedOperation/DiscardParkedOperation— central→site commands keyed byTrackedOperationId(generalize naming so they cover cached calls, not only legacy "parked message" wording).
Step 2: Verify
Run: git diff docs/requirements/Component-Commons.md
Expected: additive only; no existing type or contract removed/renamed.
Step 3: Commit
git add docs/requirements/Component-Commons.md
git commit -m "docs(requirements): add TrackedOperationId and cached-call contracts to Commons"
Task 3: Update the Store-and-Forward Engine doc
Files:
- Modify:
docs/requirements/Component-StoreAndForward.md—Responsibilities,Message Lifecycle,Persistence,Parked Message Management,Message Format
Step 1: Edit the doc
- Responsibilities / Persistence: introduce the site-local operation
tracking table — a SQLite table alongside the S&F buffer DB, holding one row
per
TrackedOperationIdfor cached calls regardless of outcome. It is the status record; the S&F buffer remains only the retry mechanism. State thatTracking.Status(id)reads this table, that it is the source of truth, and that terminal rows are purged after a configurable window (default 7 days). - Message Lifecycle: a cached call that succeeds on its first immediate
attempt is written directly as a terminal
Deliveredtracking row and never enters the S&F buffer. A buffered cached-call message references itsTrackedOperationId. - Add a telemetry emission note: on every lifecycle transition the site emits
CachedCallTelemetryto central (best-effort, at-least-once, idempotent on the ID) and responds toCachedCallReconcileRequestpulls. - Parked Message Management: note that Retry/Discard of parked cached calls
can be driven by central via
RetryParkedOperation/DiscardParkedOperation, after which the site emits telemetry reflecting the new state. - Message Format: add
TrackedOperationIdto the listed per-message fields.
Leave the notification category behavior unchanged.
Step 2: Verify
Run: git diff docs/requirements/Component-StoreAndForward.md
Expected: cached-call and DB-write categories gain tracking; notification flow untouched.
Step 3: Commit
git add docs/requirements/Component-StoreAndForward.md
git commit -m "docs(requirements): add site-local tracking table and telemetry to Store-and-Forward"
Task 4: Update the External System Gateway doc
Files:
- Modify:
docs/requirements/Component-ExternalSystemGateway.md—Cached Write,External System Call Modes,Call Timeout & Error Handling
Step 1: Edit the doc
### Cached (Store-and-Forward)and### Cached Write (Store-and-Forward): state thatCachedCall/CachedWritenow return aTrackedOperationId. They are no longer "fire-and-forget" with no handle — replace that wording with "deferred-delivery, returns a tracking handle". Immediate success → terminalDeliveredrecord; transient failure → buffered,Pending/Retrying.- Permanent failure: the error is still returned synchronously to the script
(unchanged) and recorded as a terminal
Failedtracking record. - Keep the idempotency note — duplicate delivery on retry is still the caller's responsibility.
- Add a one-line pointer that status is observable via
Tracking.Status(id)and centrally via the Site Call Audit component.
Step 2: Verify
Run: grep -n "fire-and-forget\|TrackedOperationId" docs/requirements/Component-ExternalSystemGateway.md
Expected: "fire-and-forget" no longer describes cached calls; TrackedOperationId present.
Step 3: Commit
git add docs/requirements/Component-ExternalSystemGateway.md
git commit -m "docs(requirements): cached calls return TrackedOperationId in ESG"
Task 5: Update the Site Runtime Script Runtime API
Files:
- Modify:
docs/requirements/Component-SiteRuntime.md—### External Systems,### Notifications,### Database Accessunder## Script Runtime API
Step 1: Edit the doc
### External Systems:ExternalSystem.CachedCall(...)now returns aTrackedOperationId; drop "fire-and-forget", say it returns a tracking handle.### Database Access:Database.CachedWrite(...)now returns aTrackedOperationId.- Add the unified accessor
Tracking.Status("trackedOperationId")— returns a status record (status, retry count, last error, key timestamps) for any tracked operation, answered site-locally and authoritatively for cached calls. ### Notifications: note thatNotify.Status(...)is retained as a thin alias ofTracking.Status(...);Notify.Sendreturns aTrackedOperationId(the value historically calledNotificationId).
Step 2: Verify
Run: git diff docs/requirements/Component-SiteRuntime.md
Expected: all three cached/async producers return TrackedOperationId; Tracking.Status documented.
Step 3: Commit
git add docs/requirements/Component-SiteRuntime.md
git commit -m "docs(requirements): add Tracking.Status and cached-call handles to Script Runtime API"
Task 6: Update the Central–Site Communication doc
Files:
- Modify:
docs/requirements/Component-Communication.md—### 8. Remote Queries, and add a new pattern for cached-call telemetry
Step 1: Edit the doc
- Add a new communication pattern (e.g.
### 10. Cached Call Telemetry (Site → Central)): the site S&F Engine pushesCachedCallTelemetryon every lifecycle transition; best-effort, at-least-once, idempotent onTrackedOperationId; transport is ClusterClient command/control. Also describe the reconciliation pull (CachedCallReconcileRequest/Response) initiated bySiteCallAuditActor. ### 8. Remote Queries (Central → Site): generalize the "Retry or discard parked messages" command line to also cover cached calls keyed byTrackedOperationId(RetryParkedOperation/DiscardParkedOperation).
Step 2: Verify
Run: grep -n "Telemetry\|RetryParkedOperation" docs/requirements/Component-Communication.md
Expected: new telemetry pattern and generalized command present.
Step 3: Commit
git add docs/requirements/Component-Communication.md
git commit -m "docs(requirements): add cached-call telemetry pattern to Communication"
Task 7: Update the Configuration Database doc
Files:
- Modify:
docs/requirements/Component-ConfigurationDatabase.md—## Database Schema(add a### Site Callssubsection),## Scheduled Maintenance
Step 1: Edit the doc
- Under
## Database Schema, add a### Site Callssubsection describing theSiteCallstable (columns per Task 1's "TheSiteCallsTable" list), noting it is populated only by Site Call Audit telemetry/reconciliation, and that ingestion is insert-if-not-exists + upsert-on-newer-status. - Under
## Scheduled Maintenance, add a### SiteCalls Table Purgesubsection mirroring the### Notifications Table Purgewording: daily purge of terminal rows after a configurable window (default 365 days).
Step 2: Verify
Run: grep -n "SiteCalls" docs/requirements/Component-ConfigurationDatabase.md
Expected: schema subsection and purge subsection both present.
Step 3: Commit
git add docs/requirements/Component-ConfigurationDatabase.md
git commit -m "docs(requirements): add SiteCalls table and purge to Configuration Database"
Task 8: Update the Central UI doc
Files:
- Modify:
docs/requirements/Component-CentralUI.md—## Workflows / Pages
Step 1: Edit the doc
Add a ### Site Calls (Deployment Role) page after the
### Notification Outbox (Deployment Role) section:
- Queryable list of cached calls (
ExternalCall+DatabaseWriteonly — notifications keep their own Notification Outbox page). - Filters: site, kind, status, time range.
- Columns: timestamp, site, kind, target summary, status badge, retry count, last error.
- Retry / Discard actions on
Parkedrows; "site unreachable" handling when the owning site is offline. - Custom Blazor Server + Bootstrap components, no third-party frameworks.
Step 2: Verify
Run: grep -n "Site Calls" docs/requirements/Component-CentralUI.md
Expected: new page section present, scoped to cached calls.
Step 3: Commit
git add docs/requirements/Component-CentralUI.md
git commit -m "docs(requirements): add Site Calls page to Central UI"
Task 9: Update the Health Monitoring doc
Files:
- Modify:
docs/requirements/Component-HealthMonitoring.md— add a## Site Call Audit KPIssection after## Notification Outbox KPIs
Step 1: Edit the doc
Add a ## Site Call Audit KPIs section mirroring ## Notification Outbox KPIs:
the dashboard surfaces Site Call Audit headline KPI tiles (buffered, parked,
failed-last-interval, delivered-last-interval, oldest-pending age, stuck count),
computed point-in-time by the Site Call Audit component, global and per-site.
Stuck is display-only.
Step 2: Verify
Run: grep -n "Site Call Audit KPIs" docs/requirements/Component-HealthMonitoring.md
Expected: section present.
Step 3: Commit
git add docs/requirements/Component-HealthMonitoring.md
git commit -m "docs(requirements): add Site Call Audit KPIs to Health Monitoring"
Task 10: Note the shared model in Notification docs
Files:
- Modify:
docs/requirements/Component-NotificationService.md—## Script API - Modify:
docs/requirements/Component-NotificationOutbox.md—## Purposeor### Status Lifecycle
Step 1: Edit the doc
Component-NotificationService.md## Script API: note thatNotify.Send'sNotificationIdis aTrackedOperationId(shared Commons type) andNotify.Statusis an alias of the unifiedTracking.Status.Component-NotificationOutbox.md: add a sentence that the Notification Outbox and the Site Call Audit component share theTrackedOperationIdtracking model and status lifecycle, but differ in delivery locality — the Notification Outbox delivers; Site Call Audit only audits.
Do not change any notification behavior.
Step 2: Verify
Run: git diff docs/requirements/Component-NotificationService.md docs/requirements/Component-NotificationOutbox.md
Expected: additive notes only, no behavior change.
Step 3: Commit
git add docs/requirements/Component-NotificationService.md docs/requirements/Component-NotificationOutbox.md
git commit -m "docs(requirements): note shared TrackedOperationId model in notification docs"
Task 11: Update the README component table
Files:
- Modify:
README.md— component table and any architecture diagram component count
Step 1: Edit the doc
Add row 22 — Site Call Audit — to the component table:
"Central component auditing site cached calls (CachedCall/CachedWrite);
SiteCalls table, telemetry ingest, reconciliation, KPIs, central→site
Retry/Discard relay." Update any "21 components" count to 22.
Step 2: Verify
Run: grep -rn "21 component\|22 component" README.md
Expected: count reads 22; no stale "21".
Step 3: Commit
git add README.md
git commit -m "docs: add Site Call Audit to README component table"
Task 12: Update CLAUDE.md
Files:
- Modify:
CLAUDE.md—## Current Component List,## Key Design Decisions
Step 1: Edit the doc
- Change the heading
## Current Component List (21 components)to(22 components)and add item 22 — Site Call Audit — with a one-line description. - Under
## Key Design Decisions, in### Store-and-Forward(or### UI & Monitoring), add bullets summarizing: cached calls return aTrackedOperationId; site-local tracking table is the status source of truth; new central Site Call Audit component mirrors status via best-effort telemetry + reconciliation; cached-call delivery stays site-local; unifiedTracking.Statusaccessor;Failedterminal state for permanent failures.
Step 2: Verify
Run: grep -n "22 components\|Site Call Audit" CLAUDE.md
Expected: count is 22; component listed; design decisions present.
Step 3: Commit
git add CLAUDE.md
git commit -m "docs: record cached-call tracking in CLAUDE.md"
Task 13: Final cross-reference consistency pass
Files:
- Potentially any
docs/requirements/Component-*.md,README.md,CLAUDE.md
Step 1: Sweep for stale or missing references
Run each and review:
grep -rn "fire-and-forget" docs/requirements/
grep -rn "21 component" README.md CLAUDE.md
grep -rln "Site Call Audit" docs/requirements/ README.md CLAUDE.md
grep -rn "TrackedOperationId" docs/requirements/
Expected: no "fire-and-forget" describing cached calls; no "21 component" left;
Site Call Audit referenced by its dependents (Communication, Configuration
Database, Central UI, Health Monitoring, Commons); TrackedOperationId used
consistently.
Step 2: Confirm new component's Dependencies/Interactions are reciprocated
Verify each component named in Component-SiteCallAudit.md Dependencies/Interactions
also references Site Call Audit where appropriate.
Step 3: Fix any gaps found, then commit
git add -A
git commit -m "docs(requirements): reconcile cross-references for Site Call Audit"
If no gaps are found, skip the commit and note the plan is complete.
Done
All cached-call tracking design changes are recorded. The design rationale lives
in docs/plans/2026-05-19-cached-call-tracking-design.md.