docs(plans): add cached-call tracking implementation plan

This commit is contained in:
Joseph Doherty
2026-05-19 11:30:21 -04:00
parent e7ed858920
commit a08ad09514
2 changed files with 585 additions and 0 deletions

View File

@@ -0,0 +1,566 @@
# Cached Call Tracking Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers-extended-cc:executing-plans to implement this plan task-by-task.
**Goal:** Give cached external system calls and cached database writes a trackable `TrackedOperationId`, backed by a site-local tracking table and a new central `Site Call Audit` component, under a tracking model unified with `Notify.Send`.
**Architecture:** Approach B from the design doc — a sibling central component (`Site Call Audit`), not a merged outbox. The site stays the source of truth for cached-call status; central audit is an eventually-consistent mirror fed by best-effort telemetry plus a reconciliation pull. Delivery of cached calls remains site-local.
**Tech Stack:** This is a design-documentation change. "Implementation" means editing Markdown design documents under `docs/requirements/`, plus `README.md` and `CLAUDE.md`. No source code is touched. The authoritative design is `docs/plans/2026-05-19-cached-call-tracking-design.md` — read it before starting.
**Working conventions (from `CLAUDE.md`):**
- Edit documents in place; no copies or backups.
- Component docs follow: Purpose, Location, Responsibilities, design sections, Dependencies, Interactions.
- Keep cross-references accurate across all docs.
- Use `git diff` to review before committing.
**Per-task workflow (replaces TDD for this docs project):**
1. Read the target file in full first.
2. Make the edits described.
3. **Verify**: run `git diff <file>` and confirm the change reads correctly and matches the design doc.
4. **Cross-reference check**: run the grep given in the task; confirm no stale references.
5. **Commit** with the given message.
---
### Task 1: Create the Site Call Audit component document
**Files:**
- Create: `docs/requirements/Component-SiteCallAudit.md`
**Step 1: Write the new component doc**
Create the file following the standard component structure. Content:
```markdown
# Component: Site Call Audit
## Purpose
Provides central, queryable audit and operational visibility for cached calls
made by site scripts — `ExternalSystem.CachedCall()` and `Database.CachedWrite()`.
Each such call carries a `TrackedOperationId`; sites report lifecycle telemetry
to this component, which maintains a central audit record, computes KPIs, and
relays Retry/Discard actions back to the owning site.
This is the second centrally-hosted observability component for site
store-and-forward activity (the Notification Outbox is the first). Unlike the
Notification Outbox, Site Call Audit is **not a dispatcher** — it never delivers
anything. Cached calls are delivered by the site's Store-and-Forward Engine
against site-local external systems and databases, which central cannot reach.
## Location
Central cluster only. A singleton actor (`SiteCallAuditActor`) on the active
central node. Registered as component #22 in the Host role configuration.
## Responsibilities
- Ingest cached-call lifecycle telemetry from sites into the central `SiteCalls`
table.
- Run periodic per-site reconciliation pulls so missed telemetry self-heals.
- Compute point-in-time KPIs (global and per-site) from the `SiteCalls` table.
- Relay operator Retry/Discard actions for parked cached calls to the owning
site over the command/control channel.
- Purge terminal audit rows after a configurable retention window.
## The `SiteCalls` Table
Lives in the central MS SQL configuration database — a sibling of the
`Notifications` table. One row per `TrackedOperationId`:
- **TrackedOperationId** — GUID, primary key. Generated site-side at call time.
- **SourceSite** — site that issued the call.
- **Kind** — `ExternalCall` or `DatabaseWrite`.
- **TargetSummary** — external system + method name, or database connection name.
- **Status** — `Pending`, `Retrying`, `Delivered`, `Parked`, `Failed`, `Discarded`.
- **RetryCount** — attempts so far.
- **LastError** — most recent error detail, if any.
- **Provenance** — source instance / script.
- **CreatedAtUtc**, **UpdatedAtUtc**, **TerminalAtUtc** — key timestamps.
## Status Lifecycle
`Pending → Retrying → Delivered / Parked / Failed / Discarded`
- **Delivered** — succeeded. A cached call that succeeds on its first immediate
attempt is recorded directly as `Delivered`.
- **Parked** — transient retries exhausted; awaiting manual action.
- **Failed** — permanent failure (e.g. HTTP 4xx). The error was also returned
synchronously to the calling script; the record captures it.
- **Discarded** — an operator discarded a parked operation.
The site is the source of truth. The `SiteCalls` row is an eventually-consistent
mirror — never queried by scripts (`Tracking.Status()` is answered site-locally).
## Ingest & Idempotency
Telemetry ingestion is **insert-if-not-exists** keyed on `TrackedOperationId`,
then **upsert-on-newer-status**. The lifecycle is monotonic, so status only
advances and never regresses; at-least-once and out-of-order telemetry are
therefore harmless.
## Reconciliation
Because telemetry is best-effort, `SiteCallAuditActor` periodically — and on site
reconnect — pulls "all tracking rows changed since cursor X" from each site.
Gaps left by lost telemetry self-heal. Central converges to the site; the site
never depends on central.
## Retry / Discard Relay
Parked cached calls live in the owning site's S&F buffer. Operator Retry/Discard
from the Central UI is relayed to that site as a `RetryParkedOperation` /
`DiscardParkedOperation` command over the command/control channel. The site
applies the change and emits telemetry reflecting the new state; central never
mutates the `SiteCalls` row directly. If the site is offline the command fails
fast and the UI surfaces a "site unreachable" message.
## KPIs
Point-in-time, computed from the `SiteCalls` table, global and per-source-site,
mirroring the Notification Outbox KPI shape:
- Buffered count (`Pending` + `Retrying`)
- Parked count
- Failed-last-interval
- Delivered-last-interval
- Oldest-pending age
- Stuck count — `Pending`/`Retrying` older than a configurable threshold
(default 10 minutes); display-only, no escalation.
## Retention
Daily purge of terminal rows (`Delivered`, `Failed`, `Discarded`) after a
configurable window (default 365 days), matching the `Notifications` purge.
## Dependencies
- **Configuration Database**: hosts the `SiteCalls` table and its repository.
- **CentralSite Communication**: receives cached-call telemetry and reconciliation
responses; sends Retry/Discard commands.
- **Store-and-Forward Engine**: the site-side origin of cached-call telemetry and
the executor of relayed Retry/Discard commands.
- **Commons**: `TrackedOperationId`, status enum, telemetry message contracts.
## Interactions
- **Central UI**: the Site Calls page queries this component and issues
Retry/Discard actions.
- **Health Monitoring**: surfaces Site Call Audit KPI tiles on the dashboard.
- **Cluster Infrastructure**: hosts the `SiteCallAuditActor` singleton with
active/standby failover.
```
**Step 2: Verify**
Run: `git diff --stat` and open the new file.
Expected: structure matches other `Component-*.md` files (Purpose → Interactions).
**Step 3: Commit**
```bash
git add docs/requirements/Component-SiteCallAudit.md
git commit -m "docs(requirements): add Site Call Audit component (#22)"
```
---
### Task 2: Add shared tracking contracts to Commons
**Files:**
- Modify: `docs/requirements/Component-Commons.md` — sections `REQ-COM-1` (data types), `REQ-COM-5` (message contracts)
**Step 1: Edit the doc**
In `### REQ-COM-1: Shared Data Type System`, add `TrackedOperationId` as a shared
type: a GUID identifying any tracked store-and-forward operation
(`CachedCall`, `CachedWrite`, `Notify.Send`), generated caller-side at the site
at call time, doubling as the telemetry idempotency key. Note that the existing
`NotificationId` is the notification-domain name for this same concept.
Add a shared `TrackedOperationStatus` enum:
`Pending`, `Retrying`, `Delivered`, `Parked`, `Failed`, `Discarded`.
In `### REQ-COM-5: Cross-Component Message Contracts`, add the cached-call
telemetry and command contracts (additive-only, per REQ-COM-5a):
- `CachedCallTelemetry``TrackedOperationId`, source site, `Kind`,
target summary, status, retry count, last error, timestamps, provenance.
- `CachedCallReconcileRequest` / `CachedCallReconcileResponse` — cursor-based
per-site pull of changed tracking rows.
- `RetryParkedOperation` / `DiscardParkedOperation` — central→site commands
keyed by `TrackedOperationId` (generalize naming so they cover cached calls,
not only legacy "parked message" wording).
**Step 2: Verify**
Run: `git diff docs/requirements/Component-Commons.md`
Expected: additive only; no existing type or contract removed/renamed.
**Step 3: Commit**
```bash
git add docs/requirements/Component-Commons.md
git commit -m "docs(requirements): add TrackedOperationId and cached-call contracts to Commons"
```
---
### Task 3: Update the Store-and-Forward Engine doc
**Files:**
- Modify: `docs/requirements/Component-StoreAndForward.md``Responsibilities`,
`Message Lifecycle`, `Persistence`, `Parked Message Management`, `Message Format`
**Step 1: Edit the doc**
- **Responsibilities / Persistence**: introduce the **site-local operation
tracking table** — a SQLite table alongside the S&F buffer DB, holding one row
per `TrackedOperationId` for cached calls regardless of outcome. It is the
status record; the S&F buffer remains only the retry mechanism. State that
`Tracking.Status(id)` reads this table, that it is the source of truth, and
that terminal rows are purged after a configurable window (default 7 days).
- **Message Lifecycle**: a cached call that succeeds on its first immediate
attempt is written directly as a terminal `Delivered` tracking row and never
enters the S&F buffer. A buffered cached-call message references its
`TrackedOperationId`.
- Add a **telemetry emission** note: on every lifecycle transition the site emits
`CachedCallTelemetry` to central (best-effort, at-least-once, idempotent on the
ID) and responds to `CachedCallReconcileRequest` pulls.
- **Parked Message Management**: note that Retry/Discard of parked cached calls
can be driven by central via `RetryParkedOperation`/`DiscardParkedOperation`,
after which the site emits telemetry reflecting the new state.
- **Message Format**: add `TrackedOperationId` to the listed per-message fields.
Leave the notification category behavior unchanged.
**Step 2: Verify**
Run: `git diff docs/requirements/Component-StoreAndForward.md`
Expected: cached-call and DB-write categories gain tracking; notification flow untouched.
**Step 3: Commit**
```bash
git add docs/requirements/Component-StoreAndForward.md
git commit -m "docs(requirements): add site-local tracking table and telemetry to Store-and-Forward"
```
---
### Task 4: Update the External System Gateway doc
**Files:**
- Modify: `docs/requirements/Component-ExternalSystemGateway.md``Cached Write`,
`External System Call Modes`, `Call Timeout & Error Handling`
**Step 1: Edit the doc**
- `### Cached (Store-and-Forward)` and `### Cached Write (Store-and-Forward)`:
state that `CachedCall`/`CachedWrite` now return a `TrackedOperationId`. They
are no longer "fire-and-forget" with no handle — replace that wording with
"deferred-delivery, returns a tracking handle". Immediate success → terminal
`Delivered` record; transient failure → buffered, `Pending`/`Retrying`.
- Permanent failure: the error is still returned synchronously to the script
(unchanged) **and** recorded as a terminal `Failed` tracking record.
- Keep the idempotency note — duplicate delivery on retry is still the caller's
responsibility.
- Add a one-line pointer that status is observable via `Tracking.Status(id)` and
centrally via the Site Call Audit component.
**Step 2: Verify**
Run: `grep -n "fire-and-forget\|TrackedOperationId" docs/requirements/Component-ExternalSystemGateway.md`
Expected: "fire-and-forget" no longer describes cached calls; `TrackedOperationId` present.
**Step 3: Commit**
```bash
git add docs/requirements/Component-ExternalSystemGateway.md
git commit -m "docs(requirements): cached calls return TrackedOperationId in ESG"
```
---
### Task 5: Update the Site Runtime Script Runtime API
**Files:**
- Modify: `docs/requirements/Component-SiteRuntime.md``### External Systems`,
`### Notifications`, `### Database Access` under `## Script Runtime API`
**Step 1: Edit the doc**
- `### External Systems`: `ExternalSystem.CachedCall(...)` now returns a
`TrackedOperationId`; drop "fire-and-forget", say it returns a tracking handle.
- `### Database Access`: `Database.CachedWrite(...)` now returns a
`TrackedOperationId`.
- Add the unified accessor `Tracking.Status("trackedOperationId")` — returns a
status record (status, retry count, last error, key timestamps) for any tracked
operation, answered site-locally and authoritatively for cached calls.
- `### Notifications`: note that `Notify.Status(...)` is retained as a thin alias
of `Tracking.Status(...)`; `Notify.Send` returns a `TrackedOperationId`
(the value historically called `NotificationId`).
**Step 2: Verify**
Run: `git diff docs/requirements/Component-SiteRuntime.md`
Expected: all three cached/async producers return `TrackedOperationId`; `Tracking.Status` documented.
**Step 3: Commit**
```bash
git add docs/requirements/Component-SiteRuntime.md
git commit -m "docs(requirements): add Tracking.Status and cached-call handles to Script Runtime API"
```
---
### Task 6: Update the CentralSite Communication doc
**Files:**
- Modify: `docs/requirements/Component-Communication.md``### 8. Remote Queries`,
and add a new pattern for cached-call telemetry
**Step 1: Edit the doc**
- Add a new communication pattern (e.g. `### 10. Cached Call Telemetry (Site → Central)`):
the site S&F Engine pushes `CachedCallTelemetry` on every lifecycle transition;
best-effort, at-least-once, idempotent on `TrackedOperationId`; transport is
ClusterClient command/control. Also describe the reconciliation pull
(`CachedCallReconcileRequest`/`Response`) initiated by `SiteCallAuditActor`.
- `### 8. Remote Queries (Central → Site)`: generalize the "Retry or discard
parked messages" command line to also cover cached calls keyed by
`TrackedOperationId` (`RetryParkedOperation` / `DiscardParkedOperation`).
**Step 2: Verify**
Run: `grep -n "Telemetry\|RetryParkedOperation" docs/requirements/Component-Communication.md`
Expected: new telemetry pattern and generalized command present.
**Step 3: Commit**
```bash
git add docs/requirements/Component-Communication.md
git commit -m "docs(requirements): add cached-call telemetry pattern to Communication"
```
---
### Task 7: Update the Configuration Database doc
**Files:**
- Modify: `docs/requirements/Component-ConfigurationDatabase.md``## Database Schema`
(add a `### Site Calls` subsection), `## Scheduled Maintenance`
**Step 1: Edit the doc**
- Under `## Database Schema`, add a `### Site Calls` subsection describing the
`SiteCalls` table (columns per Task 1's "The `SiteCalls` Table" list), noting
it is populated only by Site Call Audit telemetry/reconciliation, and that
ingestion is insert-if-not-exists + upsert-on-newer-status.
- Under `## Scheduled Maintenance`, add a `### SiteCalls Table Purge` subsection
mirroring the `### Notifications Table Purge` wording: daily purge of terminal
rows after a configurable window (default 365 days).
**Step 2: Verify**
Run: `grep -n "SiteCalls" docs/requirements/Component-ConfigurationDatabase.md`
Expected: schema subsection and purge subsection both present.
**Step 3: Commit**
```bash
git add docs/requirements/Component-ConfigurationDatabase.md
git commit -m "docs(requirements): add SiteCalls table and purge to Configuration Database"
```
---
### Task 8: Update the Central UI doc
**Files:**
- Modify: `docs/requirements/Component-CentralUI.md``## Workflows / Pages`
**Step 1: Edit the doc**
Add a `### Site Calls (Deployment Role)` page after the
`### Notification Outbox (Deployment Role)` section:
- Queryable list of cached calls (`ExternalCall` + `DatabaseWrite` only —
notifications keep their own Notification Outbox page).
- Filters: site, kind, status, time range.
- Columns: timestamp, site, kind, target summary, status badge, retry count,
last error.
- Retry / Discard actions on `Parked` rows; "site unreachable" handling when the
owning site is offline.
- Custom Blazor Server + Bootstrap components, no third-party frameworks.
**Step 2: Verify**
Run: `grep -n "Site Calls" docs/requirements/Component-CentralUI.md`
Expected: new page section present, scoped to cached calls.
**Step 3: Commit**
```bash
git add docs/requirements/Component-CentralUI.md
git commit -m "docs(requirements): add Site Calls page to Central UI"
```
---
### Task 9: Update the Health Monitoring doc
**Files:**
- Modify: `docs/requirements/Component-HealthMonitoring.md` — add a
`## Site Call Audit KPIs` section after `## Notification Outbox KPIs`
**Step 1: Edit the doc**
Add a `## Site Call Audit KPIs` section mirroring `## Notification Outbox KPIs`:
the dashboard surfaces Site Call Audit headline KPI tiles (buffered, parked,
failed-last-interval, delivered-last-interval, oldest-pending age, stuck count),
computed point-in-time by the Site Call Audit component, global and per-site.
Stuck is display-only.
**Step 2: Verify**
Run: `grep -n "Site Call Audit KPIs" docs/requirements/Component-HealthMonitoring.md`
Expected: section present.
**Step 3: Commit**
```bash
git add docs/requirements/Component-HealthMonitoring.md
git commit -m "docs(requirements): add Site Call Audit KPIs to Health Monitoring"
```
---
### Task 10: Note the shared model in Notification docs
**Files:**
- Modify: `docs/requirements/Component-NotificationService.md``## Script API`
- Modify: `docs/requirements/Component-NotificationOutbox.md``## Purpose` or
`### Status Lifecycle`
**Step 1: Edit the doc**
- `Component-NotificationService.md` `## Script API`: note that `Notify.Send`'s
`NotificationId` is a `TrackedOperationId` (shared Commons type) and
`Notify.Status` is an alias of the unified `Tracking.Status`.
- `Component-NotificationOutbox.md`: add a sentence that the Notification Outbox
and the Site Call Audit component share the `TrackedOperationId` tracking
model and status lifecycle, but differ in delivery locality — the Notification
Outbox delivers; Site Call Audit only audits.
Do not change any notification behavior.
**Step 2: Verify**
Run: `git diff docs/requirements/Component-NotificationService.md docs/requirements/Component-NotificationOutbox.md`
Expected: additive notes only, no behavior change.
**Step 3: Commit**
```bash
git add docs/requirements/Component-NotificationService.md docs/requirements/Component-NotificationOutbox.md
git commit -m "docs(requirements): note shared TrackedOperationId model in notification docs"
```
---
### Task 11: Update the README component table
**Files:**
- Modify: `README.md` — component table and any architecture diagram component count
**Step 1: Edit the doc**
Add row 22 — **Site Call Audit** — to the component table:
"Central component auditing site cached calls (`CachedCall`/`CachedWrite`);
`SiteCalls` table, telemetry ingest, reconciliation, KPIs, central→site
Retry/Discard relay." Update any "21 components" count to 22.
**Step 2: Verify**
Run: `grep -rn "21 component\|22 component" README.md`
Expected: count reads 22; no stale "21".
**Step 3: Commit**
```bash
git add README.md
git commit -m "docs: add Site Call Audit to README component table"
```
---
### Task 12: Update CLAUDE.md
**Files:**
- Modify: `CLAUDE.md``## Current Component List`, `## Key Design Decisions`
**Step 1: Edit the doc**
- Change the heading `## Current Component List (21 components)` to `(22 components)`
and add item 22 — **Site Call Audit** — with a one-line description.
- Under `## Key Design Decisions`, in `### Store-and-Forward` (or `### UI & Monitoring`),
add bullets summarizing: cached calls return a `TrackedOperationId`; site-local
tracking table is the status source of truth; new central Site Call Audit
component mirrors status via best-effort telemetry + reconciliation; cached-call
delivery stays site-local; unified `Tracking.Status` accessor; `Failed` terminal
state for permanent failures.
**Step 2: Verify**
Run: `grep -n "22 components\|Site Call Audit" CLAUDE.md`
Expected: count is 22; component listed; design decisions present.
**Step 3: Commit**
```bash
git add CLAUDE.md
git commit -m "docs: record cached-call tracking in CLAUDE.md"
```
---
### Task 13: Final cross-reference consistency pass
**Files:**
- Potentially any `docs/requirements/Component-*.md`, `README.md`, `CLAUDE.md`
**Step 1: Sweep for stale or missing references**
Run each and review:
```bash
grep -rn "fire-and-forget" docs/requirements/
grep -rn "21 component" README.md CLAUDE.md
grep -rln "Site Call Audit" docs/requirements/ README.md CLAUDE.md
grep -rn "TrackedOperationId" docs/requirements/
```
Expected: no "fire-and-forget" describing cached calls; no "21 component" left;
Site Call Audit referenced by its dependents (Communication, Configuration
Database, Central UI, Health Monitoring, Commons); `TrackedOperationId` used
consistently.
**Step 2: Confirm new component's Dependencies/Interactions are reciprocated**
Verify each component named in `Component-SiteCallAudit.md` Dependencies/Interactions
also references Site Call Audit where appropriate.
**Step 3: Fix any gaps found, then commit**
```bash
git add -A
git commit -m "docs(requirements): reconcile cross-references for Site Call Audit"
```
If no gaps are found, skip the commit and note the plan is complete.
---
## Done
All cached-call tracking design changes are recorded. The design rationale lives
in `docs/plans/2026-05-19-cached-call-tracking-design.md`.

View File

@@ -0,0 +1,19 @@
{
"planPath": "docs/plans/2026-05-19-cached-call-tracking.md",
"tasks": [
{"id": 6, "subject": "Task 1: Create Site Call Audit component doc", "status": "pending"},
{"id": 7, "subject": "Task 2: Add tracking contracts to Commons", "status": "pending", "blockedBy": [6]},
{"id": 8, "subject": "Task 3: Update Store-and-Forward doc", "status": "pending", "blockedBy": [6, 7]},
{"id": 9, "subject": "Task 4: Update External System Gateway doc", "status": "pending", "blockedBy": [6, 7]},
{"id": 10, "subject": "Task 5: Update Site Runtime Script Runtime API", "status": "pending", "blockedBy": [6, 7]},
{"id": 11, "subject": "Task 6: Update Communication doc", "status": "pending", "blockedBy": [6, 7]},
{"id": 12, "subject": "Task 7: Update Configuration Database doc", "status": "pending", "blockedBy": [6, 7]},
{"id": 13, "subject": "Task 8: Update Central UI doc", "status": "pending", "blockedBy": [6, 7]},
{"id": 14, "subject": "Task 9: Update Health Monitoring doc", "status": "pending", "blockedBy": [6, 7]},
{"id": 15, "subject": "Task 10: Note shared model in notification docs", "status": "pending", "blockedBy": [6, 7]},
{"id": 16, "subject": "Task 11: Update README component table", "status": "pending", "blockedBy": [6]},
{"id": 17, "subject": "Task 12: Update CLAUDE.md", "status": "pending", "blockedBy": [6]},
{"id": 18, "subject": "Task 13: Final cross-reference consistency pass", "status": "pending", "blockedBy": [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]}
],
"lastUpdated": "2026-05-19"
}