diff --git a/docs/plans/2026-05-19-cached-call-tracking.md b/docs/plans/2026-05-19-cached-call-tracking.md new file mode 100644 index 0000000..fa6a790 --- /dev/null +++ b/docs/plans/2026-05-19-cached-call-tracking.md @@ -0,0 +1,566 @@ +# Cached Call Tracking Implementation Plan + +> **For Claude:** REQUIRED SUB-SKILL: Use superpowers-extended-cc:executing-plans to implement this plan task-by-task. + +**Goal:** Give cached external system calls and cached database writes a trackable `TrackedOperationId`, backed by a site-local tracking table and a new central `Site Call Audit` component, under a tracking model unified with `Notify.Send`. + +**Architecture:** Approach B from the design doc — a sibling central component (`Site Call Audit`), not a merged outbox. The site stays the source of truth for cached-call status; central audit is an eventually-consistent mirror fed by best-effort telemetry plus a reconciliation pull. Delivery of cached calls remains site-local. + +**Tech Stack:** This is a design-documentation change. "Implementation" means editing Markdown design documents under `docs/requirements/`, plus `README.md` and `CLAUDE.md`. No source code is touched. The authoritative design is `docs/plans/2026-05-19-cached-call-tracking-design.md` — read it before starting. + +**Working conventions (from `CLAUDE.md`):** +- Edit documents in place; no copies or backups. +- Component docs follow: Purpose, Location, Responsibilities, design sections, Dependencies, Interactions. +- Keep cross-references accurate across all docs. +- Use `git diff` to review before committing. + +**Per-task workflow (replaces TDD for this docs project):** +1. Read the target file in full first. +2. Make the edits described. +3. **Verify**: run `git diff ` and confirm the change reads correctly and matches the design doc. +4. **Cross-reference check**: run the grep given in the task; confirm no stale references. +5. **Commit** with the given message. + +--- + +### Task 1: Create the Site Call Audit component document + +**Files:** +- Create: `docs/requirements/Component-SiteCallAudit.md` + +**Step 1: Write the new component doc** + +Create the file following the standard component structure. Content: + +```markdown +# Component: Site Call Audit + +## Purpose + +Provides central, queryable audit and operational visibility for cached calls +made by site scripts — `ExternalSystem.CachedCall()` and `Database.CachedWrite()`. +Each such call carries a `TrackedOperationId`; sites report lifecycle telemetry +to this component, which maintains a central audit record, computes KPIs, and +relays Retry/Discard actions back to the owning site. + +This is the second centrally-hosted observability component for site +store-and-forward activity (the Notification Outbox is the first). Unlike the +Notification Outbox, Site Call Audit is **not a dispatcher** — it never delivers +anything. Cached calls are delivered by the site's Store-and-Forward Engine +against site-local external systems and databases, which central cannot reach. + +## Location + +Central cluster only. A singleton actor (`SiteCallAuditActor`) on the active +central node. Registered as component #22 in the Host role configuration. + +## Responsibilities + +- Ingest cached-call lifecycle telemetry from sites into the central `SiteCalls` + table. +- Run periodic per-site reconciliation pulls so missed telemetry self-heals. +- Compute point-in-time KPIs (global and per-site) from the `SiteCalls` table. +- Relay operator Retry/Discard actions for parked cached calls to the owning + site over the command/control channel. +- Purge terminal audit rows after a configurable retention window. + +## The `SiteCalls` Table + +Lives in the central MS SQL configuration database — a sibling of the +`Notifications` table. One row per `TrackedOperationId`: + +- **TrackedOperationId** — GUID, primary key. Generated site-side at call time. +- **SourceSite** — site that issued the call. +- **Kind** — `ExternalCall` or `DatabaseWrite`. +- **TargetSummary** — external system + method name, or database connection name. +- **Status** — `Pending`, `Retrying`, `Delivered`, `Parked`, `Failed`, `Discarded`. +- **RetryCount** — attempts so far. +- **LastError** — most recent error detail, if any. +- **Provenance** — source instance / script. +- **CreatedAtUtc**, **UpdatedAtUtc**, **TerminalAtUtc** — key timestamps. + +## Status Lifecycle + +`Pending → Retrying → Delivered / Parked / Failed / Discarded` + +- **Delivered** — succeeded. A cached call that succeeds on its first immediate + attempt is recorded directly as `Delivered`. +- **Parked** — transient retries exhausted; awaiting manual action. +- **Failed** — permanent failure (e.g. HTTP 4xx). The error was also returned + synchronously to the calling script; the record captures it. +- **Discarded** — an operator discarded a parked operation. + +The site is the source of truth. The `SiteCalls` row is an eventually-consistent +mirror — never queried by scripts (`Tracking.Status()` is answered site-locally). + +## Ingest & Idempotency + +Telemetry ingestion is **insert-if-not-exists** keyed on `TrackedOperationId`, +then **upsert-on-newer-status**. The lifecycle is monotonic, so status only +advances and never regresses; at-least-once and out-of-order telemetry are +therefore harmless. + +## Reconciliation + +Because telemetry is best-effort, `SiteCallAuditActor` periodically — and on site +reconnect — pulls "all tracking rows changed since cursor X" from each site. +Gaps left by lost telemetry self-heal. Central converges to the site; the site +never depends on central. + +## Retry / Discard Relay + +Parked cached calls live in the owning site's S&F buffer. Operator Retry/Discard +from the Central UI is relayed to that site as a `RetryParkedOperation` / +`DiscardParkedOperation` command over the command/control channel. The site +applies the change and emits telemetry reflecting the new state; central never +mutates the `SiteCalls` row directly. If the site is offline the command fails +fast and the UI surfaces a "site unreachable" message. + +## KPIs + +Point-in-time, computed from the `SiteCalls` table, global and per-source-site, +mirroring the Notification Outbox KPI shape: + +- Buffered count (`Pending` + `Retrying`) +- Parked count +- Failed-last-interval +- Delivered-last-interval +- Oldest-pending age +- Stuck count — `Pending`/`Retrying` older than a configurable threshold + (default 10 minutes); display-only, no escalation. + +## Retention + +Daily purge of terminal rows (`Delivered`, `Failed`, `Discarded`) after a +configurable window (default 365 days), matching the `Notifications` purge. + +## Dependencies + +- **Configuration Database**: hosts the `SiteCalls` table and its repository. +- **Central–Site Communication**: receives cached-call telemetry and reconciliation + responses; sends Retry/Discard commands. +- **Store-and-Forward Engine**: the site-side origin of cached-call telemetry and + the executor of relayed Retry/Discard commands. +- **Commons**: `TrackedOperationId`, status enum, telemetry message contracts. + +## Interactions + +- **Central UI**: the Site Calls page queries this component and issues + Retry/Discard actions. +- **Health Monitoring**: surfaces Site Call Audit KPI tiles on the dashboard. +- **Cluster Infrastructure**: hosts the `SiteCallAuditActor` singleton with + active/standby failover. +``` + +**Step 2: Verify** + +Run: `git diff --stat` and open the new file. +Expected: structure matches other `Component-*.md` files (Purpose → Interactions). + +**Step 3: Commit** + +```bash +git add docs/requirements/Component-SiteCallAudit.md +git commit -m "docs(requirements): add Site Call Audit component (#22)" +``` + +--- + +### Task 2: Add shared tracking contracts to Commons + +**Files:** +- Modify: `docs/requirements/Component-Commons.md` — sections `REQ-COM-1` (data types), `REQ-COM-5` (message contracts) + +**Step 1: Edit the doc** + +In `### REQ-COM-1: Shared Data Type System`, add `TrackedOperationId` as a shared +type: a GUID identifying any tracked store-and-forward operation +(`CachedCall`, `CachedWrite`, `Notify.Send`), generated caller-side at the site +at call time, doubling as the telemetry idempotency key. Note that the existing +`NotificationId` is the notification-domain name for this same concept. + +Add a shared `TrackedOperationStatus` enum: +`Pending`, `Retrying`, `Delivered`, `Parked`, `Failed`, `Discarded`. + +In `### REQ-COM-5: Cross-Component Message Contracts`, add the cached-call +telemetry and command contracts (additive-only, per REQ-COM-5a): +- `CachedCallTelemetry` — `TrackedOperationId`, source site, `Kind`, + target summary, status, retry count, last error, timestamps, provenance. +- `CachedCallReconcileRequest` / `CachedCallReconcileResponse` — cursor-based + per-site pull of changed tracking rows. +- `RetryParkedOperation` / `DiscardParkedOperation` — central→site commands + keyed by `TrackedOperationId` (generalize naming so they cover cached calls, + not only legacy "parked message" wording). + +**Step 2: Verify** + +Run: `git diff docs/requirements/Component-Commons.md` +Expected: additive only; no existing type or contract removed/renamed. + +**Step 3: Commit** + +```bash +git add docs/requirements/Component-Commons.md +git commit -m "docs(requirements): add TrackedOperationId and cached-call contracts to Commons" +``` + +--- + +### Task 3: Update the Store-and-Forward Engine doc + +**Files:** +- Modify: `docs/requirements/Component-StoreAndForward.md` — `Responsibilities`, + `Message Lifecycle`, `Persistence`, `Parked Message Management`, `Message Format` + +**Step 1: Edit the doc** + +- **Responsibilities / Persistence**: introduce the **site-local operation + tracking table** — a SQLite table alongside the S&F buffer DB, holding one row + per `TrackedOperationId` for cached calls regardless of outcome. It is the + status record; the S&F buffer remains only the retry mechanism. State that + `Tracking.Status(id)` reads this table, that it is the source of truth, and + that terminal rows are purged after a configurable window (default 7 days). +- **Message Lifecycle**: a cached call that succeeds on its first immediate + attempt is written directly as a terminal `Delivered` tracking row and never + enters the S&F buffer. A buffered cached-call message references its + `TrackedOperationId`. +- Add a **telemetry emission** note: on every lifecycle transition the site emits + `CachedCallTelemetry` to central (best-effort, at-least-once, idempotent on the + ID) and responds to `CachedCallReconcileRequest` pulls. +- **Parked Message Management**: note that Retry/Discard of parked cached calls + can be driven by central via `RetryParkedOperation`/`DiscardParkedOperation`, + after which the site emits telemetry reflecting the new state. +- **Message Format**: add `TrackedOperationId` to the listed per-message fields. + +Leave the notification category behavior unchanged. + +**Step 2: Verify** + +Run: `git diff docs/requirements/Component-StoreAndForward.md` +Expected: cached-call and DB-write categories gain tracking; notification flow untouched. + +**Step 3: Commit** + +```bash +git add docs/requirements/Component-StoreAndForward.md +git commit -m "docs(requirements): add site-local tracking table and telemetry to Store-and-Forward" +``` + +--- + +### Task 4: Update the External System Gateway doc + +**Files:** +- Modify: `docs/requirements/Component-ExternalSystemGateway.md` — `Cached Write`, + `External System Call Modes`, `Call Timeout & Error Handling` + +**Step 1: Edit the doc** + +- `### Cached (Store-and-Forward)` and `### Cached Write (Store-and-Forward)`: + state that `CachedCall`/`CachedWrite` now return a `TrackedOperationId`. They + are no longer "fire-and-forget" with no handle — replace that wording with + "deferred-delivery, returns a tracking handle". Immediate success → terminal + `Delivered` record; transient failure → buffered, `Pending`/`Retrying`. +- Permanent failure: the error is still returned synchronously to the script + (unchanged) **and** recorded as a terminal `Failed` tracking record. +- Keep the idempotency note — duplicate delivery on retry is still the caller's + responsibility. +- Add a one-line pointer that status is observable via `Tracking.Status(id)` and + centrally via the Site Call Audit component. + +**Step 2: Verify** + +Run: `grep -n "fire-and-forget\|TrackedOperationId" docs/requirements/Component-ExternalSystemGateway.md` +Expected: "fire-and-forget" no longer describes cached calls; `TrackedOperationId` present. + +**Step 3: Commit** + +```bash +git add docs/requirements/Component-ExternalSystemGateway.md +git commit -m "docs(requirements): cached calls return TrackedOperationId in ESG" +``` + +--- + +### Task 5: Update the Site Runtime Script Runtime API + +**Files:** +- Modify: `docs/requirements/Component-SiteRuntime.md` — `### External Systems`, + `### Notifications`, `### Database Access` under `## Script Runtime API` + +**Step 1: Edit the doc** + +- `### External Systems`: `ExternalSystem.CachedCall(...)` now returns a + `TrackedOperationId`; drop "fire-and-forget", say it returns a tracking handle. +- `### Database Access`: `Database.CachedWrite(...)` now returns a + `TrackedOperationId`. +- Add the unified accessor `Tracking.Status("trackedOperationId")` — returns a + status record (status, retry count, last error, key timestamps) for any tracked + operation, answered site-locally and authoritatively for cached calls. +- `### Notifications`: note that `Notify.Status(...)` is retained as a thin alias + of `Tracking.Status(...)`; `Notify.Send` returns a `TrackedOperationId` + (the value historically called `NotificationId`). + +**Step 2: Verify** + +Run: `git diff docs/requirements/Component-SiteRuntime.md` +Expected: all three cached/async producers return `TrackedOperationId`; `Tracking.Status` documented. + +**Step 3: Commit** + +```bash +git add docs/requirements/Component-SiteRuntime.md +git commit -m "docs(requirements): add Tracking.Status and cached-call handles to Script Runtime API" +``` + +--- + +### Task 6: Update the Central–Site Communication doc + +**Files:** +- Modify: `docs/requirements/Component-Communication.md` — `### 8. Remote Queries`, + and add a new pattern for cached-call telemetry + +**Step 1: Edit the doc** + +- Add a new communication pattern (e.g. `### 10. Cached Call Telemetry (Site → Central)`): + the site S&F Engine pushes `CachedCallTelemetry` on every lifecycle transition; + best-effort, at-least-once, idempotent on `TrackedOperationId`; transport is + ClusterClient command/control. Also describe the reconciliation pull + (`CachedCallReconcileRequest`/`Response`) initiated by `SiteCallAuditActor`. +- `### 8. Remote Queries (Central → Site)`: generalize the "Retry or discard + parked messages" command line to also cover cached calls keyed by + `TrackedOperationId` (`RetryParkedOperation` / `DiscardParkedOperation`). + +**Step 2: Verify** + +Run: `grep -n "Telemetry\|RetryParkedOperation" docs/requirements/Component-Communication.md` +Expected: new telemetry pattern and generalized command present. + +**Step 3: Commit** + +```bash +git add docs/requirements/Component-Communication.md +git commit -m "docs(requirements): add cached-call telemetry pattern to Communication" +``` + +--- + +### Task 7: Update the Configuration Database doc + +**Files:** +- Modify: `docs/requirements/Component-ConfigurationDatabase.md` — `## Database Schema` + (add a `### Site Calls` subsection), `## Scheduled Maintenance` + +**Step 1: Edit the doc** + +- Under `## Database Schema`, add a `### Site Calls` subsection describing the + `SiteCalls` table (columns per Task 1's "The `SiteCalls` Table" list), noting + it is populated only by Site Call Audit telemetry/reconciliation, and that + ingestion is insert-if-not-exists + upsert-on-newer-status. +- Under `## Scheduled Maintenance`, add a `### SiteCalls Table Purge` subsection + mirroring the `### Notifications Table Purge` wording: daily purge of terminal + rows after a configurable window (default 365 days). + +**Step 2: Verify** + +Run: `grep -n "SiteCalls" docs/requirements/Component-ConfigurationDatabase.md` +Expected: schema subsection and purge subsection both present. + +**Step 3: Commit** + +```bash +git add docs/requirements/Component-ConfigurationDatabase.md +git commit -m "docs(requirements): add SiteCalls table and purge to Configuration Database" +``` + +--- + +### Task 8: Update the Central UI doc + +**Files:** +- Modify: `docs/requirements/Component-CentralUI.md` — `## Workflows / Pages` + +**Step 1: Edit the doc** + +Add a `### Site Calls (Deployment Role)` page after the +`### Notification Outbox (Deployment Role)` section: +- Queryable list of cached calls (`ExternalCall` + `DatabaseWrite` only — + notifications keep their own Notification Outbox page). +- Filters: site, kind, status, time range. +- Columns: timestamp, site, kind, target summary, status badge, retry count, + last error. +- Retry / Discard actions on `Parked` rows; "site unreachable" handling when the + owning site is offline. +- Custom Blazor Server + Bootstrap components, no third-party frameworks. + +**Step 2: Verify** + +Run: `grep -n "Site Calls" docs/requirements/Component-CentralUI.md` +Expected: new page section present, scoped to cached calls. + +**Step 3: Commit** + +```bash +git add docs/requirements/Component-CentralUI.md +git commit -m "docs(requirements): add Site Calls page to Central UI" +``` + +--- + +### Task 9: Update the Health Monitoring doc + +**Files:** +- Modify: `docs/requirements/Component-HealthMonitoring.md` — add a + `## Site Call Audit KPIs` section after `## Notification Outbox KPIs` + +**Step 1: Edit the doc** + +Add a `## Site Call Audit KPIs` section mirroring `## Notification Outbox KPIs`: +the dashboard surfaces Site Call Audit headline KPI tiles (buffered, parked, +failed-last-interval, delivered-last-interval, oldest-pending age, stuck count), +computed point-in-time by the Site Call Audit component, global and per-site. +Stuck is display-only. + +**Step 2: Verify** + +Run: `grep -n "Site Call Audit KPIs" docs/requirements/Component-HealthMonitoring.md` +Expected: section present. + +**Step 3: Commit** + +```bash +git add docs/requirements/Component-HealthMonitoring.md +git commit -m "docs(requirements): add Site Call Audit KPIs to Health Monitoring" +``` + +--- + +### Task 10: Note the shared model in Notification docs + +**Files:** +- Modify: `docs/requirements/Component-NotificationService.md` — `## Script API` +- Modify: `docs/requirements/Component-NotificationOutbox.md` — `## Purpose` or + `### Status Lifecycle` + +**Step 1: Edit the doc** + +- `Component-NotificationService.md` `## Script API`: note that `Notify.Send`'s + `NotificationId` is a `TrackedOperationId` (shared Commons type) and + `Notify.Status` is an alias of the unified `Tracking.Status`. +- `Component-NotificationOutbox.md`: add a sentence that the Notification Outbox + and the Site Call Audit component share the `TrackedOperationId` tracking + model and status lifecycle, but differ in delivery locality — the Notification + Outbox delivers; Site Call Audit only audits. + +Do not change any notification behavior. + +**Step 2: Verify** + +Run: `git diff docs/requirements/Component-NotificationService.md docs/requirements/Component-NotificationOutbox.md` +Expected: additive notes only, no behavior change. + +**Step 3: Commit** + +```bash +git add docs/requirements/Component-NotificationService.md docs/requirements/Component-NotificationOutbox.md +git commit -m "docs(requirements): note shared TrackedOperationId model in notification docs" +``` + +--- + +### Task 11: Update the README component table + +**Files:** +- Modify: `README.md` — component table and any architecture diagram component count + +**Step 1: Edit the doc** + +Add row 22 — **Site Call Audit** — to the component table: +"Central component auditing site cached calls (`CachedCall`/`CachedWrite`); +`SiteCalls` table, telemetry ingest, reconciliation, KPIs, central→site +Retry/Discard relay." Update any "21 components" count to 22. + +**Step 2: Verify** + +Run: `grep -rn "21 component\|22 component" README.md` +Expected: count reads 22; no stale "21". + +**Step 3: Commit** + +```bash +git add README.md +git commit -m "docs: add Site Call Audit to README component table" +``` + +--- + +### Task 12: Update CLAUDE.md + +**Files:** +- Modify: `CLAUDE.md` — `## Current Component List`, `## Key Design Decisions` + +**Step 1: Edit the doc** + +- Change the heading `## Current Component List (21 components)` to `(22 components)` + and add item 22 — **Site Call Audit** — with a one-line description. +- Under `## Key Design Decisions`, in `### Store-and-Forward` (or `### UI & Monitoring`), + add bullets summarizing: cached calls return a `TrackedOperationId`; site-local + tracking table is the status source of truth; new central Site Call Audit + component mirrors status via best-effort telemetry + reconciliation; cached-call + delivery stays site-local; unified `Tracking.Status` accessor; `Failed` terminal + state for permanent failures. + +**Step 2: Verify** + +Run: `grep -n "22 components\|Site Call Audit" CLAUDE.md` +Expected: count is 22; component listed; design decisions present. + +**Step 3: Commit** + +```bash +git add CLAUDE.md +git commit -m "docs: record cached-call tracking in CLAUDE.md" +``` + +--- + +### Task 13: Final cross-reference consistency pass + +**Files:** +- Potentially any `docs/requirements/Component-*.md`, `README.md`, `CLAUDE.md` + +**Step 1: Sweep for stale or missing references** + +Run each and review: +```bash +grep -rn "fire-and-forget" docs/requirements/ +grep -rn "21 component" README.md CLAUDE.md +grep -rln "Site Call Audit" docs/requirements/ README.md CLAUDE.md +grep -rn "TrackedOperationId" docs/requirements/ +``` +Expected: no "fire-and-forget" describing cached calls; no "21 component" left; +Site Call Audit referenced by its dependents (Communication, Configuration +Database, Central UI, Health Monitoring, Commons); `TrackedOperationId` used +consistently. + +**Step 2: Confirm new component's Dependencies/Interactions are reciprocated** + +Verify each component named in `Component-SiteCallAudit.md` Dependencies/Interactions +also references Site Call Audit where appropriate. + +**Step 3: Fix any gaps found, then commit** + +```bash +git add -A +git commit -m "docs(requirements): reconcile cross-references for Site Call Audit" +``` + +If no gaps are found, skip the commit and note the plan is complete. + +--- + +## Done + +All cached-call tracking design changes are recorded. The design rationale lives +in `docs/plans/2026-05-19-cached-call-tracking-design.md`. diff --git a/docs/plans/2026-05-19-cached-call-tracking.md.tasks.json b/docs/plans/2026-05-19-cached-call-tracking.md.tasks.json new file mode 100644 index 0000000..c58bb72 --- /dev/null +++ b/docs/plans/2026-05-19-cached-call-tracking.md.tasks.json @@ -0,0 +1,19 @@ +{ + "planPath": "docs/plans/2026-05-19-cached-call-tracking.md", + "tasks": [ + {"id": 6, "subject": "Task 1: Create Site Call Audit component doc", "status": "pending"}, + {"id": 7, "subject": "Task 2: Add tracking contracts to Commons", "status": "pending", "blockedBy": [6]}, + {"id": 8, "subject": "Task 3: Update Store-and-Forward doc", "status": "pending", "blockedBy": [6, 7]}, + {"id": 9, "subject": "Task 4: Update External System Gateway doc", "status": "pending", "blockedBy": [6, 7]}, + {"id": 10, "subject": "Task 5: Update Site Runtime Script Runtime API", "status": "pending", "blockedBy": [6, 7]}, + {"id": 11, "subject": "Task 6: Update Communication doc", "status": "pending", "blockedBy": [6, 7]}, + {"id": 12, "subject": "Task 7: Update Configuration Database doc", "status": "pending", "blockedBy": [6, 7]}, + {"id": 13, "subject": "Task 8: Update Central UI doc", "status": "pending", "blockedBy": [6, 7]}, + {"id": 14, "subject": "Task 9: Update Health Monitoring doc", "status": "pending", "blockedBy": [6, 7]}, + {"id": 15, "subject": "Task 10: Note shared model in notification docs", "status": "pending", "blockedBy": [6, 7]}, + {"id": 16, "subject": "Task 11: Update README component table", "status": "pending", "blockedBy": [6]}, + {"id": 17, "subject": "Task 12: Update CLAUDE.md", "status": "pending", "blockedBy": [6]}, + {"id": 18, "subject": "Task 13: Final cross-reference consistency pass", "status": "pending", "blockedBy": [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]} + ], + "lastUpdated": "2026-05-19" +}