Files
scadalink-design/docs/plans/2026-05-19-cached-call-tracking.md
2026-05-19 11:30:21 -04:00

567 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Cached Call Tracking Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers-extended-cc:executing-plans to implement this plan task-by-task.
**Goal:** Give cached external system calls and cached database writes a trackable `TrackedOperationId`, backed by a site-local tracking table and a new central `Site Call Audit` component, under a tracking model unified with `Notify.Send`.
**Architecture:** Approach B from the design doc — a sibling central component (`Site Call Audit`), not a merged outbox. The site stays the source of truth for cached-call status; central audit is an eventually-consistent mirror fed by best-effort telemetry plus a reconciliation pull. Delivery of cached calls remains site-local.
**Tech Stack:** This is a design-documentation change. "Implementation" means editing Markdown design documents under `docs/requirements/`, plus `README.md` and `CLAUDE.md`. No source code is touched. The authoritative design is `docs/plans/2026-05-19-cached-call-tracking-design.md` — read it before starting.
**Working conventions (from `CLAUDE.md`):**
- Edit documents in place; no copies or backups.
- Component docs follow: Purpose, Location, Responsibilities, design sections, Dependencies, Interactions.
- Keep cross-references accurate across all docs.
- Use `git diff` to review before committing.
**Per-task workflow (replaces TDD for this docs project):**
1. Read the target file in full first.
2. Make the edits described.
3. **Verify**: run `git diff <file>` and confirm the change reads correctly and matches the design doc.
4. **Cross-reference check**: run the grep given in the task; confirm no stale references.
5. **Commit** with the given message.
---
### Task 1: Create the Site Call Audit component document
**Files:**
- Create: `docs/requirements/Component-SiteCallAudit.md`
**Step 1: Write the new component doc**
Create the file following the standard component structure. Content:
```markdown
# Component: Site Call Audit
## Purpose
Provides central, queryable audit and operational visibility for cached calls
made by site scripts — `ExternalSystem.CachedCall()` and `Database.CachedWrite()`.
Each such call carries a `TrackedOperationId`; sites report lifecycle telemetry
to this component, which maintains a central audit record, computes KPIs, and
relays Retry/Discard actions back to the owning site.
This is the second centrally-hosted observability component for site
store-and-forward activity (the Notification Outbox is the first). Unlike the
Notification Outbox, Site Call Audit is **not a dispatcher** — it never delivers
anything. Cached calls are delivered by the site's Store-and-Forward Engine
against site-local external systems and databases, which central cannot reach.
## Location
Central cluster only. A singleton actor (`SiteCallAuditActor`) on the active
central node. Registered as component #22 in the Host role configuration.
## Responsibilities
- Ingest cached-call lifecycle telemetry from sites into the central `SiteCalls`
table.
- Run periodic per-site reconciliation pulls so missed telemetry self-heals.
- Compute point-in-time KPIs (global and per-site) from the `SiteCalls` table.
- Relay operator Retry/Discard actions for parked cached calls to the owning
site over the command/control channel.
- Purge terminal audit rows after a configurable retention window.
## The `SiteCalls` Table
Lives in the central MS SQL configuration database — a sibling of the
`Notifications` table. One row per `TrackedOperationId`:
- **TrackedOperationId** — GUID, primary key. Generated site-side at call time.
- **SourceSite** — site that issued the call.
- **Kind** — `ExternalCall` or `DatabaseWrite`.
- **TargetSummary** — external system + method name, or database connection name.
- **Status** — `Pending`, `Retrying`, `Delivered`, `Parked`, `Failed`, `Discarded`.
- **RetryCount** — attempts so far.
- **LastError** — most recent error detail, if any.
- **Provenance** — source instance / script.
- **CreatedAtUtc**, **UpdatedAtUtc**, **TerminalAtUtc** — key timestamps.
## Status Lifecycle
`Pending → Retrying → Delivered / Parked / Failed / Discarded`
- **Delivered** — succeeded. A cached call that succeeds on its first immediate
attempt is recorded directly as `Delivered`.
- **Parked** — transient retries exhausted; awaiting manual action.
- **Failed** — permanent failure (e.g. HTTP 4xx). The error was also returned
synchronously to the calling script; the record captures it.
- **Discarded** — an operator discarded a parked operation.
The site is the source of truth. The `SiteCalls` row is an eventually-consistent
mirror — never queried by scripts (`Tracking.Status()` is answered site-locally).
## Ingest & Idempotency
Telemetry ingestion is **insert-if-not-exists** keyed on `TrackedOperationId`,
then **upsert-on-newer-status**. The lifecycle is monotonic, so status only
advances and never regresses; at-least-once and out-of-order telemetry are
therefore harmless.
## Reconciliation
Because telemetry is best-effort, `SiteCallAuditActor` periodically — and on site
reconnect — pulls "all tracking rows changed since cursor X" from each site.
Gaps left by lost telemetry self-heal. Central converges to the site; the site
never depends on central.
## Retry / Discard Relay
Parked cached calls live in the owning site's S&F buffer. Operator Retry/Discard
from the Central UI is relayed to that site as a `RetryParkedOperation` /
`DiscardParkedOperation` command over the command/control channel. The site
applies the change and emits telemetry reflecting the new state; central never
mutates the `SiteCalls` row directly. If the site is offline the command fails
fast and the UI surfaces a "site unreachable" message.
## KPIs
Point-in-time, computed from the `SiteCalls` table, global and per-source-site,
mirroring the Notification Outbox KPI shape:
- Buffered count (`Pending` + `Retrying`)
- Parked count
- Failed-last-interval
- Delivered-last-interval
- Oldest-pending age
- Stuck count — `Pending`/`Retrying` older than a configurable threshold
(default 10 minutes); display-only, no escalation.
## Retention
Daily purge of terminal rows (`Delivered`, `Failed`, `Discarded`) after a
configurable window (default 365 days), matching the `Notifications` purge.
## Dependencies
- **Configuration Database**: hosts the `SiteCalls` table and its repository.
- **CentralSite Communication**: receives cached-call telemetry and reconciliation
responses; sends Retry/Discard commands.
- **Store-and-Forward Engine**: the site-side origin of cached-call telemetry and
the executor of relayed Retry/Discard commands.
- **Commons**: `TrackedOperationId`, status enum, telemetry message contracts.
## Interactions
- **Central UI**: the Site Calls page queries this component and issues
Retry/Discard actions.
- **Health Monitoring**: surfaces Site Call Audit KPI tiles on the dashboard.
- **Cluster Infrastructure**: hosts the `SiteCallAuditActor` singleton with
active/standby failover.
```
**Step 2: Verify**
Run: `git diff --stat` and open the new file.
Expected: structure matches other `Component-*.md` files (Purpose → Interactions).
**Step 3: Commit**
```bash
git add docs/requirements/Component-SiteCallAudit.md
git commit -m "docs(requirements): add Site Call Audit component (#22)"
```
---
### Task 2: Add shared tracking contracts to Commons
**Files:**
- Modify: `docs/requirements/Component-Commons.md` — sections `REQ-COM-1` (data types), `REQ-COM-5` (message contracts)
**Step 1: Edit the doc**
In `### REQ-COM-1: Shared Data Type System`, add `TrackedOperationId` as a shared
type: a GUID identifying any tracked store-and-forward operation
(`CachedCall`, `CachedWrite`, `Notify.Send`), generated caller-side at the site
at call time, doubling as the telemetry idempotency key. Note that the existing
`NotificationId` is the notification-domain name for this same concept.
Add a shared `TrackedOperationStatus` enum:
`Pending`, `Retrying`, `Delivered`, `Parked`, `Failed`, `Discarded`.
In `### REQ-COM-5: Cross-Component Message Contracts`, add the cached-call
telemetry and command contracts (additive-only, per REQ-COM-5a):
- `CachedCallTelemetry``TrackedOperationId`, source site, `Kind`,
target summary, status, retry count, last error, timestamps, provenance.
- `CachedCallReconcileRequest` / `CachedCallReconcileResponse` — cursor-based
per-site pull of changed tracking rows.
- `RetryParkedOperation` / `DiscardParkedOperation` — central→site commands
keyed by `TrackedOperationId` (generalize naming so they cover cached calls,
not only legacy "parked message" wording).
**Step 2: Verify**
Run: `git diff docs/requirements/Component-Commons.md`
Expected: additive only; no existing type or contract removed/renamed.
**Step 3: Commit**
```bash
git add docs/requirements/Component-Commons.md
git commit -m "docs(requirements): add TrackedOperationId and cached-call contracts to Commons"
```
---
### Task 3: Update the Store-and-Forward Engine doc
**Files:**
- Modify: `docs/requirements/Component-StoreAndForward.md``Responsibilities`,
`Message Lifecycle`, `Persistence`, `Parked Message Management`, `Message Format`
**Step 1: Edit the doc**
- **Responsibilities / Persistence**: introduce the **site-local operation
tracking table** — a SQLite table alongside the S&F buffer DB, holding one row
per `TrackedOperationId` for cached calls regardless of outcome. It is the
status record; the S&F buffer remains only the retry mechanism. State that
`Tracking.Status(id)` reads this table, that it is the source of truth, and
that terminal rows are purged after a configurable window (default 7 days).
- **Message Lifecycle**: a cached call that succeeds on its first immediate
attempt is written directly as a terminal `Delivered` tracking row and never
enters the S&F buffer. A buffered cached-call message references its
`TrackedOperationId`.
- Add a **telemetry emission** note: on every lifecycle transition the site emits
`CachedCallTelemetry` to central (best-effort, at-least-once, idempotent on the
ID) and responds to `CachedCallReconcileRequest` pulls.
- **Parked Message Management**: note that Retry/Discard of parked cached calls
can be driven by central via `RetryParkedOperation`/`DiscardParkedOperation`,
after which the site emits telemetry reflecting the new state.
- **Message Format**: add `TrackedOperationId` to the listed per-message fields.
Leave the notification category behavior unchanged.
**Step 2: Verify**
Run: `git diff docs/requirements/Component-StoreAndForward.md`
Expected: cached-call and DB-write categories gain tracking; notification flow untouched.
**Step 3: Commit**
```bash
git add docs/requirements/Component-StoreAndForward.md
git commit -m "docs(requirements): add site-local tracking table and telemetry to Store-and-Forward"
```
---
### Task 4: Update the External System Gateway doc
**Files:**
- Modify: `docs/requirements/Component-ExternalSystemGateway.md``Cached Write`,
`External System Call Modes`, `Call Timeout & Error Handling`
**Step 1: Edit the doc**
- `### Cached (Store-and-Forward)` and `### Cached Write (Store-and-Forward)`:
state that `CachedCall`/`CachedWrite` now return a `TrackedOperationId`. They
are no longer "fire-and-forget" with no handle — replace that wording with
"deferred-delivery, returns a tracking handle". Immediate success → terminal
`Delivered` record; transient failure → buffered, `Pending`/`Retrying`.
- Permanent failure: the error is still returned synchronously to the script
(unchanged) **and** recorded as a terminal `Failed` tracking record.
- Keep the idempotency note — duplicate delivery on retry is still the caller's
responsibility.
- Add a one-line pointer that status is observable via `Tracking.Status(id)` and
centrally via the Site Call Audit component.
**Step 2: Verify**
Run: `grep -n "fire-and-forget\|TrackedOperationId" docs/requirements/Component-ExternalSystemGateway.md`
Expected: "fire-and-forget" no longer describes cached calls; `TrackedOperationId` present.
**Step 3: Commit**
```bash
git add docs/requirements/Component-ExternalSystemGateway.md
git commit -m "docs(requirements): cached calls return TrackedOperationId in ESG"
```
---
### Task 5: Update the Site Runtime Script Runtime API
**Files:**
- Modify: `docs/requirements/Component-SiteRuntime.md``### External Systems`,
`### Notifications`, `### Database Access` under `## Script Runtime API`
**Step 1: Edit the doc**
- `### External Systems`: `ExternalSystem.CachedCall(...)` now returns a
`TrackedOperationId`; drop "fire-and-forget", say it returns a tracking handle.
- `### Database Access`: `Database.CachedWrite(...)` now returns a
`TrackedOperationId`.
- Add the unified accessor `Tracking.Status("trackedOperationId")` — returns a
status record (status, retry count, last error, key timestamps) for any tracked
operation, answered site-locally and authoritatively for cached calls.
- `### Notifications`: note that `Notify.Status(...)` is retained as a thin alias
of `Tracking.Status(...)`; `Notify.Send` returns a `TrackedOperationId`
(the value historically called `NotificationId`).
**Step 2: Verify**
Run: `git diff docs/requirements/Component-SiteRuntime.md`
Expected: all three cached/async producers return `TrackedOperationId`; `Tracking.Status` documented.
**Step 3: Commit**
```bash
git add docs/requirements/Component-SiteRuntime.md
git commit -m "docs(requirements): add Tracking.Status and cached-call handles to Script Runtime API"
```
---
### Task 6: Update the CentralSite Communication doc
**Files:**
- Modify: `docs/requirements/Component-Communication.md``### 8. Remote Queries`,
and add a new pattern for cached-call telemetry
**Step 1: Edit the doc**
- Add a new communication pattern (e.g. `### 10. Cached Call Telemetry (Site → Central)`):
the site S&F Engine pushes `CachedCallTelemetry` on every lifecycle transition;
best-effort, at-least-once, idempotent on `TrackedOperationId`; transport is
ClusterClient command/control. Also describe the reconciliation pull
(`CachedCallReconcileRequest`/`Response`) initiated by `SiteCallAuditActor`.
- `### 8. Remote Queries (Central → Site)`: generalize the "Retry or discard
parked messages" command line to also cover cached calls keyed by
`TrackedOperationId` (`RetryParkedOperation` / `DiscardParkedOperation`).
**Step 2: Verify**
Run: `grep -n "Telemetry\|RetryParkedOperation" docs/requirements/Component-Communication.md`
Expected: new telemetry pattern and generalized command present.
**Step 3: Commit**
```bash
git add docs/requirements/Component-Communication.md
git commit -m "docs(requirements): add cached-call telemetry pattern to Communication"
```
---
### Task 7: Update the Configuration Database doc
**Files:**
- Modify: `docs/requirements/Component-ConfigurationDatabase.md``## Database Schema`
(add a `### Site Calls` subsection), `## Scheduled Maintenance`
**Step 1: Edit the doc**
- Under `## Database Schema`, add a `### Site Calls` subsection describing the
`SiteCalls` table (columns per Task 1's "The `SiteCalls` Table" list), noting
it is populated only by Site Call Audit telemetry/reconciliation, and that
ingestion is insert-if-not-exists + upsert-on-newer-status.
- Under `## Scheduled Maintenance`, add a `### SiteCalls Table Purge` subsection
mirroring the `### Notifications Table Purge` wording: daily purge of terminal
rows after a configurable window (default 365 days).
**Step 2: Verify**
Run: `grep -n "SiteCalls" docs/requirements/Component-ConfigurationDatabase.md`
Expected: schema subsection and purge subsection both present.
**Step 3: Commit**
```bash
git add docs/requirements/Component-ConfigurationDatabase.md
git commit -m "docs(requirements): add SiteCalls table and purge to Configuration Database"
```
---
### Task 8: Update the Central UI doc
**Files:**
- Modify: `docs/requirements/Component-CentralUI.md``## Workflows / Pages`
**Step 1: Edit the doc**
Add a `### Site Calls (Deployment Role)` page after the
`### Notification Outbox (Deployment Role)` section:
- Queryable list of cached calls (`ExternalCall` + `DatabaseWrite` only —
notifications keep their own Notification Outbox page).
- Filters: site, kind, status, time range.
- Columns: timestamp, site, kind, target summary, status badge, retry count,
last error.
- Retry / Discard actions on `Parked` rows; "site unreachable" handling when the
owning site is offline.
- Custom Blazor Server + Bootstrap components, no third-party frameworks.
**Step 2: Verify**
Run: `grep -n "Site Calls" docs/requirements/Component-CentralUI.md`
Expected: new page section present, scoped to cached calls.
**Step 3: Commit**
```bash
git add docs/requirements/Component-CentralUI.md
git commit -m "docs(requirements): add Site Calls page to Central UI"
```
---
### Task 9: Update the Health Monitoring doc
**Files:**
- Modify: `docs/requirements/Component-HealthMonitoring.md` — add a
`## Site Call Audit KPIs` section after `## Notification Outbox KPIs`
**Step 1: Edit the doc**
Add a `## Site Call Audit KPIs` section mirroring `## Notification Outbox KPIs`:
the dashboard surfaces Site Call Audit headline KPI tiles (buffered, parked,
failed-last-interval, delivered-last-interval, oldest-pending age, stuck count),
computed point-in-time by the Site Call Audit component, global and per-site.
Stuck is display-only.
**Step 2: Verify**
Run: `grep -n "Site Call Audit KPIs" docs/requirements/Component-HealthMonitoring.md`
Expected: section present.
**Step 3: Commit**
```bash
git add docs/requirements/Component-HealthMonitoring.md
git commit -m "docs(requirements): add Site Call Audit KPIs to Health Monitoring"
```
---
### Task 10: Note the shared model in Notification docs
**Files:**
- Modify: `docs/requirements/Component-NotificationService.md``## Script API`
- Modify: `docs/requirements/Component-NotificationOutbox.md``## Purpose` or
`### Status Lifecycle`
**Step 1: Edit the doc**
- `Component-NotificationService.md` `## Script API`: note that `Notify.Send`'s
`NotificationId` is a `TrackedOperationId` (shared Commons type) and
`Notify.Status` is an alias of the unified `Tracking.Status`.
- `Component-NotificationOutbox.md`: add a sentence that the Notification Outbox
and the Site Call Audit component share the `TrackedOperationId` tracking
model and status lifecycle, but differ in delivery locality — the Notification
Outbox delivers; Site Call Audit only audits.
Do not change any notification behavior.
**Step 2: Verify**
Run: `git diff docs/requirements/Component-NotificationService.md docs/requirements/Component-NotificationOutbox.md`
Expected: additive notes only, no behavior change.
**Step 3: Commit**
```bash
git add docs/requirements/Component-NotificationService.md docs/requirements/Component-NotificationOutbox.md
git commit -m "docs(requirements): note shared TrackedOperationId model in notification docs"
```
---
### Task 11: Update the README component table
**Files:**
- Modify: `README.md` — component table and any architecture diagram component count
**Step 1: Edit the doc**
Add row 22 — **Site Call Audit** — to the component table:
"Central component auditing site cached calls (`CachedCall`/`CachedWrite`);
`SiteCalls` table, telemetry ingest, reconciliation, KPIs, central→site
Retry/Discard relay." Update any "21 components" count to 22.
**Step 2: Verify**
Run: `grep -rn "21 component\|22 component" README.md`
Expected: count reads 22; no stale "21".
**Step 3: Commit**
```bash
git add README.md
git commit -m "docs: add Site Call Audit to README component table"
```
---
### Task 12: Update CLAUDE.md
**Files:**
- Modify: `CLAUDE.md``## Current Component List`, `## Key Design Decisions`
**Step 1: Edit the doc**
- Change the heading `## Current Component List (21 components)` to `(22 components)`
and add item 22 — **Site Call Audit** — with a one-line description.
- Under `## Key Design Decisions`, in `### Store-and-Forward` (or `### UI & Monitoring`),
add bullets summarizing: cached calls return a `TrackedOperationId`; site-local
tracking table is the status source of truth; new central Site Call Audit
component mirrors status via best-effort telemetry + reconciliation; cached-call
delivery stays site-local; unified `Tracking.Status` accessor; `Failed` terminal
state for permanent failures.
**Step 2: Verify**
Run: `grep -n "22 components\|Site Call Audit" CLAUDE.md`
Expected: count is 22; component listed; design decisions present.
**Step 3: Commit**
```bash
git add CLAUDE.md
git commit -m "docs: record cached-call tracking in CLAUDE.md"
```
---
### Task 13: Final cross-reference consistency pass
**Files:**
- Potentially any `docs/requirements/Component-*.md`, `README.md`, `CLAUDE.md`
**Step 1: Sweep for stale or missing references**
Run each and review:
```bash
grep -rn "fire-and-forget" docs/requirements/
grep -rn "21 component" README.md CLAUDE.md
grep -rln "Site Call Audit" docs/requirements/ README.md CLAUDE.md
grep -rn "TrackedOperationId" docs/requirements/
```
Expected: no "fire-and-forget" describing cached calls; no "21 component" left;
Site Call Audit referenced by its dependents (Communication, Configuration
Database, Central UI, Health Monitoring, Commons); `TrackedOperationId` used
consistently.
**Step 2: Confirm new component's Dependencies/Interactions are reciprocated**
Verify each component named in `Component-SiteCallAudit.md` Dependencies/Interactions
also references Site Call Audit where appropriate.
**Step 3: Fix any gaps found, then commit**
```bash
git add -A
git commit -m "docs(requirements): reconcile cross-references for Site Call Audit"
```
If no gaps are found, skip the commit and note the plan is complete.
---
## Done
All cached-call tracking design changes are recorded. The design rationale lives
in `docs/plans/2026-05-19-cached-call-tracking-design.md`.