Files
scadalink-design/docs/plans/2026-05-19-cached-call-tracking.md
2026-05-19 11:30:21 -04:00

22 KiB
Raw Blame History

Cached Call Tracking Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers-extended-cc:executing-plans to implement this plan task-by-task.

Goal: Give cached external system calls and cached database writes a trackable TrackedOperationId, backed by a site-local tracking table and a new central Site Call Audit component, under a tracking model unified with Notify.Send.

Architecture: Approach B from the design doc — a sibling central component (Site Call Audit), not a merged outbox. The site stays the source of truth for cached-call status; central audit is an eventually-consistent mirror fed by best-effort telemetry plus a reconciliation pull. Delivery of cached calls remains site-local.

Tech Stack: This is a design-documentation change. "Implementation" means editing Markdown design documents under docs/requirements/, plus README.md and CLAUDE.md. No source code is touched. The authoritative design is docs/plans/2026-05-19-cached-call-tracking-design.md — read it before starting.

Working conventions (from CLAUDE.md):

  • Edit documents in place; no copies or backups.
  • Component docs follow: Purpose, Location, Responsibilities, design sections, Dependencies, Interactions.
  • Keep cross-references accurate across all docs.
  • Use git diff to review before committing.

Per-task workflow (replaces TDD for this docs project):

  1. Read the target file in full first.
  2. Make the edits described.
  3. Verify: run git diff <file> and confirm the change reads correctly and matches the design doc.
  4. Cross-reference check: run the grep given in the task; confirm no stale references.
  5. Commit with the given message.

Task 1: Create the Site Call Audit component document

Files:

  • Create: docs/requirements/Component-SiteCallAudit.md

Step 1: Write the new component doc

Create the file following the standard component structure. Content:

# Component: Site Call Audit

## Purpose

Provides central, queryable audit and operational visibility for cached calls
made by site scripts — `ExternalSystem.CachedCall()` and `Database.CachedWrite()`.
Each such call carries a `TrackedOperationId`; sites report lifecycle telemetry
to this component, which maintains a central audit record, computes KPIs, and
relays Retry/Discard actions back to the owning site.

This is the second centrally-hosted observability component for site
store-and-forward activity (the Notification Outbox is the first). Unlike the
Notification Outbox, Site Call Audit is **not a dispatcher** — it never delivers
anything. Cached calls are delivered by the site's Store-and-Forward Engine
against site-local external systems and databases, which central cannot reach.

## Location

Central cluster only. A singleton actor (`SiteCallAuditActor`) on the active
central node. Registered as component #22 in the Host role configuration.

## Responsibilities

- Ingest cached-call lifecycle telemetry from sites into the central `SiteCalls`
  table.
- Run periodic per-site reconciliation pulls so missed telemetry self-heals.
- Compute point-in-time KPIs (global and per-site) from the `SiteCalls` table.
- Relay operator Retry/Discard actions for parked cached calls to the owning
  site over the command/control channel.
- Purge terminal audit rows after a configurable retention window.

## The `SiteCalls` Table

Lives in the central MS SQL configuration database — a sibling of the
`Notifications` table. One row per `TrackedOperationId`:

- **TrackedOperationId** — GUID, primary key. Generated site-side at call time.
- **SourceSite** — site that issued the call.
- **Kind** — `ExternalCall` or `DatabaseWrite`.
- **TargetSummary** — external system + method name, or database connection name.
- **Status** — `Pending`, `Retrying`, `Delivered`, `Parked`, `Failed`, `Discarded`.
- **RetryCount** — attempts so far.
- **LastError** — most recent error detail, if any.
- **Provenance** — source instance / script.
- **CreatedAtUtc**, **UpdatedAtUtc**, **TerminalAtUtc** — key timestamps.

## Status Lifecycle

`Pending → Retrying → Delivered / Parked / Failed / Discarded`

- **Delivered** — succeeded. A cached call that succeeds on its first immediate
  attempt is recorded directly as `Delivered`.
- **Parked** — transient retries exhausted; awaiting manual action.
- **Failed** — permanent failure (e.g. HTTP 4xx). The error was also returned
  synchronously to the calling script; the record captures it.
- **Discarded** — an operator discarded a parked operation.

The site is the source of truth. The `SiteCalls` row is an eventually-consistent
mirror — never queried by scripts (`Tracking.Status()` is answered site-locally).

## Ingest & Idempotency

Telemetry ingestion is **insert-if-not-exists** keyed on `TrackedOperationId`,
then **upsert-on-newer-status**. The lifecycle is monotonic, so status only
advances and never regresses; at-least-once and out-of-order telemetry are
therefore harmless.

## Reconciliation

Because telemetry is best-effort, `SiteCallAuditActor` periodically — and on site
reconnect — pulls "all tracking rows changed since cursor X" from each site.
Gaps left by lost telemetry self-heal. Central converges to the site; the site
never depends on central.

## Retry / Discard Relay

Parked cached calls live in the owning site's S&F buffer. Operator Retry/Discard
from the Central UI is relayed to that site as a `RetryParkedOperation` /
`DiscardParkedOperation` command over the command/control channel. The site
applies the change and emits telemetry reflecting the new state; central never
mutates the `SiteCalls` row directly. If the site is offline the command fails
fast and the UI surfaces a "site unreachable" message.

## KPIs

Point-in-time, computed from the `SiteCalls` table, global and per-source-site,
mirroring the Notification Outbox KPI shape:

- Buffered count (`Pending` + `Retrying`)
- Parked count
- Failed-last-interval
- Delivered-last-interval
- Oldest-pending age
- Stuck count — `Pending`/`Retrying` older than a configurable threshold
  (default 10 minutes); display-only, no escalation.

## Retention

Daily purge of terminal rows (`Delivered`, `Failed`, `Discarded`) after a
configurable window (default 365 days), matching the `Notifications` purge.

## Dependencies

- **Configuration Database**: hosts the `SiteCalls` table and its repository.
- **CentralSite Communication**: receives cached-call telemetry and reconciliation
  responses; sends Retry/Discard commands.
- **Store-and-Forward Engine**: the site-side origin of cached-call telemetry and
  the executor of relayed Retry/Discard commands.
- **Commons**: `TrackedOperationId`, status enum, telemetry message contracts.

## Interactions

- **Central UI**: the Site Calls page queries this component and issues
  Retry/Discard actions.
- **Health Monitoring**: surfaces Site Call Audit KPI tiles on the dashboard.
- **Cluster Infrastructure**: hosts the `SiteCallAuditActor` singleton with
  active/standby failover.

Step 2: Verify

Run: git diff --stat and open the new file. Expected: structure matches other Component-*.md files (Purpose → Interactions).

Step 3: Commit

git add docs/requirements/Component-SiteCallAudit.md
git commit -m "docs(requirements): add Site Call Audit component (#22)"

Task 2: Add shared tracking contracts to Commons

Files:

  • Modify: docs/requirements/Component-Commons.md — sections REQ-COM-1 (data types), REQ-COM-5 (message contracts)

Step 1: Edit the doc

In ### REQ-COM-1: Shared Data Type System, add TrackedOperationId as a shared type: a GUID identifying any tracked store-and-forward operation (CachedCall, CachedWrite, Notify.Send), generated caller-side at the site at call time, doubling as the telemetry idempotency key. Note that the existing NotificationId is the notification-domain name for this same concept.

Add a shared TrackedOperationStatus enum: Pending, Retrying, Delivered, Parked, Failed, Discarded.

In ### REQ-COM-5: Cross-Component Message Contracts, add the cached-call telemetry and command contracts (additive-only, per REQ-COM-5a):

  • CachedCallTelemetryTrackedOperationId, source site, Kind, target summary, status, retry count, last error, timestamps, provenance.
  • CachedCallReconcileRequest / CachedCallReconcileResponse — cursor-based per-site pull of changed tracking rows.
  • RetryParkedOperation / DiscardParkedOperation — central→site commands keyed by TrackedOperationId (generalize naming so they cover cached calls, not only legacy "parked message" wording).

Step 2: Verify

Run: git diff docs/requirements/Component-Commons.md Expected: additive only; no existing type or contract removed/renamed.

Step 3: Commit

git add docs/requirements/Component-Commons.md
git commit -m "docs(requirements): add TrackedOperationId and cached-call contracts to Commons"

Task 3: Update the Store-and-Forward Engine doc

Files:

  • Modify: docs/requirements/Component-StoreAndForward.mdResponsibilities, Message Lifecycle, Persistence, Parked Message Management, Message Format

Step 1: Edit the doc

  • Responsibilities / Persistence: introduce the site-local operation tracking table — a SQLite table alongside the S&F buffer DB, holding one row per TrackedOperationId for cached calls regardless of outcome. It is the status record; the S&F buffer remains only the retry mechanism. State that Tracking.Status(id) reads this table, that it is the source of truth, and that terminal rows are purged after a configurable window (default 7 days).
  • Message Lifecycle: a cached call that succeeds on its first immediate attempt is written directly as a terminal Delivered tracking row and never enters the S&F buffer. A buffered cached-call message references its TrackedOperationId.
  • Add a telemetry emission note: on every lifecycle transition the site emits CachedCallTelemetry to central (best-effort, at-least-once, idempotent on the ID) and responds to CachedCallReconcileRequest pulls.
  • Parked Message Management: note that Retry/Discard of parked cached calls can be driven by central via RetryParkedOperation/DiscardParkedOperation, after which the site emits telemetry reflecting the new state.
  • Message Format: add TrackedOperationId to the listed per-message fields.

Leave the notification category behavior unchanged.

Step 2: Verify

Run: git diff docs/requirements/Component-StoreAndForward.md Expected: cached-call and DB-write categories gain tracking; notification flow untouched.

Step 3: Commit

git add docs/requirements/Component-StoreAndForward.md
git commit -m "docs(requirements): add site-local tracking table and telemetry to Store-and-Forward"

Task 4: Update the External System Gateway doc

Files:

  • Modify: docs/requirements/Component-ExternalSystemGateway.mdCached Write, External System Call Modes, Call Timeout & Error Handling

Step 1: Edit the doc

  • ### Cached (Store-and-Forward) and ### Cached Write (Store-and-Forward): state that CachedCall/CachedWrite now return a TrackedOperationId. They are no longer "fire-and-forget" with no handle — replace that wording with "deferred-delivery, returns a tracking handle". Immediate success → terminal Delivered record; transient failure → buffered, Pending/Retrying.
  • Permanent failure: the error is still returned synchronously to the script (unchanged) and recorded as a terminal Failed tracking record.
  • Keep the idempotency note — duplicate delivery on retry is still the caller's responsibility.
  • Add a one-line pointer that status is observable via Tracking.Status(id) and centrally via the Site Call Audit component.

Step 2: Verify

Run: grep -n "fire-and-forget\|TrackedOperationId" docs/requirements/Component-ExternalSystemGateway.md Expected: "fire-and-forget" no longer describes cached calls; TrackedOperationId present.

Step 3: Commit

git add docs/requirements/Component-ExternalSystemGateway.md
git commit -m "docs(requirements): cached calls return TrackedOperationId in ESG"

Task 5: Update the Site Runtime Script Runtime API

Files:

  • Modify: docs/requirements/Component-SiteRuntime.md### External Systems, ### Notifications, ### Database Access under ## Script Runtime API

Step 1: Edit the doc

  • ### External Systems: ExternalSystem.CachedCall(...) now returns a TrackedOperationId; drop "fire-and-forget", say it returns a tracking handle.
  • ### Database Access: Database.CachedWrite(...) now returns a TrackedOperationId.
  • Add the unified accessor Tracking.Status("trackedOperationId") — returns a status record (status, retry count, last error, key timestamps) for any tracked operation, answered site-locally and authoritatively for cached calls.
  • ### Notifications: note that Notify.Status(...) is retained as a thin alias of Tracking.Status(...); Notify.Send returns a TrackedOperationId (the value historically called NotificationId).

Step 2: Verify

Run: git diff docs/requirements/Component-SiteRuntime.md Expected: all three cached/async producers return TrackedOperationId; Tracking.Status documented.

Step 3: Commit

git add docs/requirements/Component-SiteRuntime.md
git commit -m "docs(requirements): add Tracking.Status and cached-call handles to Script Runtime API"

Task 6: Update the CentralSite Communication doc

Files:

  • Modify: docs/requirements/Component-Communication.md### 8. Remote Queries, and add a new pattern for cached-call telemetry

Step 1: Edit the doc

  • Add a new communication pattern (e.g. ### 10. Cached Call Telemetry (Site → Central)): the site S&F Engine pushes CachedCallTelemetry on every lifecycle transition; best-effort, at-least-once, idempotent on TrackedOperationId; transport is ClusterClient command/control. Also describe the reconciliation pull (CachedCallReconcileRequest/Response) initiated by SiteCallAuditActor.
  • ### 8. Remote Queries (Central → Site): generalize the "Retry or discard parked messages" command line to also cover cached calls keyed by TrackedOperationId (RetryParkedOperation / DiscardParkedOperation).

Step 2: Verify

Run: grep -n "Telemetry\|RetryParkedOperation" docs/requirements/Component-Communication.md Expected: new telemetry pattern and generalized command present.

Step 3: Commit

git add docs/requirements/Component-Communication.md
git commit -m "docs(requirements): add cached-call telemetry pattern to Communication"

Task 7: Update the Configuration Database doc

Files:

  • Modify: docs/requirements/Component-ConfigurationDatabase.md## Database Schema (add a ### Site Calls subsection), ## Scheduled Maintenance

Step 1: Edit the doc

  • Under ## Database Schema, add a ### Site Calls subsection describing the SiteCalls table (columns per Task 1's "The SiteCalls Table" list), noting it is populated only by Site Call Audit telemetry/reconciliation, and that ingestion is insert-if-not-exists + upsert-on-newer-status.
  • Under ## Scheduled Maintenance, add a ### SiteCalls Table Purge subsection mirroring the ### Notifications Table Purge wording: daily purge of terminal rows after a configurable window (default 365 days).

Step 2: Verify

Run: grep -n "SiteCalls" docs/requirements/Component-ConfigurationDatabase.md Expected: schema subsection and purge subsection both present.

Step 3: Commit

git add docs/requirements/Component-ConfigurationDatabase.md
git commit -m "docs(requirements): add SiteCalls table and purge to Configuration Database"

Task 8: Update the Central UI doc

Files:

  • Modify: docs/requirements/Component-CentralUI.md## Workflows / Pages

Step 1: Edit the doc

Add a ### Site Calls (Deployment Role) page after the ### Notification Outbox (Deployment Role) section:

  • Queryable list of cached calls (ExternalCall + DatabaseWrite only — notifications keep their own Notification Outbox page).
  • Filters: site, kind, status, time range.
  • Columns: timestamp, site, kind, target summary, status badge, retry count, last error.
  • Retry / Discard actions on Parked rows; "site unreachable" handling when the owning site is offline.
  • Custom Blazor Server + Bootstrap components, no third-party frameworks.

Step 2: Verify

Run: grep -n "Site Calls" docs/requirements/Component-CentralUI.md Expected: new page section present, scoped to cached calls.

Step 3: Commit

git add docs/requirements/Component-CentralUI.md
git commit -m "docs(requirements): add Site Calls page to Central UI"

Task 9: Update the Health Monitoring doc

Files:

  • Modify: docs/requirements/Component-HealthMonitoring.md — add a ## Site Call Audit KPIs section after ## Notification Outbox KPIs

Step 1: Edit the doc

Add a ## Site Call Audit KPIs section mirroring ## Notification Outbox KPIs: the dashboard surfaces Site Call Audit headline KPI tiles (buffered, parked, failed-last-interval, delivered-last-interval, oldest-pending age, stuck count), computed point-in-time by the Site Call Audit component, global and per-site. Stuck is display-only.

Step 2: Verify

Run: grep -n "Site Call Audit KPIs" docs/requirements/Component-HealthMonitoring.md Expected: section present.

Step 3: Commit

git add docs/requirements/Component-HealthMonitoring.md
git commit -m "docs(requirements): add Site Call Audit KPIs to Health Monitoring"

Task 10: Note the shared model in Notification docs

Files:

  • Modify: docs/requirements/Component-NotificationService.md## Script API
  • Modify: docs/requirements/Component-NotificationOutbox.md## Purpose or ### Status Lifecycle

Step 1: Edit the doc

  • Component-NotificationService.md ## Script API: note that Notify.Send's NotificationId is a TrackedOperationId (shared Commons type) and Notify.Status is an alias of the unified Tracking.Status.
  • Component-NotificationOutbox.md: add a sentence that the Notification Outbox and the Site Call Audit component share the TrackedOperationId tracking model and status lifecycle, but differ in delivery locality — the Notification Outbox delivers; Site Call Audit only audits.

Do not change any notification behavior.

Step 2: Verify

Run: git diff docs/requirements/Component-NotificationService.md docs/requirements/Component-NotificationOutbox.md Expected: additive notes only, no behavior change.

Step 3: Commit

git add docs/requirements/Component-NotificationService.md docs/requirements/Component-NotificationOutbox.md
git commit -m "docs(requirements): note shared TrackedOperationId model in notification docs"

Task 11: Update the README component table

Files:

  • Modify: README.md — component table and any architecture diagram component count

Step 1: Edit the doc

Add row 22 — Site Call Audit — to the component table: "Central component auditing site cached calls (CachedCall/CachedWrite); SiteCalls table, telemetry ingest, reconciliation, KPIs, central→site Retry/Discard relay." Update any "21 components" count to 22.

Step 2: Verify

Run: grep -rn "21 component\|22 component" README.md Expected: count reads 22; no stale "21".

Step 3: Commit

git add README.md
git commit -m "docs: add Site Call Audit to README component table"

Task 12: Update CLAUDE.md

Files:

  • Modify: CLAUDE.md## Current Component List, ## Key Design Decisions

Step 1: Edit the doc

  • Change the heading ## Current Component List (21 components) to (22 components) and add item 22 — Site Call Audit — with a one-line description.
  • Under ## Key Design Decisions, in ### Store-and-Forward (or ### UI & Monitoring), add bullets summarizing: cached calls return a TrackedOperationId; site-local tracking table is the status source of truth; new central Site Call Audit component mirrors status via best-effort telemetry + reconciliation; cached-call delivery stays site-local; unified Tracking.Status accessor; Failed terminal state for permanent failures.

Step 2: Verify

Run: grep -n "22 components\|Site Call Audit" CLAUDE.md Expected: count is 22; component listed; design decisions present.

Step 3: Commit

git add CLAUDE.md
git commit -m "docs: record cached-call tracking in CLAUDE.md"

Task 13: Final cross-reference consistency pass

Files:

  • Potentially any docs/requirements/Component-*.md, README.md, CLAUDE.md

Step 1: Sweep for stale or missing references

Run each and review:

grep -rn "fire-and-forget" docs/requirements/
grep -rn "21 component" README.md CLAUDE.md
grep -rln "Site Call Audit" docs/requirements/ README.md CLAUDE.md
grep -rn "TrackedOperationId" docs/requirements/

Expected: no "fire-and-forget" describing cached calls; no "21 component" left; Site Call Audit referenced by its dependents (Communication, Configuration Database, Central UI, Health Monitoring, Commons); TrackedOperationId used consistently.

Step 2: Confirm new component's Dependencies/Interactions are reciprocated

Verify each component named in Component-SiteCallAudit.md Dependencies/Interactions also references Site Call Audit where appropriate.

Step 3: Fix any gaps found, then commit

git add -A
git commit -m "docs(requirements): reconcile cross-references for Site Call Audit"

If no gaps are found, skip the commit and note the plan is complete.


Done

All cached-call tracking design changes are recorded. The design rationale lives in docs/plans/2026-05-19-cached-call-tracking-design.md.