docs(requirements): add site-local tracking table and telemetry to Store-and-Forward

2026-05-19 11:42:20 -04:00
parent 5efbb9a985
commit 17ef5f85de
1 changed files with 27 additions and 3 deletions
--- a/docs/requirements/Component-StoreAndForward.md
+++ b/docs/requirements/Component-StoreAndForward.md
@@ -18,9 +18,11 @@ Site clusters only. The central cluster does not buffer messages.
 - Retry delivery per message according to the configured retry policy.
 - Park messages that exhaust their retry limit (dead-letter).
 - Persist buffered messages to local SQLite for durability.
+- Maintain a site-local **operation tracking table** holding one row per `TrackedOperationId` for cached calls (`ExternalCall` and `DatabaseWrite`) — the authoritative status record consulted by `Tracking.Status(id)`.
+- Emit cached-call lifecycle telemetry to the central Site Call Audit component on every status transition.
 - Replicate buffered messages to the standby node via application-level replication over Akka.NET remoting.
 - On failover, the standby node takes over delivery from its replicated copy.
- Respond to remote queries from central for parked message management (list, retry, discard).
+- Respond to remote queries from central for parked message management (list, retry, discard), including central-driven Retry/Discard of parked cached calls.

 ## Message Lifecycle

@@ -44,6 +46,8 @@ Attempt immediate delivery

 For notifications, "delivery" means forwarding the message to the central cluster via Central–Site Communication; "success" is central's ack, on which the message is cleared. Notifications do not park — they are retried at the fixed forward interval until central acks. Parking applies only to the external-system-call and cached-database-write categories.

+For the cached-call categories (`ExternalCall` and `DatabaseWrite`), the operation tracking table is the status record and the S&F buffer is purely the retry mechanism. A cached call that succeeds on its first immediate attempt is written directly as a terminal `Delivered` tracking row and never enters the S&F buffer. When immediate delivery fails transiently, the message is buffered and its tracking row moves to `Pending`/`Retrying`; the buffered message carries its `TrackedOperationId` so the tracking row and the retry record stay linked. On every tracking-table status transition the site emits `CachedCallTelemetry` to central.
+
 ## Retry Policy

 For the external-system-call and cached-database-write categories, retry settings are defined on the **source entity** (not per-message):
@@ -68,6 +72,22 @@ There is **no maximum buffer size**. Messages accumulate in the buffer until del
 - On failover, the new active node has a near-complete copy of the buffer. In rare cases, the most recent operations may not have been replicated (e.g., a message added or removed just before failover). This can result in a few **duplicate deliveries** (message delivered but remove not replicated) or a few **missed retries** (message added but not replicated). Both are acceptable trade-offs for the latency benefit.
 - On failover, the new active node resumes delivery from its local copy.

+### Operation Tracking Table
+
+Alongside the S&F buffer DB, each site node holds a **site-local operation tracking table** in SQLite. It carries one row per `TrackedOperationId` for cached calls (`ExternalCall` and `DatabaseWrite`), created the moment the script issues the cached call and kept regardless of outcome.
+
+- This table is the **status record**; the S&F buffer remains purely the **retry mechanism**. A buffered cached-call message references its `TrackedOperationId` back to its tracking row.
+- Each row records the operation kind (`TrackedOperationKind`), a target summary (external system + method, or database connection name), the unified `TrackedOperationStatus`, retry count, last error, source provenance (instance / script), and the created/updated/terminal UTC timestamps.
+- `Tracking.Status(id)` reads this table. For cached calls the **site is the authoritative source of truth** for status — the query is always answered site-locally, even when central is unreachable. The central Site Call Audit `SiteCalls` table is an eventually-consistent mirror.
+- A cached call that succeeds on its first immediate attempt writes a terminal `Delivered` row directly here, with nothing placed in the S&F buffer.
+- Terminal rows are purged after a configurable retention window (default 7 days) — the site holds live operational state; central holds long-term audit.
+
+Notifications are unaffected: they have no tracking table. Their `NotificationId` and status are owned by the central `Notifications` table, and their lifecycle continues to forward to central exactly as before.
+
+### Telemetry to Central
+
+On every tracking-table status transition, the site emits a `CachedCallTelemetry` message to the central Site Call Audit component over the site→central channel. Emission is best-effort, at-least-once, and idempotent on `TrackedOperationId`. Because telemetry is best-effort, the site also responds to `CachedCallReconcileRequest` reconciliation pulls — cursor-based per-site reads of tracking rows changed since a cursor — so any missed telemetry self-heals. The site never depends on central; central converges to the site.
+
 ## Parked Message Management

 - Parked messages remain stored at the site in SQLite.
@@ -75,13 +95,15 @@ There is **no maximum buffer size**. Messages accumulate in the buffer until del
 - Operators can:
  - **Retry** a parked message (moves it back to the retry queue).
  - **Discard** a parked message (removes it permanently).
- Store-and-forward messages are **not** automatically cleared when an instance is deleted. Pending and parked messages continue to exist and can be managed via the central UI.
+- For parked cached calls, Retry/Discard can be driven centrally: the Site Call Audit component relays `RetryParkedOperation` / `DiscardParkedOperation` commands (keyed by `TrackedOperationId`) down to the owning site. The site applies the command to its S&F buffer and tracking table, then emits `CachedCallTelemetry` reflecting the new state (`Retrying` or `Discarded`) — central never mutates its mirror row directly.
+- Store-and-forward messages are **not** automatically cleared when an instance is deleted. Pending and parked messages, and their tracking rows, continue to exist and can be managed via the central UI.

 ## Message Format

 Each buffered message stores:
 - **Message ID**: Unique identifier.
 - **Category**: External system call, notification, or cached database write.
+- **Tracked Operation ID**: For the cached-call categories, the `TrackedOperationId` linking the buffered message to its row in the operation tracking table. Not used by the notification category, which is tracked centrally via its `NotificationId`.
 - **Target**: External system name, the central cluster (for notifications), or database connection name.
 - **Payload**: Serialized message content (API method + parameters; notification list name + subject + body plus the locally generated `NotificationId` and source provenance; SQL + parameters).
 - **Retry Count**: Number of attempts so far.
@@ -94,7 +116,8 @@ Each buffered message stores:
 - **SQLite**: Local persistence on each node.
 - **Communication Layer**: Application-level replication to standby node; remote query handling from central; carries buffered notifications to the central cluster (ClusterClient) and receives central's acks.
 - **External System Gateway**: Delivers external system API calls.
- **Central–Site Communication**: The delivery target for the notification category — a buffered notification is forwarded to the central cluster over Central–Site Communication and cleared on central's ack.
+- **Central–Site Communication**: The delivery target for the notification category — a buffered notification is forwarded to the central cluster over Central–Site Communication and cleared on central's ack. Also carries `CachedCallTelemetry` and reconciliation responses to central, and receives `RetryParkedOperation` / `DiscardParkedOperation` commands.
+- **Site Call Audit**: The central audit mirror for cached calls — receives this engine's cached-call telemetry and reconciliation responses, and relays operator Retry/Discard of parked cached calls back as commands.
 - **Database Connections**: Delivers cached database writes.
 - **Site Event Logging**: Logs store-and-forward activity (queued, delivered, retried, parked).

@@ -103,4 +126,5 @@ Each buffered message stores:
 - **Site Runtime (Script Actors)**: Scripts submit messages to the buffer (external calls, notifications, cached DB writes).
 - **Communication Layer**: Handles parked message queries/commands from central; carries buffered notifications to the central cluster.
 - **Notification Outbox**: The central destination for the notification category — central ingests each forwarded notification into the `Notifications` table and acks the site, on which the engine clears the buffered message.
+- **Site Call Audit**: The central observability sibling for the cached-call categories — this engine emits `CachedCallTelemetry` on every tracking-table transition, answers `CachedCallReconcileRequest` pulls, and executes the `RetryParkedOperation` / `DiscardParkedOperation` commands it relays.
 - **Health Monitoring**: Reports buffer depth metrics, including the notification backlog covering the site→central forward leg.