From 17ef5f85deb89962bfc1143e33f63a7468f673cb Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Tue, 19 May 2026 11:42:20 -0400 Subject: [PATCH] docs(requirements): add site-local tracking table and telemetry to Store-and-Forward --- .../requirements/Component-StoreAndForward.md | 30 +++++++++++++++++-- 1 file changed, 27 insertions(+), 3 deletions(-) diff --git a/docs/requirements/Component-StoreAndForward.md b/docs/requirements/Component-StoreAndForward.md index 24297d5..b76fdb5 100644 --- a/docs/requirements/Component-StoreAndForward.md +++ b/docs/requirements/Component-StoreAndForward.md @@ -18,9 +18,11 @@ Site clusters only. The central cluster does not buffer messages. - Retry delivery per message according to the configured retry policy. - Park messages that exhaust their retry limit (dead-letter). - Persist buffered messages to local SQLite for durability. +- Maintain a site-local **operation tracking table** holding one row per `TrackedOperationId` for cached calls (`ExternalCall` and `DatabaseWrite`) — the authoritative status record consulted by `Tracking.Status(id)`. +- Emit cached-call lifecycle telemetry to the central Site Call Audit component on every status transition. - Replicate buffered messages to the standby node via application-level replication over Akka.NET remoting. - On failover, the standby node takes over delivery from its replicated copy. -- Respond to remote queries from central for parked message management (list, retry, discard). +- Respond to remote queries from central for parked message management (list, retry, discard), including central-driven Retry/Discard of parked cached calls. ## Message Lifecycle @@ -44,6 +46,8 @@ Attempt immediate delivery For notifications, "delivery" means forwarding the message to the central cluster via Central–Site Communication; "success" is central's ack, on which the message is cleared. Notifications do not park — they are retried at the fixed forward interval until central acks. Parking applies only to the external-system-call and cached-database-write categories. +For the cached-call categories (`ExternalCall` and `DatabaseWrite`), the operation tracking table is the status record and the S&F buffer is purely the retry mechanism. A cached call that succeeds on its first immediate attempt is written directly as a terminal `Delivered` tracking row and never enters the S&F buffer. When immediate delivery fails transiently, the message is buffered and its tracking row moves to `Pending`/`Retrying`; the buffered message carries its `TrackedOperationId` so the tracking row and the retry record stay linked. On every tracking-table status transition the site emits `CachedCallTelemetry` to central. + ## Retry Policy For the external-system-call and cached-database-write categories, retry settings are defined on the **source entity** (not per-message): @@ -68,6 +72,22 @@ There is **no maximum buffer size**. Messages accumulate in the buffer until del - On failover, the new active node has a near-complete copy of the buffer. In rare cases, the most recent operations may not have been replicated (e.g., a message added or removed just before failover). This can result in a few **duplicate deliveries** (message delivered but remove not replicated) or a few **missed retries** (message added but not replicated). Both are acceptable trade-offs for the latency benefit. - On failover, the new active node resumes delivery from its local copy. +### Operation Tracking Table + +Alongside the S&F buffer DB, each site node holds a **site-local operation tracking table** in SQLite. It carries one row per `TrackedOperationId` for cached calls (`ExternalCall` and `DatabaseWrite`), created the moment the script issues the cached call and kept regardless of outcome. + +- This table is the **status record**; the S&F buffer remains purely the **retry mechanism**. A buffered cached-call message references its `TrackedOperationId` back to its tracking row. +- Each row records the operation kind (`TrackedOperationKind`), a target summary (external system + method, or database connection name), the unified `TrackedOperationStatus`, retry count, last error, source provenance (instance / script), and the created/updated/terminal UTC timestamps. +- `Tracking.Status(id)` reads this table. For cached calls the **site is the authoritative source of truth** for status — the query is always answered site-locally, even when central is unreachable. The central Site Call Audit `SiteCalls` table is an eventually-consistent mirror. +- A cached call that succeeds on its first immediate attempt writes a terminal `Delivered` row directly here, with nothing placed in the S&F buffer. +- Terminal rows are purged after a configurable retention window (default 7 days) — the site holds live operational state; central holds long-term audit. + +Notifications are unaffected: they have no tracking table. Their `NotificationId` and status are owned by the central `Notifications` table, and their lifecycle continues to forward to central exactly as before. + +### Telemetry to Central + +On every tracking-table status transition, the site emits a `CachedCallTelemetry` message to the central Site Call Audit component over the site→central channel. Emission is best-effort, at-least-once, and idempotent on `TrackedOperationId`. Because telemetry is best-effort, the site also responds to `CachedCallReconcileRequest` reconciliation pulls — cursor-based per-site reads of tracking rows changed since a cursor — so any missed telemetry self-heals. The site never depends on central; central converges to the site. + ## Parked Message Management - Parked messages remain stored at the site in SQLite. @@ -75,13 +95,15 @@ There is **no maximum buffer size**. Messages accumulate in the buffer until del - Operators can: - **Retry** a parked message (moves it back to the retry queue). - **Discard** a parked message (removes it permanently). -- Store-and-forward messages are **not** automatically cleared when an instance is deleted. Pending and parked messages continue to exist and can be managed via the central UI. +- For parked cached calls, Retry/Discard can be driven centrally: the Site Call Audit component relays `RetryParkedOperation` / `DiscardParkedOperation` commands (keyed by `TrackedOperationId`) down to the owning site. The site applies the command to its S&F buffer and tracking table, then emits `CachedCallTelemetry` reflecting the new state (`Retrying` or `Discarded`) — central never mutates its mirror row directly. +- Store-and-forward messages are **not** automatically cleared when an instance is deleted. Pending and parked messages, and their tracking rows, continue to exist and can be managed via the central UI. ## Message Format Each buffered message stores: - **Message ID**: Unique identifier. - **Category**: External system call, notification, or cached database write. +- **Tracked Operation ID**: For the cached-call categories, the `TrackedOperationId` linking the buffered message to its row in the operation tracking table. Not used by the notification category, which is tracked centrally via its `NotificationId`. - **Target**: External system name, the central cluster (for notifications), or database connection name. - **Payload**: Serialized message content (API method + parameters; notification list name + subject + body plus the locally generated `NotificationId` and source provenance; SQL + parameters). - **Retry Count**: Number of attempts so far. @@ -94,7 +116,8 @@ Each buffered message stores: - **SQLite**: Local persistence on each node. - **Communication Layer**: Application-level replication to standby node; remote query handling from central; carries buffered notifications to the central cluster (ClusterClient) and receives central's acks. - **External System Gateway**: Delivers external system API calls. -- **Central–Site Communication**: The delivery target for the notification category — a buffered notification is forwarded to the central cluster over Central–Site Communication and cleared on central's ack. +- **Central–Site Communication**: The delivery target for the notification category — a buffered notification is forwarded to the central cluster over Central–Site Communication and cleared on central's ack. Also carries `CachedCallTelemetry` and reconciliation responses to central, and receives `RetryParkedOperation` / `DiscardParkedOperation` commands. +- **Site Call Audit**: The central audit mirror for cached calls — receives this engine's cached-call telemetry and reconciliation responses, and relays operator Retry/Discard of parked cached calls back as commands. - **Database Connections**: Delivers cached database writes. - **Site Event Logging**: Logs store-and-forward activity (queued, delivered, retried, parked). @@ -103,4 +126,5 @@ Each buffered message stores: - **Site Runtime (Script Actors)**: Scripts submit messages to the buffer (external calls, notifications, cached DB writes). - **Communication Layer**: Handles parked message queries/commands from central; carries buffered notifications to the central cluster. - **Notification Outbox**: The central destination for the notification category — central ingests each forwarded notification into the `Notifications` table and acks the site, on which the engine clears the buffered message. +- **Site Call Audit**: The central observability sibling for the cached-call categories — this engine emits `CachedCallTelemetry` on every tracking-table transition, answers `CachedCallReconcileRequest` pulls, and executes the `RetryParkedOperation` / `DiscardParkedOperation` commands it relays. - **Health Monitoring**: Reports buffer depth metrics, including the notification backlog covering the site→central forward leg.