docs(requirements): add site-local tracking table and telemetry to Store-and-Forward

This commit is contained in:
Joseph Doherty
2026-05-19 11:42:20 -04:00
parent 5efbb9a985
commit 17ef5f85de

View File

@@ -18,9 +18,11 @@ Site clusters only. The central cluster does not buffer messages.
- Retry delivery per message according to the configured retry policy.
- Park messages that exhaust their retry limit (dead-letter).
- Persist buffered messages to local SQLite for durability.
- Maintain a site-local **operation tracking table** holding one row per `TrackedOperationId` for cached calls (`ExternalCall` and `DatabaseWrite`) — the authoritative status record consulted by `Tracking.Status(id)`.
- Emit cached-call lifecycle telemetry to the central Site Call Audit component on every status transition.
- Replicate buffered messages to the standby node via application-level replication over Akka.NET remoting.
- On failover, the standby node takes over delivery from its replicated copy.
- Respond to remote queries from central for parked message management (list, retry, discard).
- Respond to remote queries from central for parked message management (list, retry, discard), including central-driven Retry/Discard of parked cached calls.
## Message Lifecycle
@@ -44,6 +46,8 @@ Attempt immediate delivery
For notifications, "delivery" means forwarding the message to the central cluster via CentralSite Communication; "success" is central's ack, on which the message is cleared. Notifications do not park — they are retried at the fixed forward interval until central acks. Parking applies only to the external-system-call and cached-database-write categories.
For the cached-call categories (`ExternalCall` and `DatabaseWrite`), the operation tracking table is the status record and the S&F buffer is purely the retry mechanism. A cached call that succeeds on its first immediate attempt is written directly as a terminal `Delivered` tracking row and never enters the S&F buffer. When immediate delivery fails transiently, the message is buffered and its tracking row moves to `Pending`/`Retrying`; the buffered message carries its `TrackedOperationId` so the tracking row and the retry record stay linked. On every tracking-table status transition the site emits `CachedCallTelemetry` to central.
## Retry Policy
For the external-system-call and cached-database-write categories, retry settings are defined on the **source entity** (not per-message):
@@ -68,6 +72,22 @@ There is **no maximum buffer size**. Messages accumulate in the buffer until del
- On failover, the new active node has a near-complete copy of the buffer. In rare cases, the most recent operations may not have been replicated (e.g., a message added or removed just before failover). This can result in a few **duplicate deliveries** (message delivered but remove not replicated) or a few **missed retries** (message added but not replicated). Both are acceptable trade-offs for the latency benefit.
- On failover, the new active node resumes delivery from its local copy.
### Operation Tracking Table
Alongside the S&F buffer DB, each site node holds a **site-local operation tracking table** in SQLite. It carries one row per `TrackedOperationId` for cached calls (`ExternalCall` and `DatabaseWrite`), created the moment the script issues the cached call and kept regardless of outcome.
- This table is the **status record**; the S&F buffer remains purely the **retry mechanism**. A buffered cached-call message references its `TrackedOperationId` back to its tracking row.
- Each row records the operation kind (`TrackedOperationKind`), a target summary (external system + method, or database connection name), the unified `TrackedOperationStatus`, retry count, last error, source provenance (instance / script), and the created/updated/terminal UTC timestamps.
- `Tracking.Status(id)` reads this table. For cached calls the **site is the authoritative source of truth** for status — the query is always answered site-locally, even when central is unreachable. The central Site Call Audit `SiteCalls` table is an eventually-consistent mirror.
- A cached call that succeeds on its first immediate attempt writes a terminal `Delivered` row directly here, with nothing placed in the S&F buffer.
- Terminal rows are purged after a configurable retention window (default 7 days) — the site holds live operational state; central holds long-term audit.
Notifications are unaffected: they have no tracking table. Their `NotificationId` and status are owned by the central `Notifications` table, and their lifecycle continues to forward to central exactly as before.
### Telemetry to Central
On every tracking-table status transition, the site emits a `CachedCallTelemetry` message to the central Site Call Audit component over the site→central channel. Emission is best-effort, at-least-once, and idempotent on `TrackedOperationId`. Because telemetry is best-effort, the site also responds to `CachedCallReconcileRequest` reconciliation pulls — cursor-based per-site reads of tracking rows changed since a cursor — so any missed telemetry self-heals. The site never depends on central; central converges to the site.
## Parked Message Management
- Parked messages remain stored at the site in SQLite.
@@ -75,13 +95,15 @@ There is **no maximum buffer size**. Messages accumulate in the buffer until del
- Operators can:
- **Retry** a parked message (moves it back to the retry queue).
- **Discard** a parked message (removes it permanently).
- Store-and-forward messages are **not** automatically cleared when an instance is deleted. Pending and parked messages continue to exist and can be managed via the central UI.
- For parked cached calls, Retry/Discard can be driven centrally: the Site Call Audit component relays `RetryParkedOperation` / `DiscardParkedOperation` commands (keyed by `TrackedOperationId`) down to the owning site. The site applies the command to its S&F buffer and tracking table, then emits `CachedCallTelemetry` reflecting the new state (`Retrying` or `Discarded`) — central never mutates its mirror row directly.
- Store-and-forward messages are **not** automatically cleared when an instance is deleted. Pending and parked messages, and their tracking rows, continue to exist and can be managed via the central UI.
## Message Format
Each buffered message stores:
- **Message ID**: Unique identifier.
- **Category**: External system call, notification, or cached database write.
- **Tracked Operation ID**: For the cached-call categories, the `TrackedOperationId` linking the buffered message to its row in the operation tracking table. Not used by the notification category, which is tracked centrally via its `NotificationId`.
- **Target**: External system name, the central cluster (for notifications), or database connection name.
- **Payload**: Serialized message content (API method + parameters; notification list name + subject + body plus the locally generated `NotificationId` and source provenance; SQL + parameters).
- **Retry Count**: Number of attempts so far.
@@ -94,7 +116,8 @@ Each buffered message stores:
- **SQLite**: Local persistence on each node.
- **Communication Layer**: Application-level replication to standby node; remote query handling from central; carries buffered notifications to the central cluster (ClusterClient) and receives central's acks.
- **External System Gateway**: Delivers external system API calls.
- **CentralSite Communication**: The delivery target for the notification category — a buffered notification is forwarded to the central cluster over CentralSite Communication and cleared on central's ack.
- **CentralSite Communication**: The delivery target for the notification category — a buffered notification is forwarded to the central cluster over CentralSite Communication and cleared on central's ack. Also carries `CachedCallTelemetry` and reconciliation responses to central, and receives `RetryParkedOperation` / `DiscardParkedOperation` commands.
- **Site Call Audit**: The central audit mirror for cached calls — receives this engine's cached-call telemetry and reconciliation responses, and relays operator Retry/Discard of parked cached calls back as commands.
- **Database Connections**: Delivers cached database writes.
- **Site Event Logging**: Logs store-and-forward activity (queued, delivered, retried, parked).
@@ -103,4 +126,5 @@ Each buffered message stores:
- **Site Runtime (Script Actors)**: Scripts submit messages to the buffer (external calls, notifications, cached DB writes).
- **Communication Layer**: Handles parked message queries/commands from central; carries buffered notifications to the central cluster.
- **Notification Outbox**: The central destination for the notification category — central ingests each forwarded notification into the `Notifications` table and acks the site, on which the engine clears the buffered message.
- **Site Call Audit**: The central observability sibling for the cached-call categories — this engine emits `CachedCallTelemetry` on every tracking-table transition, answers `CachedCallReconcileRequest` pulls, and executes the `RetryParkedOperation` / `DiscardParkedOperation` commands it relays.
- **Health Monitoring**: Reports buffer depth metrics, including the notification backlog covering the site→central forward leg.