131 lines
11 KiB
Markdown
131 lines
11 KiB
Markdown
# Component: Store-and-Forward Engine
|
||
|
||
## Purpose
|
||
|
||
The Store-and-Forward Engine provides reliable message delivery for outbound communications from site clusters. It buffers messages when the target system is unavailable, retries them according to configured policies, and parks messages that exhaust retries for manual review.
|
||
|
||
## Location
|
||
|
||
Site clusters only. The central cluster does not buffer messages.
|
||
|
||
## Responsibilities
|
||
|
||
- Buffer outbound messages when the target system is unavailable.
|
||
- Manage three categories of buffered messages:
|
||
- External system API calls.
|
||
- Notifications forwarded to the central cluster.
|
||
- Cached database writes.
|
||
- Retry delivery per message according to the configured retry policy.
|
||
- Park messages that exhaust their retry limit (dead-letter).
|
||
- Persist buffered messages to local SQLite for durability.
|
||
- Maintain a site-local **operation tracking table** holding one row per `TrackedOperationId` for cached calls (`ExternalCall` and `DatabaseWrite`) — the authoritative status record consulted by `Tracking.Status(id)`.
|
||
- Emit cached-call lifecycle telemetry to the central Site Call Audit component on every status transition.
|
||
- Replicate buffered messages to the standby node via application-level replication over Akka.NET remoting.
|
||
- On failover, the standby node takes over delivery from its replicated copy.
|
||
- Respond to remote queries from central for parked message management (list, retry, discard), including central-driven Retry/Discard of parked cached calls.
|
||
|
||
## Message Lifecycle
|
||
|
||
```
|
||
Script submits message
|
||
│
|
||
▼
|
||
Attempt immediate delivery
|
||
│
|
||
├── Success → Remove from buffer
|
||
│
|
||
└── Failure → Buffer message
|
||
│
|
||
▼
|
||
Retry loop (per retry policy)
|
||
│
|
||
├── Success → Remove from buffer + notify standby
|
||
│
|
||
└── Max retries exhausted → Park message
|
||
```
|
||
|
||
For notifications, "delivery" means forwarding the message to the central cluster via Central–Site Communication; "success" is central's ack, on which the message is cleared. Notifications do not park — they are retried at the fixed forward interval until central acks. Parking applies only to the external-system-call and cached-database-write categories.
|
||
|
||
For the cached-call categories (`ExternalCall` and `DatabaseWrite`), the operation tracking table is the status record and the S&F buffer is purely the retry mechanism. A cached call that succeeds on its first immediate attempt is written directly as a terminal `Delivered` tracking row and never enters the S&F buffer. When immediate delivery fails transiently, the message is buffered and its tracking row moves to `Pending`/`Retrying`; the buffered message carries its `TrackedOperationId` so the tracking row and the retry record stay linked. On every tracking-table status transition the site emits `CachedCallTelemetry` to central.
|
||
|
||
## Retry Policy
|
||
|
||
For the external-system-call and cached-database-write categories, retry settings are defined on the **source entity** (not per-message):
|
||
- **External systems**: Each external system definition includes max retry count and time between retries.
|
||
- **Cached database writes**: Each database connection definition includes max retry count and time between retries.
|
||
|
||
The **notification** category retries differently: it has no source-entity setting. The site→central forward uses a single fixed retry interval configured in the host `appsettings.json`. This interval is infrastructure config for reaching the central cluster, not a per-notification-list setting. It applies uniformly to every buffered notification regardless of its target list. A buffered notification is retried until central acks it; it is not parked on a retry limit (central, once reachable, owns delivery, retry, and parking from that point on).
|
||
|
||
The retry interval is **fixed** (not exponential backoff). Fixed interval is sufficient for the expected use cases.
|
||
|
||
**Note**: Only **transient failures** are eligible for store-and-forward buffering. For external system calls, transient failures are connection errors, timeouts, and HTTP 5xx responses. Permanent failures (HTTP 4xx) are returned directly to the calling script and are **not** queued for retry. This prevents the buffer from accumulating requests that will never succeed.
|
||
|
||
## Buffer Size
|
||
|
||
There is **no maximum buffer size**. Messages accumulate in the buffer until delivery succeeds or retries are exhausted and the message is parked. Storage is bounded only by available disk space on the site node.
|
||
|
||
## Persistence
|
||
|
||
- Buffered messages are persisted to a **local SQLite database** on each site node.
|
||
- The active node persists locally and forwards each buffer operation (add, remove, park) to the standby node **asynchronously** via Akka.NET remoting. The active node does not wait for standby acknowledgment — this avoids adding latency to every script that buffers a message.
|
||
- The standby node applies the same operations to its own local SQLite database.
|
||
- On failover, the new active node has a near-complete copy of the buffer. In rare cases, the most recent operations may not have been replicated (e.g., a message added or removed just before failover). This can result in a few **duplicate deliveries** (message delivered but remove not replicated) or a few **missed retries** (message added but not replicated). Both are acceptable trade-offs for the latency benefit.
|
||
- On failover, the new active node resumes delivery from its local copy.
|
||
|
||
### Operation Tracking Table
|
||
|
||
Alongside the S&F buffer DB, each site node holds a **site-local operation tracking table** in SQLite. It carries one row per `TrackedOperationId` for cached calls (`ExternalCall` and `DatabaseWrite`), created the moment the script issues the cached call and kept regardless of outcome.
|
||
|
||
- This table is the **status record**; the S&F buffer remains purely the **retry mechanism**. A buffered cached-call message references its `TrackedOperationId` back to its tracking row.
|
||
- Each row records the operation kind (`TrackedOperationKind`), a target summary (external system + method, or database connection name), the unified `TrackedOperationStatus`, retry count, last error, source provenance (instance / script), and the created/updated/terminal UTC timestamps.
|
||
- `Tracking.Status(id)` reads this table. For cached calls the **site is the authoritative source of truth** for status — the query is always answered site-locally, even when central is unreachable. The central Site Call Audit `SiteCalls` table is an eventually-consistent mirror.
|
||
- A cached call that succeeds on its first immediate attempt writes a terminal `Delivered` row directly here, with nothing placed in the S&F buffer.
|
||
- Terminal rows are purged after a configurable retention window (default 7 days) — the site holds live operational state; central holds long-term audit.
|
||
|
||
Notifications are unaffected: they have no tracking table. Their `NotificationId` and status are owned by the central `Notifications` table, and their lifecycle continues to forward to central exactly as before.
|
||
|
||
### Telemetry to Central
|
||
|
||
On every tracking-table status transition, the site emits a `CachedCallTelemetry` message to the central Site Call Audit component over the site→central channel. Emission is best-effort, at-least-once, and idempotent on `TrackedOperationId`. Because telemetry is best-effort, the site also responds to `CachedCallReconcileRequest` reconciliation pulls — cursor-based per-site reads of tracking rows changed since a cursor — so any missed telemetry self-heals. The site never depends on central; central converges to the site.
|
||
|
||
## Parked Message Management
|
||
|
||
- Parked messages remain stored at the site in SQLite.
|
||
- The central UI can query sites for parked messages via the Communication Layer.
|
||
- Operators can:
|
||
- **Retry** a parked message (moves it back to the retry queue).
|
||
- **Discard** a parked message (removes it permanently).
|
||
- For parked cached calls, Retry/Discard can be driven centrally: the Site Call Audit component relays `RetryParkedOperation` / `DiscardParkedOperation` commands (keyed by `TrackedOperationId`) down to the owning site. The site applies the command to its S&F buffer and tracking table, then emits `CachedCallTelemetry` reflecting the new state (`Retrying` or `Discarded`) — central never mutates its mirror row directly.
|
||
- Store-and-forward messages are **not** automatically cleared when an instance is deleted. Pending and parked messages, and their tracking rows, continue to exist and can be managed via the central UI.
|
||
|
||
## Message Format
|
||
|
||
Each buffered message stores:
|
||
- **Message ID**: Unique identifier.
|
||
- **Category**: External system call, notification, or cached database write.
|
||
- **Tracked Operation ID**: For the cached-call categories, the `TrackedOperationId` linking the buffered message to its row in the operation tracking table. Not used by the notification category, which is tracked centrally via its `NotificationId`.
|
||
- **Target**: External system name, the central cluster (for notifications), or database connection name.
|
||
- **Payload**: Serialized message content (API method + parameters; notification list name + subject + body plus the locally generated `NotificationId` and source provenance; SQL + parameters).
|
||
- **Retry Count**: Number of attempts so far.
|
||
- **Created At**: Timestamp when the message was first queued.
|
||
- **Last Attempt At**: Timestamp of the most recent delivery attempt.
|
||
- **Status**: Pending, retrying, or parked.
|
||
|
||
## Dependencies
|
||
|
||
- **SQLite**: Local persistence on each node.
|
||
- **Communication Layer**: Application-level replication to standby node; remote query handling from central; carries buffered notifications to the central cluster (ClusterClient) and receives central's acks.
|
||
- **External System Gateway**: Delivers external system API calls.
|
||
- **Central–Site Communication**: The delivery target for the notification category — a buffered notification is forwarded to the central cluster over Central–Site Communication and cleared on central's ack. Also carries `CachedCallTelemetry` and reconciliation responses to central, and receives `RetryParkedOperation` / `DiscardParkedOperation` commands.
|
||
- **Site Call Audit**: The central audit mirror for cached calls — receives this engine's cached-call telemetry and reconciliation responses, and relays operator Retry/Discard of parked cached calls back as commands.
|
||
- **Database Connections**: Delivers cached database writes.
|
||
- **Site Event Logging**: Logs store-and-forward activity (queued, delivered, retried, parked).
|
||
|
||
## Interactions
|
||
|
||
- **Site Runtime (Script Actors)**: Scripts submit messages to the buffer (external calls, notifications, cached DB writes).
|
||
- **Communication Layer**: Handles parked message queries/commands from central; carries buffered notifications to the central cluster.
|
||
- **Notification Outbox**: The central destination for the notification category — central ingests each forwarded notification into the `Notifications` table and acks the site, on which the engine clears the buffered message.
|
||
- **Site Call Audit**: The central observability sibling for the cached-call categories — this engine emits `CachedCallTelemetry` on every tracking-table transition, answers `CachedCallReconcileRequest` pulls, and executes the `RetryParkedOperation` / `DiscardParkedOperation` commands it relays.
|
||
- **Health Monitoring**: Reports buffer depth metrics, including the notification backlog covering the site→central forward leg.
|