docs(requirements): add cached-call telemetry pattern to Communication

2026-05-19 11:50:55 -04:00
parent 00ec265980
commit d43d43d795
1 changed files with 12 additions and 2 deletions
--- a/docs/requirements/Component-Communication.md
+++ b/docs/requirements/Component-Communication.md
@@ -122,7 +122,7 @@ Keepalive settings are configurable via `CommunicationOptions`:
  - Site event logs.
  - Instance debug snapshots (attribute values and alarm states).
 - Central can also send management commands:
-  - Retry or discard parked messages.
+  - Retry or discard parked messages and parked cached calls — central sends `RetryParkedOperation` / `DiscardParkedOperation` (keyed by `TrackedOperationId`) to the owning site, which applies the change to its S&F buffer and tracking table.

 ### 9. Notification Submission (Site → Central)
 - **Pattern**: Fire-and-forget with acknowledgment.
@@ -131,6 +131,14 @@ Keepalive settings are configurable via `CommunicationOptions`:
 - The `NotificationId` GUID — generated at the site — is the **idempotency key**. The handoff is at-least-once: a re-sent submission after a lost ack is harmless because central's insert-if-not-exists treats the duplicate as a no-op.
 - **Transport**: ClusterClient (site→central command/control), consistent with how other site→central messages are sent.

+### 10. Cached Call Telemetry (Site → Central)
+- **Pattern**: Fire-and-forget telemetry with a periodic reconciliation pull.
+- The site **Store-and-Forward Engine** emits a `CachedCallTelemetry` message to central on **every** cached-call lifecycle transition (`Created`/`Pending → Retrying → Delivered`/`Parked`/`Failed`/`Discarded`). The message carries the `TrackedOperationId`, source site, kind, target summary, status, retry count, last error, key timestamps, and source provenance.
+- Emission is **best-effort and at-least-once**, **idempotent on `TrackedOperationId`** — central's Site Call Audit component ingests with insert-if-not-exists then upsert-on-newer-status, so a re-sent or out-of-order event is harmless.
+- **Reconciliation pull**: because telemetry is best-effort, the central **Site Call Audit** component periodically — and on site reconnect — issues a `CachedCallReconcileRequest` to each site; the site replies with a `CachedCallReconcileResponse` carrying all tracking rows changed since a cursor. Any telemetry missed during a disconnect self-heals through this pull.
+- Central audit is an **eventually-consistent mirror** — the site's operation tracking table remains the source of truth for cached-call status (`Tracking.Status(id)` is always answered site-locally).
+- **Transport**: ClusterClient (site→central command/control), consistent with how other site→central messages are sent.
+
 ## Topology

 ```
@@ -182,6 +190,7 @@ Each request/response pattern has a default timeout that can be overridden in co
 | 5. Recipe/Command Delivery | 30 seconds | Fire-and-forget with ack |
 | 8. Remote Queries | 30 seconds | Querying parked messages or event logs |
 | 9. Notification Submission | 30 seconds | Fire-and-forget with ack; central acks after persisting the row |
+| 10. Cached Call Telemetry | 30 seconds | Reconciliation pull is request/response; telemetry emission itself is fire-and-forget |

 Timeouts use the Akka.NET **ask pattern**. If no response is received within the timeout, the caller receives a timeout failure.

@@ -237,6 +246,7 @@ The ManagementActor is registered at the well-known path `/user/management` on c
 - **Site Runtime**: Receives deployments, lifecycle commands, and artifact updates. Provides debug view data.
 - **Central UI**: Debug view requests and remote queries flow through communication.
 - **Health Monitoring**: Receives periodic health reports from sites.
- **Store-and-Forward Engine (site)**: Parked message queries/commands are routed through communication.
+- **Store-and-Forward Engine (site)**: Parked message queries/commands are routed through communication. Also emits `CachedCallTelemetry` and answers `CachedCallReconcileRequest` pulls, and receives relayed `RetryParkedOperation` / `DiscardParkedOperation` commands.
+- **Site Call Audit (central)**: Receives cached-call telemetry and reconciliation responses; issues reconciliation pulls and relays parked-operation Retry/Discard commands to sites through communication.
 - **Site Event Logging**: Event log queries are routed through communication.
 - **Management Service**: The ManagementActor is registered with ClusterClientReceptionist on central nodes. The CLI communicates with the ManagementActor via ClusterClient, which is a separate channel from inter-cluster remoting.