From d43d43d79526079946e01b78a3cb1227777585c3 Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Tue, 19 May 2026 11:50:55 -0400 Subject: [PATCH] docs(requirements): add cached-call telemetry pattern to Communication --- docs/requirements/Component-Communication.md | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/docs/requirements/Component-Communication.md b/docs/requirements/Component-Communication.md index f6cdb84..9f30716 100644 --- a/docs/requirements/Component-Communication.md +++ b/docs/requirements/Component-Communication.md @@ -122,7 +122,7 @@ Keepalive settings are configurable via `CommunicationOptions`: - Site event logs. - Instance debug snapshots (attribute values and alarm states). - Central can also send management commands: - - Retry or discard parked messages. + - Retry or discard parked messages and parked cached calls — central sends `RetryParkedOperation` / `DiscardParkedOperation` (keyed by `TrackedOperationId`) to the owning site, which applies the change to its S&F buffer and tracking table. ### 9. Notification Submission (Site → Central) - **Pattern**: Fire-and-forget with acknowledgment. @@ -131,6 +131,14 @@ Keepalive settings are configurable via `CommunicationOptions`: - The `NotificationId` GUID — generated at the site — is the **idempotency key**. The handoff is at-least-once: a re-sent submission after a lost ack is harmless because central's insert-if-not-exists treats the duplicate as a no-op. - **Transport**: ClusterClient (site→central command/control), consistent with how other site→central messages are sent. +### 10. Cached Call Telemetry (Site → Central) +- **Pattern**: Fire-and-forget telemetry with a periodic reconciliation pull. +- The site **Store-and-Forward Engine** emits a `CachedCallTelemetry` message to central on **every** cached-call lifecycle transition (`Created`/`Pending → Retrying → Delivered`/`Parked`/`Failed`/`Discarded`). The message carries the `TrackedOperationId`, source site, kind, target summary, status, retry count, last error, key timestamps, and source provenance. +- Emission is **best-effort and at-least-once**, **idempotent on `TrackedOperationId`** — central's Site Call Audit component ingests with insert-if-not-exists then upsert-on-newer-status, so a re-sent or out-of-order event is harmless. +- **Reconciliation pull**: because telemetry is best-effort, the central **Site Call Audit** component periodically — and on site reconnect — issues a `CachedCallReconcileRequest` to each site; the site replies with a `CachedCallReconcileResponse` carrying all tracking rows changed since a cursor. Any telemetry missed during a disconnect self-heals through this pull. +- Central audit is an **eventually-consistent mirror** — the site's operation tracking table remains the source of truth for cached-call status (`Tracking.Status(id)` is always answered site-locally). +- **Transport**: ClusterClient (site→central command/control), consistent with how other site→central messages are sent. + ## Topology ``` @@ -182,6 +190,7 @@ Each request/response pattern has a default timeout that can be overridden in co | 5. Recipe/Command Delivery | 30 seconds | Fire-and-forget with ack | | 8. Remote Queries | 30 seconds | Querying parked messages or event logs | | 9. Notification Submission | 30 seconds | Fire-and-forget with ack; central acks after persisting the row | +| 10. Cached Call Telemetry | 30 seconds | Reconciliation pull is request/response; telemetry emission itself is fire-and-forget | Timeouts use the Akka.NET **ask pattern**. If no response is received within the timeout, the caller receives a timeout failure. @@ -237,6 +246,7 @@ The ManagementActor is registered at the well-known path `/user/management` on c - **Site Runtime**: Receives deployments, lifecycle commands, and artifact updates. Provides debug view data. - **Central UI**: Debug view requests and remote queries flow through communication. - **Health Monitoring**: Receives periodic health reports from sites. -- **Store-and-Forward Engine (site)**: Parked message queries/commands are routed through communication. +- **Store-and-Forward Engine (site)**: Parked message queries/commands are routed through communication. Also emits `CachedCallTelemetry` and answers `CachedCallReconcileRequest` pulls, and receives relayed `RetryParkedOperation` / `DiscardParkedOperation` commands. +- **Site Call Audit (central)**: Receives cached-call telemetry and reconciliation responses; issues reconciliation pulls and relays parked-operation Retry/Discard commands to sites through communication. - **Site Event Logging**: Event log queries are routed through communication. - **Management Service**: The ManagementActor is registered with ClusterClientReceptionist on central nodes. The CLI communicates with the ManagementActor via ClusterClient, which is a separate channel from inter-cluster remoting.