docs(requirements): add cached-call telemetry pattern to Communication

This commit is contained in:
Joseph Doherty
2026-05-19 11:50:55 -04:00
parent 00ec265980
commit d43d43d795

View File

@@ -122,7 +122,7 @@ Keepalive settings are configurable via `CommunicationOptions`:
- Site event logs.
- Instance debug snapshots (attribute values and alarm states).
- Central can also send management commands:
- Retry or discard parked messages.
- Retry or discard parked messages and parked cached calls — central sends `RetryParkedOperation` / `DiscardParkedOperation` (keyed by `TrackedOperationId`) to the owning site, which applies the change to its S&F buffer and tracking table.
### 9. Notification Submission (Site → Central)
- **Pattern**: Fire-and-forget with acknowledgment.
@@ -131,6 +131,14 @@ Keepalive settings are configurable via `CommunicationOptions`:
- The `NotificationId` GUID — generated at the site — is the **idempotency key**. The handoff is at-least-once: a re-sent submission after a lost ack is harmless because central's insert-if-not-exists treats the duplicate as a no-op.
- **Transport**: ClusterClient (site→central command/control), consistent with how other site→central messages are sent.
### 10. Cached Call Telemetry (Site → Central)
- **Pattern**: Fire-and-forget telemetry with a periodic reconciliation pull.
- The site **Store-and-Forward Engine** emits a `CachedCallTelemetry` message to central on **every** cached-call lifecycle transition (`Created`/`Pending → Retrying → Delivered`/`Parked`/`Failed`/`Discarded`). The message carries the `TrackedOperationId`, source site, kind, target summary, status, retry count, last error, key timestamps, and source provenance.
- Emission is **best-effort and at-least-once**, **idempotent on `TrackedOperationId`** — central's Site Call Audit component ingests with insert-if-not-exists then upsert-on-newer-status, so a re-sent or out-of-order event is harmless.
- **Reconciliation pull**: because telemetry is best-effort, the central **Site Call Audit** component periodically — and on site reconnect — issues a `CachedCallReconcileRequest` to each site; the site replies with a `CachedCallReconcileResponse` carrying all tracking rows changed since a cursor. Any telemetry missed during a disconnect self-heals through this pull.
- Central audit is an **eventually-consistent mirror** — the site's operation tracking table remains the source of truth for cached-call status (`Tracking.Status(id)` is always answered site-locally).
- **Transport**: ClusterClient (site→central command/control), consistent with how other site→central messages are sent.
## Topology
```
@@ -182,6 +190,7 @@ Each request/response pattern has a default timeout that can be overridden in co
| 5. Recipe/Command Delivery | 30 seconds | Fire-and-forget with ack |
| 8. Remote Queries | 30 seconds | Querying parked messages or event logs |
| 9. Notification Submission | 30 seconds | Fire-and-forget with ack; central acks after persisting the row |
| 10. Cached Call Telemetry | 30 seconds | Reconciliation pull is request/response; telemetry emission itself is fire-and-forget |
Timeouts use the Akka.NET **ask pattern**. If no response is received within the timeout, the caller receives a timeout failure.
@@ -237,6 +246,7 @@ The ManagementActor is registered at the well-known path `/user/management` on c
- **Site Runtime**: Receives deployments, lifecycle commands, and artifact updates. Provides debug view data.
- **Central UI**: Debug view requests and remote queries flow through communication.
- **Health Monitoring**: Receives periodic health reports from sites.
- **Store-and-Forward Engine (site)**: Parked message queries/commands are routed through communication.
- **Store-and-Forward Engine (site)**: Parked message queries/commands are routed through communication. Also emits `CachedCallTelemetry` and answers `CachedCallReconcileRequest` pulls, and receives relayed `RetryParkedOperation` / `DiscardParkedOperation` commands.
- **Site Call Audit (central)**: Receives cached-call telemetry and reconciliation responses; issues reconciliation pulls and relays parked-operation Retry/Discard commands to sites through communication.
- **Site Event Logging**: Event log queries are routed through communication.
- **Management Service**: The ManagementActor is registered with ClusterClientReceptionist on central nodes. The CLI communicates with the ManagementActor via ClusterClient, which is a separate channel from inter-cluster remoting.