Refine Communication Layer: timeouts, transport config, ordering, failure behavior
Add per-pattern message timeouts with sensible defaults (120s for deployments, 30s for queries/commands). Configure Akka.NET transport heartbeat explicitly rather than relying on framework defaults. Document per-site message ordering guarantee. Specify that in-flight messages on disconnect result in timeout error (no central buffering) and debug streams die on any disconnect.
This commit is contained in:
@@ -82,6 +82,40 @@ Central Cluster
|
||||
- Sites do **not** communicate with each other.
|
||||
- All inter-cluster communication flows through central.
|
||||
|
||||
## Message Timeouts
|
||||
|
||||
Each request/response pattern has a default timeout that can be overridden in configuration:
|
||||
|
||||
| Pattern | Default Timeout | Rationale |
|
||||
|---------|----------------|-----------|
|
||||
| 1. Deployment | 120 seconds | Script compilation at the site can be slow |
|
||||
| 2. Instance Lifecycle | 30 seconds | Stop/start actors is fast |
|
||||
| 3. System-Wide Artifacts | 120 seconds per site | Includes shared script recompilation |
|
||||
| 4. Integration Routing | 30 seconds | External system waiting for response; Inbound API per-method timeout may cap this further |
|
||||
| 5. Recipe/Command Delivery | 30 seconds | Fire-and-forget with ack |
|
||||
| 8. Remote Queries | 30 seconds | Querying parked messages or event logs |
|
||||
|
||||
Timeouts use the Akka.NET **ask pattern**. If no response is received within the timeout, the caller receives a timeout failure.
|
||||
|
||||
## Transport Configuration
|
||||
|
||||
Akka.NET remoting provides built-in connection management and failure detection. The following transport-level settings are **explicitly configured** (not left to framework defaults) for predictable behavior:
|
||||
|
||||
- **Transport heartbeat interval**: Configurable interval at which heartbeat messages are sent over remoting connections (e.g., every 5 seconds).
|
||||
- **Failure detection threshold**: Number of missed heartbeats before the connection is considered lost (e.g., 3 missed heartbeats = 15 seconds with a 5-second interval).
|
||||
- **Reconnection**: Akka.NET remoting handles reconnection automatically. No custom reconnection logic is required.
|
||||
|
||||
These settings should be tuned for the expected network conditions between central and site clusters.
|
||||
|
||||
## Message Ordering
|
||||
|
||||
Akka.NET guarantees message ordering between a specific sender/receiver actor pair. The Communication Layer relies on this guarantee — messages to a given site are processed in the order they are sent. Callers do not need to handle out-of-order delivery.
|
||||
|
||||
## Connection Failure Behavior
|
||||
|
||||
- **In-flight messages**: When a connection drops while a request is in flight (e.g., deployment sent but no response received), the Akka ask pattern times out and the caller receives a failure. There is **no automatic retry or buffering at central** — the engineer sees the failure in the UI and re-initiates the action. This is consistent with the design principle that central does not buffer messages.
|
||||
- **Debug streams**: Any connection interruption (failover or network blip) kills the debug stream. The engineer must reopen the debug view in the Central UI to re-establish the subscription with a fresh snapshot. There is no auto-resume.
|
||||
|
||||
## Failover Behavior
|
||||
|
||||
- **Central failover**: The standby node takes over the Akka.NET cluster role. In-progress deployments are treated as failed. Sites reconnect to the new active central node.
|
||||
|
||||
Reference in New Issue
Block a user