From bd735de8c4476d2d829505f31382f7d2c36c42f2 Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Mon, 16 Mar 2026 08:04:06 -0400 Subject: [PATCH] Refine Communication Layer: timeouts, transport config, ordering, failure behavior Add per-pattern message timeouts with sensible defaults (120s for deployments, 30s for queries/commands). Configure Akka.NET transport heartbeat explicitly rather than relying on framework defaults. Document per-site message ordering guarantee. Specify that in-flight messages on disconnect result in timeout error (no central buffering) and debug streams die on any disconnect. --- Component-Communication.md | 34 ++++++++++++++ ...6-communication-layer-refinement-design.md | 47 +++++++++++++++++++ 2 files changed, 81 insertions(+) create mode 100644 docs/plans/2026-03-16-communication-layer-refinement-design.md diff --git a/Component-Communication.md b/Component-Communication.md index 68bb74f..d8888b9 100644 --- a/Component-Communication.md +++ b/Component-Communication.md @@ -82,6 +82,40 @@ Central Cluster - Sites do **not** communicate with each other. - All inter-cluster communication flows through central. +## Message Timeouts + +Each request/response pattern has a default timeout that can be overridden in configuration: + +| Pattern | Default Timeout | Rationale | +|---------|----------------|-----------| +| 1. Deployment | 120 seconds | Script compilation at the site can be slow | +| 2. Instance Lifecycle | 30 seconds | Stop/start actors is fast | +| 3. System-Wide Artifacts | 120 seconds per site | Includes shared script recompilation | +| 4. Integration Routing | 30 seconds | External system waiting for response; Inbound API per-method timeout may cap this further | +| 5. Recipe/Command Delivery | 30 seconds | Fire-and-forget with ack | +| 8. Remote Queries | 30 seconds | Querying parked messages or event logs | + +Timeouts use the Akka.NET **ask pattern**. If no response is received within the timeout, the caller receives a timeout failure. + +## Transport Configuration + +Akka.NET remoting provides built-in connection management and failure detection. The following transport-level settings are **explicitly configured** (not left to framework defaults) for predictable behavior: + +- **Transport heartbeat interval**: Configurable interval at which heartbeat messages are sent over remoting connections (e.g., every 5 seconds). +- **Failure detection threshold**: Number of missed heartbeats before the connection is considered lost (e.g., 3 missed heartbeats = 15 seconds with a 5-second interval). +- **Reconnection**: Akka.NET remoting handles reconnection automatically. No custom reconnection logic is required. + +These settings should be tuned for the expected network conditions between central and site clusters. + +## Message Ordering + +Akka.NET guarantees message ordering between a specific sender/receiver actor pair. The Communication Layer relies on this guarantee — messages to a given site are processed in the order they are sent. Callers do not need to handle out-of-order delivery. + +## Connection Failure Behavior + +- **In-flight messages**: When a connection drops while a request is in flight (e.g., deployment sent but no response received), the Akka ask pattern times out and the caller receives a failure. There is **no automatic retry or buffering at central** — the engineer sees the failure in the UI and re-initiates the action. This is consistent with the design principle that central does not buffer messages. +- **Debug streams**: Any connection interruption (failover or network blip) kills the debug stream. The engineer must reopen the debug view in the Central UI to re-establish the subscription with a fresh snapshot. There is no auto-resume. + ## Failover Behavior - **Central failover**: The standby node takes over the Akka.NET cluster role. In-progress deployments are treated as failed. Sites reconnect to the new active central node. diff --git a/docs/plans/2026-03-16-communication-layer-refinement-design.md b/docs/plans/2026-03-16-communication-layer-refinement-design.md new file mode 100644 index 0000000..0c869cc --- /dev/null +++ b/docs/plans/2026-03-16-communication-layer-refinement-design.md @@ -0,0 +1,47 @@ +# Communication Layer Refinement — Design + +**Date**: 2026-03-16 +**Component**: Central–Site Communication (`Component-Communication.md`) +**Status**: Approved + +## Problem + +The Communication Layer doc defined 8 message patterns clearly but lacked specification for timeouts, transport configuration, reconnection behavior, message ordering guarantees, and connection failure handling. + +## Decisions + +### Message Timeouts +- **Per-pattern timeouts with sensible defaults**, overridable in configuration. +- Deployment and system-wide artifacts: 120 seconds (script compilation can be slow). +- Lifecycle commands, integration routing, recipe/command delivery, remote queries: 30 seconds. +- Uses the Akka.NET ask pattern; timeout results in failure to caller. + +### Transport Configuration +- **Akka.NET built-in reconnection** with explicitly configured transport heartbeat interval and failure detection threshold. +- No custom reconnection logic — framework handles it. +- Settings explicitly documented rather than relying on framework defaults, for predictability in a SCADA context. + +### Connection Failure Behavior +- **In-flight messages get a timeout error** — caller retries manually. No buffering at central. Consistent with existing design principle. +- Automatic retry rejected due to risk of duplicate processing (e.g., site may have applied a deployment before the connection dropped). + +### Message Ordering +- **Per-site ordering guaranteed** — relies on Akka.NET's built-in per-sender/per-receiver ordering. No custom sequencing logic needed. + +### Debug Stream Interruption +- **Stream dies on any disconnect** (failover or network blip). Engineer reopens the debug view manually. +- Auto-resume rejected — adds complexity for a transient diagnostic tool. + +## Affected Documents + +| Document | Change | +|----------|--------| +| `Component-Communication.md` | Added 4 new sections: Message Timeouts, Transport Configuration, Message Ordering, Connection Failure Behavior | + +## Alternatives Considered + +- **Global timeout for all patterns**: Rejected — deployment involves compilation and needs more time than a simple query. +- **Default Akka.NET transport settings**: Rejected — relying on undocumented defaults is risky for SCADA; explicit configuration ensures predictable behavior. +- **Automatic retry of in-flight messages**: Rejected — risks duplicate processing and contradicts the no-buffering-at-central principle. +- **No ordering guarantee**: Rejected — Akka.NET provides this for free; the design already implicitly relies on it. +- **Auto-resume debug streams on reconnection**: Rejected — adds state tracking complexity for a transient diagnostic feature.