Files
scadalink-design/docs/plans/2026-03-16-communication-layer-refinement-design.md
Joseph Doherty bd735de8c4 Refine Communication Layer: timeouts, transport config, ordering, failure behavior
Add per-pattern message timeouts with sensible defaults (120s for deployments, 30s
for queries/commands). Configure Akka.NET transport heartbeat explicitly rather than
relying on framework defaults. Document per-site message ordering guarantee. Specify
that in-flight messages on disconnect result in timeout error (no central buffering)
and debug streams die on any disconnect.
2026-03-16 08:04:06 -04:00

2.6 KiB
Raw Blame History

Communication Layer Refinement — Design

Date: 2026-03-16 Component: CentralSite Communication (Component-Communication.md) Status: Approved

Problem

The Communication Layer doc defined 8 message patterns clearly but lacked specification for timeouts, transport configuration, reconnection behavior, message ordering guarantees, and connection failure handling.

Decisions

Message Timeouts

  • Per-pattern timeouts with sensible defaults, overridable in configuration.
  • Deployment and system-wide artifacts: 120 seconds (script compilation can be slow).
  • Lifecycle commands, integration routing, recipe/command delivery, remote queries: 30 seconds.
  • Uses the Akka.NET ask pattern; timeout results in failure to caller.

Transport Configuration

  • Akka.NET built-in reconnection with explicitly configured transport heartbeat interval and failure detection threshold.
  • No custom reconnection logic — framework handles it.
  • Settings explicitly documented rather than relying on framework defaults, for predictability in a SCADA context.

Connection Failure Behavior

  • In-flight messages get a timeout error — caller retries manually. No buffering at central. Consistent with existing design principle.
  • Automatic retry rejected due to risk of duplicate processing (e.g., site may have applied a deployment before the connection dropped).

Message Ordering

  • Per-site ordering guaranteed — relies on Akka.NET's built-in per-sender/per-receiver ordering. No custom sequencing logic needed.

Debug Stream Interruption

  • Stream dies on any disconnect (failover or network blip). Engineer reopens the debug view manually.
  • Auto-resume rejected — adds complexity for a transient diagnostic tool.

Affected Documents

Document Change
Component-Communication.md Added 4 new sections: Message Timeouts, Transport Configuration, Message Ordering, Connection Failure Behavior

Alternatives Considered

  • Global timeout for all patterns: Rejected — deployment involves compilation and needs more time than a simple query.
  • Default Akka.NET transport settings: Rejected — relying on undocumented defaults is risky for SCADA; explicit configuration ensures predictable behavior.
  • Automatic retry of in-flight messages: Rejected — risks duplicate processing and contradicts the no-buffering-at-central principle.
  • No ordering guarantee: Rejected — Akka.NET provides this for free; the design already implicitly relies on it.
  • Auto-resume debug streams on reconnection: Rejected — adds state tracking complexity for a transient diagnostic feature.