Files
scadalink-design/docs/plans/2026-03-16-communication-layer-refinement-design.md
Joseph Doherty bd735de8c4 Refine Communication Layer: timeouts, transport config, ordering, failure behavior
Add per-pattern message timeouts with sensible defaults (120s for deployments, 30s
for queries/commands). Configure Akka.NET transport heartbeat explicitly rather than
relying on framework defaults. Document per-site message ordering guarantee. Specify
that in-flight messages on disconnect result in timeout error (no central buffering)
and debug streams die on any disconnect.
2026-03-16 08:04:06 -04:00

48 lines
2.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Communication Layer Refinement — Design
**Date**: 2026-03-16
**Component**: CentralSite Communication (`Component-Communication.md`)
**Status**: Approved
## Problem
The Communication Layer doc defined 8 message patterns clearly but lacked specification for timeouts, transport configuration, reconnection behavior, message ordering guarantees, and connection failure handling.
## Decisions
### Message Timeouts
- **Per-pattern timeouts with sensible defaults**, overridable in configuration.
- Deployment and system-wide artifacts: 120 seconds (script compilation can be slow).
- Lifecycle commands, integration routing, recipe/command delivery, remote queries: 30 seconds.
- Uses the Akka.NET ask pattern; timeout results in failure to caller.
### Transport Configuration
- **Akka.NET built-in reconnection** with explicitly configured transport heartbeat interval and failure detection threshold.
- No custom reconnection logic — framework handles it.
- Settings explicitly documented rather than relying on framework defaults, for predictability in a SCADA context.
### Connection Failure Behavior
- **In-flight messages get a timeout error** — caller retries manually. No buffering at central. Consistent with existing design principle.
- Automatic retry rejected due to risk of duplicate processing (e.g., site may have applied a deployment before the connection dropped).
### Message Ordering
- **Per-site ordering guaranteed** — relies on Akka.NET's built-in per-sender/per-receiver ordering. No custom sequencing logic needed.
### Debug Stream Interruption
- **Stream dies on any disconnect** (failover or network blip). Engineer reopens the debug view manually.
- Auto-resume rejected — adds complexity for a transient diagnostic tool.
## Affected Documents
| Document | Change |
|----------|--------|
| `Component-Communication.md` | Added 4 new sections: Message Timeouts, Transport Configuration, Message Ordering, Connection Failure Behavior |
## Alternatives Considered
- **Global timeout for all patterns**: Rejected — deployment involves compilation and needs more time than a simple query.
- **Default Akka.NET transport settings**: Rejected — relying on undocumented defaults is risky for SCADA; explicit configuration ensures predictable behavior.
- **Automatic retry of in-flight messages**: Rejected — risks duplicate processing and contradicts the no-buffering-at-central principle.
- **No ordering guarantee**: Rejected — Akka.NET provides this for free; the design already implicitly relies on it.
- **Auto-resume debug streams on reconnection**: Rejected — adds state tracking complexity for a transient diagnostic feature.