Add per-pattern message timeouts with sensible defaults (120s for deployments, 30s for queries/commands). Configure Akka.NET transport heartbeat explicitly rather than relying on framework defaults. Document per-site message ordering guarantee. Specify that in-flight messages on disconnect result in timeout error (no central buffering) and debug streams die on any disconnect.
137 lines
7.8 KiB
Markdown
137 lines
7.8 KiB
Markdown
# Component: Central–Site Communication
|
||
|
||
## Purpose
|
||
|
||
The Communication component manages all messaging between the central cluster and site clusters using Akka.NET. It provides the transport layer for deployments, instance lifecycle commands, integration routing, debug streaming, health reporting, and remote queries (parked messages, event logs).
|
||
|
||
## Location
|
||
|
||
Both central and site clusters. Each side has communication actors that handle message routing.
|
||
|
||
## Responsibilities
|
||
|
||
- Establish and maintain Akka.NET remoting connections between central and each site cluster.
|
||
- Route messages between central and site clusters in a hub-and-spoke topology.
|
||
- Broker requests from external systems (via central) to sites and return responses.
|
||
- Support multiple concurrent message patterns (request/response, fire-and-forget, streaming).
|
||
- Detect site connectivity status for health monitoring.
|
||
|
||
## Communication Patterns
|
||
|
||
### 1. Deployment (Central → Site)
|
||
- **Pattern**: Request/Response.
|
||
- Central sends a flattened configuration to a site.
|
||
- Site Runtime receives, compiles scripts, creates/updates Instance Actors, and responds with success/failure.
|
||
- No buffering at central — if the site is unreachable, the deployment fails immediately.
|
||
|
||
### 2. Instance Lifecycle Commands (Central → Site)
|
||
- **Pattern**: Request/Response.
|
||
- Central sends disable, enable, or delete commands for specific instances.
|
||
- Site Runtime processes the command and responds with success/failure.
|
||
- If the site is unreachable, the command fails immediately (no buffering).
|
||
|
||
### 3. System-Wide Artifact Deployment (Central → All Sites)
|
||
- **Pattern**: Broadcast with per-site acknowledgment.
|
||
- When shared scripts, external system definitions, database connections, or notification lists are explicitly deployed, central sends them to all sites.
|
||
- Each site acknowledges receipt and reports success/failure independently.
|
||
|
||
### 4. Integration Routing (External System → Central → Site → Central → External System)
|
||
- **Pattern**: Request/Response (brokered).
|
||
- External system sends a request to central (e.g., MES requests machine values).
|
||
- Central routes the request to the appropriate site.
|
||
- Site reads values from the Instance Actor and responds.
|
||
- Central returns the response to the external system.
|
||
|
||
### 5. Recipe/Command Delivery (External System → Central → Site)
|
||
- **Pattern**: Fire-and-forget with acknowledgment.
|
||
- External system sends a command to central (e.g., recipe manager sends recipe).
|
||
- Central routes to the site.
|
||
- Site applies and acknowledges.
|
||
|
||
### 6. Debug Streaming (Site → Central)
|
||
- **Pattern**: Subscribe/stream with initial snapshot.
|
||
- Central sends a subscribe request for a specific instance (identified by unique name).
|
||
- Site requests a **snapshot** of all current attribute values and alarm states from the Instance Actor and sends it to central.
|
||
- Site then subscribes to the **site-wide Akka stream** filtered by the instance's unique name and forwards attribute value changes and alarm state changes to central.
|
||
- Attribute value stream messages: `[InstanceUniqueName].[AttributePath].[AttributeName]`, value, quality, timestamp.
|
||
- Alarm state stream messages: `[InstanceUniqueName].[AlarmName]`, state (active/normal), priority, timestamp.
|
||
- Central sends an unsubscribe request when the debug view closes. The site removes its stream subscription.
|
||
- The stream is session-based and temporary.
|
||
|
||
### 7. Health Reporting (Site → Central)
|
||
- **Pattern**: Periodic push.
|
||
- Sites periodically send health metrics (connection status, node status, buffer depth, script error rates, alarm evaluation error rates) to central.
|
||
|
||
### 8. Remote Queries (Central → Site)
|
||
- **Pattern**: Request/Response.
|
||
- Central queries sites for:
|
||
- Parked messages (store-and-forward dead letters).
|
||
- Site event logs.
|
||
- Central can also send management commands:
|
||
- Retry or discard parked messages.
|
||
|
||
## Topology
|
||
|
||
```
|
||
Central Cluster
|
||
├── Akka.NET Remoting → Site A Cluster
|
||
├── Akka.NET Remoting → Site B Cluster
|
||
└── Akka.NET Remoting → Site N Cluster
|
||
```
|
||
|
||
- Sites do **not** communicate with each other.
|
||
- All inter-cluster communication flows through central.
|
||
|
||
## Message Timeouts
|
||
|
||
Each request/response pattern has a default timeout that can be overridden in configuration:
|
||
|
||
| Pattern | Default Timeout | Rationale |
|
||
|---------|----------------|-----------|
|
||
| 1. Deployment | 120 seconds | Script compilation at the site can be slow |
|
||
| 2. Instance Lifecycle | 30 seconds | Stop/start actors is fast |
|
||
| 3. System-Wide Artifacts | 120 seconds per site | Includes shared script recompilation |
|
||
| 4. Integration Routing | 30 seconds | External system waiting for response; Inbound API per-method timeout may cap this further |
|
||
| 5. Recipe/Command Delivery | 30 seconds | Fire-and-forget with ack |
|
||
| 8. Remote Queries | 30 seconds | Querying parked messages or event logs |
|
||
|
||
Timeouts use the Akka.NET **ask pattern**. If no response is received within the timeout, the caller receives a timeout failure.
|
||
|
||
## Transport Configuration
|
||
|
||
Akka.NET remoting provides built-in connection management and failure detection. The following transport-level settings are **explicitly configured** (not left to framework defaults) for predictable behavior:
|
||
|
||
- **Transport heartbeat interval**: Configurable interval at which heartbeat messages are sent over remoting connections (e.g., every 5 seconds).
|
||
- **Failure detection threshold**: Number of missed heartbeats before the connection is considered lost (e.g., 3 missed heartbeats = 15 seconds with a 5-second interval).
|
||
- **Reconnection**: Akka.NET remoting handles reconnection automatically. No custom reconnection logic is required.
|
||
|
||
These settings should be tuned for the expected network conditions between central and site clusters.
|
||
|
||
## Message Ordering
|
||
|
||
Akka.NET guarantees message ordering between a specific sender/receiver actor pair. The Communication Layer relies on this guarantee — messages to a given site are processed in the order they are sent. Callers do not need to handle out-of-order delivery.
|
||
|
||
## Connection Failure Behavior
|
||
|
||
- **In-flight messages**: When a connection drops while a request is in flight (e.g., deployment sent but no response received), the Akka ask pattern times out and the caller receives a failure. There is **no automatic retry or buffering at central** — the engineer sees the failure in the UI and re-initiates the action. This is consistent with the design principle that central does not buffer messages.
|
||
- **Debug streams**: Any connection interruption (failover or network blip) kills the debug stream. The engineer must reopen the debug view in the Central UI to re-establish the subscription with a fresh snapshot. There is no auto-resume.
|
||
|
||
## Failover Behavior
|
||
|
||
- **Central failover**: The standby node takes over the Akka.NET cluster role. In-progress deployments are treated as failed. Sites reconnect to the new active central node.
|
||
- **Site failover**: The standby node takes over. The Deployment Manager singleton restarts and re-creates the Instance Actor hierarchy. Central detects the node change and reconnects. Ongoing debug streams are interrupted and must be re-established by the engineer.
|
||
|
||
## Dependencies
|
||
|
||
- **Akka.NET Remoting**: Provides the transport layer.
|
||
- **Cluster Infrastructure**: Manages node roles and failover detection.
|
||
|
||
## Interactions
|
||
|
||
- **Deployment Manager (central)**: Uses communication to deliver configurations, lifecycle commands, and system-wide artifacts, and receive status.
|
||
- **Site Runtime**: Receives deployments, lifecycle commands, and artifact updates. Provides debug view data.
|
||
- **Central UI**: Debug view requests and remote queries flow through communication.
|
||
- **Health Monitoring**: Receives periodic health reports from sites.
|
||
- **Store-and-Forward Engine (site)**: Parked message queries/commands are routed through communication.
|
||
- **Site Event Logging**: Event log queries are routed through communication.
|