# Component: Central–Site Communication

## Purpose

The Communication component manages all messaging between the central cluster and site clusters using Akka.NET. It provides the transport layer for deployments, instance lifecycle commands, integration routing, debug streaming, health reporting, and remote queries (parked messages, event logs).

## Location

Both central and site clusters. Each side has communication actors that handle message routing.

## Responsibilities

- Resolve site addresses from the configuration database and maintain a cached address map.
- Establish and maintain cross-cluster connections using Akka.NET ClusterClient/ClusterClientReceptionist.
- Route messages between central and site clusters in a hub-and-spoke topology.
- Broker requests from external systems (via central) to sites and return responses.
- Support multiple concurrent message patterns (request/response, fire-and-forget, streaming).
- Detect site connectivity status for health monitoring.

## Communication Patterns

### 1. Deployment (Central → Site)
- **Pattern**: Request/Response.
- Central sends a flattened configuration to a site.
- Site Runtime receives, compiles scripts, creates/updates Instance Actors, and responds with success/failure.
- No buffering at central — if the site is unreachable, the deployment fails immediately.

### 2. Instance Lifecycle Commands (Central → Site)
- **Pattern**: Request/Response.
- Central sends disable, enable, or delete commands for specific instances.
- Site Runtime processes the command and responds with success/failure.
- If the site is unreachable, the command fails immediately (no buffering).

### 3. System-Wide Artifact Deployment (Central → Site(s))
- **Pattern**: Broadcast with per-site acknowledgment (deploy to all sites), or targeted to a single site (per-site deployment).
- When shared scripts, external system definitions, database connections, data connections, notification lists, or SMTP configuration are explicitly deployed, central sends them to the target site(s).
- Each site acknowledges receipt and reports success/failure independently.

### 4. Integration Routing (External System → Central → Site → Central → External System)
- **Pattern**: Request/Response (brokered).
- External system sends a request to central (e.g., MES requests machine values).
- Central routes the request to the appropriate site.
- Site reads values from the Instance Actor and responds.
- Central returns the response to the external system.

### 5. Recipe/Command Delivery (External System → Central → Site)
- **Pattern**: Fire-and-forget with acknowledgment.
- External system sends a command to central (e.g., recipe manager sends recipe).
- Central routes to the site.
- Site applies and acknowledges.

### 6. Debug Streaming (Site → Central)
- **Pattern**: Subscribe/push with initial snapshot (no polling).
- A **DebugStreamBridgeActor** (one per active debug session) is created on the central cluster by the **DebugStreamService**. The bridge actor sends a `SubscribeDebugViewRequest` to the site via `CentralCommunicationActor`. The site's `InstanceActor` stores the subscription's correlation ID and replies with an initial snapshot via the ClusterClient reply path.
- Site requests a **snapshot** of all current attribute values and alarm states from the Instance Actor and sends it back to the bridge actor (via the ClusterClient reply path, which works for immediate responses).
- For ongoing events, the InstanceActor wraps `AttributeValueChanged` and `AlarmStateChanged` in a `DebugStreamEvent(correlationId, event)` message and sends it to the local `SiteCommunicationActor`. The SiteCommunicationActor forwards it to central via its own ClusterClient (`ClusterClient.Send("/user/central-communication", event)`). The `CentralCommunicationActor` looks up the bridge actor by correlation ID and delivers the event. This follows the same site→central pattern as health reports.
- The bridge actor forwards received events to the consumer via callbacks (Blazor component or SignalR hub).
- Attribute value stream messages: `[InstanceUniqueName].[AttributePath].[AttributeName]`, value, quality, timestamp.
- Alarm state stream messages: `[InstanceUniqueName].[AlarmName]`, state (active/normal), priority, timestamp.
- Central sends an unsubscribe request when the debug session ends. The site removes its stream subscription and the bridge actor is stopped.
- The stream is session-based and temporary.

#### Central-Side Debug Stream Components

- **DebugStreamService**: Singleton service that manages debug stream sessions. Resolves instance ID to unique name and site, creates and tears down `DebugStreamBridgeActor` instances, and provides a clean API for both Blazor components and the SignalR hub.
- **DebugStreamBridgeActor**: One per active debug session. Acts as the Akka-level subscriber registered with the site's `InstanceActor`. Receives real-time `AttributeValueChanged` and `AlarmStateChanged` events from the site and forwards them to the consumer via callbacks.
- **DebugStreamHub**: SignalR hub at `/hubs/debug-stream` for external consumers (e.g., CLI). Authenticates via Basic Auth + LDAP and requires the **Deployment** role. Server-to-client methods: `OnSnapshot`, `OnAttributeChanged`, `OnAlarmChanged`, `OnStreamTerminated`.

### 6a. Debug Snapshot (Central → Site)
- **Pattern**: Request/Response (one-shot, no subscription).
- Central sends a `DebugSnapshotRequest` (identified by instance unique name) to the site.
- Site's Deployment Manager routes to the Instance Actor by unique name.
- Instance Actor builds and returns a `DebugViewSnapshot` with all current attribute values and alarm states (same payload as the streaming initial snapshot).
- No subscription is created; no stream is established.
- Uses the 30-second `QueryTimeout`.

### 7. Health Reporting (Site → Central)
- **Pattern**: Periodic push.
- Sites periodically send health metrics (connection status, node status, buffer depth, script error rates, alarm evaluation error rates) to central.

### 8. Remote Queries (Central → Site)
- **Pattern**: Request/Response.
- Central queries sites for:
  - Parked messages (store-and-forward dead letters).
  - Site event logs.
  - Instance debug snapshots (attribute values and alarm states).
- Central can also send management commands:
  - Retry or discard parked messages.

## Topology

```
Central Cluster
  ├── ClusterClient → Site A Cluster (SiteCommunicationActor via Receptionist)
  ├── ClusterClient → Site B Cluster (SiteCommunicationActor via Receptionist)
  └── ClusterClient → Site N Cluster (SiteCommunicationActor via Receptionist)

Site Clusters
  └── ClusterClient → Central Cluster (CentralCommunicationActor via Receptionist)
```

- Sites do **not** communicate with each other.
- All inter-cluster communication flows through central.
- Both **CentralCommunicationActor** and **SiteCommunicationActor** are registered with their cluster's **ClusterClientReceptionist** for cross-cluster discovery.

## Site Address Resolution

Central discovers site addresses through the **configuration database**, not runtime registration:

- Each site record in the Sites table includes optional **NodeAAddress** and **NodeBAddress** fields containing base Akka addresses of the site's cluster nodes (e.g., `akka.tcp://scadalink@host:port`).
- The **CentralCommunicationActor** loads all site addresses from the database at startup and creates one **ClusterClient per site**, configured with both NodeA and NodeB as contact points.
- The address cache is **refreshed every 60 seconds** and **on-demand** when site records are added, edited, or deleted via the Central UI or CLI. ClusterClient instances are recreated when contact points change.
- When routing a message to a site, central sends via `ClusterClient.Send("/user/site-communication", msg)`. **ClusterClient handles failover between NodeA and NodeB internally** — there is no application-level NodeA preference/NodeB fallback logic.
- **Heartbeats** from sites serve **health monitoring only** — they do not serve as a registration or address discovery mechanism.
- If no addresses are configured for a site, messages to that site are **dropped** and the caller's Ask times out.

### Site → Central Communication

- Site nodes configure a list of **CentralContactPoints** (both central node addresses) instead of a single `CentralActorPath`.
- The site creates a **ClusterClient** using the central contact points and sends heartbeats, health reports, and other messages via `ClusterClient.Send("/user/central-communication", msg)`.
- ClusterClient handles automatic failover between central nodes — if the active central node goes down, the site's ClusterClient reconnects to the standby node transparently.

## Message Timeouts

Each request/response pattern has a default timeout that can be overridden in configuration:

| Pattern | Default Timeout | Rationale |
|---------|----------------|-----------|
| 1. Deployment | 120 seconds | Script compilation at the site can be slow |
| 2. Instance Lifecycle | 30 seconds | Stop/start actors is fast |
| 3. System-Wide Artifacts | 120 seconds per site | Includes shared script recompilation |
| 4. Integration Routing | 30 seconds | External system waiting for response; Inbound API per-method timeout may cap this further |
| 5. Recipe/Command Delivery | 30 seconds | Fire-and-forget with ack |
| 8. Remote Queries | 30 seconds | Querying parked messages or event logs |

Timeouts use the Akka.NET **ask pattern**. If no response is received within the timeout, the caller receives a timeout failure.

## Transport Configuration

Akka.NET remoting provides the underlying transport for both intra-cluster communication and ClusterClient connections. The following transport-level settings are **explicitly configured** (not left to framework defaults) for predictable behavior:

- **Transport heartbeat interval**: Configurable interval at which heartbeat messages are sent over remoting connections (e.g., every 5 seconds).
- **Failure detection threshold**: Number of missed heartbeats before the connection is considered lost (e.g., 3 missed heartbeats = 15 seconds with a 5-second interval).
- **Reconnection**: ClusterClient handles reconnection and failover between contact points automatically for cross-cluster communication. No custom reconnection logic is required.

These settings should be tuned for the expected network conditions between central and site clusters.

## Application-Level Correlation

All request/response messages include an application-level **correlation ID** to ensure correct pairing of requests and responses, even across reconnection events:

- Deployments include a **deployment ID** and **revision hash** for idempotency (see Deployment Manager).
- Lifecycle commands include a **command ID** for deduplication.
- Remote queries include a **query ID** for response correlation.

This provides protocol-level safety beyond Akka.NET's transport guarantees, which may not hold across disconnect/reconnect cycles.

## Message Ordering

Akka.NET guarantees message ordering between a specific sender/receiver actor pair. The Communication Layer relies on this guarantee — messages to a given site are processed in the order they are sent. Callers do not need to handle out-of-order delivery.

## ManagementActor and ClusterClient

The ManagementActor is registered at the well-known path `/user/management` on central nodes and advertised via **ClusterClientReceptionist**. External tools (primarily the CLI) connect using Akka.NET ClusterClient, which contacts the receptionist to discover the ManagementActor. This is a separate ClusterClient usage from the inter-cluster ClusterClient connections used for central-site messaging — the CLI does not participate in cluster membership or affect the hub-and-spoke topology.

## Connection Failure Behavior

- **In-flight messages**: When a connection drops while a request is in flight (e.g., deployment sent but no response received), the Akka ask pattern times out and the caller receives a failure. There is **no automatic retry or buffering at central** — the engineer sees the failure in the UI and re-initiates the action. This is consistent with the design principle that central does not buffer messages.
- **Debug streams**: Any connection interruption (failover or network blip) kills the debug stream. The `DebugStreamBridgeActor` is stopped and the consumer is notified via `OnStreamTerminated`. The engineer must reopen the debug view to re-establish the subscription with a fresh snapshot. There is no auto-resume.

## Failover Behavior

- **Central failover**: The standby node takes over the Akka.NET cluster role. In-progress deployments are treated as failed. Site ClusterClients automatically reconnect to the standby central node via their configured contact points.
- **Site failover**: The standby node takes over. The Deployment Manager singleton restarts and re-creates the Instance Actor hierarchy. Central's per-site ClusterClient automatically reconnects to the surviving site node. Ongoing debug streams are interrupted and must be re-established by the engineer.

## Dependencies

- **Akka.NET Remoting + ClusterClient**: Provides the transport layer. ClusterClient/ClusterClientReceptionist used for all cross-cluster messaging.
- **Cluster Infrastructure**: Manages node roles and failover detection.
- **Configuration Database**: Provides site node addresses (NodeAAddress, NodeBAddress) for address resolution.

## Interactions

- **Deployment Manager (central)**: Uses communication to deliver configurations, lifecycle commands, and system-wide artifacts, and receive status.
- **Site Runtime**: Receives deployments, lifecycle commands, and artifact updates. Provides debug view data.
- **Central UI**: Debug view requests and remote queries flow through communication.
- **Health Monitoring**: Receives periodic health reports from sites.
- **Store-and-Forward Engine (site)**: Parked message queries/commands are routed through communication.
- **Site Event Logging**: Event log queries are routed through communication.
- **Management Service**: The ManagementActor is registered with ClusterClientReceptionist on central nodes. The CLI communicates with the ManagementActor via ClusterClient, which is a separate channel from inter-cluster remoting.