feat: replace ActorSelection with ClusterClient for inter-cluster communication

Central and site clusters now communicate via ClusterClient/ ClusterClientReceptionist instead of direct ActorSelection. Both CentralCommunicationActor and SiteCommunicationActor are registered with their cluster's receptionist. Central creates one ClusterClient per site using NodeA/NodeB contact points from the DB. Sites configure multiple CentralContactPoints for automatic failover between central nodes. ISiteClientFactory enables test injection.
2026-03-18 00:08:47 -04:00
parent e5eb871961
commit 4f22ca2b1f
15 changed files with 287 additions and 136 deletions
--- a/Component-Communication.md
+++ b/Component-Communication.md
@@ -11,7 +11,7 @@ Both central and site clusters. Each side has communication actors that handle m
 ## Responsibilities

 - Resolve site addresses from the configuration database and maintain a cached address map.
- Establish and maintain Akka.NET remoting connections between central and each site cluster.
+- Establish and maintain cross-cluster connections using Akka.NET ClusterClient/ClusterClientReceptionist.
 - Route messages between central and site clusters in a hub-and-spoke topology.
 - Broker requests from external systems (via central) to sites and return responses.
 - Support multiple concurrent message patterns (request/response, fire-and-forget, streaming).
@@ -75,25 +75,35 @@ Both central and site clusters. Each side has communication actors that handle m

 ```
 Central Cluster
-  ├── Akka.NET Remoting → Site A Cluster
-  ├── Akka.NET Remoting → Site B Cluster
-  └── Akka.NET Remoting → Site N Cluster
+  ├── ClusterClient → Site A Cluster (SiteCommunicationActor via Receptionist)
+  ├── ClusterClient → Site B Cluster (SiteCommunicationActor via Receptionist)
+  └── ClusterClient → Site N Cluster (SiteCommunicationActor via Receptionist)
+
+Site Clusters
+  └── ClusterClient → Central Cluster (CentralCommunicationActor via Receptionist)
 ```

 - Sites do **not** communicate with each other.
 - All inter-cluster communication flows through central.
+- Both **CentralCommunicationActor** and **SiteCommunicationActor** are registered with their cluster's **ClusterClientReceptionist** for cross-cluster discovery.

 ## Site Address Resolution

 Central discovers site addresses through the **configuration database**, not runtime registration:

- Each site record in the Sites table includes optional **NodeAAddress** and **NodeBAddress** fields containing the Akka remoting paths of the site's cluster nodes.
- The **CentralCommunicationActor** loads all site addresses from the database at startup and caches them in memory.
- The cache is **refreshed every 60 seconds** and **on-demand** when site records are added, edited, or deleted via the Central UI or CLI.
- When routing a message to a site, the actor **prefers NodeA** and **falls back to NodeB** if NodeA is unreachable.
+- Each site record in the Sites table includes optional **NodeAAddress** and **NodeBAddress** fields containing base Akka addresses of the site's cluster nodes (e.g., `akka.tcp://scadalink@host:port`).
+- The **CentralCommunicationActor** loads all site addresses from the database at startup and creates one **ClusterClient per site**, configured with both NodeA and NodeB as contact points.
+- The address cache is **refreshed every 60 seconds** and **on-demand** when site records are added, edited, or deleted via the Central UI or CLI. ClusterClient instances are recreated when contact points change.
+- When routing a message to a site, central sends via `ClusterClient.Send("/user/site-communication", msg)`. **ClusterClient handles failover between NodeA and NodeB internally** — there is no application-level NodeA preference/NodeB fallback logic.
 - **Heartbeats** from sites serve **health monitoring only** — they do not serve as a registration or address discovery mechanism.
 - If no addresses are configured for a site, messages to that site are **dropped** and the caller's Ask times out.

+### Site → Central Communication
+
+- Site nodes configure a list of **CentralContactPoints** (both central node addresses) instead of a single `CentralActorPath`.
+- The site creates a **ClusterClient** using the central contact points and sends heartbeats, health reports, and other messages via `ClusterClient.Send("/user/central-communication", msg)`.
+- ClusterClient handles automatic failover between central nodes — if the active central node goes down, the site's ClusterClient reconnects to the standby node transparently.
+
 ## Message Timeouts

 Each request/response pattern has a default timeout that can be overridden in configuration:
@@ -111,11 +121,11 @@ Timeouts use the Akka.NET **ask pattern**. If no response is received within the

 ## Transport Configuration

-Akka.NET remoting provides built-in connection management and failure detection. The following transport-level settings are **explicitly configured** (not left to framework defaults) for predictable behavior:
+Akka.NET remoting provides the underlying transport for both intra-cluster communication and ClusterClient connections. The following transport-level settings are **explicitly configured** (not left to framework defaults) for predictable behavior:

 - **Transport heartbeat interval**: Configurable interval at which heartbeat messages are sent over remoting connections (e.g., every 5 seconds).
 - **Failure detection threshold**: Number of missed heartbeats before the connection is considered lost (e.g., 3 missed heartbeats = 15 seconds with a 5-second interval).
- **Reconnection**: Akka.NET remoting handles reconnection automatically. No custom reconnection logic is required.
+- **Reconnection**: ClusterClient handles reconnection and failover between contact points automatically for cross-cluster communication. No custom reconnection logic is required.

 These settings should be tuned for the expected network conditions between central and site clusters.

@@ -135,7 +145,7 @@ Akka.NET guarantees message ordering between a specific sender/receiver actor pa

 ## ManagementActor and ClusterClient

-The ManagementActor is registered at the well-known path `/user/management` on central nodes and advertised via **ClusterClientReceptionist**. External tools (primarily the CLI) connect using Akka.NET ClusterClient, which contacts the receptionist to discover the ManagementActor. ClusterClient is a separate communication channel from the inter-cluster remoting used for central-site messaging — it does not participate in cluster membership or affect the hub-and-spoke topology.
+The ManagementActor is registered at the well-known path `/user/management` on central nodes and advertised via **ClusterClientReceptionist**. External tools (primarily the CLI) connect using Akka.NET ClusterClient, which contacts the receptionist to discover the ManagementActor. This is a separate ClusterClient usage from the inter-cluster ClusterClient connections used for central-site messaging — the CLI does not participate in cluster membership or affect the hub-and-spoke topology.

 ## Connection Failure Behavior

@@ -144,12 +154,12 @@ The ManagementActor is registered at the well-known path `/user/management` on c

 ## Failover Behavior

- **Central failover**: The standby node takes over the Akka.NET cluster role. In-progress deployments are treated as failed. Sites reconnect to the new active central node.
- **Site failover**: The standby node takes over. The Deployment Manager singleton restarts and re-creates the Instance Actor hierarchy. Central detects the node change and reconnects. Ongoing debug streams are interrupted and must be re-established by the engineer.
+- **Central failover**: The standby node takes over the Akka.NET cluster role. In-progress deployments are treated as failed. Site ClusterClients automatically reconnect to the standby central node via their configured contact points.
+- **Site failover**: The standby node takes over. The Deployment Manager singleton restarts and re-creates the Instance Actor hierarchy. Central's per-site ClusterClient automatically reconnects to the surviving site node. Ongoing debug streams are interrupted and must be re-established by the engineer.

 ## Dependencies

- **Akka.NET Remoting**: Provides the transport layer.
+- **Akka.NET Remoting + ClusterClient**: Provides the transport layer. ClusterClient/ClusterClientReceptionist used for all cross-cluster messaging.
 - **Cluster Infrastructure**: Manages node roles and failover detection.
 - **Configuration Database**: Provides site node addresses (NodeAAddress, NodeBAddress) for address resolution.