feat: replace ActorSelection with ClusterClient for inter-cluster communication

Central and site clusters now communicate via ClusterClient/
ClusterClientReceptionist instead of direct ActorSelection. Both
CentralCommunicationActor and SiteCommunicationActor are registered
with their cluster's receptionist. Central creates one ClusterClient
per site using NodeA/NodeB contact points from the DB. Sites configure
multiple CentralContactPoints for automatic failover between central
nodes. ISiteClientFactory enables test injection.
This commit is contained in:
Joseph Doherty
2026-03-18 00:08:47 -04:00
parent e5eb871961
commit 4f22ca2b1f
15 changed files with 287 additions and 136 deletions

View File

@@ -11,7 +11,7 @@ Both central and site clusters. Each side has communication actors that handle m
## Responsibilities
- Resolve site addresses from the configuration database and maintain a cached address map.
- Establish and maintain Akka.NET remoting connections between central and each site cluster.
- Establish and maintain cross-cluster connections using Akka.NET ClusterClient/ClusterClientReceptionist.
- Route messages between central and site clusters in a hub-and-spoke topology.
- Broker requests from external systems (via central) to sites and return responses.
- Support multiple concurrent message patterns (request/response, fire-and-forget, streaming).
@@ -75,25 +75,35 @@ Both central and site clusters. Each side has communication actors that handle m
```
Central Cluster
├── Akka.NET Remoting → Site A Cluster
├── Akka.NET Remoting → Site B Cluster
└── Akka.NET Remoting → Site N Cluster
├── ClusterClient → Site A Cluster (SiteCommunicationActor via Receptionist)
├── ClusterClient → Site B Cluster (SiteCommunicationActor via Receptionist)
└── ClusterClient → Site N Cluster (SiteCommunicationActor via Receptionist)
Site Clusters
└── ClusterClient → Central Cluster (CentralCommunicationActor via Receptionist)
```
- Sites do **not** communicate with each other.
- All inter-cluster communication flows through central.
- Both **CentralCommunicationActor** and **SiteCommunicationActor** are registered with their cluster's **ClusterClientReceptionist** for cross-cluster discovery.
## Site Address Resolution
Central discovers site addresses through the **configuration database**, not runtime registration:
- Each site record in the Sites table includes optional **NodeAAddress** and **NodeBAddress** fields containing the Akka remoting paths of the site's cluster nodes.
- The **CentralCommunicationActor** loads all site addresses from the database at startup and caches them in memory.
- The cache is **refreshed every 60 seconds** and **on-demand** when site records are added, edited, or deleted via the Central UI or CLI.
- When routing a message to a site, the actor **prefers NodeA** and **falls back to NodeB** if NodeA is unreachable.
- Each site record in the Sites table includes optional **NodeAAddress** and **NodeBAddress** fields containing base Akka addresses of the site's cluster nodes (e.g., `akka.tcp://scadalink@host:port`).
- The **CentralCommunicationActor** loads all site addresses from the database at startup and creates one **ClusterClient per site**, configured with both NodeA and NodeB as contact points.
- The address cache is **refreshed every 60 seconds** and **on-demand** when site records are added, edited, or deleted via the Central UI or CLI. ClusterClient instances are recreated when contact points change.
- When routing a message to a site, central sends via `ClusterClient.Send("/user/site-communication", msg)`. **ClusterClient handles failover between NodeA and NodeB internally** — there is no application-level NodeA preference/NodeB fallback logic.
- **Heartbeats** from sites serve **health monitoring only** — they do not serve as a registration or address discovery mechanism.
- If no addresses are configured for a site, messages to that site are **dropped** and the caller's Ask times out.
### Site → Central Communication
- Site nodes configure a list of **CentralContactPoints** (both central node addresses) instead of a single `CentralActorPath`.
- The site creates a **ClusterClient** using the central contact points and sends heartbeats, health reports, and other messages via `ClusterClient.Send("/user/central-communication", msg)`.
- ClusterClient handles automatic failover between central nodes — if the active central node goes down, the site's ClusterClient reconnects to the standby node transparently.
## Message Timeouts
Each request/response pattern has a default timeout that can be overridden in configuration:
@@ -111,11 +121,11 @@ Timeouts use the Akka.NET **ask pattern**. If no response is received within the
## Transport Configuration
Akka.NET remoting provides built-in connection management and failure detection. The following transport-level settings are **explicitly configured** (not left to framework defaults) for predictable behavior:
Akka.NET remoting provides the underlying transport for both intra-cluster communication and ClusterClient connections. The following transport-level settings are **explicitly configured** (not left to framework defaults) for predictable behavior:
- **Transport heartbeat interval**: Configurable interval at which heartbeat messages are sent over remoting connections (e.g., every 5 seconds).
- **Failure detection threshold**: Number of missed heartbeats before the connection is considered lost (e.g., 3 missed heartbeats = 15 seconds with a 5-second interval).
- **Reconnection**: Akka.NET remoting handles reconnection automatically. No custom reconnection logic is required.
- **Reconnection**: ClusterClient handles reconnection and failover between contact points automatically for cross-cluster communication. No custom reconnection logic is required.
These settings should be tuned for the expected network conditions between central and site clusters.
@@ -135,7 +145,7 @@ Akka.NET guarantees message ordering between a specific sender/receiver actor pa
## ManagementActor and ClusterClient
The ManagementActor is registered at the well-known path `/user/management` on central nodes and advertised via **ClusterClientReceptionist**. External tools (primarily the CLI) connect using Akka.NET ClusterClient, which contacts the receptionist to discover the ManagementActor. ClusterClient is a separate communication channel from the inter-cluster remoting used for central-site messaging — it does not participate in cluster membership or affect the hub-and-spoke topology.
The ManagementActor is registered at the well-known path `/user/management` on central nodes and advertised via **ClusterClientReceptionist**. External tools (primarily the CLI) connect using Akka.NET ClusterClient, which contacts the receptionist to discover the ManagementActor. This is a separate ClusterClient usage from the inter-cluster ClusterClient connections used for central-site messaging — the CLI does not participate in cluster membership or affect the hub-and-spoke topology.
## Connection Failure Behavior
@@ -144,12 +154,12 @@ The ManagementActor is registered at the well-known path `/user/management` on c
## Failover Behavior
- **Central failover**: The standby node takes over the Akka.NET cluster role. In-progress deployments are treated as failed. Sites reconnect to the new active central node.
- **Site failover**: The standby node takes over. The Deployment Manager singleton restarts and re-creates the Instance Actor hierarchy. Central detects the node change and reconnects. Ongoing debug streams are interrupted and must be re-established by the engineer.
- **Central failover**: The standby node takes over the Akka.NET cluster role. In-progress deployments are treated as failed. Site ClusterClients automatically reconnect to the standby central node via their configured contact points.
- **Site failover**: The standby node takes over. The Deployment Manager singleton restarts and re-creates the Instance Actor hierarchy. Central's per-site ClusterClient automatically reconnects to the surviving site node. Ongoing debug streams are interrupted and must be re-established by the engineer.
## Dependencies
- **Akka.NET Remoting**: Provides the transport layer.
- **Akka.NET Remoting + ClusterClient**: Provides the transport layer. ClusterClient/ClusterClientReceptionist used for all cross-cluster messaging.
- **Cluster Infrastructure**: Manages node roles and failover detection.
- **Configuration Database**: Provides site node addresses (NodeAAddress, NodeBAddress) for address resolution.