Files

Joseph Doherty 9e97c1acd2 feat: replace site registration with database-driven site addressing

Central now resolves site Akka remoting addresses from the Sites DB table
(NodeAAddress/NodeBAddress) instead of relying on runtime RegisterSite
messages. Eliminates the race condition where sites starting before central
had their registration dead-lettered. Addresses are cached in
CentralCommunicationActor with 60s periodic refresh and on-demand refresh
when sites are added/edited/deleted via UI or CLI.

2026-03-17 23:13:10 -04:00

10 KiB

Raw Blame History

Component: Central–Site Communication

Purpose

The Communication component manages all messaging between the central cluster and site clusters using Akka.NET. It provides the transport layer for deployments, instance lifecycle commands, integration routing, debug streaming, health reporting, and remote queries (parked messages, event logs).

Location

Both central and site clusters. Each side has communication actors that handle message routing.

Responsibilities

Resolve site addresses from the configuration database and maintain a cached address map.
Establish and maintain Akka.NET remoting connections between central and each site cluster.
Route messages between central and site clusters in a hub-and-spoke topology.
Broker requests from external systems (via central) to sites and return responses.
Support multiple concurrent message patterns (request/response, fire-and-forget, streaming).
Detect site connectivity status for health monitoring.

Communication Patterns

1. Deployment (Central → Site)

Pattern: Request/Response.
Central sends a flattened configuration to a site.
Site Runtime receives, compiles scripts, creates/updates Instance Actors, and responds with success/failure.
No buffering at central — if the site is unreachable, the deployment fails immediately.

2. Instance Lifecycle Commands (Central → Site)

Pattern: Request/Response.
Central sends disable, enable, or delete commands for specific instances.
Site Runtime processes the command and responds with success/failure.
If the site is unreachable, the command fails immediately (no buffering).

3. System-Wide Artifact Deployment (Central → Site(s))

Pattern: Broadcast with per-site acknowledgment (deploy to all sites), or targeted to a single site (per-site deployment).
When shared scripts, external system definitions, database connections, data connections, notification lists, or SMTP configuration are explicitly deployed, central sends them to the target site(s).
Each site acknowledges receipt and reports success/failure independently.

4. Integration Routing (External System → Central → Site → Central → External System)

Pattern: Request/Response (brokered).
External system sends a request to central (e.g., MES requests machine values).
Central routes the request to the appropriate site.
Site reads values from the Instance Actor and responds.
Central returns the response to the external system.

5. Recipe/Command Delivery (External System → Central → Site)

Pattern: Fire-and-forget with acknowledgment.
External system sends a command to central (e.g., recipe manager sends recipe).
Central routes to the site.
Site applies and acknowledges.

6. Debug Streaming (Site → Central)

Pattern: Subscribe/stream with initial snapshot.
Central sends a subscribe request for a specific instance (identified by unique name).
Site requests a snapshot of all current attribute values and alarm states from the Instance Actor and sends it to central.
Site then subscribes to the site-wide Akka stream filtered by the instance's unique name and forwards attribute value changes and alarm state changes to central.
Attribute value stream messages: [InstanceUniqueName].[AttributePath].[AttributeName], value, quality, timestamp.
Alarm state stream messages: [InstanceUniqueName].[AlarmName], state (active/normal), priority, timestamp.
Central sends an unsubscribe request when the debug view closes. The site removes its stream subscription.
The stream is session-based and temporary.

7. Health Reporting (Site → Central)

Pattern: Periodic push.
Sites periodically send health metrics (connection status, node status, buffer depth, script error rates, alarm evaluation error rates) to central.

8. Remote Queries (Central → Site)

Pattern: Request/Response.
Central queries sites for:
- Parked messages (store-and-forward dead letters).
- Site event logs.
Central can also send management commands:
- Retry or discard parked messages.

Topology

Central Cluster
  ├── Akka.NET Remoting → Site A Cluster
  ├── Akka.NET Remoting → Site B Cluster
  └── Akka.NET Remoting → Site N Cluster

Sites do not communicate with each other.
All inter-cluster communication flows through central.

Site Address Resolution

Central discovers site addresses through the configuration database, not runtime registration:

Each site record in the Sites table includes optional NodeAAddress and NodeBAddress fields containing the Akka remoting paths of the site's cluster nodes.
The CentralCommunicationActor loads all site addresses from the database at startup and caches them in memory.
The cache is refreshed every 60 seconds and on-demand when site records are added, edited, or deleted via the Central UI or CLI.
When routing a message to a site, the actor prefers NodeA and falls back to NodeB if NodeA is unreachable.
Heartbeats from sites serve health monitoring only — they do not serve as a registration or address discovery mechanism.
If no addresses are configured for a site, messages to that site are dropped and the caller's Ask times out.

Message Timeouts

Each request/response pattern has a default timeout that can be overridden in configuration:

Pattern	Default Timeout	Rationale
1. Deployment	120 seconds	Script compilation at the site can be slow
2. Instance Lifecycle	30 seconds	Stop/start actors is fast
3. System-Wide Artifacts	120 seconds per site	Includes shared script recompilation
4. Integration Routing	30 seconds	External system waiting for response; Inbound API per-method timeout may cap this further
5. Recipe/Command Delivery	30 seconds	Fire-and-forget with ack
8. Remote Queries	30 seconds	Querying parked messages or event logs

Timeouts use the Akka.NET ask pattern. If no response is received within the timeout, the caller receives a timeout failure.

Transport Configuration

Akka.NET remoting provides built-in connection management and failure detection. The following transport-level settings are explicitly configured (not left to framework defaults) for predictable behavior:

Transport heartbeat interval: Configurable interval at which heartbeat messages are sent over remoting connections (e.g., every 5 seconds).
Failure detection threshold: Number of missed heartbeats before the connection is considered lost (e.g., 3 missed heartbeats = 15 seconds with a 5-second interval).
Reconnection: Akka.NET remoting handles reconnection automatically. No custom reconnection logic is required.

These settings should be tuned for the expected network conditions between central and site clusters.

Application-Level Correlation

All request/response messages include an application-level correlation ID to ensure correct pairing of requests and responses, even across reconnection events:

Deployments include a deployment ID and revision hash for idempotency (see Deployment Manager).
Lifecycle commands include a command ID for deduplication.
Remote queries include a query ID for response correlation.

This provides protocol-level safety beyond Akka.NET's transport guarantees, which may not hold across disconnect/reconnect cycles.

Message Ordering

Akka.NET guarantees message ordering between a specific sender/receiver actor pair. The Communication Layer relies on this guarantee — messages to a given site are processed in the order they are sent. Callers do not need to handle out-of-order delivery.

ManagementActor and ClusterClient

The ManagementActor is registered at the well-known path /user/management on central nodes and advertised via ClusterClientReceptionist. External tools (primarily the CLI) connect using Akka.NET ClusterClient, which contacts the receptionist to discover the ManagementActor. ClusterClient is a separate communication channel from the inter-cluster remoting used for central-site messaging — it does not participate in cluster membership or affect the hub-and-spoke topology.

Connection Failure Behavior

In-flight messages: When a connection drops while a request is in flight (e.g., deployment sent but no response received), the Akka ask pattern times out and the caller receives a failure. There is no automatic retry or buffering at central — the engineer sees the failure in the UI and re-initiates the action. This is consistent with the design principle that central does not buffer messages.
Debug streams: Any connection interruption (failover or network blip) kills the debug stream. The engineer must reopen the debug view in the Central UI to re-establish the subscription with a fresh snapshot. There is no auto-resume.

Failover Behavior

Central failover: The standby node takes over the Akka.NET cluster role. In-progress deployments are treated as failed. Sites reconnect to the new active central node.
Site failover: The standby node takes over. The Deployment Manager singleton restarts and re-creates the Instance Actor hierarchy. Central detects the node change and reconnects. Ongoing debug streams are interrupted and must be re-established by the engineer.

Dependencies

Akka.NET Remoting: Provides the transport layer.
Cluster Infrastructure: Manages node roles and failover detection.
Configuration Database: Provides site node addresses (NodeAAddress, NodeBAddress) for address resolution.

Interactions

Deployment Manager (central): Uses communication to deliver configurations, lifecycle commands, and system-wide artifacts, and receive status.
Site Runtime: Receives deployments, lifecycle commands, and artifact updates. Provides debug view data.
Central UI: Debug view requests and remote queries flow through communication.
Health Monitoring: Receives periodic health reports from sites.
Store-and-Forward Engine (site): Parked message queries/commands are routed through communication.
Site Event Logging: Event log queries are routed through communication.
Management Service: The ManagementActor is registered with ClusterClientReceptionist on central nodes. The CLI communicates with the ManagementActor via ClusterClient, which is a separate channel from inter-cluster remoting.

10 KiB Raw Blame History Unescape Escape