Files
scadalink-design/Component-ClusterInfrastructure.md
2026-03-16 07:39:26 -04:00

4.3 KiB

Component: Cluster Infrastructure

Purpose

The Cluster Infrastructure component manages the Akka.NET cluster setup, active/standby node roles, failover detection, and the foundational runtime environment on which all other components run. It provides the base layer for both central and site clusters.

Location

Both central and site clusters.

Responsibilities

  • Bootstrap the Akka.NET actor system on each node.
  • Form a two-node cluster (active/standby) using Akka.NET Cluster.
  • Manage leader election and role assignment (active vs. standby).
  • Detect node failures and trigger failover.
  • Provide the Akka.NET remoting infrastructure for inter-cluster communication.
  • Support cluster singleton hosting (used by the Site Runtime Deployment Manager singleton on site clusters).
  • Manage Windows service lifecycle (start, stop, restart) on each node.

Cluster Topology

Central Cluster

  • Two nodes forming an Akka.NET cluster.
  • One active node runs all central components (Template Engine, Deployment Manager, Central UI, etc.).
  • One standby node is ready to take over on failover.
  • Connected to MS SQL databases (Config DB, Machine Data DB).

Site Cluster (per site)

  • Two nodes forming an Akka.NET cluster.
  • One active node runs all site components (Site Runtime, Data Connection Layer, Store-and-Forward Engine, etc.).
  • The Site Runtime Deployment Manager runs as an Akka.NET cluster singleton on the active node, owning the full Instance Actor hierarchy.
  • One standby node receives replicated store-and-forward data and is ready to take over.
  • Connected to local SQLite databases (store-and-forward buffer, event logs, deployed configurations).
  • Connected to machines via data connections (OPC UA, custom protocol).

Failover Behavior

Detection

  • Akka.NET Cluster monitors node health via heartbeat.
  • If the active node becomes unreachable, the standby node detects the failure and promotes itself to active.

Central Failover

  • The new active node takes over all central responsibilities.
  • In-progress deployments are treated as failed — engineers must retry.
  • The UI session may be interrupted — users reconnect to the new active node.
  • No message buffering at central — no state to recover beyond what's in MS SQL.

Site Failover

  • The new active node takes over:
    • The Deployment Manager singleton restarts and re-creates the full Instance Actor hierarchy by reading deployed configurations from local SQLite. Each Instance Actor spawns its child Script and Alarm Actors.
    • Data collection (Data Connection Layer re-establishes subscriptions as Instance Actors register their data source references).
    • Store-and-forward delivery (buffer is already replicated locally).
  • Active debug view streams from central are interrupted — the engineer must re-open them.
  • Health reporting resumes from the new active node.
  • Alarm states are re-evaluated from incoming values (alarm state is in-memory only).

Node Configuration

Each node is configured with:

  • Cluster seed nodes: Addresses of both nodes in the cluster.
  • Cluster role: Central or Site (plus site identifier for site clusters).
  • Akka.NET remoting: Hostname/port for inter-node and inter-cluster communication.
  • Local storage paths: SQLite database locations (site nodes only).

Windows Service

  • Each node runs as a Windows service for automatic startup and recovery.
  • Service configuration includes Akka.NET cluster settings and component-specific configuration.

Platform

  • OS: Windows Server.
  • Runtime: .NET (Akka.NET).
  • Cluster: Akka.NET Cluster (application-level, not Windows Server Failover Clustering).

Dependencies

  • Akka.NET: Core actor system, cluster, remoting, and cluster singleton libraries.
  • Windows: Service hosting, networking.
  • MS SQL (central only): Database connectivity.
  • SQLite (sites only): Local storage.

Interactions

  • All components: Every component runs within the Akka.NET actor system managed by this infrastructure.
  • Site Runtime: The Deployment Manager singleton relies on Akka.NET cluster singleton support provided by this infrastructure.
  • Communication Layer: Built on top of the Akka.NET remoting provided here.
  • Health Monitoring: Reports node status (active/standby) as a health metric.