# Component: Cluster Infrastructure ## Purpose The Cluster Infrastructure component manages the Akka.NET cluster setup, active/standby node roles, failover detection, and the foundational runtime environment on which all other components run. It provides the base layer for both central and site clusters. ## Location Both central and site clusters. ## Responsibilities - Bootstrap the Akka.NET actor system on each node. - Form a two-node cluster (active/standby) using Akka.NET Cluster. - Manage leader election and role assignment (active vs. standby). - Detect node failures and trigger failover. - Provide the Akka.NET remoting infrastructure for inter-cluster communication. - Support cluster singleton hosting (used by the Site Runtime Deployment Manager singleton on site clusters). - Manage Windows service lifecycle (start, stop, restart) on each node. ## Cluster Topology ### Central Cluster - Two nodes forming an Akka.NET cluster. - One active node runs all central components (Template Engine, Deployment Manager, Central UI, etc.). - One standby node is ready to take over on failover. - Connected to MS SQL databases (Config DB, Machine Data DB). ### Site Cluster (per site) - Two nodes forming an Akka.NET cluster. - One active node runs all site components (Site Runtime, Data Connection Layer, Store-and-Forward Engine, etc.). - The Site Runtime Deployment Manager runs as an **Akka.NET cluster singleton** on the active node, owning the full Instance Actor hierarchy. - One standby node receives replicated store-and-forward data and is ready to take over. - Connected to local SQLite databases (store-and-forward buffer, event logs, deployed configurations). - Connected to machines via data connections (OPC UA). ## Failover Behavior ### Detection - Akka.NET Cluster monitors node health via heartbeat. - If the active node becomes unreachable, the standby node detects the failure and promotes itself to active. ### Central Failover - The new active node takes over all central responsibilities. - In-progress deployments are treated as **failed** — engineers must retry. - The UI session may be interrupted — users reconnect to the new active node. - No message buffering at central — no state to recover beyond what's in MS SQL. ### Site Failover - The new active node takes over: - The Deployment Manager singleton restarts and re-creates the full Instance Actor hierarchy by reading deployed configurations from local SQLite. Each Instance Actor spawns its child Script and Alarm Actors. - Data collection (Data Connection Layer re-establishes subscriptions as Instance Actors register their data source references). - Store-and-forward delivery (buffer is already replicated locally). - Active debug view streams from central are interrupted — the engineer must re-open them. - Health reporting resumes from the new active node. - Alarm states are re-evaluated from incoming values (alarm state is in-memory only). ## Split-Brain Resolution The system uses the Akka.NET **keep-oldest** split-brain resolver strategy: - On a network partition, the node that has been in the cluster longest remains active. The younger node downs itself. - **Stable-after duration**: 15 seconds. The cluster membership must remain stable (no changes) for 15 seconds before the resolver acts to down unreachable nodes. This prevents premature downing during startup or rolling restarts. - **`down-if-alone = on`**: The keep-oldest resolver is configured with `down-if-alone` enabled. If the oldest node finds itself alone (no other reachable members), it downs itself rather than continuing as a single-node cluster. This prevents the oldest node from running in isolation during a network partition while the younger node also forms its own cluster. - **Why keep-oldest**: With only two nodes, quorum-based strategies (static-quorum, keep-majority) cannot distinguish "one node crashed" from "network partition" — both sides see fewer than quorum and both would down themselves, resulting in total cluster shutdown. Keep-oldest with `down-if-alone` provides safe singleton ownership — at most one node runs the cluster singleton at any time. ## Single-Node Operation `akka.cluster.min-nr-of-members` must be set to **1**. After failover, only one node is running. If set to 2, the surviving node waits for a second member before allowing the Cluster Singleton (Site Runtime Deployment Manager) to start — blocking all data collection and script execution indefinitely. ## Failure Detection Timing Configurable defaults for heartbeat and failure detection: | Setting | Default | Description | |---------|---------|-------------| | Heartbeat interval | 2 seconds | Frequency of health check messages between nodes | | Failure detection threshold | 10 seconds | Time without heartbeat before a node is considered unreachable | | Stable-after (split-brain) | 15 seconds | Time cluster must be stable before resolver acts | | **Total failover time** | **~25 seconds** | Detection (10s) + stable-after (15s) + singleton restart | These values balance failover speed with stability — fast enough that data collection gaps are small, tolerant enough that brief network hiccups don't trigger unnecessary failovers. ## Dual-Node Recovery If both nodes in a cluster fail simultaneously (e.g., site power outage): 1. **No manual intervention required.** Since both nodes are configured as seed nodes, whichever node starts first forms a new cluster. The second node joins when it starts. 2. **State recovery** (each node has its own local copy of all required data): - **Site clusters**: The Deployment Manager singleton reads deployed configurations from local SQLite and re-creates the full Instance Actor hierarchy. Store-and-forward buffers are already persisted locally. Alarm states re-evaluate from incoming data values. - **Central cluster**: All state is in MS SQL (configuration database). The active node resumes normal operation. 3. The keep-oldest resolver handles the "both starting fresh" case naturally — there is no pre-existing cluster to conflict with. ## Graceful Shutdown & Singleton Handover When a node is stopped for planned maintenance (Windows Service stop), `CoordinatedShutdown` triggers a **graceful leave** from the cluster. This enables the Cluster Singleton (Site Runtime Deployment Manager) to hand over to the other node in seconds (limited by the hand-over retry interval) rather than waiting for the full failure detection timeout (~25 seconds). Configuration required: - `akka.coordinated-shutdown.run-by-clr-shutdown-hook = on` - `akka.cluster.run-coordinated-shutdown-when-down = on` The Host component wires CoordinatedShutdown into the Windows Service lifecycle (see REQ-HOST-9). ## Node Configuration Each node is configured with: - **Cluster seed nodes**: **Both nodes** are seed nodes — each node lists both itself and its partner. Either node can start first and form the cluster; the other joins when it starts. No startup ordering dependency. - **Cluster role**: Central or Site (plus site identifier for site clusters). - **Akka.NET remoting**: Hostname/port for inter-node and inter-cluster communication (default 8081 central, 8082 site). - **gRPC port** (site nodes only): Dedicated HTTP/2 port for the SiteStreamGrpcServer (default 8083). Separate from the Akka remoting port — gRPC uses Kestrel, Akka uses its own TCP transport. - **Local storage paths**: SQLite database locations (site nodes only). ## Windows Service - Each node runs as a **Windows service** for automatic startup and recovery. - Service configuration includes Akka.NET cluster settings and component-specific configuration. ## Platform - **OS**: Windows Server. - **Runtime**: .NET (Akka.NET). - **Cluster**: Akka.NET Cluster (application-level, not Windows Server Failover Clustering). ## Dependencies - **Akka.NET**: Core actor system, cluster, remoting, and cluster singleton libraries. - **Windows**: Service hosting, networking. - **MS SQL** (central only): Database connectivity. - **SQLite** (sites only): Local storage. ## Interactions - **All components**: Every component runs within the Akka.NET actor system managed by this infrastructure. - **Site Runtime**: The Deployment Manager singleton relies on Akka.NET cluster singleton support provided by this infrastructure. - **Communication Layer**: Built on top of the Akka.NET remoting provided here. - **Health Monitoring**: Reports node status (active/standby) as a health metric.