Files
scadalink-design/AkkaDotNet/03-Cluster.md
Joseph Doherty de636b908b Add Akka.NET reference documentation
Notes and documentation covering actors, remoting, clustering, persistence,
streams, serialization, hosting, testing, and best practices for the Akka.NET
framework used throughout the ScadaLink system.
2026-03-16 09:08:17 -04:00

8.0 KiB
Raw Blame History

03 — Cluster (Akka.Cluster)

Overview

Akka.Cluster organizes our two SCADA nodes into a managed membership group with failure detection, role assignment, and coordinated lifecycle management. It is the backbone that makes the active/cold-standby failover model work — Cluster determines which node is "oldest" (and therefore active), detects when a node has failed, and triggers the standby to take over.

The 2-node cluster topology is the most architecturally challenging configuration in Akka.Cluster because traditional quorum-based decisions require a majority, and with 2 nodes there is no majority when one node is down.

When to Use

  • Always — the 2-node failover pair requires Cluster for membership management, failure detection, and coordinated role transitions
  • Cluster membership events drive the entire failover lifecycle

When Not to Use

  • Do not use Cluster for communication with equipment — that's the device actor layer
  • Do not add more than 2 nodes to a site cluster without revisiting the Split Brain Resolver configuration

Design Decisions for the SCADA System

Cluster Roles

Assign a single role to both nodes: scada-node. Role differentiation between active and standby is handled by Cluster Singleton (the oldest node becomes the singleton host), not by role assignment.

akka.cluster {
  roles = ["scada-node"]
}

Seed Nodes

With exactly 2 nodes and static IPs/hostnames, use static seed nodes. Both nodes list both addresses as seeds:

akka.cluster {
  seed-nodes = [
    "akka.tcp://scada-system@nodeA.scada.local:4053",
    "akka.tcp://scada-system@nodeB.scada.local:4053"
  ]
}

The first seed node listed should be the same on both nodes. The node at the first seed address will automatically become the cluster leader and the "oldest" node (which determines the Singleton host).

Split Brain Resolver (SBR) — The Critical Decision

With only 2 nodes, quorum-based SBR strategies (majority, static-quorum) do not work because losing 1 of 2 nodes means no majority exists. The options for 2-node clusters:

Recommended: keep-oldest

The oldest node (the one that has been in the cluster longest) survives a partition. The younger node downs itself. When the failed/partitioned node comes back, it rejoins as the younger node.

akka.cluster {
  downing-provider-class = "Akka.Cluster.SBR.SplitBrainResolverProvider, Akka.Cluster"
  split-brain-resolver {
    active-strategy = keep-oldest
    keep-oldest {
      down-if-alone = on  # If the oldest is truly alone (no other node visible), it downs itself too
    }
    stable-after = 15s  # Wait for cluster to stabilize before making decisions
  }
}

Why down-if-alone = on: In a 2-node cluster, if the oldest node is partitioned (it can see no one), it should down itself. Otherwise, both nodes think they are the sole survivor and you get a split brain. With down-if-alone = on, the oldest node in a partition will also down itself, and an external mechanism (Windows Service restart, watchdog) must restart both nodes to reform the cluster.

Alternative: Lease-based SBR (see 15-Coordination.md)

If you have access to a shared resource (file share, database), a lease-based approach is more robust for 2-node clusters. Only the node that holds the lease survives. This avoids the "both nodes down themselves" scenario.

Failure Detection Tuning

Cluster uses an accrual failure detector on top of Remoting heartbeats. For a SCADA system on a local network:

akka.cluster {
  failure-detector {
    heartbeat-interval = 1s
    threshold = 8.0
    acceptable-heartbeat-pause = 10s
    min-std-deviation = 100ms
  }
}

These settings provide failover detection within ~1015 seconds. Do not make acceptable-heartbeat-pause shorter than 5s — .NET garbage collection can cause pauses that trigger false positives.

Cluster Event Subscription

The application should subscribe to cluster membership events to react to failover:

Cluster.Get(Context.System).Subscribe(Self, typeof(ClusterEvent.MemberUp),
    typeof(ClusterEvent.MemberRemoved),
    typeof(ClusterEvent.UnreachableMember),
    typeof(ClusterEvent.LeaderChanged));

Common Patterns

Graceful Shutdown

When performing planned maintenance on a node, leave the cluster gracefully before shutting down. This triggers an orderly handoff rather than a failure-detected removal:

await CoordinatedShutdown.Get(system).Run(CoordinatedShutdown.ClrExitReason.Instance);

Configure CoordinatedShutdown to leave the cluster:

akka.coordinated-shutdown.run-by-clr-shutdown-hook = on
akka.cluster.run-coordinated-shutdown-when-down = on

Health Check Actor

Create an actor that monitors cluster state and exposes it via a local HTTP endpoint (or Windows Event Log) for operations teams:

public class ClusterHealthActor : ReceiveActor
{
    public ClusterHealthActor()
    {
        var cluster = Cluster.Get(Context.System);
        cluster.Subscribe(Self, typeof(ClusterEvent.IMemberEvent));

        Receive<ClusterEvent.MemberUp>(msg => LogMemberUp(msg));
        Receive<ClusterEvent.MemberRemoved>(msg => LogMemberRemoved(msg));
        Receive<ClusterEvent.UnreachableMember>(msg => AlertUnreachable(msg));
    }
}

Auto-Restart with Windows Service

Wrap the Akka.NET application as a Windows Service. If both nodes down themselves during a partition (due to down-if-alone = on), the Windows Service restart mechanism will bring them back and they'll reform the cluster.

Anti-Patterns

Ignoring SBR Configuration

Running a 2-node cluster without a Split Brain Resolver is dangerous. If a network partition occurs, both nodes will remain "up" independently, each believing it is the sole cluster member. Both will run the Singleton, and both will issue commands to equipment — violating the "no duplicate commands" requirement.

Over-Tuning Failure Detection

Making the failure detector extremely aggressive (e.g., acceptable-heartbeat-pause = 2s) causes flapping — nodes repeatedly marking each other as unreachable and then reachable. Each transition triggers Singleton handoff, which means device actors restart, connections drop, and commands may be lost.

Manual Downing

Never rely on manual downing (an operator clicking a button to remove a node) in a SCADA system. The failover must be automatic. Always configure SBR.

Multiple ActorSystems per Process

Do not create multiple ActorSystem instances in the same process. Each SCADA node should have exactly one ActorSystem that participates in exactly one cluster.

Configuration Guidance

Complete Cluster Configuration Block

akka.cluster {
  roles = ["scada-node"]
  seed-nodes = [
    "akka.tcp://scada-system@nodeA.scada.local:4053",
    "akka.tcp://scada-system@nodeB.scada.local:4053"
  ]

  downing-provider-class = "Akka.Cluster.SBR.SplitBrainResolverProvider, Akka.Cluster"
  split-brain-resolver {
    active-strategy = keep-oldest
    keep-oldest {
      down-if-alone = on
    }
    stable-after = 15s
  }

  failure-detector {
    heartbeat-interval = 1s
    threshold = 8.0
    acceptable-heartbeat-pause = 10s
  }

  min-nr-of-members = 1  # Allow single-node operation (after failover, only 1 node is up)

  run-coordinated-shutdown-when-down = on
}

akka.coordinated-shutdown {
  run-by-clr-shutdown-hook = on
  exit-clr = on
}

Important: min-nr-of-members = 1

Set this to 1, not 2. After a failover, only 1 node is running. If set to 2, the surviving node will wait for a second member before the Singleton starts — blocking all device communication.

References