scadalink-design/AkkaDotNet/03-Cluster.md

# 03 — Cluster (Akka.Cluster)

## Overview

Akka.Cluster organizes our two SCADA nodes into a managed membership group with failure detection, role assignment, and coordinated lifecycle management. It is the backbone that makes the active/cold-standby failover model work — Cluster determines which node is "oldest" (and therefore active), detects when a node has failed, and triggers the standby to take over.

The 2-node cluster topology is the most architecturally challenging configuration in Akka.Cluster because traditional quorum-based decisions require a majority, and with 2 nodes there is no majority when one node is down.

## When to Use

- Always — the 2-node failover pair requires Cluster for membership management, failure detection, and coordinated role transitions
- Cluster membership events drive the entire failover lifecycle

## When Not to Use

- Do not use Cluster for communication with equipment — that's the device actor layer
- Do not add more than 2 nodes to a site cluster without revisiting the Split Brain Resolver configuration

## Design Decisions for the SCADA System

### Cluster Roles

Assign a single role to both nodes: `scada-node`. Role differentiation between active and standby is handled by Cluster Singleton (the oldest node becomes the singleton host), not by role assignment.

```hocon
akka.cluster {
  roles = ["scada-node"]
}
```

### Seed Nodes

With exactly 2 nodes and static IPs/hostnames, use static seed nodes. Both nodes list both addresses as seeds:

```hocon
akka.cluster {
  seed-nodes = [
    "akka.tcp://scada-system@nodeA.scada.local:4053",
    "akka.tcp://scada-system@nodeB.scada.local:4053"
  ]
}
```

The first seed node listed should be the same on both nodes. The node at the first seed address will automatically become the cluster leader and the "oldest" node (which determines the Singleton host).

### Split Brain Resolver (SBR) — The Critical Decision

With only 2 nodes, quorum-based SBR strategies (majority, static-quorum) do not work because losing 1 of 2 nodes means no majority exists. The options for 2-node clusters:

**Recommended: `keep-oldest`**

The oldest node (the one that has been in the cluster longest) survives a partition. The younger node downs itself. When the failed/partitioned node comes back, it rejoins as the younger node.

```hocon
akka.cluster {
  downing-provider-class = "Akka.Cluster.SBR.SplitBrainResolverProvider, Akka.Cluster"
  split-brain-resolver {
    active-strategy = keep-oldest
    keep-oldest {
      down-if-alone = on  # If the oldest is truly alone (no other node visible), it downs itself too
    }
    stable-after = 15s  # Wait for cluster to stabilize before making decisions
  }
}
```

**Why `down-if-alone = on`:** In a 2-node cluster, if the oldest node is partitioned (it can see no one), it should down itself. Otherwise, both nodes think they are the sole survivor and you get a split brain. With `down-if-alone = on`, the oldest node in a partition will also down itself, and an external mechanism (Windows Service restart, watchdog) must restart both nodes to reform the cluster.

**Alternative: Lease-based SBR** (see [15-Coordination.md](./15-Coordination.md))

If you have access to a shared resource (file share, database), a lease-based approach is more robust for 2-node clusters. Only the node that holds the lease survives. This avoids the "both nodes down themselves" scenario.

### Failure Detection Tuning

Cluster uses an accrual failure detector on top of Remoting heartbeats. For a SCADA system on a local network:

```hocon
akka.cluster {
  failure-detector {
    heartbeat-interval = 1s
    threshold = 8.0
    acceptable-heartbeat-pause = 10s
    min-std-deviation = 100ms
  }
}
```

These settings provide failover detection within ~10–15 seconds. Do not make `acceptable-heartbeat-pause` shorter than 5s — .NET garbage collection can cause pauses that trigger false positives.

### Cluster Event Subscription

The application should subscribe to cluster membership events to react to failover:

```csharp
Cluster.Get(Context.System).Subscribe(Self, typeof(ClusterEvent.MemberUp),
    typeof(ClusterEvent.MemberRemoved),
    typeof(ClusterEvent.UnreachableMember),
    typeof(ClusterEvent.LeaderChanged));
```

## Common Patterns

### Graceful Shutdown

When performing planned maintenance on a node, leave the cluster gracefully before shutting down. This triggers an orderly handoff rather than a failure-detected removal:

```csharp
await CoordinatedShutdown.Get(system).Run(CoordinatedShutdown.ClrExitReason.Instance);
```

Configure CoordinatedShutdown to leave the cluster:

```hocon
akka.coordinated-shutdown.run-by-clr-shutdown-hook = on
akka.cluster.run-coordinated-shutdown-when-down = on
```

### Health Check Actor

Create an actor that monitors cluster state and exposes it via a local HTTP endpoint (or Windows Event Log) for operations teams:

```csharp
public class ClusterHealthActor : ReceiveActor
{
    public ClusterHealthActor()
    {
        var cluster = Cluster.Get(Context.System);
        cluster.Subscribe(Self, typeof(ClusterEvent.IMemberEvent));

        Receive<ClusterEvent.MemberUp>(msg => LogMemberUp(msg));
        Receive<ClusterEvent.MemberRemoved>(msg => LogMemberRemoved(msg));
        Receive<ClusterEvent.UnreachableMember>(msg => AlertUnreachable(msg));
    }
}
```

### Auto-Restart with Windows Service

Wrap the Akka.NET application as a Windows Service. If both nodes down themselves during a partition (due to `down-if-alone = on`), the Windows Service restart mechanism will bring them back and they'll reform the cluster.

## Anti-Patterns

### Ignoring SBR Configuration

Running a 2-node cluster without a Split Brain Resolver is dangerous. If a network partition occurs, both nodes will remain "up" independently, each believing it is the sole cluster member. Both will run the Singleton, and both will issue commands to equipment — violating the "no duplicate commands" requirement.

### Over-Tuning Failure Detection

Making the failure detector extremely aggressive (e.g., `acceptable-heartbeat-pause = 2s`) causes flapping — nodes repeatedly marking each other as unreachable and then reachable. Each transition triggers Singleton handoff, which means device actors restart, connections drop, and commands may be lost.

### Manual Downing

Never rely on manual downing (an operator clicking a button to remove a node) in a SCADA system. The failover must be automatic. Always configure SBR.

### Multiple ActorSystems per Process

Do not create multiple `ActorSystem` instances in the same process. Each SCADA node should have exactly one `ActorSystem` that participates in exactly one cluster.

## Configuration Guidance

### Complete Cluster Configuration Block

```hocon
akka.cluster {
  roles = ["scada-node"]
  seed-nodes = [
    "akka.tcp://scada-system@nodeA.scada.local:4053",
    "akka.tcp://scada-system@nodeB.scada.local:4053"
  ]

  downing-provider-class = "Akka.Cluster.SBR.SplitBrainResolverProvider, Akka.Cluster"
  split-brain-resolver {
    active-strategy = keep-oldest
    keep-oldest {
      down-if-alone = on
    }
    stable-after = 15s
  }

  failure-detector {
    heartbeat-interval = 1s
    threshold = 8.0
    acceptable-heartbeat-pause = 10s
  }

  min-nr-of-members = 1  # Allow single-node operation (after failover, only 1 node is up)

  run-coordinated-shutdown-when-down = on
}

akka.coordinated-shutdown {
  run-by-clr-shutdown-hook = on
  exit-clr = on
}
```

### Important: `min-nr-of-members = 1`

Set this to 1, not 2. After a failover, only 1 node is running. If set to 2, the surviving node will wait for a second member before the Singleton starts — blocking all device communication.

## References

- Official Documentation: <https://getakka.net/articles/clustering/cluster-overview.html>
- Split Brain Resolver: <https://getakka.net/articles/clustering/split-brain-resolver.html>
- Configuration Reference: <https://getakka.net/articles/configuration/modules/akka.cluster.html>
- Cluster Membership Events: <https://getakka.net/articles/clustering/cluster-overview.html>