# 05 — Cluster Singleton (Akka.Cluster.Tools)

## Overview

Cluster Singleton ensures that exactly one instance of a particular actor runs across the entire cluster at any time. If the node hosting the singleton goes down, the singleton is automatically restarted on the next oldest node. This is the primary mechanism for implementing the active/cold-standby model in our SCADA system.

The "Device Manager" — the top-level actor that owns all device actors and manages equipment communication — runs as a Cluster Singleton. The active node hosts the singleton; the standby node runs a Singleton Proxy that can route messages to the active node's singleton.

## When to Use

- The Device Manager actor that spawns and supervises all device actors — this is the primary use case
- Any component where exactly-one semantics are required: alarm aggregator, command queue processor, historian writer
- Any coordination point that must not have duplicates during normal operation

## When Not to Use

- Do not make every actor a singleton — only top-level coordinators that genuinely need cluster-wide uniqueness
- Do not use a singleton for high-throughput work that could benefit from parallelism
- Individual device actors should not be singletons; they are children of the singleton Device Manager

## Design Decisions for the SCADA System

### Singleton: Device Manager Actor

The Device Manager is the singleton. On startup, it reads the site's device configuration, creates one device actor per machine, and manages their lifecycle. When failover occurs, the singleton restarts on the standby node, creating new device actors that reconnect to equipment.

```csharp
// Singleton registration via Akka.Hosting
builder.WithSingleton<DeviceManagerActor>(
    singletonName: "device-manager",
    propsFactory: (system, registry, resolver) =>
        resolver.Props<DeviceManagerActor>(),
    options: new ClusterSingletonOptions
    {
        Role = "scada-node",
        // Hand-over retry interval during graceful migration
        HandOverRetryInterval = TimeSpan.FromSeconds(5),
    });
```

### Singleton Proxy on Both Nodes

Both nodes should have a Singleton Proxy. Even on the active node (which hosts the actual singleton), the proxy provides a stable `IActorRef` that other actors can use to send messages. This decouples message senders from knowing which node currently hosts the singleton.

```csharp
builder.WithSingletonProxy<DeviceManagerActor>(
    singletonName: "device-manager",
    options: new ClusterSingletonProxyOptions
    {
        Role = "scada-node",
        BufferSize = 1000,  // Buffer messages while singleton is being handed over
    });
```

### Singleton Lifecycle During Failover

When the active node goes down:

1. Cluster failure detector marks the node as unreachable (~10–15 seconds with our config)
2. SBR downs the unreachable node
3. Cluster Singleton notices the singleton host is gone
4. Singleton starts on the next oldest (surviving) node
5. The new Device Manager reads device config, replays the Persistence journal for in-flight commands, and creates device actors
6. Device actors connect to equipment and resume tag subscriptions

**Total failover time:** ~20–40 seconds depending on failure detection + singleton startup + equipment reconnection.

### Buffering During Hand-Over

During singleton migration (whether from failure or graceful shutdown), messages sent to the Singleton Proxy are buffered. Configure `BufferSize` to handle the expected message volume during the hand-over window:

```csharp
new ClusterSingletonProxyOptions
{
    BufferSize = 1000,  // Buffer up to 1000 messages during hand-over
}
```

If the buffer overflows, messages are dropped. For a SCADA system, 1000 is usually sufficient — commands arrive at human-operator speed, and tag updates will be re-sent by equipment once the new singleton subscribes.

## Common Patterns

### Singleton with Persistence

The Device Manager singleton should use Akka.Persistence to journal in-flight commands. When the singleton restarts on the standby node, it replays the journal to identify commands that were sent but not yet acknowledged:

```csharp
public class DeviceManagerActor : ReceivePersistentActor
{
    public override string PersistenceId => "device-manager-singleton";

    // On recovery, rebuild the pending command queue
    // On command: persist the command event, then send to device
    // On ack: persist the ack event, remove from pending queue
}
```

See [08-Persistence.md](./08-Persistence.md) for full details.

### Graceful Hand-Over

When performing planned maintenance, use `CoordinatedShutdown` to trigger a graceful singleton hand-over. The singleton on the old node stops, the proxy buffers messages, and the singleton starts on the new node — minimizing the gap in equipment communication.

### Singleton Health Monitoring

Create a periodic self-check inside the singleton that publishes its status. The standby node's monitoring actor can watch for these heartbeats to provide early warning of issues:

```csharp
// Inside the singleton
Context.System.Scheduler.ScheduleTellRepeatedly(
    TimeSpan.FromSeconds(5),
    TimeSpan.FromSeconds(5),
    Self,
    new HealthCheck(),
    ActorRefs.NoSender);
```

## Anti-Patterns

### Putting All Logic in the Singleton

The singleton should be a thin coordination layer. It owns the device actor hierarchy but does not process tag updates, execute commands, or handle alarms directly. Those are delegated to child actors. If the singleton actor becomes a bottleneck, the entire system stalls.

### Not Handling Singleton Restart

When the singleton restarts on the standby node, all child actors (device actors) are created fresh. Any in-memory state from the previous instance is gone. If the system assumes device actors persist across failover, it will fail. Design for restart: use Persistence for critical state, re-read configuration, and re-establish equipment connections.

### Ignoring the Buffer Overflow Scenario

If the Singleton Proxy buffer fills up during a long failover (e.g., network partition where SBR takes time to act), messages are silently dropped. For commands, this means lost instructions. Mitigate by persisting commands before sending them through the proxy.

## Configuration Guidance

```hocon
akka.cluster.singleton {
  # Role that the singleton runs on
  singleton-name = "device-manager"
  role = "scada-node"

  # Minimum members before singleton starts
  # Set to 1 — after failover, the surviving node is alone
  min-number-of-hand-over-retries = 15
  hand-over-retry-interval = 5s
}

akka.cluster.singleton-proxy {
  singleton-name = "device-manager"
  role = "scada-node"
  buffer-size = 1000
  singleton-identification-interval = 1s
}
```

### Important: Single-Node Operation

After failover, only one node is running. The singleton must be able to start with just one cluster member. Ensure `akka.cluster.min-nr-of-members = 1` (set in [03-Cluster.md](./03-Cluster.md)).

## References

- Official Documentation: <https://getakka.net/articles/clustering/cluster-singleton.html>
- Cluster Tools Configuration: <https://getakka.net/articles/configuration/modules/akka.cluster.tools.html>