Files
scadalink-design/AkkaDotNet/15-Coordination.md
Joseph Doherty de636b908b Add Akka.NET reference documentation
Notes and documentation covering actors, remoting, clustering, persistence,
streams, serialization, hosting, testing, and best practices for the Akka.NET
framework used throughout the ScadaLink system.
2026-03-16 09:08:17 -04:00

6.3 KiB
Raw Blame History

15 — Coordination (Akka.Coordination)

Overview

Akka.Coordination provides lease-based distributed locking primitives. A lease is a time-bounded lock that a node acquires from an external store. It is used by the Split Brain Resolver, Cluster Sharding, and Cluster Singleton to prevent split-brain scenarios by ensuring only the node holding the lease can act as the leader or singleton host.

In the SCADA system's 2-node topology, lease-based coordination addresses the fundamental challenge of 2-node split-brain resolution: without a third-party arbiter, two partitioned nodes cannot determine which should survive.

When to Use

  • If the keep-oldest SBR strategy with down-if-alone = on (see 03-Cluster.md) is insufficient — specifically, if the scenario where both nodes down themselves during a partition is unacceptable
  • When a shared resource (file share, database) is available to serve as the lease store
  • For stronger singleton guarantees — ensuring only the lease holder can run the Device Manager singleton

When Not to Use

  • If the site has no shared resource accessible by both nodes — a lease requires an external store
  • If the keep-oldest strategy with automatic Windows Service restart is acceptable for your availability requirements
  • If the added dependency on the lease store introduces more risk than it mitigates (lease store becomes a single point of failure)

Design Decisions for the SCADA System

Lease Store Options

For Windows Server deployments without cloud services:

Option A: SMB File Share Lease

If the site has a shared filesystem (NAS, Windows file server), a file-based lease can work. However, Akka.NET does not ship a file-based lease implementation — a custom one would need to be built.

Option B: SQL Server Lease (Where Available)

For sites with SQL Server, use a database-backed lease. This provides strong consistency guarantees:

// Custom lease implementation using SQL Server
public class SqlServerLease : Lease
{
    // Acquire: INSERT with optimistic concurrency
    // Heartbeat: UPDATE timestamp periodically
    // Release: DELETE the lease row
    // Check: SELECT where timestamp is recent
}

Option C: Azure Blob Storage Lease (Akka.Coordination.Azure)

If the site has Azure connectivity, Akka.Coordination.Azure provides a production-ready lease implementation using Azure Blob Storage:

akkaBuilder.WithClusterBootstrap(options =>
{
    // Configure Azure lease for SBR
});

Current Recommendation:

For most on-premise SCADA sites without cloud access, use the keep-oldest SBR strategy without a lease, and rely on Windows Service auto-restart to recover from the "both nodes downed" scenario. The recovery time is longer (~12 minutes for service restart + cluster reformation) but avoids the lease store dependency.

If faster recovery or stronger guarantees are needed, implement a SQL Server-backed lease for sites that have SQL Server.

Lease-Based SBR Configuration

If a lease is available:

akka.cluster {
  downing-provider-class = "Akka.Cluster.SBR.SplitBrainResolverProvider, Akka.Cluster"
  split-brain-resolver {
    active-strategy = lease-majority
    lease-majority {
      lease-implementation = "custom-sql-lease"
      acquire-lease-delay-for-minority = 5s
    }
  }
}

Lease for Cluster Singleton

The Singleton can be configured to require a lease before starting. This provides an additional guarantee against duplicate singletons:

akka.cluster.singleton {
  use-lease = "custom-sql-lease"
  lease-retry-interval = 5s
}

Common Patterns

Lease Heartbeat

Leases are time-bounded. The holder must periodically renew (heartbeat) the lease. If the holder crashes without releasing, the lease expires after the heartbeat timeout, allowing another node to acquire it:

Heartbeat interval: 12s (default)
Heartbeat timeout: 120s (default)

This means after a node crash, it takes up to 120 seconds for the lease to expire and the standby to acquire it. Reduce the timeout for faster failover, but be cautious of network glitches causing false lease expiry:

custom-sql-lease {
  heartbeat-timeout = 30s
  heartbeat-interval = 5s
  lease-operation-timeout = 5s
}

Lease + SBR Interaction

When using lease-based SBR in a 2-node cluster:

  1. Both nodes attempt to acquire the lease on startup
  2. Only one succeeds — this node becomes the leader
  3. During a partition, both nodes attempt to renew/acquire the lease
  4. The node that holds the lease survives; the other downs itself
  5. On the surviving node, the Singleton continues running
  6. No "both nodes down" scenario occurs (as long as the lease store is reachable)

Anti-Patterns

Lease Store on One of the Two SCADA Nodes

Never host the lease store on one of the SCADA nodes themselves. If that node goes down, the lease store goes with it, and the surviving node cannot acquire the lease. The lease store must be on an independent resource.

Very Short Lease Timeouts

Setting heartbeat-timeout below 10 seconds risks false lease expiry during garbage collection pauses, network blips, or high CPU load. This would cause the singleton to stop unnecessarily.

Assuming the Lease Prevents All Split-Brain

The lease only works if both nodes can reach the lease store. If the lease store itself is partitioned from one node, that node cannot acquire the lease and will down itself — even if it's otherwise healthy. Consider lease store availability as part of the system's overall availability design.

Configuration Guidance

Without Lease (Current Default)

No coordination configuration needed. Use keep-oldest SBR as described in 03-Cluster.md.

With SQL Server Lease (Future Enhancement)

custom-sql-lease {
  lease-class = "ScadaSystem.SqlServerLease, ScadaSystem"
  heartbeat-timeout = 30s
  heartbeat-interval = 5s
  lease-operation-timeout = 5s
}

akka.cluster.split-brain-resolver {
  active-strategy = lease-majority
  lease-majority {
    lease-implementation = "custom-sql-lease"
  }
}

References