scadalink-design/AkkaDotNet/09-DistributedData.md

# 09 — Distributed Data (Akka.DistributedData)

## Overview

Akka.Distributed Data shares state between cluster nodes using Conflict-Free Replicated Data Types (CRDTs). Writes can happen on any node and are merged automatically without coordination, providing eventual consistency. Data is replicated in-memory — there is no external database dependency.

In our SCADA system, Distributed Data provides a lightweight, zero-infrastructure mechanism for sharing state between the active and standby nodes. Unlike Persistence (which journals events to disk), Distributed Data keeps state in memory and gossips it between nodes. This makes it ideal for state that should be visible on both nodes but does not need to survive a full cluster restart.

## When to Use

- Sharing pending command summaries between active and standby (so the standby knows what to reconcile on failover)
- Replicating device connection status across nodes (so the standby's monitoring actor has live status)
- Sharing alarm acknowledgment state (so the standby knows which alarms have been acknowledged)
- Any state that should be available on both nodes without the overhead of disk persistence

## When Not to Use

- State that must survive a full cluster restart (both nodes down simultaneously) — use Persistence instead
- Large datasets (hundreds of MB) — Distributed Data holds everything in memory and gossips the full state
- Strongly consistent state where both nodes must agree before proceeding — CRDTs provide eventual consistency only
- High-frequency updates (thousands per second on the same key) — gossip merging has overhead

## Design Decisions for the SCADA System

### Data Model: What to Replicate

Keep Distributed Data focused on small, operationally critical state:

| Key | CRDT Type | Purpose |
|---|---|---|
| `pending-commands` | `LWWDictionary<string, PendingCommand>` | In-flight commands (mirrors Persistence journal for fast standby access) |
| `device-status` | `LWWDictionary<string, DeviceStatus>` | Connection status per device |
| `alarm-acks` | `ORDictionary<string, Flag>` | Which alarms have been acknowledged |
| `system-config` | `LWWRegister<SiteConfiguration>` | Current site config (device list, thresholds) |

### CRDT Type Selection

- **LWWRegister** (Last-Writer-Wins Register): For single-value state where the most recent write wins. Good for device status, configuration.
- **LWWDictionary**: For dictionaries where entries are independently updated. Good for pending commands (keyed by command ID).
- **ORDictionary** (Observed-Remove Dictionary): For dictionaries where entries can be added and removed concurrently. Good for alarm acknowledgments.
- **GCounter / PNCounter**: For monotonic counters (e.g., total commands issued). Rarely needed in SCADA.

### Write and Read Consistency Levels

For a 2-node cluster:

- **WriteTo(1)** (WriteLocal): Write to the local node only. Fastest, but the standby won't see the update until the next gossip round.
- **WriteTo(2)** (WriteMajority on 2 nodes = WriteAll): Write to both nodes before confirming. Ensures the standby has the data immediately, but blocks if the standby is down.

**Recommendation for SCADA:**

- **Pending commands:** `WriteTo(1)` with frequent gossip. Speed matters when issuing commands; the standby gets the update within 1-2 gossip intervals.
- **Device status:** `WriteTo(1)`. Status is informational; eventual consistency is fine.
- **Alarm acknowledgments:** `WriteTo(2)` if both nodes are up, fall back to `WriteTo(1)` on failure. Alarm state is safety-critical.

```csharp
var replicator = DistributedData.Get(Context.System).Replicator;

// Write pending command with local consistency (fast)
var key = new LWWDictionaryKey<string, string>("pending-commands");
replicator.Tell(Dsl.Update(key, LWWDictionary<string, string>.Empty, WriteLocal.Instance,
    dict => dict.SetItem(Cluster.Get(Context.System), commandId, serializedCommand)));

// Read device status with local consistency (fast)
var statusKey = new LWWDictionaryKey<string, string>("device-status");
replicator.Tell(Dsl.Get(statusKey, ReadLocal.Instance));
```

### Gossip Interval

With 2 nodes on a local network, gossip propagation is nearly instant. The default 2-second gossip interval means the standby sees updates within ~2 seconds. For command-critical state, this is acceptable because Persistence provides the durability guarantee; Distributed Data is a convenience for the standby.

## Common Patterns

### Subscribe to Data Changes

The standby node's monitoring actor can subscribe to Distributed Data changes to maintain a live view of the active node's state:

```csharp
public class StandbyMonitorActor : ReceiveActor
{
    public StandbyMonitorActor()
    {
        var replicator = DistributedData.Get(Context.System).Replicator;
        var statusKey = new LWWDictionaryKey<string, string>("device-status");

        replicator.Tell(Dsl.Subscribe(statusKey, Self));

        Receive<Replicator.Changed>(changed =>
        {
            var status = changed.Get(statusKey);
            UpdateDashboard(status);
        });
    }
}
```

### Expiry / Cleanup Pattern

Distributed Data does not have built-in TTL. For entries like pending commands that should be cleaned up after acknowledgment, the active node must explicitly remove them:

```csharp
// After command is acknowledged
replicator.Tell(Dsl.Update(key, LWWDictionary<string, string>.Empty, WriteLocal.Instance,
    dict => dict.Remove(Cluster.Get(Context.System), commandId)));
```

### Fallback When Standby Is Down

Always handle the case where the standby node is unreachable. Use `WriteTo(1)` as a fallback:

```csharp
var writeConsistency = clusterHasTwoMembers
    ? (IWriteConsistency)new WriteTo(2, TimeSpan.FromSeconds(3))
    : WriteLocal.Instance;
```

## Anti-Patterns

### Storing Large or High-Cardinality Data

Distributed Data replicates everything to all nodes. Storing 500 devices × 50 tags × full value history would consume significant memory and gossip bandwidth. Store only summaries and current state, not history.

### Using Distributed Data as a Database

Distributed Data is ephemeral — it lives in memory only. If both nodes restart simultaneously (power failure, site-wide event), all Distributed Data is lost. Critical state that must survive this scenario belongs in Persistence or the historian database.

### Ignoring Merge Conflicts

CRDTs resolve conflicts automatically, but the resolution may not match business expectations. For example, with `LWWDictionary`, if both nodes update the same key simultaneously, the last writer wins. For command state, this is fine (only the active node writes). For alarm acknowledgments, ensure the business logic is idempotent — acknowledging an already-acknowledged alarm should be a no-op.

## Configuration Guidance

```hocon
akka.cluster.distributed-data {
  # Role for replication — both nodes must have this role
  role = "scada-node"

  # Gossip interval — how often state is pushed between nodes
  gossip-interval = 2s

  # Notification interval — how often subscribers are notified of changes
  notify-subscribers-interval = 500ms

  # Maximum delta size for gossip — increase if data volume is high
  max-delta-elements = 500

  # Durable storage (optional) — persist Distributed Data to disk
  # Enable this for data that should survive single-node restarts
  durable {
    keys = ["pending-commands", "alarm-acks"]
    lmdb {
      dir = "C:\\ProgramData\\SCADA\\ddata"
      map-size = 104857600  # 100 MB
    }
  }
}
```

### Durable Distributed Data

Akka.DistributedData supports optional durable storage (LMDB) that persists specified keys to disk. This means the data survives a single-node restart (but not a full cluster loss, since LMDB is node-local). This is a good middle ground for pending command state — it's faster than full Persistence recovery but still durable.

## References

- Official Documentation: <https://getakka.net/articles/clustering/distributed-data.html>
- CRDT Concepts: <https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type>