scadalink-design/AkkaDotNet/BestPracticesAndTraps.md

# Best Practices and Traps to Avoid

This document consolidates cross-cutting best practices and common traps for building the SCADA system on Akka.NET. While each component document covers module-specific guidance, the issues here span multiple components or emerge from the interactions between them.

---

## Best Practices

### 1. Message Design

**Use immutable C# records for all messages.** Records provide value equality, `ToString()` for logging, and immutability by default. Every message that crosses an actor boundary — whether local, remote, or persisted — should be a record.

```csharp
// Good
public record SendCommand(string CommandId, string DeviceId, string TagName, object Value, DateTime Timestamp);

// Bad — mutable class with no equality semantics
public class SendCommand
{
    public string CommandId { get; set; }
    public Dictionary<string, object> Metadata { get; set; }  // Mutable collection
}
```

**Keep messages small and focused.** A message should represent one thing: a command, an event, a query, or a response. Do not bundle unrelated data into a single message to "save roundtrips." Actors process messages sequentially — smaller messages mean faster processing and clearer intent.

**Never put IActorRef in persisted messages.** Actor references contain node addresses that are invalid after restart. Store logical identifiers (device ID, actor name) and resolve references at runtime via the `ActorRegistry` or `ActorSelection`.

### 2. Actor Hierarchy Design

**Model the domain, not the infrastructure.** The actor hierarchy should reflect the SCADA domain:

```
/user
  /device-manager (Singleton)
    /machine-001
    /machine-002
    /...
  /alarm-manager
    /alarm-processor-1
    /alarm-processor-2
  /historian-writer
```

Do not create actors for infrastructure concerns (one actor per database connection, one actor per thread). Use DI for infrastructure services; use actors for domain entities with state and lifecycle.

**One actor, one responsibility.** The Device Manager creates and supervises device actors — it does not process tag updates, evaluate alarm conditions, or write to the historian. Each of those concerns gets its own actor or actor subtree.

**Prefer flat-and-wide over deep-and-narrow.** A Device Manager with 500 direct child device actors is fine. A hierarchy where `DeviceManager → Zone → Group → Subgroup → Device` adds supervision overhead at every level. Only add hierarchy levels when you need different supervision strategies at each level.

### 3. Supervision Strategy

**Design supervision strategies explicitly for every parent actor.** The default strategy (restart on any exception) is rarely correct for a SCADA system. Think through each failure type:

| Exception Type | Strategy | Rationale |
|---|---|---|
| `CommunicationException` | Restart with backoff | Transient network issue; reconnection likely succeeds |
| `ConfigurationException` | Stop | Bad config won't fix itself on restart |
| `TimeoutException` | Restart | Equipment may be temporarily unresponsive |
| `SerializationException` | Resume | Bad message; skip it and continue |
| `OutOfMemoryException` | Escalate | Node-level problem; let the parent/system handle it |

**Use exponential backoff for restarts.** When a device actor fails repeatedly (equipment offline), exponential backoff prevents the actor from saturating the network with rapid reconnection attempts:

```csharp
var strategy = new OneForOneStrategy(
    maxNrOfRetries: -1,
    withinTimeRange: TimeSpan.MaxValue,
    decider: Decider.From(Directive.Restart));

// Combine with BackoffSupervisor
var backoffProps = BackoffSupervisor.Props(
    Backoff.OnFailure(
        childProps: Props.Create(() => new DeviceActor(config)),
        childName: $"device-{config.DeviceId}",
        minBackoff: TimeSpan.FromSeconds(3),
        maxBackoff: TimeSpan.FromSeconds(60),
        randomFactor: 0.2));
```

### 4. Failover Design

**Design every actor for restart.** The entire device actor subtree will be destroyed and recreated during failover. No actor should assume its state persists across restarts unless it explicitly uses Akka.Persistence. This means:

- Device actors re-read current state from equipment on startup
- The Device Manager replays its Persistence journal for in-flight commands
- Alarm state is reconstructed from Distributed Data (or re-evaluated from current tag values)
- Subscriptions are re-established, not carried over

**Test failover as a first-class scenario, not an afterthought.** Build MultiNodeTestRunner specs for failover early in development. The most dangerous bugs in a SCADA system are the ones that only appear during failover.

**Accept that failover has a gap.** With a cold standby, there will be a window (20–40 seconds) where no node is actively communicating with equipment. Design the system so this gap is safe: equipment should have local safety logic that does not depend on continuous SCADA commands. If the SCADA system is the sole safety mechanism, a cold standby may not be appropriate — consider a warm standby architecture.

### 5. Configuration Management

**Identical code, different config.** Both nodes in the pair should run the exact same compiled code. All node-specific differences (hostname, seed node order) come from `appsettings.json`. This prevents configuration drift bugs that only manifest in one node.

**Validate configuration at startup.** Before the ActorSystem starts, validate all configuration values. A misconfigured seed node address that silently fails to connect is worse than a loud startup crash:

```csharp
public static class ConfigValidator
{
    public static void Validate(SiteConfiguration config)
    {
        if (string.IsNullOrEmpty(config.NodeHostname))
            throw new ConfigurationException("NodeHostname is required");
        if (config.SeedNodes.Count < 2)
            throw new ConfigurationException("At least 2 seed nodes are required for failover pair");
        if (config.SeedNodes.All(s => !s.Contains(config.NodeHostname)))
            throw new ConfigurationException("This node's hostname must appear in the seed node list");
    }
}
```

### 6. Logging and Observability

**Use structured logging with correlation IDs.** Every command should carry a `CommandId` that flows through all actors and log entries. When diagnosing a failover issue, you need to trace a specific command's journey:

```csharp
_logger.LogInformation("Command {CommandId} dispatched to device {DeviceId} for tag {TagName}",
    command.CommandId, command.DeviceId, command.TagName);
```

**Log cluster membership events at Warning level.** In a SCADA system, cluster membership changes are operationally significant. An `UnreachableMember` event means failover may be imminent:

```csharp
Receive<ClusterEvent.UnreachableMember>(msg =>
    _logger.LogWarning("Node unreachable: {Address} — failover may initiate", msg.Member.Address));
```

**Monitor dead letters.** Dead letters in production indicate messages being sent to actors that no longer exist — often a symptom of failover timing issues or stale actor references. Subscribe to dead letters and log them at Warning level.

### 7. Performance

**Do not prematurely optimize.** Akka.NET handles millions of messages per second. A 500-device SCADA system with tag updates every second is approximately 25,000 messages per second — well within the comfort zone of a single node. Optimize only after profiling identifies an actual bottleneck.

**Use `Tell`, not `Ask`, for internal actor communication.** `Ask` creates a temporary actor, allocates a `TaskCompletionSource`, and starts a timeout timer for every call. In the hot path (tag updates flowing from device actors to alarm processors), use `Tell` with reply-to patterns:

```csharp
// Good — fire and forget with reply via Tell
deviceActor.Tell(new GetDeviceState(replyTo: Self));

// Bad in hot paths — creates overhead per call
var state = await deviceActor.Ask<DeviceState>(new GetDeviceState(), TimeSpan.FromSeconds(5));
```

**Batch historian writes.** Writing each tag update individually to SQL Server creates excessive I/O. Use Akka.Streams `GroupedWithin` to batch updates (see [10-Streams.md](./10-Streams.md)).

### 8. Testing Strategy

**Test at three levels:**

| Level | Tool | What to Test | Volume |
|---|---|---|---|
| Unit | Akka.TestKit | Individual actor behavior, message handling, state transitions | Many — fast, cheap |
| Integration | Akka.Hosting.TestKit | DI wiring, configuration loading, actor startup pipeline | Moderate |
| Distributed | MultiNodeTestRunner | Failover, SBR, singleton migration, command recovery | Few — slow, expensive |

**Create mock protocol adapters from day one.** The ability to test without real equipment is essential. Mock adapters should simulate both normal behavior (tag updates, command acks) and failure scenarios (connection drops, timeouts, garbled responses).

---

## Traps to Avoid

### Trap 1: The 2-Node Split Brain

**The problem:** With exactly 2 nodes, there is no majority. If the network partitions, each node sees itself as the sole survivor and runs the Cluster Singleton. Both nodes issue commands to equipment simultaneously.

**How it manifests:** Two Device Manager singletons running concurrently, each sending commands to the same equipment. Motors start and stop unpredictably. Safety-critical commands are duplicated.

**How to avoid it:** Configure the Split Brain Resolver with `keep-oldest` and `down-if-alone = on`. Accept that both nodes may down themselves during a true partition, and rely on Windows Service auto-restart to reform the cluster. Alternatively, implement a lease-based SBR with an external arbiter (see [15-Coordination.md](./15-Coordination.md)).

**How to detect it:** Monitor for `ClusterEvent.MemberUp` events where the cluster has 2 members that both believe they are leader. Log the singleton actor's lifecycle — if you see two "Singleton started" log entries in the same time window from different nodes, you have a split brain.

### Trap 2: Blocking Inside Actors

**The problem:** A device actor makes a synchronous network call (e.g., `opcClient.ReadValue()` without `await`) inside a `Receive` handler. This blocks the actor's dispatcher thread. If enough actors block simultaneously, the thread pool is exhausted and the entire ActorSystem stalls — including cluster heartbeats.

**How it manifests:** The cluster failure detector triggers because heartbeats stop being processed. The standby node marks the active node as unreachable, even though it's running — it's just deadlocked. Failover initiates unnecessarily, and the same blocking behavior occurs on the new active node.

**How to avoid it:** Use `ReceiveAsync` or `PipeTo` for all asynchronous operations. If a third-party library only offers synchronous APIs, wrap the call in `Task.Run` and `PipeTo` the result:

```csharp
Receive<ReadTag>(msg =>
{
    Task.Run(() => syncOnlyClient.ReadValue(msg.TagName))
        .PipeTo(Self, success: value => new TagValueRead(msg.TagName, value),
                      failure: ex => new TagReadFailed(msg.TagName, ex));
});
```

### Trap 3: Singleton Starvation on Startup

**The problem:** After failover, the surviving node is alone. If `akka.cluster.min-nr-of-members = 2`, the Cluster waits for a second member before allowing the Singleton to start. No device communication occurs until the failed node is restarted.

**How it manifests:** Failover appears to succeed (cluster state shows 1 member), but the Device Manager never starts. Equipment is disconnected indefinitely.

**How to avoid it:** Set `akka.cluster.min-nr-of-members = 1`. The Singleton must be able to start on a single-node cluster.

### Trap 4: Persistence Journal Growth

**The problem:** The Device Manager persists every command event and never cleans up. Over weeks of operation, the SQLite journal grows to gigabytes. Recovery after failover takes minutes instead of seconds as the entire journal is replayed.

**How it manifests:** Failover time gradually increases. Eventually, the standby node takes so long to recover that operators assume it has failed and restart it, creating a cascading failure loop.

**How to avoid it:** Take periodic snapshots and delete old journal entries. Expire stale pending commands. Set up a maintenance task that monitors journal file size:

```csharp
// After every 100 persisted events
if (LastSequenceNr % 100 == 0)
{
    SaveSnapshot(_state);
}

Receive<SaveSnapshotSuccess>(success =>
{
    DeleteMessages(success.Metadata.SequenceNr);
    DeleteSnapshots(new SnapshotSelectionCriteria(success.Metadata.SequenceNr - 1));
});
```

### Trap 5: Serialization Mismatch After Deployment

**The problem:** A new version of the SCADA software changes a message type (adds a field, renames a property, changes a namespace). The Persistence journal contains events serialized with the old schema. On recovery, deserialization fails and the persistent actor crashes.

**How it manifests:** After a software update, the Device Manager singleton fails to start. The error log shows `SerializationException` during recovery. The system is down until someone manually fixes or clears the journal.

**How to avoid it:** Never modify existing persisted event types in a breaking way. Add new event versions alongside old ones and handle both during recovery (see [20-Serialization.md](./20-Serialization.md)). Test journal recovery with old-format events as part of the CI/CD pipeline:

```csharp
[Fact]
public void Should_recover_v1_command_events()
{
    // Seed the journal with V1 events
    // Start the persistent actor
    // Verify state is correctly recovered
}
```

### Trap 6: Equipment Reconnection Storm

**The problem:** After failover, 500 device actors start simultaneously on the standby node and all attempt to connect to their respective equipment at the same instant. This saturates the network and the equipment's connection capacity, causing most connections to fail. All 500 actors retry after the same backoff interval, creating synchronized waves.

**How it manifests:** After failover, only a handful of devices connect successfully. The rest cycle through connect → timeout → retry in lockstep waves. Full reconnection takes 10–15 minutes instead of 30 seconds.

**How to avoid it:** Stagger device actor startup. The Device Manager should create device actors in small batches with delays between them:

```csharp
private async Task StartDeviceActors(IReadOnlyList<DeviceConfig> configs)
{
    const int batchSize = 20;
    foreach (var batch in configs.Chunk(batchSize))
    {
        foreach (var config in batch)
        {
            var props = _adapterFactory.CreateAdapterProps(config, _resolver);
            Context.ActorOf(props, $"device-{config.DeviceId}");
        }
        await Task.Delay(TimeSpan.FromSeconds(1));  // Pause between batches
    }
}
```

Additionally, add random jitter to the `BackoffSupervisor`'s `randomFactor` so retrying actors don't synchronize.

### Trap 7: Distributed Data Loss on Full Cluster Restart

**The problem:** Distributed Data is in-memory by default. If both nodes go down simultaneously (power failure, site-wide event), all Distributed Data is lost. If the system relies on Distributed Data for pending command state without a Persistence backup, those commands are lost.

**How it manifests:** After a site-wide restart, the Device Manager has no record of in-flight commands. Commands that were sent but not yet acknowledged are neither retried nor flagged for operator review. Equipment may be in an inconsistent state.

**How to avoid it:** Use Distributed Data's durable LMDB storage for critical keys (see [09-DistributedData.md](./09-DistributedData.md)). Additionally, always use Akka.Persistence as the authoritative record for in-flight commands — Distributed Data is a convenience for the standby, not the source of truth.

### Trap 8: Forgetting CoordinatedShutdown

**The problem:** The SCADA Windows Service is stopped (for maintenance, update, etc.) by killing the process or stopping the service without graceful shutdown. The Cluster Singleton does not hand over cleanly; the standby detects the active node as unreachable (not gracefully left) and waits for the SBR timeout before taking over.

**How it manifests:** Planned maintenance causes the same 20–40 second failover gap as an unplanned crash. Operators learn to distrust the failover system and develop manual procedures that bypass it.

**How to avoid it:** Ensure the Windows Service wrapper triggers `CoordinatedShutdown`:

```csharp
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
    stoppingToken.Register(() =>
    {
        CoordinatedShutdown.Get(_actorSystem)
            .Run(CoordinatedShutdown.ClrExitReason.Instance)
            .Wait(TimeSpan.FromSeconds(30));
    });
}
```

And configure Akka to respect the CLR shutdown hook:

```hocon
akka.coordinated-shutdown {
  run-by-clr-shutdown-hook = on
  exit-clr = on
}
akka.cluster.run-coordinated-shutdown-when-down = on
```

With graceful shutdown, the singleton migrates in seconds (limited by the hand-over retry interval) instead of the full failure detection timeout.

### Trap 9: Testing Only the Happy Path

**The problem:** Tests cover normal operation (device connects, tags update, commands succeed) but not failure scenarios (connection drops mid-command, equipment returns garbled data, failover during alarm escalation). The system works perfectly in the lab and fails unpredictably in production.

**How to avoid it:** For every happy-path test, write at least one failure-path test:

- Device connects → Device connection times out
- Command succeeds → Command times out, equipment never responds
- Tag updates normally → Tag update contains invalid data
- Singleton runs on Node A → Node A crashes during command processing
- Both nodes healthy → Network partition between nodes

### Trap 10: Overusing Ask in Actor-to-Actor Communication

**The problem:** Internal actor communication uses `Ask` (request-response) pervasively. Each `Ask` creates a temporary actor and a timeout. Under load, this creates thousands of temporary actors, increasing garbage collection pressure and memory consumption. Worse, if an actor is processing slowly, `Ask` timeouts cascade into `AskTimeoutException` storms that fill logs and trigger supervisor restarts.

**How it manifests:** Under moderate load, the system starts logging `AskTimeoutException` frequently. Device actors are restarted by supervisors that interpret the timeout exception as a failure. The restarts disconnect equipment, causing more timeouts, creating a cascading failure.

**How to avoid it:** Use `Tell` with explicit reply-to actor references for all internal communication. Reserve `Ask` for system boundaries where an external caller (ASP.NET controller, health check) needs a synchronous response from an actor:

```csharp
// Internal: Tell with reply
target.Tell(new RequestState(replyTo: Self));
Receive<StateResponse>(response => ProcessResponse(response));

// Boundary only: Ask from ASP.NET
public async Task<IActionResult> GetDeviceState(string deviceId)
{
    var state = await _deviceManager.Ask<DeviceState>(
        new GetDeviceState(deviceId), TimeSpan.FromSeconds(5));
    return Ok(state);
}
```

---

## Checklist: Before Going to Production

- [ ] Split Brain Resolver is configured and tested (not left on defaults)
- [ ] `min-nr-of-members = 1` so the singleton starts after failover
- [ ] All cross-boundary messages are serialization-tested
- [ ] Persistence journal cleanup is implemented (snapshots + deletion)
- [ ] Persisted event schema versioning strategy is documented
- [ ] Device actor startup is staggered to prevent reconnection storms
- [ ] CoordinatedShutdown is wired into the Windows Service lifecycle
- [ ] MultiNodeTestRunner specs cover failover, partition, and rejoin scenarios
- [ ] Dead letter monitoring is enabled
- [ ] Structured logging with command correlation IDs is in place
- [ ] Mock protocol adapters exist for all equipment protocols
- [ ] appsettings.json is validated at startup before the ActorSystem starts
- [ ] Both nodes have been tested as the surviving node (not just Node A)
- [ ] The failover gap duration is documented and accepted by operations
- [ ] Equipment has local safety logic that does not depend on continuous SCADA commands