scadalink-design/AkkaDotNet/18-MultiNodeTestRunner.md

# 18 — MultiNodeTestRunner (Akka.MultiNodeTestRunner)

## Overview

The MultiNodeTestRunner provides infrastructure for running distributed integration tests across multiple actor systems simultaneously. Each "node" in the test runs in its own process, with full cluster formation, network simulation, and coordinated test assertions. This is the tool for validating failover behavior, split-brain scenarios, and cluster membership transitions.

In the SCADA system, MultiNodeTestRunner is essential for validating the core availability guarantee: that the standby node correctly takes over device communication when the active node fails, without losing or duplicating commands.

## When to Use

- Testing failover scenarios (active node crash → standby takes over)
- Validating Split Brain Resolver behavior in the 2-node topology
- Testing Cluster Singleton migration (Device Manager moves to the standby)
- Verifying Distributed Data replication between nodes
- Testing graceful shutdown and rejoin sequences

## When Not to Use

- Unit testing individual actor logic — use TestKit
- Integration tests that only need DI — use Hosting.TestKit
- Performance or load testing — MultiNodeTestRunner adds significant overhead from process coordination

## Design Decisions for the SCADA System

### Key Failover Scenarios to Test

1. **Active node hard crash:** Kill the active node's process. Verify the standby detects the failure, acquires the singleton, and starts device actors.

2. **Active node graceful shutdown:** Initiate CoordinatedShutdown on the active node. Verify the singleton migrates cleanly with buffered messages preserved.

3. **Network partition (simulated):** Prevent the two nodes from communicating. Verify SBR correctly resolves the partition (one node survives, one downs itself).

4. **Rejoin after failure:** After failover, restart the failed node. Verify it joins the cluster as the new standby without disrupting the active node.

5. **Command in-flight during failover:** Send a command to the active node, then kill it before the command is acknowledged. Verify the new active node recovers the pending command from the Persistence journal.

### Test Structure

```csharp
public class FailoverSpec : MultiNodeClusterSpec
{
    public FailoverSpec() : base(new FailoverSpecConfig()) { }

    [MultiNodeFact]
    public void Active_node_failure_should_trigger_singleton_migration()
    {
        // Arrange: Both nodes join cluster
        RunOn(() => Cluster.Join(GetAddress(First)), First, Second);
        AwaitMembersUp(2);

        // Verify singleton is on the first (oldest) node
        RunOn(() =>
        {
            var singleton = Sys.ActorSelection("/user/device-manager");
            singleton.Tell(new Identify(1));
            var identity = ExpectMsg<ActorIdentity>();
            Assert.NotNull(identity.Subject);
        }, First);

        EnterBarrier("singleton-running");

        // Act: Kill the first node
        RunOn(() =>
        {
            TestConductor.Exit(First, 0).Wait();
        }, Second);

        // Assert: Singleton migrates to second node
        RunOn(() =>
        {
            AwaitAssert(() =>
            {
                var singleton = Sys.ActorSelection("/user/device-manager");
                singleton.Tell(new Identify(2));
                var identity = ExpectMsg<ActorIdentity>(TimeSpan.FromSeconds(30));
                Assert.NotNull(identity.Subject);
            }, TimeSpan.FromSeconds(60));
        }, Second);
    }
}
```

### Spec Configuration

```csharp
public class FailoverSpecConfig : MultiNodeConfig
{
    public RoleName First { get; }
    public RoleName Second { get; }

    public FailoverSpecConfig()
    {
        First = Role("first");
        Second = Role("second");

        CommonConfig = ConfigurationFactory.ParseString(@"
            akka.actor.provider = cluster
            akka.remote.dot-netty.tcp.port = 0
            akka.cluster {
                downing-provider-class = ""Akka.Cluster.SBR.SplitBrainResolverProvider, Akka.Cluster""
                split-brain-resolver {
                    active-strategy = keep-oldest
                    keep-oldest.down-if-alone = on
                }
                min-nr-of-members = 1
            }
        ");
    }
}
```

## Common Patterns

### Barriers for Synchronization

Use `EnterBarrier` to synchronize test steps across nodes:

```csharp
// Both nodes reach this point before proceeding
EnterBarrier("cluster-formed");
// ... do work ...
EnterBarrier("work-complete");
```

### RunOn for Node-Specific Logic

Execute test logic on specific nodes:

```csharp
RunOn(() =>
{
    // This code runs only on the "first" node
    Cluster.Join(GetAddress(First));
}, First);

RunOn(() =>
{
    // This code runs only on the "second" node
    Cluster.Join(GetAddress(First));
}, Second);
```

### TestConductor for Failure Injection

The TestConductor controls node lifecycle and network simulation:

```csharp
// Kill a node
TestConductor.Exit(First, exitCode: 0).Wait();

// Simulate network partition (blackhole traffic)
TestConductor.Blackhole(First, Second, ThrottleTransportAdapter.Direction.Both).Wait();

// Restore network
TestConductor.PassThrough(First, Second, ThrottleTransportAdapter.Direction.Both).Wait();
```

### Timeout Handling

Multi-node tests involve network coordination and are inherently slower. Use generous timeouts:

```csharp
AwaitAssert(() =>
{
    // Assertion that may take time (singleton migration, failure detection)
}, max: TimeSpan.FromSeconds(60), interval: TimeSpan.FromSeconds(2));
```

## Anti-Patterns

### Testing Everything Multi-Node

Multi-node tests are slow (process startup, cluster formation, barrier synchronization). Only test scenarios that genuinely require multiple nodes: failover, partition handling, data replication. All other tests should use TestKit or Hosting.TestKit.

### Brittle Timing Assertions

Do not assert that failover completes in exactly N seconds. Timing varies with machine load, GC pauses, and CI environment. Use `AwaitAssert` with a generous maximum timeout.

### Forgetting Cleanup

Ensure all node processes are terminated after each test. The MultiNodeTestRunner handles this, but custom test infrastructure must clean up explicitly.

### Testing with Real Equipment

Multi-node tests should use mock protocol adapters, not real equipment connections. Equipment behavior during test-driven cluster failures could be unpredictable.

## Configuration Guidance

### Running Multi-Node Tests

```bash
# Using the Akka.MultiNodeTestRunner CLI
dotnet tool install --global Akka.MultiNodeTestRunner

# Run tests
mntr run ScadaSystem.MultiNode.Tests.dll
```

### CI/CD Integration

Multi-node tests require multiple processes on the same machine. Ensure the CI agent has sufficient resources and that ports are available (the test runner uses random ports).

### Test Project Structure

```
ScadaSystem.MultiNode.Tests/
  Specs/
    FailoverSpec.cs
    SplitBrainSpec.cs
    RejoinSpec.cs
    CommandRecoverySpec.cs
  Configs/
    FailoverSpecConfig.cs
    SplitBrainSpecConfig.cs
```

## References

- GitHub: <https://github.com/akkadotnet/Akka.MultiNodeTestRunner>
- Testing Actor Systems: <https://getakka.net/articles/actors/testing-actor-systems.html>