Notes and documentation covering actors, remoting, clustering, persistence, streams, serialization, hosting, testing, and best practices for the Akka.NET framework used throughout the ScadaLink system.
221 lines
7.1 KiB
Markdown
221 lines
7.1 KiB
Markdown
# 18 — MultiNodeTestRunner (Akka.MultiNodeTestRunner)
|
|
|
|
## Overview
|
|
|
|
The MultiNodeTestRunner provides infrastructure for running distributed integration tests across multiple actor systems simultaneously. Each "node" in the test runs in its own process, with full cluster formation, network simulation, and coordinated test assertions. This is the tool for validating failover behavior, split-brain scenarios, and cluster membership transitions.
|
|
|
|
In the SCADA system, MultiNodeTestRunner is essential for validating the core availability guarantee: that the standby node correctly takes over device communication when the active node fails, without losing or duplicating commands.
|
|
|
|
## When to Use
|
|
|
|
- Testing failover scenarios (active node crash → standby takes over)
|
|
- Validating Split Brain Resolver behavior in the 2-node topology
|
|
- Testing Cluster Singleton migration (Device Manager moves to the standby)
|
|
- Verifying Distributed Data replication between nodes
|
|
- Testing graceful shutdown and rejoin sequences
|
|
|
|
## When Not to Use
|
|
|
|
- Unit testing individual actor logic — use TestKit
|
|
- Integration tests that only need DI — use Hosting.TestKit
|
|
- Performance or load testing — MultiNodeTestRunner adds significant overhead from process coordination
|
|
|
|
## Design Decisions for the SCADA System
|
|
|
|
### Key Failover Scenarios to Test
|
|
|
|
1. **Active node hard crash:** Kill the active node's process. Verify the standby detects the failure, acquires the singleton, and starts device actors.
|
|
|
|
2. **Active node graceful shutdown:** Initiate CoordinatedShutdown on the active node. Verify the singleton migrates cleanly with buffered messages preserved.
|
|
|
|
3. **Network partition (simulated):** Prevent the two nodes from communicating. Verify SBR correctly resolves the partition (one node survives, one downs itself).
|
|
|
|
4. **Rejoin after failure:** After failover, restart the failed node. Verify it joins the cluster as the new standby without disrupting the active node.
|
|
|
|
5. **Command in-flight during failover:** Send a command to the active node, then kill it before the command is acknowledged. Verify the new active node recovers the pending command from the Persistence journal.
|
|
|
|
### Test Structure
|
|
|
|
```csharp
|
|
public class FailoverSpec : MultiNodeClusterSpec
|
|
{
|
|
public FailoverSpec() : base(new FailoverSpecConfig()) { }
|
|
|
|
[MultiNodeFact]
|
|
public void Active_node_failure_should_trigger_singleton_migration()
|
|
{
|
|
// Arrange: Both nodes join cluster
|
|
RunOn(() => Cluster.Join(GetAddress(First)), First, Second);
|
|
AwaitMembersUp(2);
|
|
|
|
// Verify singleton is on the first (oldest) node
|
|
RunOn(() =>
|
|
{
|
|
var singleton = Sys.ActorSelection("/user/device-manager");
|
|
singleton.Tell(new Identify(1));
|
|
var identity = ExpectMsg<ActorIdentity>();
|
|
Assert.NotNull(identity.Subject);
|
|
}, First);
|
|
|
|
EnterBarrier("singleton-running");
|
|
|
|
// Act: Kill the first node
|
|
RunOn(() =>
|
|
{
|
|
TestConductor.Exit(First, 0).Wait();
|
|
}, Second);
|
|
|
|
// Assert: Singleton migrates to second node
|
|
RunOn(() =>
|
|
{
|
|
AwaitAssert(() =>
|
|
{
|
|
var singleton = Sys.ActorSelection("/user/device-manager");
|
|
singleton.Tell(new Identify(2));
|
|
var identity = ExpectMsg<ActorIdentity>(TimeSpan.FromSeconds(30));
|
|
Assert.NotNull(identity.Subject);
|
|
}, TimeSpan.FromSeconds(60));
|
|
}, Second);
|
|
}
|
|
}
|
|
```
|
|
|
|
### Spec Configuration
|
|
|
|
```csharp
|
|
public class FailoverSpecConfig : MultiNodeConfig
|
|
{
|
|
public RoleName First { get; }
|
|
public RoleName Second { get; }
|
|
|
|
public FailoverSpecConfig()
|
|
{
|
|
First = Role("first");
|
|
Second = Role("second");
|
|
|
|
CommonConfig = ConfigurationFactory.ParseString(@"
|
|
akka.actor.provider = cluster
|
|
akka.remote.dot-netty.tcp.port = 0
|
|
akka.cluster {
|
|
downing-provider-class = ""Akka.Cluster.SBR.SplitBrainResolverProvider, Akka.Cluster""
|
|
split-brain-resolver {
|
|
active-strategy = keep-oldest
|
|
keep-oldest.down-if-alone = on
|
|
}
|
|
min-nr-of-members = 1
|
|
}
|
|
");
|
|
}
|
|
}
|
|
```
|
|
|
|
## Common Patterns
|
|
|
|
### Barriers for Synchronization
|
|
|
|
Use `EnterBarrier` to synchronize test steps across nodes:
|
|
|
|
```csharp
|
|
// Both nodes reach this point before proceeding
|
|
EnterBarrier("cluster-formed");
|
|
// ... do work ...
|
|
EnterBarrier("work-complete");
|
|
```
|
|
|
|
### RunOn for Node-Specific Logic
|
|
|
|
Execute test logic on specific nodes:
|
|
|
|
```csharp
|
|
RunOn(() =>
|
|
{
|
|
// This code runs only on the "first" node
|
|
Cluster.Join(GetAddress(First));
|
|
}, First);
|
|
|
|
RunOn(() =>
|
|
{
|
|
// This code runs only on the "second" node
|
|
Cluster.Join(GetAddress(First));
|
|
}, Second);
|
|
```
|
|
|
|
### TestConductor for Failure Injection
|
|
|
|
The TestConductor controls node lifecycle and network simulation:
|
|
|
|
```csharp
|
|
// Kill a node
|
|
TestConductor.Exit(First, exitCode: 0).Wait();
|
|
|
|
// Simulate network partition (blackhole traffic)
|
|
TestConductor.Blackhole(First, Second, ThrottleTransportAdapter.Direction.Both).Wait();
|
|
|
|
// Restore network
|
|
TestConductor.PassThrough(First, Second, ThrottleTransportAdapter.Direction.Both).Wait();
|
|
```
|
|
|
|
### Timeout Handling
|
|
|
|
Multi-node tests involve network coordination and are inherently slower. Use generous timeouts:
|
|
|
|
```csharp
|
|
AwaitAssert(() =>
|
|
{
|
|
// Assertion that may take time (singleton migration, failure detection)
|
|
}, max: TimeSpan.FromSeconds(60), interval: TimeSpan.FromSeconds(2));
|
|
```
|
|
|
|
## Anti-Patterns
|
|
|
|
### Testing Everything Multi-Node
|
|
|
|
Multi-node tests are slow (process startup, cluster formation, barrier synchronization). Only test scenarios that genuinely require multiple nodes: failover, partition handling, data replication. All other tests should use TestKit or Hosting.TestKit.
|
|
|
|
### Brittle Timing Assertions
|
|
|
|
Do not assert that failover completes in exactly N seconds. Timing varies with machine load, GC pauses, and CI environment. Use `AwaitAssert` with a generous maximum timeout.
|
|
|
|
### Forgetting Cleanup
|
|
|
|
Ensure all node processes are terminated after each test. The MultiNodeTestRunner handles this, but custom test infrastructure must clean up explicitly.
|
|
|
|
### Testing with Real Equipment
|
|
|
|
Multi-node tests should use mock protocol adapters, not real equipment connections. Equipment behavior during test-driven cluster failures could be unpredictable.
|
|
|
|
## Configuration Guidance
|
|
|
|
### Running Multi-Node Tests
|
|
|
|
```bash
|
|
# Using the Akka.MultiNodeTestRunner CLI
|
|
dotnet tool install --global Akka.MultiNodeTestRunner
|
|
|
|
# Run tests
|
|
mntr run ScadaSystem.MultiNode.Tests.dll
|
|
```
|
|
|
|
### CI/CD Integration
|
|
|
|
Multi-node tests require multiple processes on the same machine. Ensure the CI agent has sufficient resources and that ports are available (the test runner uses random ports).
|
|
|
|
### Test Project Structure
|
|
|
|
```
|
|
ScadaSystem.MultiNode.Tests/
|
|
Specs/
|
|
FailoverSpec.cs
|
|
SplitBrainSpec.cs
|
|
RejoinSpec.cs
|
|
CommandRecoverySpec.cs
|
|
Configs/
|
|
FailoverSpecConfig.cs
|
|
SplitBrainSpecConfig.cs
|
|
```
|
|
|
|
## References
|
|
|
|
- GitHub: <https://github.com/akkadotnet/Akka.MultiNodeTestRunner>
|
|
- Testing Actor Systems: <https://getakka.net/articles/actors/testing-actor-systems.html>
|