5.1 KiB
E2E Cluster & RAFT Tests Design
Goal
Create a new test project NATS.E2E.Cluster.Tests with comprehensive end-to-end tests for cluster resilience, RAFT consensus, JetStream cluster replication, gateway failover, and leaf node failover. All tests run against real multi-process server instances (not in-memory mocks).
Architecture
Single new xUnit 3 test project with shared fixtures per topology. Separated from the existing fast NATS.E2E.Tests suite (78 tests, ~6s) because cluster failover tests are inherently slower (30s+ timeouts, node kill/restart cycles).
Dependencies: NATS.Client.Core, NATS.NKeys, xUnit 3, Shouldly — same as existing E2E project. No new NuGet packages.
Project Structure
tests/NATS.E2E.Cluster.Tests/
NATS.E2E.Cluster.Tests.csproj
Infrastructure/
NatsServerProcess.cs (copied from NATS.E2E.Tests, standalone utility)
ThreeNodeClusterFixture.cs (3-node full-mesh cluster)
JetStreamClusterFixture.cs (3-node cluster with JetStream enabled)
GatewayPairFixture.cs (2 independent clusters connected by gateway)
HubLeafFixture.cs (1 hub + 1 leaf node)
ClusterResilienceTests.cs (4 tests)
RaftConsensusTests.cs (4 tests)
JetStreamClusterTests.cs (4 tests)
GatewayFailoverTests.cs (2 tests)
LeafNodeFailoverTests.cs (2 tests)
Fixtures
ThreeNodeClusterFixture
- 3-node full-mesh cluster with monitoring enabled
- Each node: client port, cluster port, monitoring port (all ephemeral)
pool_size: 1for deterministic routing- Readiness: polls
/routezon all nodes untilNumRoutes >= 2 - Used by: ClusterResilienceTests
JetStreamClusterFixture
- Extends ThreeNodeClusterFixture pattern with
jetstream { store_dir }config - Temp store dirs per node, cleaned up on dispose
- Used by: RaftConsensusTests, JetStreamClusterTests
GatewayPairFixture
- 2 servers in separate "clusters" connected by gateway
- Readiness: polls
/gatewayzuntilNumGateways >= 1on both - Used by: GatewayFailoverTests
HubLeafFixture
- 1 hub server + 1 leaf server with solicited connection
- Readiness: polls hub's
/leafzuntilNumLeafs >= 1 - Used by: LeafNodeFailoverTests
KillAndRestart Primitive
Core infrastructure for all failover tests. Each fixture tracks NatsServerProcess[] by index.
KillNode(int index)— kills the process, waits for exitRestartNode(int index)— starts new process on same ports/configWaitForFullMesh()— polls/routezuntil topology restored
Ports are pre-allocated and reused across kill/restart cycles so cluster config remains valid.
Test Cases (16 total)
ClusterResilienceTests.cs (ThreeNodeClusterFixture)
| Test | Validates |
|---|---|
NodeDies_TrafficReroutesToSurvivors |
Kill node 2, pub on node 0, sub on node 1 — messages delivered |
NodeRejoins_SubscriptionsPropagateAgain |
Kill node 2, restart, verify new subs on node 2 receive messages from node 0 |
AllRoutesReconnect_AfterNodeRestart |
Kill node 1, restart, poll /routez until full mesh restored |
QueueGroup_NodeDies_RemainingMembersDeliver |
Queue subs on nodes 1+2, kill node 2, node 1 delivers all |
RaftConsensusTests.cs (JetStreamClusterFixture)
| Test | Validates |
|---|---|
LeaderElection_ClusterFormsLeader |
Create R3 stream, verify one node reports as leader |
LeaderDies_NewLeaderElected |
Identify leader, kill it, verify new leader elected |
LogReplication_AllReplicasHaveData |
Publish to R3 stream, verify all replicas report same count |
LeaderRestart_RejoinsAsFollower |
Kill leader, wait for new, restart old, verify catchup |
JetStreamClusterTests.cs (JetStreamClusterFixture)
| Test | Validates |
|---|---|
R3Stream_CreateAndPublish_ReplicatedAcrossNodes |
Create R3 stream, publish, verify replicas on 3 nodes |
R3Stream_NodeDies_PublishContinues |
Kill replica, continue publishing, verify on survivors |
Consumer_NodeDies_PullContinuesOnSurvivor |
Pull consumer, kill its node, verify consumer works elsewhere |
R3Stream_Purge_ReplicatedAcrossNodes |
Purge stream, verify all replicas report 0 |
GatewayFailoverTests.cs (GatewayPairFixture)
| Test | Validates |
|---|---|
Gateway_Disconnect_Reconnects |
Kill gateway-B, restart, verify messaging resumes |
Gateway_InterestUpdated_AfterReconnect |
Sub on B, kill B, restart, resub, verify delivery from A |
LeafNodeFailoverTests.cs (HubLeafFixture)
| Test | Validates |
|---|---|
Leaf_Disconnect_ReconnectsToHub |
Kill leaf, restart, verify /leafz shows reconnection |
Leaf_HubRestart_LeafReconnects |
Kill hub, restart, verify leaf reconnects and messaging resumes |
Timeout Strategy
- Per-test: 30-second
CancellationTokenSource - Fixture readiness (full mesh, gateway, leaf): 30-second timeout, 200ms polling
- Node restart wait: reuse
WaitForTcpReadyAsync()+ topology readiness checks
Port Allocation
Same AllocateFreePort() pattern as existing E2E infrastructure. Each fixture allocates all ports upfront and reuses them across kill/restart cycles.