Files
natsdotnet/docs/plans/2026-03-12-e2e-cluster-tests-design.md
2026-03-12 23:22:31 -04:00

5.1 KiB

E2E Cluster & RAFT Tests Design

Goal

Create a new test project NATS.E2E.Cluster.Tests with comprehensive end-to-end tests for cluster resilience, RAFT consensus, JetStream cluster replication, gateway failover, and leaf node failover. All tests run against real multi-process server instances (not in-memory mocks).

Architecture

Single new xUnit 3 test project with shared fixtures per topology. Separated from the existing fast NATS.E2E.Tests suite (78 tests, ~6s) because cluster failover tests are inherently slower (30s+ timeouts, node kill/restart cycles).

Dependencies: NATS.Client.Core, NATS.NKeys, xUnit 3, Shouldly — same as existing E2E project. No new NuGet packages.

Project Structure

tests/NATS.E2E.Cluster.Tests/
  NATS.E2E.Cluster.Tests.csproj
  Infrastructure/
    NatsServerProcess.cs            (copied from NATS.E2E.Tests, standalone utility)
    ThreeNodeClusterFixture.cs      (3-node full-mesh cluster)
    JetStreamClusterFixture.cs      (3-node cluster with JetStream enabled)
    GatewayPairFixture.cs           (2 independent clusters connected by gateway)
    HubLeafFixture.cs               (1 hub + 1 leaf node)
  ClusterResilienceTests.cs         (4 tests)
  RaftConsensusTests.cs             (4 tests)
  JetStreamClusterTests.cs         (4 tests)
  GatewayFailoverTests.cs           (2 tests)
  LeafNodeFailoverTests.cs          (2 tests)

Fixtures

ThreeNodeClusterFixture

  • 3-node full-mesh cluster with monitoring enabled
  • Each node: client port, cluster port, monitoring port (all ephemeral)
  • pool_size: 1 for deterministic routing
  • Readiness: polls /routez on all nodes until NumRoutes >= 2
  • Used by: ClusterResilienceTests

JetStreamClusterFixture

  • Extends ThreeNodeClusterFixture pattern with jetstream { store_dir } config
  • Temp store dirs per node, cleaned up on dispose
  • Used by: RaftConsensusTests, JetStreamClusterTests

GatewayPairFixture

  • 2 servers in separate "clusters" connected by gateway
  • Readiness: polls /gatewayz until NumGateways >= 1 on both
  • Used by: GatewayFailoverTests

HubLeafFixture

  • 1 hub server + 1 leaf server with solicited connection
  • Readiness: polls hub's /leafz until NumLeafs >= 1
  • Used by: LeafNodeFailoverTests

KillAndRestart Primitive

Core infrastructure for all failover tests. Each fixture tracks NatsServerProcess[] by index.

  • KillNode(int index) — kills the process, waits for exit
  • RestartNode(int index) — starts new process on same ports/config
  • WaitForFullMesh() — polls /routez until topology restored

Ports are pre-allocated and reused across kill/restart cycles so cluster config remains valid.

Test Cases (16 total)

ClusterResilienceTests.cs (ThreeNodeClusterFixture)

Test Validates
NodeDies_TrafficReroutesToSurvivors Kill node 2, pub on node 0, sub on node 1 — messages delivered
NodeRejoins_SubscriptionsPropagateAgain Kill node 2, restart, verify new subs on node 2 receive messages from node 0
AllRoutesReconnect_AfterNodeRestart Kill node 1, restart, poll /routez until full mesh restored
QueueGroup_NodeDies_RemainingMembersDeliver Queue subs on nodes 1+2, kill node 2, node 1 delivers all

RaftConsensusTests.cs (JetStreamClusterFixture)

Test Validates
LeaderElection_ClusterFormsLeader Create R3 stream, verify one node reports as leader
LeaderDies_NewLeaderElected Identify leader, kill it, verify new leader elected
LogReplication_AllReplicasHaveData Publish to R3 stream, verify all replicas report same count
LeaderRestart_RejoinsAsFollower Kill leader, wait for new, restart old, verify catchup

JetStreamClusterTests.cs (JetStreamClusterFixture)

Test Validates
R3Stream_CreateAndPublish_ReplicatedAcrossNodes Create R3 stream, publish, verify replicas on 3 nodes
R3Stream_NodeDies_PublishContinues Kill replica, continue publishing, verify on survivors
Consumer_NodeDies_PullContinuesOnSurvivor Pull consumer, kill its node, verify consumer works elsewhere
R3Stream_Purge_ReplicatedAcrossNodes Purge stream, verify all replicas report 0

GatewayFailoverTests.cs (GatewayPairFixture)

Test Validates
Gateway_Disconnect_Reconnects Kill gateway-B, restart, verify messaging resumes
Gateway_InterestUpdated_AfterReconnect Sub on B, kill B, restart, resub, verify delivery from A

LeafNodeFailoverTests.cs (HubLeafFixture)

Test Validates
Leaf_Disconnect_ReconnectsToHub Kill leaf, restart, verify /leafz shows reconnection
Leaf_HubRestart_LeafReconnects Kill hub, restart, verify leaf reconnects and messaging resumes

Timeout Strategy

  • Per-test: 30-second CancellationTokenSource
  • Fixture readiness (full mesh, gateway, leaf): 30-second timeout, 200ms polling
  • Node restart wait: reuse WaitForTcpReadyAsync() + topology readiness checks

Port Allocation

Same AllocateFreePort() pattern as existing E2E infrastructure. Each fixture allocates all ports upfront and reuses them across kill/restart cycles.