Files

Joseph Doherty b5e1786ec2 docs: add E2E cluster & raft tests design document

2026-03-12 23:22:31 -04:00

5.1 KiB

Raw Permalink Blame History

E2E Cluster & RAFT Tests Design

Goal

Create a new test project NATS.E2E.Cluster.Tests with comprehensive end-to-end tests for cluster resilience, RAFT consensus, JetStream cluster replication, gateway failover, and leaf node failover. All tests run against real multi-process server instances (not in-memory mocks).

Architecture

Single new xUnit 3 test project with shared fixtures per topology. Separated from the existing fast NATS.E2E.Tests suite (78 tests, ~6s) because cluster failover tests are inherently slower (30s+ timeouts, node kill/restart cycles).

Dependencies: NATS.Client.Core, NATS.NKeys, xUnit 3, Shouldly — same as existing E2E project. No new NuGet packages.

Project Structure

tests/NATS.E2E.Cluster.Tests/
  NATS.E2E.Cluster.Tests.csproj
  Infrastructure/
    NatsServerProcess.cs            (copied from NATS.E2E.Tests, standalone utility)
    ThreeNodeClusterFixture.cs      (3-node full-mesh cluster)
    JetStreamClusterFixture.cs      (3-node cluster with JetStream enabled)
    GatewayPairFixture.cs           (2 independent clusters connected by gateway)
    HubLeafFixture.cs               (1 hub + 1 leaf node)
  ClusterResilienceTests.cs         (4 tests)
  RaftConsensusTests.cs             (4 tests)
  JetStreamClusterTests.cs         (4 tests)
  GatewayFailoverTests.cs           (2 tests)
  LeafNodeFailoverTests.cs          (2 tests)

Fixtures

ThreeNodeClusterFixture

3-node full-mesh cluster with monitoring enabled
Each node: client port, cluster port, monitoring port (all ephemeral)
pool_size: 1 for deterministic routing
Readiness: polls /routez on all nodes until NumRoutes >= 2
Used by: ClusterResilienceTests

JetStreamClusterFixture

Extends ThreeNodeClusterFixture pattern with jetstream { store_dir } config
Temp store dirs per node, cleaned up on dispose
Used by: RaftConsensusTests, JetStreamClusterTests

GatewayPairFixture

2 servers in separate "clusters" connected by gateway
Readiness: polls /gatewayz until NumGateways >= 1 on both
Used by: GatewayFailoverTests

HubLeafFixture

1 hub server + 1 leaf server with solicited connection
Readiness: polls hub's /leafz until NumLeafs >= 1
Used by: LeafNodeFailoverTests

KillAndRestart Primitive

Core infrastructure for all failover tests. Each fixture tracks NatsServerProcess[] by index.

KillNode(int index) — kills the process, waits for exit
RestartNode(int index) — starts new process on same ports/config
WaitForFullMesh() — polls /routez until topology restored

Ports are pre-allocated and reused across kill/restart cycles so cluster config remains valid.

Test Cases (16 total)

ClusterResilienceTests.cs (ThreeNodeClusterFixture)

Test	Validates
`NodeDies_TrafficReroutesToSurvivors`	Kill node 2, pub on node 0, sub on node 1 — messages delivered
`NodeRejoins_SubscriptionsPropagateAgain`	Kill node 2, restart, verify new subs on node 2 receive messages from node 0
`AllRoutesReconnect_AfterNodeRestart`	Kill node 1, restart, poll `/routez` until full mesh restored
`QueueGroup_NodeDies_RemainingMembersDeliver`	Queue subs on nodes 1+2, kill node 2, node 1 delivers all

RaftConsensusTests.cs (JetStreamClusterFixture)

Test	Validates
`LeaderElection_ClusterFormsLeader`	Create R3 stream, verify one node reports as leader
`LeaderDies_NewLeaderElected`	Identify leader, kill it, verify new leader elected
`LogReplication_AllReplicasHaveData`	Publish to R3 stream, verify all replicas report same count
`LeaderRestart_RejoinsAsFollower`	Kill leader, wait for new, restart old, verify catchup

JetStreamClusterTests.cs (JetStreamClusterFixture)

Test	Validates
`R3Stream_CreateAndPublish_ReplicatedAcrossNodes`	Create R3 stream, publish, verify replicas on 3 nodes
`R3Stream_NodeDies_PublishContinues`	Kill replica, continue publishing, verify on survivors
`Consumer_NodeDies_PullContinuesOnSurvivor`	Pull consumer, kill its node, verify consumer works elsewhere
`R3Stream_Purge_ReplicatedAcrossNodes`	Purge stream, verify all replicas report 0

GatewayFailoverTests.cs (GatewayPairFixture)

Test	Validates
`Gateway_Disconnect_Reconnects`	Kill gateway-B, restart, verify messaging resumes
`Gateway_InterestUpdated_AfterReconnect`	Sub on B, kill B, restart, resub, verify delivery from A

LeafNodeFailoverTests.cs (HubLeafFixture)

Test	Validates
`Leaf_Disconnect_ReconnectsToHub`	Kill leaf, restart, verify `/leafz` shows reconnection
`Leaf_HubRestart_LeafReconnects`	Kill hub, restart, verify leaf reconnects and messaging resumes

Timeout Strategy

Per-test: 30-second CancellationTokenSource
Fixture readiness (full mesh, gateway, leaf): 30-second timeout, 200ms polling
Node restart wait: reuse WaitForTcpReadyAsync() + topology readiness checks

Port Allocation

Same AllocateFreePort() pattern as existing E2E infrastructure. Each fixture allocates all ports upfront and reuses them across kill/restart cycles.

5.1 KiB Raw Permalink Blame History