docs: add E2E cluster & raft tests design document

2026-03-12 23:22:31 -04:00
parent 95e9f0a92e
commit b5e1786ec2
1 changed files with 116 additions and 0 deletions
@@ -0,0 +1,116 @@
+# E2E Cluster & RAFT Tests Design
+
+## Goal
+
+Create a new test project `NATS.E2E.Cluster.Tests` with comprehensive end-to-end tests for cluster resilience, RAFT consensus, JetStream cluster replication, gateway failover, and leaf node failover. All tests run against real multi-process server instances (not in-memory mocks).
+
+## Architecture
+
+Single new xUnit 3 test project with shared fixtures per topology. Separated from the existing fast `NATS.E2E.Tests` suite (78 tests, ~6s) because cluster failover tests are inherently slower (30s+ timeouts, node kill/restart cycles).
+
+**Dependencies:** `NATS.Client.Core`, `NATS.NKeys`, `xUnit 3`, `Shouldly` — same as existing E2E project. No new NuGet packages.
+
+## Project Structure
+
+```
+tests/NATS.E2E.Cluster.Tests/
+  NATS.E2E.Cluster.Tests.csproj
+  Infrastructure/
+    NatsServerProcess.cs            (copied from NATS.E2E.Tests, standalone utility)
+    ThreeNodeClusterFixture.cs      (3-node full-mesh cluster)
+    JetStreamClusterFixture.cs      (3-node cluster with JetStream enabled)
+    GatewayPairFixture.cs           (2 independent clusters connected by gateway)
+    HubLeafFixture.cs               (1 hub + 1 leaf node)
+  ClusterResilienceTests.cs         (4 tests)
+  RaftConsensusTests.cs             (4 tests)
+  JetStreamClusterTests.cs         (4 tests)
+  GatewayFailoverTests.cs           (2 tests)
+  LeafNodeFailoverTests.cs          (2 tests)
+```
+
+## Fixtures
+
+### ThreeNodeClusterFixture
+- 3-node full-mesh cluster with monitoring enabled
+- Each node: client port, cluster port, monitoring port (all ephemeral)
+- `pool_size: 1` for deterministic routing
+- Readiness: polls `/routez` on all nodes until `NumRoutes >= 2`
+- Used by: ClusterResilienceTests
+
+### JetStreamClusterFixture
+- Extends ThreeNodeClusterFixture pattern with `jetstream { store_dir }` config
+- Temp store dirs per node, cleaned up on dispose
+- Used by: RaftConsensusTests, JetStreamClusterTests
+
+### GatewayPairFixture
+- 2 servers in separate "clusters" connected by gateway
+- Readiness: polls `/gatewayz` until `NumGateways >= 1` on both
+- Used by: GatewayFailoverTests
+
+### HubLeafFixture
+- 1 hub server + 1 leaf server with solicited connection
+- Readiness: polls hub's `/leafz` until `NumLeafs >= 1`
+- Used by: LeafNodeFailoverTests
+
+## KillAndRestart Primitive
+
+Core infrastructure for all failover tests. Each fixture tracks `NatsServerProcess[]` by index.
+
+- `KillNode(int index)` — kills the process, waits for exit
+- `RestartNode(int index)` — starts new process on same ports/config
+- `WaitForFullMesh()` — polls `/routez` until topology restored
+
+Ports are pre-allocated and reused across kill/restart cycles so cluster config remains valid.
+
+## Test Cases (16 total)
+
+### ClusterResilienceTests.cs (ThreeNodeClusterFixture)
+
+| Test | Validates |
+|------|-----------|
+| `NodeDies_TrafficReroutesToSurvivors` | Kill node 2, pub on node 0, sub on node 1 — messages delivered |
+| `NodeRejoins_SubscriptionsPropagateAgain` | Kill node 2, restart, verify new subs on node 2 receive messages from node 0 |
+| `AllRoutesReconnect_AfterNodeRestart` | Kill node 1, restart, poll `/routez` until full mesh restored |
+| `QueueGroup_NodeDies_RemainingMembersDeliver` | Queue subs on nodes 1+2, kill node 2, node 1 delivers all |
+
+### RaftConsensusTests.cs (JetStreamClusterFixture)
+
+| Test | Validates |
+|------|-----------|
+| `LeaderElection_ClusterFormsLeader` | Create R3 stream, verify one node reports as leader |
+| `LeaderDies_NewLeaderElected` | Identify leader, kill it, verify new leader elected |
+| `LogReplication_AllReplicasHaveData` | Publish to R3 stream, verify all replicas report same count |
+| `LeaderRestart_RejoinsAsFollower` | Kill leader, wait for new, restart old, verify catchup |
+
+### JetStreamClusterTests.cs (JetStreamClusterFixture)
+
+| Test | Validates |
+|------|-----------|
+| `R3Stream_CreateAndPublish_ReplicatedAcrossNodes` | Create R3 stream, publish, verify replicas on 3 nodes |
+| `R3Stream_NodeDies_PublishContinues` | Kill replica, continue publishing, verify on survivors |
+| `Consumer_NodeDies_PullContinuesOnSurvivor` | Pull consumer, kill its node, verify consumer works elsewhere |
+| `R3Stream_Purge_ReplicatedAcrossNodes` | Purge stream, verify all replicas report 0 |
+
+### GatewayFailoverTests.cs (GatewayPairFixture)
+
+| Test | Validates |
+|------|-----------|
+| `Gateway_Disconnect_Reconnects` | Kill gateway-B, restart, verify messaging resumes |
+| `Gateway_InterestUpdated_AfterReconnect` | Sub on B, kill B, restart, resub, verify delivery from A |
+
+### LeafNodeFailoverTests.cs (HubLeafFixture)
+
+| Test | Validates |
+|------|-----------|
+| `Leaf_Disconnect_ReconnectsToHub` | Kill leaf, restart, verify `/leafz` shows reconnection |
+| `Leaf_HubRestart_LeafReconnects` | Kill hub, restart, verify leaf reconnects and messaging resumes |
+
+## Timeout Strategy
+
+- Per-test: 30-second `CancellationTokenSource`
+- Fixture readiness (full mesh, gateway, leaf): 30-second timeout, 200ms polling
+- Node restart wait: reuse `WaitForTcpReadyAsync()` + topology readiness checks
+
+## Port Allocation
+
+Same `AllocateFreePort()` pattern as existing E2E infrastructure. Each fixture allocates all ports upfront and reuses them across kill/restart cycles.