diff --git a/docs/plans/2026-03-12-e2e-cluster-tests-design.md b/docs/plans/2026-03-12-e2e-cluster-tests-design.md new file mode 100644 index 0000000..38250bd --- /dev/null +++ b/docs/plans/2026-03-12-e2e-cluster-tests-design.md @@ -0,0 +1,116 @@ +# E2E Cluster & RAFT Tests Design + +## Goal + +Create a new test project `NATS.E2E.Cluster.Tests` with comprehensive end-to-end tests for cluster resilience, RAFT consensus, JetStream cluster replication, gateway failover, and leaf node failover. All tests run against real multi-process server instances (not in-memory mocks). + +## Architecture + +Single new xUnit 3 test project with shared fixtures per topology. Separated from the existing fast `NATS.E2E.Tests` suite (78 tests, ~6s) because cluster failover tests are inherently slower (30s+ timeouts, node kill/restart cycles). + +**Dependencies:** `NATS.Client.Core`, `NATS.NKeys`, `xUnit 3`, `Shouldly` — same as existing E2E project. No new NuGet packages. + +## Project Structure + +``` +tests/NATS.E2E.Cluster.Tests/ + NATS.E2E.Cluster.Tests.csproj + Infrastructure/ + NatsServerProcess.cs (copied from NATS.E2E.Tests, standalone utility) + ThreeNodeClusterFixture.cs (3-node full-mesh cluster) + JetStreamClusterFixture.cs (3-node cluster with JetStream enabled) + GatewayPairFixture.cs (2 independent clusters connected by gateway) + HubLeafFixture.cs (1 hub + 1 leaf node) + ClusterResilienceTests.cs (4 tests) + RaftConsensusTests.cs (4 tests) + JetStreamClusterTests.cs (4 tests) + GatewayFailoverTests.cs (2 tests) + LeafNodeFailoverTests.cs (2 tests) +``` + +## Fixtures + +### ThreeNodeClusterFixture +- 3-node full-mesh cluster with monitoring enabled +- Each node: client port, cluster port, monitoring port (all ephemeral) +- `pool_size: 1` for deterministic routing +- Readiness: polls `/routez` on all nodes until `NumRoutes >= 2` +- Used by: ClusterResilienceTests + +### JetStreamClusterFixture +- Extends ThreeNodeClusterFixture pattern with `jetstream { store_dir }` config +- Temp store dirs per node, cleaned up on dispose +- Used by: RaftConsensusTests, JetStreamClusterTests + +### GatewayPairFixture +- 2 servers in separate "clusters" connected by gateway +- Readiness: polls `/gatewayz` until `NumGateways >= 1` on both +- Used by: GatewayFailoverTests + +### HubLeafFixture +- 1 hub server + 1 leaf server with solicited connection +- Readiness: polls hub's `/leafz` until `NumLeafs >= 1` +- Used by: LeafNodeFailoverTests + +## KillAndRestart Primitive + +Core infrastructure for all failover tests. Each fixture tracks `NatsServerProcess[]` by index. + +- `KillNode(int index)` — kills the process, waits for exit +- `RestartNode(int index)` — starts new process on same ports/config +- `WaitForFullMesh()` — polls `/routez` until topology restored + +Ports are pre-allocated and reused across kill/restart cycles so cluster config remains valid. + +## Test Cases (16 total) + +### ClusterResilienceTests.cs (ThreeNodeClusterFixture) + +| Test | Validates | +|------|-----------| +| `NodeDies_TrafficReroutesToSurvivors` | Kill node 2, pub on node 0, sub on node 1 — messages delivered | +| `NodeRejoins_SubscriptionsPropagateAgain` | Kill node 2, restart, verify new subs on node 2 receive messages from node 0 | +| `AllRoutesReconnect_AfterNodeRestart` | Kill node 1, restart, poll `/routez` until full mesh restored | +| `QueueGroup_NodeDies_RemainingMembersDeliver` | Queue subs on nodes 1+2, kill node 2, node 1 delivers all | + +### RaftConsensusTests.cs (JetStreamClusterFixture) + +| Test | Validates | +|------|-----------| +| `LeaderElection_ClusterFormsLeader` | Create R3 stream, verify one node reports as leader | +| `LeaderDies_NewLeaderElected` | Identify leader, kill it, verify new leader elected | +| `LogReplication_AllReplicasHaveData` | Publish to R3 stream, verify all replicas report same count | +| `LeaderRestart_RejoinsAsFollower` | Kill leader, wait for new, restart old, verify catchup | + +### JetStreamClusterTests.cs (JetStreamClusterFixture) + +| Test | Validates | +|------|-----------| +| `R3Stream_CreateAndPublish_ReplicatedAcrossNodes` | Create R3 stream, publish, verify replicas on 3 nodes | +| `R3Stream_NodeDies_PublishContinues` | Kill replica, continue publishing, verify on survivors | +| `Consumer_NodeDies_PullContinuesOnSurvivor` | Pull consumer, kill its node, verify consumer works elsewhere | +| `R3Stream_Purge_ReplicatedAcrossNodes` | Purge stream, verify all replicas report 0 | + +### GatewayFailoverTests.cs (GatewayPairFixture) + +| Test | Validates | +|------|-----------| +| `Gateway_Disconnect_Reconnects` | Kill gateway-B, restart, verify messaging resumes | +| `Gateway_InterestUpdated_AfterReconnect` | Sub on B, kill B, restart, resub, verify delivery from A | + +### LeafNodeFailoverTests.cs (HubLeafFixture) + +| Test | Validates | +|------|-----------| +| `Leaf_Disconnect_ReconnectsToHub` | Kill leaf, restart, verify `/leafz` shows reconnection | +| `Leaf_HubRestart_LeafReconnects` | Kill hub, restart, verify leaf reconnects and messaging resumes | + +## Timeout Strategy + +- Per-test: 30-second `CancellationTokenSource` +- Fixture readiness (full mesh, gateway, leaf): 30-second timeout, 200ms polling +- Node restart wait: reuse `WaitForTcpReadyAsync()` + topology readiness checks + +## Port Allocation + +Same `AllocateFreePort()` pattern as existing E2E infrastructure. Each fixture allocates all ports upfront and reuses them across kill/restart cycles.