Production-ready multi-node distributed PyTorch training setup using WSL2, NCCL, and torchrun.
Scale your deep learning workloads across multiple Windows machines with GPUs, leveraging NVIDIA's NCCL for high-performance GPU-to-GPU communication.
This project provides a complete, battle-tested solution for distributed PyTorch training across multiple physical machines:
- ๐ฅ๏ธ Platform: Windows 11 + WSL2 (Ubuntu)
- โก Backend: NCCL (GPU-accelerated) + Gloo (CPU fallback)
- ๐ Networking: WSL2 Mirrored Networking Mode
- ๐ฎ GPUs: NVIDIA CUDA-enabled GPUs (RTX/GTX series)
- ๐ก Rendezvous: PyTorch's native distributed launcher (
torchrun)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Node 01 (Master - Rank 0) โ โ Node 02 (Worker - Rank 1) โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Windows 11 + WSL2 โ โ โ โ Windows 11 + WSL2 โ โ
โ โ โโ Ubuntu 22.04 โ โ โ โ โโ Ubuntu 22.04 โ โ
โ โ โโ PyTorch 2.x + CUDA โ โ โ โ โโ PyTorch 2.x + CUDA โ โ
โ โ โโ NCCL 2.21.5 โ โ โ โ โโ NCCL 2.21.5 โ โ
โ โ โโ NVIDIA GPU (CUDA) โ โ โ โ โโ NVIDIA GPU (CUDA) โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ 192.168.29.67 โ โ 192.168.29.197 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโ
โ
Local Network / Direct Ethernet
(10-50 GB/s with NCCL)
โ
Multi-node NCCL: True GPU-to-GPU communication across physical machines
โ
WSL2 Native: No Docker, no VM overhead - direct CUDA access
โ
Mirrored Networking: WSL2's latest networking mode for seamless connectivity
โ
Dual Backend: NCCL for performance, Gloo for compatibility
โ
Production Ready: Comprehensive troubleshooting and firewall configuration
โ
Easy Setup: Step-by-step guides for both master and worker nodes
dist-gpu-windows/
โโโ ๐ train_torchrun.py # Main training script (NCCL backend)
โโโ ๐ worker.py # Standalone worker script
โโโ ๐ requirements.txt # Python dependencies
โ
โโโ ๐ Documentation
โ โโโ NODE01_MASTER.md # Complete Node 01 setup guide
โ โโโ NODE02_WORKER.md # Complete Node 02 setup guide
โ โโโ QUICK_START.md # Quick reference guide
โ โโโ SOLUTION.md # Architecture & design decisions
โ โโโ WSL_MIRRORED_NETWORKING.md # Mirrored networking setup
โ โโโ WINDOWS_SETUP.md # Windows-specific instructions
โ
โโโ ๐ Launch Scripts
โ โโโ run_node0.sh # Launch master node
โ โโโ run_node1.sh # Launch worker node
โ โโโ run_single_node.sh # Single-node testing
โ โโโ run_train.sh # Training launcher
โ
โโโ ๐ฆ Archive & Tests
โ โโโ archive/ # Previous implementations
โ โโโ Test/ # Test scripts and utilities
โ โโโ issues/ # Issue tracking and solutions
โ
โโโ ๐ฌ Experimental
โโโ L1-nbdistributed/ # Jupyter notebook experiments
| File | Purpose | When to Use |
|---|---|---|
train_torchrun.py |
Main distributed training script with NCCL | Multi-node GPU training (recommended) |
NODE01_MASTER.md |
Master node setup instructions | Setting up Node 01 |
NODE02_WORKER.md |
Worker node setup instructions | Setting up Node 02 |
QUICK_START.md |
Quick reference and troubleshooting | Fast setup & debugging |
WSL_MIRRORED_NETWORKING.md |
WSL2 networking configuration | Enabling mirrored mode |
System Requirements (Both Nodes):
- ๐ป Windows 11 (Build 22621+ for mirrored networking)
- ๐ง WSL2 with Ubuntu 22.04
- ๐ฎ NVIDIA GPU (RTX/GTX series)
- ๐ง NVIDIA CUDA on WSL driver
- ๐ Same local network or direct Ethernet connection
Software Requirements:
- Python 3.10+
- PyTorch 2.x with CUDA 12.4 support
- Git
On Both Node 01 and Node 02:
wsl --install -d Ubuntu-22.04git clone <your-repo-url> ~/dist-gpu-windows
cd ~/dist-gpu-windows# Install dependencies
sudo apt update && sudo apt install -y build-essential python3 python3-pip python3-venv git
# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate
# Install PyTorch with CUDA
pip install --upgrade pip setuptools wheel
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124# Create .wslconfig
@"
[wsl2]
networkingMode=mirrored
"@ | Out-File -FilePath "$env:USERPROFILE\.wslconfig" -Encoding ASCII -Force
# Restart WSL
wsl --shutdown
Start-Sleep -Seconds 30
wsl# Disable Windows Firewall for Private network (recommended for testing)
Set-NetFirewallProfile -Profile Private -Enabled False
# OR add specific rules
New-NetFirewallRule -DisplayName "PyTorch Distributed 29500" -Direction Inbound -LocalPort 29500 -Protocol TCP -Action Allow
New-NetFirewallRule -DisplayName "NCCL Communication Ports" -Direction Inbound -LocalPort 20000-40000 -Protocol TCP -Action Allow- Norton/Avira: Disable firewall during testing
- This prevents blocking of NCCL GPU communication
Step 1: Get Node 01's IP (WSL on Node 01)
ip addr show eth0 | grep "inet "
# Example: 192.168.29.67Step 2: Start Master Node (WSL on Node 01)
cd ~/dist-gpu-windows
source .venv/bin/activate
# Set NCCL environment variables
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=1
export NCCL_SOCKET_IFNAME=eth0
export NCCL_DEBUG=INFO
# Launch master (replace IP with your Node 01 IP)
torchrun --nnodes=2 --node_rank=0 --nproc_per_node=1 \
--master_addr=192.168.29.67 \
--master_port=29500 \
train_torchrun.pyStep 3: Start Worker Node (WSL on Node 02)
cd ~/dist-gpu-windows
source .venv/bin/activate
# Set NCCL environment variables
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=1
export NCCL_SOCKET_IFNAME=eth0
export NCCL_DEBUG=INFO
# Launch worker (use Node 01's IP)
torchrun --nnodes=2 --node_rank=1 --nproc_per_node=1 \
--master_addr=192.168.29.67 \
--master_port=29500 \
train_torchrun.pyBoth nodes should show:
Initializing with backend: nccl
[rank X] world_size=2 device=cuda hostname=NODE-NAME
NCCL INFO Bootstrap : Using eth0:192.168.29.XX<0>
NCCL INFO Connected all rings
NCCL INFO Connected all trees
[rank X] gathered=[0, 1]
[rank X] barrier OK; shutting down
For complete setup instructions and troubleshooting:
| Guide | Description |
|---|---|
| ๐ NODE01_MASTER.md | Complete setup for master node (Rank 0) |
| ๐ NODE02_WORKER.md | Complete setup for worker node (Rank 1) |
| ๐ QUICK_START.md | Quick reference and common commands |
| ๐ WSL_MIRRORED_NETWORKING.md | WSL2 networking configuration |
| ๐ SOLUTION.md | Architecture and design decisions |
| Setting | Value | Description |
|---|---|---|
| Port | 29500 | Master rendezvous port |
| Backend | NCCL / Gloo | NCCL for GPU, Gloo for CPU fallback |
| Rendezvous | Native torchrun |
Uses --master_addr and --master_port |
| Network Mode | WSL2 Mirrored | Direct host network access |
| Communication | TCP/IP | Over Ethernet (eth0) |
export NCCL_IB_DISABLE=1 # Disable InfiniBand (not available on consumer hardware)
export NCCL_P2P_DISABLE=1 # Disable peer-to-peer GPU access (for multi-node)
export NCCL_SOCKET_IFNAME=eth0 # Use Ethernet interface
export NCCL_DEBUG=INFO # Enable debug loggingMASTER_ADDR=192.168.29.67 # Set via --master_addr
MASTER_PORT=29500 # Set via --master_port
WORLD_SIZE=2 # Total number of processes
RANK=0/1 # Process rank (0=master, 1=worker)
LOCAL_RANK=0 # GPU index on local machinepython -c "import torch; print('PyTorch:', torch.__version__); print('CUDA:', torch.cuda.is_available()); print('NCCL:', torch.distributed.is_nccl_available())"# From Node 02 to Node 01
ping 192.168.29.67
# Check if port is listening (on Node 01 after starting master)
ss -tulpn | grep 29500# Your WSL IP should match Windows IP
ip addr show eth0 | grep "inet "# Test on one machine with 1 GPU
torchrun --nproc_per_node=1 --nnodes=1 train_torchrun.py- Bandwidth: 10-50 GB/s (GPU-to-GPU direct)
- Latency: < 10ฮผs (local network)
- Use Case: Production training, large models
- Requirements: WSL2 mirrored networking
- Bandwidth: 1-10 GB/s (CPU-mediated)
- Latency: ~100ฮผs
- Use Case: Development, compatibility testing
- Requirements: Standard WSL2 NAT networking
- โ Best: Direct Ethernet cable (10 Gbps)
- โ Good: WiFi 6 on same router (1-2 Gbps)
โ ๏ธ Avoid: WiFi with AP isolation enabled
Symptoms: ping 192.168.29.67 times out
Causes & Fixes:
-
Router AP Isolation:
- Access router admin (
http://192.168.29.1for Jio Fiber) - Disable "AP Isolation" or "Client Isolation"
- Reboot router
- Access router admin (
-
Windows Firewall:
Set-NetFirewallProfile -Profile Private -Enabled False
-
Antivirus Blocking (Norton/Avira):
- Temporarily disable antivirus firewall
- Add exceptions for Python and ports 20000-40000
-
Alternative: Use direct Ethernet cable between laptops
Error: The client socket has timed out after 60000ms
Solutions:
- Ensure both nodes have WSL2 mirrored networking enabled
- Verify firewall is disabled:
Get-NetFirewallProfile - Check NCCL environment variables are set on both nodes
- Use
--master_addrand--master_port(not--rdzv_endpoint)
Error: socketStartConnect: Connect to IP<port> failed : Software caused connection abort
Solutions:
- Disable Windows Firewall on both nodes (see Step 5)
- Disable antivirus (Norton, Avira, etc.)
- Verify mirrored networking:
ip addr show eth0 | grep "inet " - Add firewall rules for ports 20000-40000
Error: CUDA: False when checking PyTorch
Solutions:
- Install NVIDIA CUDA on WSL driver from nvidia.com/cuda/wsl
- Verify with:
nvidia-smi(should show GPU) - Reinstall PyTorch with CUDA:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
Error: NCCL: False when checking PyTorch
Solution: Install PyTorch with CUDA support (NCCL is included)
Error: WSL IP still shows 172.x.x.x instead of Windows IP
Solutions:
- Verify Windows version: Build 22621+ required
[System.Environment]::OSVersion.Version - Check
.wslconfiglocation:C:\Users\<YourUser>\.wslconfig - Ensure file has correct format (no BOM, ASCII encoding)
- Full WSL restart:
wsl --shutdown Start-Sleep -Seconds 30 wsl
Error: OSError: [Errno 98] Address already in use
Solutions:
- Find process using port:
sudo lsof -i :29500 - Kill the process:
sudo kill -9 <PID> - Or use a different port (e.g., 29501)
- โ Both machines on same WiFi network or direct Ethernet
- โ Windows Firewall disabled for Private network
- โ Antivirus firewall disabled (Norton, Avira, etc.)
- โ WSL2 mirrored networking enabled (Build 22621+)
- โ Port 29500 accessible (rendezvous)
- โ Ports 20000-40000 accessible (NCCL communication)
- โ No VPN or proxy interference
- โ Router AP Isolation disabled
- Start Simple: Test single-node first, then multi-node
- Enable Mirrored Networking: Critical for NCCL to work
- Disable Firewalls: Start with all firewalls off, add rules later
- Check Connectivity: Ensure nodes can ping each other
- Monitor Logs: Use
NCCL_DEBUG=INFOto diagnose issues - Scale Gradually: 2 nodes โ 3 nodes โ N nodes
- โ Use direct Ethernet connection for lowest latency
- โ Disable power-saving on network adapters
- โ Use dedicated network interface for distributed training
- โ
Monitor bandwidth:
iperf3between nodes
- ๐ฏ Batch size: Larger batches better utilize multi-GPU
- ๐ฏ Gradient accumulation: Simulate larger batches
- ๐ฏ Mixed precision: Use FP16 to reduce communication overhead
- ๐ฏ Efficient collectives: Use
all_reduceoverall_gatherwhen possible
# GPU utilization
nvidia-smi -l 1
# Network traffic
iftop -i eth0
# NCCL performance test
nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 1Contributions are welcome! This project is the result of extensive troubleshooting and experimentation with WSL2 + NCCL multi-node setups.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-improvement) - Test on both master and worker nodes
- Commit your changes (
git commit -m 'Add amazing improvement') - Push to the branch (
git push origin feature/amazing-improvement) - Open a Pull Request
MIT License - feel free to use this in your projects!
- PyTorch Distributed - Official documentation
- NVIDIA NCCL - High-performance GPU communication
- WSL2 Mirrored Networking - Microsoft WSL docs
- Community contributions and testing
Created by Amardeep Singh Sidhu (@thefirehacker)
Building AI solutions at @AIEdX & @bubblspace
๐ bubblspace.com | ๐ฆ @thefirehacker | ๐ผ LinkedIn
โญ If this project helped you, consider giving it a star on GitHub!
๐ง Questions? Open an issue or reach out at contact@aiedx.com