🚀 Distributed PyTorch Multi-Node GPU Training

Production-ready multi-node distributed PyTorch training setup using WSL2, NCCL, and torchrun.

Scale your deep learning workloads across multiple Windows machines with GPUs, leveraging NVIDIA's NCCL for high-performance GPU-to-GPU communication.

🎯 Overview

This project provides a complete, battle-tested solution for distributed PyTorch training across multiple physical machines:

🖥️ Platform: Windows 11 + WSL2 (Ubuntu)
⚡ Backend: NCCL (GPU-accelerated) + Gloo (CPU fallback)
🔗 Networking: WSL2 Mirrored Networking Mode
🎮 GPUs: NVIDIA CUDA-enabled GPUs (RTX/GTX series)
📡 Rendezvous: PyTorch's native distributed launcher (torchrun)

Architecture

┌─────────────────────────────────┐      ┌─────────────────────────────────┐
│   Node 01 (Master - Rank 0)     │      │   Node 02 (Worker - Rank 1)     │
│  ┌──────────────────────────┐   │      │  ┌──────────────────────────┐   │
│  │  Windows 11 + WSL2       │   │      │  │  Windows 11 + WSL2       │   │
│  │  ├─ Ubuntu 22.04         │   │      │  │  ├─ Ubuntu 22.04         │   │
│  │  ├─ PyTorch 2.x + CUDA   │   │      │  │  ├─ PyTorch 2.x + CUDA   │   │
│  │  ├─ NCCL 2.21.5          │   │      │  │  ├─ NCCL 2.21.5          │   │
│  │  └─ NVIDIA GPU (CUDA)    │   │      │  │  └─ NVIDIA GPU (CUDA)    │   │
│  └──────────────────────────┘   │      │  └──────────────────────────┘   │
│         192.168.29.67            │      │         192.168.29.197          │
└─────────────────────────────────┘      └─────────────────────────────────┘
                    │                                    │
                    └────────────────┬───────────────────┘
                                     │
                              Local Network / Direct Ethernet
                              (10-50 GB/s with NCCL)

Key Features

✅ Multi-node NCCL: True GPU-to-GPU communication across physical machines
✅ WSL2 Native: No Docker, no VM overhead - direct CUDA access
✅ Mirrored Networking: WSL2's latest networking mode for seamless connectivity
✅ Dual Backend: NCCL for performance, Gloo for compatibility
✅ Production Ready: Comprehensive troubleshooting and firewall configuration
✅ Easy Setup: Step-by-step guides for both master and worker nodes

📁 Project Structure

dist-gpu-windows/
├── 📄 train_torchrun.py           # Main training script (NCCL backend)
├── 📄 worker.py                   # Standalone worker script
├── 📄 requirements.txt            # Python dependencies
│
├── 📚 Documentation
│   ├── NODE01_MASTER.md           # Complete Node 01 setup guide
│   ├── NODE02_WORKER.md           # Complete Node 02 setup guide
│   ├── QUICK_START.md             # Quick reference guide
│   ├── SOLUTION.md                # Architecture & design decisions
│   ├── WSL_MIRRORED_NETWORKING.md # Mirrored networking setup
│   └── WINDOWS_SETUP.md           # Windows-specific instructions
│
├── 🚀 Launch Scripts
│   ├── run_node0.sh               # Launch master node
│   ├── run_node1.sh               # Launch worker node
│   ├── run_single_node.sh         # Single-node testing
│   └── run_train.sh               # Training launcher
│
├── 📦 Archive & Tests
│   ├── archive/                   # Previous implementations
│   ├── Test/                      # Test scripts and utilities
│   └── issues/                    # Issue tracking and solutions
│
└── 🔬 Experimental
    └── L1-nbdistributed/          # Jupyter notebook experiments

Key Files

File	Purpose	When to Use
`train_torchrun.py`	Main distributed training script with NCCL	Multi-node GPU training (recommended)
`NODE01_MASTER.md`	Master node setup instructions	Setting up Node 01
`NODE02_WORKER.md`	Worker node setup instructions	Setting up Node 02
`QUICK_START.md`	Quick reference and troubleshooting	Fast setup & debugging
`WSL_MIRRORED_NETWORKING.md`	WSL2 networking configuration	Enabling mirrored mode

🚀 Quick Start

Prerequisites

System Requirements (Both Nodes):

💻 Windows 11 (Build 22621+ for mirrored networking)
🐧 WSL2 with Ubuntu 22.04
🎮 NVIDIA GPU (RTX/GTX series)
🔧 NVIDIA CUDA on WSL driver
🌐 Same local network or direct Ethernet connection

Software Requirements:

Python 3.10+
PyTorch 2.x with CUDA 12.4 support
Git

🎯 30-Second Setup

On Both Node 01 and Node 02:

1️⃣ Install WSL2 (Windows PowerShell as Admin)

wsl --install -d Ubuntu-22.04

2️⃣ Clone Repository (Inside WSL)

git clone <your-repo-url> ~/dist-gpu-windows
cd ~/dist-gpu-windows

3️⃣ Setup Python Environment (WSL)

# Install dependencies
sudo apt update && sudo apt install -y build-essential python3 python3-pip python3-venv git

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install PyTorch with CUDA
pip install --upgrade pip setuptools wheel
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

4️⃣ Enable WSL2 Mirrored Networking (Windows PowerShell as Admin)

# Create .wslconfig
@"
[wsl2]
networkingMode=mirrored
"@ | Out-File -FilePath "$env:USERPROFILE\.wslconfig" -Encoding ASCII -Force

# Restart WSL
wsl --shutdown
Start-Sleep -Seconds 30
wsl

5️⃣ Configure Firewall (Windows PowerShell as Admin)

# Disable Windows Firewall for Private network (recommended for testing)
Set-NetFirewallProfile -Profile Private -Enabled False

# OR add specific rules
New-NetFirewallRule -DisplayName "PyTorch Distributed 29500" -Direction Inbound -LocalPort 29500 -Protocol TCP -Action Allow
New-NetFirewallRule -DisplayName "NCCL Communication Ports" -Direction Inbound -LocalPort 20000-40000 -Protocol TCP -Action Allow

6️⃣ Disable Antivirus (Temporarily)

Norton/Avira: Disable firewall during testing
This prevents blocking of NCCL GPU communication

🏃 Run Multi-Node Training

Step 1: Get Node 01's IP (WSL on Node 01)

ip addr show eth0 | grep "inet "
# Example: 192.168.29.67

Step 2: Start Master Node (WSL on Node 01)

cd ~/dist-gpu-windows
source .venv/bin/activate

# Set NCCL environment variables
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=1
export NCCL_SOCKET_IFNAME=eth0
export NCCL_DEBUG=INFO

# Launch master (replace IP with your Node 01 IP)
torchrun --nnodes=2 --node_rank=0 --nproc_per_node=1 \
  --master_addr=192.168.29.67 \
  --master_port=29500 \
  train_torchrun.py

Step 3: Start Worker Node (WSL on Node 02)

cd ~/dist-gpu-windows
source .venv/bin/activate

# Set NCCL environment variables
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=1
export NCCL_SOCKET_IFNAME=eth0
export NCCL_DEBUG=INFO

# Launch worker (use Node 01's IP)
torchrun --nnodes=2 --node_rank=1 --nproc_per_node=1 \
  --master_addr=192.168.29.67 \
  --master_port=29500 \
  train_torchrun.py

✅ Expected Output

Both nodes should show:

Initializing with backend: nccl
[rank X] world_size=2 device=cuda hostname=NODE-NAME
NCCL INFO Bootstrap : Using eth0:192.168.29.XX<0>
NCCL INFO Connected all rings
NCCL INFO Connected all trees
[rank X] gathered=[0, 1]
[rank X] barrier OK; shutting down

📚 Detailed Guides

For complete setup instructions and troubleshooting:

Guide	Description
📖 NODE01_MASTER.md	Complete setup for master node (Rank 0)
📖 NODE02_WORKER.md	Complete setup for worker node (Rank 1)
📖 QUICK_START.md	Quick reference and common commands
📖 WSL_MIRRORED_NETWORKING.md	WSL2 networking configuration
📖 SOLUTION.md	Architecture and design decisions

🔧 Configuration

Network Settings

Setting	Value	Description
Port	29500	Master rendezvous port
Backend	NCCL / Gloo	NCCL for GPU, Gloo for CPU fallback
Rendezvous	Native `torchrun`	Uses `--master_addr` and `--master_port`
Network Mode	WSL2 Mirrored	Direct host network access
Communication	TCP/IP	Over Ethernet (eth0)

NCCL Environment Variables

export NCCL_IB_DISABLE=1         # Disable InfiniBand (not available on consumer hardware)
export NCCL_P2P_DISABLE=1        # Disable peer-to-peer GPU access (for multi-node)
export NCCL_SOCKET_IFNAME=eth0   # Use Ethernet interface
export NCCL_DEBUG=INFO           # Enable debug logging

Environment Variables (Auto-set by torchrun)

MASTER_ADDR=192.168.29.67   # Set via --master_addr
MASTER_PORT=29500           # Set via --master_port
WORLD_SIZE=2                # Total number of processes
RANK=0/1                    # Process rank (0=master, 1=worker)
LOCAL_RANK=0                # GPU index on local machine

🧪 Testing & Validation

Verify CUDA and NCCL (WSL)

python -c "import torch; print('PyTorch:', torch.__version__); print('CUDA:', torch.cuda.is_available()); print('NCCL:', torch.distributed.is_nccl_available())"

Test Network Connectivity

# From Node 02 to Node 01
ping 192.168.29.67

# Check if port is listening (on Node 01 after starting master)
ss -tulpn | grep 29500

Verify WSL Mirrored Networking

# Your WSL IP should match Windows IP
ip addr show eth0 | grep "inet "

Single-Node Test (Before Multi-Node)

# Test on one machine with 1 GPU
torchrun --nproc_per_node=1 --nnodes=1 train_torchrun.py

📊 Performance

NCCL Backend (Recommended)

Bandwidth: 10-50 GB/s (GPU-to-GPU direct)
Latency: < 10μs (local network)
Use Case: Production training, large models
Requirements: WSL2 mirrored networking

Gloo Backend (Fallback)

Bandwidth: 1-10 GB/s (CPU-mediated)
Latency: ~100μs
Use Case: Development, compatibility testing
Requirements: Standard WSL2 NAT networking

Network Recommendations

✅ Best: Direct Ethernet cable (10 Gbps)
✅ Good: WiFi 6 on same router (1-2 Gbps)
⚠️ Avoid: WiFi with AP isolation enabled

🐛 Troubleshooting

Common Issues & Solutions

❌ Ping Fails Between Nodes

Symptoms: ping 192.168.29.67 times out

Causes & Fixes:

Router AP Isolation:
- Access router admin (http://192.168.29.1 for Jio Fiber)
- Disable "AP Isolation" or "Client Isolation"
- Reboot router

Windows Firewall:

Set-NetFirewallProfile -Profile Private -Enabled False

Antivirus Blocking (Norton/Avira):
- Temporarily disable antivirus firewall
- Add exceptions for Python and ports 20000-40000
Alternative: Use direct Ethernet cable between laptops

❌ NCCL Connection Timeout

Error: The client socket has timed out after 60000ms

Solutions:

Ensure both nodes have WSL2 mirrored networking enabled
Verify firewall is disabled: Get-NetFirewallProfile
Check NCCL environment variables are set on both nodes
Use --master_addr and --master_port (not --rdzv_endpoint)

❌ NCCL Connection Reset / Socket Error

Error: socketStartConnect: Connect to IP<port> failed : Software caused connection abort

Solutions:

Disable Windows Firewall on both nodes (see Step 5)
Disable antivirus (Norton, Avira, etc.)
Verify mirrored networking: ip addr show eth0 | grep "inet "
Add firewall rules for ports 20000-40000

❌ CUDA Not Available

Error: CUDA: False when checking PyTorch

Solutions:

Install NVIDIA CUDA on WSL driver from nvidia.com/cuda/wsl
Verify with: nvidia-smi (should show GPU)

Reinstall PyTorch with CUDA:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

❌ NCCL Not Available

Error: NCCL: False when checking PyTorch

Solution: Install PyTorch with CUDA support (NCCL is included)

❌ WSL Mirrored Networking Not Working

Error: WSL IP still shows 172.x.x.x instead of Windows IP

Solutions:

Verify Windows version: Build 22621+ required
```
[System.Environment]::OSVersion.Version
```
Check .wslconfig location: C:\Users\<YourUser>\.wslconfig
Ensure file has correct format (no BOM, ASCII encoding)

Full WSL restart:

wsl --shutdown
Start-Sleep -Seconds 30
wsl

❌ Port Already in Use

Error: OSError: [Errno 98] Address already in use

Solutions:

Find process using port: sudo lsof -i :29500
Kill the process: sudo kill -9 <PID>
Or use a different port (e.g., 29501)

🔄 Development Workflow

Start Simple: Test single-node first, then multi-node
Enable Mirrored Networking: Critical for NCCL to work
Disable Firewalls: Start with all firewalls off, add rules later
Check Connectivity: Ensure nodes can ping each other
Monitor Logs: Use NCCL_DEBUG=INFO to diagnose issues
Scale Gradually: 2 nodes → 3 nodes → N nodes

📈 Performance Tips

Optimize Network

✅ Use direct Ethernet connection for lowest latency
✅ Disable power-saving on network adapters
✅ Use dedicated network interface for distributed training
✅ Monitor bandwidth: iperf3 between nodes

Optimize Training

🎯 Batch size: Larger batches better utilize multi-GPU
🎯 Gradient accumulation: Simulate larger batches
🎯 Mixed precision: Use FP16 to reduce communication overhead
🎯 Efficient collectives: Use all_reduce over all_gather when possible

Monitor Performance

# GPU utilization
nvidia-smi -l 1

# Network traffic
iftop -i eth0

# NCCL performance test
nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 1

🤝 Contributing

Contributions are welcome! This project is the result of extensive troubleshooting and experimentation with WSL2 + NCCL multi-node setups.

Fork the repository
Create a feature branch (git checkout -b feature/amazing-improvement)
Test on both master and worker nodes
Commit your changes (git commit -m 'Add amazing improvement')
Push to the branch (git push origin feature/amazing-improvement)
Open a Pull Request

📝 License

MIT License - feel free to use this in your projects!

🙏 Acknowledgments

PyTorch Distributed - Official documentation
NVIDIA NCCL - High-performance GPU communication
WSL2 Mirrored Networking - Microsoft WSL docs
Community contributions and testing

💝 Made with Love

Created by Amardeep Singh Sidhu (@thefirehacker)

Building AI solutions at @AIEdX & @bubblspace

🌐 bubblspace.com | 🐦 @thefirehacker | 💼 LinkedIn

⭐ If this project helped you, consider giving it a star on GitHub!

📧 Questions? Open an issue or reach out at contact@aiedx.com

thefirehacker/dist-gpu-windows

🚀 Distributed PyTorch Multi-Node GPU Training

🎯 Overview

Architecture

Key Features

📁 Project Structure

Key Files

🚀 Quick Start

Prerequisites

🎯 30-Second Setup

1️⃣ Install WSL2 (Windows PowerShell as Admin)

2️⃣ Clone Repository (Inside WSL)

3️⃣ Setup Python Environment (WSL)

4️⃣ Enable WSL2 Mirrored Networking (Windows PowerShell as Admin)

5️⃣ Configure Firewall (Windows PowerShell as Admin)

6️⃣ Disable Antivirus (Temporarily)

🏃 Run Multi-Node Training

✅ Expected Output

📚 Detailed Guides

🔧 Configuration

Network Settings

NCCL Environment Variables

Environment Variables (Auto-set by torchrun)

🧪 Testing & Validation

Verify CUDA and NCCL (WSL)

Test Network Connectivity

Verify WSL Mirrored Networking

Single-Node Test (Before Multi-Node)

📊 Performance

NCCL Backend (Recommended)

Gloo Backend (Fallback)

Network Recommendations

🐛 Troubleshooting

Common Issues & Solutions

❌ Ping Fails Between Nodes

❌ NCCL Connection Timeout

❌ NCCL Connection Reset / Socket Error

❌ CUDA Not Available

❌ NCCL Not Available

❌ WSL Mirrored Networking Not Working

❌ Port Already in Use

Network Requirements Checklist

🔄 Development Workflow

📈 Performance Tips

Optimize Network

Optimize Training

Monitor Performance

🤝 Contributing

📝 License

🙏 Acknowledgments

💝 Made with Love