/dist-gpu-windows

repo to try dist training and inference using gpus from windows laptops

Primary LanguagePython

๐Ÿš€ Distributed PyTorch Multi-Node GPU Training

Production-ready multi-node distributed PyTorch training setup using WSL2, NCCL, and torchrun.

Scale your deep learning workloads across multiple Windows machines with GPUs, leveraging NVIDIA's NCCL for high-performance GPU-to-GPU communication.

๐ŸŽฏ Overview

This project provides a complete, battle-tested solution for distributed PyTorch training across multiple physical machines:

  • ๐Ÿ–ฅ๏ธ Platform: Windows 11 + WSL2 (Ubuntu)
  • โšก Backend: NCCL (GPU-accelerated) + Gloo (CPU fallback)
  • ๐Ÿ”— Networking: WSL2 Mirrored Networking Mode
  • ๐ŸŽฎ GPUs: NVIDIA CUDA-enabled GPUs (RTX/GTX series)
  • ๐Ÿ“ก Rendezvous: PyTorch's native distributed launcher (torchrun)

Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Node 01 (Master - Rank 0)     โ”‚      โ”‚   Node 02 (Worker - Rank 1)     โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚      โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚  Windows 11 + WSL2       โ”‚   โ”‚      โ”‚  โ”‚  Windows 11 + WSL2       โ”‚   โ”‚
โ”‚  โ”‚  โ”œโ”€ Ubuntu 22.04         โ”‚   โ”‚      โ”‚  โ”‚  โ”œโ”€ Ubuntu 22.04         โ”‚   โ”‚
โ”‚  โ”‚  โ”œโ”€ PyTorch 2.x + CUDA   โ”‚   โ”‚      โ”‚  โ”‚  โ”œโ”€ PyTorch 2.x + CUDA   โ”‚   โ”‚
โ”‚  โ”‚  โ”œโ”€ NCCL 2.21.5          โ”‚   โ”‚      โ”‚  โ”‚  โ”œโ”€ NCCL 2.21.5          โ”‚   โ”‚
โ”‚  โ”‚  โ””โ”€ NVIDIA GPU (CUDA)    โ”‚   โ”‚      โ”‚  โ”‚  โ””โ”€ NVIDIA GPU (CUDA)    โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚      โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚         192.168.29.67            โ”‚      โ”‚         192.168.29.197          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                    โ”‚                                    โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                     โ”‚
                              Local Network / Direct Ethernet
                              (10-50 GB/s with NCCL)

Key Features

โœ… Multi-node NCCL: True GPU-to-GPU communication across physical machines
โœ… WSL2 Native: No Docker, no VM overhead - direct CUDA access
โœ… Mirrored Networking: WSL2's latest networking mode for seamless connectivity
โœ… Dual Backend: NCCL for performance, Gloo for compatibility
โœ… Production Ready: Comprehensive troubleshooting and firewall configuration
โœ… Easy Setup: Step-by-step guides for both master and worker nodes

๐Ÿ“ Project Structure

dist-gpu-windows/
โ”œโ”€โ”€ ๐Ÿ“„ train_torchrun.py           # Main training script (NCCL backend)
โ”œโ”€โ”€ ๐Ÿ“„ worker.py                   # Standalone worker script
โ”œโ”€โ”€ ๐Ÿ“„ requirements.txt            # Python dependencies
โ”‚
โ”œโ”€โ”€ ๐Ÿ“š Documentation
โ”‚   โ”œโ”€โ”€ NODE01_MASTER.md           # Complete Node 01 setup guide
โ”‚   โ”œโ”€โ”€ NODE02_WORKER.md           # Complete Node 02 setup guide
โ”‚   โ”œโ”€โ”€ QUICK_START.md             # Quick reference guide
โ”‚   โ”œโ”€โ”€ SOLUTION.md                # Architecture & design decisions
โ”‚   โ”œโ”€โ”€ WSL_MIRRORED_NETWORKING.md # Mirrored networking setup
โ”‚   โ””โ”€โ”€ WINDOWS_SETUP.md           # Windows-specific instructions
โ”‚
โ”œโ”€โ”€ ๐Ÿš€ Launch Scripts
โ”‚   โ”œโ”€โ”€ run_node0.sh               # Launch master node
โ”‚   โ”œโ”€โ”€ run_node1.sh               # Launch worker node
โ”‚   โ”œโ”€โ”€ run_single_node.sh         # Single-node testing
โ”‚   โ””โ”€โ”€ run_train.sh               # Training launcher
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ฆ Archive & Tests
โ”‚   โ”œโ”€โ”€ archive/                   # Previous implementations
โ”‚   โ”œโ”€โ”€ Test/                      # Test scripts and utilities
โ”‚   โ””โ”€โ”€ issues/                    # Issue tracking and solutions
โ”‚
โ””โ”€โ”€ ๐Ÿ”ฌ Experimental
    โ””โ”€โ”€ L1-nbdistributed/          # Jupyter notebook experiments

Key Files

File Purpose When to Use
train_torchrun.py Main distributed training script with NCCL Multi-node GPU training (recommended)
NODE01_MASTER.md Master node setup instructions Setting up Node 01
NODE02_WORKER.md Worker node setup instructions Setting up Node 02
QUICK_START.md Quick reference and troubleshooting Fast setup & debugging
WSL_MIRRORED_NETWORKING.md WSL2 networking configuration Enabling mirrored mode

๐Ÿš€ Quick Start

Prerequisites

System Requirements (Both Nodes):

  • ๐Ÿ’ป Windows 11 (Build 22621+ for mirrored networking)
  • ๐Ÿง WSL2 with Ubuntu 22.04
  • ๐ŸŽฎ NVIDIA GPU (RTX/GTX series)
  • ๐Ÿ”ง NVIDIA CUDA on WSL driver
  • ๐ŸŒ Same local network or direct Ethernet connection

Software Requirements:

  • Python 3.10+
  • PyTorch 2.x with CUDA 12.4 support
  • Git

๐ŸŽฏ 30-Second Setup

On Both Node 01 and Node 02:

1๏ธโƒฃ Install WSL2 (Windows PowerShell as Admin)

wsl --install -d Ubuntu-22.04

2๏ธโƒฃ Clone Repository (Inside WSL)

git clone <your-repo-url> ~/dist-gpu-windows
cd ~/dist-gpu-windows

3๏ธโƒฃ Setup Python Environment (WSL)

# Install dependencies
sudo apt update && sudo apt install -y build-essential python3 python3-pip python3-venv git

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install PyTorch with CUDA
pip install --upgrade pip setuptools wheel
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

4๏ธโƒฃ Enable WSL2 Mirrored Networking (Windows PowerShell as Admin)

# Create .wslconfig
@"
[wsl2]
networkingMode=mirrored
"@ | Out-File -FilePath "$env:USERPROFILE\.wslconfig" -Encoding ASCII -Force

# Restart WSL
wsl --shutdown
Start-Sleep -Seconds 30
wsl

5๏ธโƒฃ Configure Firewall (Windows PowerShell as Admin)

# Disable Windows Firewall for Private network (recommended for testing)
Set-NetFirewallProfile -Profile Private -Enabled False

# OR add specific rules
New-NetFirewallRule -DisplayName "PyTorch Distributed 29500" -Direction Inbound -LocalPort 29500 -Protocol TCP -Action Allow
New-NetFirewallRule -DisplayName "NCCL Communication Ports" -Direction Inbound -LocalPort 20000-40000 -Protocol TCP -Action Allow

6๏ธโƒฃ Disable Antivirus (Temporarily)

  • Norton/Avira: Disable firewall during testing
  • This prevents blocking of NCCL GPU communication

๐Ÿƒ Run Multi-Node Training

Step 1: Get Node 01's IP (WSL on Node 01)

ip addr show eth0 | grep "inet "
# Example: 192.168.29.67

Step 2: Start Master Node (WSL on Node 01)

cd ~/dist-gpu-windows
source .venv/bin/activate

# Set NCCL environment variables
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=1
export NCCL_SOCKET_IFNAME=eth0
export NCCL_DEBUG=INFO

# Launch master (replace IP with your Node 01 IP)
torchrun --nnodes=2 --node_rank=0 --nproc_per_node=1 \
  --master_addr=192.168.29.67 \
  --master_port=29500 \
  train_torchrun.py

Step 3: Start Worker Node (WSL on Node 02)

cd ~/dist-gpu-windows
source .venv/bin/activate

# Set NCCL environment variables
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=1
export NCCL_SOCKET_IFNAME=eth0
export NCCL_DEBUG=INFO

# Launch worker (use Node 01's IP)
torchrun --nnodes=2 --node_rank=1 --nproc_per_node=1 \
  --master_addr=192.168.29.67 \
  --master_port=29500 \
  train_torchrun.py

โœ… Expected Output

Both nodes should show:

Initializing with backend: nccl
[rank X] world_size=2 device=cuda hostname=NODE-NAME
NCCL INFO Bootstrap : Using eth0:192.168.29.XX<0>
NCCL INFO Connected all rings
NCCL INFO Connected all trees
[rank X] gathered=[0, 1]
[rank X] barrier OK; shutting down

๐Ÿ“š Detailed Guides

For complete setup instructions and troubleshooting:

Guide Description
๐Ÿ“– NODE01_MASTER.md Complete setup for master node (Rank 0)
๐Ÿ“– NODE02_WORKER.md Complete setup for worker node (Rank 1)
๐Ÿ“– QUICK_START.md Quick reference and common commands
๐Ÿ“– WSL_MIRRORED_NETWORKING.md WSL2 networking configuration
๐Ÿ“– SOLUTION.md Architecture and design decisions

๐Ÿ”ง Configuration

Network Settings

Setting Value Description
Port 29500 Master rendezvous port
Backend NCCL / Gloo NCCL for GPU, Gloo for CPU fallback
Rendezvous Native torchrun Uses --master_addr and --master_port
Network Mode WSL2 Mirrored Direct host network access
Communication TCP/IP Over Ethernet (eth0)

NCCL Environment Variables

export NCCL_IB_DISABLE=1         # Disable InfiniBand (not available on consumer hardware)
export NCCL_P2P_DISABLE=1        # Disable peer-to-peer GPU access (for multi-node)
export NCCL_SOCKET_IFNAME=eth0   # Use Ethernet interface
export NCCL_DEBUG=INFO           # Enable debug logging

Environment Variables (Auto-set by torchrun)

MASTER_ADDR=192.168.29.67   # Set via --master_addr
MASTER_PORT=29500           # Set via --master_port
WORLD_SIZE=2                # Total number of processes
RANK=0/1                    # Process rank (0=master, 1=worker)
LOCAL_RANK=0                # GPU index on local machine

๐Ÿงช Testing & Validation

Verify CUDA and NCCL (WSL)

python -c "import torch; print('PyTorch:', torch.__version__); print('CUDA:', torch.cuda.is_available()); print('NCCL:', torch.distributed.is_nccl_available())"

Test Network Connectivity

# From Node 02 to Node 01
ping 192.168.29.67

# Check if port is listening (on Node 01 after starting master)
ss -tulpn | grep 29500

Verify WSL Mirrored Networking

# Your WSL IP should match Windows IP
ip addr show eth0 | grep "inet "

Single-Node Test (Before Multi-Node)

# Test on one machine with 1 GPU
torchrun --nproc_per_node=1 --nnodes=1 train_torchrun.py

๐Ÿ“Š Performance

NCCL Backend (Recommended)

  • Bandwidth: 10-50 GB/s (GPU-to-GPU direct)
  • Latency: < 10ฮผs (local network)
  • Use Case: Production training, large models
  • Requirements: WSL2 mirrored networking

Gloo Backend (Fallback)

  • Bandwidth: 1-10 GB/s (CPU-mediated)
  • Latency: ~100ฮผs
  • Use Case: Development, compatibility testing
  • Requirements: Standard WSL2 NAT networking

Network Recommendations

  • โœ… Best: Direct Ethernet cable (10 Gbps)
  • โœ… Good: WiFi 6 on same router (1-2 Gbps)
  • โš ๏ธ Avoid: WiFi with AP isolation enabled

๐Ÿ› Troubleshooting

Common Issues & Solutions

โŒ Ping Fails Between Nodes

Symptoms: ping 192.168.29.67 times out

Causes & Fixes:

  1. Router AP Isolation:

    • Access router admin (http://192.168.29.1 for Jio Fiber)
    • Disable "AP Isolation" or "Client Isolation"
    • Reboot router
  2. Windows Firewall:

    Set-NetFirewallProfile -Profile Private -Enabled False
  3. Antivirus Blocking (Norton/Avira):

    • Temporarily disable antivirus firewall
    • Add exceptions for Python and ports 20000-40000
  4. Alternative: Use direct Ethernet cable between laptops

โŒ NCCL Connection Timeout

Error: The client socket has timed out after 60000ms

Solutions:

  1. Ensure both nodes have WSL2 mirrored networking enabled
  2. Verify firewall is disabled: Get-NetFirewallProfile
  3. Check NCCL environment variables are set on both nodes
  4. Use --master_addr and --master_port (not --rdzv_endpoint)

โŒ NCCL Connection Reset / Socket Error

Error: socketStartConnect: Connect to IP<port> failed : Software caused connection abort

Solutions:

  1. Disable Windows Firewall on both nodes (see Step 5)
  2. Disable antivirus (Norton, Avira, etc.)
  3. Verify mirrored networking: ip addr show eth0 | grep "inet "
  4. Add firewall rules for ports 20000-40000

โŒ CUDA Not Available

Error: CUDA: False when checking PyTorch

Solutions:

  1. Install NVIDIA CUDA on WSL driver from nvidia.com/cuda/wsl
  2. Verify with: nvidia-smi (should show GPU)
  3. Reinstall PyTorch with CUDA:
    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

โŒ NCCL Not Available

Error: NCCL: False when checking PyTorch

Solution: Install PyTorch with CUDA support (NCCL is included)

โŒ WSL Mirrored Networking Not Working

Error: WSL IP still shows 172.x.x.x instead of Windows IP

Solutions:

  1. Verify Windows version: Build 22621+ required
    [System.Environment]::OSVersion.Version
  2. Check .wslconfig location: C:\Users\<YourUser>\.wslconfig
  3. Ensure file has correct format (no BOM, ASCII encoding)
  4. Full WSL restart:
    wsl --shutdown
    Start-Sleep -Seconds 30
    wsl

โŒ Port Already in Use

Error: OSError: [Errno 98] Address already in use

Solutions:

  1. Find process using port: sudo lsof -i :29500
  2. Kill the process: sudo kill -9 <PID>
  3. Or use a different port (e.g., 29501)

Network Requirements Checklist

  • โœ… Both machines on same WiFi network or direct Ethernet
  • โœ… Windows Firewall disabled for Private network
  • โœ… Antivirus firewall disabled (Norton, Avira, etc.)
  • โœ… WSL2 mirrored networking enabled (Build 22621+)
  • โœ… Port 29500 accessible (rendezvous)
  • โœ… Ports 20000-40000 accessible (NCCL communication)
  • โœ… No VPN or proxy interference
  • โœ… Router AP Isolation disabled

๐Ÿ”„ Development Workflow

  1. Start Simple: Test single-node first, then multi-node
  2. Enable Mirrored Networking: Critical for NCCL to work
  3. Disable Firewalls: Start with all firewalls off, add rules later
  4. Check Connectivity: Ensure nodes can ping each other
  5. Monitor Logs: Use NCCL_DEBUG=INFO to diagnose issues
  6. Scale Gradually: 2 nodes โ†’ 3 nodes โ†’ N nodes

๐Ÿ“ˆ Performance Tips

Optimize Network

  • โœ… Use direct Ethernet connection for lowest latency
  • โœ… Disable power-saving on network adapters
  • โœ… Use dedicated network interface for distributed training
  • โœ… Monitor bandwidth: iperf3 between nodes

Optimize Training

  • ๐ŸŽฏ Batch size: Larger batches better utilize multi-GPU
  • ๐ŸŽฏ Gradient accumulation: Simulate larger batches
  • ๐ŸŽฏ Mixed precision: Use FP16 to reduce communication overhead
  • ๐ŸŽฏ Efficient collectives: Use all_reduce over all_gather when possible

Monitor Performance

# GPU utilization
nvidia-smi -l 1

# Network traffic
iftop -i eth0

# NCCL performance test
nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 1

๐Ÿค Contributing

Contributions are welcome! This project is the result of extensive troubleshooting and experimentation with WSL2 + NCCL multi-node setups.

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-improvement)
  3. Test on both master and worker nodes
  4. Commit your changes (git commit -m 'Add amazing improvement')
  5. Push to the branch (git push origin feature/amazing-improvement)
  6. Open a Pull Request

๐Ÿ“ License

MIT License - feel free to use this in your projects!

๐Ÿ™ Acknowledgments


๐Ÿ’ Made with Love

Created by Amardeep Singh Sidhu (@thefirehacker)

Building AI solutions at @AIEdX & @bubblspace

๐ŸŒ bubblspace.com | ๐Ÿฆ @thefirehacker | ๐Ÿ’ผ LinkedIn


โญ If this project helped you, consider giving it a star on GitHub!

๐Ÿ“ง Questions? Open an issue or reach out at contact@aiedx.com