Lightning-AI/pytorch-lightning

Multi-ndoes DDP training hanging when initializing.

ElderWanng opened this issue ยท 11 comments

๐Ÿ› Bug

I'm trying to utilize all the computational resources to speed up. My code works fine on a single node, multi-GPUs mode (which means I did most part for DDP training right).

But when it comes to multi-nodes, I found my code always stops at DDP initializing. (stop at 2/4, I guess one node initializes correctly, but another one sucks. )

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/8
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/8
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/8
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/8
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/8
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/8
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8
Traceback (most recent call last):
  File "/home/tw2112/codes/s2s/aux_with_neg_wiki/cool_test/test.py", line 74, in <module>
    trainer.fit(model)
  File "/ext3/miniconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 552, in fit
    self._run(model)
  File "/ext3/miniconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 863, in _run
    self.accelerator.setup_environment()
  File "/ext3/miniconda3/lib/python3.8/site-packages/pytorch_lightning/accelerators/gpu.py", line 30, in setup_environment
    super().setup_environment()
  File "/ext3/miniconda3/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 76, in setup_environment
    self.training_type_plugin.setup_environment()
  File "/ext3/miniconda3/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 168, in setup_environment
    self.setup_distributed()
  File "/ext3/miniconda3/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 253, in setup_distributed
    self.init_ddp_connection()
  File "/ext3/miniconda3/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 323, in init_ddp_connection
    torch.distributed.init_process_group(
  File "/ext3/miniconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 525, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/ext3/miniconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 212, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 2, for key: store_based_barrier_key:1 (world_size=8, worker_count=3, timeout=0:30:00)



To Reproduce

I follow this #8707 and borrow the simple model from it, that problem still occurs:

test.py:

import logging
import os
import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
import torchvision.transforms as transforms

import pytorch_lightning as ptl

def get_logger(name=__name__, level=logging.INFO):
    """Initializes python logger."""

    logger = logging.getLogger(name)
    logger.setLevel(level)

    # this ensures all logging levels get marked with the rank zero decorator
    # otherwise logs would get multiplied for each GPU process in multi-GPU setup
    for level in ("debug", "info", "warning", "error", "exception", "fatal", "critical"):
        setattr(logger, level, ptl.utilities.rank_zero_only(getattr(logger, level)))

    return logger


log = get_logger(__name__)

class CoolModel(ptl.LightningModule):

    def __init__(self):
        super(CoolModel, self).__init__()
        # not the best model...
        self.l1 = torch.nn.Linear(28 * 28, 10)

    def forward(self, x):
        return torch.relu(self.l1(x.view(x.size(0), -1)))

    def my_loss(self, y_hat, y):
        return F.cross_entropy(y_hat, y)

    def training_step(self, batch, batch_nb):
        x, y = batch
        y_hat = self.forward(x)
        return {'loss': self.my_loss(y_hat, y)}

    def validation_step(self, batch, batch_nb):
        x, y = batch
        y_hat = self.forward(x)
        return {'val_loss': self.my_loss(y_hat, y)}

    def validation_end(self, outputs):
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        return {'avg_val_loss': avg_loss}

    def configure_optimizers(self):
        return [torch.optim.Adam(self.parameters(), lr=0.02)]

    def train_dataloader(self):
        return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)

    def val_dataloader(self):
        return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)

    def test_dataloader(self):
        return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)


from pytorch_lightning import Trainer

model = CoolModel()


trainer = Trainer(max_epochs=1, gpus=4 num_nodes=2, accelerator='ddp')

trainer.fit(model)

Due to my HPC cluster, I have to use singularity to load conda environment instead of using module in other posts.

run.slurm

#!/bin/bash
#!/bin/bash
#SBATCH --output=./%j_%x.out
#SBATCH --error=./%j_%x.err
#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=4
#SBATCH --export=ALL
#SBATCH --time=2-00:00:00
#SBATCH --gres=gpu:4
#SBATCH --mem=10G
#SBATCH --account=cds
#SBATCH -c 4



srun singularity exec --nv  --overlay $SCRATCH/overlay2/overlay-50G-10M.ext3:ro   /scratch/work/public/singularity/cuda11.1-cudnn8-devel-ubuntu18.04.sif /bin/bash -c "
cd
source /ext3/env.sh
conda activate
cd /home/tw2112/codes/s2s/aux_with_neg_wiki/cool_test
python test.py
"

Expected behavior

all process start

Environment

  • CUDA:
    - GPU:
    - NVIDIA Quadro RTX 8000
    - NVIDIA Quadro RTX 8000
    - NVIDIA Quadro RTX 8000
    - NVIDIA Quadro RTX 8000
    - available: True
    - version: 11.1
  • Packages:
    - numpy: 1.21.2
    - pyTorch_debug: False
    - pyTorch_version: 1.8.1
    - pytorch-lightning: 1.4.9
    - tqdm: 4.62.2
  • System:
    - OS: Linux
    - architecture:
    - 64bit
    - ELF
    - processor: x86_64
    - python: 3.8.5
    - version: #1 SMP Fri Oct 16 13:38:49 EDT 2020

this is the environment output of one node.

Additional context

I'm new to slurm... maybe I made some stupid mistake...

cc @awaelchli @rohitgr7

Hey, please check our docs for how to run on slurm.

You are not setting the number of nodes in slurm. It assumes 1 by default.
The number of nodes and ntasks per node settins in slurm must match the num_nodes and gpus in the Trainer.

stale commented

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

Did you manage to run it with my directions in the previous comment?

stale commented

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@awaelchli Hi, I have the same issue. Could you help me please?

@ElderWanng Hey, I have the same issue. Could you fix this issue? Any advice on how to fix it?

Hey, please check our docs for how to run on slurm.

You are not setting the number of nodes in slurm. It assumes 1 by default. The number of nodes and ntasks per node settins in slurm must match the num_nodes and gpus in the Trainer.

It appears to be another reason for the problem but I am not able to find what is it. I am executing a dummy code:

'''
This codes creates a simple one hidden layer neural network to be trained
with a dummy dataset. The network is trained with Trainer from Pytorch Lightning
'''

import os
import torch
import torch.nn as nn
import pytorch_lightning as pl

class Dataset(torch.utils.data.Dataset):
    def __init__(self):
        self.x = torch.arange(-1, 1, 0.02)
        self.y = 3 * self.x + torch.randn(self.x.size()) * 0.33

    def __getitem__(self, index):
        return self.x[index], self.y[index]

    def __len__(self):
        return self.x.size(0)


# Create a simple neural network
class Net(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(1, 10)
        self.fc2 = nn.Linear(10, 1)
        
    def forward(self, x):
        x = self.fc1(x)
        x = self.fc2(x)
        return x

    def training_step(self, batch, _):
        x, y = batch
        y_hat = self(x)
        loss = nn.functional.mse_loss(y_hat, y)
        self.log('train_loss', loss)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.02)



# Create train and test dataloaders
train_loader = torch.utils.data.DataLoader(Dataset(), batch_size=64)

# Some prints that might be useful
print('SLURM_NTASKS =', os.environ['SLURM_NTASKS'])
print('SLURM_TASKS_PER_NODE =', os.environ['SLURM_TASKS_PER_NODE'])
print('SLURM_GPUS_PER_NODE =', os.environ['SLURM_GPUS_PER_NODE'])
print('SLURM_NNODES =', os.environ['SLURM_NNODES'])

# Create a model
model = Net()

# Create a trainer
trainer = pl.Trainer(max_epochs=10, gpus=2, num_nodes=4, strategy="ddp")

# Train the model
trainer.fit(model, train_loader)

And the code gets stuck:

Multiprocessing is handled by SLURM.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8

Just in case I was doing something wrong I printed the global variables indicated in the code above and the result was:

SLURM_NTASKS = 8
SLURM_TASKS_PER_NODE = 2(x4)
SLURM_GPUS_PER_NODE = volta:2
SLURM_NNODES = 4

As you can see num_nodes is equal to the number of nodes requested in SLURM and the SLURM_TASKS_PER_NODE is equal to the gpus requested in the trainer. If I remove the num_nodes variables from the Trainer call, the code works but not as I would expect. The code only creates two processes (so I understand it is only using one node instead of the 4 requested).

Originally posted by @Naxel100 in #6206 (comment)

Hi, do you find a solution? I am having the same problem on the LSF platform.

Nope, finally I run the code in just one node.

The same thing happens to me. Did you try an older version of lightning?

Hi, I met the same issue these days and found the solutions with help from @awaelchli . I would like to share my experience. Hope it can help you.
I try to reiterate the errors and solutions step-by-step.

Context

I tried to train a model with "distributed data parallel (DDP)" strategy on a remote SLURM platform. At the same time, I also want to load training data in parallel. And, I need to use singularity as a container ( for convenient code transferring between different machines)


System:

  • torch:1.13.0
  • cuda:11.6
  • python:3.9
  • pytorch-lightning:1.8.6
  1. The SLURM configuration:
   #!/bin/bash
    #SBATCH --time=00:20:00
    #SBATCH --job-name=test_ddp
    #SBATCH --ntasks-per-node=4
    #SBATCH --cpus-per-task=8
    #SBATCH --gres=gpu:v100l:4
    #SBATCH --mem=0
    #SBATCH --nodes=1
    #SBATCH --output=./output/%x.out

Notes: the --ntasks-per-node has to be equal to devices, --nodes = num_nodes


  1. How to execute with singularity:
    srun --mpi=pmi2 singularity exec --nv -B /bind/path image.sif python test_ddp.py
    OR
    srun singularity exec --nv -B /bind/path image.sif python test_ddp.py

[> To run jobs across nodes with MPI requires:

  • Ensuring your MPI program is compiled using the OpenMPI installed inside your Singularity container.
  • Ideally the version of OpenMPI inside the container is version 3 or 4. Version 2 may or may not work. Version 1 will not work.
  • Ensure the MPI installation in the container can use the same process-management interface library as the MPI version on the cluster, e.g. PMI-2 or PMIx.
  • Ensuring your SLURM job script uses srun to run the MPI program, and have srun use a pmi library that is supported by the MPI implementation in the container. Do not use mpirun or mpiexec, e.g.,](https://docs.alliancecan.ca/wiki/Singularity#Running_MPI_programs_from_within_a_container)

  1. Trainer setting:
    Trainer(max_epochs=1, **num_nodes=1**, devices = 4, accelerator='gpu', strategy = 'ddp')
    Notes: the --ntasks-per-node has to be equal to devices, --nodes = num_nodes

  1. Dataloader setting:
    DataLoader(
                self.dataset,
                **num_workers=self.args.num_workers**,
                batch_size=self.args.configs.train.batch_size,
                shuffle=True,
                pin_memory=True,
                prefetch_factor = self.args_D.prefetch_factor,
            )
    

Notes: num_workers <=cpus-per-task


Bugs record

  1. Get stuck at:
    ````
    GPU available: True (cuda), used: True
    TPU available: False, using: 0 TPU cores
    IPU available: False, using: 0 IPUs
    HPU available: False, using: 0 HPUs
    /opt/conda/lib/python3.9/site-packages/pytorch_lightning/loops/utilities.py:94: PossibleUserWarning: max_epochs was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.
    rank_zero_warn(
    Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
    ```
  • Solution: 1) set --ntasks-per-node = devices, --nodes = num_nodes_, 2) srun singularity exec train.py
  1. Get stuck at dataloading
  • Solution: make num_workers <=cpus-per-task
    Notes: The default value for the cpus-per-task is 1, for the num_workers it is 2, if cpus-per-task is not explicitly set, it would easily get stuck.