Multi-ndoes DDP training hanging when initializing.

Question

Multi-ndoes DDP training hanging when initializing.

ElderWanng opened this issue 3 years ago · 11 comments

🐛 Bug

I'm trying to utilize all the computational resources to speed up. My code works fine on a single node, multi-GPUs mode (which means I did most part for DDP training right).

But when it comes to multi-nodes, I found my code always stops at DDP initializing. (stop at 2/4, I guess one node initializes correctly, but another one sucks. )

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/8
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/8
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/8
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/8
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/8
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/8
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8
Traceback (most recent call last):
  File "/home/tw2112/codes/s2s/aux_with_neg_wiki/cool_test/test.py", line 74, in <module>
    trainer.fit(model)
  File "/ext3/miniconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 552, in fit
    self._run(model)
  File "/ext3/miniconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 863, in _run
    self.accelerator.setup_environment()
  File "/ext3/miniconda3/lib/python3.8/site-packages/pytorch_lightning/accelerators/gpu.py", line 30, in setup_environment
    super().setup_environment()
  File "/ext3/miniconda3/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 76, in setup_environment
    self.training_type_plugin.setup_environment()
  File "/ext3/miniconda3/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 168, in setup_environment
    self.setup_distributed()
  File "/ext3/miniconda3/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 253, in setup_distributed
    self.init_ddp_connection()
  File "/ext3/miniconda3/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 323, in init_ddp_connection
    torch.distributed.init_process_group(
  File "/ext3/miniconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 525, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/ext3/miniconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 212, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 2, for key: store_based_barrier_key:1 (world_size=8, worker_count=3, timeout=0:30:00)

To Reproduce

I follow this #8707 and borrow the simple model from it, that problem still occurs:

test.py:

import logging
import os
import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
import torchvision.transforms as transforms

import pytorch_lightning as ptl

def get_logger(name=__name__, level=logging.INFO):
    """Initializes python logger."""

    logger = logging.getLogger(name)
    logger.setLevel(level)

    # this ensures all logging levels get marked with the rank zero decorator
    # otherwise logs would get multiplied for each GPU process in multi-GPU setup
    for level in ("debug", "info", "warning", "error", "exception", "fatal", "critical"):
        setattr(logger, level, ptl.utilities.rank_zero_only(getattr(logger, level)))

    return logger


log = get_logger(__name__)

class CoolModel(ptl.LightningModule):

    def __init__(self):
        super(CoolModel, self).__init__()
        # not the best model...
        self.l1 = torch.nn.Linear(28 * 28, 10)

    def forward(self, x):
        return torch.relu(self.l1(x.view(x.size(0), -1)))

    def my_loss(self, y_hat, y):
        return F.cross_entropy(y_hat, y)

    def training_step(self, batch, batch_nb):
        x, y = batch
        y_hat = self.forward(x)
        return {'loss': self.my_loss(y_hat, y)}

    def validation_step(self, batch, batch_nb):
        x, y = batch
        y_hat = self.forward(x)
        return {'val_loss': self.my_loss(y_hat, y)}

    def validation_end(self, outputs):
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        return {'avg_val_loss': avg_loss}

    def configure_optimizers(self):
        return [torch.optim.Adam(self.parameters(), lr=0.02)]

    def train_dataloader(self):
        return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)

    def val_dataloader(self):
        return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)

    def test_dataloader(self):
        return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)


from pytorch_lightning import Trainer

model = CoolModel()


trainer = Trainer(max_epochs=1, gpus=4 num_nodes=2, accelerator='ddp')

trainer.fit(model)

Due to my HPC cluster, I have to use singularity to load conda environment instead of using module in other posts.

run.slurm

#!/bin/bash
#!/bin/bash
#SBATCH --output=./%j_%x.out
#SBATCH --error=./%j_%x.err
#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=4
#SBATCH --export=ALL
#SBATCH --time=2-00:00:00
#SBATCH --gres=gpu:4
#SBATCH --mem=10G
#SBATCH --account=cds
#SBATCH -c 4



srun singularity exec --nv  --overlay $SCRATCH/overlay2/overlay-50G-10M.ext3:ro   /scratch/work/public/singularity/cuda11.1-cudnn8-devel-ubuntu18.04.sif /bin/bash -c "
cd
source /ext3/env.sh
conda activate
cd /home/tw2112/codes/s2s/aux_with_neg_wiki/cool_test
python test.py
"

Expected behavior

all process start

Environment

CUDA:
- GPU:
- NVIDIA Quadro RTX 8000
- NVIDIA Quadro RTX 8000
- NVIDIA Quadro RTX 8000
- NVIDIA Quadro RTX 8000
- available: True
- version: 11.1
Packages:
- numpy: 1.21.2
- pyTorch_debug: False
- pyTorch_version: 1.8.1
- pytorch-lightning: 1.4.9
- tqdm: 4.62.2
System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.5
- version: #1 SMP Fri Oct 16 13:38:49 EDT 2020

this is the environment output of one node.

Additional context

I'm new to slurm... maybe I made some stupid mistake...

cc @awaelchli @rohitgr7

Answer 1 · 2021-10-24T11:31:01.000Z

Hey, please check our docs for how to run on slurm.

You are not setting the number of nodes in slurm. It assumes 1 by default.
The number of nodes and ntasks per node settins in slurm must match the num_nodes and gpus in the Trainer.

Answer 2 · 2021-11-25T09:39:55.000Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

Answer 3 · 2021-11-27T04:53:03.000Z

Did you manage to run it with my directions in the previous comment?

Answer 4 · 2021-12-27T09:25:15.000Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

Answer 5 · 2022-07-21T04:11:01.000Z

@awaelchli Hi, I have the same issue. Could you help me please?

Answer 6 · 2022-07-21T04:12:09.000Z

@ElderWanng Hey, I have the same issue. Could you fix this issue? Any advice on how to fix it?

Answer 7 · 2022-08-18T15:32:40.000Z

Hey, please check our docs for how to run on slurm.

You are not setting the number of nodes in slurm. It assumes 1 by default. The number of nodes and ntasks per node settins in slurm must match the num_nodes and gpus in the Trainer.

It appears to be another reason for the problem but I am not able to find what is it. I am executing a dummy code:

'''
This codes creates a simple one hidden layer neural network to be trained
with a dummy dataset. The network is trained with Trainer from Pytorch Lightning
'''

import os
import torch
import torch.nn as nn
import pytorch_lightning as pl

class Dataset(torch.utils.data.Dataset):
    def __init__(self):
        self.x = torch.arange(-1, 1, 0.02)
        self.y = 3 * self.x + torch.randn(self.x.size()) * 0.33

    def __getitem__(self, index):
        return self.x[index], self.y[index]

    def __len__(self):
        return self.x.size(0)


# Create a simple neural network
class Net(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(1, 10)
        self.fc2 = nn.Linear(10, 1)
        
    def forward(self, x):
        x = self.fc1(x)
        x = self.fc2(x)
        return x

    def training_step(self, batch, _):
        x, y = batch
        y_hat = self(x)
        loss = nn.functional.mse_loss(y_hat, y)
        self.log('train_loss', loss)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.02)



# Create train and test dataloaders
train_loader = torch.utils.data.DataLoader(Dataset(), batch_size=64)

# Some prints that might be useful
print('SLURM_NTASKS =', os.environ['SLURM_NTASKS'])
print('SLURM_TASKS_PER_NODE =', os.environ['SLURM_TASKS_PER_NODE'])
print('SLURM_GPUS_PER_NODE =', os.environ['SLURM_GPUS_PER_NODE'])
print('SLURM_NNODES =', os.environ['SLURM_NNODES'])

# Create a model
model = Net()

# Create a trainer
trainer = pl.Trainer(max_epochs=10, gpus=2, num_nodes=4, strategy="ddp")

# Train the model
trainer.fit(model, train_loader)

And the code gets stuck:

Multiprocessing is handled by SLURM.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8

Just in case I was doing something wrong I printed the global variables indicated in the code above and the result was:

SLURM_NTASKS = 8
SLURM_TASKS_PER_NODE = 2(x4)
SLURM_GPUS_PER_NODE = volta:2
SLURM_NNODES = 4

As you can see num_nodes is equal to the number of nodes requested in SLURM and the SLURM_TASKS_PER_NODE is equal to the gpus requested in the trainer. If I remove the num_nodes variables from the Trainer call, the code works but not as I would expect. The code only creates two processes (so I understand it is only using one node instead of the 4 requested).

Originally posted by @Naxel100 in #6206 (comment)

Answer 8 · 2022-09-15T21:20:19.000Z

Hi, do you find a solution? I am having the same problem on the LSF platform.

Answer 9 · 2022-09-16T23:43:55.000Z

Nope, finally I run the code in just one node.

Answer 10 · 2022-09-27T06:43:58.000Z

The same thing happens to me. Did you try an older version of lightning?

Answer 11 · 2023-01-07T04:11:55.000Z

Hi, I met the same issue these days and found the solutions with help from @awaelchli . I would like to share my experience. Hope it can help you.
I try to reiterate the errors and solutions step-by-step.

Context

I tried to train a model with "distributed data parallel (DDP)" strategy on a remote SLURM platform. At the same time, I also want to load training data in parallel. And, I need to use singularity as a container ( for convenient code transferring between different machines)

System:

torch:1.13.0
cuda:11.6
python:3.9
pytorch-lightning:1.8.6

The SLURM configuration:

   #!/bin/bash
    #SBATCH --time=00:20:00
    #SBATCH --job-name=test_ddp
    #SBATCH --ntasks-per-node=4
    #SBATCH --cpus-per-task=8
    #SBATCH --gres=gpu:v100l:4
    #SBATCH --mem=0
    #SBATCH --nodes=1
    #SBATCH --output=./output/%x.out

Notes: the --ntasks-per-node has to be equal to devices, --nodes = num_nodes

How to execute with singularity:
srun --mpi=pmi2 singularity exec --nv -B /bind/path image.sif python test_ddp.py
OR
srun singularity exec --nv -B /bind/path image.sif python test_ddp.py

[> To run jobs across nodes with MPI requires:

Ensuring your MPI program is compiled using the OpenMPI installed inside your Singularity container.

Ideally the version of OpenMPI inside the container is version 3 or 4. Version 2 may or may not work. Version 1 will not work.

Ensure the MPI installation in the container can use the same process-management interface library as the MPI version on the cluster, e.g. PMI-2 or PMIx.

Ensuring your SLURM job script uses srun to run the MPI program, and have srun use a pmi library that is supported by the MPI implementation in the container. Do not use mpirun or mpiexec, e.g.,](https://docs.alliancecan.ca/wiki/Singularity#Running_MPI_programs_from_within_a_container)

Trainer setting:
Trainer(max_epochs=1, **num_nodes=1**, devices = 4, accelerator='gpu', strategy = 'ddp')
Notes: the --ntasks-per-node has to be equal to devices, --nodes = num_nodes

Dataloader setting:

DataLoader(
            self.dataset,
            **num_workers=self.args.num_workers**,
            batch_size=self.args.configs.train.batch_size,
            shuffle=True,
            pin_memory=True,
            prefetch_factor = self.args_D.prefetch_factor,
        )

Notes: num_workers <=cpus-per-task

Bugs record

Get stuck at:
````
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/opt/conda/lib/python3.9/site-packages/pytorch_lightning/loops/utilities.py:94: PossibleUserWarning: max_epochs was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.
rank_zero_warn(
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
```

Solution: 1) set --ntasks-per-node = devices, --nodes = num_nodes_, 2) srun singularity exec train.py

Get stuck at dataloading

Solution: make num_workers <=cpus-per-task
Notes: The default value for the cpus-per-task is 1, for the num_workers it is 2, if cpus-per-task is not explicitly set, it would easily get stuck.