lucidrains/magvit2-pytorch

Running multi-gpu training

joe-sht opened this issue · 5 comments

How to run training on multi gpu? As I can see training runs on single gpu.

I am also curious. The error I get is this:

Traceback (most recent call last):
  File "/root/research/suhail/magvit2/train.py", line 27, in <module>
    trainer = VideoTokenizerTrainer(
  File "/root/research/suhail/.venv/lib/python3.10/site-packages/pytorch_custom_utils/accelerate_utils.py", line 95, in __init__
    _orig_init(self, *args, **kwargs)
  File "<@beartype(magvit2_pytorch.trainer.VideoTokenizerTrainer.__init__) at 0x7f20aa90b910>", line 314, in __init__
  File "/root/research/suhail/.venv/lib/python3.10/site-packages/magvit2_pytorch/trainer.py", line 203, in __init__
    self.has_multiscale_discrs = self.model.has_multiscale_discrs
  File "/root/research/suhail/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'DistributedDataParallel' object has no attribute 'has_multiscale_discrs'```

ChatGPT:
The error you're encountering indicates that an attribute has_multiscale_discrs is being accessed on an object of type DistributedDataParallel, and this object does not have such an attribute. This is a common issue when using PyTorch's DistributedDataParallel (DDP) wrapper around models for distributed training. The DDP wrapper takes your model and replicates it across multiple GPUs, managing the distribution of data and the gathering of results. However, it only forwards calls to the underlying model for methods defined in the nn.Module, not for custom attributes or methods unless they are implemented in a specific way.

iirc one needs to replace direct self.model.whatever accesses with something like (self.model.module if isinstance(self.model, nn.DataParallel) else self.model).whatever (potentially via a helper function) when using PyTorch DDP

The codes use accelerate to do DDP automatically.

Can you please share your command to do DDP?

Can you please share your command to do DDP?

just refer to https://github.com/huggingface/accelerate,
for example, if you are using 2 gpus:

accelerate launch --multi_gpu --num_processes 2 train.py --...

You may also find huggingface/accelerate#1239 helpful if you are running in slurm environment