Running multi-gpu training
joe-sht opened this issue · 5 comments
How to run training on multi gpu? As I can see training runs on single gpu.
I am also curious. The error I get is this:
Traceback (most recent call last):
File "/root/research/suhail/magvit2/train.py", line 27, in <module>
trainer = VideoTokenizerTrainer(
File "/root/research/suhail/.venv/lib/python3.10/site-packages/pytorch_custom_utils/accelerate_utils.py", line 95, in __init__
_orig_init(self, *args, **kwargs)
File "<@beartype(magvit2_pytorch.trainer.VideoTokenizerTrainer.__init__) at 0x7f20aa90b910>", line 314, in __init__
File "/root/research/suhail/.venv/lib/python3.10/site-packages/magvit2_pytorch/trainer.py", line 203, in __init__
self.has_multiscale_discrs = self.model.has_multiscale_discrs
File "/root/research/suhail/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'DistributedDataParallel' object has no attribute 'has_multiscale_discrs'```
ChatGPT:
The error you're encountering indicates that an attribute has_multiscale_discrs is being accessed on an object of type DistributedDataParallel, and this object does not have such an attribute. This is a common issue when using PyTorch's DistributedDataParallel (DDP) wrapper around models for distributed training. The DDP wrapper takes your model and replicates it across multiple GPUs, managing the distribution of data and the gathering of results. However, it only forwards calls to the underlying model for methods defined in the nn.Module, not for custom attributes or methods unless they are implemented in a specific way.
iirc one needs to replace direct self.model.whatever
accesses with something like (self.model.module if isinstance(self.model, nn.DataParallel) else self.model).whatever
(potentially via a helper function) when using PyTorch DDP
The codes use accelerate
to do DDP automatically.
Can you please share your command to do DDP?
Can you please share your command to do DDP?
just refer to https://github.com/huggingface/accelerate,
for example, if you are using 2 gpus:
accelerate launch --multi_gpu --num_processes 2 train.py --...
You may also find huggingface/accelerate#1239 helpful if you are running in slurm environment