about training with multi GPU

Question

about training with multi GPU

boxuLibrary opened this issue 7 months ago · 4 comments

hi, i encounter one problem. When i run the code with multiple GPUs distributed on the different nodes on the slurm. I find that I can not execute GPUS on different nodes. I wonder does the code support the distribution on different nodes?

Answer 1 · 2024-05-29T09:19:44.000Z

Hi @boxuLibrary, ideally, this project should support multiple-node training since it is built on top of pytorch_lightning. However, I don't have an environment to test that, I can only confirm that this project works fine with multiple GPUs on the same node.

As previously mentioned, have you checked here to ensure all related parameters are correctly set? For example, in your sbatch script, make sure that

#SBATCH --nodes=4             # This needs to match Trainer(num_nodes=...)
#SBATCH --ntasks-per-node=8   # This needs to match Trainer(devices=...)

Besides, make sure that you run this project with srun, as detailed here.

Answer 2 · 2024-05-29T10:25:01.000Z

Oh, thank you for your reply. It is real helpful for me.

Answer 3 · 2024-05-29T11:23:33.000Z

No worries @boxuLibrary. If you manage to make it work on multiple nodes, feel free to make a pull request (if some codes need to be modified) or leave some notes on this issue (I will pin it so that others can easily refer to it).

Answer 4 · 2024-05-31T14:57:15.000Z

Hi @boxuLibrary, on another project I work on, I find it easy to train on multiple nodes with pytorch-lighting; just ensure that num_nodes and devices are correctly set, as indicated in my above response.

Besides, make sure that nodes can 'ping' each other. In my case, I need to set in the sbatch script with export NCCL_SOCKET_IFNAME=XXX, where XXX is the interface name that can be obtained from ifconfig. You will encounter errors like NCCL WARN socketStartConnect: Connect to x.x.x.x failed : Software caused connection abort if the interface is not properly set.

I will find time to test the multiple-node training on this project shortly.