about training with multi GPU
boxuLibrary opened this issue · 4 comments
hi, i encounter one problem. When i run the code with multiple GPUs distributed on the different nodes on the slurm. I find that I can not execute GPUS on different nodes. I wonder does the code support the distribution on different nodes?
Hi @boxuLibrary, ideally, this project should support multiple-node training since it is built on top of pytorch_lightning. However, I don't have an environment to test that, I can only confirm that this project works fine with multiple GPUs on the same node.
As previously mentioned, have you checked here to ensure all related parameters are correctly set? For example, in your sbatch script, make sure that
#SBATCH --nodes=4 # This needs to match Trainer(num_nodes=...)
#SBATCH --ntasks-per-node=8 # This needs to match Trainer(devices=...)
Besides, make sure that you run this project with srun
, as detailed here.
Oh, thank you for your reply. It is real helpful for me.
No worries @boxuLibrary. If you manage to make it work on multiple nodes, feel free to make a pull request (if some codes need to be modified) or leave some notes on this issue (I will pin it so that others can easily refer to it).
Hi @boxuLibrary, on another project I work on, I find it easy to train on multiple nodes with pytorch-lighting; just ensure that num_nodes
and devices
are correctly set, as indicated in my above response.
Besides, make sure that nodes can 'ping' each other. In my case, I need to set in the sbatch script with export NCCL_SOCKET_IFNAME=XXX
, where XXX is the interface name that can be obtained from ifconfig
. You will encounter errors like NCCL WARN socketStartConnect: Connect to x.x.x.x failed : Software caused connection abort
if the interface is not properly set.
I will find time to test the multiple-node training on this project shortly.