sony/ai-research-code

【NVC-Net】Mutli-GPU training multiple models?

ColdFir5 opened this issue · 2 comments

When training on multiple GPUs (2) it appears as though it's training 2 models at the same time, is this supposed to be the case?
image

For distributed training with multiple GPUs, you need to install nnabla-ext-cuda110-nccl2-mpi3-1-6 or nnabla-ext-cuda110-nccl2-mpi2-1-1 instead of nnabla-ext-cuda110.
I recommend to use docker image nnabla/nnabla-ext-cuda-multi-gpu, but if you cannot use docker environment, setup environment with apt and pip.

Note that:

  • nnabla-ext-cuda110 is package for cuda11.0 single gpu.
  • nnabla-ext-cuda110-nccl2-mpi* are packages for cuda11.0 multiple gpu.

If you already installed nnabla-ext-cuda110 in your environment, please

  1. uninstall nnabla and nnabla-ext-cuda110
  2. install nnabla-ext-cuda110-nccl2-mpi3-1-6 or nnabla-ext-cuda110-nccl2-mpi2-1-1.

In order to use this package, openmpi and libnccl2 are needed in your environment.

For some environments, these can be installed simply by apt-get install openmpi-bin libnccl2.
But for other environments, you may need to build openmpi from source code, see this dockerfile:
https://github.com/sony/nnabla-ext-cuda/blob/master/docker/release/Dockerfile.cuda-mpi#L50-L80

After environment is built,
mpirun -n 2 python main.py -c cudnn -d 0,1 will train NVC-Net with 2GPUs.

Thank you very much for this information, it's now working by manually installing those packages and libraries.