【NVC-Net】Mutli-GPU training multiple models?

Question

【NVC-Net】Mutli-GPU training multiple models?

ColdFir5 opened this issue 3 years ago · 2 comments

When training on multiple GPUs (2) it appears as though it's training 2 models at the same time, is this supposed to be the case?

Answer 1 · 2022-03-01T01:23:11.000Z

For distributed training with multiple GPUs, you need to install nnabla-ext-cuda110-nccl2-mpi3-1-6 or nnabla-ext-cuda110-nccl2-mpi2-1-1 instead of nnabla-ext-cuda110.
I recommend to use docker image nnabla/nnabla-ext-cuda-multi-gpu, but if you cannot use docker environment, setup environment with apt and pip.

Note that:

nnabla-ext-cuda110 is package for cuda11.0 single gpu.
nnabla-ext-cuda110-nccl2-mpi* are packages for cuda11.0 multiple gpu.

If you already installed nnabla-ext-cuda110 in your environment, please

uninstall nnabla and nnabla-ext-cuda110
install nnabla-ext-cuda110-nccl2-mpi3-1-6 or nnabla-ext-cuda110-nccl2-mpi2-1-1.

In order to use this package, openmpi and libnccl2 are needed in your environment.

For some environments, these can be installed simply by apt-get install openmpi-bin libnccl2.
But for other environments, you may need to build openmpi from source code, see this dockerfile:
https://github.com/sony/nnabla-ext-cuda/blob/master/docker/release/Dockerfile.cuda-mpi#L50-L80

After environment is built,
mpirun -n 2 python main.py -c cudnn -d 0,1 will train NVC-Net with 2GPUs.

Answer 2 · 2022-03-01T02:41:25.000Z

Thank you very much for this information, it's now working by manually installing those packages and libraries.