【NVC-Net】Mutli-GPU training multiple models?
ColdFir5 opened this issue · 2 comments
For distributed training with multiple GPUs, you need to install nnabla-ext-cuda110-nccl2-mpi3-1-6
or nnabla-ext-cuda110-nccl2-mpi2-1-1
instead of nnabla-ext-cuda110
.
I recommend to use docker image nnabla/nnabla-ext-cuda-multi-gpu
, but if you cannot use docker environment, setup environment with apt and pip.
Note that:
nnabla-ext-cuda110
is package for cuda11.0 single gpu.nnabla-ext-cuda110-nccl2-mpi*
are packages for cuda11.0 multiple gpu.
If you already installed nnabla-ext-cuda110
in your environment, please
- uninstall
nnabla
andnnabla-ext-cuda110
- install
nnabla-ext-cuda110-nccl2-mpi3-1-6
ornnabla-ext-cuda110-nccl2-mpi2-1-1
.
In order to use this package, openmpi and libnccl2 are needed in your environment.
For some environments, these can be installed simply by apt-get install openmpi-bin libnccl2
.
But for other environments, you may need to build openmpi from source code, see this dockerfile:
https://github.com/sony/nnabla-ext-cuda/blob/master/docker/release/Dockerfile.cuda-mpi#L50-L80
After environment is built,
mpirun -n 2 python main.py -c cudnn -d 0,1
will train NVC-Net with 2GPUs.
Thank you very much for this information, it's now working by manually installing those packages and libraries.