GpuArrayException: nccl Collectives Fail with MultiGPU Use
Nqabz opened this issue · 0 comments
Hi,
I recently switched to using the DGX containers. However when training models with one gpu everything works well. For scaling and training with very large datasets I sought to invest in multiGPU training.
In my first simple test, I am observing that pygpu collectives are failing even on a simple Allreduce() and ncclAllgather() test code. I tried testing with this script:
using the container pulled from the Nvidia registry: nvcr.io/nvidia/theano:17.07 and python3.5.
and here is what I get in my error trace:
worker rank in gpu_comm: 1
master rank in gpu_comm: 0
worker rank in gpu_comm: 2
Testing all_gather
Traceback (most recent call last):
File "gpu_comm_test.py", line 108, in <module>
master()
File "gpu_comm_test.py", line 46, in master
test_sequence(MASTER, gpu_comm, barrier)
File "gpu_comm_test.py", line 56, in test_sequence
r = gpu_comm.all_gather(s.container.data)
File "pygpu/collectives.pyx", line 303, in pygpu.collectives.GpuComm.all_gather
File "pygpu/collectives.pyx", line 508, in pygpu.collectives.pygpu_make_all_gathered
Process Process-2:
File "pygpu/collectives.pyx", line 390, in pygpu.collectives.comm_all_gather
pygpu.gpuarray.GpuArrayException: b'ncclAllGather((void *)(src->ptr + offsrc), count, datatype, (void *)(dest->ptr + offdest), comm->c, ctx->s): invalid argument'
Traceback (most recent call last):
File "/opt/conda/lib/python3.5/multiprocessing/process.py", line 252, in _bootstrap
self.run()
File "/opt/conda/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "gpu_comm_test.py", line 28, in worker
test_sequence(rank, gpu_comm, barrier)
File "gpu_comm_test.py", line 88, in test_sequence
gpu_comm.all_gather(s.container.data)
File "pygpu/collectives.pyx", line 303, in pygpu.collectives.GpuComm.all_gather
File "pygpu/collectives.pyx", line 508, in pygpu.collectives.pygpu_make_all_gathered
File "pygpu/collectives.pyx", line 390, in pygpu.collectives.comm_all_gather
pygpu.gpuarray.GpuArrayException: b'ncclAllGather((void *)(src->ptr + offsrc), count, datatype, (void *)(dest->ptr + offdest), comm->c, ctx->s): invalid argument'
Process Process-3:
Traceback (most recent call last):
File "/opt/conda/lib/python3.5/multiprocessing/process.py", line 252, in _bootstrap
self.run()
File "/opt/conda/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "gpu_comm_test.py", line 28, in worker
test_sequence(rank, gpu_comm, barrier)
File "gpu_comm_test.py", line 88, in test_sequence
gpu_comm.all_gather(s.container.data)
File "pygpu/collectives.pyx", line 303, in pygpu.collectives.GpuComm.all_gather
File "pygpu/collectives.pyx", line 508, in pygpu.collectives.pygpu_make_all_gathered
File "pygpu/collectives.pyx", line 390, in pygpu.collectives.comm_all_gather
pygpu.gpuarray.GpuArrayException: b'ncclAllGather((void *)(src->ptr + offsrc), count, datatype, (void *)(dest->ptr + offdest), comm->c, ctx->s): invalid argument'
A similar error was picked up on earlier issue (NVIDIA/nccl#82). The suggested solution was to install nccl1.3.4. I went on with that suggestion to install in my DGX container and I am still running into the same error. Test works with two gpus but as soon as I move to 3+ gpus it fails with above error.
I appreciate any suggestions on this.
Thank you,
Nqabz