Theano/libgpuarray

GpuArrayException: nccl Collectives Fail with MultiGPU Use

Nqabz opened this issue · 0 comments

Nqabz commented

Hi,

I recently switched to using the DGX containers. However when training models with one gpu everything works well. For scaling and training with very large datasets I sought to invest in multiGPU training.

In my first simple test, I am observing that pygpu collectives are failing even on a simple Allreduce() and ncclAllgather() test code. I tried testing with this script:

gpu_comm_test.txt

using the container pulled from the Nvidia registry: nvcr.io/nvidia/theano:17.07 and python3.5.

and here is what I get in my error trace:

worker rank in gpu_comm: 1
master rank in gpu_comm: 0
worker rank in gpu_comm: 2

Testing all_gather
Traceback (most recent call last):
  File "gpu_comm_test.py", line 108, in <module>
    master()
  File "gpu_comm_test.py", line 46, in master
    test_sequence(MASTER, gpu_comm, barrier)
  File "gpu_comm_test.py", line 56, in test_sequence
    r = gpu_comm.all_gather(s.container.data)
  File "pygpu/collectives.pyx", line 303, in pygpu.collectives.GpuComm.all_gather
  File "pygpu/collectives.pyx", line 508, in pygpu.collectives.pygpu_make_all_gathered
Process Process-2:
  File "pygpu/collectives.pyx", line 390, in pygpu.collectives.comm_all_gather
pygpu.gpuarray.GpuArrayException: b'ncclAllGather((void *)(src->ptr + offsrc), count, datatype, (void *)(dest->ptr + offdest), comm->c, ctx->s): invalid argument'
Traceback (most recent call last):
  File "/opt/conda/lib/python3.5/multiprocessing/process.py", line 252, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "gpu_comm_test.py", line 28, in worker
    test_sequence(rank, gpu_comm, barrier)
  File "gpu_comm_test.py", line 88, in test_sequence
    gpu_comm.all_gather(s.container.data)
  File "pygpu/collectives.pyx", line 303, in pygpu.collectives.GpuComm.all_gather
  File "pygpu/collectives.pyx", line 508, in pygpu.collectives.pygpu_make_all_gathered
  File "pygpu/collectives.pyx", line 390, in pygpu.collectives.comm_all_gather
pygpu.gpuarray.GpuArrayException: b'ncclAllGather((void *)(src->ptr + offsrc), count, datatype, (void *)(dest->ptr + offdest), comm->c, ctx->s): invalid argument'
Process Process-3:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.5/multiprocessing/process.py", line 252, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "gpu_comm_test.py", line 28, in worker
    test_sequence(rank, gpu_comm, barrier)
  File "gpu_comm_test.py", line 88, in test_sequence
    gpu_comm.all_gather(s.container.data)
  File "pygpu/collectives.pyx", line 303, in pygpu.collectives.GpuComm.all_gather
  File "pygpu/collectives.pyx", line 508, in pygpu.collectives.pygpu_make_all_gathered
  File "pygpu/collectives.pyx", line 390, in pygpu.collectives.comm_all_gather
pygpu.gpuarray.GpuArrayException: b'ncclAllGather((void *)(src->ptr + offsrc), count, datatype, (void *)(dest->ptr + offdest), comm->c, ctx->s): invalid argument'

A similar error was picked up on earlier issue (NVIDIA/nccl#82). The suggested solution was to install nccl1.3.4. I went on with that suggestion to install in my DGX container and I am still running into the same error. Test works with two gpus but as soon as I move to 3+ gpus it fails with above error.

I appreciate any suggestions on this.

Thank you,

Nqabz