j96w/DenseFusion

cublas runtime error

Closed this issue · 9 comments

When I run the training script, I have the following issues:
How do I fix this?

----------Dataset loaded!---------<<<<<<<<
length of the training set: 2373
length of the testing set: 1336
number of sample points on mesh: 500
symmetry object list: [7, 8]
/usr/local/lib/python3.5/dist-packages/torch/nn/functional.py:52: UserWarning: size_average and reduce args will be deprecated, please use reduction='elementwise_mean' instead.
warnings.warn(warning.format(ret))
2020-09-11 03:59:25,250 : Train time 00h 00m 00s, Training started
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument
/usr/local/lib/python3.5/dist-packages/torch/nn/functional.py:1890: UserWarning: nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.
warnings.warn("nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.")
/usr/local/lib/python3.5/dist-packages/torch/nn/functional.py:1961: UserWarning: Default upsampling behavior when mode=bilinear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details.
"See the documentation of nn.Upsample for details.".format(mode))
/usr/local/lib/python3.5/dist-packages/torch/nn/modules/upsampling.py:122: UserWarning: nn.Upsampling is deprecated. Use nn.functional.interpolate instead.
warnings.warn("nn.Upsampling is deprecated. Use nn.functional.interpolate instead.")
/usr/local/lib/python3.5/dist-packages/torch/nn/modules/container.py:91: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument.
input = module(input)
Exception ignored in: <bound method _DataLoaderIter.del of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f7776454d30>>
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 399, in del
self._shutdown_workers()
File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 378, in _shutdown_workers
self.worker_result_queue.get()
File "/usr/lib/python3.5/multiprocessing/queues.py", line 345, in get
return ForkingPickler.loads(res)
File "/usr/local/lib/python3.5/dist-packages/torch/multiprocessing/reductions.py", line 151, in rebuild_storage_fd
fd = df.detach()
File "/usr/lib/python3.5/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/usr/lib/python3.5/multiprocessing/reduction.py", line 181, in recv_handle
return recvfds(s, 1)[0]
File "/usr/lib/python3.5/multiprocessing/reduction.py", line 152, in recvfds
msg, ancdata, flags, addr = sock.recvmsg(1, socket.CMSG_LEN(bytes_size))
ConnectionResetError: [Errno 104] Connection reset by peer
Traceback (most recent call last):
File "./tools/train.py", line 237, in
main()
File "./tools/train.py", line 140, in main
loss, dis, new_points, new_target = criterion(pred_r, pred_t, pred_c, target, model_points, idx, points, opt.w, opt.refine_start)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/root/dense_fusion/lib/loss.py", line 83, in forward
return loss_calculation(pred_r, pred_t, pred_c, target, model_points, idx, points, w, refine, self.num_pt_mesh, self.sym_list)
File "/root/dense_fusion/lib/loss.py", line 38, in loss_calculation
pred = torch.add(torch.bmm(model_points, base), points + pred_t)
RuntimeError: cublas runtime error : the GPU program failed to execute at /pytorch/aten/src/THC/THCBlas.cu:411

my envs:
Ubuntu 18.04
DenseFusion docker file
root@desktop:~/dense_fusion# cat /usr/local/cuda/version.txt CUDA Version 9.0.176
torch version 0.4.1
torchvision version 0.2.2

Do you use your dataset?if so ,it may very possible that the train.py->opt.num_objects=21 is not correspond your dataset

I am trying to learn with LineMOD dataset downloaded by download script.

I am using RTX TITAN.
According to this reference my GPU seems to need cuda version 10.0 or higher. Using cuda version 10.0, is there a problem with your code?
pytorch/pytorch#17334

Hi @bgyooPtr , I still cannot fix this problem. How did you fix it?
I do have checked the issue 44, but I didn't find the setup.py

me too! I met the same problem!

me too! I met the same problem!

Hello @oreo-lp I also met the problem "RuntimeError: cublas runtime error : the GPU program failed to execute at /pytorch/aten/src/THC/THCBlas.cu:411". Could you share your way to fix this problem? Thank you very much!

Hi @oreo-lp @Destinycjk , same problems here, to solve i am with the follow steps:

  • Change to using pytorch 1.0.1:
RUN pip3 install cffi_utils
RUN pip3 install https://download.pytorch.org/whl/cu100/torch-1.0.1.post2-cp35-cp35m-linux_x86_64.whl && \
    pip3 install torchvision==0.2.2.post3

Hi @oreo-lp @Destinycjk , same problems here, to solve i am with the follow steps:

  • Change to using pytorch 1.0.1:
RUN pip3 install cffi_utils
RUN pip3 install https://download.pytorch.org/whl/cu100/torch-1.0.1.post2-cp35-cp35m-linux_x86_64.whl && \
    pip3 install torchvision==0.2.2.post3

Thans very much for your reply! In fact, I have to use higher version pytorch because I need to match it with my cuda version. I have solved this problem later. The main cause is still the version match problem(GPU, cuda, pytorch). Anyway, thank you verymuch again!