Pointops - GPU DataParallel Error
L-Reichardt opened this issue · 1 comments
L-Reichardt commented
Hello I implemented the model into another training loop and it trains fine on a single GPU. However when I use multi-GPU DataParallel the model stops with the following error
ATen/native/cuda/IndexKernel.cu:91: index out of bounds
According to the error message, this is caused by pointops queryandgroup
.
Any suggestions what might cause that?
Error Message:
/opt/conda/conda-bld/pytorch_1656352464346/work/aten/src/ATen/native/cuda/IndexKernel.cu:91: operator(): block: [30669,0,0], thread: [31,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
Traceback ...
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/Paper/model/pointtransformer_seg.py", line 162, in forward
p1, x1, o1 = self.enc1([p0, x0, o0])
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/container.py", line 139, in forward
input = module(input)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/Paper/model/pointtransformer_seg.py", line 116, in forward
x = self.relu(self.bn2(self.transformer2([p, x, o])))
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/Paper/model/pointtransformer_seg.py", line 26, in forward
x_k = pointops.queryandgroup(self.nsample, p, p, x_k, None, o, o, use_xyz=True) # (n, nsample, 3+c)
File "/home/Paper/lib/pointops/functions/pointops.py", line 91, in queryandgroup
grouped_xyz = xyz[idx.view(-1).long(), :].view(m, nsample, 3) # (m, nsample, 3)
RuntimeError: CUDA error: device-side assert triggered
L-Reichardt commented
Update : It works with DistributedDataParallel
.
Since PyTorch officially recommends always using DistributedDataParallel
over DataParallel
, this issue is most likely a PyTorch DataParallel
issue.