Can't run code on Computers with multiple GPUs.
Closed this issue · 4 comments
ShqWW commented
Hello, I find another bug with the following code (pip the prebuilt spconv):
from torch import nn
import torch
import spconv.pytorch as spconv
if __name__=='__main__':
device = torch.device('cuda:1')
net = spconv.SparseSequential(spconv.SparseConv2d(8, 16, 2, 2, 0),
nn.BatchNorm1d(16),
nn.ReLU())
net.to(device)
dense = spconv.ToDense()
dense.to(device)
features = torch.ones(10, 8).to(device)
coors = 2*torch.ones(10, 3).to(device).int()
coors[0:5] = coors[0:5]+1
x = spconv.SparseConvTensor(features, coors, (8, 8), 10)
y = net(x)
y = dense(y)
print(y.shape)
It reports an error
Traceback (most recent call last):
File "/home/wsq2019/repo/traj/debug3.py", line 19, in <module>
y = net(x)
File "/home/wsq2019/anaconda3/envs/wsq/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wsq2019/anaconda3/envs/wsq/lib/python3.9/site-packages/spconv/pytorch/modules.py", line 137, in forward
input = module(input)
File "/home/wsq2019/anaconda3/envs/wsq/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wsq2019/anaconda3/envs/wsq/lib/python3.9/site-packages/spconv/pytorch/conv.py", line 367, in forward
res = ops.get_indice_pairs_implicit_gemm(
File "/home/wsq2019/anaconda3/envs/wsq/lib/python3.9/site-packages/spconv/pytorch/ops.py", line 465, in get_indice_pairs_implicit_gemm
SpconvOps.generate_conv_inds_mask_stage2(inds_tv,
ValueError: /tmp/pip-build-env-2x428bne/overlay/lib/python3.9/site-packages/cumm/include/tensorview/tensor.h(692)
start < end assert faild. start must small than end
Process finished with exit code 1
Especially, when I use 0th GPU with device = torch.device('cuda:0')
, there is no error. But when I use cuda:n (n>0), there is an error I mentioned above.
I have tested the code on several computers (Linux) with different cuda versions, the error keeps the same.
FindDefinition commented
currently you need to use torch.set_device
to set active cuda device in current thread before call any custom ops in spconv, which works well with regular distributed training (one process one gpu).
I will use with torch.device
in each op in next bug-fix release.
ShqWW commented
Thank you!
ZXP-S-works commented
torch.cuda.set_device(device)
works for my multi-GPU PC.