traveller59/spconv

Can't run code on Computers with multiple GPUs.

Closed this issue · 4 comments

ShqWW commented

Hello, I find another bug with the following code (pip the prebuilt spconv):

from torch import nn
import torch
import spconv.pytorch as spconv
if __name__=='__main__':
    device = torch.device('cuda:1')

    net = spconv.SparseSequential(spconv.SparseConv2d(8, 16, 2, 2, 0),
                        nn.BatchNorm1d(16),
                        nn.ReLU())
    net.to(device)

    dense = spconv.ToDense()
    dense.to(device)

    features = torch.ones(10, 8).to(device)
    coors = 2*torch.ones(10, 3).to(device).int()
    coors[0:5] = coors[0:5]+1
    x = spconv.SparseConvTensor(features, coors, (8, 8), 10)
    y = net(x)
    y = dense(y)
    print(y.shape)

It reports an error

Traceback (most recent call last):
  File "/home/wsq2019/repo/traj/debug3.py", line 19, in <module>
    y = net(x)
  File "/home/wsq2019/anaconda3/envs/wsq/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wsq2019/anaconda3/envs/wsq/lib/python3.9/site-packages/spconv/pytorch/modules.py", line 137, in forward
    input = module(input)
  File "/home/wsq2019/anaconda3/envs/wsq/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wsq2019/anaconda3/envs/wsq/lib/python3.9/site-packages/spconv/pytorch/conv.py", line 367, in forward
    res = ops.get_indice_pairs_implicit_gemm(
  File "/home/wsq2019/anaconda3/envs/wsq/lib/python3.9/site-packages/spconv/pytorch/ops.py", line 465, in get_indice_pairs_implicit_gemm
    SpconvOps.generate_conv_inds_mask_stage2(inds_tv,
ValueError: /tmp/pip-build-env-2x428bne/overlay/lib/python3.9/site-packages/cumm/include/tensorview/tensor.h(692)
start < end assert faild. start must small than end

Process finished with exit code 1

Especially, when I use 0th GPU with device = torch.device('cuda:0') , there is no error. But when I use cuda:n (n>0), there is an error I mentioned above.
I have tested the code on several computers (Linux) with different cuda versions, the error keeps the same.

currently you need to use torch.set_device to set active cuda device in current thread before call any custom ops in spconv, which works well with regular distributed training (one process one gpu).
I will use with torch.device in each op in next bug-fix release.

ShqWW commented

Thank you!

It seems similar with old-version spconv(#35).

torch.cuda.set_device(device) works for my multi-GPU PC.