traveller59/spconv

Point Cloud Sparse Tensor error with SubMConv3dLayer: SubMmerge_sort: failed on 2nd step: cudaErrorIllegalAddress: an illegal memory access was encountered

JamesMcCullochDickens opened this issue · 3 comments

I'm trying to write code to go from a point cloud to a sparse tensor, and then do a SubMConv3dLayer convolution on the resulting sparse tensor. Here is what I have:

from typing import Union, Tuple, Optional

import torch, torch.nn as nn
import spconv.pytorch as spconv


def pcs_to_sparse_tensor(pcs: torch.Tensor, grid_size: float, pc_range: Tuple[float, float, float, float, float, float],
                         device_id: Optional[int], pad: int = 0) -> spconv.SparseConvTensor:
    batch_size, num_points, C = pcs.shape
    for i in range(3):
        pcs[:, :, i] = torch.clamp(pcs[:, :, i], min=pc_range[i], max=pc_range[i+3])
    pc_xyz_min = torch.tensor([pc_range[0], pc_range[1], pc_range[2]]).view(1, 1, 3)
    displaced_coords = pcs-pc_xyz_min
    coords = torch.div(displaced_coords, grid_size, rounding_mode="trunc").int()
    coords = coords.reshape(-1, 3)
    feats = pcs.reshape(-1, 3)
    spatial_shape = []
    for i in range(3):
        spatial_shape.append(int(((pc_range[3]-pc_range[0])/grid_size) + pad))
    repeat_vals = torch.tensor([num_points for _ in range(batch_size)])
    batch_vals = torch.arange(0, batch_size, step=1)
    batch_idx = torch.repeat_interleave(batch_vals, repeat_vals)
    indices = torch.cat([batch_idx.unsqueeze(-1).int(), coords], dim=1).contiguous()
    if device_id is not None:
        feats = feats.to(device_id)
        indices = indices.to(device_id)
    sp_tensor = spconv.SparseConvTensor(
        features=feats,
        indices=indices,
        spatial_shape=spatial_shape,
        batch_size=batch_size
    )
    return sp_tensor



rand_pcs = torch.rand(3, 500, 3)
sp = pcs_to_sparse_tensor(rand_pcs, grid_size=0.01, device_id=2, pc_range=(-1., -1., -1., 1., 1., 1.))
layer = spconv.SubMConv3d(in_channels=3, out_channels=6,
                                      kernel_size=3, indice_key="l1", padding=1,
                                      stride=1, dilation=0).to(2)
output = layer(sp)

I'm getting the error:

"[Exception|implicit_gemm_pair]indices=torch.Size([1500, 4]),bs=3,ss=[300, 300, 300],algo=ConvAlgo.MaskImplicitGemm,ksize=[3, 3, 3],stride=[1, 1, 1],padding=[1, 1, 1],dilation=[0, 0, 0],subm=True,transpose=False
Traceback (most recent call last):
File "/home/jamesdickens/miniconda3/envs/pointcept/lib/python3.8/site-packages/spconv/pytorch/conv.py", line 408, in _conv_forward
raise e
File "/home/jamesdickens/miniconda3/envs/pointcept/lib/python3.8/site-packages/spconv/pytorch/conv.py", line 385, in _conv_forward
res = ops.get_indice_pairs_implicit_gemm(
File "/home/jamesdickens/miniconda3/envs/pointcept/lib/python3.8/site-packages/spconv/pytorch/ops.py", line 550, in get_indice_pairs_implicit_gemm
SpconvOps.sort_1d_by_key_allocator(pair_mask_tv[j],
RuntimeError: merge_sort: failed on 2nd step: cudaErrorIllegalAddress: an illegal memory access was encountered"

Any thoughts what I might be doing wrong? I'd be happy to give any other relevant details.

Edit: The error only occurs for gpu 2 and not gpu 0.

I seem to have fixed this issue looking at another github issue here:
#387

torch.cuda.set_device(device_id)

does the trick.

I haven't tried with distributed training yet but all seems good now. I guess there is some hardcoding of gpu 0 in the ops file somehow? Not sure. Anyway thanks for the great repo.