Point Cloud Sparse Tensor error with SubMConv3dLayer: SubMmerge_sort: failed on 2nd step: cudaErrorIllegalAddress: an illegal memory access was encountered
JamesMcCullochDickens opened this issue · 3 comments
I'm trying to write code to go from a point cloud to a sparse tensor, and then do a SubMConv3dLayer convolution on the resulting sparse tensor. Here is what I have:
from typing import Union, Tuple, Optional
import torch, torch.nn as nn
import spconv.pytorch as spconv
def pcs_to_sparse_tensor(pcs: torch.Tensor, grid_size: float, pc_range: Tuple[float, float, float, float, float, float],
device_id: Optional[int], pad: int = 0) -> spconv.SparseConvTensor:
batch_size, num_points, C = pcs.shape
for i in range(3):
pcs[:, :, i] = torch.clamp(pcs[:, :, i], min=pc_range[i], max=pc_range[i+3])
pc_xyz_min = torch.tensor([pc_range[0], pc_range[1], pc_range[2]]).view(1, 1, 3)
displaced_coords = pcs-pc_xyz_min
coords = torch.div(displaced_coords, grid_size, rounding_mode="trunc").int()
coords = coords.reshape(-1, 3)
feats = pcs.reshape(-1, 3)
spatial_shape = []
for i in range(3):
spatial_shape.append(int(((pc_range[3]-pc_range[0])/grid_size) + pad))
repeat_vals = torch.tensor([num_points for _ in range(batch_size)])
batch_vals = torch.arange(0, batch_size, step=1)
batch_idx = torch.repeat_interleave(batch_vals, repeat_vals)
indices = torch.cat([batch_idx.unsqueeze(-1).int(), coords], dim=1).contiguous()
if device_id is not None:
feats = feats.to(device_id)
indices = indices.to(device_id)
sp_tensor = spconv.SparseConvTensor(
features=feats,
indices=indices,
spatial_shape=spatial_shape,
batch_size=batch_size
)
return sp_tensor
rand_pcs = torch.rand(3, 500, 3)
sp = pcs_to_sparse_tensor(rand_pcs, grid_size=0.01, device_id=2, pc_range=(-1., -1., -1., 1., 1., 1.))
layer = spconv.SubMConv3d(in_channels=3, out_channels=6,
kernel_size=3, indice_key="l1", padding=1,
stride=1, dilation=0).to(2)
output = layer(sp)
I'm getting the error:
"[Exception|implicit_gemm_pair]indices=torch.Size([1500, 4]),bs=3,ss=[300, 300, 300],algo=ConvAlgo.MaskImplicitGemm,ksize=[3, 3, 3],stride=[1, 1, 1],padding=[1, 1, 1],dilation=[0, 0, 0],subm=True,transpose=False
Traceback (most recent call last):
File "/home/jamesdickens/miniconda3/envs/pointcept/lib/python3.8/site-packages/spconv/pytorch/conv.py", line 408, in _conv_forward
raise e
File "/home/jamesdickens/miniconda3/envs/pointcept/lib/python3.8/site-packages/spconv/pytorch/conv.py", line 385, in _conv_forward
res = ops.get_indice_pairs_implicit_gemm(
File "/home/jamesdickens/miniconda3/envs/pointcept/lib/python3.8/site-packages/spconv/pytorch/ops.py", line 550, in get_indice_pairs_implicit_gemm
SpconvOps.sort_1d_by_key_allocator(pair_mask_tv[j],
RuntimeError: merge_sort: failed on 2nd step: cudaErrorIllegalAddress: an illegal memory access was encountered"
Any thoughts what I might be doing wrong? I'd be happy to give any other relevant details.
Edit: The error only occurs for gpu 2 and not gpu 0.
I seem to have fixed this issue looking at another github issue here:
#387
torch.cuda.set_device(device_id)
does the trick.
I haven't tried with distributed training yet but all seems good now. I guess there is some hardcoding of gpu 0 in the ops file somehow? Not sure. Anyway thanks for the great repo.