Cuda 11.1 - Coordinate manager
zgojcic opened this issue ยท 16 comments
Hi Chris,
I have stumbled onto the following problem when using ME 0.5.1 or 0.5.2 with Cuda 11.1:
File "/home/zgojcic/anaconda3/envs/rigid_3dsf/lib/python3.7/site-packages/MinkowskiEngine-0.5.1-py3.7-linux-x86_64.egg/MinkowskiEngine/MinkowskiConvolution.py", line 84, in forward
coordinate_manager._manager,
RuntimeError: /home/zgojcic/Documents/Rigid3DSceneFlow/MinkowskiEngine/src/convolution_gpu.cu:85, assertion (in_feat.size(0) == p_map_manager->size(in_key)) failed. Invalid in_feat size 0 != 5296
Note that the same code works perfectly fine with Cuda 10.2. I am sorry that I do not have a very compact working example, but the error occurs when running the code available in https://github.com/zgojcic/Rigid3DSceneFlow. For example when running the following evaluation:
python eval.py ./configs/eval/eval_lidar_kitti.yaml
If you actually want to run the code you also have to download the dataset, but it is very small (see the repo). If I can help you somehow or should provide more information, please let me know.
Best,
Zan
Diagnostic from one of the computers that I have used (I have observed the same error on three computers running either ME 0.5.1 or 0.5.2):
==========System==========
Linux-5.4.0-66-generic-x86_64-with-debian-buster-sid
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.5 LTS"
3.7.10 (default, Feb 26 2021, 18:47:35)
[GCC 7.3.0]
==========Pytorch==========
1.8.0
torch.cuda.is_available(): True
==========NVIDIA-SMI==========
/usr/bin/nvidia-smi
Driver Version 455.32.00
CUDA Version 11.1
VBIOS Version 88.00.41.00.18
Image Version G001.0000.01.04
==========NVCC==========
sh: 1: nvcc: not found
==========CC==========
CC=g++-7
/usr/bin/g++-7
g++-7 (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
==========MinkowskiEngine==========
0.5.1
MinkowskiEngine compiled with CUDA Support: True
NVCC version MinkowskiEngine is compiled: 11010
CUDART version MinkowskiEngine is compiled: 11010
The error says that you fed a 0 length feature matrix. You might want to put a break point import ipdb; ipdb.set_trace()
before the line got the error and make sure you are doing everything correctly.
Exactly the same code runs on the same computer if I use Cuda 10.2 with the same ME version, so I assume it is a combination of the Cuda 11.1 with ME.
If I debug the code step by step the error actually happens before, when I cast the values to the sparse tensor like:
sinput1 = ME.SparseTensor(features=input_dict['sinput_s_F'].to(self.device),
coordinates=input_dict['sinput_s_C'].to(self.device))
the error message is:
terminate called after throwing an instance of 'thrust::system::system_error'
what(): CUDA free failed: cudaErrorIllegalAddress: an illegal memory access was encountered
The inputs are generated with
coords_batch1, feats_batch1 = ME.utils.sparse_collate(coords=coords1, feats=feats1)
but the batch size is one and a single worker is used in the data loader. The dimension of the inputs is [5296,3] and [5296,4] respectively.
I have tried to generate a minimum working example but if I just cast random values to a tensor it works without an error.
It would be great if you can prepare a self-contained code for debugging.
So the following example should show the problem. On my machine (the diagnostic is in the first post) in returns:
tensor(8188)
torch.Size([8188, 3])
torch.Size([8188, 3])
tensor(8188, device='cuda:0')
torch.Size([8188, 3])
torch.Size([0, 3])
I think that there is something wrong when casting the features to the ME.SparseTensor, as I can for example also not use
print(sinput1.F), the python just hangs in this case. Hope that this helps.
Just as an info the same code with Cuda 10.2 returns
tensor(8188)
torch.Size([8188, 3])
torch.Size([8188, 3])
tensor(8188, device='cuda:0')
torch.Size([8188, 3])
torch.Size([8188, 3])
Thank you in advance for your help
import torch
import MinkowskiEngine as ME
import numpy as np
pc_1 = np.random.rand(8192,3) * 20
voxel_size = 0.1
# Voxelization
_, sel1 = ME.utils.sparse_quantize(pc_1 / voxel_size, return_index=True)
# Slect the voxelized points
pc_1 = pc_1[sel1,:]
# Get sparse indices
coords1 = np.floor(pc_1 / voxel_size)
# Use absolute features as input
feats1 = coords1.copy()
coords_batch1, feats_batch1 = ME.utils.sparse_collate(coords=[coords1], feats=[feats1])
sinput1 = ME.SparseTensor(features=feats_batch1,
coordinates=coords_batch1)
sinput1_cuda = ME.SparseTensor(features=feats_batch1.to('cuda'),
coordinates=coords_batch1.to('cuda'))
for b_idx in range(len(sinput1.decomposed_coordinates)):
feat_s = sinput1.F[sinput1.C[:,0] == b_idx]
print(sum(sinput1.C[:,0] == b_idx))
print(sinput1.F.shape)
print(feat_s.shape)
for b_idx in range(len(sinput1_cuda.decomposed_coordinates)):
feat_s_cuda = sinput1_cuda.F[sinput1_cuda.C[:,0] == b_idx]
print(sum(sinput1_cuda.C[:,0] == b_idx))
print(sinput1_cuda.F.shape)
print(feat_s_cuda.shape)
Hi! Any chance this issue could be looked at? I am using an NVIDIA 3000 series GPU which only runs CUDA 11 and therefore I cannot use Minkowski Engine.
Just FYI, the code snippet works on my Machine: MinkowskiEngine==0.5.0, Cuda 11.2, GeForce RTX 3090.
I ran the snippet on CUDA 11.2 (ME==0.5.1, RTX 3090) and still getting the same error. ME==0.5.0 wouldn't compile.
Hmm, can't replicate the error on the latest master. Both with CUDA 11.0 and CUDA 11.1.
My environments are
python -c "import MinkowskiEngine; MinkowskiEngine.print_diagnostics()"
==========System==========
Linux-5.4.0-67-generic-x86_64-with-glibc2.10
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.1 LTS"
3.8.8 (default, Feb 24 2021, 21:46:12)
[GCC 7.3.0]
==========Pytorch==========
1.8.1
torch.cuda.is_available(): True
==========NVIDIA-SMI==========
/usr/bin/nvidia-smi
Driver Version 460.39
CUDA Version 11.2
VBIOS Version 90.02.2E.00.0C
Image Version G001.0000.02.04
==========NVCC==========
/usr/local/cuda/bin/nvcc
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0
==========CC==========
CC=g++
/usr/bin/g++
g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
==========MinkowskiEngine==========
0.5.2
MinkowskiEngine compiled with CUDA Support: True
NVCC version MinkowskiEngine is compiled: 11000
CUDART version MinkowskiEngine is compiled: 11000
and
==========System==========
Linux-5.4.0-67-generic-x86_64-with-glibc2.10
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.1 LTS"
3.8.8 (default, Feb 24 2021, 21:46:12)
[GCC 7.3.0]
==========Pytorch==========
1.8.1
torch.cuda.is_available(): True
==========NVIDIA-SMI==========
/usr/bin/nvidia-smi
Driver Version 460.39
CUDA Version 11.2
VBIOS Version 90.02.2E.00.0C
Image Version G001.0000.02.04
==========NVCC==========
/usr/local/cuda/bin/nvcc
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
==========CC==========
CC=g++
/usr/bin/g++
g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
==========MinkowskiEngine==========
0.5.2
MinkowskiEngine compiled with CUDA Support: True
NVCC version MinkowskiEngine is compiled: 11010
CUDART version MinkowskiEngine is compiled: 11010
Hey Chris, @chrischoy , I still produce this error on 3090 GPU using the latest master, with the environments:
==========System==========
Linux-5.8.0-44-generic-x86_64-with-glibc2.10
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"
3.8.8 (default, Feb 24 2021, 21:46:12)
[GCC 7.3.0]
==========Pytorch==========
1.8.1
torch.cuda.is_available(): True
==========NVIDIA-SMI==========
/usr/bin/nvidia-smi
Driver Version 460.39
CUDA Version 11.2
VBIOS Version 94.02.26.88.3C
Image Version G001.0000.03.03
==========NVCC==========
/usr/local/cuda-11.1/bin/nvcc
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Tue_Sep_15_19:10:02_PDT_2020
Cuda compilation tools, release 11.1, V11.1.74
Build cuda_11.1.TC455_06.29069683_0
==========CC==========
CC=g++-7
/usr/bin/g++-7
g++-7 (Ubuntu 7.5.0-6ubuntu2) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
==========MinkowskiEngine==========
0.5.2
MinkowskiEngine compiled with CUDA Support: True
NVCC version MinkowskiEngine is compiled: 11010
CUDART version MinkowskiEngine is compiled: 11010
I first remove the last conda environments, and create new environments using conda.
Then, I running the commend,
conda install openblas-devel -c anaconda
conda install pytorch=1.8.1 torchvision cudatoolkit=11.1 -c pytorch -c conda-forge
pip install -U git+https://github.com/NVIDIA/MinkowskiEngine -v --no-deps --install-option="--blas_include_dirs=${CONDA_PREFIX}/include" --install-option="--blas=openblas"
python test.py
/home/ps/anaconda3/envs/rigid_3dsf/lib/python3.8/site-packages/MinkowskiEngine-0.5.2-py3.8-linux-x86_64.egg/MinkowskiEngine/init.py:36: UserWarning: The environment variableOMP_NUM_THREADS
not set. MinkowskiEngine will automatically setOMP_NUM_THREADS=16
. If you want to setOMP_NUM_THREADS
manually, please export it on the command line before running a python script. e.g.export OMP_NUM_THREADS=12; python your_program.py
. It is recommended to set it below 24.
warnings.warn(
tensor(8188)
torch.Size([8188, 3])
torch.Size([8188, 3])
tensor(8188, device='cuda:0')
torch.Size([8188, 3])
torch.Size([0, 3])
Sorry, I misread the issue. I assumed the cudaIllegalMemoryAccess was the problem. Yes, I was able to reproduce this error. Let me get back to you ASAP.
TLDR: This is an error in pytorch (v1.8.X + CUDA11.X) which affects many other custom C extension libraries.
On pytorch 1.8.1 + cuda 11.1
import MinkowskiEngine as ME
import torch
coordinates = torch.rand(8192,3) * 200
bcoords, bfeats = coordinates.cuda(), coordinates.cuda()
print(bcoords, bfeats) # without print, it works fine... print seems to be triggering something
ME.SparseTensor(bfeats, bcoords)
The full log for the above script with ME debug installation is
...
/home/chrischoy/projects/MinkowskiEngine/src/coordinate_map_gpu.cu:225 nm_threads 8192
/home/chrischoy/projects/MinkowskiEngine/src/coordinate_map_gpu.cu:227 nm_blocks 64
/home/chrischoy/projects/MinkowskiEngine/src/coordinate_map_gpu.cu:229 unused_key 4294967295
CUDA error 101 [/usr/local/cuda-11.1/include/cub/block/../iterator/../util_device.cuh, 471]: invalid device ordinal
CUDA error 101 [/usr/local/cuda-11.1/include/cub/device/dispatch/dispatch_reduce.cuh, 653]: invalid device ordinal
/home/chrischoy/projects/MinkowskiEngine/src/storage.cuh:80 Deallocating 2 gpu storage at 0x7fd1c62e0000
/home/chrischoy/projects/MinkowskiEngine/src/storage.cuh:80 Deallocating 0 gpu storage at 0
/home/chrischoy/projects/MinkowskiEngine/src/storage.cuh:80 Deallocating 8192 gpu storage at 0x7fd1c62e8200
/home/chrischoy/projects/MinkowskiEngine/src/storage.cuh:80 Deallocating 8192 gpu storage at 0x7fd1c62e0200
Traceback (most recent call last):
File "test330.py", line 7, in <module>
ME.SparseTensor(bfeats, bcoords)
File "/home/chrischoy/projects/MinkowskiEngine/MinkowskiEngine/MinkowskiSparseTensor.py", line 269, in __init__
coordinates, features, coordinate_map_key = self.initialize_coordinates(
File "/home/chrischoy/projects/MinkowskiEngine/MinkowskiEngine/MinkowskiSparseTensor.py", line 294, in initialize_coordinates
) = self._manager.insert_and_map(coordinates, *coordinate_map_key.get_key())
File "/home/chrischoy/projects/MinkowskiEngine/MinkowskiEngine/MinkowskiCoordinateManager.py", line 179, in insert_and_map
return self._manager.insert_and_map(coordinates, tensor_stride, string_id)
RuntimeError: after reduction step 1: cudaErrorInvalidDevice: invalid device ordinal
The invalid device ordinal
should not be triggered.
A related issue happens also on these libraries with pytorch 1.8.x + CUDA 11.X
This is a pytorch error which probably will be fixed in the next update. In the meantime, I'll update the readme and recommend
- pytorch 1.8.1 + CUDA 10.2
- pytorch 1.7.1 + CUDA 11.X
but not
- pytorch 1.8.1 + CUDA 11.X
Are there any updates on this? I have an RTX 3090, which is only compatible with CUDA 11.1+.
Does pytorch1.9+cuda11.1 fix this problem? Thx.
I have tried running the codes that were given in the previous posts, they are running fine. So, I guess this is fixed.
==========System==========
Linux-5.8.0-49-generic-x86_64-with-glibc2.17
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"
3.8.10 (default, Jun 4 2021, 15:09:15)
[GCC 7.5.0]
==========Pytorch==========
1.9.0
torch.cuda.is_available(): True
==========NVIDIA-SMI==========
/usr/bin/nvidia-smi
Driver Version 460.56
CUDA Version 11.2
VBIOS Version 88.00.4F.00.04
Image Version G500.0200.00.03
==========NVCC==========
/var/tmp/cuda-11.1/bin/nvcc
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
==========CC==========
/usr/bin/c++
c++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
==========MinkowskiEngine==========
0.5.4
MinkowskiEngine compiled with CUDA Support: True
NVCC version MinkowskiEngine is compiled: 11010
CUDART version MinkowskiEngine is compiled: 11010
Great! I wasn't sure it was solved. So I'll close the ticket since I got the confirmation that it's been resolved.