hustvl/HAIS

why the code remind me to bug a new GPU?

MrCrazyCrab opened this issue · 2 comments

i did't change anything just use my own data. When i train the model, i met a problem:
RuntimeError: CUDA out of memory. Tried to allocate 239.84 GiB (GPU 0; 10.76 GiB total capacity; 122.72 MiB already allocated; 9.77 GiB free; 142.00 MiB reserved in total by PyTorch) (malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:289)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f1b3cbfc193 in /home/anaconda3/envs/py1_4/lib/python3.6/site-packages/torch/lib/libc10.so)

frame #11: std::vector<at::Tensor, std::allocatorat::Tensor > spconv::getIndicePair<3u>(at::Tensor, long, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, long, long) + 0x518 (0x7f1b3987fda8 in /home/anaconda3/envs/py1_4/lib/python3.6/site-packages/spconv/libspconv.so)
frame #12: c10::detail::WrapRuntimeKernelFunctor_<std::vector<at::Tensor, std::allocatorat::Tensor > ()(at::Tensor, long, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, long, long), std::vector<at::Tensor, std::allocatorat::Tensor >, c10::guts::typelist::typelist<at::Tensor, long, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, long, long> >::operator()(at::Tensor, long, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, long, long) + 0x1ff (0x7f1b39882e8f in /home/anaconda3/envs/py1_4/lib/python3.6/site-packages/spconv/libspconv.so)
frame #13: c10::guts::infer_function_traits_t::return_type c10::detail::call_functor_with_args_from_stack_<c10::detail::WrapRuntimeKernelFunctor_<std::vector<at::Tensor, std::allocatorat::Tensor > (
)(at::Tensor, long, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, long, long), std::vector<at::Tensor, std::allocatorat::Tensor >, c10::guts::typelist::typelist<at::Tensor, long, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, long, long> >, true, 0ul, 1ul, 2ul, 3ul, 4ul, 5ul, 6ul, 7ul, 8ul, 9ul, 10ul>(c10::detail::WrapRuntimeKernelFunctor_<std::vector<at::Tensor, std::allocatorat::Tensor > ()(at::Tensor, long, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, long, long), std::vector<at::Tensor, std::allocatorat::Tensor >, c10::guts::typelist::typelist<at::Tensor, long, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, long, long> >, std::vector<c10::IValue, std::allocatorc10::IValue >*, std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul, 4ul, 5ul, 6ul, 7ul, 8ul, 9ul, 10ul>) + 0x193 (0x7f1b3988aae3 in

It's just a common problem of 'out of memory'. The point cloud scenes in your data are so large that the GPU memory is not enough. Try to reduce the batch size or subsample your point clouds.

@outsidercsy ihave found Out the reason. Maybe my data is denser than scannet. i set the scale to 10, it can train normally. But when i test the model, i met the problem " clusters_scale = 1 / ((clusters_coords_max - clusters_coords_min) / fullscale).max(1)[0] - 0.01 # (nCluster), float
RuntimeError: cannot perform reduction function max on tensor with no elements because the operation does not have an identity"?