NVIDIA/MinkowskiEngine

CUDA OOM leads to unhandled thrust::system exception

evelkey opened this issue · 9 comments

Describe the bug

ME raises a C++ thrust::system::system_error exception which cannot be handled from Python and crashes the program. This issue is raised non-deterministically during training (especially in long running trainings after a few days) and cannot be caught from Python leading to a failing training pipeline.

As parallel_for is not used directly in the repo, most likely one of the functions in MinkowskiConvolution use a thrust builtin function which utilizes it. This function call should be wrapped with THRUST_CHECK like CUDA_CHECK to create an exception which can be interpreted in Python.


To Reproduce

The problem is GPU dependent, the below code is deterministically producing the error on a 16 GB Tesla V100 GPU. To reproduce on other GPUs (mostly dependent on VRAM size), one needs to find the optimal point_count in the below code.

import MinkowskiEngine as ME
import torch
import torch.nn as nn
from MinkowskiEngine import SparseTensor


class TestNet(ME.MinkowskiNetwork):
    def __init__(self, in_feat, out_feat, D, layers=80):
        super(TestNet, self).__init__(D)
        convs = [out_feat for _ in range(layers)]
        self.convs = []
        prev = in_feat
        for outchannels in convs:
            layer = nn.Sequential(
                ME.MinkowskiConvolution(
                    in_channels=prev,
                    out_channels=outchannels,
                    kernel_size=3,
                    stride=2,
                    dilation=1,
                    bias=True,
                    dimension=D,
                ),
                ME.MinkowskiReLU(),
            )
            self.convs.append(layer)
            prev = outchannels
        self.relu = ME.MinkowskiReLU()

    def forward(self, x):
        temp = x
        for convlayer in self.convs:
            temp = convlayer(temp)
        return temp

    def cuda(self):
        super(TestNet, self).cuda()
        self.convs = [c.cuda() for c in self.convs]
        return self


point_count = 6000000
in_channels, out_channels, D = 2, 3, 3
coords, feats = (
    torch.randint(low=-1000, high=1000, size=(point_count, D + 1)).int().cuda(),
    torch.rand(size=(point_count, in_channels)).cuda(),
)
coords[:, 0] = 0

testnetwork = TestNet(in_channels, 32, 3).cuda()


for i in range(5):
    print(f"starting {i}")
    xt = SparseTensor(feats, coordinates=coords, device="cuda")
    torch.cuda.synchronize()
    print("run forward")
    res = testnetwork(xt)
    loss = res.F.sum()
    torch.cuda.synchronize()
    print("run backward")
    loss.backward()

Expected behavior

A thrust::system::system_error exception should be converted to a Python RuntimeError or MemoryError so that it can be caught with a try .. except block in Python.


Server (running inside Nvidia Docker):

==========System==========
Linux-5.4.0-1047-aws-x86_64-with-glibc2.10
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.5 LTS"
3.8.5 (default, Sep 4 2020, 07:30:14)
[GCC 7.3.0]
==========Pytorch==========
1.7.1
torch.cuda.is_available(): True
==========NVIDIA-SMI==========
/usr/bin/nvidia-smi
Driver Version 460.73.01
CUDA Version 11.2
VBIOS Version 88.00.4F.00.09
Image Version G503.0201.00.03
==========NVCC==========
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89
==========CC==========
/usr/bin/c++
c++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0

==========MinkowskiEngine==========
0.5.4 (master of 05/26/2021)
MinkowskiEngine compiled with CUDA Support: True
NVCC version MinkowskiEngine is compiled: 10020
CUDART version MinkowskiEngine is compiled: 10020

Thanks for the suggestion. I have added THRUST_CHECKs on the coordinate initialization functions. It will return std::runtime_error which will be converted to a python error using pybind11.

Thank you, I merged the changes (and the wrapper it to the same functions in the .cu files too). However it does not solve the issue.

I implemented a custom pybind11 handler which catches all thrust::system_error exceptions, this makes it simple to handle Thrust exceptions from Python:

// added to pybind/minkowski.cu after PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
py::register_exception<thrust::system_error>(m, "ThrustException", PyExc_RuntimeError);

This raises MinkowskiEngine._C.ThrustException any time a thrust::system_error is thrown.

Then re-compiled the code in debug mode and the same error occurs:

...
/home/geza/workspace/MinkowskiEngine/src/coordinate_map_gpu.cu:1688 num threads 134555040
/home/geza/workspace/MinkowskiEngine/src/coordinate_map_gpu.cu:1711 direct_kernel_map finished
/home/geza/workspace/MinkowskiEngine/src/coordinate_map_gpu.cu:1723 Valid size: 10269573
thrust running: kernel_map.decompose()
/home/geza/workspace/MinkowskiEngine/src/kernel_map.cuh:314 Decomposing 10269573 elements
CUDA error 2 [/usr/local/cuda/include/thrust/system/cuda/detail/parallel_for.h, 143]: out of memory
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  parallel_for failed: cudaErrorMemoryAllocation: out of memory
[1]    9579 abort (core dumped)  python test.py

I also tested intentionally throwing an exception from the same function call, and it was correctly raised and I could catch it in Python.

So this terminate called is usually produced by a function with noexcept producing an exception which lead to a call to std::terminate(), can be simply reproduced with:

void test_no_except() noexcept {
  throw thrust::system_error(thrust::error_code());
}

For testing I removed all noexcept clauses from src except from robin_hood.h but it did not solve the issue, but the 3rdparty robinhood lib uses noexcept at 120 different places such as in swappable.

I keep investigating the issue as this kills most of our trainings (we have very high number of points). My best guess is that some of the 3rdparty noexcept functions do raise an exception.

Hmm, I couldn't reproduce the error with my Titan RTX. I tried changing the number of points from 10M to 220M, but no luck.

Anyway, I pushed another commit to cover the thrust errors for kernel_map.decompose(). I didn't know that this could be memory intensive as well. The try catch will convert thrust error to system runtime_error which pybind automatically converts a python error. So you do not need to create your own custom functions as long as you use THRUST_CHECK in gpu.cuh.

robin_hood.h is a hashtable for the CPU mode, so not related to this error.

I did the same with decompose calls before, but it didn't solve the issue. I'll follow up with the core dump to see the traceback, because I'm missing where it all goes wrong. (Everything is guarded but I'm still getting a terminate which could not actually happen..)
Thanks for looking into it!

Ah sorry, I found that I didn't wrap sort_by_key in decompose in the previous commit. I amended the commit and force pushed. Could you try the latest master again?

Thanks for the update, I had a deepdive with gdb and valgrind and there is something extremely strange. I modified the decompose call to this:

void decompose() {
    LOG_DEBUG("Decomposing", kernels.end() - kernels.begin(), "elements");
    // the memory space must be initialized first!
    // sort
    try {
      thrust::sort_by_key(thrust::device,            //
                          kernels.begin(),           // key begin
                          kernels.end(),             // key end
                          thrust::make_zip_iterator( // value begin
                              thrust::make_tuple(    //
                                  in_maps.begin(),   //
                                  out_maps.begin()   //
                                  )));
    }
    THRUST_CATCH;
...

Results in the same error as before:

/home/geza/workspace/MinkowskiEngine/src/coordinate_map_gpu.cu:1721 Valid size: 10267625
/home/geza/workspace/MinkowskiEngine/src/kernel_map.cuh:314 Decomposing 10267625 elements
CUDA error 2 [/usr/local/cuda/include/thrust/system/cuda/detail/parallel_for.h, 143]: out of memory
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  parallel_for failed: cudaErrorMemoryAllocation: out of memory
[1]    4024 abort (core dumped)  python test.py

I tried adding a thrust::system error and it was properly caught.

Traceback:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f7869189921 in __GI_abort () at abort.c:79
#2  0x00007f784a3b5892 in __gnu_cxx::__verbose_terminate_handler ()
    at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1618239181388/work/.build/x86_64-conda-linux-gnu/src/gcc/libstdc++-v3/libsupc++/vterminate.cc:95
#3  0x00007f784a3b3f69 in __cxxabiv1::__terminate (handler=<optimized out>)
    at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1618239181388/work/.build/x86_64-conda-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:48
#4  0x00007f784a3b33c7 in __cxa_call_terminate (ue_header=ue_header@entry=0x56010acfb430)
    at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1618239181388/work/.build/x86_64-conda-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_call.cc:54
#5  0x00007f784a3b3bfa in __cxxabiv1::__gxx_personality_v0 (version=<optimized out>, actions=6, exception_class=5138137972254386944, ue_header=0x56010acfb430, context=0x7ffc41c7eb40)
    at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1618239181388/work/.build/x86_64-conda-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_personality.cc:677
#6  0x00007f784a2ffadc in _Unwind_RaiseException_Phase2 (exc=exc@entry=0x56010acfb430, context=context@entry=0x7ffc41c7eb40, frames_p=frames_p@entry=0x7ffc41c7ec30)
    at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1618239181388/work/.build/x86_64-conda-linux-gnu/src/gcc/libgcc/unwind.inc:64
#7  0x00007f784a2ffdda in _Unwind_RaiseException (exc=exc@entry=0x56010acfb430)
    at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1618239181388/work/.build/x86_64-conda-linux-gnu/src/gcc/libgcc/unwind.inc:136
#8  0x00007f784a3b4187 in __cxxabiv1::__cxa_throw (obj=0x56010acfb450, tinfo=0x7f77ecbca898 <typeinfo for thrust::system::system_error>, dest=0x7f77ec295cce <thrust::system::system_error::~system_error()>)
    at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1618239181388/work/.build/x86_64-conda-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_throw.cc:90
#9  0x00007f77ec35386d in thrust::cuda_cub::throw_on_error (status=cudaErrorMemoryAllocation, msg=0x7f77ec607621 "parallel_for failed") at /usr/local/cuda/include/thrust/system/cuda/detail/util.h:213
#10 0x00007f77ec3df546 in thrust::cuda_cub::parallel_for<thrust::cuda_cub::par_t, thrust::cuda_cub::for_each_f<thrust::pointer<thrust::tuple<unsigned int, unsigned int, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>, thrust::cuda_cub::par_t, thrust::use_default, thrust::use_default>, thrust::detail::wrapped_function<thrust::detail::allocator_traits_detail::gozer, void> >, long> (policy=..., f=..., count=16215768) at /usr/local/cuda/include/thrust/system/cuda/detail/parallel_for.h:165
#11 0x00007f77ec3dd99e in thrust::cuda_cub::for_each_n<thrust::cuda_cub::par_t, thrust::pointer<thrust::tuple<unsigned int, unsigned int, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>, thrust::cuda_cub::par_t, thrust::use_default, thrust::use_default>, long, thrust::detail::allocator_traits_detail::gozer> (op=..., count=16215768, first=..., policy=...) at /usr/local/cuda/include/thrust/system/cuda/detail/for_each.h:79
#12 thrust::for_each_n<thrust::cuda_cub::par_t, thrust::pointer<thrust::tuple<unsigned int, unsigned int, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>, thrust::cuda_cub::par_t, thrust::use_default, thrust::use_default>, long, thrust::detail::allocator_traits_detail::gozer> (exec=..., first=..., 
    n=16215768, f=...) at /usr/local/cuda/include/thrust/detail/for_each.inl:71
#13 0x00007f77ec3dbfe4 in thrust::detail::allocator_traits_detail::destroy_range<thrust::detail::no_throw_allocator<thrust::detail::temporary_allocator<thrust::tuple<unsigned int, unsigned int, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>, thrust::cuda_cub::par_t> >, thrust::pointer<thrust::tuple<unsigned int, unsigned int, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>, thrust::cuda_cub::par_t, thrust::use_default, thrust::use_default>, long> (a=..., p=..., n=16215768) at /usr/local/cuda/include/thrust/detail/allocator/destroy_range.inl:137
#14 0x00007f77ec3c1a4c in thrust::detail::destroy_range<thrust::detail::no_throw_allocator<thrust::detail::temporary_allocator<thrust::tuple<unsigned int, unsigned int, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>, thrust::cuda_cub::par_t> >, thrust::pointer<thrust::tuple<unsigned int, unsigned int, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>, thrust::cuda_cub::par_t, thrust::use_default, thrust::use_default>, long> (a=..., p=..., n=16215768) at /usr/local/cuda/include/thrust/detail/allocator/destroy_range.inl:158
#15 0x00007f77ec3b6ec2 in thrust::detail::contiguous_storage<thrust::tuple<unsigned int, unsigned int, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>, thrust::detail::no_throw_allocator<thrust::detail::temporary_allocator<thrust::tuple<unsigned int, unsigned int, thrust::null_type, thrust::null_type, thrus---Type <return> to continue, or q <return> to quit---
t::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>, thrust::cuda_cub::par_t> > >::destroy (this=0x7ffc41c7f2c0, first=..., last=...)
    at /usr/local/cuda/include/thrust/detail/contiguous_storage.inl:313
#16 0x00007f77ec39c4da in thrust::detail::temporary_array<thrust::tuple<unsigned int, unsigned int, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>, thrust::cuda_cub::par_t>::~temporary_array (this=0x7ffc41c7f2c0, __in_chrg=<optimized out>) at /usr/local/cuda/include/thrust/detail/temporary_array.inl:160
#17 0x00007f77ec39a7f8 in thrust::detail::_trivial_sequence<thrust::zip_iterator<thrust::tuple<unsigned int*, unsigned int*, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type> >, thrust::cuda_cub::par_t, thrust::detail::integral_constant<bool, false> >::~_trivial_sequence (this=0x7ffc41c7f2c0, 
    __in_chrg=<optimized out>) at /usr/local/cuda/include/thrust/detail/trivial_sequence.h:62
#18 0x00007f77ec39a814 in thrust::detail::trivial_sequence<thrust::zip_iterator<thrust::tuple<unsigned int*, unsigned int*, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type> >, thrust::cuda_cub::par_t>::~trivial_sequence (this=0x7ffc41c7f2c0, __in_chrg=<optimized out>)
    at /usr/local/cuda/include/thrust/detail/trivial_sequence.h:83
#19 0x00007f77ec393c6b in thrust::cuda_cub::__smart_sort::smart_sort<thrust::detail::integral_constant<bool, true>, thrust::detail::integral_constant<bool, false>, thrust::cuda_cub::par_t, unsigned int*, thrust::zip_iterator<thrust::tuple<unsigned int*, unsigned int*, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type> >, thrust::less<unsigned int> > (compare_op=..., items_first=..., keys_last=0x7f75b87c9360, keys_first=0x7f75b49ed800, policy=...) at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1573
#20 thrust::cuda_cub::sort_by_key<thrust::cuda_cub::par_t, unsigned int*, thrust::zip_iterator<thrust::tuple<unsigned int*, unsigned int*, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type> >, thrust::less<unsigned int> > (policy=..., keys_first=0x7f75b49ed800, keys_last=0x7f75b87c9360, values=..., 
    compare_op=...) at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1666
#21 0x00007f77ec387bf9 in thrust::cuda_cub::sort_by_key<thrust::cuda_cub::par_t, unsigned int*, thrust::zip_iterator<thrust::tuple<unsigned int*, unsigned int*, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type> > > (policy=..., keys_first=0x7f75b49ed800, keys_last=0x7f75b87c9360, values=...)
    at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1734
#22 0x00007f77ec384e8d in thrust::sort_by_key<thrust::cuda_cub::par_t, unsigned int*, thrust::zip_iterator<thrust::tuple<unsigned int*, unsigned int*, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type> > > (exec=..., keys_first=0x7f75b49ed800, keys_last=0x7f75b87c9360, values_first=...)
    at /usr/local/cuda/include/thrust/detail/sort.inl:98
#23 0x00007f77ec376352 in minkowski::gpu_kernel_map<unsigned int, minkowski::detail::c10_allocator<char> >::decompose (this=0x7ffc41c7f8c0) at /home/geza/workspace/MinkowskiEngine/src/kernel_map.cuh:318
#24 0x00007f77ec36d0b9 in minkowski::CoordinateMapGPU<int, minkowski::detail::c10_allocator>::kernel_map (this=0x560113de9f50, out_map=..., kernel=..., 

So there is an exception raised in the block, but it's not caught by the catch. We think it might have something to do with the thrust::detail::no_throw_allocator.

As a workaround my idea is to check before that we have enough space for the sort and only then call it to prevent these OOM issues, but it's only a hack and does not solve the real problem.

I can also reproduce the issue with RTX 2070 and point number: 2500000

Hi Chris,

We've reproduced the issue in pure thrust code, so it's not a problem with MinkowskiEngine. I raised the issue in the thrust repo here: NVIDIA/thrust#1448. The issue is not present anymore with CUDA 11.0+, so we are migrating the codebase over if it's possible. Thanks for looking into it.

Thanks @evelkey for the update.

I'll close the issue and put some note in the readme.