BIGBALLON/PyTorch-CPP

Segmentation fault

yueyihua opened this issue · 9 comments

After loading my model to inference all images, a segmention fault occur, how to resolve this problem? My error is as follows:
File: ./test-mnist/p2p/P2P-xunlei-unk-udp-2965_192.168.253.9_12345_113.100.16.221_12345_17.png, Label: p2p, Probability: 99.9866%
File: ./test-mnist/p2p/P2P-xunlei-unk-udp-3034_192.168.253.9_12345_180.166.38.134_0_17.png, Label: p2p, Probability: 99.9999%

Thread 1 "classifier" received signal SIGSEGV, Segmentation fault.
0x00007fffec81723e in ?? () from /usr/local/cuda-10.2/lib64/libcudart.so.10.2
Missing separate debuginfos, use: yum debuginfo-install libgcc-8.3.1-4.5.el8.x86_64 libgomp-8.3.1-4.5.el8.x86_64 libstdc++-8.3.1-4.5.el8.x86_64 zlib-1.2.11-10.el8.x86_64
(gdb) bt
#0 0x00007fffec81723e in ?? () from /usr/local/cuda-10.2/lib64/libcudart.so.10.2
#1 0x00007fffec81c70b in ?? () from /usr/local/cuda-10.2/lib64/libcudart.so.10.2
#2 0x00007fffec8492d0 in cudaStreamDestroy () from /usr/local/cuda-10.2/lib64/libcudart.so.10.2
#3 0x00007fff68f0551d in cudnnDestroy () from /usr/local/lib/libtorch_cuda.so
#4 0x00007fff68207a05 in at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cudnnContext*, &at::native::(anonymous namespace)::createCuDNNHandle, &at::native::(anonymous namespace)::destroyCuDNNHandle>::~DeviceThreadHandlePool() () from /usr/local/lib/libtorch_cuda.so
#5 0x00007fff63e66677 in __cxa_finalize () from /lib64/libc.so.6
#6 0x00007fff65a68a83 in __do_global_dtors_aux () from /usr/local/lib/libtorch_cuda.so
#7 0x00007fffffffe130 in ?? ()
#8 0x00007ffff7de4106 in _dl_fini () from /lib64/ld-linux-x86-64.so.2
Backtrace stopped: frame did not save the PC
(gdb)

@yueyihua can you provide the snippet or more details for debugging, I guess there is something wrong in C++ not CUDA/CUDNN.

@BIGBALLON I think it could be that resources are not being released, but i don't know really why. My code is as follows:
int main(int argc, const char *argv[])
{
if (argc != 4) {
std::cerr << "Usage: classifier "
" " << std::endl;
return -1;
}

torch::jit::script::Module module = torch::jit::load(argv[1]);
std::cout << "== Switch to GPU mode" << std::endl;
// to GPU
module.to(at::kCUDA);

std::cout << "== Model loaded!\n";
std::vector<std::string> labels;
if (LoadImageNetLabel(argv[2], labels)) {
    std::cout << "== Label loaded! Let's try it\n";
} else {
    std::cerr << "Please check your label file path." << std::endl;
    return -1;
}

cv::Mat image;
std::vector<std::string> files;
get_all_files(argv[3], files);
for (auto file : files) {
    if (LoadImage(file, image)) {
        auto input_tensor = torch::from_blob(
                image.data, {1, kIMAGE_SIZE, kIMAGE_SIZE, kCHANNELS});
        input_tensor = input_tensor.permute({0, 3, 1, 2});
        input_tensor[0][0] = input_tensor[0][0].sub_(0.485).div_(0.229);
        input_tensor[0][1] = input_tensor[0][1].sub_(0.456).div_(0.224);
        input_tensor[0][2] = input_tensor[0][2].sub_(0.406).div_(0.225);

        // to GPU
        input_tensor = input_tensor.to(at::kCUDA);

        torch::Tensor out_tensor = module.forward({input_tensor}).toTensor();

        auto results = out_tensor.sort(-1, true);
        auto softmaxs = std::get<0>(results)[0].softmax(0);
        auto indexs = std::get<1>(results)[0];

        for (int i = 0; i < kTOP_K; ++i) {
            auto idx = indexs[i].item<int>();
	std::cout << " File: " << file << ", Label: " << labels[idx] << ", Probability: " << softmaxs[i].item<float>() * 100.0f << "%" << std::endl;
        }

    }
}
**std::cout << "Before return, I'm OK!" << std::endl;** // This log can be print out
return 0;

}

maybe check the IO operations of get_all_files() ?

@BIGBALLON this function just get all images names to a vector, I think it has no problems:
void get_all_files(const std::string& path, std::vectorstd::string& files)
{
struct dirent *ptr;
DIR *dir = opendir(path.c_str());
while((ptr = readdir(dir)) != NULL) {
if(ptr->d_name[0] == '.')
continue;
files.emplace_back(path + std::string("/") + std::string(ptr->d_name));
}
closedir(dir);
}

@BIGBALLON I using your code and model, has the same question after enter Q:
== Switch to GPU mode
[New Thread 0x7fff215ea700 (LWP 30485)]
[New Thread 0x7fff20de9700 (LWP 30486)]
== ResNet50 loaded!
== Label loaded! Let's try it
== Input image path: [enter Q to exit]
/home/yyh/test/PyTorch-CPP/pic/dog.jpg
[New Thread 0x7fff2d664700 (LWP 30487)]
[New Thread 0x7fff2ce63700 (LWP 30488)]
[New Thread 0x7fff29fff700 (LWP 30489)]
[New Thread 0x7fff297fe700 (LWP 30490)]
[New Thread 0x7fff28ffd700 (LWP 30491)]
[New Thread 0x7fff287fc700 (LWP 30492)]
[New Thread 0x7fff27ffb700 (LWP 30493)]
[New Thread 0x7fff03fff700 (LWP 30494)]
[New Thread 0x7fff0233d700 (LWP 30495)]
[New Thread 0x7fff01b3c700 (LWP 30496)]
[New Thread 0x7ffeeffff700 (LWP 30497)]
[New Thread 0x7ffeef7fe700 (LWP 30498)]
[New Thread 0x7ffeeeffd700 (LWP 30499)]
[New Thread 0x7ffeddfff700 (LWP 30500)]
[New Thread 0x7ffedd7fe700 (LWP 30501)]
[New Thread 0x7ffedcffd700 (LWP 30502)]
[New Thread 0x7ffedc7fc700 (LWP 30503)]
[New Thread 0x7ffedbffb700 (LWP 30504)]
[New Thread 0x7ffedb7fa700 (LWP 30505)]
[New Thread 0x7ffed54e9700 (LWP 30506)]
[New Thread 0x7ffed4ce8700 (LWP 30507)]
[New Thread 0x7ffebbfff700 (LWP 30508)]
[New Thread 0x7ffebb7fe700 (LWP 30509)]
[New Thread 0x7ffebaffd700 (LWP 30510)]
[New Thread 0x7ffeba7fc700 (LWP 30511)]
[New Thread 0x7ffeb9ffb700 (LWP 30512)]
[New Thread 0x7ffeb97fa700 (LWP 30513)]
[New Thread 0x7ffeb8ff9700 (LWP 30514)]
[New Thread 0x7ffeb87f8700 (LWP 30515)]
[New Thread 0x7ffeb7ff7700 (LWP 30516)]
[New Thread 0x7ffeb77f6700 (LWP 30517)]
[New Thread 0x7ffeb6ff5700 (LWP 30518)]
[New Thread 0x7ffeb67f4700 (LWP 30519)]
[New Thread 0x7ffeb5ff3700 (LWP 30520)]
[New Thread 0x7ffeb57f2700 (LWP 30521)]
[New Thread 0x7ffeb4ff1700 (LWP 30522)]
[New Thread 0x7ffeb47f0700 (LWP 30523)]
[New Thread 0x7ffeb3fef700 (LWP 30524)]
[New Thread 0x7ffeb37ee700 (LWP 30525)]
[New Thread 0x7ffeb2fed700 (LWP 30526)]
[New Thread 0x7ffeb27ec700 (LWP 30527)]
[New Thread 0x7ffeb1feb700 (LWP 30528)]
[New Thread 0x7ffeb17ea700 (LWP 30529)]
[New Thread 0x7ffeb0fe9700 (LWP 30530)]
[New Thread 0x7ffeb07e8700 (LWP 30531)]
[New Thread 0x7ffeaffe7700 (LWP 30532)]
== image size: [976 x 549] ==
== simply resize: [224 x 224] ==
============= Top-1 =============
Label: beagle
With Probability: 99.1227%
============= Top-2 =============
Label: Walker hound, Walker foxhound
With Probability: 0.469355%
============= Top-3 =============
Label: English foxhound
With Probability: 0.110916%
== Input image path: [enter Q to exit]
Q

Thread 1 "classifier" received signal SIGSEGV, Segmentation fault.
0x00007fffec81723e in ?? () from /usr/local/cuda-10.2/lib64/libcudart.so.10.2
Missing separate debuginfos, use: yum debuginfo-install libgcc-8.3.1-4.5.el8.x86_64 libgomp-8.3.1-4.5.el8.x86_64 libstdc++-8.3.1-4.5.el8.x86_64 zlib-1.2.11-10.el8.x86_64
(gdb) bt
#0 0x00007fffec81723e in ?? () from /usr/local/cuda-10.2/lib64/libcudart.so.10.2
#1 0x00007fffec81c70b in ?? () from /usr/local/cuda-10.2/lib64/libcudart.so.10.2
#2 0x00007fffec8492d0 in cudaStreamDestroy () from /usr/local/cuda-10.2/lib64/libcudart.so.10.2
#3 0x00007fff68f0551d in cudnnDestroy () from /usr/local/lib/libtorch_cuda.so
#4 0x00007fff68207a05 in at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cudnnContext*, &at::native::(anonymous namespace)::createCuDNNHandle, &at::native::(anonymous namespace)::destroyCuDNNHandle>::~DeviceThreadHandlePool() () from /usr/local/lib/libtorch_cuda.so
#5 0x00007fff63e66677 in __cxa_finalize () from /lib64/libc.so.6
#6 0x00007fff65a68a83 in __do_global_dtors_aux () from /usr/local/lib/libtorch_cuda.so
#7 0x00007fffffffe0b0 in ?? ()
#8 0x00007ffff7de4106 in _dl_fini () from /lib64/ld-linux-x86-64.so.2
Backtrace stopped: frame did not save the PC
(gdb)

try to comment on all op of the main function, it seems to have something wrong in your CUDA

If comment all op of main, it's ok.

This problem is libtorch's bug. See pytorch/pytorch#38385

@yueyihua it's great!