Segmentation fault

Question

Segmentation fault

yueyihua opened this issue 5 years ago · 9 comments

After loading my model to inference all images, a segmention fault occur, how to resolve this problem? My error is as follows:
File: ./test-mnist/p2p/P2P-xunlei-unk-udp-2965_192.168.253.9_12345_113.100.16.221_12345_17.png, Label: p2p, Probability: 99.9866%
File: ./test-mnist/p2p/P2P-xunlei-unk-udp-3034_192.168.253.9_12345_180.166.38.134_0_17.png, Label: p2p, Probability: 99.9999%

Thread 1 "classifier" received signal SIGSEGV, Segmentation fault.
0x00007fffec81723e in ?? () from /usr/local/cuda-10.2/lib64/libcudart.so.10.2
Missing separate debuginfos, use: yum debuginfo-install libgcc-8.3.1-4.5.el8.x86_64 libgomp-8.3.1-4.5.el8.x86_64 libstdc++-8.3.1-4.5.el8.x86_64 zlib-1.2.11-10.el8.x86_64
(gdb) bt
#0 0x00007fffec81723e in ?? () from /usr/local/cuda-10.2/lib64/libcudart.so.10.2
#1 0x00007fffec81c70b in ?? () from /usr/local/cuda-10.2/lib64/libcudart.so.10.2
#2 0x00007fffec8492d0 in cudaStreamDestroy () from /usr/local/cuda-10.2/lib64/libcudart.so.10.2
#3 0x00007fff68f0551d in cudnnDestroy () from /usr/local/lib/libtorch_cuda.so
#4 0x00007fff68207a05 in at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cudnnContext*, &at::native::(anonymous namespace)::createCuDNNHandle, &at::native::(anonymous namespace)::destroyCuDNNHandle>::~DeviceThreadHandlePool() () from /usr/local/lib/libtorch_cuda.so
#5 0x00007fff63e66677 in __cxa_finalize () from /lib64/libc.so.6
#6 0x00007fff65a68a83 in __do_global_dtors_aux () from /usr/local/lib/libtorch_cuda.so
#7 0x00007fffffffe130 in ?? ()
#8 0x00007ffff7de4106 in _dl_fini () from /lib64/ld-linux-x86-64.so.2
Backtrace stopped: frame did not save the PC
(gdb)

Answer 1 · 2020-05-07T15:54:04.000Z

@yueyihua can you provide the snippet or more details for debugging, I guess there is something wrong in C++ not CUDA/CUDNN.

Answer 2 · 2020-05-08T01:12:27.000Z

@BIGBALLON I think it could be that resources are not being released, but i don't know really why. My code is as follows:
int main(int argc, const char *argv[])
{
if (argc != 4) {
std::cerr << "Usage: classifier "
" " << std::endl;
return -1;
}

torch::jit::script::Module module = torch::jit::load(argv[1]);
std::cout << "== Switch to GPU mode" << std::endl;
// to GPU
module.to(at::kCUDA);

std::cout << "== Model loaded!\n";
std::vector<std::string> labels;
if (LoadImageNetLabel(argv[2], labels)) {
    std::cout << "== Label loaded! Let's try it\n";
} else {
    std::cerr << "Please check your label file path." << std::endl;
    return -1;
}

cv::Mat image;
std::vector<std::string> files;
get_all_files(argv[3], files);
for (auto file : files) {
    if (LoadImage(file, image)) {
        auto input_tensor = torch::from_blob(
                image.data, {1, kIMAGE_SIZE, kIMAGE_SIZE, kCHANNELS});
        input_tensor = input_tensor.permute({0, 3, 1, 2});
        input_tensor[0][0] = input_tensor[0][0].sub_(0.485).div_(0.229);
        input_tensor[0][1] = input_tensor[0][1].sub_(0.456).div_(0.224);
        input_tensor[0][2] = input_tensor[0][2].sub_(0.406).div_(0.225);

        // to GPU
        input_tensor = input_tensor.to(at::kCUDA);

        torch::Tensor out_tensor = module.forward({input_tensor}).toTensor();

        auto results = out_tensor.sort(-1, true);
        auto softmaxs = std::get<0>(results)[0].softmax(0);
        auto indexs = std::get<1>(results)[0];

        for (int i = 0; i < kTOP_K; ++i) {
            auto idx = indexs[i].item<int>();
	std::cout << " File: " << file << ", Label: " << labels[idx] << ", Probability: " << softmaxs[i].item<float>() * 100.0f << "%" << std::endl;
        }

    }
}
**std::cout << "Before return, I'm OK!" << std::endl;** // This log can be print out
return 0;

}

Answer 3 · 2020-05-08T06:05:24.000Z

maybe check the IO operations of get_all_files() ?

Answer 4 · 2020-05-08T06:08:00.000Z

@BIGBALLON this function just get all images names to a vector, I think it has no problems:
void get_all_files(const std::string& path, std::vectorstd::string& files)
{
struct dirent *ptr;
DIR *dir = opendir(path.c_str());
while((ptr = readdir(dir)) != NULL) {
if(ptr->d_name[0] == '.')
continue;
files.emplace_back(path + std::string("/") + std::string(ptr->d_name));
}
closedir(dir);
}

Answer 5 · 2020-05-08T06:33:33.000Z

@BIGBALLON I using your code and model, has the same question after enter Q:
== Switch to GPU mode
[New Thread 0x7fff215ea700 (LWP 30485)]
[New Thread 0x7fff20de9700 (LWP 30486)]
== ResNet50 loaded!
== Label loaded! Let's try it
== Input image path: [enter Q to exit]
/home/yyh/test/PyTorch-CPP/pic/dog.jpg
[New Thread 0x7fff2d664700 (LWP 30487)]
[New Thread 0x7fff2ce63700 (LWP 30488)]
[New Thread 0x7fff29fff700 (LWP 30489)]
[New Thread 0x7fff297fe700 (LWP 30490)]
[New Thread 0x7fff28ffd700 (LWP 30491)]
[New Thread 0x7fff287fc700 (LWP 30492)]
[New Thread 0x7fff27ffb700 (LWP 30493)]
[New Thread 0x7fff03fff700 (LWP 30494)]
[New Thread 0x7fff0233d700 (LWP 30495)]
[New Thread 0x7fff01b3c700 (LWP 30496)]
[New Thread 0x7ffeeffff700 (LWP 30497)]
[New Thread 0x7ffeef7fe700 (LWP 30498)]
[New Thread 0x7ffeeeffd700 (LWP 30499)]
[New Thread 0x7ffeddfff700 (LWP 30500)]
[New Thread 0x7ffedd7fe700 (LWP 30501)]
[New Thread 0x7ffedcffd700 (LWP 30502)]
[New Thread 0x7ffedc7fc700 (LWP 30503)]
[New Thread 0x7ffedbffb700 (LWP 30504)]
[New Thread 0x7ffedb7fa700 (LWP 30505)]
[New Thread 0x7ffed54e9700 (LWP 30506)]
[New Thread 0x7ffed4ce8700 (LWP 30507)]
[New Thread 0x7ffebbfff700 (LWP 30508)]
[New Thread 0x7ffebb7fe700 (LWP 30509)]
[New Thread 0x7ffebaffd700 (LWP 30510)]
[New Thread 0x7ffeba7fc700 (LWP 30511)]
[New Thread 0x7ffeb9ffb700 (LWP 30512)]
[New Thread 0x7ffeb97fa700 (LWP 30513)]
[New Thread 0x7ffeb8ff9700 (LWP 30514)]
[New Thread 0x7ffeb87f8700 (LWP 30515)]
[New Thread 0x7ffeb7ff7700 (LWP 30516)]
[New Thread 0x7ffeb77f6700 (LWP 30517)]
[New Thread 0x7ffeb6ff5700 (LWP 30518)]
[New Thread 0x7ffeb67f4700 (LWP 30519)]
[New Thread 0x7ffeb5ff3700 (LWP 30520)]
[New Thread 0x7ffeb57f2700 (LWP 30521)]
[New Thread 0x7ffeb4ff1700 (LWP 30522)]
[New Thread 0x7ffeb47f0700 (LWP 30523)]
[New Thread 0x7ffeb3fef700 (LWP 30524)]
[New Thread 0x7ffeb37ee700 (LWP 30525)]
[New Thread 0x7ffeb2fed700 (LWP 30526)]
[New Thread 0x7ffeb27ec700 (LWP 30527)]
[New Thread 0x7ffeb1feb700 (LWP 30528)]
[New Thread 0x7ffeb17ea700 (LWP 30529)]
[New Thread 0x7ffeb0fe9700 (LWP 30530)]
[New Thread 0x7ffeb07e8700 (LWP 30531)]
[New Thread 0x7ffeaffe7700 (LWP 30532)]
== image size: [976 x 549] ==
== simply resize: [224 x 224] ==
============= Top-1 =============
Label: beagle
With Probability: 99.1227%
============= Top-2 =============
Label: Walker hound, Walker foxhound
With Probability: 0.469355%
============= Top-3 =============
Label: English foxhound
With Probability: 0.110916%
== Input image path: [enter Q to exit]
Q

Thread 1 "classifier" received signal SIGSEGV, Segmentation fault.
0x00007fffec81723e in ?? () from /usr/local/cuda-10.2/lib64/libcudart.so.10.2
Missing separate debuginfos, use: yum debuginfo-install libgcc-8.3.1-4.5.el8.x86_64 libgomp-8.3.1-4.5.el8.x86_64 libstdc++-8.3.1-4.5.el8.x86_64 zlib-1.2.11-10.el8.x86_64
(gdb) bt
#0 0x00007fffec81723e in ?? () from /usr/local/cuda-10.2/lib64/libcudart.so.10.2
#1 0x00007fffec81c70b in ?? () from /usr/local/cuda-10.2/lib64/libcudart.so.10.2
#2 0x00007fffec8492d0 in cudaStreamDestroy () from /usr/local/cuda-10.2/lib64/libcudart.so.10.2
#3 0x00007fff68f0551d in cudnnDestroy () from /usr/local/lib/libtorch_cuda.so
#4 0x00007fff68207a05 in at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cudnnContext*, &at::native::(anonymous namespace)::createCuDNNHandle, &at::native::(anonymous namespace)::destroyCuDNNHandle>::~DeviceThreadHandlePool() () from /usr/local/lib/libtorch_cuda.so
#5 0x00007fff63e66677 in __cxa_finalize () from /lib64/libc.so.6
#6 0x00007fff65a68a83 in __do_global_dtors_aux () from /usr/local/lib/libtorch_cuda.so
#7 0x00007fffffffe0b0 in ?? ()
#8 0x00007ffff7de4106 in _dl_fini () from /lib64/ld-linux-x86-64.so.2
Backtrace stopped: frame did not save the PC
(gdb)

Answer 6 · 2020-05-08T07:24:33.000Z

try to comment on all op of the main function, it seems to have something wrong in your CUDA

Answer 7 · 2020-05-08T07:29:52.000Z

If comment all op of main, it's ok.

Answer 8 · 2020-06-23T01:31:16.000Z

This problem is libtorch's bug. See pytorch/pytorch#38385

Answer 9 · 2020-06-23T02:57:47.000Z

@yueyihua it's great!