A problem when make runtest

Question

A problem when make runtest

jamiesoung opened this issue 8 years ago · 6 comments

Issue summary

[----------] 1 test from LayerFactoryTest/2, where TypeParam = caffe::GPUDevice
[ RUN ] LayerFactoryTest/2.TestCreateLayer
*** Aborted at 1483801360 (unix time) try "date -d @1483801360" if you are using GNU date ***
PC: @ 0x7f3458fca962 (unknown)
*** SIGSEGV (@0x118) received by PID 1777 (TID 0x7f346b689800) from PID 280; stack trace: ***
@ 0x7f3459321390 (unknown)
@ 0x7f3458fca962 (unknown)
@ 0x7f3459cd67a5 caffe::BasePrefetchingDataLayer<>::~BasePrefetchingDataLayer()
@ 0x7f3459d99e09 caffe::DataLayer<>::~DataLayer()
@ 0x4ec5e8 caffe::LayerFactoryTest_TestCreateLayer_Test<>::TestBody()
@ 0x8f63d3 testing::internal::HandleExceptionsInMethodIfSupported<>()
@ 0x8f01ea testing::Test::Run()
@ 0x8f0338 testing::TestInfo::Run()
@ 0x8f0415 testing::TestCase::Run()
@ 0x8f162f testing::internal::UnitTestImpl::RunAllTests()
@ 0x8f1943 testing::UnitTest::Run()
@ 0x46dacd main
@ 0x7f3458f67830 (unknown)
@ 0x475509 _start
Makefile:526: recipe for target 'runtest' failed
make: *** [runtest] Segmentation fault (core dumped)

Steps to reproduce

make runtest -j16

Your system configuration

Operating system:Ubuntu16.04
Compiler:GCC5.3
CUDA version (if applicable):8.0
CUDNN version (if applicable):v5.1
BLAS:atlas
Python or MATLAB version (for pycaffe and matcaffe respectively):Python2.7

Answer 1 · 2017-01-11T19:07:22.000Z

I can not reproduce that error.
I tried it on 3 different machines and no error occurred:

Ubuntu 14.04, CUDA 8.0, Titan X (Pascal), cuDNN v.4 /v.5 /v.5.1, compiled with cmake and make
Ubuntu 14.04, CUDA 7.5, Titan X (Maxwell), cuDNN v.4 /v.5 /v.5.1, compiled with cmake and make
Ubuntu 16, CUDA 8.0, GTX 980, cuDNN v.5.1, compiled with cmake

Did you compiled it with cmake or make? Did you change in your makefile.config something else then uncomment the cuDNN flag?
Can you test and train SegNet anyway or which errors do you encounter?

Answer 2 · 2017-01-22T10:04:04.000Z

Hello,
I have now run into this error while building on an AWS g2.2xlarge instance (ubuntu 16.04, CUDA 8, CUDNN 5.1). I was able to make and pass all tests on another ubuntu machine, also 16.04, but which has the K4000 GPU. Also, my OS X El Capitan laptop with the measly GeForce GT 650M was able to make and pass all tests.

I have used make all the time.

In Makefile.config, nothing has changed, except to uncomment cuDNN flag, and add usr/include/hdf5/serial to the INCLUDE_DIRS.

Running the SegNet-Tutorial basic version training gives me the following output:

~/SegNet-Tutorial/Models$ ~/caffe-segnet-cudnn5/build/tools/caffe train --solver ./segnet_basic_solver.prototxt
I0122 20:35:36.243996  3013 caffe.cpp:217] Using GPUs 0
I0122 20:35:36.531260  3013 caffe.cpp:222] GPU 0: GRID K520
F0122 20:35:36.669652  3013 solver_factory.hpp:76] Check failed: registry.count(type) == 1 (0 vs. 1) Unknown solver type: SGD (known types: )
*** Check failure stack trace: ***
    @     0x7f75cac2b5cd  google::LogMessage::Fail()
    @     0x7f75cac2d433  google::LogMessage::SendToLog()
    @     0x7f75cac2b15b  google::LogMessage::Flush()
    @     0x7f75cac2de1e  google::LogMessageFatal::~LogMessageFatal()
    @           0x41cd2a  train()
    @           0x417678  main
    @     0x7f75c7899830  __libc_start_main
    @           0x418dc9  _start
    @              (nil)  (unknown)
Aborted (core dumped)

Any ideas?

Answer 3 · 2017-01-23T08:12:45.000Z

Does the same error occurs, when you compile it with cmake?
Have you compiled caffe (master branch) on this machine and can you train with it successfully?

Answer 4 · 2017-01-23T19:41:46.000Z

@TimoSaemann Thank you for the reply.
I have just tried with cmake and eventually got the exact same error in runtest.

Master branch compiles and seems to train no problem. I used make to build it, and ran the MNIST example with no issues.

This time I've got some new output:

[----------] 12 tests from DataLayerTest/3, where TypeParam = caffe::GPUDevice<double>
[ RUN      ] DataLayerTest/3.TestReadCropTrainLevelDB
*** Error in `/home/ubuntu/caffe-segnet-cudnn5/build/test/test.testbin': free(): invalid pointer: 0x00007f86eae3a7a0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f86ea5527e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x7fe0a)[0x7f86ea55ae0a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f86ea55e98c]
/home/ubuntu/caffe-segnet-cudnn5/build/lib/libcaffe.so.1.0.0-rc3(_ZN5caffe24BasePrefetchingDataLayerIdED1Ev+0x37)[0x7f86f1492757]
/home/ubuntu/caffe-segnet-cudnn5/build/test/test.testbin(_ZN5caffe13DataLayerTestINS_9GPUDeviceIdEEE12TestReadCropENS_5PhaseE+0x8f6)[0xb23ab6]
/home/ubuntu/caffe-segnet-cudnn5/build/test/test.testbin(_ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc+0x43)[0xde5923]
/home/ubuntu/caffe-segnet-cudnn5/build/test/test.testbin(_ZN7testing4Test3RunEv+0xba)[0xdde85a]
/home/ubuntu/caffe-segnet-cudnn5/build/test/test.testbin(_ZN7testing8TestInfo3RunEv+0x118)[0xdde9a8]
/home/ubuntu/caffe-segnet-cudnn5/build/test/test.testbin(_ZN7testing8TestCase3RunEv+0xe5)[0xddeab5]
/home/ubuntu/caffe-segnet-cudnn5/build/test/test.testbin(_ZN7testing8internal12UnitTestImpl11RunAllTestsEv+0x22f)[0xde064f]
/home/ubuntu/caffe-segnet-cudnn5/build/test/test.testbin(_ZN7testing8UnitTest3RunEv+0x43)[0xde0973]
/home/ubuntu/caffe-segnet-cudnn5/build/test/test.testbin(main+0x17d)[0x891abd]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f86ea4fb830]
/home/ubuntu/caffe-segnet-cudnn5/build/test/test.testbin(_start+0x29)[0x8973b9]
======= Memory map: ========
00400000-00fcd000 r-xp 00000000 ca:01 813068                             /home/ubuntu/caffe-segnet-cudnn5/.build_release/test/test.testbin
011cc000-0122f000 r--p 00bcc000 ca:01 813068                             /home/ubuntu/caffe-segnet-cudnn5/.build_release/test/test.testbin
0122f000-01231000 rw-p 00c2f000 ca:01 813068                             /home/ubuntu/caffe-segnet-cudnn5/.build_release/test/test.testbin
01231000-01232000 rw-p 00000000 00:00 0
02212000-07380000 rw-p 00000000 00:00 0                                  [heap]
200000000-200100000 rw-s 36092000 00:06 395                              /dev/nvidiactl

Followed by a long output, I'm not sure what it is, and finally the original error.

Answer 5 · 2017-07-26T21:22:11.000Z

It might be failing cause of the presence of multiple GPUs in your system. Try
export CUDA_VISIBLE_DEVICES=0
echo $CUDA_VISIBLE_DEVICES -> should show 0
and then run make runtest -j8

Answer 6 · 2017-11-15T18:03:43.000Z

sudo apt-get install libtcmalloc-minimal4

Adding it in LD_PRELOAD variable, then make runtest again:
export LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4"

You may need to add it in ~/.bashrc too.