stephenyan1231/caffe-public

Error when running cifar100 examples

Opened this issue · 4 comments

Hi, Zhicheng,

I successfully build caffe using your tutorial here: https://sites.google.com/site/homepagezhichengyan/home/hdcnn/code, but when running the example of cifar100 in the 2nd step(./examples/cifar100/train_cifar100_NIN_float_crop_v2_train_val.sh, there is some strange error, as the following shows, I think it may be the problem of multiple GPUs, as I can run single experiment using one GPU in the Caffe's official example. Could you please give me some advices? Thank you very much!

I0302 18:59:01.165179 5920 caffe.cpp:105] Use GPUs with device IDs below
I0302 18:59:01.165335 5920 caffe.cpp:107] device id 0
I0302 18:59:01.165354 5920 caffe.cpp:107] device id 1
I0302 18:59:01.165369 5920 caffe.cpp:117] Starting Optimization
I0302 18:59:11.525671 5920 solver.cpp:77] Creating training net from net file: models/cifar100_NIN_float_crop_v2/train_val/train_test.prototxt
I0302 18:59:11.525739 5920 upgrade_proto.cpp:928] start ReadNetParamsFromTextFileOrDie
I0302 18:59:11.526916 5920 solver.cpp:80] create net
I0302 18:59:11.527045 5920 net.cpp:475] The NetState phase (0) differed from the phase (1) specified by a rule in layer cifar
I0302 18:59:11.527104 5920 net.cpp:475] The NetState phase (0) differed from the phase (1) specified by a rule in layer accuracy
I0302 18:59:11.527258 5920 data_transformer.cpp:24] Loading mean file from: data/cifar100/float_mean.binaryproto
I0302 18:59:11.530076 5920 db.cpp:20] Opened leveldb examples/cifar100/cifar100-float-train-train-val-leveldb/cifar100-train-leveldb
I0302 18:59:11.530103 5920 data_manager.cpp:97] new database cursor
I0302 18:59:11.530953 5920 data_manager.cpp:99] new database transaction
*** Aborted at 1456963151 (unix time) try "date -d @1456963151" if you are using GNU date ***
PC: @ 0x7f9d86a53644 leveldb::(anonymous namespace)::MergingIterator::key()
*** SIGSEGV (@0x18) received by PID 5920 (TID 0x7f9d8c7339c0) from PID 24; stack trace: ***
@ 0x7f9d80d81670 (unknown)
@ 0x7f9d86a53644 leveldb::(anonymous namespace)::MergingIterator::key()
@ 0x7f9d86a3dc7e leveldb::(anonymous namespace)::DBIter::key()
@ 0x4fb102 caffe::db::LevelDBCursor::key()
@ 0x535fa1 caffe::DataManager<>::DataManager()
@ 0x5496b2 caffe::Net<>::InitDataManager()
@ 0x5676be caffe::Net<>::Init()
@ 0x567880 caffe::Net<>::Net()
@ 0x573eab caffe::Solver<>::InitTrainNet()
@ 0x574ebc caffe::Solver<>::Init()
@ 0x575046 caffe::Solver<>::Solver()
@ 0x4229f0 caffe::GetSolver<>()
@ 0x41c1f8 train()
@ 0x414091 main
@ 0x7f9d80d6db15 __libc_start_main
@ 0x41bd6d (unknown)
./examples/cifar100/train_cifar100_NIN_float_crop_v2_train_val.sh: line 5: 5920 Segmentation fault (core dumped) GLOG_logtostderr=1 ./build/tools/caffe train --solver=models/cifar100_NIN_float_crop_v2/train_val/solver.prototxt
I0302 18:59:13.376158 6839 caffe.cpp:105] Use GPUs with device IDs below
I0302 18:59:13.376302 6839 caffe.cpp:107] device id 0
I0302 18:59:13.376323 6839 caffe.cpp:107] device id 1
I0302 18:59:13.376339 6839 caffe.cpp:117] Starting Optimization
I0302 18:59:24.075724 6839 solver.cpp:77] Creating training net from net file: models/cifar100_NIN_float_crop_v2/train_val/train_test.prototxt
I0302 18:59:24.075798 6839 upgrade_proto.cpp:928] start ReadNetParamsFromTextFileOrDie
I0302 18:59:24.076957 6839 solver.cpp:80] create net
I0302 18:59:24.077093 6839 net.cpp:475] The NetState phase (0) differed from the phase (1) specified by a rule in layer cifar
I0302 18:59:24.077142 6839 net.cpp:475] The NetState phase (0) differed from the phase (1) specified by a rule in layer accuracy
I0302 18:59:24.077369 6839 data_transformer.cpp:24] Loading mean file from: data/cifar100/float_mean.binaryproto
I0302 18:59:24.080384 6839 db.cpp:20] Opened leveldb examples/cifar100/cifar100-float-train-train-val-leveldb/cifar100-train-leveldb
I0302 18:59:24.080411 6839 data_manager.cpp:97] new database cursor
I0302 18:59:24.081198 6839 data_manager.cpp:99] new database transaction
*** Aborted at 1456963164 (unix time) try "date -d @1456963164" if you are using GNU date ***
PC: @ 0x7f9b196d2644 leveldb::(anonymous namespace)::MergingIterator::key()
*** SIGSEGV (@0x18) received by PID 6839 (TID 0x7f9b1f3b29c0) from PID 24; stack trace: ***
@ 0x7f9b13a00670 (unknown)
@ 0x7f9b196d2644 leveldb::(anonymous namespace)::MergingIterator::key()
@ 0x7f9b196bcc7e leveldb::(anonymous namespace)::DBIter::key()
@ 0x4fb102 caffe::db::LevelDBCursor::key()
@ 0x535fa1 caffe::DataManager<>::DataManager()
@ 0x5496b2 caffe::Net<>::InitDataManager()
@ 0x5676be caffe::Net<>::Init()
@ 0x567880 caffe::Net<>::Net()
@ 0x573eab caffe::Solver<>::InitTrainNet()
@ 0x574ebc caffe::Solver<>::Init()
@ 0x575046 caffe::Solver<>::Solver()
@ 0x4229f0 caffe::GetSolver<>()
@ 0x41c1f8 train()
@ 0x414091 main
@ 0x7f9b139ecb15 __libc_start_main
@ 0x41bd6d (unknown)
./examples/cifar100/train_cifar100_NIN_float_crop_v2_train_val.sh: line 8: 6839 Segmentation fault (core dumped) GLOG_logtostderr=1 ./build/tools/caffe train --solver=models/cifar100_NIN_float_crop_v2/train_val/solver_lr1.prototxt --snapshot=models/cifar100_NIN_float_crop_v2/train_val/cifar100_NIN_float_crop_v2_iter_100000.solverstate
I0302 18:59:25.059556 7868 caffe.cpp:105] Use GPUs with device IDs below
I0302 18:59:25.059707 7868 caffe.cpp:107] device id 0
I0302 18:59:25.059726 7868 caffe.cpp:107] device id 1
I0302 18:59:25.059741 7868 caffe.cpp:117] Starting Optimization
I0302 18:59:35.792002 7868 solver.cpp:77] Creating training net from net file: models/cifar100_NIN_float_crop_v2/train_val/train_test.prototxt
I0302 18:59:35.792084 7868 upgrade_proto.cpp:928] start ReadNetParamsFromTextFileOrDie
I0302 18:59:35.793494 7868 solver.cpp:80] create net
I0302 18:59:35.793649 7868 net.cpp:475] The NetState phase (0) differed from the phase (1) specified by a rule in layer cifar
I0302 18:59:35.793747 7868 net.cpp:475] The NetState phase (0) differed from the phase (1) specified by a rule in layer accuracy
I0302 18:59:35.793915 7868 data_transformer.cpp:24] Loading mean file from: data/cifar100/float_mean.binaryproto
I0302 18:59:35.796743 7868 db.cpp:20] Opened leveldb examples/cifar100/cifar100-float-train-train-val-leveldb/cifar100-train-leveldb
I0302 18:59:35.796777 7868 data_manager.cpp:97] new database cursor
I0302 18:59:35.797546 7868 data_manager.cpp:99] new database transaction
*** Aborted at 1456963175 (unix time) try "date -d @1456963175" if you are using GNU date ***
PC: @ 0x7f10c8d4a644 leveldb::(anonymous namespace)::MergingIterator::key()
*** SIGSEGV (@0x18) received by PID 7868 (TID 0x7f10cea2a9c0) from PID 24; stack trace: ***
@ 0x7f10c3078670 (unknown)
@ 0x7f10c8d4a644 leveldb::(anonymous namespace)::MergingIterator::key()
@ 0x7f10c8d34c7e leveldb::(anonymous namespace)::DBIter::key()
@ 0x4fb102 caffe::db::LevelDBCursor::key()
@ 0x535fa1 caffe::DataManager<>::DataManager()
@ 0x5496b2 caffe::Net<>::InitDataManager()
@ 0x5676be caffe::Net<>::Init()
@ 0x567880 caffe::Net<>::Net()
@ 0x573eab caffe::Solver<>::InitTrainNet()
@ 0x574ebc caffe::Solver<>::Init()
@ 0x575046 caffe::Solver<>::Solver()
@ 0x4229f0 caffe::GetSolver<>()
@ 0x41c1f8 train()
@ 0x414091 main
@ 0x7f10c3064b15 __libc_start_main
@ 0x41bd6d (unknown)
./examples/cifar100/train_cifar100_NIN_float_crop_v2_train_val.sh: line 11: 7868 Segmentation fault (core dumped) GLOG_logtostderr=1 ./build/tools/caffe train --solver=models/cifar100_NIN_float_crop_v2/train_val/solver_lr2.prototxt --snapshot=models/cifar100_NIN_float_crop_v2/train_val/cifar100_NIN_float_crop_v2_iter_115000.solverstate

You can use a single GPU to run. To do this, open solver.prototxt and keep only one line 'device_id: 0'. Remove the other line 'device_id: 1'.

Thanks for your reply. @stephenyan1984

Actually I've tried your method, but it doesn't work very well, so I'd figure out where the problem is. If I only use one GPU, it still gives out the following error:

[cliu@ycao-hadoop3 HD-CNN]$ ./examples/cifar100/train_cifar100_NIN_float_crop_v2_train_val.sh
I0309 22:07:28.194317 7029 caffe.cpp:105] Use GPUs with device IDs below
I0309 22:07:28.194476 7029 caffe.cpp:107] device id 0
I0309 22:07:28.194507 7029 caffe.cpp:117] Starting Optimization
I0309 22:07:38.190814 7029 solver.cpp:77] Creating training net from net file: models/cifar100_NIN_float_crop_v2/train_val/train_test.prototxt
I0309 22:07:38.190878 7029 upgrade_proto.cpp:928] start ReadNetParamsFromTextFileOrDie
I0309 22:07:38.191999 7029 solver.cpp:80] create net
I0309 22:07:38.192157 7029 net.cpp:475] The NetState phase (0) differed from the phase (1) specified by a rule in layer cifar
I0309 22:07:38.192248 7029 net.cpp:475] The NetState phase (0) differed from the phase (1) specified by a rule in layer accuracy
I0309 22:07:38.192414 7029 data_transformer.cpp:24] Loading mean file from: data/cifar100/float_mean.binaryproto
I0309 22:07:38.195355 7029 db.cpp:20] Opened leveldb examples/cifar100/cifar100-float-train-train-val-leveldb/cifar100-train-leveldb
I0309 22:07:38.195408 7029 data_manager.cpp:97] new database cursor
*** Aborted at 1457579258 (unix time) try "date -d @1457579258" if you are using GNU date ***
PC: @ 0x7f1234f941e0 (unknown)
*** SIGSEGV (@0x7f1234f941e0) received by PID 7029 (TID 0x7f123aa5b9c0) from PID 888750560; stack trace: ***
@ 0x7f122f0a9670 (unknown)
@ 0x7f1234f941e0 (unknown)
./examples/cifar100/train_cifar100_NIN_float_crop_v2_train_val.sh: line 5: 7029 Segmentation fault (core dumped) GLOG_logtostderr=1 ./build/tools/caffe train --solver=models/cifar100_NIN_float_crop_v2/train_val/solver.prototxt
I0309 22:07:39.626389 8122 caffe.cpp:105] Use GPUs with device IDs below
I0309 22:07:39.626590 8122 caffe.cpp:107] device id 0
I0309 22:07:39.626626 8122 caffe.cpp:117] Starting Optimization
I0309 22:07:50.246525 8122 solver.cpp:77] Creating training net from net file: models/cifar100_NIN_float_crop_v2/train_val/train_test.prototxt
I0309 22:07:50.246598 8122 upgrade_proto.cpp:928] start ReadNetParamsFromTextFileOrDie
I0309 22:07:50.247752 8122 solver.cpp:80] create net
I0309 22:07:50.247891 8122 net.cpp:475] The NetState phase (0) differed from the phase (1) specified by a rule in layer cifar
I0309 22:07:50.247942 8122 net.cpp:475] The NetState phase (0) differed from the phase (1) specified by a rule in layer accuracy
I0309 22:07:50.248123 8122 data_transformer.cpp:24] Loading mean file from: data/cifar100/float_mean.binaryproto
I0309 22:07:50.251242 8122 db.cpp:20] Opened leveldb examples/cifar100/cifar100-float-train-train-val-leveldb/cifar100-train-leveldb
I0309 22:07:50.251271 8122 data_manager.cpp:97] new database cursor
*** Aborted at 1457579270 (unix time) try "date -d @1457579270" if you are using GNU date ***
PC: @ 0x7fc6efd921e0 (unknown)
*** SIGSEGV (@0x7fc6efd921e0) received by PID 8122 (TID 0x7fc6f58599c0) from PID 18446744073438568928; stack trace: ***
@ 0x7fc6e9ea7670 (unknown)
@ 0x7fc6efd921e0 (unknown)
./examples/cifar100/train_cifar100_NIN_float_crop_v2_train_val.sh: line 8: 8122 Segmentation fault (core dumped) GLOG_logtostderr=1 ./build/tools/caffe train --solver=models/cifar100_NIN_float_crop_v2/train_val/solver_lr1.prototxt --snapshot=models/cifar100_NIN_float_crop_v2/train_val/cifar100_NIN_float_crop_v2_iter_100000.solverstate
I0309 22:07:51.091190 8949 caffe.cpp:105] Use GPUs with device IDs below
I0309 22:07:51.091320 8949 caffe.cpp:107] device id 0
I0309 22:07:51.091341 8949 caffe.cpp:117] Starting Optimization
I0309 22:08:01.450184 8949 solver.cpp:77] Creating training net from net file: models/cifar100_NIN_float_crop_v2/train_val/train_test.prototxt
I0309 22:08:01.450248 8949 upgrade_proto.cpp:928] start ReadNetParamsFromTextFileOrDie
I0309 22:08:01.451361 8949 solver.cpp:80] create net
I0309 22:08:01.451508 8949 net.cpp:475] The NetState phase (0) differed from the phase (1) specified by a rule in layer cifar
I0309 22:08:01.451594 8949 net.cpp:475] The NetState phase (0) differed from the phase (1) specified by a rule in layer accuracy
I0309 22:08:01.451751 8949 data_transformer.cpp:24] Loading mean file from: data/cifar100/float_mean.binaryproto
I0309 22:08:01.454588 8949 db.cpp:20] Opened leveldb examples/cifar100/cifar100-float-train-train-val-leveldb/cifar100-train-leveldb
I0309 22:08:01.454617 8949 data_manager.cpp:97] new database cursor
*** Aborted at 1457579281 (unix time) try "date -d @1457579281" if you are using GNU date ***
PC: @ 0x7f57302ad1e0 (unknown)
*** SIGSEGV (@0x7f57302ad1e0) received by PID 8949 (TID 0x7f5735d749c0) from PID 808112608; stack trace: ***
@ 0x7f572a3c2670 (unknown)
@ 0x7f57302ad1e0 (unknown)
./examples/cifar100/train_cifar100_NIN_float_crop_v2_train_val.sh: line 11: 8949 Segmentation fault (core dumped) GLOG_logtostderr=1 ./build/tools/caffe train --solver=models/cifar100_NIN_float_crop_v2/train_val/solver_lr2.prototxt --snapshot=models/cifar100_NIN_float_crop_v2/train_val/cifar100_NIN_float_crop_v2_iter_115000.solverstate

Any new ideas? or suggestions? Thanks.

From the error message, you should check whether the leveldb database exists. Its path is 'examples/cifar100/cifar100-float-train-train-val-leveldb/cifar100-train-leveldb'.

Thanks for your advice. @stephenyan1984

Actually I have checked that and the folder exists, I also tried to remove the repo and redo all the operation, it seems that it still doesn't work. I'm not sure if the level-db file you provide is broken or not. I used the examples/cifar100/get_cifar100_float_train-train-val-leveldb.sh to fetch your level-db files.

I will try to generate my own level-db files and see if it works. Thanks.