segment fault

Question

segment fault

Closed this issue 8 years ago · 14 comments

Hi, when I ran exp_voc_shared.sh in /experiments/scripts use "./exp_voc_shared.sh 0 voc.yml", I encountered segment fault as following but I don't know how to fix it.

./exp_voc_shared.sh: line 49: 14830 segment fault (core has dumped) ./tools/train_az_net.py --gpu $gpu_id --solver models/Pascal/VGG16/az-net/solver_"$prefix".prototxt --weights data/imagenet_models/VGG16.v2.caffemodel --imdb $trainset --cfg experiments/cfgs/$cfg_file --iters 160000 --exp $prefix

the following is my configuration information

It will be very nice of you to tell me how to fix this problem. Thanks.

lakeblue commented 8 years ago

@luyongxi

Answer 1 · 2016-12-13T21:37:26.000Z

Hi @lakeblue

This could be caused by various reasons. The information you provided may not be enough to find out what really happens. I would suggest that you find out the specific line in the python codes that are causing this errors.

In my experience, such an error are typically caused by error in loading the dataset. Segmentation fault would suggest a problem on the Caffe side (pure Python code does not generate such an error message, unless there is a bug in the interpreter). In particular, it is likely caused by feeding the Caffe solver with empty arrays. If that is the case, printing the output of the get_minibatch() function inside lib/az_data_layer/minibatch.py could help you to find out the source of the problem. Then you can trace back the error using this information.

Do remember once you correct the issue, always remember to clean the cache inside data/cache before trying the training procedure again, or the program may load the erroneous cache files rather than loading the data correctly.

Answer 2 · 2016-12-17T15:23:52.000Z

Hi @luyongxi Thanks for your reply, I add "print blobs" to get_minibatch() to print the output of this function inside lib/az_data_layer/minibatch.py. But when I run the script, I haven't seen the output information. I find that the error hasn't occurred untill the line 108, "self.solver.step(1)" inside lib/train_az.py. This line in the part of making one SGD update and self.solver = caffe.SGDSolver(solver_prototxt). Could the error in this line be from the caffee or SGD updating? It will be very nice of you to tell me how to fix this problem. Thanks.

…

-- Best wishes！

------------------ Lake Blue Department of Electronic Engineering, Tsinghua University Beijing, P.R.China POST CODE: 100084 Mobile Phone: +8618811363565 E-mail：hunan0186@163.com At 2016-12-14 05:37:26, "Yongxi Lu" <notifications@github.com> wrote: Hi Hu, This could be caused by various reasons. The information you provided may not be enough to find out what really happens. I would suggest that you find out the specific line in the python codes that are causing this errors. In my experience, such an error are typically caused by error in loading the dataset. Segmentation fault would suggest a problem on the Caffe side (pure Python code does not generate such an error message, unless there is a bug in the interpreter). In particular, it is likely caused by feeding the Caffe solver with empty arrays. If that is the case, printing the output of the get_minibatch() function inside lib/az_data_layer/minibatch.py could help you to find out the source of the problem. Then you can trace back the error using this information. Do remember once you correct the issue, always remember to clean the cache inside data/cache before trying the training procedure again, or the program may load the erroneous cache files rather than loading the data correctly. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

Answer 3 · 2016-12-17T19:15:57.000Z

Hi @lakeblue

From what you described, it seems unlikely to be an issue of Caffe. If the error occurs in self.solver.step(1), it is very likely to be an error caused by the input to the SGD update (which is the output of get_minibatch()).

"blobs" in get_minibatch() is a dictionary. I would suggest you to print out its contents. For example, you can try "print im_blob" or "print rois_blob". You may also want to see if their shapes are correct. Fro example, use "print im_blob.shape".

Answer 4 · 2016-12-18T06:18:10.000Z

Hi @luyongxi I have tried "print im_blob", "print rois_blob" and "print im_blob.shape" but I haven't seen any printed information. And I can make sure that the error has occurred in "im_blob, im_scales = _get_image_blob(roidb, random_scale_inds)", line 33 in minibatch.py for getting the input image blob. The im_blob is null. How to resolve the error in getting input image blob? It will be very nice of you to tell me how to fix this problem. Thanks.

…

-- Best wishes!

------------------ Lake Blue Department of Electronic Engineering, Tsinghua University Beijing, P.R.China POST CODE: 100084 Mobile Phone: +8618811363565 E-mail：hunan0186@163.com At 2016-12-18 03:15:58, "Yongxi Lu" <notifications@github.com> wrote: Hi @lakeblue From what you described, it seems unlikely to be an issue of Caffe. If the error occurs in self.solver.step(1), it is very likely to be an error caused by the input to the SGD update (which is the output of get_minibatch()). "blobs" in get_minibatch() is a dictionary. I would suggest you to print out its contents. For example, you can try "print im_blob" or "print rois_blob". You may also want to see if their shapes are correct. Fro example, use "print im_blob.shape". — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Answer 5 · 2016-12-18T06:32:53.000Z

Hi @lakeblue

It seems the error is caused by not properly reading the image. You may want to check which image is causing the problem: it might be as simple as putting the images in the wrong folder. Also, remember to clean contents in data/cache, as the cached image paths are absolute paths, so if for whatever reasons you move the images to another folder it will cause an error in loading the image (thus you see a null in the im_blobs).

Answer 6 · 2016-12-19T02:01:31.000Z

Hi @luyongxi I have used VOC2007 database(9963 images) in data/VOCdevkit2007 by absolute path, and they can't be put in another folder. I clean contents in data/cache every time. I can't understand why the images will cause the problem, or the error isn't caused by image itself? It will be very nice of you to tell me how to fix this problem. Thanks.

Answer 7 · 2016-12-19T02:57:32.000Z

Hi @lakeblue

My suggestion would be to check the following line

im = cv2.imread(roidb[i]['image'])

From what you described, there is one image that is not loaded properly. You can check which image is causing the problem. You can then manually check if the image to be loaded is in the correct path.

Answer 8 · 2016-12-20T02:06:18.000Z

Hi @luyongxi I had a try according to your suggestion and find the length of roidb as the input of get_minibatch() and _get_image_blob() is 1, only including one image infomation(000590.jpg). The path(roidb[0]['image']) is correct and I have read this image successfully by python. But the script stop at the line "im = cv2.imread(roidb[i]['image'])", why loading process could generate error? I find that roidb in train_az_net.py("roidb = get_training_roidb(imdb)") and imdb.roidb in train_az.py("rdl_roidb.add_adjacent_prediction_targets(imdb)") are all complete(10022 images). It will be very nice of you to tell me how to fix this problem. Thanks.

Answer 9 · 2016-12-20T02:17:07.000Z

Hi @lakeblue

The get_minibatch() should have length 1 as it is a minibatch. For the error message you are seeing, it is highly likely that the path in roidb[i]p'image'] is wrong. Have you confirmed that the path is the correct absolute path to the image? You need to have full paths rather than just a file name to load images correctly in this path of the script.

Answer 10 · 2016-12-20T04:31:36.000Z

Hi, @luyongxi Yes, I have printed the path in roidb[i]['image'] before error and printed out "/space3/hunan/az-net/tools/../lib/datasets/../../data/VOCdevkit2007/VOC2007/JPEGImages/000590.jpg", showing the path in roidb[i]['image'] is the correct absolute path to the image that we can read by python in any path. But the next line "im = cv2.imread(roidb[i]['image'])" still generates error. Why couln't it load the path properly? It will be very nice of you to tell me how to fix this problem. Thanks.

Answer 11 · 2016-12-20T17:15:31.000Z

Hi @lakeblue

Perhaps you can use os.path.isfile() to check if the path is actually correct. There is a chance that what is printed out appears to be correct, but the path is somehow wrong. Otherwise, you might want to check your opencv installation and see if there is something wrong there.

Answer 12 · 2016-12-21T05:45:15.000Z

Hi @luyongxi I now know the opencv has been updated to 3.0, so I should make caffe again. Now there is not error, thanks a lot for your suggestion.

Answer 13 · 2016-12-21T06:52:42.000Z

Hi @lakeblue

Good to know. I am closing this thread as the problem has been resolved.