luyongxi/az-net

segment fault

Closed this issue · 14 comments

Hi, when I ran exp_voc_shared.sh in /experiments/scripts use "./exp_voc_shared.sh 0 voc.yml", I encountered segment fault as following but I don't know how to fix it.

./exp_voc_shared.sh: line 49: 14830 segment fault (core has dumped) ./tools/train_az_net.py --gpu $gpu_id --solver models/Pascal/VGG16/az-net/solver_"$prefix".prototxt --weights data/imagenet_models/VGG16.v2.caffemodel --imdb $trainset --cfg experiments/cfgs/$cfg_file --iters 160000 --exp $prefix

the following is my configuration information
default

It will be very nice of you to tell me how to fix this problem. Thanks.

Hi @lakeblue

This could be caused by various reasons. The information you provided may not be enough to find out what really happens. I would suggest that you find out the specific line in the python codes that are causing this errors.

In my experience, such an error are typically caused by error in loading the dataset. Segmentation fault would suggest a problem on the Caffe side (pure Python code does not generate such an error message, unless there is a bug in the interpreter). In particular, it is likely caused by feeding the Caffe solver with empty arrays. If that is the case, printing the output of the get_minibatch() function inside lib/az_data_layer/minibatch.py could help you to find out the source of the problem. Then you can trace back the error using this information.

Do remember once you correct the issue, always remember to clean the cache inside data/cache before trying the training procedure again, or the program may load the erroneous cache files rather than loading the data correctly.

Hi @lakeblue

From what you described, it seems unlikely to be an issue of Caffe. If the error occurs in self.solver.step(1), it is very likely to be an error caused by the input to the SGD update (which is the output of get_minibatch()).

"blobs" in get_minibatch() is a dictionary. I would suggest you to print out its contents. For example, you can try "print im_blob" or "print rois_blob". You may also want to see if their shapes are correct. Fro example, use "print im_blob.shape".

Hi @lakeblue

It seems the error is caused by not properly reading the image. You may want to check which image is causing the problem: it might be as simple as putting the images in the wrong folder. Also, remember to clean contents in data/cache, as the cached image paths are absolute paths, so if for whatever reasons you move the images to another folder it will cause an error in loading the image (thus you see a null in the im_blobs).

Hi @lakeblue

My suggestion would be to check the following line

im = cv2.imread(roidb[i]['image'])

From what you described, there is one image that is not loaded properly. You can check which image is causing the problem. You can then manually check if the image to be loaded is in the correct path.

Hi @lakeblue

The get_minibatch() should have length 1 as it is a minibatch. For the error message you are seeing, it is highly likely that the path in roidb[i]p'image'] is wrong. Have you confirmed that the path is the correct absolute path to the image? You need to have full paths rather than just a file name to load images correctly in this path of the script.

Hi @lakeblue

Perhaps you can use os.path.isfile() to check if the path is actually correct. There is a chance that what is printed out appears to be correct, but the path is somehow wrong. Otherwise, you might want to check your opencv installation and see if there is something wrong there.

Hi @lakeblue

Good to know. I am closing this thread as the problem has been resolved.