Unable to install caffe-future.
Closed this issue · 58 comments
Hi,
I found out about Caffe-future from the paper Fully Convolutional Neural Networks (found the link from model-zoo). I am trying to work on a regression problem where the input to the CNN is an 256 X 256 image and the output that the CNN is supposed to produce is also an 256 X 256 image. So a version of caffe that supports Fully convolutional neural netwroks would be extremely useful for me. In the original version of caffe I was getting an error when I tried setting the stride of the convolutional layer to a float value (for upsampling). I believe the caffe-future version supports float value for stride.
However, while trying to install caffe-future I am facing some issues. I am not sure if I am missing anything. Following is what I tried for installation:
First I cloned the git repository. After that I followed the instructions mentioned in future.sh
Mentioned below are the commands I wrote and the outputs I got. The main issue I faced was in the command : hub merge BVLC#1977 which gave the error : fatal: Couldn't find remote ref refs/heads/accum-grad
>>> git clone https://github.com/longjon/caffe.git
>>> cd caffe
>>> git checkout master
Already on 'master'
Your branch is up-to-date with 'origin/master'.
>>> git branch -D future
error: branch 'future' not found.
>>> git checkout -b future
Switched to a new branch 'future'
>>> hub merge https://github.com/BVLC/caffe/pull/1976
include/caffe/util/benchmark.hpp | 27 +-
include/caffe/util/coords.hpp | 61 +
include/caffe/util/cudnn.hpp | 128 +
include/caffe/util/db.hpp | 190 ++
include/caffe/util/device_alternate.hpp | 102 +
include/caffe/util/im2col.hpp | 22 +-
...
...
...
matlab/caffe/matcaffe_init.m | 11 +-
.../bvlc_alexnet/deploy.prototxt | 248 +-
models/bvlc_alexnet/readme.md | 25 +
.../bvlc_alexnet/solver.prototxt | 6 +-
.../bvlc_alexnet/train_val.prototxt | 296 ++-
models/bvlc_googlenet/deploy.prototxt | 2156 +++++++++++++++++
...
...
...
tools/test_net.cpp | 54 +-
tools/train_net.cpp | 34 +-
tools/upgrade_net_proto_binary.cpp | 17 +-
tools/upgrade_net_proto_text.cpp | 29 +-
430 files changed, 46179 insertions(+), 11932 deletions(-)
create mode 100644 .Doxyfile
create mode 100644 .travis.yml
create mode 100644 CMakeLists.txt
create mode 100644 cmake/ConfigGen.cmake
create mode 100644 cmake/Cuda.cmake
create mode 100644 cmake/Dependencies.cmake
...
...
...
create mode 100644 src/caffe/util/db.cpp
create mode 100644 src/gtest/CMakeLists.txt
create mode 100644 tools/CMakeLists.txt
create mode 100644 tools/caffe.cpp
delete mode 100644 tools/dump_network.cpp
create mode 100755 tools/extra/parse_log.py
>>> hub merge https://github.com/BVLC/caffe/pull/1977
fatal: Couldn't find remote ref refs/heads/accum-grad
>>> hub merge https://github.com/BVLC/caffe/pull/2086
From git://github.com/longjon/caffe
[new branch] python-net-spec -> longjon/python-net-spec
Auto-merging src/caffe/net.cpp
Removing src/caffe/layers/flatten_layer.cu
Auto-merging matlab/hdf5creation/demo.m
Removing matlab/caffe/read_cell.m
Removing matlab/caffe/print_cell.m
Removing matlab/caffe/prepare_batch.m
Removing matlab/caffe/matcaffe_init.m
Removing matlab/caffe/matcaffe_demo_vgg_mean_pix.m
Removing matlab/caffe/matcaffe_demo_vgg.m
Removing matlab/caffe/matcaffe_demo.m
Removing matlab/caffe/matcaffe_batch.m
Removing matlab/caffe/matcaffe.cpp
Removing matlab/caffe/ilsvrc_2012_mean.mat
Auto-merging include/caffe/vision_layers.hpp
CONFLICT (content): Merge conflict in include/caffe/vision_layers.hpp
Auto-merging include/caffe/neuron_layers.hpp
Auto-merging include/caffe/layer.hpp
Auto-merging include/caffe/common_layers.hpp
Auto-merging examples/net_surgery/bvlc_caffenet_full_conv.prototxt
Automatic merge failed; fix conflicts and then commit the result.
I am unable to compile caffe. Can someone please help me with this issue ?
@aalok1969, the compilation error you're getting is from a conflict in the vision_layers header. Specifically the class definitions for the CropLayer
class that got tangled up with the definitions of the SPPLayer
class that got merged before @shelhamer submitted his crop-layer PR#1976.
I recently ran into the same problem as you and tried to resolve the conflict with PR #2 to @shelhamer's crop-layer branch. He hasn't responded yet. The PR is basically a merge of a long list of changes from the BVLC:master to bring shelhamer:crop-layer up to speed on changes there, in addition to 2 commits for resolving the SPPLayer-CropLayer class definition conflict (8ebd41b and fa0cbb2). No logical or functional changes.
If you plan on checking out my PR, can you comment on whether you were able to reproduce the FCN experiments? Thanks.
Hi @kashefy, thanks a lot for your reply. I tried your version of caffe by running
git clone https://github.com/kashefy/caffe
And then I compiled caffe, everything went smoothly.
But when I tried to train the network, I got the following errors. This error is due to the fact that, in one of my convolutional layers the stride is a float value of 0.5. Hence, it is giving the error Expected integer. I want to be able to set the stride as a float value in order to upscale the output. How could I be able to do that ?
[libprotobuf ERROR google/protobuf/text_format.cc:245] Error parsing text-format caffe.NetParameter: 69:13: Expected integer.
F0804 18:13:11.194710 31449 upgrade_proto.cpp:928] Check failed:
ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: /home/aalok/caffe-kashefy/caffe/Over-exposure/MyNet_net.prototxt
*** Check failure stack trace: ***
@ 0x7fd198941daa (unknown)
@ 0x7fd198941ce4 (unknown)
@ 0x7fd1989416e6 (unknown)
@ 0x7fd198944687 (unknown)
@ 0x7fd198d8d1ae caffe::ReadNetParamsFromTextFileOrDie()
@ 0x7fd198d7c822 caffe::Solver<>::InitTrainNet()
@ 0x7fd198d7d713 caffe::Solver<>::Init()
@ 0x7fd198d7d8e6 caffe::Solver<>::Solver()
@ 0x40d790 caffe::GetSolver<>()
@ 0x407311 train()
@ 0x405891 main
@ 0x7fd197e53ec5 (unknown)
@ 0x405e3d (unknown)
@ (nil) (unknown)
Aborted (core dumped)
I am a bit new to caffe and Github, hence I didnt understand the earlier part of your reply. Can you elaborate a bit as to what steps I should take to be able to install caffe-future ?
@aalok1993, thanks for taking the time to checkout my changes.
Re-stride of 0.5: I don't think this is possible, given that the stride member in the ConvolutionLayer class is defined as an int. See vision_layer.hpp#68.
I'm still new to caffe myself and still trying to figure out how things are done. I was able to resolve some of the issue but still haven't figured out an end-to-end process for making things work. I have yet to train one of these FCNs successfully myself...
On how to upscale the output, I don't think you need to worry about floating point stride values. The FCN models perform something similar through the Deconvolution layer. It involves bilinear interpolation but I'm a bit lost on the details. Might be worth looking up related posts in the caffe-users group. The implementation already exists in caffe, just a matter of figuring out usage.
Re-building caffe-future: My understanding is that the instructions future.sh are sufficient. The merge conflict that was causing your build error was because you were merging the PRs to the BVLC:master and not longjon:master, which are not in synch at the moment. Did I get that right?
I'll try to respond with something more useful when I've figure out more.
I'm facing similar issues. Will update you guys if I find a solution myself. My next avenue if to check out other implementations of FCN using caffe. This is what I came up with:
I was able to train the FCN32s model through fine tuning successfully. My problem was in that weights of some of the layers of my fully conv. VGG-16 variant were not being copied correctly. Please find more details under this this topic in caffe-users group. All zero weights in these layers will only propagate zeros during training.
Don't you think the deconv. later will upscale your features? If a stride of 1/2 is critical for your algorithm, maybe you can use the deconv. layer to upscale by a factor of 2 using nearest neighbor interpolation (not sure about the details for this) a stride of a consecutive conv layer of 1 then be equivalent of a 1/2 stride with only a single conv layer. Would this work for you?
@neurohn, yes please keep us updated. If it's more about concepts and less about the implementation it may be better to continue that discussion in the caffe-users group.
Hi @kashefy , thanks a lot for your reply. I was able to upscale using deconvolution layer with bilinear weight filler. But I am still facing lots of issues.
Initially I was getting a lots of Nan's and Inf's in my weight parameters. I tried to modify the learning rates and this problem went away. (I wanted to ask that: what various parameters should I try to modify to solve this issue)
After that the issue I am facing is that, when I take an image and pass it forward, most of the blob values are coming out to be zeros and the final image I am getting is an image filled with zeros. And the weight parameters learned by the network become very huge. Below I have described the various outputs in detail.
I am working on a regression problem where my input is a 256X256X3 image and the output is also a 256X256X3 image. In order to figure out the issue, I took a very small architecture(a toy example) which consists of a single convolutional layer, Relu layer, a pooling layer followed by a deconvolution layer. Also, to make it simple initially I am taking the (output label = input data), so currently my network works like an autoencoder. All it has to do is learn an approximation of the identity function. But it fails to do even that. Following are the prototxt files: deploy.prototxt, train_val.prototxt and solver.prototxt.
I trained the network for 1000 iteration and used the snapshot as my model. Following is the code and output, which describes what I obtain after 1000 iterations. (NOTE: I have done the training in GPU mode as well as CPU mode. But I get the same result in each case.)
Initializing caffe and Loading the network
caffe.set_mode_cpu()
net = caffe.Net('MyNet_deploy.prototxt', 'snapshots/MyNet_iter_1000.caffemodel', caffe.TEST)
transformer = caffe.io.Transformer({'data': net.blobs['data'].data.shape})
transformer.set_transpose('data', (2,0,1))
net.blobs['data'].reshape(1,3,256,256)
net.blobs['data'].data[...] = transformer.preprocess('data', caffe.io.load_image('Train/data/0001.jpg'))
out = net.forward()
The blobs
[(k, v.data.shape) for k, v in net.blobs.items()]
[('data', (1, 3, 256, 256)),
('conv1_1', (1, 64, 256, 256)),
('pool1', (1, 64, 128, 128)),
('upsample2', (1, 3, 256, 256))]
The parameters
[(k, v[0].data.shape) for k, v in net.params.items()]
[('conv1_1', (64, 3, 3, 3)),
('upsample2', (64, 3, 4, 4))]
The conv layer weights
print net.params['conv1_1'][0].data
[[[[ -8.90221119 -19.70544052 -21.9944973 ]
[ -28.27580643 -44.15635681 -51.63126373]
[ -39.88535309 -59.30950165 -62.64734268]][[ -9.73268604 -21.16998863 -23.3067379 ]
[ -28.68981361 -45.53733826 -52.59268951]
[ -40.60289001 -60.55314255 -63.270298 ]][[ -7.46913862 -18.73158836 -20.8146286 ]
[ -26.17634583 -42.74364471 -49.60507965]
[ -37.86455536 -57.53972244 -60.60445023]]]...,
[[[-1756.36547852 -1774.34521484 -1799.48950195]
[-1785.36962891 -1828.19641113 -1854.27050781]
[-1797.99133301 -1837.64611816 -1851.94775391]][[-1765.79675293 -1784.0411377 -1808.77331543]
[-1794.91149902 -1837.94580078 -1863.44091797]
[-1807.38049316 -1847.3157959 -1861.21154785]][[-1588.7590332 -1605.53112793 -1629.23632812]
[-1617.13195801 -1658.57910156 -1683.2623291 ]
[-1629.62780762 -1667.82958984 -1681.2208252 ]]]
The deconv layer weights
print net.params['upsample2'][0].data
[[[[ -0.2453279 -0.42636055 -0.52841532 -0.63897181]
[ -0.75671118 -0.7169919 -0.82515067 -1.17307651]
[ -0.96557409 -0.9307059 -1.03437531 -1.36865413]
[ -1.08291376 -1.28496742 -1.37371349 -1.44269586]][[ -0.2445658 -0.42509246 -0.52721226 -0.63792503]
[ -0.75604206 -0.71580309 -0.82402509 -1.17214358]
[ -0.96510863 -0.9297061 -1.03346813 -1.36791492]
[ -1.08263409 -1.28413677 -1.37294734 -1.44208062]][[ -0.24634758 -0.42725337 -0.52933705 -0.64002675]
[ -0.75833076 -0.71846634 -0.82661229 -1.17464745]
[ -0.9673087 -0.93222517 -1.03590178 -1.37026465]
[ -1.08475745 -1.28656888 -1.37528908 -1.44430864]]]...,
[[[-83.92314148 -85.89565277 -86.49584961 -86.15866852]
[-86.43471527 -88.2796402 -88.8900528 -88.71788788]
[-87.11362457 -88.95469666 -89.52527618 -89.28121185]
[-86.79399109 -88.77192688 -89.2594986 -88.69790649]][[-83.90159607 -85.87146759 -86.47241974 -86.13829803]
[-86.41383362 -88.25655365 -88.86827087 -88.69919586]
[-87.09313202 -88.93185425 -89.50371552 -89.26304626]
[-86.77391815 -88.74938965 -89.23816681 -88.67989349]][[-84.18785858 -86.16223145 -86.7639389 -86.43247986]
[-86.70091248 -88.54869843 -89.16176605 -88.99497223]
[-87.38189697 -89.2256546 -89.79877472 -89.56040192]
[-87.06370544 -89.04449463 -89.53528595 -88.9778595 ]]]
The data blob
print net.blobs['data'].data
[[[[ 0.02745098 0.00784314 0.02745098 ..., 0.19607843 0.14509805
0.11764706]
[ 0. 0.13725491 0.66666669 ..., 0.93725491 0.89803922
0.90980393]
[ 0.3019608 0.87058824 0.99607843 ..., 0.9137255 0.89411765
0.90980393]
...,
[ 0.03921569 0.01960784 0.01176471 ..., 0.08235294 0.07843138
0.07843138]
[ 0. 0. 0. ..., 0.08235294 0.07843138
0.07450981]
[ 0.00392157 0. 0.00392157 ..., 0.07450981 0.07058824
0.07058824]][[ 0.03137255 0.01176471 0.03137255 ..., 0.21568628 0.16470589
0.13725491]
[ 0.00392157 0.14509805 0.67450982 ..., 0.95686275 0.91764706
0.92941177]
[ 0.30980393 0.87843138 1. ..., 0.93725491 0.91764706
0.93333334]
...,
[ 0.06666667 0.04705882 0.03921569 ..., 0.1254902 0.12156863
0.12156863]
[ 0.02352941 0.01960784 0.01568628 ..., 0.1254902 0.12156863
0.11764706]
[ 0.03137255 0.01568628 0.02352941 ..., 0.11764706 0.11372549
0.11372549]][[ 0.01176471 0. 0.01176471 ..., 0.19215687 0.14117648
0.11372549]
[ 0. 0.1254902 0.65490198 ..., 0.93333334 0.89411765
0.90588236]
[ 0.29803923 0.86666667 0.99215686 ..., 0.92156863 0.90196079
0.91764706]
...,
[ 0.03921569 0.01960784 0.01176471 ..., 0.10196079 0.09803922
0.09803922]
[ 0. 0. 0. ..., 0.10196079 0.09803922
0.09411765]
[ 0.00392157 0. 0. ..., 0.09411765 0.09019608
0.09019608]]]]
The conv1_1 blob
print net.blobs['conv1_1'].data
[[[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]][[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]][[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]]]
The pool1 blob
print net.blobs['pool1'].data
[[[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]][[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]][[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]]]
The upsample2 blob
print net.blobs['upsample2'].data
[[[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]][[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]][[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]]]
Some queries
As it can be seen above, the outputs of conv1_1 , pool1 and upsample2 are all filled with zeros. It seem like the net is learning to output a blank image irrespective of the input. Also the weights learned by the FCN consists of many large values. I am unable to understand what is causing this issues. Should I change some parameters to solve this problem ? How should I solve the problem of large weight. Should I include a large weight decay.
I have run the training in both CPU mode as well as GPU mode. But in both the cases I get the same result. So the problem is not related to GPU it seems.
Also I wanted to ask, what is a better weight filler for convolutional layer : xavier or gaussian. Also in gaussian how to take the value of std.
In the next reply I have included the various prototxt files. Could you please help me in understanding how to solve these issues. Thanks a lot.
Dear @kashefy
I cloned your caffe version just as @aalok1969 did (git clone https://github.com/kashefy/caffe) and then compiled it successfully. But, while fine-tunig the alexnet 32-stride model, I got an error saying that the crop layer is not defined. Please note that, after cloning your caffe i didn't not run future.sh as provided in longjon's caffe, as doing so creates conflict.
Could you please advise how to make this PR work?
Dear @kashefy
By setting the weight decay parameter to be large, I was able to make the weight smaller, but I still get zeroes as the output of all the layer. I am not able to understand what really is causing this problem.
Dear @aalok1969 ,
could you please explain how you got @kashefy 's caffe working using git clone https://github.com/kashefy/caffe? I cloned it and compiled. But, then the deconvolution layer was not found by caffe while running the fine-tuning. I didn't run the future.sh script after cloning @kashefy 's caffe, as trying so caused PR conflicts.
Thanks.
Dear @atique81 ,
I had just performed : git clone https://github.com/kashefy/caffe
and then compiled caffe following the instructions in the following tutorial
This worked perfectly fine for me. (NOTE : I didn't run future.sh)
A sample deploy.prototxt for defining deconvolution layer can be seen here : deploy.prototxt and train_val.prototxt
Dear @aalok1969
thank you so much for your reply. I mistakenly mentioned about Deconvolution layer in my last reply, whereas, the error that is generated while running @kashefy 's caffe (after doing clone https://github.com/kashefy/caffe without running future.sh) is missing crop layer, which is actually Pool#1976 and the first pool merge listed in future.sh.
Did you also use the crop layer as mentioned here https://gist.github.com/shelhamer/80667189b218ad570e82#file-train_val-prototxt-L559 ? If so, then I wonder how you could run the fcn fine-tuning from @kashefy 's caffe?
I wont be able to answer that as I am working on a regression problem and not on segmentation. I didnt require the crop layer for my task.
@kashefy mentioned earlier that he was able to run the code for segmentation so he would be able to answer that.
Thanks a lot @aalok1969 . Waiting for @kashefy to reply...
Dear @kashefy ,
could you please explain how I can make all the merges work without any conflict to run the FCN-Semantic segmentation as given here (https://github.com/longjon/caffe/tree/future)? I have gone through your detailed post regarding this here (https://groups.google.com/forum/#!msg/caffe-users/3eIMYV0OlY8/zXrCDI3OBAAJ). But, I didn't understand the step 1. My problem is all that @aalok1969 was facing while running future.sh.
I would highly appreciate if you kindly reply.
@atique81, did you do 'git checkout with_crop' after cloning my fork. Or
did you merge my PR? Without doing any of these you won't have the
CropLayer class defined.
On Aug 11, 2015 7:29 AM, "atique81" notifications@github.com wrote:
Dear @aalok1969 https://github.com/aalok1969
thank you so much for your reply. I mistakenly mentioned about
Deconvolution layer in my last reply, whereas, the error that is generated
while running @kashefy https://github.com/kashefy 's caffe (after doing
it clone https://github.com/kashefy/caffe without running future.sh) is
missing crop layer, which is actually Pool#1976 and the first merge listed
in future.sh.Did you also used the crop layer as done here
https://gist.github.com/shelhamer/80667189b218ad570e82#file-train_val-prototxt-L559
? If so, then I wonder how you could run the fcn fine-tuning?—
Reply to this email directly or view it on GitHub
#1 (comment).
Hi @kashefy , I was trying to reproduce the fcn-8s-pascal-deploy.txt. I followed into this thread and checkout the with_crop branch of your fork, while caffe still does not recognize the CROP layer.
Part of my error message says:
[libprotobuf ERROR google/protobuf/text_format.cc:245] Error parsing text-format caffe.NetParameter: 96:21: Unknown enumeration value of "CROP" for field "type".
F0812 19:45:36.754561 6873 upgrade_proto.cpp:928] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: fcn-8s-pascal.prototxt
I'm wondering if you still use CROP layer, or you get away with other options. Thanks!
Edit: I made it work and I will come back later with more details.
Dear @kashefy ,
I highly appreciate your feedback. I have just done the following -
git clone https://github.com/kashefy/caffe
git checkout with_fork
But, while running the fcn-32s-alexnet prototxt, it generated an error saying something like "reshape not set", which I managed to overcome by following this guideline (BVLC#2834).
Now its running fine, but the loss seems to be jumping much, though only 2000 iterations have been passed (I am training on 1112 images from pascal voc2011).
I will let you know the update once more iterations are finished.
Please let me know if I am still missing anything from your caffe version.
Thanks again for your wonderful support.
@atique81, glad to hear you're making progress. I didn't run into the
reshape error. So far, I've only trained fcn-32s on the PASCAL-Context
dataset by fine-tuning VGG-16 after making it fully convolutional.
Re-loss: This confused me for a while, eventually I was able to see the
loss drop after 200 iterations, so pretty early on in the training (from
600K to >100K). It dropped less drastically after that (taking several 10k
iterations to drop by 10k) It might be worth running the eval.py script
that come with the pretrained models but plugging in the weights from the
snapshots to gauge how well the network is doing through visual inspection.
Dear @kashefy https://github.com/kashefy ,
I highly appreciate your feedback. I have just done the following -
git clone https://github.com/kashefy/caffe
git checkout with_fork
But, while running the fcn-32s-alexnet prototxt, it generated an error
saying something like "reshape not set", which I managed to overcome by
following this guideline (BVLC#2834
BVLC#2834).
Now its running fine, but the loss seems to be jumping much, though only
2000 iterations have been passed (I am training on 1112 images from pascal
voc2011).
I will let you know the update once more iterations are finished.
Please let me know if I am still missing anything from your caffe version.
Thanks again for your wonderful support.
—
Reply to this email directly or view it on GitHub
#1 (comment).
Dear @kashefy ,
Its now 6000 iterations going on, and the loss is jumping between 0.15 and 0.8. I hope it would become more stable once more iterations are passed.
Thanks for all your cordial help.
Hi @kashefy and @atique81 , are you training with single image every iteration or in mini-batch? If you are training in mini-batches, since the aspect ratios are different across different images, did you guys write some code for data preparation? Now I pad all images to be 500x500 to make sure all images are the same size before processing them in batches, but I am wondering if there is any built-in functions in Caffe for this. I'm kind of new to Caffe and I'm still learning the basics. Thanks!
Hi @Eric-Phu ,
as per the guidelines provided in FCN semantic segmentation, I am training in mini-batches of size 1. That's why, it doesn't require to resize the inputs. I am also very new to Caffe. But, I guess, if you have a look into their imagenet tutorial (http://caffe.berkeleyvision.org/gathered/examples/imagenet.html), you will get to know how to feed caffe with resized inputs, or how input data layers can resize inputs automatically.
Thanks
Hi @atique81 , thanks for your reply! Yeah right now I'm sticking with the explicit resizing strategy as the imagenet tutorial does. While it's weird that I cannot use a batch size like 20 according to the original FCN paper. I posted the memory issue in the Google group. Check it out if you'd like. BTW, what kind of speed do you get when training the FCN-32s?
Dear @Eric-Phu ,
I am running FCN-32s on an Nvidia GeForce GTX 980 GPU with 4GB memory. Its taking approximately 1.04 Sec for one complete forward+reverse pass.
Just curious to know about how you did the resize for your ground truth images, as unlike the training images, simple interpolation method won't work for ground truth labels, (as it would create new class numbers).
Could you please elaborate on this?
Hi @atique81 , thanks for your reply!
Your speed is pretty nice. Mine is 5 sec per image on Tesla K40c, which makes me wonder if I did something wrong. Did you set the ‘group’ for deconvolution? I saw people’s posting about it. But whenever I set the group to be 60 by adding group: 60
, which is the same number as the num_output
, Caffe crashed. The error message is
F0813 21:56:23.929918 11210 blob.cpp:455] Check failed: ShapeEquals(proto) shape mismatch (reshape not set)
You mentioned this above, I wonder if it's caused by adding group
.
Actually, you do not need to resize images. Since the images in Pascal VOC has the longer side to be 500. So all you have to do is pad the other side with mean RGB values, which at least is what I did. For ground truth labels, you may want to pad with zero, which represents the background.
Yes, I did. So far I know (from this source: http://caffe.berkeleyvision.org/doxygen/classcaffe_1_1BilinearFiller.html), the deconvolution layer weights dimension should be : Cx1xKxK, where, C is num_output as well as group value. Just check with that link.
I am not sure if padding parts of images with mean rgb values and the corresponding parts in labels with background (class 0) will comply or not. I was in need of resizing the pascal images and labels once and someone advised me to resize the labels based on a voting principle, instead of interpolation, which I was not sure of. That's why I went for single-image batch training.
Thanks for pointing out the link. I checked it out. It's just weird that even I used the exact same protobuf snippet and replace factor
with 64, it still does not work. I know this is too much to ask, do you care to share your network like train_val.prototxt text by gist or something? Thanks!
About the padding vs interpolation, I'm actually not sure which is the right way to go, since the original paper did not talk about it neither.
Hi @atique81 , I found the reason why I cannot make the group
thing work. Because I did not re-run the net surgery code after I make changes to the train_val.prototxt. By using group
, I can save a little bit of memory and much faster than before, although I cannot use a batch size of 20 still.
Another gotcha from Caffe. Anyway, problem solved. Thanks!
Hi @Eric-Phu , nice to know that you solved it.
Hi @kashefy ,
now that I have been able to train fcn 32 stride model on pascal voc2011, I am trying to test the net on pascal voc2011 validation data. But, unfortunately, caffe exits showing insufficient memory. Please note that I trained the model keeping batch size 1 and during testing batch size is also fixed to 1.
Could you please advise what is going wrong?
Thanks
Hi @kashefy
I am facing a problem while training.
All the outputs are coming out to be zeroes. And the testing error doesnt reduce at all. Do you know what might be causing this problem. I am posted the detailed problem and the outputs of all the layer in the thread before. Thanks.
The longjon/caffe:future
has been rebased on BVLC/caffe:master
so the merge conflicts that have been brought up should be settled.
@shelhamer Thanks for updating this! Question: Does future.sh still need to be ran? If so it is still causing issues with merging of the vision layer.
No, just checkout longjon/caffe:future
and use it as-is.
@aalok1993 when you midified the model did you set a weight_filler (the default is to set the wieghts to zero - a stupid default if i know one) check out gausian, xavier...
@atique81 @kashefy I really think setting batch size to 1 is a big mistake. remember, no image contains examples of all the classes, to the gradient will be skewed. If memory is the problem (and it is) you can use iter_size
check out
https://groups.google.com/forum/#!topic/caffe-users/PMbycfbpKcY
I'm using iter_size = 20 and batch_size = 1
quick question, does the loss layer treat "background" different or is it just another class?
@aivision2020 actually we've found batch_size == 1 to be effective when paired with high momentum. See the PASCAL Context FCN in the model zoo: https://gist.github.com/shelhamer/80667189b218ad570e82#file-readme-md
Eventually the arxiv will be updated with more comments on this.
@aalok1993
dear aalok1993
I want to see your .prototxt,but when I click the link ,it goes wrong. Can you send me the two file solver.prototxt and train_val.prototxt to the mail 14120452@bjtu.edu.cn
.Thank you very much .
Hi, I sent you the 3 files on your mail
ok,I got it .thank you very much @aalok1993
Hi, I am wondering which step I did wrong that I cannot run eval.py from FCN-32s. (https://gist.github.com/shelhamer/80667189b218ad570e82#file-readme-md)
I did git clone https://github.com/longjon/caffe/tree/future
git checkout future
make
...
Then I have the image paths put and set but I still get the error said that unknown layer Crop when trying to run eval.py.
I checked caffe.proto under caffe_root/src/caffe/proto and I saw the crop setting in it.
Could anyone tell me which step I did wrong and how to fix it?
Thanks.
@bruceko Hi, I have met the same problem with you. I also checked the caffe.proto
but I cannot find the crop_param
as what other layers have, like sigmoid_param
and softmax_param
. So what do you see in the caffe.proto
that is related to the CropLayer?
I wonder I have mis-installed the future release. I only download and unzip the caffe-future.zip
. Then I use the Makefile.config
as in other Caffe braches and run make all
.
Well, have you got your problem fixed? Do you have any suggestions? Thanks!
@Jianchao-ICT I haven't figured out how to solve the problem yet.
Jon posted how he added a new layer to Caffe in BVLC#684 and now you can find it at https://github.com/BVLC/caffe/wiki/Development.
I think there is no parameter for crop layer, so you cannot find it.
(You might only be able to find the ID for it in caffe.proto.)
However, you could find the crop layer has been defined in vision_layers.hpp.
@bruceko Thanks! Well, I have noticed that CropLayer
has no parameters and I also see the CROP
in caffe.proto
now. Well, just hope to get the problem fixed. Thank you for the nice links 👍
@bruceko In fact, I wonder whether I have got caffe-future
installed correctly. I just download, unzip and make caffe-future.zip
without using future.sh
(someone seemed to mention that it should be used).
@Jianchao-ICT I only tried the steps I posted to install caffe-future.
I skipped running future.sh because l got into some trouble last time.
Since I have already checked with the development guide, I just ignore that.
@bruceko I try to run git clone https://github.com/longjon/caffe/tree/future
in the linux terminal command line, but the following error appears. Have you met with it?
Cloning into 'future'...
p11-kit: invalid config filename, will be ignored in the future: /etc/pkcs11/modules/gnome-keyring-module
fatal: repository 'https://github.com/longjon/caffe/tree/future/' not found
@Jianchao-ICT I'm using ubuntu and I don't have such problem.
I think you are using other Linux system and you might want to look at this http://forum.mepiscommunity.org/viewtopic.php?f=94&t=36357
@bruceko Well, it seems that my Linux is also Ubuntu?
lijianchao@cuda-server:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 12.04.5 LTS
Release: 12.04
Codename: precise
@kashefy Hi, I have read your detailed comments above. Now I am just trying to run the eval.py
script of FCN-32s and I encounter a problem which says that Crop
is an unknown layer to Caffe. I check for files related to the CropLayer
and find nothing wrong. My problem is posted in this issue. Could you help me with it? Thanks!
@bruceko Hi, I have found the reason on my machine why Caffe would report CropLayer
to be unknown. The reason is that I have another caffe-master branch on my machine, which is compiled. In eval.py
, when import caffe
is executed, the caffe
module of caffe-master is imported so it cannot recognize the CropLayer
. You may verify this by printing help(caffe)
and check the Path
information. Anyway, I have just noticed it and am still trying to fix it.
@Jianchao-ICT Thanks for your information. I do have the same problem.
I installed several repos for Caffe and didn't make the distribution for that.
I could run the eval.py by changing the path in .bashrc but I got some warning for that.
export CAFFE_HOME=${HOME}/caffe
Change caffe to caffe-future or other folder you have.
Hope the solution I used could help you too.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message. If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 597011289
I am working on other stuff, so I might not be able to solve the problem we have with you.
Hope you could get your results soon and you might be able to help me then.
@bruceko Yes, I change PYTHONPATH
and eval.py
works now. BTW, I think the warning message is simply due to the .prototxt
of FCN-32s is just too large and has nothing wrong with your code.
@longjon @shelhamer Any plan to merge to master with a PR?
Hi all and @aalok1993
I have same problem that my training output is always zero. The training loss does not decrease.
Do you have any suggestion?
Thank you all,
Thuan
@shelhamer Hi, I encountered conflicts when I merged PR BVLC#2016 , it says "Automatic merge failed", should I fix conflicts manually? Thanks in advance.
Equivalent code is already merged to master in github.com/BVLC/caffe, in case people here weren't aware.
Hey all,
Check out the fcn.berkeleyvision.org repo for master editions of the reference networks, weights, and code for learning, inference, and scoring.
Closing this issue since the future
branch is now deprecated.