ImageNet Example : train error does not decrease!
fangli1992 opened this issue · 23 comments
Hi,
I am training ImageNet using the default configuer file:ImageNet.conf.
I use the latest version cxxnet downloaded from https://github.com/dmlc/cxxnet ,and I got result like this:
[1] train-error:0.999173 train-rec@1:0.00181927 train-rec@5:0.00576143 test-error:0.999 test-rec@1:0.00106 test-rec@5:0.0051
[2] train-error:0.998985 train-rec@1:0.000984172 train-rec@5:0.00498642 test-error:0.999 test-rec@1:0.0013 test-rec@5:0.00538
[3] train-error:0.998985 train-rec@1:0.00102632 train-rec@5:0.00492242 test-error:0.999 test-rec@1:0.00102 test-rec@5:0.00448
[4] train-error:0.998985 train-rec@1:0.000982611 train-rec@5:0.00496066 test-error:0.999 test-rec@1:0.00098 test-rec@5:0.00444
[5] train-error:0.998985 train-rec@1:0.00105441 train-rec@5:0.00507695 test-error:0.999 test-rec@1:0.00112 test-rec@5:0.00566
[6] train-error:0.998985 train-rec@1:0.000970124 train-rec@5:0.00502935 test-error:0.999 test-rec@1:0.00098 test-rec@5:0.0046
[7] train-error:0.998985 train-rec@1:0.00096466 train-rec@5:0.0049271 test-error:0.999 test-rec@1:0.00078 test-rec@5:0.005
[8] train-error:0.998985 train-rec@1:0.00104271 train-rec@5:0.00509178 test-error:0.999 test-rec@1:0.001 test-rec@5:0.00484
I found a similar issues #84 but did not find right answer.
Here is the further information about my cxxnet and training(maybe this can help):
- I use the latest version cxxnet downloaded from https://github.com/dmlc/cxxnet .
- I did not modified the default conf file except adding 'shuffle = 1' as well as some Path.
# ImageNet.conf
data = train
iter = imgrec
# image_list = "../../NameList.train"
image_rec = "./data/train.bin"
# image_root = "../../data/resize256/"
image_mean = "models/image_net_mean.bin"
rand_crop=1
rand_mirror=1
shuffle = 1
iter = threadbuffer
iter = end
eval = test
iter = imgrec
# image_list = "../../NameList.test"
image_rec = "./data/val.bin"
# image_root = "../../data/resize256/"
image_mean = "models/image_net_mean.bin"
# no random crop and mirror in test
iter = end
...
...
- I trained LeNet on MNIST with a conf file I converted from Caffe, and it works well!(default MNIST.conf works well too)
- I did not use CUDNN(USE_CUDNN = 0)
- I create the image_list_file in format like this:
# for train.bin (of course, this line is not in image_list_file)
1 0 n01440764/n01440764_10026.JPEG
2 0 n01440764/n01440764_10027.JPEG
3 0 n01440764/n01440764_10029.JPEG
4 0 n01440764/n01440764_10040.JPEG
5 0 n01440764/n01440764_10042.JPEG
...
63341 48 n01695060/n01695060_6356.JPEG
63342 48 n01695060/n01695060_6360.JPEG
63343 48 n01695060/n01695060_6371.JPEG
63344 48 n01695060/n01695060_6389.JPEG
63345 48 n01695060/n01695060_64.JPEG
63346 48 n01695060/n01695060_6400.JPEG
63347 48 n01695060/n01695060_6403.JPEG
...
# for test.bin(In my conf, it is val.bin. Of course, this line is not in image_list_file. )
1 65 ILSVRC2012_val_00000001.JPEG
2 970 ILSVRC2012_val_00000002.JPEG
3 230 ILSVRC2012_val_00000003.JPEG
4 809 ILSVRC2012_val_00000004.JPEG
5 516 ILSVRC2012_val_00000005.JPEG
6 57 ILSVRC2012_val_00000006.JPEG
- I use
./bin/im2rec image_list_file image_root_dir train.bin resize=256
to create rec file. - I also tyied to run on the eraly version cxxnet downloaded from https://github.com/dmlc/cxxnet/tree/revert-208-fix_aug , but got the same output.
- I run kaiming.conf, and the output was still very bad==!
Please other configure file. AlexNet conf file is out of dated.
On Sun, Aug 30, 2015 at 20:43 fangli1992 notifications@github.com wrote:
Hi,
I am training ImageNet using the default configuer file:ImageNet.conf.
I use the latest version cxxnet downloaded from
https://github.com/dmlc/cxxnet ,and I got result like this:[1] train-error:0.999173 train-rec@1:0.00181927 train-rec@5:0.00576143 test-error:0.999 test-rec@1:0.00106 test-rec@5:0.0051
[2] train-error:0.998985 train-rec@1:0.000984172 train-rec@5:0.00498642 test-error:0.999 test-rec@1:0.0013 test-rec@5:0.00538
[3] train-error:0.998985 train-rec@1:0.00102632 train-rec@5:0.00492242 test-error:0.999 test-rec@1:0.00102 test-rec@5:0.00448
[4] train-error:0.998985 train-rec@1:0.000982611 train-rec@5:0.00496066 test-error:0.999 test-rec@1:0.00098 test-rec@5:0.00444
[5] train-error:0.998985 train-rec@1:0.00105441 train-rec@5:0.00507695 test-error:0.999 test-rec@1:0.00112 test-rec@5:0.00566
[6] train-error:0.998985 train-rec@1:0.000970124 train-rec@5:0.00502935 test-error:0.999 test-rec@1:0.00098 test-rec@5:0.0046
[7] train-error:0.998985 train-rec@1:0.00096466 train-rec@5:0.0049271 test-error:0.999 test-rec@1:0.00078 test-rec@5:0.005
[8] train-error:0.998985 train-rec@1:0.00104271 train-rec@5:0.00509178 test-error:0.999 test-rec@1:0.001 test-rec@5:0.00484I found a similar issues #84 #84
but did not find right answer.
Here is the further information about my cxxnet and training(maybe this
can help):
- I use the latest version cxxnet downloaded from
https://github.com/dmlc/cxxnet .- I did not modified the default conf file except adding 'shuffle = 1'
as well as some Path.ImageNet.conf
data = train
iter = imgrecimage_list = "../../NameList.train"
image_rec = "./data/train.bin"
image_root = "../../data/resize256/"
image_mean = "models/image_net_mean.bin"
rand_crop=1
rand_mirror=1
shuffle = 1
iter = threadbuffer
iter = endeval = test
iter = imgrecimage_list = "../../NameList.test"
image_rec = "./data/val.bin"
image_root = "../../data/resize256/"
image_mean = "models/image_net_mean.bin"
no random crop and mirror in test
iter = end
...
...
- I trained LeNet on MNIST with a conf file I converted from Caffe,
and it works well!(default MNIST.conf works well too)- I did not use CUDNN(USE_CUDNN = 0)
- I create the image_list_file in format like this:
for train.bin (of course, this line is not in image_list_file)
1 0 n01440764/n01440764_10026.JPEG
2 0 n01440764/n01440764_10027.JPEG
3 0 n01440764/n01440764_10029.JPEG
4 0 n01440764/n01440764_10040.JPEG
5 0 n01440764/n01440764_10042.JPEG
...
63341 48 n01695060/n01695060_6356.JPEG
63342 48 n01695060/n01695060_6360.JPEG
63343 48 n01695060/n01695060_6371.JPEG
63344 48 n01695060/n01695060_6389.JPEG
63345 48 n01695060/n01695060_64.JPEG
63346 48 n01695060/n01695060_6400.JPEG
63347 48 n01695060/n01695060_6403.JPEG
...for test.bin(In my conf, it is val.bin. Of course, this line is not in image_list_file. )
1 65 ILSVRC2012_val_00000001.JPEG
2 970 ILSVRC2012_val_00000002.JPEG
3 230 ILSVRC2012_val_00000003.JPEG
4 809 ILSVRC2012_val_00000004.JPEG
5 516 ILSVRC2012_val_00000005.JPEG
6 57 ILSVRC2012_val_00000006.JPEG
- I use ./bin/im2rec image_list_file image_root_dir train.bin
resize=256 to create rec file.- I also tyied to run on the eraly version cxxnet downloaded from
https://github.com/dmlc/cxxnet/tree/revert-208-fix_aug , but got the
same output.- I run kaiming.conf, and the output was still very bad==!
—
Reply to this email directly or view it on GitHub
#235.
Thanks. @antinucleon
Is there something wrong with the ImageNet.conf ? I have check the net configure but did not found something wrong.
The reason seems is the random initialization. You can try xavier
initialization method with different seed
On Sun, Aug 30, 2015 at 20:53 fangli1992 notifications@github.com wrote:
Thanks. @antinucleon https://github.com/antinucleon
Is there something wrong with the ImageNet.conf ? I have check the net
configure but did not found something wrong.—
Reply to this email directly or view it on GitHub
#235 (comment).
Thanks.@antinucleon
Now I still have no idea on how to configure the seed of xavier ==! Should I read and modify the source code?
Or chang it to gaussian like caffe and retry?
I am confused why LeNet works well.
I also use the master version to train the ImageNet data. I used the kaiming.conf and Inception-BN.conf. The train error rate and val error rate both did not decrease.
My system is Windows Server.
I see your @ommiissyu problems submitted in April and it has been closed by @winstywang in June. Do you have some ideas on this issue? My system is Ubuntu 14.04LTS .
I have tried the gaussian initialization but got the same result.
@fangli1992 Have you tried xavier on kaiming.conf or googlenet? If it does not work well, try to add clip_gradient = 10 at the end of the config.
Thank for your advice @winstywang , I tried clip_gradient = 10 and got the output like this
[1] train-error:0.99599 train-rec@1:0.00401005 train-rec@5:0.00447365 test-error:0.999 test-rec@1:0.001 test-rec@5:0.005
[2] train-error:1 train-rec@1:0 train-rec@5:0 test-error:0.999 test-rec@1:0.001 test-rec@5:0.005
[3] train-error:1 train-rec@1:0 train-rec@5:0 test-error:0.999 test-rec@1:0.001 test-rec@5:0.005
[4] train-error:1 train-rec@1:0 train-rec@5:0 test-error:0.999 test-rec@1:0.001 test-rec@5:0.005
[5] train-error:1 train-rec@1:0 train-rec@5:0 test-error:0.999 test-rec@1:0.001 test-rec@5:0.005
[6] train-error:1 train-rec@1:0 train-rec@5:0 test-error:0.999 test-rec@1:0.001 test-rec@5:0.005
[7] train-error:1 train-rec@1:0 train-rec@5:0 test-error:0.999 test-rec@1:0.001 test-rec@5:0.005
I do not try kaiming.conf or googlenet.conf, but @ommiissyu did this #236
As replied above, try xavier initialization. I am not sure whether the
issue is caused by windows version.
On Tuesday, September 1, 2015, fangli1992 notifications@github.com wrote:
Thank for your advice @winstywang https://github.com/winstywang , I
tried clip_gradient = 10 and got the output like this[1] train-error:0.99599 train-rec@1:0.00401005 train-rec@5:0.00447365 test-error:0.999 test-rec@1:0.001 test-rec@5:0.005
[2] train-error:1 train-rec@1:0 train-rec@5:0 test-error:0.999 test-rec@1:0.001 test-rec@5:0.005
[3] train-error:1 train-rec@1:0 train-rec@5:0 test-error:0.999 test-rec@1:0.001 test-rec@5:0.005
[4] train-error:1 train-rec@1:0 train-rec@5:0 test-error:0.999 test-rec@1:0.001 test-rec@5:0.005
[5] train-error:1 train-rec@1:0 train-rec@5:0 test-error:0.999 test-rec@1:0.001 test-rec@5:0.005
[6] train-error:1 train-rec@1:0 train-rec@5:0 test-error:0.999 test-rec@1:0.001 test-rec@5:0.005
[7] train-error:1 train-rec@1:0 train-rec@5:0 test-error:0.999 test-rec@1:0.001 test-rec@5:0.005I do not try kaiming.conf or googlenet.conf, but @ommiissyu
https://github.com/ommiissyu did this #236
#236—
Reply to this email directly or view it on GitHub
#235 (comment).
@fangli1992 , in April, I used the Linux system, we solved the problem. But these days I used the cxxnet on the Windows platform, I got the error rate not decreasing problem too.
thanks for the info. maybe it is related to rand_r on windows. We are busy
developing next generation data flow tools MXNet, in MXNet we will use
CXX11 to avoid random number inconsistent problem.
On Mon, Aug 31, 2015 at 21:12 Leo Xiao notifications@github.com wrote:
@fangli1992 https://github.com/fangli1992 , in April, I used the Linux
system, we solved the problem. But these I used the cxxnet on the Windows
platform, I got the error rate not decreasing problem too.—
Reply to this email directly or view it on GitHub
#235 (comment).
@winstywang ok, I am trying kaiming.conf with xavier and will submit the result as soon as possible.
By the way, I am using Ubuntu 14.04 LTS @antinucleon but meet with this problems.
@ommiissyu could you please give me some further suggestions? In April, you used cxxnet-v1?
Thanks a lot!
@fangli1992 Sorry in cxxnet I won't have time to solve it because my own
network always works well. Once mxnet is finishing, cxxnet will be replaced
totally.
On Mon, Aug 31, 2015 at 21:25 fangli1992 notifications@github.com wrote:
@winstywang https://github.com/winstywang ok, I am trying kaiming.conf
with xavier and will submit the result as soon as possible.
By the way, I am using Ubuntu 14.04 LTS @antinucleon
https://github.com/antinucleon but meet with this problems.
@ommiissyu https://github.com/ommiissyu could you please give me some
further suggestions? In April, you used cxxnet-v1?
Thanks a lot!—
Reply to this email directly or view it on GitHub
#235 (comment).
@fangli1992 ,on the linux platform, it works well. On the windows platform I got the same problem as yours.
my result:
round 0:[ 10010] 11767 sec elapsed[1] train-rec@1:0.00114183 train-rec@5:0.00508788 val-
rec@1:0.001 val-rec@5:0.005
round 1:[ 10010] 23840 sec elapsed[2] train-rec@1:0 train-rec@5:0 val-rec@1:0.001 val-
rec@5:0.005
round 2:[ 10010] 35863 sec elapsed[3] train-rec@1:0 train-rec@5:0 val-rec@1:0.001 val-
rec@5:0.005
round 3:[ 10010] 47897 sec elapsed[4] train-rec@1:0 train-rec@5:0 val-rec@1:0.001 val-
rec@5:0.005
round 4:[ 10010] 59918 sec elapsed[5] train-rec@1:0 train-rec@5:0 val-rec@1:0.001 val-
rec@5:0.005
round 5:[ 10010] 71943 sec elapsed[6] train-rec@1:0 train-rec@5:0 val-rec@1:0.001 val-
rec@5:0.005
round 6:[ 10010] 83983 sec elapsed[7] train-rec@1:0 train-rec@5:0 val-rec@1:0.001 val-
rec@5:0.005
Thanks @antinucleon, you mean that cxxnet will be given up soon after MXNet? Would you please introduce me MXNet? When will it be Published?
@fangli1992 https://github.com/dmlc/mxnet https://mxnet.readthedocs.org/en/latest/ It is still on going and no exact timeline but we are confident to finish it soon.
@antinucleon @winstywang These days I did a test and I think the result is helpful to find the bugs.
I just tried the raw image with iter=img
, that is, I didn't use rec format file or bin format file. Though the training speed was extremely slow, the result was fine.
round 0:[ 5000] 21859 sec elapsed[1] train-error:0.988323 train-rec@1:0.0116774 train-rec@5:0.0427986 test-error:0.94268 test-rec@1:0.05732 test-rec@5:0.17228
round 1:[ 5000] 44233 sec elapsed[2] train-error:0.909059 train-rec@1:0.0909411 train-rec@5:0.23927 test-error:0.85578 test-rec@1:0.14422 test-rec@5:0.33428
round 2:[ 5000] 66558 sec elapsed[3] train-error:0.839233 train-rec@1:0.160767 train-rec@5:0.361801 test-error:0.79018 test-rec@1:0.20982 test-rec@5:0.43284
round 3:[ 5000] 88912 sec elapsed[4] train-error:0.787381 train-rec@1:0.212619 train-rec@5:0.438686 test-error:0.74626 test-rec@1:0.25374 test-rec@5:0.48924
round 4:[ 5000] 111285 sec elapsed[5] train-error:0.747077 train-rec@1:0.252923 train-rec@5:0.494042 test-error:0.71226 test-rec@1:0.28774 test-rec@5:0.53446
round 5:[ 5000] 133646 sec elapsed[6] train-error:0.713261 train-rec@1:0.286739 train-rec@5:0.536074 test-error:0.68064 test-rec@1:0.31936 test-rec@5:0.56912
round 6:[ 5000] 155997 sec elapsed[7] train-error:0.688767 train-rec@1:0.311233 train-rec@5:0.564116 test-error:0.67778 test-rec@1:0.32222 test-rec@5:0.573
round 7:[ 5000] 178354 sec elapsed[8] train-error:0.691887 train-rec@1:0.308113 train-rec@5:0.55988 test-error:0.66232 test-rec@1:0.33768 test-rec@5:0.58872
round 8:[ 5000] 200695 sec elapsed[9] train-error:0.665245 train-rec@1:0.334755 train-rec@5:0.589703 test-error:0.6387 test-rec@1:0.3613 test-rec@5:0.6173
round 9:[ 5000] 223037 sec elapsed[10] train-error:0.640461 train-rec@1:0.359539 train-rec@5:0.616859 test-error:0.62932 test-rec@1:0.37068 test-rec@5:0.62178
round 10:[ 5000] 245377 sec elapsed[11] train-error:0.621268 train-rec@1:0.378732 train-rec@5:0.637484 test-error:0.60522 test-rec@1:0.39478 test-rec@5:0.64822
round 11:[ 5000] 267710 sec elapsed[12] train-error:0.603612 train-rec@1:0.396388 train-rec@5:0.655111 test-error:0.59288 test-rec@1:0.40712 test-rec@5:0.66024
round 12:[ 5000] 290041 sec elapsed[13] train-error:0.588543 train-rec@1:0.411457 train-rec@5:0.670691 test-error:0.58648 test-rec@1:0.41352 test-rec@5:0.66906
I guess there must be something wrong with img2rec
or imrec iterator
.
@fangli1992
It seems that the list used to generate rec file is not shuffled.
Try to get it shuffled before using im2rec may help.
Use shuffle in iter do not works for im2rec for it just do shuffle in a page( about 3000 pics).
So if your label keeps same for a lot of continous examples it will fall to overfitting.
..
I shuffled the list before generating the rec file. It works well for me now. Thank you @fangli1992 .
Thank you all @superzrx @antinucleon @ommiissyu @winstywang , I followed @superzrx advice and now my Alex Net works well.
@fangli1992 @ommiissyu we have met the same problem like you, and did you mean you shuffled the list and get the well result? How should we shuffle the list? Does it have some special requirments?
@fangli1992 by the way, have you compiled the cxxnet with ps-lite? we use it and use kaiming.con in the example folder, and just change it with the input file route and the gpu number, and get the same problem you mentioned above.