ImageNet Example : train error does not decrease!

Question

ImageNet Example : train error does not decrease!

fangli1992 opened this issue 9 years ago · 23 comments

fangli1992 commented 9 years ago

Hi,

I am training ImageNet using the default configuer file:ImageNet.conf.
I use the latest version cxxnet downloaded from https://github.com/dmlc/cxxnet ,and I got result like this:

[1] train-error:0.999173    train-rec@1:0.00181927  train-rec@5:0.00576143  test-error:0.999    test-rec@1:0.00106  test-rec@5:0.0051
[2] train-error:0.998985    train-rec@1:0.000984172 train-rec@5:0.00498642  test-error:0.999    test-rec@1:0.0013   test-rec@5:0.00538
[3] train-error:0.998985    train-rec@1:0.00102632  train-rec@5:0.00492242  test-error:0.999    test-rec@1:0.00102  test-rec@5:0.00448
[4] train-error:0.998985    train-rec@1:0.000982611 train-rec@5:0.00496066  test-error:0.999    test-rec@1:0.00098  test-rec@5:0.00444
[5] train-error:0.998985    train-rec@1:0.00105441  train-rec@5:0.00507695  test-error:0.999    test-rec@1:0.00112  test-rec@5:0.00566
[6] train-error:0.998985    train-rec@1:0.000970124 train-rec@5:0.00502935  test-error:0.999    test-rec@1:0.00098  test-rec@5:0.0046
[7] train-error:0.998985    train-rec@1:0.00096466  train-rec@5:0.0049271   test-error:0.999    test-rec@1:0.00078  test-rec@5:0.005
[8] train-error:0.998985    train-rec@1:0.00104271  train-rec@5:0.00509178  test-error:0.999    test-rec@1:0.001    test-rec@5:0.00484

I found a similar issues #84 but did not find right answer.
Here is the further information about my cxxnet and training(maybe this can help):

I use the latest version cxxnet downloaded from https://github.com/dmlc/cxxnet .
I did not modified the default conf file except adding 'shuffle = 1' as well as some Path.

# ImageNet.conf
data = train
iter = imgrec
#  image_list = "../../NameList.train"
  image_rec  = "./data/train.bin"
#  image_root = "../../data/resize256/"
  image_mean = "models/image_net_mean.bin"
  rand_crop=1
  rand_mirror=1
  shuffle = 1
iter = threadbuffer
iter = end

eval = test
iter = imgrec
#  image_list = "../../NameList.test"
  image_rec = "./data/val.bin"
#  image_root = "../../data/resize256/"
  image_mean = "models/image_net_mean.bin"
# no random crop and mirror in test
iter = end
...
...

I trained LeNet on MNIST with a conf file I converted from Caffe, and it works well!(default MNIST.conf works well too)
I did not use CUDNN(USE_CUDNN = 0)
I create the image_list_file in format like this:

# for train.bin (of course, this line is not in image_list_file)
1   0   n01440764/n01440764_10026.JPEG
2   0   n01440764/n01440764_10027.JPEG
3   0   n01440764/n01440764_10029.JPEG
4   0   n01440764/n01440764_10040.JPEG
5   0   n01440764/n01440764_10042.JPEG
...
63341   48  n01695060/n01695060_6356.JPEG
63342   48  n01695060/n01695060_6360.JPEG
63343   48  n01695060/n01695060_6371.JPEG
63344   48  n01695060/n01695060_6389.JPEG
63345   48  n01695060/n01695060_64.JPEG
63346   48  n01695060/n01695060_6400.JPEG
63347   48  n01695060/n01695060_6403.JPEG
...

# for test.bin(In my conf, it is val.bin. Of course, this line is not in image_list_file. )
1   65  ILSVRC2012_val_00000001.JPEG
2   970     ILSVRC2012_val_00000002.JPEG
3   230     ILSVRC2012_val_00000003.JPEG
4   809     ILSVRC2012_val_00000004.JPEG
5   516     ILSVRC2012_val_00000005.JPEG
6   57  ILSVRC2012_val_00000006.JPEG

I use ./bin/im2rec image_list_file image_root_dir train.bin resize=256 to create rec file.
I also tyied to run on the eraly version cxxnet downloaded from https://github.com/dmlc/cxxnet/tree/revert-208-fix_aug , but got the same output.
I run kaiming.conf, and the output was still very bad==!

Answer 1 · 2015-08-31T02:44:56.000Z

Please other configure file. AlexNet conf file is out of dated.
On Sun, Aug 30, 2015 at 20:43 fangli1992 notifications@github.com wrote:

Hi,

I am training ImageNet using the default configuer file:ImageNet.conf.
I use the latest version cxxnet downloaded from
https://github.com/dmlc/cxxnet ,and I got result like this:

[1] train-error:0.999173 train-rec@1:0.00181927 train-rec@5:0.00576143 test-error:0.999 test-rec@1:0.00106 test-rec@5:0.0051
[2] train-error:0.998985 train-rec@1:0.000984172 train-rec@5:0.00498642 test-error:0.999 test-rec@1:0.0013 test-rec@5:0.00538
[3] train-error:0.998985 train-rec@1:0.00102632 train-rec@5:0.00492242 test-error:0.999 test-rec@1:0.00102 test-rec@5:0.00448
[4] train-error:0.998985 train-rec@1:0.000982611 train-rec@5:0.00496066 test-error:0.999 test-rec@1:0.00098 test-rec@5:0.00444
[5] train-error:0.998985 train-rec@1:0.00105441 train-rec@5:0.00507695 test-error:0.999 test-rec@1:0.00112 test-rec@5:0.00566
[6] train-error:0.998985 train-rec@1:0.000970124 train-rec@5:0.00502935 test-error:0.999 test-rec@1:0.00098 test-rec@5:0.0046
[7] train-error:0.998985 train-rec@1:0.00096466 train-rec@5:0.0049271 test-error:0.999 test-rec@1:0.00078 test-rec@5:0.005
[8] train-error:0.998985 train-rec@1:0.00104271 train-rec@5:0.00509178 test-error:0.999 test-rec@1:0.001 test-rec@5:0.00484

I found a similar issues #84 #84
but did not find right answer.
Here is the further information about my cxxnet and training(maybe this
can help):

I use the latest version cxxnet downloaded from
https://github.com/dmlc/cxxnet .

I did not modified the default conf file except adding 'shuffle = 1'
as well as some Path.

ImageNet.conf

data = train
iter = imgrec

image_list = "../../NameList.train"

image_rec = "./data/train.bin"

image_root = "../../data/resize256/"

image_mean = "models/image_net_mean.bin"
rand_crop=1
rand_mirror=1
shuffle = 1
iter = threadbuffer
iter = end

eval = test
iter = imgrec

image_list = "../../NameList.test"

image_rec = "./data/val.bin"

image_root = "../../data/resize256/"

image_mean = "models/image_net_mean.bin"

no random crop and mirror in test

iter = end
...
...

I trained LeNet on MNIST with a conf file I converted from Caffe,
and it works well!(default MNIST.conf works well too)

I did not use CUDNN(USE_CUDNN = 0)

I create the image_list_file in format like this:

for train.bin (of course, this line is not in image_list_file)

1 0 n01440764/n01440764_10026.JPEG
2 0 n01440764/n01440764_10027.JPEG
3 0 n01440764/n01440764_10029.JPEG
4 0 n01440764/n01440764_10040.JPEG
5 0 n01440764/n01440764_10042.JPEG
...
63341 48 n01695060/n01695060_6356.JPEG
63342 48 n01695060/n01695060_6360.JPEG
63343 48 n01695060/n01695060_6371.JPEG
63344 48 n01695060/n01695060_6389.JPEG
63345 48 n01695060/n01695060_64.JPEG
63346 48 n01695060/n01695060_6400.JPEG
63347 48 n01695060/n01695060_6403.JPEG
...

for test.bin(In my conf, it is val.bin. Of course, this line is not in image_list_file. )

1 65 ILSVRC2012_val_00000001.JPEG
2 970 ILSVRC2012_val_00000002.JPEG
3 230 ILSVRC2012_val_00000003.JPEG
4 809 ILSVRC2012_val_00000004.JPEG
5 516 ILSVRC2012_val_00000005.JPEG
6 57 ILSVRC2012_val_00000006.JPEG

I use ./bin/im2rec image_list_file image_root_dir train.bin
resize=256 to create rec file.

I also tyied to run on the eraly version cxxnet downloaded from
https://github.com/dmlc/cxxnet/tree/revert-208-fix_aug , but got the
same output.

I run kaiming.conf, and the output was still very bad==!

—
Reply to this email directly or view it on GitHub
#235.

Answer 2 · 2015-08-31T02:53:22.000Z

Thanks. @antinucleon
Is there something wrong with the ImageNet.conf ? I have check the net configure but did not found something wrong.

Answer 3 · 2015-08-31T02:55:11.000Z

The reason seems is the random initialization. You can try xavier
initialization method with different seed
On Sun, Aug 30, 2015 at 20:53 fangli1992 notifications@github.com wrote:

Thanks. @antinucleon https://github.com/antinucleon
Is there something wrong with the ImageNet.conf ? I have check the net
configure but did not found something wrong.

—
Reply to this email directly or view it on GitHub
#235 (comment).

Answer 4 · 2015-08-31T03:19:37.000Z

Thanks.@antinucleon
Now I still have no idea on how to configure the seed of xavier ==! Should I read and modify the source code?
Or chang it to gaussian like caffe and retry?
I am confused why LeNet works well.

Answer 5 · 2015-08-31T08:59:49.000Z

I also use the master version to train the ImageNet data. I used the kaiming.conf and Inception-BN.conf. The train error rate and val error rate both did not decrease.

My system is Windows Server.

Answer 6 · 2015-08-31T12:27:06.000Z

I see your @ommiissyu problems submitted in April and it has been closed by @winstywang in June. Do you have some ideas on this issue? My system is Ubuntu 14.04LTS .
I have tried the gaussian initialization but got the same result.

Answer 7 · 2015-08-31T12:31:36.000Z

@fangli1992 Have you tried xavier on kaiming.conf or googlenet? If it does not work well, try to add clip_gradient = 10 at the end of the config.

Answer 8 · 2015-09-01T01:23:08.000Z

Thank for your advice @winstywang , I tried clip_gradient = 10 and got the output like this

[1]     train-error:0.99599     train-rec@1:0.00401005  train-rec@5:0.00447365  test-error:0.999        test-rec@1:0.001        test-rec@5:0.005
[2]     train-error:1   train-rec@1:0   train-rec@5:0   test-error:0.999        test-rec@1:0.001        test-rec@5:0.005
[3]     train-error:1   train-rec@1:0   train-rec@5:0   test-error:0.999        test-rec@1:0.001        test-rec@5:0.005
[4]     train-error:1   train-rec@1:0   train-rec@5:0   test-error:0.999        test-rec@1:0.001        test-rec@5:0.005
[5]     train-error:1   train-rec@1:0   train-rec@5:0   test-error:0.999        test-rec@1:0.001        test-rec@5:0.005
[6]     train-error:1   train-rec@1:0   train-rec@5:0   test-error:0.999        test-rec@1:0.001        test-rec@5:0.005
[7]     train-error:1   train-rec@1:0   train-rec@5:0   test-error:0.999        test-rec@1:0.001        test-rec@5:0.005

I do not try kaiming.conf or googlenet.conf, but @ommiissyu did this #236

Answer 9 · 2015-09-01T02:14:33.000Z

As replied above, try xavier initialization. I am not sure whether the
issue is caused by windows version.

On Tuesday, September 1, 2015, fangli1992 notifications@github.com wrote:

Thank for your advice @winstywang https://github.com/winstywang , I
tried clip_gradient = 10 and got the output like this

[1] train-error:0.99599 train-rec@1:0.00401005 train-rec@5:0.00447365 test-error:0.999 test-rec@1:0.001 test-rec@5:0.005
[2] train-error:1 train-rec@1:0 train-rec@5:0 test-error:0.999 test-rec@1:0.001 test-rec@5:0.005
[3] train-error:1 train-rec@1:0 train-rec@5:0 test-error:0.999 test-rec@1:0.001 test-rec@5:0.005
[4] train-error:1 train-rec@1:0 train-rec@5:0 test-error:0.999 test-rec@1:0.001 test-rec@5:0.005
[5] train-error:1 train-rec@1:0 train-rec@5:0 test-error:0.999 test-rec@1:0.001 test-rec@5:0.005
[6] train-error:1 train-rec@1:0 train-rec@5:0 test-error:0.999 test-rec@1:0.001 test-rec@5:0.005
[7] train-error:1 train-rec@1:0 train-rec@5:0 test-error:0.999 test-rec@1:0.001 test-rec@5:0.005

I do not try kaiming.conf or googlenet.conf, but @ommiissyu
https://github.com/ommiissyu did this #236
#236

—
Reply to this email directly or view it on GitHub
#235 (comment).

Answer 10 · 2015-09-01T03:12:25.000Z

@fangli1992 , in April, I used the Linux system, we solved the problem. But these days I used the cxxnet on the Windows platform, I got the error rate not decreasing problem too.

Answer 11 · 2015-09-01T03:15:46.000Z

thanks for the info. maybe it is related to rand_r on windows. We are busy
developing next generation data flow tools MXNet, in MXNet we will use
CXX11 to avoid random number inconsistent problem.
On Mon, Aug 31, 2015 at 21:12 Leo Xiao notifications@github.com wrote:

@fangli1992 https://github.com/fangli1992 , in April, I used the Linux
system, we solved the problem. But these I used the cxxnet on the Windows
platform, I got the error rate not decreasing problem too.

—
Reply to this email directly or view it on GitHub
#235 (comment).

Answer 12 · 2015-09-01T03:25:43.000Z

@winstywang ok, I am trying kaiming.conf with xavier and will submit the result as soon as possible.
By the way, I am using Ubuntu 14.04 LTS @antinucleon but meet with this problems.
@ommiissyu could you please give me some further suggestions? In April, you used cxxnet-v1?
Thanks a lot!

Answer 13 · 2015-09-01T04:36:24.000Z

@fangli1992 Sorry in cxxnet I won't have time to solve it because my own
network always works well. Once mxnet is finishing, cxxnet will be replaced
totally.
On Mon, Aug 31, 2015 at 21:25 fangli1992 notifications@github.com wrote:

@winstywang https://github.com/winstywang ok, I am trying kaiming.conf
with xavier and will submit the result as soon as possible.
By the way, I am using Ubuntu 14.04 LTS @antinucleon
https://github.com/antinucleon but meet with this problems.
@ommiissyu https://github.com/ommiissyu could you please give me some
further suggestions? In April, you used cxxnet-v1?
Thanks a lot!

—
Reply to this email directly or view it on GitHub
#235 (comment).

Answer 14 · 2015-09-01T04:52:29.000Z

@fangli1992 ,on the linux platform, it works well. On the windows platform I got the same problem as yours.

Answer 15 · 2015-09-01T04:54:50.000Z

my result:

round        0:[   10010] 11767 sec elapsed[1]  train-rec@1:0.00114183  train-rec@5:0.00508788  val-
rec@1:0.001     val-rec@5:0.005
round        1:[   10010] 23840 sec elapsed[2]  train-rec@1:0   train-rec@5:0   val-rec@1:0.001 val-
rec@5:0.005
round        2:[   10010] 35863 sec elapsed[3]  train-rec@1:0   train-rec@5:0   val-rec@1:0.001 val-
rec@5:0.005
round        3:[   10010] 47897 sec elapsed[4]  train-rec@1:0   train-rec@5:0   val-rec@1:0.001 val-
rec@5:0.005
round        4:[   10010] 59918 sec elapsed[5]  train-rec@1:0   train-rec@5:0   val-rec@1:0.001 val-
rec@5:0.005
round        5:[   10010] 71943 sec elapsed[6]  train-rec@1:0   train-rec@5:0   val-rec@1:0.001 val-
rec@5:0.005
round        6:[   10010] 83983 sec elapsed[7]  train-rec@1:0   train-rec@5:0   val-rec@1:0.001 val-
rec@5:0.005

Answer 16 · 2015-09-01T11:27:01.000Z

Thanks @antinucleon, you mean that cxxnet will be given up soon after MXNet? Would you please introduce me MXNet? When will it be Published？

Answer 17 · 2015-09-01T16:58:09.000Z

@fangli1992 https://github.com/dmlc/mxnet https://mxnet.readthedocs.org/en/latest/ It is still on going and no exact timeline but we are confident to finish it soon.

Answer 18 · 2015-09-06T03:36:47.000Z

@antinucleon @winstywang These days I did a test and I think the result is helpful to find the bugs.
I just tried the raw image with iter=img , that is, I didn't use rec format file or bin format file. Though the training speed was extremely slow, the result was fine.

round        0:[    5000] 21859 sec elapsed[1]  train-error:0.988323    train-rec@1:0.0116774   train-rec@5:0.0427986   test-error:0.94268  test-rec@1:0.05732  test-rec@5:0.17228
round        1:[    5000] 44233 sec elapsed[2]  train-error:0.909059    train-rec@1:0.0909411   train-rec@5:0.23927 test-error:0.85578  test-rec@1:0.14422  test-rec@5:0.33428
round        2:[    5000] 66558 sec elapsed[3]  train-error:0.839233    train-rec@1:0.160767    train-rec@5:0.361801    test-error:0.79018  test-rec@1:0.20982  test-rec@5:0.43284
round        3:[    5000] 88912 sec elapsed[4]  train-error:0.787381    train-rec@1:0.212619    train-rec@5:0.438686    test-error:0.74626  test-rec@1:0.25374  test-rec@5:0.48924
round        4:[    5000] 111285 sec elapsed[5] train-error:0.747077    train-rec@1:0.252923    train-rec@5:0.494042    test-error:0.71226  test-rec@1:0.28774  test-rec@5:0.53446
round        5:[    5000] 133646 sec elapsed[6] train-error:0.713261    train-rec@1:0.286739    train-rec@5:0.536074    test-error:0.68064  test-rec@1:0.31936  test-rec@5:0.56912
round        6:[    5000] 155997 sec elapsed[7] train-error:0.688767    train-rec@1:0.311233    train-rec@5:0.564116    test-error:0.67778  test-rec@1:0.32222  test-rec@5:0.573
round        7:[    5000] 178354 sec elapsed[8] train-error:0.691887    train-rec@1:0.308113    train-rec@5:0.55988 test-error:0.66232  test-rec@1:0.33768  test-rec@5:0.58872
round        8:[    5000] 200695 sec elapsed[9] train-error:0.665245    train-rec@1:0.334755    train-rec@5:0.589703    test-error:0.6387   test-rec@1:0.3613   test-rec@5:0.6173
round        9:[    5000] 223037 sec elapsed[10]        train-error:0.640461    train-rec@1:0.359539    train-rec@5:0.616859    test-error:0.62932  test-rec@1:0.37068  test-rec@5:0.62178
round       10:[    5000] 245377 sec elapsed[11]        train-error:0.621268    train-rec@1:0.378732    train-rec@5:0.637484    test-error:0.60522  test-rec@1:0.39478  test-rec@5:0.64822
round       11:[    5000] 267710 sec elapsed[12]        train-error:0.603612    train-rec@1:0.396388    train-rec@5:0.655111    test-error:0.59288  test-rec@1:0.40712  test-rec@5:0.66024
round       12:[    5000] 290041 sec elapsed[13]        train-error:0.588543    train-rec@1:0.411457    train-rec@5:0.670691    test-error:0.58648  test-rec@1:0.41352  test-rec@5:0.66906

I guess there must be something wrong with img2rec or imrec iterator .

Answer 19 · 2015-09-07T02:43:58.000Z

@fangli1992
It seems that the list used to generate rec file is not shuffled.
Try to get it shuffled before using im2rec may help.
Use shuffle in iter do not works for im2rec for it just do shuffle in a page( about 3000 pics).
So if your label keeps same for a lot of continous examples it will fall to overfitting.

..

Answer 20 · 2015-09-09T02:46:45.000Z

I shuffled the list before generating the rec file. It works well for me now. Thank you @fangli1992 .

Answer 21 · 2015-09-09T08:29:50.000Z

Thank you all @superzrx @antinucleon @ommiissyu @winstywang , I followed @superzrx advice and now my Alex Net works well.

Answer 22 · 2015-09-21T02:36:12.000Z

@fangli1992 @ommiissyu we have met the same problem like you, and did you mean you shuffled the list and get the well result? How should we shuffle the list? Does it have some special requirments?

Answer 23 · 2015-09-21T02:45:43.000Z

@fangli1992 by the way, have you compiled the cxxnet with ps-lite? we use it and use kaiming.con in the example folder, and just change it with the input file route and the gpu number, and get the same problem you mentioned above.