About the learning rate decay
Closed this issue · 5 comments
I have trained the encoder on the Cityscapes dataset using a batch size of 4 and learning rate of 5e-4. I tested 2 values for the learning rate decay and I do not understand why and how the learning rate decay affects the training process in the first 50 epochs, when lrDecayEvery is 50. So here are the training and testing errors for the first 5 epochs when training the encoder with different learning rate decay values (the other hyperparameters are the same).
- learning rate decay = 5e-1
Epoch Testing error Training error
1 1.936894622 2.197471235
2 1.847181412 1.922610026
3 1.803531181 1.842771126
4 1.746171906 1.799409299
5 1.754233467 1.771462634
- learning rate decay = 1e-7
Epoch Testing error Training error
1 1.084342769 1.018340091
2 0.760894685 0.752971031
3 0.664953763 0.658853157
4 0.597714836 0.603430173
5 0.571140148 0.565291346
I'm not sure if the problem is how the weights are initialized when you start training each time. But if you look into 'train.lua' file and find opt.learningRateDecay, you'll see this parameter won't do anything unless epoch % opt.lrDecayEvery == 0 (sorry I don't know how to link the code). I think nothing about learning rate decay supposed to be expected until 51th, 101th (etc.) epoch.
@AndraPetrovai lrDecayEvery
when set to 50 will reduce the learning rate by a factor of learningRateDecay
after every 50 epochs. This is accomplished by this line.
@wzhouuu @codeAC29 yes, exactly my thoughts. The learning rate should decay only after the 50th epoch, which is true, I checked and it does so. I do not understand why the testing and training errors are so different just by changing the learning rate decay, which theoretically does not affect the training process in the first 50 epochs. It may be a weight initialization problem, but I find it weird that when the learning rate decay is 1e-7 the weights are well initialized (converges much faster) and when the learning rate decay is 5e-1 the weights are poorly initialized (converges very slow). That happened every time I tried to train the encoder with both values and it seems like a pattern.
These are the parameters I use.
-r,--learningRate (default 5e-4) learning rate
-d,--learningRateDecay (default 1e-7) learning rate decay (in # samples)
-w,--weightDecay (default 2e-4) L2 penalty on the weights
-m,--momentum (default 0.9) momentum
-b,--batchSize (default 4) batch size
--maxepoch (default 300) maximum number of training epochs
--plot (default true) plot training/testing error
--lrDecayEvery (default 50) Decay learning rate every X epoch by 1e-1
-t,--threads (default 8) number of threads
-i,--devid (default 1) device ID (if using CUDA)
--nGPU (default 2) number of GPUs you want to train on
--channels (default 3)
--dataset (default cs) dataset type: cv(CamVid)/cs(cityscapes)/su(SUN)
--imHeight (default 512) image height (360 cv/256 cs/256 su)
--imWidth (default 1024) image width (480 cv/512 cs/328 su)
--labelHeight (default 64) label height (45 cv/32 cs/32 su)
--labelWidth (default 128) label width (60 cv/64 cs/41 su)
Do you get approximately the same training and testing errors when training on Cityscapes with learning rate decay = 5e-1 and learning rate decay = 1e-7 for example? (for the first 5 epochs, as in my example above)
@AndraPetrovai actually we are using adam
optimizer which is also takinglearningRateDecay
as one of the parameter through optimState
. Deleting this line might give you the desired output.
Got it. Thank you.