Using SGDM with lr=0.1 leads to not learning

Question

Using SGDM with lr=0.1 leads to not learning

Closed this issue 6 years ago · 10 comments

Thanks for sharing your keras version of adabound and I found that when changing optimizer from adabound to SGDM (lr=0.1), the resnet doesn't learn at all like the fig below.

I remember that in the original paper it uses SGDM (lr=0.1) for comparisons and I'm wondering how this could be.

Answer 1 · 2019-03-05T15:35:49.000Z

Is this with Wide ResNet 34? I'm surprised it can train in less than 5 minutes per epoch.

I think you have an older versions of the code. I changed ResNet to match that of the PyTorch codebase (which is the wide version, width factor of 4), using even the same momentum for Batch norm as PyTorch, then trained a ResNet 34 on colab for over 22 hours to get to 92% without cropping augmentation.

The earlier versions in fact can't train to over 89%, even with over 500 epochs of fine tuning and other tricks.

Answer 2 · 2019-03-06T05:06:08.000Z

I pulled your latest code yesterday but just changed from py file to notebook for convenience. Without any other modifications but set a larger batch size, SGDM with lr = 0.1 would lead to this problem.

I guess your code is from official Keras team? I have tried their Wide ResNet 34 and still, it cannot work with SGDM (lr=0.1). However, it works well in the original Pytorch code. Is it possibly because of the differences between Keras and Pytorch?

Answer 3 · 2019-03-06T16:42:40.000Z

That's odd. My version of ResNet is basically a port of the one in PyTorch, keeping the batch norm momentum same as well. There should be no direct reason why one trains while the other does not.

When you say sgdm, you mean the regular SGD optimizer with momentum and nesterov set yes ?

Adabound starts with an initial lr or 0.001 and the bounds begin to clip close to 0.1 depending on the gamma. Maybe try SGD with slightly lower learning rate, 0.05 or something.

Answer 4 · 2019-03-07T01:39:55.000Z

Yes. SGDM is regular SGD optimizer with momentum but no Nesterov. I'm doing research in optimizer comparisons and using Keras to test its performance all the time.

Could you help me test SGDM with lr = 0.1 and momentum = 0.9 in your code? Several epochs are enough, maybe 5 or 10. I just want to know whether I should keep using Keras or change to Pytorch for further experiments. Your precious help could save me a lot of time. Thank you very much!

Answer 5 · 2019-03-11T15:37:48.000Z

Sorry was busy for a week with some work.

Actually, I've been using colab to train the models for this repo. It's quite convenient. I'll just provide you the notebook and you can run it there with your modifications?

https://gist.github.com/titu1994/efa42c8ced1c055801fd74789ef108d2

You'll need to drop this in a folder alongside the resnet.py, adabound.py and cifar10.py scripts. And change the drive cd path. Sorry can't just share the colab dir since its inside my personal research folder.

Answer 6 · 2019-03-15T02:16:32.000Z

Thanks for your kind sharing! And may I ask what the authorization code is and I need that when I run drive.mount cell.

PS: I've switched to Pytorch and got satisfying results when testing the designed optimization method.

Answer 7 · 2019-03-15T02:23:56.000Z

Authorization code is a personal token received from Google when you run it on Colab. They can't be shared.

It's interesting that optimizers behave differently, but it's probably something to do with the model architecture or some other reason.

Answer 8 · 2019-03-15T05:01:41.000Z

Yes. Actually, I've been doing optimization research using Keras. After this experiment, I decide to use Pytorch but keep Keras results cause I really don't want to spend time repeatedly tuning hyperparameters 😵Thanks anyway!

Answer 9 · 2019-04-15T11:06:48.000Z

Hey, here's the update. After switching to Pytorch and reading its source code, I happen to find that Pytorch's SGDM implementation is slightly different from Keras'. While gradient should be multiplied by learning rate in original SGDM implementation and in Keras, the momentum is multiplied by learning rate in Pytorch and gradient is multiplied by constant value 1. Maybe this can explain why the same hyperparameter setting could lead to different result.

Answer 10 · 2019-04-15T12:32:50.000Z

Oh that's very interesting ! I'll try it out over the week and see if it works after rewriting sgdm PyTorch in Keras