May I use CLR for Adam optimizer?
xuzhang5788 opened this issue · 10 comments
From the paper and your implementation, your examples are only use SGD optimizer. I am wondering if I can use this CLR for Adam or other optimizers. Many thanks.
I use CLR with Adam. I haven't had any issues with it.
@mdhimes which framework?
@MugheesAhmad Keras/TensorFlow. It should also work with PyTorch, though I haven't implemented it there.
I may be a bit late here, but I'll add my two cents for the sake of a hopefully valuable addition for future readers.
Besides the practical side of things - "I haven't had any issues with it", see above - I would also conceptually argue that usage is perfectly fine when using Adam.
At a very high level, Adam differs from classic SGD in the sense that it (1) performs local parameter updates (i.e., makes changes at the parameter level) rather than SGD which does this globally, and (2) that it performs some momentum-like optimization contrary to no momentum with classic SGD.
Now, cyclical learning rates do nothing but move the learning rate back and forth between a higher and a lower value with the goal of escaping saddle points and, by consequence of the design, local minima as well.
Does this violate the conceptual improvements of Adam over SGD? Not in my opinion. Local optimization still takes place with respect to the current loss (irrespective of future learning rates), and CLRs slow down momentum one time, while making it a bit faster the other time (depending on where in the cycle you are).
Perhaps, CLR does thus even extend the conceptual improvements of the Adam optimizer, making it even better.
Now, this should all be verified empirically and at scale, but let's hope this answers your question from a conceptual point of view as well. And maybe not yours, but the ones of others who find this issue in the future 😄
Just a warning about using Adam with CLR: I wouldn’t do the LR range test with Adam, the momentum will throw off the results and not give the best max and base LR.
Fair enough Robert. Any tips?
Just a warning about using Adam with CLR: I wouldn’t do the LR range test with Adam, the momentum will throw off the results and not give the best max and base LR.
Can you comment more on this? Whenever I've done an LR range test with Adam, the results have been fairly consistent. I can understand why the determined max LR might not be the best due to the momentum, but I'm having a hard time seeing why the base LR would not be the best. From a plot of validation loss vs. learning rate, it has always been quite clear, and the result is consistent as long as that determined base LR is within the explored range (e.g., 1e-15 -- 1e-1 and 1e-5 -- 1e-3 finds the same base LR of, e.g., 1e-4). Perhaps you can share an example demonstrating that it doesn't find the best LR range?
@mdhimes That's a good point, there isn't any reason the base LR would differ too significantly when testing with Adam.
I've had bad results running my LR Range test with SGD, and then trying those learning rates with an Adam optimizer. In particular, I've had a LR of 0.001 work with plain Adam, and a SGD-based LR range test also conclude max_lr=0.001, but then have very unstable training with CLR + Adam using a max_lr=0.001.
I haven't seen this looked at rigorously in papers (only blogs posts doing a single run of CLR with Adam on one dataset, as opposed to Smith's work which focused on SGD + CLR and some forms of regularization: https://arxiv.org/pdf/1708.07120 https://arxiv.org/pdf/1803.09820 and https://arxiv.org/pdf/1506.01186), so I'm not sure if there is a consensus opinion on combining CLR and Adam. In the meantime, the simple solution may just be to use Adam during the range test if you're set on using Adam + CLR during training.
@robert-giaquinto these are good points... Someone should do a thorough investigation of it, I think it'd make for a good paper. I'm sure there is some way to do the LR range test with Adam.
One thing I want to mention is that the LR range test discussed in Smith (2015), where accuracy vs. LR is plotted to determine the LR range, isn't very telling when using Adam in my experience. However, using val loss vs. LR is usually quite clear, as you can see when things become unstable (loss will be constant for tiny LR, then start decreasing at some base LR, decreases smoothly until some slightly-above-max LR where it goes crazy). I haven't done nearly a thorough enough investigation to conclude that the LR range test with Adam works when done in this manner, but I have observed that it leads to results that are better than using a constant LR with Adam. YMMV.
I was also surprised accuracy is often shown in the LR range test plots. Accuracy isn’t a proper scoring rule, validation loss should be much more stable and informative.