juntang-zhuang/Adabelief-Optimizer

Results on ImageNet with tuning weight decay

XuezheMax opened this issue · 11 comments

I quickly run some experiments on ImageNet with different weight decay rates.

Using AdamW with wd=1e-2 and setting other hyper parameters the same as reported in AdaBelief paper, the average accuracy over 3 runs is 69.73%, much better than that compared in the paper. I will keep updating results for other optimizers and weight decay rates.

Seems effect of weight decay dominates the effect of optimizers in this case. What learning rate schedule did you use? Does that influence results?

I used the same lr scheduler: decay at epoch 70 and 80 by 0.1.

Thanks for your feedback. Just curious, what hardware did you use? I'm quite surprised that you can finish 3 runs within 12 hours (since your earliest post on weight decay here). Typically one round of ImageNet training takes me 3 to 4 days with 4 GPUs.

I run with 8 v100 (from AWS) and it took around 10 hours to complete the training with 90 epochs.
One comment that might be useful for you is that the CPU memory sometimes is the bottleneck for running ImageNet experiments since the dataset is very large.

Thanks for the suggestions and experiments, it might be the reason, feel quite stuck when experimenting with my 1080 GPU.

I run experiments with Adam and RAdam on ResNet-18. I decoupled the weight decay for both of them, so they are actually AdamW and RAdamW. The lr schedule is the same as AdaBelief: decaying at 70 and 80 epochs by 0.1, with total 90 epochs for training. The implementation is from this repo.
Here are the updates (3 runs for each experiment):

method wd=1e-2 wd=1e-4
AdamW 69.73 67.57
RAdamW 69.80 67.68

I think these results suggest that the baselines for ImageNet need to be updated by using the same weight decay 1e-2.

It surprises me that RAdam does not outperform Adam, since RAdam uses decoupled weight decay, do you have any results about AdamW with larger weight decay? Based on your results, I somehow doubt if decoupled weight decay is actually helpful. BTW, is the result reported in Apollo paper achieved by Apollo or ApolloW?

Oh, sorry for the confusion. Here in my results, Adam is actually AdamW. Without decoupling the weight decay, Adam works significantly worse than AdamW.

For the results in Apollo, I did not decouple the weight decay for Apollo. I tried ApolloW, but the performance is similar to Apollo.

Thanks a lot. I think your results suggest that weight decay is not properly set for AdamW family, and the baseline needs to be improved.

By looking at the literature, I found something weird, [1] also uses AdamW, and set weight decay as 5e-2, which is also a large number, and they achieve 67.93. Though the authors claim they performed a grid search, not sure if their grid includes 1e-2 as you used here. I'll take a more careful look later to see if some training details are different from yours.

BTW, regarding Apollo, is it because it's scale-variant, hence the weight decay is similar to a decoupled weight decay, like in SGD? Any idea why Apollo is not influenced by decoupled weight decay so much?

[1] Closing the Generalization Gap of Adaptive Gradient Methods in TrainingDeep Neural Networks

I also tried wd=5e-2 for AdamW, and the results are even slightly better than wd=1e-2. So I guess the models were not properly trained in [1].

For Apollo and SGD, I think one possible reason that decoupled weight decay is not that influential is that they were not using second-order momentum. In ICLR 2021, there is a new submission about stable weight decay in Adam, maybe we can get some ideas from it :-)

Thanks for your feedback. Just curious, what hardware did you use? I'm quite surprised that you can finish 3 runs within 12 hours (since your earliest post on weight decay here). Typically one round of ImageNet training takes me 3 to 4 days with 4 GPUs.

This could be reasonable. According to this benchmark, V100s are 5x faster than 1080Tis.