Comparison of optimizers on some standard benckmark

Question

Comparison of optimizers on some standard benckmark

saurabh-kataria opened this issue 4 years ago · 1 comments

Thanks for your contribution. Is there a guideline/comparison you provide for these optimizers? Which one would be best for standard tasks like image classification?

Answer 1 · 2020-08-07T03:13:14.000Z

I mostly implemented this, and tried a few combinations in some NLI tasks...nothing exhaustive, just looking at how it runs, how the lr changes and so on. So I don't have a real combination. For research, if you aren't focusing on optimizers, it will be probably better to just go with Adam or AdamW because of how standard it is. For application purposes, it's hard to say. I think overall:

(1) RAdam is becoming more popular. But it may be equivalent to a heuristic based warmup method: https://arxiv.org/abs/1910.04209v1 . And Adam with properly tuned warm up and everything still may be better or equivalent.

(2) Lookahead is probably a decent technique to use. First, it is a general technique that can be applied to almost any optimizer, and it was published in NeurIPs. But it may make computation heavier so that is a trade off to keep in mind. So this is something you can add on whichever optimizer you use.

(3) I think AMSGrad usually have mixed results for day to day results. Nostalgic Adam, PAdam all seems to demonstrate better results in their respective paper, you can just try to use the newer one which shows improvement over the older one. But if lacking time, I would just recommend sticking to 'good-ol' adam or better adamW than experimenting with all of them, but if you Nostalgic Adam, or PAdam can be some alternatives to consider as well. QHAdam may be ok too. The repo allows "combining" different techniques from different papers, but I don't have any results of those. So you have to experiment with them, or just stick with standard stuffs.

(4) Other methods I didn't mention, may not be as well established in the literature. Again you can experiment with them but I don't really have anything extra on them besides what the papers already mostly show in the links.