kozistr/pytorch_optimizer

Modified AdaFactor by ViT paper

Closed this issue · 1 comments

Modified AdaFactor by ViT paper:

here! section 3.4

Actually I'm not sure if it is already implemented/possible to tweak from params with the existing Adafactor optimizer. Please let me know if that's the case.

Adafactor optimizer. The above optimizer still induces a large memory overhead. Thus, we turn our attention to the Adafactor optimizer [35], which stores second momentum using rank 1 factorization. From practical point of view, this results in the negligible memory overhead. However, the Adafactor optimizer did not work out of the box, so we make the following modifications: • Were-introduce the first momentum in half-precision, whereas the recommended setting does not use the first momentum at all. • Wedisable scaling of learning rate relative to weight norms, a feature that is part of Adafactor. • Adafactor gradually increases the second momentum from 0.0 to 1.0 throughout the course of training. In our preliminary experiments, we found that clipping the second momentum at 0.999 (Adam’s default value) results in better convergence, so we adopt it. The resulting optimizer introduces only a 50% memory overhead on top the space needed to store model’s parameters. Weobserve that both proposed optimizers perform on par with or slightly better than the original Adam optimizer. We are aware of other memory-efficient optimizers [32,40], we leave the exploration to future work.

thanks for it! maybe no tweaked Adafactor is implemented. I'll check it later thank you :)