Full Implementation Available + Disclaimer

Question

Full Implementation Available + Disclaimer

Opened this issue 5 years ago · 0 comments

OverLordGoldDragon commented 5 years ago

A complete implementation of AdamW, NadamW, and SGDW - with warm restarts, cosine annealing, and per-layer learning rate multipliers is available here.

Regarding this repository's implementation, it has a major flaw: weight decay can only be set globally, for all layer weight matrices; this is rarely useful, as not all weights require decay, and most require weight-specific treatment. Also, no mention is made against using both weight decay and l2 penalty, which is important - as using both defeats the purpose of either.

Feedback is welcome.