Full Implementation Available + Disclaimer
Opened this issue · 0 comments
OverLordGoldDragon commented
A complete implementation of AdamW, NadamW, and SGDW - with warm restarts, cosine annealing, and per-layer learning rate multipliers is available here.
Regarding this repository's implementation, it has a major flaw: weight decay can only be set globally, for all layer weight matrices; this is rarely useful, as not all weights require decay, and most require weight-specific treatment. Also, no mention is made against using both weight decay and l2 penalty, which is important - as using both defeats the purpose of either.
Feedback is welcome.