GLambard/AdamW_Keras

Full Implementation Available + Disclaimer

Opened this issue · 0 comments

A complete implementation of AdamW, NadamW, and SGDW - with warm restarts, cosine annealing, and per-layer learning rate multipliers is available here.

Regarding this repository's implementation, it has a major flaw: weight decay can only be set globally, for all layer weight matrices; this is rarely useful, as not all weights require decay, and most require weight-specific treatment. Also, no mention is made against using both weight decay and l2 penalty, which is important - as using both defeats the purpose of either.

Feedback is welcome.