Question about the decoupled weight decay in Adam
milancurcic opened this issue · 2 comments
Currently, we wrote:
neural-fortran/src/nf/nf_optimizers.f90
Lines 188 to 190 in 6adc1c2
However, I'm looking at the paper and PyTorch docs again.
In the paper, in Algorithm 2, line 12, we have self % weight_decay_decoupled * param) multiplied by the schedule multiplier
In the PyTorch docs for AdamW,
I looked at Keras source and I don't even see where and if the weight decay is even used (??).
@Spnetic-5 do you also see the same discrepancy between the paper and the PyTorch docs like I do?
If yes, I suggest that we multiply it with the learning rate in our code as well. I trust more that the PyTorch implements it correctly than that we are correctly interpreting the paper (and papers have typos, of course).
@milancurcic I first learned about AdamW from the Keras Optimizer documentation and implemented it based solely on the provided paper, I did not refer the PyTorch docs/Keras Source and missed this.
Furthermore, yes I can see in Pytorch Docs that they multiply with learning rate as well. Since PyTorch is widely used, I also think it should be right. I will update our implementation and create a new PR. Thank you for bringing this to my attention!