Question about the decoupled weight decay in Adam

Currently, we wrote:

Lines 188 to 190 in 6adc1c2

    
           param = param & 
        
             - self % learning_rate * m_hat / (sqrt(v_hat) + self % epsilon) & 
        
             - self % weight_decay_decoupled * param

However, I'm looking at the paper and PyTorch docs again.

In the paper, in Algorithm 2, line 12, we have $\lambda \theta_{t-1}$ (in our code, self % weight_decay_decoupled * param) multiplied by the schedule multiplier $\eta_t$, but not the learning rate $\alpha$.

In the PyTorch docs for AdamW, $\lambda \theta_{t-1}$ is multiplied by the learning rate. I can also see that from the source code. I'm not familiar with PyTorch design, but it's possible that the schedule multiplier is already accounted for in the learning rate.

I looked at Keras source and I don't even see where and if the weight decay is even used (??).

@Spnetic-5 do you also see the same discrepancy between the paper and the PyTorch docs like I do?

If yes, I suggest that we multiply it with the learning rate in our code as well. I trust more that the PyTorch implements it correctly than that we are correctly interpreting the paper (and papers have typos, of course).

@milancurcic I first learned about AdamW from the Keras Optimizer documentation and implemented it based solely on the provided paper, I did not refer the PyTorch docs/Keras Source and missed this.

Furthermore, yes I can see in Pytorch Docs that they multiply with learning rate as well. Since PyTorch is widely used, I also think it should be right. I will update our implementation and create a new PR. Thank you for bringing this to my attention!

Resolved in #150

	param = param &
	- self % learning_rate * m_hat / (sqrt(v_hat) + self % epsilon) &
	- self % weight_decay_decoupled * param