Learning Rate Decay or Typo?

Question

Learning Rate Decay or Typo?

Opened this issue 4 years ago · 0 comments

Hello, I was just reading your paper and came across the statement "Both models were optimized using stochastic gradient descent with momentum, using an initial learning rate at 0.01 which decays by 10−6 on every epoch." So I was wondering that if 1.e-6 is actually learning rate decay or is it weight decay and you made a typo. Nevertheless, I did not see any mentions of both of the scenarios so I would be glad if you clarify that, because 1.e-6 decay every epoch is nearly nothing, it goes from 0.01 to 0.00999 in 100 epochs, thanks a lot.