Why is warmup better than RAdam?
Closed this issue · 3 comments
brando90 commented
I've argued here LiyuanLucasLiu/RAdam#62 that if warm up and RAdam are equivalent that using RAdam might be simpler - however, I'd be curious about arguments in favour of warm up presented in this repo and related paper.
What are reasons to choose warm up isntead of RAdam?
Tony-Y commented
It is because the untuned linear warmup works well and is easy to implement.
brando90 commented
It is because the untuned linear warmup works well and is easy to implement.
But RAdam requires no tuning...doesn't that make it better than warm up?
Tony-Y commented
The untuned warmup dependent on beta2 requires no tuning too.