Training spec #2
Closed this issue · 2 comments
Could you share other details of the training results in the comment of the issue which has loss of 3.2398 ?
For example, the things such as scheduler, optimizer beta1, beta2, dropout probability, gradient clipping, learning rate, warmup step, layer normalization, etc.
I just know about some training tips for parameter configuration.
Thank you.
The other details are described in my code. So you'd be better to check out my project more detailedly. And basically the training specifications are based on the original paper and the typical case of other transformer-based models. I think this comment is sufficient. Actually, the transformer-based models tend to be trained well without hard hyperparameter tuning, due to the large-scale dataset and model.
Thank you!