affjljoo3581/GPT2

Activaiton Function

Closed this issue · 3 comments

From the paper Improving Language Understanding by Generative Pre-Training (GPT-2), it says that gelu was used as an activation function.
Is there any activation function used in the code?

Also, can you tell me the reason of adding Swish?

Yes. You're right. The original paper used GELU as an activation. However, I found this article and I decided to use Swish instead of GELU empirically.
Although this article is not trustful and does not show the general results for other NLP tasks, I observed the significant performance improvement when I changed the model activation to Swish.
In fact, this repository does not implement the GPT2 model strictly. The purpose of this project is not to reproduce the results described in the paper. I wrote this codes to train my own sentence generator model. So I tried to implement the features in the paper as much as possible, but there should be some different ones.

Oh, I got it.
Your code helped me a lot understanding How GPT works and pretrained.
Thanks a lot.

Great! I hoped my code will be not just for running, but to help people understand and study Transformer models.