affjljoo3581/GPT2

GPT-2 implementation problem

Closed this issue · 1 comments

"Hi, I am reading the GPT-2 paper and encountering a problem with the following phrase related to implementation:

'A modified initialization method is used to account for the accumulation on the residual path with model depth. We scale the weights of residual layers at initialization by a factor of 1/√N, where N is the number of residual layers.'

My problem is that we normalize after accumulation (addition then normalization). So, why do we need to scale weights? Aren't we doing this to reduce the impact of accumulation?"

Hi, thanks for your comment.

First of all, I am not the author of the GPT-2 paper and this repository is an unofficial implementation of the model. It would be better to ask to the authors.

But as far as I know, many transformers models including LLMs and ViTs use pre-layernorm, the normalizations are performed before each sublayer. The hidden states of the layers are not normalized, so the hidden state norm will be larger than of the previous layer. Simply, more layers make larger norm. Hence they normalize the weights at initialization for training stability at the early stage.

You are right that it is not required to scale the weights if you use post-layernorm, the LayerNorm is placed after the addition. But be aware that GPT-2 uses pre-layernorm.