Position of residual connection in PreLN architecture is wrong
bilzard opened this issue · 1 comments
bilzard commented
In the current implementation, residual connection in feed forward block comes from after layer norm[1].
x0 = self.maybe_layer_norm(self.self_attn_layer_norm, x0, after=True)
residual = x0
However, according to the paper of PreLN architecture[2], residual variable should be before layer norm.
residual = x0
x0 = self.maybe_layer_norm(self.self_attn_layer_norm, x0, after=True)
bilzard commented
Sorry, this is my misunderstanding. Since this layer norm is for post norm architecture, it isn't a problem.
x0 = self.maybe_layer_norm(self.self_attn_layer_norm, x0, after=True)