LiyuanLucasLiu/Transformer-Clinic

Position of residual connection in PreLN architecture is wrong

bilzard opened this issue · 1 comments

In the current implementation, residual connection in feed forward block comes from after layer norm[1].

        x0 = self.maybe_layer_norm(self.self_attn_layer_norm, x0, after=True)
        residual = x0

However, according to the paper of PreLN architecture[2], residual variable should be before layer norm.

        residual = x0
        x0 = self.maybe_layer_norm(self.self_attn_layer_norm, x0, after=True)

Sorry, this is my misunderstanding. Since this layer norm is for post norm architecture, it isn't a problem.

        x0 = self.maybe_layer_norm(self.self_attn_layer_norm, x0, after=True)