Add&Norm layer is missing after each attention layer

Question

Add&Norm layer is missing after each attention layer

Yizheng-Sun opened this issue 2 years ago · 0 comments

Hi,
According to the transformer architecture, there should be a add&norm layer after each attention layer. However for code in /docs/tutorials/transformer.ipynb, those Add&Norm layers are missing.

Take decoder layer for example, the original code is:

  def call(self, x, context):
      x = self.causal_self_attention(x=x)
      x = self.cross_attention(x=x, context=context)

      \# Cache the last attention scores for plotting later
      self.last_attn_scores = self.cross_attention.last_attn_scores

      x = self.ffn(x)  # Shape `(batch_size, seq_len, d_model)`.
  return x

If we add Add&Norm layers, it should be:

  def call(self, x, context):
      x = self.add([x, self.causal_self_attention(x=x)])
      x = self.layer_norm1(x)

      x = self.add([x, self.cross_attention(x=x, context=context)])
      x = self.layer_norm2(x)

      \# Cache the last attention scores for plotting later
      self.last_attn_scores = self.cross_attention.last_attn_scores

      x = self.ffn(x)  # Shape `(batch_size, seq_len, d_model)`.
  return x

I wonder if those add&norm layers are ignored on purpose.