IAAR-Shanghai/Awesome-Attention-Heads

Add new paper: Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers

Closed this issue · 0 comments

Title: Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers
Head: Induction head, Retrieval head
Published: arXiv
Summary:

  • Innovation: Introduced the retrieval problem, a simple reasoning task that can be solved only by transformers with a minimum number of layers.
  • Tasks: Trained several transformers on a minimal formulation and studied attention maps in the trained transformers.
  • Result: Transformers solve tasks through a gradually emerging induction head mechanism, enhanced by an implicit curriculum that progressively adds more heads.