Add new paper: Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers
Closed this issue · 0 comments
wyzh0912 commented
Title: Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers
Head: Induction head, Retrieval head
Published: arXiv
Summary:
- Innovation: Introduced the retrieval problem, a simple reasoning task that can be solved only by transformers with a minimum number of layers.
- Tasks: Trained several transformers on a minimal formulation and studied attention maps in the trained transformers.
- Result: Transformers solve tasks through a gradually emerging induction head mechanism, enhanced by an implicit curriculum that progressively adds more heads.