Add new paper: Transformers on Markov Data: Constant Depth Suffices

Question

Closed this issue 2 months ago · 0 comments

Title: Transformers on Markov Data: Constant Depth Suffices
Head: Induction Head
Published: ICML
Summary:

Innovation: Proved that a transformer with a single head and three layers can represent the in-context conditional empirical distribution for k^th-order Markov sources.
Tasks: Analyzed the performance of low-depth Transformers trained on k^th-order Markov sources.
Significant Result: Prove a conditional lower bound showing that attention-only transformers need Ω(log(k)) layers to represent k^th-order induction heads, under an assumption on the realized attention patterns.

.