Implementation of the "Adaptive Attention Span in Transformers" paper.
Primary LanguageJupyter Notebook