jinglescode/papers

O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers

jinglescode opened this issue · 0 comments

Paper

Link: https://arxiv.org/abs/2006.04862
Year: 2020

Summary

  • this paper proofs that sparse transformers can approximate the same as the dense counterpart for any sequence to sequence function

Results

  • Except for the STAR and RANDOM patterns, we can see that the networks learn to copy the sequences with four sparse attention layers

image

  • STRIDED pattern and the STAR achieve the best performance across all sparsity levels for language modeling
  • STRIDED and FIXED patterns with UNION configuration show the best scores for translation task
  • In all tasks, the RANDOM pattern performs worse than the deterministic patterns, demonstrating the need for a careful design of sparsity patterns. Overall, our experiments suggest that the design of the optimal sparsity patterns is heavily dependent on specific tasks. For example, the STAR pattern shows the best performance on the language modeling task, while having trouble with copying, translation, and BERT experiments