O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers

Question

jinglescode opened this issue 5 years ago · 0 comments

Paper

this paper proofs that sparse transformers can approximate the same as the dense counterpart for any sequence to sequence function

Except for the STAR and RANDOM patterns, we can see that the networks learn to copy the sequences with four sparse attention layers

STRIDED pattern and the STAR achieve the best performance across all sparsity levels for language modeling
STRIDED and FIXED patterns with UNION configuration show the best scores for translation task
In all tasks, the RANDOM pattern performs worse than the deterministic patterns, demonstrating the need for a careful design of sparsity patterns. Overall, our experiments suggest that the design of the optimal sparsity patterns is heavily dependent on specific tasks. For example, the STAR pattern shows the best performance on the language modeling task, while having trouble with copying, translation, and BERT experiments