The transformer architecture revolutionized sequence transduction with its novel "Scaled Dot-Product Attention" by outperforming previous recurrent and convolutional models, while dispensing recurrence and convolutions entirely. However, these transformers are prohibitively slow for very long sequences due to their
The self-attention mechanism makes it possible to capture dependencies between different positions in a sequence by simultaneously attending to all positions. The "Scaled Dot-Product Attention" proposed by Vaswani et al.1 for an input sequence
Since
The exponential similarity function above can be replaced with any map
Since
The transformer architecture can further be used to efficiently train autoregressive models by masking the attention computation such that a position cannot be influenced by the subsequent positions. Then the attention for the
Note that,
The resulting RNN thus has two hidden states – the attention memory
To examine the claims regarding time complexity, we experimented with random sequences of different lengths and measured the time taken to measure the attention in each case. We also measured the total time taken in the forward and backward passes for each sequence.
Total time taken for forward and backward passes vs Sequence length | Time taken for attention calculation vs Sequence length |
We trained a softmax transformer (using the attention from Vaswani et al.1) and a linear transformer (using the attention from Katharopoulos et al.2) for autoregressive image generation on the MNIST dataset3. We then compared the performance of the two models on image completion from occluded images.
MNIST | Occluded MNIST | Outputs |
The linear transformer achieved faster image completion on occluded inputs in comparison to the softmax transformer. To evaluate the two transformers, we generated 600 images with each of the two models and compared the accuracy of a neural network trained on the MNIST dataset3 on these two datasets. The accuracies were 87% and 86.3% on images generated by the linear and the softmax transformers respectively.
Number of Images Completed | Linear | Softmax |
---|---|---|
100 | 33.45s | 367.91s |
200 | 67.46s | 811.99s |
300 | 76.46s | 1248.15s |
Due to the compute requirements of large-scale training, fine-tuning is a widely adopted paradigm for adapting models pre-trained on general domain data to particular tasks or domains. However, for large models even fine-tuning may be prohibitively expensive. Work from Li et al.4 and Aghajanyan et al.5 has demonstrated that models may be over-parameterized and may actually reside on a low intrinsic dimension. Inspired from these results, Hu et al.6 showed empirically that the change in weights during model adaptation also has a low intrinsic rank. This not only makes it possible to achieve comparable accuracies with faster fine-tuning (due to fewer trainable parameters) but also makes it possible to easily switch different adaptators in and out during deployment. A matrix
We train a base neural network on an image classification task on the MNIST dataset3. Our base model was composed of three linear layers which together had 55.1K trainable parameters. This model has a test accuracy of approximately 93.2%. We then created our variants of MNIST, namely Quantized MNIST, Rotated MNIST, and Inverted MNIST. These are illustrated below through an example.
MNIST | Quantized MNIST | Rotated MNIST | Inverted MNIST |
The accuracies of our base model on these modified datasets were approximately 85.58%, 12.38%, and 5.52% respectively.
We first fine-tuned our base model on the three datasets by modifying all 55.1K parameters. Our fine-tuned models achieved accuracies of approximately 93.57%, 91.97%, and 76.41% on Quantized MNIST, Rotated MNIST, and Inverted MNIST respectively. These form our baselines for the fine-tuned models.
We then fine-tuned our base model on the three datasets using LoRA, with different values of
Trainable Parameters | Quantized MNIST | Rotated MNIST | Inverted MNIST | |
---|---|---|---|---|
1 | 1.1K | 91.20% | 37.53% | 16.23% |
2 | 2.1K | 91.30% | 49.01% | 17.92% |
4 | 4.2K | 91.42% | 69.10% | 16.32% |
8 | 8.4K | 91.39% | 77.49% | 32.19% |
16 | 16.8K | 91.72% | 86.95% | 62.26% |
32 | 33.6K | 92.31% | 89.50% | 68.06% |
64 | 67.2K | 93.19% | 90.41% | 71.88% |
We further explore whether LoRA can be used to train models from scratch, to observe if models have intrinsically low ranks. We trained separate models on each of our three modified MNIST datasets as well as the original MNIST. The accuracies of these models are given below along with the respective
Trainable Parameters | MNIST | Quantized MNIST | Rotated MNIST | Inverted MNIST | |
---|---|---|---|---|---|
1 | 1.1K | 56.79% | 23.50% | 25.91% | 22.21% |
2 | 2.1K | 71.90% | 37.48% | 43.96% | 45.81% |
4 | 4.2K | 84.44% | 64.87% | 62.60% | 69.67% |
8 | 8.4K | 89.12% | 77.96% | 82.39% | 83.11% |
16 | 16.8K | 92.64% | 88.2% | 86.76% | 87.38% |
32 | 33.6K | 93.98% | 90.13% | 90.62% | 90.25% |
64 | 67.2K | 94.85% | 91.66% | 91.85% | 86.01% |
We thank Dhruva Kashyap, one of the teaching assistants in UMC 203 at IISc and one of the last Emacs users, for his unwavering technical and emotional support throughout the preparation of this report. We are also extremely grateful to Professor Chiranjib Bhattacharyya for providing us with the opportunity to explore this topic through a graded term-paper in the course. The code for the experiments in section "Transformers are RNNs" was adapted from code provided by Katharopoulos et al.2 in colab.research.google.com/drive/1BV4OaWRAHYGeimO0cqHD86GfnfgUaKD4. We also used the provided pre-trained weights for both the softmax and linear transformers used in our experiments. The code for the experiments in section "Low-Rank Adaptation (LoRA) of Large Language Models" was adapted from github.com/sunildkumar/lora_from_scratch.
All of our code is released in this repository and can be used to reproduce the experiments mentioned above. You can visit this Colab notebook for experiments on the linear transformer, and this for experiments on LoRA.
References
Footnotes
-
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. ↩ ↩2
-
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast Autoregressive Transformers With Linear Attention. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020. ↩ ↩2 ↩3
-
Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010. ↩ ↩2 ↩3
-
Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes. In International Conference on Learning Representations, 2018. ↩
-
Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7319–7328, Online, August 2021. Association for Computational Linguistics. ↩
-
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. ↩