Integrating into a regular transformer
Akbarable opened this issue · 0 comments
Hi Alex, I am amazed by your repository! I have been working on implementing LongNet for a few days now, but your implementation is brilliant.
I want to use your implementation of dilated attention (with/without flash), do you have any suggestions for me? I am having issues integrating this to use on a simple single layer encoder-decoder transformer that is trained on translating English sentences into another language. The model works fine with regular attention, but I am confused as to how to integrate your version of dilated attention in it.
Please let me know if you have any suggestions, and bear with me because I am just scratching the surface when it comes to transformers and LLMs. If you have any advice or reading/tutorial suggestions for me, I'd be happy to take them. Thank you!