Integrating into a regular transformer

Question

Integrating into a regular transformer

Akbarable opened this issue a year ago · 0 comments

Hi Alex, I am amazed by your repository! I have been working on implementing LongNet for a few days now, but your implementation is brilliant.

I want to use your implementation of dilated attention (with/without flash), do you have any suggestions for me? I am having issues integrating this to use on a simple single layer encoder-decoder transformer that is trained on translating English sentences into another language. The model works fine with regular attention, but I am confused as to how to integrate your version of dilated attention in it.

Please let me know if you have any suggestions, and bear with me because I am just scratching the surface when it comes to transformers and LLMs. If you have any advice or reading/tutorial suggestions for me, I'd be happy to take them. Thank you!