Paper and code
Closed this issue · 2 comments
king-menin commented
Sparse MultiHead Attention (https://arxiv.org/abs/1904.10509) - it is in deepspeedsparseselfattention.py or attention.py
ptillet commented
I think both work. They just have a different API. The attention.py file has a torch-like interface, but the deepspeedsparseselfattention.py was contributed by Microsoft for compatibility with the DeepSpeed software. I suspect the latter will be deprecated when it gets merged in the deepspeed repo. Hope this answers your question!
king-menin commented
ty!