openai/sparse_attention

Expected throughput?

cbockman opened this issue · 0 comments

Can you provide any insight into expected throughput, relative to a "base" transformer implementation?

I.e., if you consider two model with same hidden size, # layers, etc., will sparse_attention version run significantly slower (if yes, presumably because of recompute)?

Apologies if this was covered in the paper--I skimmed and didn't see it addressed.

Am considering getting this up and running--extremely interesting--but would like a sense on whether there is a major throughput hit before doing so.

Thank you--very neat to see successful evolution from https://openai.com/blog/block-sparse-gpu-kernels/.