Expected throughput?
cbockman opened this issue · 0 comments
cbockman commented
Can you provide any insight into expected throughput, relative to a "base" transformer implementation?
I.e., if you consider two model with same hidden size, # layers, etc., will sparse_attention version run significantly slower (if yes, presumably because of recompute)?
Apologies if this was covered in the paper--I skimmed and didn't see it addressed.
Am considering getting this up and running--extremely interesting--but would like a sense on whether there is a major throughput hit before doing so.
Thank you--very neat to see successful evolution from https://openai.com/blog/block-sparse-gpu-kernels/.