microsoft/DeepSpeedExamples

Question to attention computation

Opened this issue · 0 comments

Hi, thank you for the amazing demo and doc! I have a question regarding this section in zero-inference. It is mentioned that "Thus, our current implementation computes attention scores on CPU". May I ask if there is a detailed comparison of the latency or throughput between GPU-attention and CPU-attention to support this desicion? I am also serious about the implementation detail of the CPU-attention computation. Thank you!