Question to attention computation
Opened this issue · 0 comments
yuzhenmao commented
Hi, thank you for the amazing demo and doc! I have a question regarding this section in zero-inference. It is mentioned that "Thus, our current implementation computes attention scores on CPU"
. May I ask if there is a detailed comparison of the latency or throughput between GPU-attention and CPU-attention to support this desicion? I am also serious about the implementation detail of the CPU-attention computation. Thank you!