efeslab/Atom

kernel optimized for A100

Opened this issue · 3 comments

Thank you for the great work and experiment. We want to test the throughput on A100 with bachsize=16. Do you try kernel optimized for A100, or what can I refer?

Hi @lisuying214 ,

Thanks for your question! The kernels in Atom are specifically optimized for Ada GPUs. Its performance on A100 will degrade a lot due to A100's poor CUDA core throughput. I suppose optimizing the dequantization process in Atom kernel will be crucial for A100 performance. Please refer to this recent work to see A100 evaluations (https://arxiv.org/pdf/2405.04532)

@happierpig
Dear author, thanks for you reply and recommeded paper!
However, I checked the A100 CUDA core throughtput, A100's FP32/FP16 is worse than RTX4090, but why not use TensorCore in the dequantization process in Atom kernel? Tensor Core in both A100 and RTX4090 support FP32/FP16 and INT8.
Looking forward to your reply, thanks again!

@lisuying214 ,

As ref in

__device__ __forceinline__ void dequant(
, the dequantization process in Atom is an element-wise operations, which needs an outer-product of scales and can't fit into the tensor core.