kernel optimized for A100

Thank you for the great work and experiment. We want to test the throughput on A100 with bachsize=16. Do you try kernel optimized for A100, or what can I refer?

Hi @lisuying214 ,

Thanks for your question! The kernels in Atom are specifically optimized for Ada GPUs. Its performance on A100 will degrade a lot due to A100's poor CUDA core throughput. I suppose optimizing the dequantization process in Atom kernel will be crucial for A100 performance. Please refer to this recent work to see A100 evaluations (https://arxiv.org/pdf/2405.04532)

@happierpig
Dear author, thanks for you reply and recommeded paper!
However, I checked the A100 CUDA core throughtput, A100's FP32/FP16 is worse than RTX4090, but why not use TensorCore in the dequantization process in Atom kernel? Tensor Core in both A100 and RTX4090 support FP32/FP16 and INT8.
Looking forward to your reply, thanks again!

@lisuying214 ,

As ref in

Atom/kernels/include/GEMM/Dense_layer_gemm_i4_o16.cuh

Line 404 in 7e3618b

__device__ __forceinline__ void dequant(

, the dequantization process in Atom is an element-wise operations, which needs an outer-product of scales and can't fit into the tensor core.