mit-han-lab/qserve

Questions about FP8 and H100

sijiac opened this issue · 1 comments

hey, i saw in the paper, we only discuss the INT4 and INT8 quantization. Was it because we only have INT8 tensor core for A100?
I am curious about why not FP4 and FP8, especially when H100 has FP8 tensor core.

Thanks!!

Hi @sijiac . Thanks for your interest in QServe!

Yes. We mainly discussed on INT quantization in the paper since devices like A100 only have INT8 tensor cores. For FP8, the optimizations in quantization algorithms still work, and the principle of "reducing (dequantization) overheads on CUDA cores" still holds for FP8 kernels. By the way, INT8 operations are also supported on newer GPUs such as H100 and Google TPUs, etc. The theoretical throughput of INT8 tensor core is equivalent to FP8 on H100.