weight only int4 is slower than cutlass int4

Question

weight only int4 is slower than cutlass int4

Opened this issue 7 months ago · 1 comments

https://github.com/ModelTC/lightllm/blob/main/lightllm/common/basemodel/triton_kernel/dequantize_gemm_int4.py

The algorithm in the above file implements weight only int4, but its speed is only 50% of cutpass int4. How can this be resolved?

Answer 1 · 2024-03-19T06:06:14.000Z

You can compile your own operator interface implemented with pybind, and accelerate the inference by modifying the source code to replace the implementation used during inference.