ModelTC/lightllm

weight only int4 is slower than cutlass int4

Opened this issue · 1 comments

https://github.com/ModelTC/lightllm/blob/main/lightllm/common/basemodel/triton_kernel/dequantize_gemm_int4.py

The algorithm in the above file implements weight only int4, but its speed is only 50% of cutpass int4. How can this be resolved?

You can compile your own operator interface implemented with pybind, and accelerate the inference by modifying the source code to replace the implementation used during inference.