FMInference/FlexLLMGen

question about quantization

Opened this issue · 0 comments

Hi FlexGen team! I have a question about your quantization algorithm. are you using this function run_float_quantization for int4/int8 compression? When I run the test(test_float_quantize), it fails because the params is different with the deepspeed version(the ref_out_tensor is the same). the deepspeed param can recover the float16 tensor, the run_float_quantize can't. Thanks!