GATECH-EIC/ShiftAddNet

round_to_fixed function

Closed this issue · 1 comments

Hi, thanks for the great work!! And I am very interested in this work.

However, I am new to the area of quantization and have some questions about the round_to_fixed function in deepshift.utils Line7-18.

In line15 the torch.floor(input/delta) round the fp32 input to the nearest 16bit interger. In my opinion the clamp function should then be followed to clamp the nearest intergers to range(min_val, max_val), that is changing line15-17 to the following:
rounded = torch.floor(input/delta)
rounded = torch.clamp(rounded, min_val, max_val)
rounded = rounded*delta

Could you give me some comments about the difference of these two implementations? Thanks!!

BTW, may I ask what's the range of input? If input is fp32, the range is very large. Is there any implications that input range from [-1,1]?