round_to_fixed function
Closed this issue · 1 comments
Hi, thanks for the great work!! And I am very interested in this work.
However, I am new to the area of quantization and have some questions about the round_to_fixed function in deepshift.utils Line7-18.
In line15 the torch.floor(input/delta) round the fp32 input to the nearest 16bit interger. In my opinion the clamp function should then be followed to clamp the nearest intergers to range(min_val, max_val), that is changing line15-17 to the following:
rounded = torch.floor(input/delta)
rounded = torch.clamp(rounded, min_val, max_val)
rounded = rounded*delta
Could you give me some comments about the difference of these two implementations? Thanks!!
BTW, may I ask what's the range of input? If input is fp32, the range is very large. Is there any implications that input range from [-1,1]?