Problems with floatToFloat10 and floatToFloat11

Question

Problems with floatToFloat10 and floatToFloat11

nemerle opened this issue 5 years ago · 2 comments

Both of those functions have the following block of code:

// Too small to be represented as a normalized float, convert to denormalized value
UINT32 shift = 113 - f.field.exponent;
val = (0x800000U | f.field.mantissa) >> shift;

I'm not 100% sure, but shouldn't that 113 be some other number ?
Clang sanitizer is complaining about shifts by 113 bits :)

Answer 1 · 2019-08-09T07:11:50.000Z

I don't remember anymore to be honest, and I can't find a reference I used for implementing that.

But I think it might be correct. 32-bit float has an 8-bit exponent, while an 10 & 11 bit floats have a 5 bit one. The difference between those is pretty large. So if the number is very small, the exponent will be very small and the number will not be representable by 10 & 11 bit floats, and will be zero. So for most of those small exponents the number should get shifted to 0.

I guess we could limit the shift to 32 as it makes no difference to do more than that. (Probably even less accounting for the float structure, but I don't feel like doing that math right now).

Answer 2 · 2019-08-22T18:44:23.000Z

I'll close this unless we get definite proof this is an issue.