TheThirdOne/rars

Error in Floating Point Representation tool about subnormal numbers?

pacalet opened this issue · 0 comments

According IEEE 754-2019:

  • $emin$ shall be $1 − emax$ for all formats (section 3.3, page 17),
  • when biased exponent $E = 0$ and trailing significand $T \neq 0$, the number is subnormal and the corresponding value is $v = (-1)^S \times 2^{emin} \times (0 + 2^{1-p} \times T)$ (section 3.4, page 19),
  • for 32 bits precision $p = 24, emax = 127$ (table 3.5, page 23).

As a consequence for 32 bits precision the value of subnormal numbers shall be $v = (-1)^S \times 2^{-126} \times (0 + 2^{-23} \times T)$. The Floating Point Representation tool apparently has a different interpretation and displays equation $v = (-1)^S \times 2^{-127} \times 2^{-23} \times T$. The first screenshot below shows the tool for $S = 0, E = 0, T = 1$. Still, it displays the correct scientific notation: 1.4E-45 instead of the wrong $v = 2^{-127} \times 2^{-23} = 2^{-150} \approx 7e^{-46}$.

I suggest to replace $0 - 127$ with $-126$, $2^{-127}$ with $2^{-126}$ and, while we are at it, denormalized with the new standard subnormal term. Pull request submitted, new screenshot added with the correct display.

rars-fp

rars-fp-fixed