Write 128 bit floating point types and algebras
bdezonia opened this issue · 6 comments
Support 128-bit floats (IEEE) in software.
Note for anyone interested in working on this that one could use the Float64 types and algebras as a working template for how to design the classes and for which code needs to be implemented. Maybe code from a permissively licensed library for 128-bit float support could be adapted to Java in zorbage.
Here is more info on the number format:
https://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format
One slow running but simple idea for implementation is to write code that works entirely in BigDecimal and saves to and reads from a 128-bit bundle of bytes when required. The encoder just translates BigDecimal values of various ranges into the 128-bit format outlined above.
In the zorbage-nifti file reader code I have taken this approach. However it can't represent nans and infinities. I think one could grab that code as a start of decoding 128 bit IEEE numbers. And then represent the numbers internally as a BigDecimal and a boolean. The math could all work in BigDecimal and only encoded/decoded when assigning to a data structure. The data structure could encode as 17 bytes: 16 for the IEEE encoding and 1 for the boolean. The boolean represents a denominator of 0 or 1. If the denom is 1 the stored value is a real. If the denom is 0 then if the numerator < 0 the result is neg inf and if numerator > 0 the result is pos inf and if numerator == 0 then result is nan. For mathematical correctness denom might need to be 1 or pos 0 or neg 0 and track that stuff carefully.
Note that this approach is not exactly perfect. One can make Float128Members whose in memory rep is outside the boundaries of the max and min for the type. When saving to storage it is then encoded as an infinity. However while in ram it can take on all kinds of values. This approach is more accurate than the IEEE binary128 approach but since it differs from it there are bound to be computations that, though correct, would surprise a float128 user.
I am working on this code on the flt128 branch. Most of the real flt128 number code has been written. Not yet vec/mat/tens. And then further have not dulicated to complex and quat and oct.
However have fixed the values drifting out of bounds problem. We clamp() after every sertV() and the internal methods use setV().
This feature is complete including real/complex/quaternion/octonion implementations for numbers/vectors/matricies/tensors. Holy cow this was way more work than I expected. Closing.