milankl/SoftPosit.jl

conversion Float32 <-> Posit16

Closed this issue · 13 comments

https://twitter.com/ProjectPhysX/status/1444007905058070531?s=20 suggests a new conversion between Float32 and Posit16

Hi all,

I'm researching Posits as a potential storage format for the DDFs in the lattice Boltzmann method. Therefore I needed much faster conversion algorithms than previously available, especially to be suitable for GPUs.
I wrote these conversion algorithms using just basic bit manipulation and to be branchless. They are much shorter than the existing SoftPosit conversion, and also much faster. Tests with them integrated into the LBM showed 22x performance gain compared to the existing SoftPosit implementation.

Below is the validation plot, showing digits = -log10(fabs(log10(x_converted_back_and_forth/x))), proving the conversion is bitwise identical and handles rounding/overflow correctly. Note that I have ditched the infinity and NaR definitions to spare some operations.

The purpose of these algorithms really is to be as fast as possible for applications where lots of conversions are needed, for example compressing numbers for storage in memory.

Regards,
Moritz Lehmann

Posit

typedef unsigned int uint;
typedef unsigned short ushort;
float as_float(const uint x) {
	return *(float*)&x;
}
uint as_uint(const float x) {
	return *(uint*)&x;
}
int as_int(const float x) {
	return *(int*)&x;
}

ushort float_to_p160(const float x) {
	const uint b = as_uint(x);
	const int e = ((b&0x7F800000)>>23)-127; // exponent-bias
	int m = (b&0x007FFFFF)>>9; // mantissa
	const int v = abs(e); // shift
	const int r = (e<0 ? 0x0002 : 0xFFFE)<<(13-v); // generate regime bits
	m = ((m>>(v-(e<0)))+1+(e<-13)*0x2)>>1; // rounding: add 1 after truncated position; in case of lowest numbers, saturate
	return (b&0x80000000)>>16 | (e>-16)*((r+m)&0x7FFF) | (e>13)*0x7FFF; // sign | regime+mantissa ("+" handles rounding overflow) | saturate
}
float p160_to_float(const ushort x) {
	const uint sr = (x>>14)&1; // sign of regime
	ushort t = x<<2; // remove sign and first regime bit
	t = sr ? ~t : t; // positive regime r>=0 : negative regime r<0
	const int r = 142-(as_int((float)t)>>23); // evil log2 bit hack to count leading zeros for regime
	const uint m = (x<<(r+10))&0x007FFFFF; // extract mantissa and bit-shift it in place
	const int rs = sr ? r : -r-1; // negative regime r<0 : positive regime r>=0
	return as_float((x&0x8000)<<16 | (r!=158)*((rs+127)<<23 | m)); // sign | regime | mantissa
}

ushort float_to_p161(const float x) {
	const uint b = as_uint(x);
	const int e = ((b&0x7F800000)>>23)-127; // exponent-bias
	int m = (b&0x007FFFFF)>>10; // mantissa
	const int ae = abs(e);
	const int v = ae>>1; // shift, ">>1" is the same as "/2"
	const int e2 = ae&1; // "&1" is the same as "%2"
	const int r = ((e<0 ? 0x0002 : 0xFFFE<<e2)+e2)<<(13-v-e2); // generate regime bits, merge regime+exponent and shift in place
	m = ((m>>(v-(e<0)*(1-e2)))+(e>-28)+(e<-26)*0x3)>>1; // rounding: add 1 after truncated position; in case of lowest numbers, saturate
	return (b&0x80000000)>>16 | (e>-31)*((r+m)&0x7FFF) | (e>26)*0x7FFF; // sign | regime+exponent+mantissa ("+" handles rounding overflow) | saturate
}
float p161_to_float(const ushort x) {
	const uint sr = (x>>14)&1; // sign of regime
	ushort t = x<<2; // remove sign and first regime bit
	t = sr ? ~t : t; // positive regime r>=0 : negative regime r<0
	const int r = 142-(as_int((float)t)>>23); // evil log2 bit hack to count leading zeros for regime
	const uint e = (x>>(12-r))&1; // extract mantissa and bit-shift it in place
	const uint m = (x<<(r+11))&0x007FFFFF; // extract mantissa and bit-shift it in place
	const int rs = (sr ? r : -r-1)<<1; // negative regime r<0 : positive regime r>=0, "<<1" is the same as "*2"
	return as_float((x&0x8000)<<16 | (r!=158)*((rs+e+127)<<23 | m)); // sign | regime+exponent | mantissa
}

EDIT: Flush to zero now works correctly.
EDIT 2: simplified evil log2 bit hack part

Instead of int r = 158-(as_int((float)((uint)t<<16))>>23); I would write r = leading_zeros(t) in Julia. Is r the actual number of regime bits or does it exclude the first regime that you removed?

I just slimplified that part in the algorithm again (I made t type ushort instead of int to eliminate (uint)t<<16).

r = leading_zeros(t) is equivalent to r = 158-(as_int((float)t)>>23); when t is a 32-bit integer.
In Julia you can do r = leading_zeros(t)-16. instead of r = 142-(as_int((float)t)>>23);.

r is the number of bits in the regime minus 1, so it excludes the regime termination bit.

Quick check: Does your version work for negative numbers too? Because you don't seem to implement the two's complement do you? And the rounding is just round to nearest tie away from zero, right? Because I believe the posit standard implements tie to even as for floats.

Okay, that's my take on this conversion. I don't need any branches (neither if nor ?) although I think it actually overflows where it shouldn't and at the moment underflows occur at minpos/4 and and not at minpos/2 (but I think SoftPosit never underflows in conversion???)

function Posit16_new(x::Float32)
ui = reinterpret(UInt32,x)
# REGIME AND EXPONENT BITS
# exponent without the exponent bias 127
e = reinterpret(Int32,(ui & 0x7f80_0000) >> 23) - Int32(127)
abs_e = abs(e) # exponent without sign
signbit_e = signbit(e) # sign of exponent
n_regimebits = (abs_e >> 1) + 1 # number of regime bits
exponent_bit = abs_e & 0x1 # exponent bit from exponent odd = 3,5,...?
# combine regime bit and exponent and then arithmetic bitshift them in for e.g. 111110_e_....
regime_exponent = reinterpret(Int32,0x8000_0000 | (exponent_bit << 29)) >> n_regimebits
# for x < 1 use two's complement to the regime and exponent bits to flip them correctly
regime_exponent = (regime_exponent (signbit_e*0xffff_ffff)) + signbit_e
regime_exponent &= 0x7fff_ffff # remove any sign that results from arith bitshift
# MANTISSA - isolate mantissa bits and shift in position
mantissa = (ui & 0x007f_ffff) >> (n_regimebits-6+signbit_e*(exponent_bit-1))
# add u/2 = 0x0000_7fff or 0x0000_8000 for tie to even
u_half = 0x0000_7fff + ((mantissa >> 16) & 0x0000_0001)
# combine regime, exponent and mantissa, round to nearest, tie to even
p32 = (regime_exponent | mantissa) + u_half
p16 = (p32 >> 16) % UInt16 # after +u_half round down via >>
# combine with sign bit and two's complement for negative numbers
signbitx = signbit(x) # 0x1 or 0x0
sign = ((ui & 0x8000_0000) >> 16) % UInt16 # isolate sign bit
p16 = sign | ((p16 (signbitx*0xffff)) + signbitx) # two's complement
return reinterpret(Posit16,p16)
end

It's currently more written for readability and not for performance. On an i5 Ice Lake I get the following timings (Julia v1.6.3)

julia> using SoftPosit, BenchmarkTools
julia> A = rand(Float32,1000000)
julia> @btime Posit16_new.($A)     # new conversion in pure Julia
  7.020 ms (2 allocations: 1.91 MiB)
julia> @btime Posit16.($A)         # SoftPosit's C conversion
  89.033 ms (2 allocations: 1.91 MiB)

So about a performance gain of 12-13x. However, if I disable the last two's complement (hence only positive floats are correctly converted) I get to 100x speedups

julia> @btime Posit16_new.($A);
  866.962 μs (5 allocations: 1.91 MiB)

So there's clearly room for improvement. Will try and look at that next week. @ProjectPhysX what are your C timings?

My conversion works for negative numbers as well. I threat the conversion only for positive numbers and then copy over the sign bit. Positive and negative numbers are completely symmetric this way.

For rounding I do round-to-nearest-even, i.e. add 1 to the leftmost bit of what is truncated from the mantissa, for example to 01010|10 I add 00000|10 to get 01011|00 and then I truncate to 01011.
This is equivalent to correct integer rounding if you would treat the mantissa like an integer, except for edge cases where it is a tie. For example this rounds 0.4999 to 0, 0.5001 to 1 and 0.5000 to 1 as well. I think this is different to the floating-point and also Posit standard, but it is equally accurate and makes the implementation much simpler.

As for timings, for P16_1 I get 54ns per back-and-forth conversion for SoftPosit and 9.5ns for my implementation, so speedup of 5.8x. For P16_0 I don't have a reference, but back-and-forth conversion takes 8.5ns. I have an 8700K at ~4.4GHz.
On the GPU, with the conversion embedded in a lattice Boltzmann algorithm, I get ~20x speedup on my Nvidia Titan Xp. Here having branchless code makes a much larger difference as all threads within a workgroup need to execute both branches if only a single thread deviates.

My conversion works for negative numbers as well. I threat the conversion only for positive numbers and then copy over the sign bit. Positive and negative numbers are completely symmetric this way.

This is true for floats, but not for posits, which use two's complement for negative numbers. For example maxpos is 0111...1 but -maxpos is 100...01. It seems that you are only interested in conversions, in which you could change the definition of posits to a sign-magnitude representation. However, what do you do then with NaR? 100...0 would then be -0, and you'd need to steal another bitpattern for NaR.

For rounding I do round-to-nearest-even, i.e. add 1 to the leftmost bit of what is truncated from the mantissa, for example to 01010|10 I add 00000|10 to get 01011|00 and then I truncate to 01011. This is equivalent to correct integer rounding if you would treat the mantissa like an integer, except for edge cases where it is a tie. For example this rounds 0.4999 to 0, 0.5001 to 1 and 0.5000 to 1 as well. I think this is different to the floating-point and also Posit standard, but it is equally accurate and makes the implementation much simpler.

Note that with your method a (positive) tie like 00|10 is always rounded up to 01|00, this corresponds for integers to 0.5 to 1; 1.5 to 2; 2.5 to 3 etc. For floats this means all ties are always rounded away from zero (sign-magnitude repr), for posits (two's complement), all ties would be rounded up, as their ordering is reversed for negative numbers due to the two's complement. In order to remove this bias, tie to even is part of the float and also of the posit standard (https://posithub.org/docs/posit_standard.pdf, 4.1). It sounds a bit irrelevant, but actually the chances to hit a tie with float or posit arithmetic are suprisingly large due to their piecewise linear distribution of representable numbers.

As for timings, for P16_1 I get 54ns per back-and-forth conversion for SoftPosit and 9.5ns for my implementation, so speedup of 5.8x. For P16_0 I don't have a reference, but back-and-forth conversion takes 8.5ns. I have an 8700K at ~4.4GHz. On the GPU, with the conversion embedded in a lattice Boltzmann algorithm, I get ~20x speedup on my Nvidia Titan Xp. Here having branchless code makes a much larger difference as all threads within a workgroup need to execute both branches if only a single thread deviates.

Does that mean you time a single back and forth conversion, or did you apply this to a whole array? Looks like with my implementation I could get below 1ns one way, but as always comparisons are difficult as I'm just using a 2GHz i5...

Okay, I got it down to

julia> A = randn(Float32,1000000);
julia> @btime Posit16_new.($A);
  850.299 μs (5 allocations: 1.91 MiB)

meaning 0.85ns per conversion (one way only atm), about 110x faster than SoftPosit. There was another shortcut in the last two's complement. Let me know whether you get in any faster.

Thanks for all your thoughts! Yes I only want sign-magnitude representation for my purposes, so I ditched NaR.
Rounding the ties up or down for my purposes also is not relevant; I use Posits only as a compressed storage format and not for arithmetic.
I measured timings by measuring time for about 30M back-and-forth conversions and dividing by 30M to get the time per conversion.

julia> using SoftPosit, BenchmarkTools

julia> function f(A)
           @inbounds for i in eachindex(A)
               A[i] = Float32_new(Posit16_new(A[i]))
           end
       end
f (generic function with 1 method)

julia> A = randn(Float32,1000000);
julia> @btime f($A);
  1.895 ms (0 allocations: 0 bytes)

Got the back & forth conversion (without memory allocation this time) down to <2ns on my macbook air (i5 10th gen, Ice Lake, 1.1GHz); get similar results (2.8ns) on an older i7 (7th gen, Kaby Lake, 3.6GHz). The code is

function Float32_new(x::Posit16)
ui = reinterpret(UInt16,x)
signbitx = signbit(x) # sign of number
abs_p16 = (signbitx ? -ui : ui) << 1 # two's complement for negative, remove sign bit
# determine exponent sign from 2nd bit: 1=x>
sign_exponent = reinterpret(Bool,((abs_p16 & 0x8000) >> 15) % UInt8)
n_regimebits = sign_exponent ? leading_ones(abs_p16) : leading_zeros(abs_p16)
# extract exponent and mantissa bits
exponent_bit = ((abs_p16 << (n_regimebits+1)) & 0x8000) >> 15
mantissa = ((abs_p16 << (n_regimebits + 2)) % UInt32) << 7
# assemble float exponent from posit # of regime bits and exponent bit
exponent = ((sign_exponent ? n_regimebits : -n_regimebits+1) << 1) + exponent_bit + 125
# set exponent (and 1st mantissa bit) to NaN for NaR inputs
# set exponent to 0 for zero(Posit16) input
exponent = n_regimebits == 16 ? signbitx*0x7fc00000 : (exponent % UInt32) << 23
# assemble sign, exponent and mantissa bits
sign = (signbitx % UInt32) << 31 # isolate sign bit
f32 = sign | exponent | mantissa # concatenate sign, exponent and mantissa
return reinterpret(Float32,f32)
end

Feel free to use this version or investigate further where the speed differences are!

I've just made the new conversion to default for SoftPosit v0.4 with #61, it still includes slight changes in the special cases of the rounding mode compared to the posit (draft) standard which is now described in the README.md

The no overflow rounding mode

p16 = ((p32 >> 16) % UInt16) - (15 < n_regimebits < 64) # after +u_half round down via >>

made things a bit more expensive, as I had to prevent rounding to NaR for [maxpos(Posit16),floatmax(Float32)] but it's still below 3ns for back & forth conversion.

Closed with #61