float16 provides IEEE 754 half-precision floating-point numbers (binary16) with IEEE 754 default rounding for conversions.
All possible 4+ billion conversions between float16 and float32 are verified to be correct.
Conversions between float16 and float32:
- use IEEE 754-2008 "Round-to-Nearest RoundTiesToEven" when converting float32 to float16
- use lossless conversion when converting float16 to float32
- use zero allocations and take less than 3 ns/op (in pure Go) on a desktop CPU
This library is used by fxamacker/cbor v1.4+ and is ready for production use on supported platforms.
This float16 library produces correct results for all conversions between float16 and float32.
Current status:
- 100% of unit tests pass:
- short mode (
go test -short) tests around 65763 conversions in 0.005s. - normal mode (
go test) tests all possible 4+ billion conversions in about 45s.
- short mode (
- 100% code coverage with both short mode and normal mode.
- conversions use zero allocations and take less than 3 ns/op (in pure Go) on a desktop.
- tested on amd64 but it should work on all little-endian platforms supported by Go.
Roadmap:
- add functions for fast batch conversions
- speed up unit test when verifying all possible 4+ billion conversions
- test on additional platforms
Conversions from float16 to float32 are lossless conversions. All 65536 possible float16 to float32 conversions (in pure Go) are confirmed to be correct.
Unit tests take a fraction of a second to check all 65536 expected values for float16 to float32 conversions.
Conversions from float32 to float16 use IEEE 754 default rounding ("Round-to-Nearest RoundTiesToEven"). All 4294967296 possible float32 to float16 conversions (in pure Go) are confirmed to be correct.
Unit tests in normal mode take about 35-55 seconds to check all 4+ billion expected values for float32 to float16 conversions.
Unit tests in short mode use a small subset (65763) of expected values and finish in under 1 second while still reaching 100% code coverage.
Float16 is a Go type with a simple API. The only state for each Float16 is a uint16.
package float16 // import "github.com/x448/float16"
// Exported types
type Float16 uint16
// Exported functions
Fromfloat32(f32 float32) Float16 // returns Float16 converted from f32 using IEEE 754 default rounding
Frombits(u16 uint16) Float16 // returns Float16 by casting uint16 to Float16
NaN() Float16 // returns IEEE 754 half-precision not-a-number
Inf(sign int) Float16 // returns IEEE 754 half-precision infinity according to sign
// Exported methods
(f Float16) Float32() float32 // returns float32 converted from f16 using lossless conversion
(f Float16) IsNaN() bool // returns true if f is not-a-number (NaN)
(f Float16) IsInf(sign int) bool // returns true if f is infinite according to sign (-1=NegInf, 0=Both, 1=PosInf)
(f Float16) IsFinite() bool // returns true if f is not infinite or NaN
(f Float16) IsNormal() bool // returns true if f is not zero, infinite, subnormal, or NaN.
(f Float16) Signbit() bool // returns true if f is negative or negative zero
(f Float16) String() string // returns the string representation of f to satisfy fmt.Stringer interface
// Convert float32 to float16
pi := float32(math.Pi)
pi16 := float16.Fromfloat32(pi)
// Convert float16 to float32
pi32 := pi16.Float32()
Conversions (in pure Go) are around 3 ns/op for float16 to Float32 as well as Float32 to float16.
All functions have zero allocations except float16.String().
FromFloat32pi-2 2.59ns ± 0% // speed using Fromfloat32() to convert float32 of math.Pi to Float16
ToFloat32pi-2 2.69ns ± 0% // speed using Float32() to convert float16 of math.Pi to float32
Frombits-2 0.36ns ± 8% // speed using Frombits() to cast uint16 to Float16
- Tested on Go 1.12 and Go 1.13 but it should also work with older versions.
- Tested on amd64 but it should also work on all little-endian platforms supported by Go.
Special thanks to Kathryn Long (starkat99) for creating half-rs, a very nice rust implementation of float16.