/fp16

Half-precision 16-bit floating point numbers

Primary LanguageJavaScriptMIT LicenseMIT

fp16

standard-readme compliant license NPM version TypeScript types

Half-precision 16-bit floating point numbers.

DataView has APIs for getting and setting float64s and float32s. This library provides the analogous methods for float16s, and utilities for testing how a given float64 value will convert to float32 and float16 values. Conversion implements the IEEE 754 default rounding behavior ("Round-to-Nearest RoundTiesToEven").

NaN is always encoded as 0x7e00, which extends the pattern of how browsers serialize NaN in 32 bits and is the recommendation in the CBOR spec.

This library is TypeScript-native, ESM-only, and has zero dependencies. It works in Node, the browser, and Deno.

Table of Contents

Install

npm i fp16

Usage

Set a 16-bit float

declare function setFloat16(
  view: DataView,
  offset: number,
  value: number,
  littleEndian?: boolean,
): void

Get a 16-bit float

declare function getFloat16(
  view: DataView,
  offset: number,
  littleEndian?: boolean,
): number

Precision

In addition to methods for getting and setting float16s, fp16 exports two methods for testing how a given number value will convert to 32-bit and 16-bit values.

export const Precision = {
	Exact: 0,
	Inexact: 1,
	Underflow: 2,
	Overflow: 3,
} as const

export type Precision = typeof Precision[keyof typeof Precision]

declare function getFloat32Precision(value: number): Precision
declare function getFloat16Precision(value: number): Precision
  • Precision.Exact: Conversion will not loose precision. The value is guaranteed to round-trip back to the same number value. Positive and negative zero, positive and negative infinity, and NaN all return exact. Values that can be represented losslessly as a subnormal value in the target format will return exact.
  • Precision.Overflow: the exponent of the given value is greater than the maximum exponent of the target size (127 for float32 or 15 for float16). Conversion is guaranteed to overflow to +/- Infinity.
  • Precision.Underflow: the exponent of the given value is less than the minimum exponent minus the number of fractional bits of the target size (-126 - 23 for float32 or -14 - 10 for float16). Conversion is guaranteed to underflow to +/- 0 or to the smallest signed subnormal value (+/- 2^-24 for float16 or +/- 2^-149 for float32).
  • Precision.Inexact: the exponent is within the target range, but precision bits will be lost during rounding. The value may round to +/- 0 but will never round to +/- Infinity.

Note that the boundaries for overflow and underflow are not what you might necessarily expect; this is because values with exponents just under the minimum exponent for a format map to subnormal values.

Also note that fp16 treats all NaN values as identical, ignoring sign and signalling bits when decoding, and encoding every NaN value as 0x7e00. This means that not all 16-bit values will round-trip through setFloat16 and getFloat16.

Testing

Tests use AVA and live in the test directory.

npm run test

Tests cover decoding all 65536 possible 16-bit values, rounding behaviour, subnormal values, underflows, and overflows. More tests are always welcome.

Credits

This PDF was extremely helpful as a reference for understanding the float16 format, even though fp16 doesn't use the table-based aproach it outlines.

The Golang github.com/x448/float16 package was used as a reference for implementing rounding. The test suite in tests/32to16.js was adapted from its test file float16_test.go.

Contributing

I don't expect to add any additional features to this library, or change any of the exported interfaces. If you encounter a bug or would like to add more tests, please open an issue to discuss it!

License

MIT © 2021 Joel Gustafson