oprecomp/flexfloat

compressed nan/inf representations used for new fp8? e4m3 e5m2

Opened this issue · 0 comments

xloem commented

via https://dblalock.substack.com/p/2022-9-18-arxiv-roundup-reliable

FP8 Formats for Deep Learning
A group of NVIDIA, ARM, and Intel researchers got fp8 training working reliably, with only a tiny accuracy loss compared to fp16.

8-bit floating point (FP8) binary interchange format consisting of two encodings - E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bit exponent and 2-bit mantissa). While E5M2 follows IEEE 754 conventions for representatio of special values, E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs. We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions.

reducing the number of NaN/Inf encodings in fp1-4-3 down to just one bitstring

how much accuracy loss does this approach cause? They find that, across a huge array of models and tasks, the consistent answer is: not much—around 0-.3% accuracy/BLEU/perplexity: