Introduction

This project contains a library of math-related hardware units.

Right now, it contains only "Fpxx" units: floating point with a user-programmable number of exponent and mantissa.

Fpxx General
FpxxAdd
FpxxMul
FpxxDiv
FpxxSqrt
FpxxRSqrt
Math Related Literature

Getting Started

This library is using SpinalHDL.

If you want to run some of the code here, you first need to install that.

Installation instructions can be found here.

Once one, run ./run.sh to generate whichever unit you want to test. Edit this file if you want to run a different test. (All of this could be streamlined with a better Makefile...)

Then run make sim to run a test.

Fpxx

The Fpxx library is one that supports floating point operations for which the exponent and mantissa can be specified at compile time.

The primary use of this library is for FPGA projects that need floating point, but don't necessarily need all the features and precision of 32-bit standard floating point operations. By reducing the size of the mantissa and exponent, the hardware of some floating point operations can be made to map directly onto the hardware multipliers of the DSPs that are often present in today's FPGAs, and the maximum clock speed can be increased significantly.

For example, many FPGAs support 18x18 bit multiplications. By restricting the size of the mantissa, a single hardware multiplier may be sufficient to implement the core operation of the a floating point multiplier.

Goals:

SpinalHDL

The code is written in SpinalHDL instead of Verilog or VHDL. This makes it much easier to write generic code with programmable widths and pipeline stages. It also cuts back on boiler plate code.

That said, it's almost trivial to generate the Verilog or VHDL for use in your own project. And if that's too much effort, a number of configuration are pre-generated and stored as Verilog and VHDL in the repository, so they can be copied straight into your own project.
Floating port support for all basic operations

At the minimum, add, multiply and divide should work with acceptable accuracy, whatever that means.

For additional operations (e.g. sqrt and 1/sqrt), accuracy may very well be completely unacceptable: depending on my use cases, a small lookup table could be sufficient and the library won't have a better solution.
User-programmable mantissa and exponent size

There are some limitations. For example, FpxxDiv currently requires an odd numbered mantissa.
User-programmable size of various lookup tables or internal results

The user may want to specify a particular mantissa, but still restrict the precion for select operations when it's clear that the full precision won't be needed.

For example, one may want to use a 20-bit size mantissa in general, but restrict multiplications to 17 or 18 bits to map to a single FPGA DSP multiplier.

Similarly, the divide operation uses lookup table. For certain input ranges, the size and precision of this lookup table may not be as larges recommended for maximum precision.

Where possible, the library provides knobs to play with this.
Support for NaN, Infinity, and sign checking

It's important that NaN and Infinity values get propagated through the pipeline, to avoid cases where these kind of values alias into a real value. NaN number should be generated for operations such as asking for the square root of a negative. Overflows or division by 0 will result in Infinity.
One result per cycle

The library is initially designed for a use cases where one result is needed per clock cycle.
User-programmable pipeline depth

For each instance, the user can control the amount of intermediate pipeline stages. This makes it possible to trade off between clock speed, pipeline latency and clock speed.
C++ model

There is C++ template class with an implementation of the Fpxx modules.

This can be very useful to first create a C++ proof of concept of your design before implementing it in hardware.

The goal is for the C++ model and the hardware model to be bit exact (though this might not always be the case.)
Testbench

A testbench with directed and random vectors is provided to verify the results between a model that has a 23-bit mantissa and 8-bit exponent and the standard IEEE fp32 operations of your PC.

The testbench ignores differences that are due to the limitations of the library (e.g. denormals, rounding differences etc.)

Non-goals:

Support for denormals

Denormals requires quite a bit of additional logic for often little benefit. Support for them may be added later, there it's not there at this time.

When a denormal is encountered on an input, it is immediately clamped to zero. Denormal results are replaced by a zero as well.
(Correct) rounding

Rounding is a surprisingly expensive operation and hard to get really right. At this moment, it is not supported at all. This has definitely an impact on precision.
Correct handling of negative and positive zeros

For some operations, negative and positive zeros are dealt with correctly, but not all of them.

FpxxAdd

FpxxMul

FpxxDiv

FpxxSqrt

FpxxRSqrt

Math Related Literature

Reduced Precision Floating Point

Simplified Floating Point for DSP

Cornell student project with C code and Verilog.
Float Point Core generator

Create custom VHDL floating point cores of variable size.

Articles on two-complement floating point

TMS320C3x User Guide (1994)

Old user guide. Two-complements floating point section starts at page 4-4.
TMS320C3x User Guide (2004)
- Current official version. Has typos. E.g. bottom of 5-35 is incorrect.
StackExchange question

Links to conversion code.

Division

Variable Precision Floating Point Division and Square Root

Very interesting presentation on how to create division and square root on FPGA.

The thesis about this presentation can be found here.

Another thesis implementing this kind of divider, with (bad) source code.
A Pipelined Divider with a Small Lookup Table

Paper that describes a similar divider as the one in the presentation above, but with smaller lookup table and more multipliers.
Fast Division Algorithm with a Small Lookup Table

Paper that is referenced by the two papers above as main inspiration for the LUT + 2 multipliers division operation.

Includes detailed mathematical derivation and error analysis.

Square Root and Reciprocal Square Root

Matlab - Implement Fixed-Point Square Root Using Lookup Table

Matlab code for fixed point square root lookup table.
Variable Precision Floating Point Division and Square Root

Uses combination of table lookup and a bunch of multipliers for square root. See same paper under the 'Division' section for related information.
Implementation of Single Precision Floating Point Square Root on FPGAs

Shows simple interative implementation and pipelined version, both for integer-only and floating point.

FP32 version requires 15 pipeline stages instead of 24, because some stages are so small that they can be collapsed.

Does not use a lookup table or multiplier, just a bunch of adders.

Parallel-Array Implementations of A Non-Restoring Square Root Algorithm
An Optimized Square Root Algorithm for Implementation in FPGA Hardware

Seems to be equivalent to the previous one.
An Efficient Implementation of the Non Restoring Square Root Algorithm in Gate Level
Reciprocation, Square root, Inverse Square Root, and some Elementary Functions using Small Multipliers

Paper that is referenced by the papers above as main inspiration for the LUT + multipliers approach.

Had detailed mathematical derivation about how things work.
Methods of Computing Square Roots

Wikipedia.
Best Square Root Methods

Not very useful.
Simple Seed Architectures for Reciprocal and Square Root Reciprocal

Not very useful.
Fixed-Point Implementations of the Reciprocal, Square Root and Reciprocal Square Root Functions
Ask Hackaday: Computing Square Roots on FPGA?
Chebyshev Approximation and How It Can Help You Save Money, Win Friends, and Influence People - Jason Sachs
Fast interactive sqrt

Leading Zero Counter (LZC) and Leading Zero Anticipor (LZA)

Modular Design of Fast Leading Zeros Counting Circuit

Very fast and low area regular leading zero counting implementation.
Stack Exchange Hierarchical Solution

Neat implementation, but apparently not nearly as area and speed efficient as the implementation of the previous bullet point. (See also this video
Leading-Zero Anticipatory Logic for High-Speed Floating Point Addition (1995)
Leading Zero Anticipation and Detection - A Comparison of Methods (2001)
Analysis and Implementation of a Novel Leading Zero Anticipation Algorithm for Floating Point Arithmetic Units (2001)
Hybrid LZA: A Near Optimal Implementation of the Leading Zero Anticipator (2009)

Sin/Cos Calculation

Computing sin & cos in hardware with synthesisable Verilog

tomverbeure/math