/apfp

FPGA acceleration of arbitrary precision floating point computations.

Primary LanguageC++

Fast Arbitrary Precision Floating Point on FPGA

A detailed description of the approach implemented in this repository can be found in our FCCM'22 paper [1].

Introduction

This repository implements an arbitrary precision floating point multiplier and adder using Vitis HLS targeting XRT-enabled Xilinx FPGAs, exposing them through a matrix multiplication primitive that allows running them at full throughput without becoming memory bound. The design is fully pipelined, yielding a MAC throughput equivalent to the frequency times the number of compute units instantiated.

Instantiations of the design on an Alveo U250 accelerator were shown to yield 2.0 GMAC/s of 512-bit matrix-matrix multiplication; an order of magnitude higher than a 36-core dual-socket Xeon node, corresponding to 375× CPU cores worth of throughput [1].

Configuration

The hardware design is configured using CMake. The target Xilinx XRT-enabled platform must be specified with the APFP_PLATFORM parameter. The most important configuration parameters include:

  • The width used for the floating point representation is fixed at compile-time using the APFP_BITS CMake parameter, out of which 63 bits will be used for the exponent, 1 bit will be used for the sign, and the remaining bits will be used for the mantissa. The value is currently expected to be a multiple of 512 for the sake of being aligned to the memory interface width.
  • To scale the design beyond a single pipelined multiplier, the APFP_COMPUTE_UNITS can be used to replicate the full kernel. Each instantiation will run a fully independent matrix multiplication unit. These can be used to collaborate on a single matrix multiplication operation (see host/TestMatrixMultiplication.cpp for an example.
  • The floating point multiplier uses Karatsuba decomposition to reduce the overall resource usage of the design. The decomposition bottoms out at APFP_MULT_BASE_BITS, after which it falls back on naive multiplication using DSPs as generated by the HLS tool. Similarly, the APFP_ADD_BASE_BITS configures the number of bits to dispatch to the HLS tool's addition implementation, manually pipelining the addition into multiple stages above this threshold.
  • To avoid being memory bound, the matrix multiplication implementation is tiled using the approach described in our FPGA'20 paper [2]. The tile sizes are exposed through the APFP_TILE_SIZE_N and APFP_TILE_SIZE_M parameters. The highest arithmetic intensity is achieved when these two quantities are equal and maximized, but relatively small tile sizes are sufficient to overcome the memory bottleneck (e.g., 32x32). Higher tile sizes increase arithmetic intensity at the cost of BRAM usage, and potential overhead when the input matrix is not a multiple of the tile size.
  • APFP_FREQUENCY can be used to change the maximum frequency targeted by the design. If unspecified, the default of the target platform will be used.

For more details on how to configure the project to achieve high throughput, see our paper [1].

Configuration and compilation

Please make sure you clone the repository with git clone --recursive or run git submodule update --init after cloning to check out dependencies.

The minimum commands necessary to configure and build the code are:

mkdir build
cd build
cmake ..  # Default parameters
make      # Builds software components
make hw   # Builds hardware accelerator

However, the accelerator should always be configured to match the target system using the parameters described in the previous section and in our paper [1]. The CMake configuration flow uses hlslib [3] to locate the Xilinx tools and expose hardware build targets.

The project depends on Vitis, GMP, and MPFR to successfully configure.

Running the code

We provide an example host code that runs the matrix multiplication accelerator on a randomized input in host/TestMatrixMultiplication.cpp. See the executable for usage. An example invocation could be:

./TestMatrixMultiplicationHardware hw 256 256 256

Installation

To install the project, including both the software interface components and the hardware accelerator itself (built with make hw), simply run make install. The location to install the project in is configured with the CMAKE_INSTALL_PREFIX parameter.

References

[1] Johannes de Fine Licht, Christopher A. Pattison, Alexandros Nikolaos Ziogas, David Simmons-Duffin, Torsten Hoefler, "Fast Arbitrary Precision Floating Point on FPGA", in Proceedings of the 2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM'22). 🔗

[2] Johannes de Fine Licht, Grzegorz Kwasniewski, and Torsten Hoefler, "Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis", in Proceedings of 28th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA'20). 🔗

[3] Johannes de Fine Licht, and Torsten Hoefler. "hlslib: Software Engineering for Hardware Design.", presented at the Fifth International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC'19). 🔗