cornell-zhang/allo

[Feature] Examples of how the llm pldi24-artifacts are generated

Opened this issue · 1 comments

Is your feature request related to a problem? Please describe.
I am trying to retarget the llm artifacts to my own FPGA board. I'd like to regenerate the HLS code to try more aggressive quantization schemes.

Describe the solution you'd like
Please add some small examples of advanced optimization techniques that are used in the pldi24-artifact repo.

  • Mixed precision input/output for GEMM
  • Mixed precision activation/weight for GEMM
  • Mixed precision input/output for Softmax/Layernorm/Residual
  • Low-bit packing input/output for GEMM/Softmax

Additional context
For example, the softmax operator requires the same fp32 datatype for both input and output. However, there is a mixed precision HLS implementation with input/output packing in the artifact code here. I searched the Allo repo and could not find a reference of how to generate such code.

Hi @bibo-msft, thanks for raising the issue! The PLDI'24 artifact was not purely generated by Allo. There exists some manual hacks in the kernel, and we are still automating the process.

Currently, we have a script for generating the Transformer kernels. Please check out this page for the instructions. This test case also shows a low-bit packing example of GEMM. You can change the bitwidths in the type parameters to generate different GEMM kernels.

We will provide additional examples of mixed precision kernels soon and will notify you once they are available. Please feel free to share any other suggestions you may have. Thank you!