/trans-fat

An FPGA Accelerator for Transformer Inference

Primary LanguageJupyter Notebook

trans-fat

An FPGA Accelerator for Transformer Inference

We accelerated a BERT layer across two FPGAs, partitioned into four pipeline stages. We conduct three levels of optimization using Vitis HLS and report runtimes. The accelerator implements a transformer layer of standard BERT size, with a sequence length of 128 (which can be modified).

Instructions

This repository is designed to run on a host node with at least two Xilinx u200s. The instructions provided are specific to the the Pitt CRC fpga-n0 node, however, they may be adapted as neded for other nodes.

Dependancies

The required dependancies can be loaded using the following commands.

module load xilinx/vitis/2020.2
module load libfaketime
source /opt/xilinx/xrt/setup.sh

Building

All building is performed in the fpga/ directory. Navigate there and enter the following command.

faketime 'last year' make all TARGET=<hw, hw_emu, sw_emu> VERSION=<0, 1, 2, 3> PART=<fpga1, fpga2, all> JOBS=<# of jobs requested>

If building for hardware the output artifacts will automatically be coppied into /builds/v#/fpga#/.

Running

To run all enter make test VERSION=<0, 1, 2, 3> PART=all in the fpga/ directory.

Individual fpga builds can be run directly using the host and executable in the desired builds/ directory.

Optimization Versions

v0

  • None

v1

  • Linear layer tiling
  • Buffering of input and output data
  • Unrolling of multiplication inner loops

v2

  • Transpose A matmul input
  • Cache line of A.T
  • Increase tile size in j dimension
  • Unrolling of computation in attention heads

v3

  • Stream DDR inputs/outputs in linear layers

Results

Version Latency (ms)
fpga1 fpga2 all
v0 4723.71 10950.90 15676.30
v1 274.98 120.91 397.45
v2 48.36 95.60 145.27
v3 35.03 71.76 110.99