An FPGA Accelerator for Transformer Inference
We accelerated a BERT layer across two FPGAs, partitioned into four pipeline stages. We conduct three levels of optimization using Vitis HLS and report runtimes. The accelerator implements a transformer layer of standard BERT size, with a sequence length of 128 (which can be modified).
This repository is designed to run on a host node with at least two Xilinx u200s. The instructions provided are specific to the the Pitt CRC fpga-n0 node, however, they may be adapted as neded for other nodes.
The required dependancies can be loaded using the following commands.
module load xilinx/vitis/2020.2
module load libfaketime
source /opt/xilinx/xrt/setup.sh
All building is performed in the fpga/
directory. Navigate there and enter the following command.
faketime 'last year' make all TARGET=<hw, hw_emu, sw_emu> VERSION=<0, 1, 2, 3> PART=<fpga1, fpga2, all> JOBS=<# of jobs requested>
If building for hardware the output artifacts will automatically be coppied into /builds/v#/fpga#/
.
To run all enter make test VERSION=<0, 1, 2, 3> PART=all
in the fpga/
directory.
Individual fpga builds can be run directly using the host and executable in the desired builds/
directory.
- None
- Linear layer tiling
- Buffering of input and output data
- Unrolling of multiplication inner loops
- Transpose A matmul input
- Cache line of A.T
- Increase tile size in j dimension
- Unrolling of computation in attention heads
- Stream DDR inputs/outputs in linear layers
Version | Latency (ms) | ||
---|---|---|---|
fpga1 | fpga2 | all | |
v0 | 4723.71 | 10950.90 | 15676.30 |
v1 | 274.98 | 120.91 | 397.45 |
v2 | 48.36 | 95.60 | 145.27 |
v3 | 35.03 | 71.76 | 110.99 |