CGO 2019 AE: A Code Generator for High-Performance Tensor Contractions on GPUs

This document details the steps to reproduce the five figures in the experimental section (Figure 4, 5, 6, 7, and 8). Figure 4 and Figure 5 compare the performance of our code generator with the NWChem Code Generator and TAL_SH on Nvidia Pascal P100 GPU and V100 GPU, respectively. Figures 6 and 7 compare the performance of our approach with Facebook’s Tensor Comprehensions (TC) on P100 and V100 respectively, for tensor contractions in SD2 function of CCSD(T) for single precision. Figure 8 shows the iterations vs GFLOPS achieved by tensor comprehensions for the SD2_1 (bcdef-gdab-efgc) benchmark on V100.

Amazon EC2 instance for evaluation

For easy evaluation we have setup an Amazon EC2 machine with an Nvidia V100 GPU. All the required libraries and frameworks are already installed. The machine can be accessed using ssh (the ssh-key is provided with the submission).

please contact us to get access to the ssh key (we could not upload it in the submission site)
download the cogent3.pem (ssh key) and run chmod 400 cogent3.pem to change file permission
ssh: ssh -i cogent3.pem ubuntu@ec2-13-59-110-214.us-east-2.compute.amazonaws.com
root directory: /home/ubuntu/cgo2019-ae-draft/

COGENT(COde GENerator for Tensor Contractions)

The code generator (COGENT) is based on Python 3.5. COGENT outputs CUDA kernels.

git-repository: https://github.com/kimjsung/CGO2019-AE (http://gitlab.hpcrl.cse.ohio-state.edu/jinsung/cgo2019-ae-draft) .
Location: ./cogent/

Ensure that CUDA_ARCH variable in the cogent Makefile is set properly. Run the make command to build.

Benchmark #1 (Figure 4 and 5)

The below script runs TCCG Benchmark for Figure 4 and 5.

Script: ./cogent/bench_tccg.sh
Output: cogent_tccg_results.txt
Estimated runtime on P100: 5 minutes
Expected Results (P100): ./cogent/expect_p100_cogent_tccg_results.txt
Expected Results (V100): ./cogent/expect_v100_cogent_tccg_results.txt .

Benchmark #2 (Figure 6 and 7)

The below script runs TCCG Benchmark for Figure 6 and 7.

Script: ./cogent/bench_fb.sh
Output file location: cogent_fb_results.txt
Estimated runtime on P100: 5 minutes
Expected Results (P100): ./cogent/expect_p100_cogent_fb_results.txt
Expected Results (V100): ./cogent/expect_v100_cogent_fb_results.txt

Evaluating additional benchmarks using COGENT

COGENT accepts an expression corresponding to the required tensor contraction and the representative problem size as follows:

t3 [a,312,b,312,c,24] += sum(d,312) * t2 [b,d,a] * v2 [d,c];

The above expression contracts two tensors t2 and v2 and saves the result in t3
d is the contraction dimension and a,b and c are external dimensions
The representation problem sizes are specified after indices in the output and sum (contraction) terms
Note: Our current parser is whitespace sensitive; Hence please follow the above whitespace format

The above input is available in the file ./cogent/input_strings/tccg/input_tcct_01.in

NWChem

./nwchem-tccg/tccg-kernels.cu contains the kernels generated by NWChem's Code Generator. In order to build NWChem kernels set CUDA_ARCH variable corresponding to the GPU architecture in the NWChem Makefile before building. Run the make command to build.

Location: ./nwchem-tccg/

Benchmark #1 (Figure 4 and 5)

Script: ./nwchem-tccg/bench_tccg.sh
Output: nwchem_tccg_results.txt
Estimated runtime on P100: 10 minutes
Expected Results (P100): ./nwchem-tccg/expect_p100_nwchem_tccg_results.txt
Expected Results (V100): ./nwchem-tccg/expect_v100_nwchem_tccg_results.txt

TAL-SH

In order to run TAL-SH benchmark (Figure 4 or 5) TAL-SH has to be installed. We have already installed TAL-SH in the Amazon EC2 Machine.

How To Build TAL-SH

TAL-SH depends on the cuTT library. Hence cuTT has to installed first.

cuTT's git-repository: https://github.com/ap-hynninen/cutt

After building cuTT and before building TAL-SH replace the test.cpp file in TAL-SH directory with test.cpp. Download TAL-SH from https://github.com/DmitryLyakh/TAL_SH. In order to build TAL-SH modify the Makefile to set the following variables TOOLKIT, BLASLIB, GPU_CUDA (set it to CUDA), GPU_SM_ARCH, WITH_CUTT (set it to yes), FOOL_CUDA, PATH_CUDA and PATH_CUTT.

Below is the sample Makefile.

#Compiler: [GNU|PGI|INTEL|CRAY|IBM]:
export TOOLKIT ?= INTEL
...
#BLAS: [ATLAS|MKL|ACML|ESSL|NONE]:
export BLASLIB ?= MKL
#Nvidia GPU via CUDA: [CUDA|NOCUDA]:
export GPU_CUDA ?= CUDA
#Nvidia GPU architecture (two digits):
export GPU_SM_ARCH ?= 60
...
#Fast GPU tensor transpose (cuTT library): [YES|NO]:
export WITH_CUTT ?= YES
...
#Fool CUDA 7.0 with GCC > 4.9: [YES|NO]:
export FOOL_CUDA ?= YES
...
# CUDA (only set this if you build with CUDA):
export PATH_CUDA ?= /usr/local/cuda/9.2.88
...
# cuTT path (only set this if you use cuTT library):
export PATH_CUTT ?= /users/......./cutt

Location: ./tal-sh/
Sample Makefile: ./tal-sh/Makefile

Troubleshooting

If building TAL-SH results in an error related to relocation add -Xcompiler -fPIC to CUDA_CFLAGS in cuTT's Makefile

Benchmark #1 (Figure 4 and 5)

Ensure that the test.cpp file was replaced with the provided version before building talsh. The modified file contains the benchmarks. Ensure that the build was successful.
Reviewers using Amazon EC2 instance can find the pre-build version at : /home/ubuntu/cgo2019-ae-draft/tal-sh/build/TAL_SH
Run: ./test_talsh.x
Output: results.tsv
Estimated runtime on P100: 5 minutes
Expected Results (P100): ./tal-sh/expect_p100_tal-sh_tccg_results.txt
Expected Results (V100): ./tal-sh/expect_v100_tal-sh_tccg_results.txt

Facebook's Tensor Comprehensions (TC)

Before running our scripts for FB's TC, please build FB's TC (see installation instructions below). Note that runtime of TC with tuning is around 7 hours on P100 GPU. We have already installed TC in the Amazon EC2 Machine.

How To Build Facebook's TC

Install anaconda3 from https://conda.io/docs/index.html
After building conda, you can install the FB's TC as follows:

conda install -y -c pytorch -c tensorcomp tensor_comprehensions

Additional resources

Installation-guide: https://facebookresearch.github.io/TensorComprehensions/installation.html
git-repository: https://github.com/facebookresearch/TensorComprehensions/

Benchmark #1

FB's TC with tuning (Figure 6, 7 and 8)

Script: ./fb-tc/fb-w-tuning/bench_fb_w_tuning.sh
Reviewers using Amazon EC2 instance can simply run the script. The TC package and python are preconfigured
Output: "fb_w_tuning.txt" and "fb_tuning_time_sd2_1.txt"
Estimated runtime on P100: 7+ hours
Expected output: TC tuning relies on random seed points, hence the results will vary widely from run to run. The results reported in the paper are average of 5 runs

Benchmark #2

FB's TC without tuning (Figure 6 and 7)

Script: ./fb-tc/fb-wo-tuning/bench_fb_wo_tunning.sh
Reviewers using Amazon EC2 instance can simply run the script. The TC package and python are preconfigured
Output: fb_tccg_wo_tuning.txt
Estimated runtime on P100: 15 to 20 minutes
Expected Results (P100): ./fb-tc/fb-wo-tuning/expect_p100_fb_tccg_wo_tuning.txt
Expected Results (V100): ./fb-tc/fb-wo-tuning/expect_v100_fb_tccg_wo_tuning.txt

Troubleshooting

If the conda installation fails or if the installation succeeds but the runs don't produce correct output, it is highly likely some package versions are conflicting. Try the below to resolve the version conflicts

conda install cudatoolkit=8.0
conda install -y -c pytorch pytorch=0.4.0 torchvision cuda90
conda install -y -c pytorch -c tensorcomp tensor_comprehensions

All files in this archive which do not include a prior copyright are by default included in this tool and copyrighted 2018 Ohio State University.

MORE INFORMATION

For more information on how to add a new benchmark, see the docs/ folder or contact me at kim.4232@osu.edu

ozturkosu/cogent_kokkos

CGO 2019 AE: A Code Generator for High-Performance Tensor Contractions on GPUs

Amazon EC2 instance for evaluation

COGENT(COde GENerator for Tensor Contractions)

Benchmark #1 (Figure 4 and 5)

Benchmark #2 (Figure 6 and 7)

Evaluating additional benchmarks using COGENT

NWChem

Benchmark #1 (Figure 4 and 5)

TAL-SH

How To Build TAL-SH

Troubleshooting

Benchmark #1 (Figure 4 and 5)

Facebook's Tensor Comprehensions (TC)

How To Build Facebook's TC

Benchmark #1

FB's TC with tuning (Figure 6, 7 and 8)

Benchmark #2

FB's TC without tuning (Figure 6 and 7)

Troubleshooting