/cogent_kokkos

Primary LanguageCBSD 4-Clause "Original" or "Old" LicenseBSD-4-Clause

CGO 2019 AE: A Code Generator for High-Performance Tensor Contractions on GPUs

This document details the steps to reproduce the five figures in the experimental section (Figure 4, 5, 6, 7, and 8). Figure 4 and Figure 5 compare the performance of our code generator with the NWChem Code Generator and TAL_SH on Nvidia Pascal P100 GPU and V100 GPU, respectively. Figures 6 and 7 compare the performance of our approach with Facebook’s Tensor Comprehensions (TC) on P100 and V100 respectively, for tensor contractions in SD2 function of CCSD(T) for single precision. Figure 8 shows the iterations vs GFLOPS achieved by tensor comprehensions for the SD2_1 (bcdef-gdab-efgc) benchmark on V100.

Amazon EC2 instance for evaluation

For easy evaluation we have setup an Amazon EC2 machine with an Nvidia V100 GPU. All the required libraries and frameworks are already installed. The machine can be accessed using ssh (the ssh-key is provided with the submission).

  • please contact us to get access to the ssh key (we could not upload it in the submission site)
  • download the cogent3.pem (ssh key) and run chmod 400 cogent3.pem to change file permission
  • ssh: ssh -i cogent3.pem ubuntu@ec2-13-59-110-214.us-east-2.compute.amazonaws.com
  • root directory: /home/ubuntu/cgo2019-ae-draft/

COGENT(COde GENerator for Tensor Contractions)

The code generator (COGENT) is based on Python 3.5. COGENT outputs CUDA kernels.

Ensure that CUDA_ARCH variable in the cogent Makefile is set properly. Run the make command to build.

Benchmark #1 (Figure 4 and 5)

The below script runs TCCG Benchmark for Figure 4 and 5.

  • Script: ./cogent/bench_tccg.sh
  • Output: cogent_tccg_results.txt
  • Estimated runtime on P100: 5 minutes
  • Expected Results (P100): ./cogent/expect_p100_cogent_tccg_results.txt
  • Expected Results (V100): ./cogent/expect_v100_cogent_tccg_results.txt .

Benchmark #2 (Figure 6 and 7)

The below script runs TCCG Benchmark for Figure 6 and 7.

  • Script: ./cogent/bench_fb.sh
  • Output file location: cogent_fb_results.txt
  • Estimated runtime on P100: 5 minutes
  • Expected Results (P100): ./cogent/expect_p100_cogent_fb_results.txt
  • Expected Results (V100): ./cogent/expect_v100_cogent_fb_results.txt

Evaluating additional benchmarks using COGENT

  • COGENT accepts an expression corresponding to the required tensor contraction and the representative problem size as follows:
t3 [a,312,b,312,c,24] += sum(d,312) * t2 [b,d,a] * v2 [d,c];
  • The above expression contracts two tensors t2 and v2 and saves the result in t3
  • d is the contraction dimension and a,b and c are external dimensions
  • The representation problem sizes are specified after indices in the output and sum (contraction) terms
  • Note: Our current parser is whitespace sensitive; Hence please follow the above whitespace format

The above input is available in the file ./cogent/input_strings/tccg/input_tcct_01.in

NWChem

./nwchem-tccg/tccg-kernels.cu contains the kernels generated by NWChem's Code Generator. In order to build NWChem kernels set CUDA_ARCH variable corresponding to the GPU architecture in the NWChem Makefile before building. Run the make command to build.

  • Location: ./nwchem-tccg/

Benchmark #1 (Figure 4 and 5)

  • Script: ./nwchem-tccg/bench_tccg.sh
  • Output: nwchem_tccg_results.txt
  • Estimated runtime on P100: 10 minutes
  • Expected Results (P100): ./nwchem-tccg/expect_p100_nwchem_tccg_results.txt
  • Expected Results (V100): ./nwchem-tccg/expect_v100_nwchem_tccg_results.txt

TAL-SH

In order to run TAL-SH benchmark (Figure 4 or 5) TAL-SH has to be installed. We have already installed TAL-SH in the Amazon EC2 Machine.

How To Build TAL-SH

TAL-SH depends on the cuTT library. Hence cuTT has to installed first.

After building cuTT and before building TAL-SH replace the test.cpp file in TAL-SH directory with test.cpp. Download TAL-SH from https://github.com/DmitryLyakh/TAL_SH. In order to build TAL-SH modify the Makefile to set the following variables TOOLKIT, BLASLIB, GPU_CUDA (set it to CUDA), GPU_SM_ARCH, WITH_CUTT (set it to yes), FOOL_CUDA, PATH_CUDA and PATH_CUTT.

Below is the sample Makefile.

#Compiler: [GNU|PGI|INTEL|CRAY|IBM]:
export TOOLKIT ?= INTEL
...
#BLAS: [ATLAS|MKL|ACML|ESSL|NONE]:
export BLASLIB ?= MKL
#Nvidia GPU via CUDA: [CUDA|NOCUDA]:
export GPU_CUDA ?= CUDA
#Nvidia GPU architecture (two digits):
export GPU_SM_ARCH ?= 60
...
#Fast GPU tensor transpose (cuTT library): [YES|NO]:
export WITH_CUTT ?= YES
...
#Fool CUDA 7.0 with GCC > 4.9: [YES|NO]:
export FOOL_CUDA ?= YES
...
# CUDA (only set this if you build with CUDA):
export PATH_CUDA ?= /usr/local/cuda/9.2.88
...
# cuTT path (only set this if you use cuTT library):
export PATH_CUTT ?= /users/......./cutt
  • Location: ./tal-sh/
  • Sample Makefile: ./tal-sh/Makefile

Troubleshooting

  • If building TAL-SH results in an error related to relocation add -Xcompiler -fPIC to CUDA_CFLAGS in cuTT's Makefile

Benchmark #1 (Figure 4 and 5)

  • Ensure that the test.cpp file was replaced with the provided version before building talsh. The modified file contains the benchmarks. Ensure that the build was successful.
  • Reviewers using Amazon EC2 instance can find the pre-build version at : /home/ubuntu/cgo2019-ae-draft/tal-sh/build/TAL_SH
  • Run: ./test_talsh.x
  • Output: results.tsv
  • Estimated runtime on P100: 5 minutes
  • Expected Results (P100): ./tal-sh/expect_p100_tal-sh_tccg_results.txt
  • Expected Results (V100): ./tal-sh/expect_v100_tal-sh_tccg_results.txt

Facebook's Tensor Comprehensions (TC)

Before running our scripts for FB's TC, please build FB's TC (see installation instructions below). Note that runtime of TC with tuning is around 7 hours on P100 GPU. We have already installed TC in the Amazon EC2 Machine.

How To Build Facebook's TC

conda install -y -c pytorch -c tensorcomp tensor_comprehensions

Additional resources

Benchmark #1

FB's TC with tuning (Figure 6, 7 and 8)

  • Script: ./fb-tc/fb-w-tuning/bench_fb_w_tuning.sh
  • Reviewers using Amazon EC2 instance can simply run the script. The TC package and python are preconfigured
  • Output: "fb_w_tuning.txt" and "fb_tuning_time_sd2_1.txt"
  • Estimated runtime on P100: 7+ hours
  • Expected output: TC tuning relies on random seed points, hence the results will vary widely from run to run. The results reported in the paper are average of 5 runs

Benchmark #2

FB's TC without tuning (Figure 6 and 7)

  • Script: ./fb-tc/fb-wo-tuning/bench_fb_wo_tunning.sh
  • Reviewers using Amazon EC2 instance can simply run the script. The TC package and python are preconfigured
  • Output: fb_tccg_wo_tuning.txt
  • Estimated runtime on P100: 15 to 20 minutes
  • Expected Results (P100): ./fb-tc/fb-wo-tuning/expect_p100_fb_tccg_wo_tuning.txt
  • Expected Results (V100): ./fb-tc/fb-wo-tuning/expect_v100_fb_tccg_wo_tuning.txt

Troubleshooting

If the conda installation fails or if the installation succeeds but the runs don't produce correct output, it is highly likely some package versions are conflicting. Try the below to resolve the version conflicts

  • conda install cudatoolkit=8.0
  • conda install -y -c pytorch pytorch=0.4.0 torchvision cuda90
  • conda install -y -c pytorch -c tensorcomp tensor_comprehensions

COPYRIGHT

All files in this archive which do not include a prior copyright are by default included in this tool and copyrighted 2018 Ohio State University.

MORE INFORMATION

For more information on how to add a new benchmark, see the docs/ folder or contact me at kim.4232@osu.edu