GPU Drano is a static analysis tool for GPU Programs. One of the analyses in GPU Drano is for finding uncoalesced memory accesses in CUDA code. The other analysis supported by GPU Drano is analysis to prove block-size independence of GPU programs.
Modern GPUs bundle threads into warps. All threads in a warp perform operations in lockstep. Memory accesses to different memory locations can be coalesced into a single load/store if the memory is adjacent or close enough in memory. When the memory accessed by a warp is far apart, multiple load/stores are required to complete the memory transaction, and we say the access is uncoalesced.
A GPU kernel is said to be block-size independent, if modifying the block-size while keeping the total number of threads same, does not break the functionality of the program. This is essential for correct block-size tuning in GPU programs, which is often used to improve program performance.
We have also implemented a dynamic analysis to identify uncoalesced accesses, it's available in this repository.
GPU Drano is implemented as a compiler pass for LLVM using Google's open source CUDA
implementation: gpucc
.
Therefore, Drano is tightly coupled with LLVM.
Drano requires LLVM version 6.0 or later. It also requires CUDA toolkit (version 7.5 or later) from NVIDIA. It has been tested on Ubuntu 16.04 LTS, but should work with most existing Linux systems.
The static analysis itself does not require a GPU. However to execute the programs and to run dynamic anlaysis, an NVIDIA GPU and compatible drivers are required. Check NVIDIA's system requirements for more information.
The details of the algorithm and the design choices can be found in following papers:
The script installnrun.sh
included with the project details the steps required
to build and execute GPU Drano on an Ubuntu system. To run the script,
- Download GPU Drano and set
ROOT_DIR
to the path to the downloaded folder. - Run
sh installnrun.sh
This would automatically install GPU Drano on a Linux system. We further describe the installation steps for installing the uncoalesced access analysis here. The installation for block-size independence analysis (also refered to as block-size invariance analysis) can be done similarly.
-
Get LLVM source:
Ensure
subversion
is installed. Download the newest version of LLVM:
svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm
-
Get Clang source:
Change your current working directory to
llvm/tools/
and check outclang
from the svn repository:
svn co http://llvm.org/svn/llvm-project/cfe/trunk clang
-
Add GPU Drano to LLVM:
Copy GPU Drano's
src/
folder into the directoryllvm/lib/Transforms/UncoalescedAnalysis
in your source code:
cp -r src/abstract-execution/* llvm/lib/Transforms/UncoalescedAnalysis
cp -r src/uncoalesced-analysis/* llvm/lib/Transforms/UncoalescedAnalysis
This will create a folder called llvm/lib/Transforms/UncoalescedAnalysis/
. We must
register our pass with the LLVM build system, cmake
. Therefore, append
add_subdirectory(UncoalescedAnalysis)
to llvm/lib/Transforms/CMakeLists.txt
Sample CMakeLists.txt file:
$> more CMakeLists.txt
add_subdirectory(Utils)
add_subdirectory(Instrumentation)
add_subdirectory(InstCombine)
add_subdirectory(Scalar)
add_subdirectory(IPO)
add_subdirectory(Vectorize)
add_subdirectory(Hello)
add_subdirectory(ObjCARC)
add_subdirectory(Coroutines)
add_subdirectory(UncoalescedAnalysis)
-
Build LLVM and GPU Drano:
From the root directory of Drano, create a
build/
directory. Then, change directory to thebuild/
directory. Ensure CMake is installed on the system. Execute the following commands here:
cmake ../llvm
make
That's cmake
with the path to LLVM directory (../llvm
). CMake configures
LLVM for your system. It should generate several files in your current working
directory (build/
). The command make
builds LLVM and Drano.
Ideally, use make -j N
where N is your number of cores to build with in
parallel as llvm takes a long time to build.
Notice: LLVM build takes a large amount of memory to compile, specifically
at the linking step. If you have < 8 gigabytes of RAM, LLVM may fail to build.
If so, you can rerun make
without the -j
option. This will only recompile the
failed parts of the build, and still save more time than not using -j
from
the start.
-
Install LLVM and GPU Drano Perform
sudo make install
. This should install the libraries and binaries in the default location (or the specified location). If installed locally, this guide assumes bash can find the command in it's PATH.The above is a quick start guide. If you're unfamiliar with LLVM you may find all details for installation at: http://llvm.org/docs/GettingStarted.html
The script installnrun.sh
briefly describes the process to build NVIDIA drivers,
toolkit and the SDK. The script has been commented out to avoid the risk of overriding
existing NVIDIA drivers. Note that we only need the toolkit and the SDK to run
the static analysis. We also need drivers and a working GPU to execute dynamic analysis
and CUDA programs themselves.
-
Install the NVIDIA drivers, CUDA toolkit and SDK: Please refer to the NVIDIA installtion guide for instructions.
In summary, if you use a recent and popular Linux distro you should be able to use the automatic download tool to install the required drivers and sdk.
If you already have the required drivers and SDK installed you may skip this step. It's probably not a good idea to override your existing drivers as this might render your system unusable.
-
You may require g++-multilib to install necessary libraries required by
clang
.
This step is optional. A simple CUDA program like "hello world" should now compile
using clang
or clang++
:
clang -x cuda helloWorld.cu
The -x cuda
option explicitly states the language. You may omit it as well and
clang will infer this as a CUDA program.
Notice: Your clang installation may need not be able to find several internal
LLVM functions, if so, you may need to include -lcudart
. You may also need to
point to the location of the cudart.so
.
Example:
clang -x cuda -L /usr/local/cuda/targets/x86_64-linux/lib/ -lcudart helloWorld.cu
The path to your cudart.so may be different depending on your system.
LLVM should have generated a shared object .so
file, called LLVMUncoalescedAnalysis.so
,
under the build/lib/
directory.
The LLVM complier clang
generates separate device (the gpu kernel code) and
host (code run on CPU) IR files when compiling CUDA programs. GPU Drano analyzes
the LLVM IR files for device code.
As an example, let's analyze the Rodinia
kernel code for the gaussian
benchmark.
We change directory into rodinia_3.1/cuda/gaussian/
.
First we have clang generate LLVM IR files for the code we are interested in:
clang++ -S -g -emit-llvm gaussian.cu
Notice we compile with debug symbol -g
, to keep the debug information about
source code locations in the generated IR. This is used to point source code locations
with potential uncoalesced accesses from LLVM IR.
The compilation generates two files:
gaussian-cuda-nvptx64-nvidia-cuda-sm_20.ll
gaussian.ll
We can then run the static analysis through LLVM's opt
by specifying the path to
the GPU Drano binary and specifying pass -interproc-uncoalesced-analysis
to be run. This
pass is an interprocedural analysis to detect uncoalesced accesses. It starts with
the analysis of the top-most function in the call-graph and then proceeds with the
analysis of its callees in a topological order. While analyzing a specific callee,
it considers the join of the call contexts of all its callers. To run an intraprocedural
analysis (that assumes all initial function arguments are independent of thread's id),
specify pass -uncoalesced-analysis
to be run, instead of -interproc-uncoalesced-analysis
.
opt -load ../../../build/lib/LLVMUncoalescedAnalysis.so -instnamer -interproc-uncoalesced-analysis < gaussian-cuda-nvptx64-nvidia-cuda-sm_20.ll > /dev/null 2> gpuDranoResults.txt
Notice opt
reads the IR file from standard input. opt
writes it's own uninteresting
output to standard out so we redirection it to /dev/null
GPU Drano's output is
written to standard error which may be redirected to a file.
To generate verbose analysis results (LLVM IR annotated with analysis info), run
opt
with an additional -debug-only=uncoalesced-analysis
flag.
The following command can be used to run the block-size independence analysis on a program.
opt -load ../../../build/lib/LLVMBlockSizeInvarianceAnalysis.so - -instnamer -always-inline -interproc-bsize-invariance-analysis < gaussian-cuda-nvptx64-nvidia-cuda-sm_20.ll > /dev/null 2> gpuDranoResults.txt
The generated results for uncoalesced access analysis reports all accesses that
might be potentially uncoalesced in each of the GPU kernels. For example, here
are the results for the analysis of
gaussian.cu
.
Analysis Results:
Function: _Z4Fan1PfS_ii
Uncoalesced accesses: #2
-- gaussian.cu:295:59
-- gaussian.cu:295:61
Analysis Results:
Function: _Z4Fan2PfS_S_iii
Uncoalesced accesses: #4
-- gaussian.cu:312:38
-- gaussian.cu:312:35
-- gaussian.cu:312:35
-- gaussian.cu:317:23
Each result item points to a potentially uncoalesced access in the source code.
For example, access to m_cuda
at line 295, column 59 in gaussian.cu at method
Fan1()
is uncoalesced.
Similarly, generated results for block-size independence analysis identify all kernels that are block-size independent!
Rodinia is popular benchmark suite of GPU programs consisting of 22 programs from different domains. We analyzed the suite and found 111 real uncoalesced accesses using static analysis. To reproduce the results, here are the steps involved:
-
Update CUDA and Drano configuration in
rodinia_3.1/common/make.config
. SetCUDA_DIR
andSDK_DIR
with NVIDIA toolkit and SDK paths. Updateopt
alias with path to GPU Drano binaryLLVMUncoalescedAnalysis.so
. -
Go to benchmarks directory
rodinia_3.1/cuda
. -
Compile benchmarks:
sh compile.sh
- Run GPU Drano on benchmarks (takes about 2 hours):
sh run-analysis.sh
Analysis of each benchmark generates results in its respective folder in log-files
named log_<filename>
.
- Summarize results:
./summarize-results.sh
Nvidia provides a set of CUDA samples which can be used for various applications. We analyzed the suite to identify block-size independent kernels in the samples. To reproduce the results, the steps are:
-
Update
HOME
variable inNVIDIA_CUDA-8.0_Samples/common/drano.mk
to the root GPU Drano directory. -
Install OpenGL (required for compiling a few benchmarks). On Ubuntu, following command can be used:
sudo apt-get install freeglut3-dev
-
Go to the directory
NVIDIA_CUDA-8.0_Samples
-
Compile benchmarks:
sh compile.sh
- Run GPU Drano on benchmarks (takes about 2 hours):
sh run-analysis.sh
Analysis of each benchmark generates results in its respective folder in log-files
named log_<filename>
.
- Summarize results:
./summarize-bsize-independence.sh