This project gives an opportunity to optimize space-limited on-chip memories (L1, shared memory) of your CUDA applications.
If you use or build on this tool, please cite the following papers.
- Throttling for L1 cache contention reduction (ICPP'19)
- Preloading in the shared memory by using memory-inactive threads (CCPE'20)
This is an example work-flow and configuration to get and build the Transpiler.
-
Tested with following setups
- Ubuntu 18.04, cmake-3.10.2, gcc/g++-5
sudo apt install gcc-5 g++-5 cmake
sudo apt install libboost-all-dev
- Benchmark -- PolyBench/GPU and Rodinia
-
Checkout llvm, clang, and co-optimizer
-
llvm
git clone https://github.com/llvm-mirror/llvm
cd llvm
git checkout release_80
-
clang
cd tools;
git clone https://github.com/llvm-mirror/clang
cd clang;
git checkout release_80
-
co-optimizer
cd tools
git clone https://github.com/hjunkim/CO-Optimizer.git
-
-
Build them
- Add the CO-Optimizer repository to
llvm/tools/clang/tools/CMakeLists.txt
add_clang_subdirectory(CO-Optimizer)
cd ../../../../;mkdir build;cd build
cmake -G "Unix Makefiles" ../llvm
make -j 16;sudo make install
- Add the CO-Optimizer repository to
{bin} {cuda_program}.cu [Run Options] -- --cuda-device-only --cuda-path={path/to/cuda} --cuda-gpu-arch={sm_xx}
{bin}
---./build/bin/{throttling/preloading}
{cuda_program}.cu
--- your target CUDA program--cuda-device-only
--- will run analysis/translate for the device code--cuda-path=
--- installed CUDA path (ex: /usr/local/cuda)--cuda-gpu-arch=sm_xx
--- CUDA architecture (ex: Titan V, V100: sm_70)
-
Throttling:
--csize=<int>
- : L1 cache size of the GPU (default: 32 KB)--nblks=<int>
- : # of thread blocks per SM (default: 4 blks)--tbsize=<int>
- : thread block size (default: 8 warps)
-
Preloading
--prdsize=<string>
- : set preloading size (default: 1)--tbsize=<string>
- : set thread block size (default: 8 warps)