/CO-Optimizer

Code-level op-chip memory optimizer for your CUDA applications.

Primary LanguageC++MIT LicenseMIT

CO-Optimizer: Code-level On-chip memory Optimizer

This project gives an opportunity to optimize space-limited on-chip memories (L1, shared memory) of your CUDA applications.

If you use or build on this tool, please cite the following papers.

  • Throttling for L1 cache contention reduction (ICPP'19)
  • Preloading in the shared memory by using memory-inactive threads (CCPE'20)

Getting the Source Code and Building the On-chip Memory Optimizer

This is an example work-flow and configuration to get and build the Transpiler.

  1. Tested with following setups

    • Ubuntu 18.04, cmake-3.10.2, gcc/g++-5
    • sudo apt install gcc-5 g++-5 cmake
    • sudo apt install libboost-all-dev
    • Benchmark -- PolyBench/GPU and Rodinia
  2. Checkout llvm, clang, and co-optimizer

    • llvm

      • git clone https://github.com/llvm-mirror/llvm
      • cd llvm
      • git checkout release_80
    • clang

      • cd tools;
      • git clone https://github.com/llvm-mirror/clang
      • cd clang;
      • git checkout release_80
    • co-optimizer

      • cd tools
      • git clone https://github.com/hjunkim/CO-Optimizer.git
  3. Build them

    • Add the CO-Optimizer repository to llvm/tools/clang/tools/CMakeLists.txt
      • add_clang_subdirectory(CO-Optimizer)
    • cd ../../../../;mkdir build;cd build
    • cmake -G "Unix Makefiles" ../llvm
    • make -j 16;sudo make install

Usage

Run

  • {bin} {cuda_program}.cu [Run Options] -- --cuda-device-only --cuda-path={path/to/cuda} --cuda-gpu-arch={sm_xx}
    • {bin} --- ./build/bin/{throttling/preloading}
    • {cuda_program}.cu --- your target CUDA program
    • --cuda-device-only --- will run analysis/translate for the device code
    • --cuda-path= --- installed CUDA path (ex: /usr/local/cuda)
    • --cuda-gpu-arch=sm_xx --- CUDA architecture (ex: Titan V, V100: sm_70)

Run Options

  • Throttling:

    • --csize=<int> - : L1 cache size of the GPU (default: 32 KB)
    • --nblks=<int> - : # of thread blocks per SM (default: 4 blks)
    • --tbsize=<int> - : thread block size (default: 8 warps)
  • Preloading

    • --prdsize=<string> - : set preloading size (default: 1)
    • --tbsize=<string> - : set thread block size (default: 8 warps)