Set up environments for PPT-GPU (SC'21)

Configure Git

git config --global user.name "xxxx"
git config --global user.email "xxxx"
ssh-keygen -t rsa -C "xxxx"

Basic Dependencies

Get glib2.0 using

sudo apt-get update -y
sudo apt-get install libglib2.0-dev

if error like this :

libglib2.0-dev : Depends: libglib2.0-0 (= 2.33.12+really2.32.4-5) but 2.42.1-1 is to be installed
			    Depends: libglib2.0-bin (= 2.33.12+really2.32.4-5) but 2.42.1-1 is to be installed

then install corresponding dependencies:

apt-get install libglib2.0-0=2.33.12+really2.32.4-5
apt-get install libglib2.0-bin= 2.33.12+really2.32.4-5

gfortran is used for building mpich and ninja is used for buildingllvm

apt-get install gfortran
apt-get install ninja-build
apt-get install re2c

also, make sure the nvcc -V output the CUDA version >=10.1. As for the driver version:<= 450.36. In my case, I am using a Geforce GTX 1080 Ti card :

mpich

Get the mpich package from the website(http://www.mpich.org/downloads/), then build and install it.

cd mpich_src_dir
./configure -prefix=/opt/mpich
make && make install

Add into the PATH, add these lines to ~/.bashrc

export MPI_ROOT=/opt/mpich
export PATH=$MPI_ROOT/bin:$PATH
export MANPATH=$MPI_ROOT/man:$MANPATH

Always source the file after adjusting:

source ~/.bashrc

Then update the correct libmpich. so path in the simian.py, if you install mpich to /opt/mpich, no changes are needed.

Trace Tool

Setup ARCH variable in the Makefile, then

make clean && make

You can use this trace tool in this way:

LD_PRELOAD=/ppt-gpu/PPT-GPU/tracing_tool/tracer.so  ./app.out

for example in my apps directory:

compile the program using nvcc as normal (for SASS):

nvcc -arch=sm_60 saxpy.cu -o saxpy

then :

LD_PRELOAD=/ppt-gpu/PPT-GPU/tracing_tool/tracer.so  ./saxpy

The results look like this:

LLVM Tool

Make sure to get llvm from the git repository by:

git clone --branch release/11.x https://github.com/llvm/llvm-project.git

DO NOT USE THE PRE-BUILD LLVM VERSION

Start building the llvm, into the llvm main directory, then:

mkdir build && cd build
cmake  -DLLVM_ENABLE_PROJECTS=clang \
-DCMAKE_INSTALL_PREFIX=/opt/llvm-11.0 \
-DCMAKE_BUILD_TYPE=Release  \
-DLLVM_ENABLE_ASSERTIONS=ON \
-DLLVM_ENABLE_DOXYGEN=OFF -DLLVM_BUILD_DOCS=OFF \
-GNinja \
-DLLVM_INSTALL_BINUTILS_SYMLINKS=ON -DBUILD_SHARED_LIBS=ON \
../llvm

If everything is OK, use ninja by simply typing :

ninja

This step might be crashed due to a compiler error, that's OK, when an error occurs, type ninja again.

PPT-GPU llvm_tool is NOT the same ascuda_flux so copy some files from the original cuda_flux from https://github.com/UniHD-CEG/cuda-flux.git.

cp  cuda_flux/CMakeLists.txt  PPT-GPU/llvm_tool
cp  cuda_flux/mekong-utils/CMakeLists.txt PPT-GPU/llvm_tool/mekong-utils/

Before build, configure the toolchain :

export CC=/opt/llvm-11.0/bin/clang
export CXX=/opt/llvm-11.0/bin/clang++
export C_INCLUDE_PATH=/usr/local/cuda/include:/opt/llvm-11.0/include
export CPLUS_INCLUDE_PATH=/usr/local/cuda/include:/opt/llvm-11.0/include
export CUDA_PATH=/usr/local/cuda
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/opt/llvm-11.0/lib
export LIBRARY_PATH=/usr/local/cuda-10.1/lib64:/opt/llvm-11.0/lib
export LLVM_DIR=/opt/llvm-11.0/lib/cmake/llvm

Then make and install :

cmake -DCMAKE_INSTALL_PREFIX=/opt/cuda-flux ..
make && make install

Add into the PATH, add these lines to ~/.bashrc:

export PATH=$PATH:/opt/cuda-flux/bin

In case of the link error :

ln -s /usr/local/cuda/lib64/libcudart.so /usr/lib/libcudart.so

And now return to the application directory and using this clang to compile (for PTX):

clang_cf++ -O3 --cuda-gpu-arch=sm_60 -std=c++11 -lcudart saxpy.cu -o saxpy

The results look like this:

Now run the application:

You can see there are directories and files including app_config.py,memory_traces/, ptx_traces/, sass_traces/ as the paper described.

Reuse Distance Tool

Simply by following these commands:

cd reuse_distance_tool
make

PPT-GPU

Using the tool by these commands( in the PPT-GPU directory ):

from SASS

NOTICE: All programs should be compiled through nvcc

After compilation, run the application as usual. Then you have app_config.py,memory_traces/, and sass_traces/ in the application directory, now, go to the PPT-GPU directory and execute :

mpiexec -n 2 python ppt.py --app dir_of_app/ --sass --config TITANV --granularity 2

There is an error like this:

So you have to add the corresponding instruction latency to the corresponding SASS instruction table in PPT-GPU/hardware/ISA.

from PTX

NOTICE: All programs should be compiled through llvm and clang++ compiler first.

After compilation, first, run the program and then generate memory_trace using trace_tool:

LD_PRELOAD=/ppt-gpu/PPT-GPU/tracing_tool/tracer.so  ./app.out

After you have app_config.py,memory_traces/, ptx_traces/, sass_traces/ in application directory, go to PPT-GPU directory and execute :

mpiexec -n 1 python ppt.py --app dir_of_app/ --ptx --config TITANV --granularity 2 --kernel 1

Example

Here is the bfs from Rodinia Benchmark, use PPT-GPU in this program as below.

The directory is formed in this way:

root@1597c1488c3f:/ppt-gpu/bfs# ls
bfs.cu  graph4096.txt  kernel2.cu  kernel.cu  Makefile  run

Firstly, compile bfs using clang_cf++:

root@1597c1488c3f:/ppt-gpu/bfs# cat Makefile

CC := clang_cf++
INCLUDE := $(CUDA_DIR)/include
SRC = bfs.cu
EXE = bfs
release: $(SRC)
         $(CC) -O3 -std=c++11 --cuda-gpu-arch=sm_60 $(SRC) -o $(EXE) -I$(INCLUDE) -L/usr/local/lib -lcudart
clean: $(SRC)
        rm -f $(EXE) $(EXE).linkinfo result.txt  *.bc *.out *.ptx

the output of make looks like:

root@1597c1488c3f:/ppt-gpu/bfs# make

clang_cf++ -O3 -std=c++11 --cuda-gpu-arch=sm_60 bfs.cu  -o bfs -I/include -L/usr/local/lib -lcudart
+ clang++ -Xclang -load -Xclang /opt/cuda-flux/lib/libcuda_flux_pass.so -finline-functions -O3 -std=c++11 --cuda-gpu-arch=sm_60 bfs.cu -o bfs -I/include -L/usr/local/lib -lcudart
clang-11: warning: Unknown CUDA version. cuda.h: CUDA_VERSION=11000. Assuming the latest supported version 10.1 [-Wunknown-cuda-version]
CUDA Flux: Instrumenting device code...
CUDA Flux: Module prefix: bfs.cu_5d02b493
clang version 11.1.0 (https://github.com/llvm/llvm-project.git 1fdec59bffc11ae37eb51a1b9869f0696bfd5312)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/llvm-11.0/bin
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/7
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/7.5.0
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/8
Selected GCC installation: /usr/lib/gcc/x86_64-linux-gnu/7.5.0
Candidate multilib: .;@m64
Selected multilib: .;@m64
Found CUDA installation: /usr/local/cuda-11.0, version 11.0
clang-11: warning: Unknown CUDA version. cuda.h: CUDA_VERSION=11000. Assuming the latest supported version 10.1 [-Wunknown-cuda-version]
 (in-process)
 "/opt/llvm-11.0/bin/clang-11" -cc1 -triple nvptx64-nvidia-cuda -aux-triple x86_64-unknown-linux-gnu -emit-llvm -disable-free -main-file-name cuda_flux_drt1Qbsdb.cu -mrelocation-model static -mframe-pointer=all -fno-rounding-math -fno-verbose-asm -no-integrated-as -aux-target-cpu x86-64 -fcuda-is-device -mlink-builtin-bitcode /usr/local/cuda-11.0/nvvm/libdevice/libdevice.10.bc -target-feature +ptx70 -target-sdk-version=11.0 -target-cpu sm_60 -fno-split-dwarf-inlining -debugger-tuning=gdb -v -resource-dir /opt/llvm-11.0/lib/clang/11.1.0 -internal-isystem /opt/llvm-11.0/lib/clang/11.1.0/include/cuda_wrappers -internal-isystem /usr/local/cuda-11.0/include -include __clang_cuda_runtime_wrapper.h -c-isystem /usr/local/cuda/include -c-isystem /opt/llvm-11.0/include -c-isystem /usr/local/include -c-isystem /usr/include -cxx-isystem /usr/local/cuda/include -cxx-isystem /opt/llvm-11.0/include -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/c++/7.5.0 -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/x86_64-linux-gnu/c++/7.5.0 -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/x86_64-linux-gnu/c++/7.5.0 -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/c++/7.5.0/backward -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/c++/7.5.0 -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/x86_64-linux-gnu/c++/7.5.0 -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/x86_64-linux-gnu/c++/7.5.0 -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/c++/7.5.0/backward -internal-isystem /usr/local/include -internal-isystem /opt/llvm-11.0/lib/clang/11.1.0/include -internal-externc-isystem /usr/include/x86_64-linux-gnu -internal-externc-isystem /include -internal-externc-isystem /usr/include -internal-isystem /usr/local/include -internal-isystem /opt/llvm-11.0/lib/clang/11.1.0/include -internal-externc-isystem /usr/include/x86_64-linux-gnu -internal-externc-isystem /include -internal-externc-isystem /usr/include -O3 -std=c++11 -fdeprecated-macro -fno-dwarf-directory-asm -fno-autolink -fdebug-compilation-dir /ppt-gpu/bfs -ferror-limit 19 -fgnuc-version=4.2.1 -fcxx-exceptions -fexceptions -fcolor-diagnostics -vectorize-loops -vectorize-slp -o /tmp/cuda_flux_drt1Qbsdb.ll -x cuda /tmp/cuda_flux_drt1Qbsdb.cu
clang -cc1 version 11.1.0 based upon LLVM 11.1.0 default target x86_64-unknown-linux-gnu
ignoring nonexistent directory "/include"
ignoring nonexistent directory "/include"
ignoring duplicate directory "/usr/local/cuda/include"
ignoring duplicate directory "/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/x86_64-linux-gnu/c++/7.5.0"
ignoring duplicate directory "/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/c++/7.5.0"
ignoring duplicate directory "/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/x86_64-linux-gnu/c++/7.5.0"
ignoring duplicate directory "/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/x86_64-linux-gnu/c++/7.5.0"
ignoring duplicate directory "/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/c++/7.5.0/backward"
ignoring duplicate directory "/usr/local/include"
ignoring duplicate directory "/opt/llvm-11.0/lib/clang/11.1.0/include"
ignoring duplicate directory "/usr/include/x86_64-linux-gnu"
ignoring duplicate directory "/usr/include"
#include "..." search starts here:
#include <...> search starts here:
 /usr/local/cuda/include
 /opt/llvm-11.0/include
 /opt/llvm-11.0/lib/clang/11.1.0/include/cuda_wrappers
 /usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/c++/7.5.0
 /usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/x86_64-linux-gnu/c++/7.5.0
 /usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/c++/7.5.0/backward
 /usr/local/include
 /opt/llvm-11.0/lib/clang/11.1.0/include
 /usr/include/x86_64-linux-gnu
 /usr/include
End of search list.
CUDA Flux: Working on kernel: _Z6KernelP4NodePiPbS2_S2_S1_i
CUDA Flux: BlockCount: 8
CUDA Flux: Working on kernel: _Z7Kernel2PbS_S_S_i
CUDA Flux: BlockCount: 4
CUDA Flux: instrumenting host code...
CUDA Flux: CUDA Version 11.0
passed before launches loop
CUDA Flux: Found BasicBlockCount for kernel _Z7Kernel2PbS_S_S_i: 4
Passed flux trac pointer creation
passed addition of tracPtr to addArgs
Passed cloneLaunchCall
Call creation serializeCountersFu with tracPtr
passed before launches loop
CUDA Flux: Found BasicBlockCount for kernel _Z6KernelP4NodePiPbS2_S2_S1_i: 8
Passed flux trac pointer creation
passed addition of tracPtr to addArgs
Passed cloneLaunchCall
Call creation serializeCountersFu with tracPtr

Then run the program, and ptx_trace will be generated.

(base) root@1597c1488c3f:/ppt-gpu/bfs# ll
total 428
......
drwxr-xr-x  2 root root   4096 6月   5 14:31 ptx_traces/
......

Now run trace_tool:

LD_PRELOAD=/ppt-gpu/PPT-GPU/tracing_tool/tracer.so  ./bfs graph4096.txt

Then app_config.py,memory_traces/, sass_traces/ will be generated:

root@1597c1488c3f:/ppt-gpu/bfs# ll
total 452
......
-rw-r--r--  1 root root   3071 6月   5 14:38 app_config.py
drwxr-xr-x  2 root root   4096 6月   5 14:38 memory_traces/
drwxr-xr-x  2 root root   4096 6月   5 14:31 ptx_traces/
drwxr-xr-x  2 root root   4096 6月   5 14:38 sass_traces/
......

Now use PPT-GPU for PTX analysis, go back to PPT-GPUdirectory and run:

mpiexec -n 1 python ppt.py --app /ppt-gpu/bfs/ --ptx --config TITANV --granularity 2 --kernel 1

Back to bfs directory, there is a file called kernel_1_PTX_g2.out:


- Total GPU computations is divided into 2048 thread block(s) running on 80 SM(s)

- Modeled SM-0 running 4 thread block(s):
        * allocated max active thread block(s): 4
        * allocated max active warps per thread block: 16

- Occupancy of SM-0:
        * Thread block Limit SM: 32
        * Thread block limit registers: 5
        * Thread block limit shared memory: 32
        * Thread block limit warps: 4
        * theoretical max active thread block(s): 4
        * theoretical max active warps per SM: 64
        * theoretical occupancy: 100 %
        * achieved active warps per SM: 34.71
        * achieved occupancy: 54.23 %

- Memory Performance:
        * unified L1 cache hit rate: 5.48 %
        * unified L1 cache hit rate for read transactions (global memory accesses): 6.99 %
        * L2 cache hit rate: 24.54 %

        * Global Memory Requests:
                ** GMEM read requests: 1170
                ** GMEM write requests: 136
                ** GMEM total requests: 1306

        * Global Memory Transactions:
                ** GMEM read transactions: 1170
                ** GMEM write transactions: 136
                ** GMEM total transactions: 1306

        * Global Memory Divergence:
                ** number of read transactions per read requests: 1.0 (3.12%)
                ** number of write transactions per write requests: 1.0 (3.12%)

        * L2 Cache Transactions (for global memory accesses):
                ** L2 read transactions: 1088
                ** L2 write transactions: 136
                ** L2 total transactions: 1224

        * DRAM Transactions (for global memory accesses):
                ** DRAM total transactions: 923

        * Total number of global atomic requests: 0
        * Total number of global reduction requests: 0
        * Global memory atomic and reduction transactions: 0

- Kernel cycles:
        * GPU active cycles (min): 12,048
        * GPU active cycles (max): 18,906
        * SM active cycles (sum): 1,512,480
        * SM elapsed cycles (sum): 1,512,480

- Warp instructions executed: 1,966,080
- Thread instructions executed: 62,914,560
- Instructions executed per clock cycle (IPC): 1.3
- Clock cycles per instruction (CPI):  0.769
- Total instructions executed per seconds (MIPS): 1559
- Kernel execution time: 1260.4 us

- Simulation Time:
        * Memory model: 1.875 sec, 00:00:01
        * Compute model: 0.165 sec, 00:00:00

And now let's run the SASS analysis.

Recompile the program using nvcc and run traceing_tool:

LD_PRELOAD=/ppt-gpu/PPT-GPU/tracing_tool/tracer.so  ./bfs graph4096.txt

Then, back to PPT-GPU directory and run:

mpiexec -n 2 python ppt.py --app /home/test/Workloads/2mm/ --sass --config TITANV --granularity 2 --kernel 1

The output is:

kernel name: _Z5saxpyifPfS__clone

- Total GPU computations is divided into 2048 thread block(s) running on 80 SM(s)

- Modeled SM-0 running 4 thread block(s):
        * allocated max active thread block(s): 4
        * allocated max active warps per thread block: 16

- Occupancy of SM-0:
        * Thread block Limit SM: 32
        * Thread block limit registers: 5
        * Thread block limit shared memory: 32
        * Thread block limit warps: 4
        * theoretical max active thread block(s): 4
        * theoretical max active warps per SM: 64
        * theoretical occupancy: 100 %
        * achieved active warps per SM: 47.48
        * achieved occupancy: 74.19 %

- Memory Performance:
        * unified L1 cache hit rate: 5.48 %
        * unified L1 cache hit rate for read transactions (global memory accesses): 6.99 %
        * L2 cache hit rate: 19.4 %

        * Global Memory Requests:
                ** GMEM read requests: 1170
                ** GMEM write requests: 136
                ** GMEM total requests: 1306

        * Global Memory Transactions:
                ** GMEM read transactions: 1170
                ** GMEM write transactions: 136
                ** GMEM total transactions: 1306

        * Global Memory Divergence:
                ** number of read transactions per read requests: 1.0 (3.12%)
                ** number of write transactions per write requests: 1.0 (3.12%)

        * L2 Cache Transactions (for global memory accesses):
                ** L2 read transactions: 1088
                ** L2 write transactions: 136
                ** L2 total transactions: 1224

        * DRAM Transactions (for global memory accesses):
                ** DRAM total transactions: 986

        * Total number of global atomic requests: 0
        * Total number of global reduction requests: 0
        * Global memory atomic and reduction transactions: 0

- Kernel cycles:
        * GPU active cycles (min): 6,498
        * GPU active cycles (max): 8,778
        * SM active cycles (sum): 702,240
        * SM elapsed cycles (sum): 702,240

- Warp instructions executed: 557,056
- Thread instructions executed: 17,825,792
- Instructions executed per clock cycle (IPC): 0.793
- Clock cycles per instruction (CPI):  1.261
- Total instructions executed per seconds (MIPS): 951
- Kernel execution time: 585.2 us

- Simulation Time:
        * Memory model: 1.881 sec, 00:00:01
        * Compute model: 0.058 sec, 00:00:00

Elio-yang/PPT-GPU