BallisticLA/RandLAPACK

`blaspp` goes into single-thread mode when used

Closed this issue · 2 comments

I am adding GPU support to RandLAPACK using HAMR. My code is here. It appears both the CPU and GPU gemm are quite slow: the CPU gemm goes into single thread mode, and the GPU gemm are more than 10 times slower than when directly called in cuBLAS. They are both invoked using blas::gemm from blaspp. Under the hood, they use MKL and cuBLAS, respectively. I ran the tests provided by blaspp, and both the host and device gemm are within my expectations: the host (MKL) gemm scales to 16 threads, and device gemm have roughly the same performance as directly using cuBLAS.

Could it be due to the build changes I made? Or could it be due to other issues? Everything is bulit in -O3, and no debug flags are enabled. Below is detailed instructions for reproducing the issue.

Pre-built dependencies

Source dependencies

PROJ_ROOT is set to the parent folder where everything is:

PROJ_ROOT
    HAMR
    HAMR-build
    HAMR-install
    RandBLAS
    RandBLAS-build
    RandBLAS-install
    RandLAPACK
    RandLAPACK-build
    RandLAPACK-install
    blaspp
    blaspp-build
    blaspp-install
    lapackpp
    lapackpp-build
    lapackpp-install
    randlapackbuild
    random123
    Random123-install
[mengyibai@freddie:~/BallisticLA/RandLAPACK-build]$ echo $LD_LIBRARY_PATH
/data/mengyibai/intel/oneapi/mkl/2023.0.0/lib/intel64
[mengyibai@freddie:~/BallisticLA/RandLAPACK-build]$ echo $LIBRARY_PATH
/data/mengyibai/intel/oneapi/mkl/2023.0.0/lib/intel64
[mengyibai@freddie:~/BallisticLA/RandLAPACK-build]$ echo $CPATH
/data/mengyibai/intel/oneapi/mkl/2023.0.0/include

blaspp

Version: f8b5b04c9c25069 @ master

cmake \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_COMPILER=/usr/bin/gcc-10 \
-DCMAKE_CXX_COMPILER=/usr/bin/g++-10 \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda-11.7/bin/nvcc \
-DCMAKE_INSTALL_PREFIX=${PROJ_ROOT}/blaspp-install  \
-DCMAKE_BINARY_DIR=${PROJ_ROOT}/blaspp-build \
-Dblas_threaded=yes \
-Dgpu_backend=cuda \
-Dbuild_tests=OFF  ../blaspp

lapackpp

Version: 17edf95ec1d11787b61d9a016f202e6b3d772cff @ master

cmake \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_COMPILER=/usr/bin/gcc-10 \
-DCMAKE_CXX_COMPILER=/usr/bin/g++-10 \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda-11.7/bin/nvcc \
-DCMAKE_INSTALL_PREFIX=${PROJ_ROOT}/lapackpp-install  \
-DCMAKE_BINARY_DIR=${PROJ_ROOT}/lapackpp-build \
-Dgpu_backend=cuda \
-Dblaspp_DIR=${PROJ_ROOT}/blaspp-install/lib/blaspp \
-Dbuild_tests=OFF  ../lapackpp

random123

Use HEAD @ master

make prefix=${PROJ_ROOT}/random123-install install-include

HAMR

Use HEAD (bba3ee0b40de4ab437e7e4d2275c39f6b4397b67) @ master

cmake  \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_COMPILER=/usr/bin/gcc-10 \
-DCMAKE_CXX_COMPILER=/usr/bin/g++-10 \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda-11.7/bin/nvcc \
-DCMAKE_INSTALL_PREFIX=${PROJ_ROOT}/HAMR-install  \
-DCMAKE_BINARY_DIR=${PROJ_ROOT}/HAMR-build \
 -DHAMR_ENABLE_CUDA=True \
../HAMR

RandBLAS

Version: 1e8b475d8f7ec9021ea7998b9f4fd58e41bfe21e @ master

cmake \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_COMPILER=/usr/bin/gcc-10 \
-DCMAKE_CXX_COMPILER=/usr/bin/g++-10 \
-DCMAKE_INSTALL_PREFIX=${PROJ_ROOT}/RandBLAS-install  \
-DCMAKE_BINARY_DIR=${PROJ_ROOT}/RandBLAS-build \
-Dblaspp_DIR=${PROJ_ROOT}/blaspp-install/lib/blaspp/ \
-DRandom123_DIR=${PROJ_ROOT}/random123-install/include/ \
 ../RandBLAS

RandLAPACK

The code is on https://github.com/YibaiMeng/RandLAPACK
Version: head e1b273bc5417fedd2f7f4d4652dfbe0ecd0960bd @ rsvd-hamr

cmake \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_COMPILER=/usr/bin/gcc-10 \
-DCMAKE_CXX_COMPILER=/usr/bin/g++-10 \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda-11.7/bin/nvcc \
-Dlapackpp_DIR=${PROJ_ROOT}/lapackpp-install/lib/lapackpp/ \
-DRandBLAS_DIR=${PROJ_ROOT}/RandBLAS-install/lib/cmake/ \
-DHamr_DIR=${PROJ_ROOT}/HAMR-install \
-DCMAKE_BINARY_DIR=${PROJ_ROOT}/RandLAPACK-build \
-DCMAKE_INSTALL_PREFIX=${PROJ_ROOT}/RandLAPACK-install \
../RandLAPACK

Reproduce the issue

Run

./bin/RandLAPACK_benchmark --gtest_filter=RsvdSpeed.GemmMicrobench

in the build dir. This test just does gemm with blaspp on both host and device. Could use CUDA_VISIBLE_DEVICES to control the CUDA device used. Then you can see the issue. 8000 by 8000 gemm should take one second at most for CPU, and 0.1 for GPU, excluding memory transfer and allocation. Also CPU gemm should execute with multiple threads. However now it takes 360 seconds to run CPU gemm, and 2 second to run GPU gemm.

Misc

Why everything use nvcc? Because HAMR with GPU support needs nvcc to compile. Even if no part of the code use GPU. (Please correct me if I'm wrong).

Hi Yibai,

There are many things here. So this will be somewhat scattered reply.

When using OpenMP for threading one should set the OMP_NUM_THREADS to control the number of threads. Are you doing that?

CUDA_VISIBLE_DEVICES is probably not what you want. When there are multiple GPU's you callcudaSetDevice and HAMR use that device. This is how management between GPU's is implemented. If you do not have multiple GPUs , or are not trying to use multiple GPUs, then you do not need to do anything.

The cudaDeviceSynchronize calls are probably not what you want since they stall the entire device. Use cudaStreamSynchronize, and only where you need it. Stream events are a better way to implement timers. See https://developer.nvidia.com/blog/how-implement-performance-metrics-cuda-cc/

hamr only use CUDA for GPU code, so if you're not using CUDA, simply turn it off in the build and it will not require nvcc.

A few suggestion about hamr usage with CUDA.

  1. always specify a CUDA stream with both CPU and CUDA allocators, the default CUDA stream synchronizes all streams and needs to be avoided. Currently the stream must be obtained from BLAS++/LAPACK++ because they do not have an API to set the stream. There is an open issue about this.
  2. Prefer to use buffer_allocator::cuda_host for CPU allocations that will be moved to/from the GPU. If you do not CUDA will internally allocate a buffer and do an extra memcpy before moving the data.
  3. Prefer to use buffer_allocator::cuda_async for device allocations because buffer_allocator::cuda synchronizes/stall device. Note, when you specify the stream, all operations are asynchronous, and you will have to add synchronization points before dereferencing pointer on the CPU.
  4. Prefer to use buffer_allocator::malloc over buffer_allocator::cpp for arrays of POD.
  5. When synchronizing, you only need to synchronize per stream, (not per object). For example when any number of hamr buffers and blas++/lapack++ use the same stream, synchronize that stream only once and all objects are synchronized.

nvcc is a compiler wrapper, gcc (or what ever host compiler is in use) is what does the heavy lifting. If you think the issue is due to suboptimal compiler optimizations, then verify the flags are used during compilation with make VERBOSE=1. Ideally the complier flags are set by RandLAPACK and passed to hamr, however I am not sure if that's what happening here.

Since the performance is different in LAPACK++ test, I assume you have looked at the LAPACK++ source code to see if they are doing something differently. If not that might be insightful.

Hope this helps!

Found the problem, not RandLAPACK's fault. See this blaspp issue