`blaspp` goes into single-thread mode when used
Closed this issue · 2 comments
I am adding GPU support to RandLAPACK using HAMR. My code is here. It appears both the CPU and GPU gemm
are quite slow: the CPU gemm
goes into single thread mode, and the GPU gemm
are more than 10 times slower than when directly called in cuBLAS. They are both invoked using blas::gemm
from blaspp. Under the hood, they use MKL and cuBLAS, respectively. I ran the tests provided by blaspp
, and both the host and device gemm
are within my expectations: the host (MKL) gemm
scales to 16 threads, and device gemm
have roughly the same performance as directly using cuBLAS.
Could it be due to the build changes I made? Or could it be due to other issues? Everything is bulit in -O3
, and no debug flags are enabled. Below is detailed instructions for reproducing the issue.
Pre-built dependencies
- CMake: installed prebuilt binaries from it’s website: https://github.com/Kitware/CMake/releases/download/v3.25.1/cmake-3.25.1-linux-x86_64.tar.gz. Added
/bin
toPATH
. - MKL: MKL is installed using the script on intel’s website: https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-download.html?operatingsystem=linux&distributions=offline. Version: 2023.0.0. We use the script Intel provides:
source /path_to_mkl_dir/oneapi/mkl/2023.0.0/env/vars.sh
to set MKL related env variables,LD_LIBRARY_PATH
,CPATH
, etc.
Source dependencies
PROJ_ROOT
is set to the parent folder where everything is:
PROJ_ROOT
HAMR
HAMR-build
HAMR-install
RandBLAS
RandBLAS-build
RandBLAS-install
RandLAPACK
RandLAPACK-build
RandLAPACK-install
blaspp
blaspp-build
blaspp-install
lapackpp
lapackpp-build
lapackpp-install
randlapackbuild
random123
Random123-install
[mengyibai@freddie:~/BallisticLA/RandLAPACK-build]$ echo $LD_LIBRARY_PATH
/data/mengyibai/intel/oneapi/mkl/2023.0.0/lib/intel64
[mengyibai@freddie:~/BallisticLA/RandLAPACK-build]$ echo $LIBRARY_PATH
/data/mengyibai/intel/oneapi/mkl/2023.0.0/lib/intel64
[mengyibai@freddie:~/BallisticLA/RandLAPACK-build]$ echo $CPATH
/data/mengyibai/intel/oneapi/mkl/2023.0.0/include
blaspp
Version: f8b5b04c9c25069 @ master
cmake \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_COMPILER=/usr/bin/gcc-10 \
-DCMAKE_CXX_COMPILER=/usr/bin/g++-10 \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda-11.7/bin/nvcc \
-DCMAKE_INSTALL_PREFIX=${PROJ_ROOT}/blaspp-install \
-DCMAKE_BINARY_DIR=${PROJ_ROOT}/blaspp-build \
-Dblas_threaded=yes \
-Dgpu_backend=cuda \
-Dbuild_tests=OFF ../blaspp
lapackpp
Version: 17edf95ec1d11787b61d9a016f202e6b3d772cff @ master
cmake \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_COMPILER=/usr/bin/gcc-10 \
-DCMAKE_CXX_COMPILER=/usr/bin/g++-10 \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda-11.7/bin/nvcc \
-DCMAKE_INSTALL_PREFIX=${PROJ_ROOT}/lapackpp-install \
-DCMAKE_BINARY_DIR=${PROJ_ROOT}/lapackpp-build \
-Dgpu_backend=cuda \
-Dblaspp_DIR=${PROJ_ROOT}/blaspp-install/lib/blaspp \
-Dbuild_tests=OFF ../lapackpp
random123
Use HEAD @ master
make prefix=${PROJ_ROOT}/random123-install install-include
HAMR
Use HEAD (bba3ee0b40de4ab437e7e4d2275c39f6b4397b67) @ master
cmake \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_COMPILER=/usr/bin/gcc-10 \
-DCMAKE_CXX_COMPILER=/usr/bin/g++-10 \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda-11.7/bin/nvcc \
-DCMAKE_INSTALL_PREFIX=${PROJ_ROOT}/HAMR-install \
-DCMAKE_BINARY_DIR=${PROJ_ROOT}/HAMR-build \
-DHAMR_ENABLE_CUDA=True \
../HAMR
RandBLAS
Version: 1e8b475d8f7ec9021ea7998b9f4fd58e41bfe21e @ master
cmake \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_COMPILER=/usr/bin/gcc-10 \
-DCMAKE_CXX_COMPILER=/usr/bin/g++-10 \
-DCMAKE_INSTALL_PREFIX=${PROJ_ROOT}/RandBLAS-install \
-DCMAKE_BINARY_DIR=${PROJ_ROOT}/RandBLAS-build \
-Dblaspp_DIR=${PROJ_ROOT}/blaspp-install/lib/blaspp/ \
-DRandom123_DIR=${PROJ_ROOT}/random123-install/include/ \
../RandBLAS
RandLAPACK
The code is on https://github.com/YibaiMeng/RandLAPACK
Version: head e1b273bc5417fedd2f7f4d4652dfbe0ecd0960bd @ rsvd-hamr
cmake \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_COMPILER=/usr/bin/gcc-10 \
-DCMAKE_CXX_COMPILER=/usr/bin/g++-10 \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda-11.7/bin/nvcc \
-Dlapackpp_DIR=${PROJ_ROOT}/lapackpp-install/lib/lapackpp/ \
-DRandBLAS_DIR=${PROJ_ROOT}/RandBLAS-install/lib/cmake/ \
-DHamr_DIR=${PROJ_ROOT}/HAMR-install \
-DCMAKE_BINARY_DIR=${PROJ_ROOT}/RandLAPACK-build \
-DCMAKE_INSTALL_PREFIX=${PROJ_ROOT}/RandLAPACK-install \
../RandLAPACK
Reproduce the issue
Run
./bin/RandLAPACK_benchmark --gtest_filter=RsvdSpeed.GemmMicrobench
in the build dir. This test just does gemm
with blaspp
on both host and device. Could use CUDA_VISIBLE_DEVICES
to control the CUDA device used. Then you can see the issue. 8000 by 8000 gemm
should take one second at most for CPU, and 0.1 for GPU, excluding memory transfer and allocation. Also CPU gemm
should execute with multiple threads. However now it takes 360 seconds to run CPU gemm
, and 2 second to run GPU
gemm.
Misc
Why everything use nvcc? Because HAMR with GPU support needs nvcc to compile. Even if no part of the code use GPU. (Please correct me if I'm wrong).
Hi Yibai,
There are many things here. So this will be somewhat scattered reply.
When using OpenMP for threading one should set the OMP_NUM_THREADS
to control the number of threads. Are you doing that?
CUDA_VISIBLE_DEVICES
is probably not what you want. When there are multiple GPU's you callcudaSetDevice
and HAMR use that device. This is how management between GPU's is implemented. If you do not have multiple GPUs , or are not trying to use multiple GPUs, then you do not need to do anything.
The cudaDeviceSynchronize
calls are probably not what you want since they stall the entire device. Use cudaStreamSynchronize
, and only where you need it. Stream events are a better way to implement timers. See https://developer.nvidia.com/blog/how-implement-performance-metrics-cuda-cc/
hamr only use CUDA for GPU code, so if you're not using CUDA, simply turn it off in the build and it will not require nvcc.
A few suggestion about hamr usage with CUDA.
- always specify a CUDA stream with both CPU and CUDA allocators, the default CUDA stream synchronizes all streams and needs to be avoided. Currently the stream must be obtained from BLAS++/LAPACK++ because they do not have an API to set the stream. There is an open issue about this.
- Prefer to use
buffer_allocator::cuda_host
for CPU allocations that will be moved to/from the GPU. If you do not CUDA will internally allocate a buffer and do an extra memcpy before moving the data. - Prefer to use
buffer_allocator::cuda_async
for device allocations becausebuffer_allocator::cuda
synchronizes/stall device. Note, when you specify the stream, all operations are asynchronous, and you will have to add synchronization points before dereferencing pointer on the CPU. - Prefer to use
buffer_allocator::malloc
overbuffer_allocator::cpp
for arrays of POD. - When synchronizing, you only need to synchronize per stream, (not per object). For example when any number of hamr buffers and blas++/lapack++ use the same stream, synchronize that stream only once and all objects are synchronized.
nvcc is a compiler wrapper, gcc (or what ever host compiler is in use) is what does the heavy lifting. If you think the issue is due to suboptimal compiler optimizations, then verify the flags are used during compilation with make VERBOSE=1
. Ideally the complier flags are set by RandLAPACK and passed to hamr, however I am not sure if that's what happening here.
Since the performance is different in LAPACK++ test, I assume you have looked at the LAPACK++ source code to see if they are doing something differently. If not that might be insightful.
Hope this helps!
Found the problem, not RandLAPACK's fault. See this blaspp issue