eth-cscs/COSMA

COSMA cublas crash after job finished

yaoyi92 opened this issue · 3 comments

Dear COSMA developers,

I have been working on a GPU hackathon and testing COSMA (commit e034ddd) using the Scalapack API with FHIaims code on the Ascent cluster (The training cluster for Summit). With COSMA GPU version, I was able to get a 7x speedup for pzgemm (matrix size 3312 * 3312) (36 Power CPU cores + 6 V100 GPU v. 36 Power CPU cores) (36 MPI with OMP_THREADS=1). That is great. However, I have also seen some GPU errors after the job finished. The errors only exist if I link my code against COSMA. @kabicm suggests it could be something that happened during the cleanup stage. Unfortunately, I won't have the access to the cluster anymore now, so that I am not able to provide a minimal example to reproduce the error. We will try to get access to Summit cluster later.

This is the error I saw.

error: GPU API call : invalid argument
terminate called after throwing an instance of 'std::runtime_error'
  what():  GPU ERROR

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
error: GPU API call : invalid argument
terminate called after throwing an instance of 'std::runtime_error'
  what():  GPU ERROR

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
error: GPU API call : invalid argument
terminate called after throwing an instance of 'std::runtime_error'
  what():  GPU ERROR

Here are my build and link and submitting scripts.

set -e
module purge
module load gcc/7.4.0
module load essl
module load cuda/10.1.243
module load spectrum-mpi
module load netlib-lapack
module load netlib-scalapack
module load cmake


export CUDA_PATH=$CUDA_DIR

export CC=mpicc
export CXX=mpicxx
cmake -DCOSMA_BLAS=CUDA -DCOSMA_SCALAPACK=CUSTOM -DCMAKE_INSTALL_PREFIX=/ccsopen/home/yaoyi92/proj_dir_aims-gw/yaoyi92/cosma/install_yy ..
make VERBOSE=0 -j 16
make install
set(CMAKE_C_COMPILER "mpicc" CACHE STRING "")
set(CMAKE_C_FLAGS "-O2 -g -fbacktrace -mcpu=power9 -funwind-tables -fopenmp" CACHE STRING "")

set(CMAKE_Fortran_COMPILER "mpif90" CACHE STRING "")
set(CMAKE_Fortran_FLAGS "-O2 -g -fbacktrace -mcpu=power9 -ffree-line-length-none -funwind-tables -fopenmp" CACHE STRING "")
set(Fortran_MIN_FLAGS "-O0 -g -fbacktrace -ffree-line-length-none -funwind-tables -fopenmp" CACHE STRING "")

set(USE_CUDA ON CACHE BOOL "")
set(CMAKE_CUDA_FLAGS "-O2 -g -DAdd_ -arch=sm_70" CACHE STRING "")

set(USE_MPI ON CACHE BOOL "")
set(USE_SCALAPACK ON CACHE BOOL "")
set(USE_LIBXC OFF CACHE BOOL "")
set(USE_iPI OFF CACHE BOOL "")
set(USE_SPGLIB OFF CACHE BOOL "")

SET(INC_PATHS "/ccsopen/home/yaoyi92/proj_dir_aims-gw/yaoyi92/cosma/install_yy/include" CACHE STRING "")

set(LIB_PATHS "/ccsopen/home/yaoyi92/proj_dir_aims-gw/yaoyi92/cosma/install_yy/lib64 $ENV{OLCF_ESSL_ROOT}/lib64 $ENV{OLCF_CUDA_ROOT}/lib64" CACHE STRING "")
set(LIBS "cosma_pxgemm cosma costa_scalapack costa Tiled-MM scalapack essl lapack cublas cudart" CACHE STRING "")
#!/bin/bash

#BSUB -P GEN157
#BSUB -W 2:00
#BSUB -nnodes 1
#BSUB -alloc_flags gpumps
#BSUB -J aims-gw
#BSUB -o aims.%J
#BSUB -N yy244@duke.edu

module purge
module load gcc/7.4.0 spectrum-mpi/10.3.1.2-20200121  cuda/10.1.243 essl/6.1.0-2 netlib-lapack/3.8.0 netlib-scalapack/2.0.2
module load nsight-systems/2021.2.1.58

export COSMA_GPU_MEMORY_PINNING=OFF
export COSMA_GPU_STREAMS=1
export COSMA_GPU_MAX_TILE_M=500
export COSMA_GPU_MAX_TILE_N=500
export COSMA_GPU_MAX_TILE_K=500


#export LD_LIBRARY_PATH=/ccsopen/home/yaoyi92/proj_dir_aims-gw/yaoyi92/cosma/install_yy/lib64

#bin=/ccsopen/home/yaoyi92/proj_dir_aims-gw/yaoyi92/FHIaims_gw_gpu/build_gcc_cuda10_2/aims.210427.scalapack.mpi.x
#bin=/ccsopen/home/yaoyi92/proj_dir_aims-gw/yaoyi92/FHIaims_gw_gpu/build_gcc_cuda10_3/aims.210427.scalapack.mpi.x
bin=/ccsopen/home/yaoyi92/proj_dir_aims-gw/yaoyi92/FHIaims_gw_gpu/build_gcc_cuda10_cosma/aims.210427.scalapack.mpi.x

export OMP_NUM_THREADS=1

ulimit -s unlimited

jsrun -n 6 -a 6 -c 6 -g 1 -r 6 $bin > aims.out

Best wishes,
Yi

Dear Yi,

Thanks a lot for reporting this issue and for your feedback. We have found what was causing this problem and will fix it soon.

Cheers,
Marko

Dear Yi,

This is fixed starting from version v2.5.0. It was caused by not setting the gpu devices properly.

Let us know if you observe any other problems.

I will now close this issue, but feel free to reopen it if the problem still persists.

Cheers,
Marko

Hi @kabicm ,
I am able to redo the test again and the problem still exists with v2.5.0 and the master branch. Does COSMA use a wrapper over MPI_Finalize()? I notice a similar issue on Summit here with another code that is wrapping over MPI_Finalize LLNL/Caliper#392.

If that's the case, my question becomes: Is it possible to manually finalize COSMA?

A related question here: if I called COSMA, does it take the GPU memory after the gemm calls? Is it possible to control those GPU memories? I guess I am looking for something like initiating and finalizing a COSMA environment over a certain code region and free up the GPU memories when outside the code region.

Best wishes,
Yi