COSMA cublas crash after job finished

Question

COSMA cublas crash after job finished

yaoyi92 opened this issue 4 years ago · 3 comments

Dear COSMA developers,

I have been working on a GPU hackathon and testing COSMA (commit e034ddd) using the Scalapack API with FHIaims code on the Ascent cluster (The training cluster for Summit). With COSMA GPU version, I was able to get a 7x speedup for pzgemm (matrix size 3312 * 3312) (36 Power CPU cores + 6 V100 GPU v. 36 Power CPU cores) (36 MPI with OMP_THREADS=1). That is great. However, I have also seen some GPU errors after the job finished. The errors only exist if I link my code against COSMA. @kabicm suggests it could be something that happened during the cleanup stage. Unfortunately, I won't have the access to the cluster anymore now, so that I am not able to provide a minimal example to reproduce the error. We will try to get access to Summit cluster later.

This is the error I saw.

error: GPU API call : invalid argument
terminate called after throwing an instance of 'std::runtime_error'
  what():  GPU ERROR

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
error: GPU API call : invalid argument
terminate called after throwing an instance of 'std::runtime_error'
  what():  GPU ERROR

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
error: GPU API call : invalid argument
terminate called after throwing an instance of 'std::runtime_error'
  what():  GPU ERROR

Here are my build and link and submitting scripts.

set -e
module purge
module load gcc/7.4.0
module load essl
module load cuda/10.1.243
module load spectrum-mpi
module load netlib-lapack
module load netlib-scalapack
module load cmake


export CUDA_PATH=$CUDA_DIR

export CC=mpicc
export CXX=mpicxx
cmake -DCOSMA_BLAS=CUDA -DCOSMA_SCALAPACK=CUSTOM -DCMAKE_INSTALL_PREFIX=/ccsopen/home/yaoyi92/proj_dir_aims-gw/yaoyi92/cosma/install_yy ..
make VERBOSE=0 -j 16
make install

set(CMAKE_C_COMPILER "mpicc" CACHE STRING "")
set(CMAKE_C_FLAGS "-O2 -g -fbacktrace -mcpu=power9 -funwind-tables -fopenmp" CACHE STRING "")

set(CMAKE_Fortran_COMPILER "mpif90" CACHE STRING "")
set(CMAKE_Fortran_FLAGS "-O2 -g -fbacktrace -mcpu=power9 -ffree-line-length-none -funwind-tables -fopenmp" CACHE STRING "")
set(Fortran_MIN_FLAGS "-O0 -g -fbacktrace -ffree-line-length-none -funwind-tables -fopenmp" CACHE STRING "")

set(USE_CUDA ON CACHE BOOL "")
set(CMAKE_CUDA_FLAGS "-O2 -g -DAdd_ -arch=sm_70" CACHE STRING "")

set(USE_MPI ON CACHE BOOL "")
set(USE_SCALAPACK ON CACHE BOOL "")
set(USE_LIBXC OFF CACHE BOOL "")
set(USE_iPI OFF CACHE BOOL "")
set(USE_SPGLIB OFF CACHE BOOL "")

SET(INC_PATHS "/ccsopen/home/yaoyi92/proj_dir_aims-gw/yaoyi92/cosma/install_yy/include" CACHE STRING "")

set(LIB_PATHS "/ccsopen/home/yaoyi92/proj_dir_aims-gw/yaoyi92/cosma/install_yy/lib64 $ENV{OLCF_ESSL_ROOT}/lib64 $ENV{OLCF_CUDA_ROOT}/lib64" CACHE STRING "")
set(LIBS "cosma_pxgemm cosma costa_scalapack costa Tiled-MM scalapack essl lapack cublas cudart" CACHE STRING "")

#!/bin/bash

#BSUB -P GEN157
#BSUB -W 2:00
#BSUB -nnodes 1
#BSUB -alloc_flags gpumps
#BSUB -J aims-gw
#BSUB -o aims.%J
#BSUB -N yy244@duke.edu

module purge
module load gcc/7.4.0 spectrum-mpi/10.3.1.2-20200121  cuda/10.1.243 essl/6.1.0-2 netlib-lapack/3.8.0 netlib-scalapack/2.0.2
module load nsight-systems/2021.2.1.58

export COSMA_GPU_MEMORY_PINNING=OFF
export COSMA_GPU_STREAMS=1
export COSMA_GPU_MAX_TILE_M=500
export COSMA_GPU_MAX_TILE_N=500
export COSMA_GPU_MAX_TILE_K=500


#export LD_LIBRARY_PATH=/ccsopen/home/yaoyi92/proj_dir_aims-gw/yaoyi92/cosma/install_yy/lib64

#bin=/ccsopen/home/yaoyi92/proj_dir_aims-gw/yaoyi92/FHIaims_gw_gpu/build_gcc_cuda10_2/aims.210427.scalapack.mpi.x
#bin=/ccsopen/home/yaoyi92/proj_dir_aims-gw/yaoyi92/FHIaims_gw_gpu/build_gcc_cuda10_3/aims.210427.scalapack.mpi.x
bin=/ccsopen/home/yaoyi92/proj_dir_aims-gw/yaoyi92/FHIaims_gw_gpu/build_gcc_cuda10_cosma/aims.210427.scalapack.mpi.x

export OMP_NUM_THREADS=1

ulimit -s unlimited

jsrun -n 6 -a 6 -c 6 -g 1 -r 6 $bin > aims.out

Best wishes,
Yi

Answer 1 · 2021-05-26T10:02:22.000Z

Dear Yi,

Thanks a lot for reporting this issue and for your feedback. We have found what was causing this problem and will fix it soon.

Cheers,
Marko

Answer 2 · 2021-05-28T14:19:34.000Z

Dear Yi,

This is fixed starting from version v2.5.0. It was caused by not setting the gpu devices properly.

Let us know if you observe any other problems.

I will now close this issue, but feel free to reopen it if the problem still persists.

Cheers,
Marko

Answer 3 · 2021-11-10T18:29:20.000Z

Hi @kabicm ,
I am able to redo the test again and the problem still exists with v2.5.0 and the master branch. Does COSMA use a wrapper over MPI_Finalize()? I notice a similar issue on Summit here with another code that is wrapping over MPI_Finalize LLNL/Caliper#392.

If that's the case, my question becomes: Is it possible to manually finalize COSMA?

A related question here: if I called COSMA, does it take the GPU memory after the gemm calls? Is it possible to control those GPU memories? I guess I am looking for something like initiating and finalizing a COSMA environment over a certain code region and free up the GPU memories when outside the code region.

Best wishes,
Yi