ecrc/kblas-gpu

Error while testing library

Closed this issue · 6 comments

Hi

After installing kblas on my arch 62 via cuda 10.2 and running make in testing, I tried running "./test_dtrmm -N 200:512" in /testing/bin which gave me the following error:

side L, uplo L, trans N, diag N, db 512
    M     N     kblasTRMM_REC GF/s (ms)  kblasTRMM_CU GF/s (ms)  cublasTRMM GF/s (ms)  SP_REC   SP_CU   Error
====================================================================
  200   512   CUDA runtime error: no kernel image is available for execution on the device (209) in Xtrmm at blas_l3/Xtrmm.cu:479
CUBLAS error: execution failed (13) in test_trmm at blas_l3/test_trmm.ch:202

Am I doing something wrong? I wish to use kblas for doing batched svd. How do I use this library? I did get some warnings while making kblas, could that be the reason for this error?

Hi
This usually happens when the Gencode parameters of CUDA kernels doesn't match with your GPU arch.
I attached a video that helps you go through the installation of kblas-gpu from scratch using Pascal architecture.
You can follow the steps and reach us if you still have such errors.

https://www.youtube.com/watch?v=jAWdo39M-xk

and a google doc store the required files location.
https://docs.google.com/document/d/1UF-53VoZOz8uBhdC8ob9uwW6NmuqnAJwGqWuu6Rn584/edit?usp=sharing

Hope it helps.

To use batched-svd, I think you can check ./testing/batch_triangular/test_Xsvd_full_batch.cpp

Thank you very much @hongyx11. I followed the tutorial and was able to get the library running. I tried running test_dtrmm and it ran successfully. However, I am getting memory errors when I try to run test_dsvd_full_batch even for very small sizes of matrices. Do you know why this might be happening?

./test_dsvd_full_batch -N 100:512
batchCount    M     N     kblasSVD GF/s (ms)       Error
    4     100     512   1542792336   !!!! malloc_cpu failed for: h_Au

./test_dsvd_full_batch -N 10:10
batchCount    M     N     kblasSVD GF/s (ms)       Error
    4      10      10   714802320   !!!! malloc_cpu failed for: h_Au

./test_ssvd_full_batch -N 200:200
batchCount    M     N     kblasSVD GF/s (ms)       Error
    4     200     200   -1657445232   CUDA runtime error: out of memory (2) in test_Xsvd_full_batch at batch_triangular/test_Xsvd_full_batch.cpp:227
CUDA runtime error: out of memory (2) in test_Xsvd_full_batch at batch_triangular/test_Xsvd_full_batch.cpp:228
CUDA runtime error: out of memory (2) in Xset_pointer_4_core at batch_triangular/Xhelper_funcs.cuh:335
gpuKblasAssert: CUDA error batch_triangular/test_Xsvd_full_batch.cpp 257

./test_ssvd_full_batch -N 20:20
batchCount    M     N     kblasSVD GF/s (ms)       Error
    4      20      20   1326126224   !!!! malloc_cpu failed for: h_Au

./test_ssvd_full_batch -N 2:2
batchCount    M     N     kblasSVD GF/s (ms)       Error
    4       2       2   1879909520   !!!! malloc_cpu failed for: h_Au

./test_ssvd_full_batch -N 1:1
batchCount    M     N     kblasSVD GF/s (ms)       Error
    4       1       1   1176986768   CUDA runtime error: out of memory (2) in test_Xsvd_full_batch at batch_triangular/test_Xsvd_full_batch.cpp:228
CUDA runtime error: out of memory (2) in Xset_pointer_4_core at batch_triangular/Xhelper_funcs.cuh:335
gpuKblasAssert: CUDA error batch_triangular/test_Xsvd_full_batch.cpp 257

Hi @rahulwankhede , what‘s size of your gpu memory? Could you do nvidia-smi?

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 105...  On   | 00000000:01:00.0  On |                  N/A |
| 45%   23C    P8    N/A /  75W |    605MiB /  4036MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1191      G   /usr/lib/xorg/Xorg                            28MiB |
|    0      1344      G   /usr/bin/gnome-shell                          47MiB |
|    0      1597      G   /usr/lib/xorg/Xorg                           204MiB |
|    0      1750      G   /usr/bin/gnome-shell                         252MiB |
|    0      2323      G   ...AAAAAAAAAAAAAAgAAAAAAAAA --shared-files    53MiB |
|    0     19664      G   /opt/teamviewer/tv_bin/TeamViewer             12MiB |
+-----------------------------------------------------------------------------+

My GPU memory is 4 GB. If it might be relevant, my arch_sm is 61 (I mistakenly said 62 in my original post and think installing for 62 was one of the reasons I was not able to run kblas). I have done the installation again this time for the correct sm. I did it without using or loading a module since I already have CUDA 10.2 as my default installation. GPU model GTX 1050 Ti. GCC version 7.5.0

Hi @rahulwankhede ,

I don't have too much insight about the memory usage of KBLAS. But as you can see, there is cpu malloc error also. Maybe you need also look into this.
We test your input on a 8G memory Pascal GPU and it passed without error.
I also noticed that there are visulization job working on your job. Maybe another way is to kill processes on the GPU and try again. You are interested in the performance results of KBLAS right? These processes will lower the performance. If not, then you can use a few simple sequential svd instead.

Best,
Yuxi

Thanks @hongyx11. There seems to be something wrong with my installation. I'll try installing on a different GPU with more memory and see if it works. Thanks a lot for the help. Cheers!