Theano/libgpuarray

PyGPU tests fail with cuLinkAddData: CUDA_ERROR_UNKNOWN

Closed this issue · 7 comments

Hello,

I'm on Ubuntu 16.04, and have a Geforce GTX 1080 with Cuda 8.0 runtime installed.
I also installed Cudnn 5.1.
My cuda libraries (and cudnn) are all found by the linker path.

/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 1080"
  CUDA Driver Version / Runtime Version          9.0 / 8.0
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 8114 MBytes (8508145664 bytes)
  (20) Multiprocessors, (128) CUDA Cores/MP:     2560 CUDA Cores
  GPU Max Clock rate:                            1734 MHz (1.73 GHz)
  Memory Clock rate:                             5005 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 2097152 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 3 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX 1080
Result = PASS

I compiled manually libgpuarray using the instructions:

mkdir Build
cd Build
# you can pass -DCMAKE_INSTALL_PREFIX=/path/to/somewhere to install to an alternate location
cmake .. -DCMAKE_BUILD_TYPE=Release # or Debug if you are investigating a crash
make
make install
cd ..
# This must be done after libgpuarray is installed as per instructions above.
python setup.py build
python setup.py install --user

Running the C unit tests fails on the following tests:


DEVICE=cuda0 make test
Running tests...
Test project /home/user/development/libgpuarray/build
      Start  1: test_types
 1/11 Test  #1: test_types .......................   Passed    0.00 sec
      Start  2: test_util
 2/11 Test  #2: test_util ........................   Passed    0.00 sec
      Start  3: test_util_integerfactoring
 3/11 Test  #3: test_util_integerfactoring .......   Passed    0.48 sec
      Start  4: test_reduction
 4/11 Test  #4: test_reduction ...................   Passed    6.95 sec
      Start  5: test_array
 5/11 Test  #5: test_array .......................   Passed    3.46 sec
      Start  6: test_blas
 6/11 Test  #6: test_blas ........................***Failed    4.13 sec
      Start  7: test_elemwise
 7/11 Test  #7: test_elemwise ....................***Failed   23.39 sec
      Start  8: test_error
 8/11 Test  #8: test_error .......................   Passed    0.00 sec
      Start  9: test_buffer
 9/11 Test  #9: test_buffer ......................   Passed    4.20 sec
      Start 10: test_buffer_collectives
10/11 Test #10: test_buffer_collectives ..........***Failed    1.02 sec
      Start 11: test_collectives
11/11 Test #11: test_collectives .................***Failed    1.05 sec

64% tests passed, 4 tests failed out of 11

Total Test time (real) =  44.69 sec

The following tests FAILED:
	  6 - test_blas (Failed)
	  7 - test_elemwise (Failed)
	 10 - test_buffer_collectives (Failed)
	 11 - test_collectives (Failed)
Errors while running CTest
Makefile:127: recipe for target 'test' failed
make: *** [test] Error 8

My .theonorc file looks like this:

[global]
floatX = float32
device = cuda0
mode = FAST_RUN

[blas]
ldflags = -lopenblas -lgfortran

When running the Python tests, a lot of them fail. The log is attached as a text file.
Seems to boil down to a problem with blas and some cuda kernels not compiling, most
of the time failing on cuLinkAddData: CUDA_ERROR_UNKNOWN

log.txt

nouiz commented

Ok I recompiled, the C tests give the same output. The python tests give the following log with more output:
log.txt

Re-installed cuda 9, libnccl 2 and recompiled everything, all tests pass now.
1 thing that bugged me though: You have to leave the libgpuarray directory to run the tests. This got me going for a while.

nouiz commented

In this section:
To run the python tests, install pygpu, then move outside its directory and run this command

I'd suggest making it bold that you have to move outside the libgpuarray directory.

Also this is very subtle, but in the compilation instruction, there's a stealthed "cd" command:

python setup.py build
python setup.py install --user
cd
DEVICE="<test device>" python -c "import pygpu;pygpu.test()"

nouiz commented
nouiz commented

Ad added some bold in a new PR.