lisa-lab/pylearn2

[bug] cuda_convnet not found with mode=DebugMode

TNick opened this issue · 4 comments

Not sure if this is Theano or Pylearn2 bug. Here it goes:

On a fresh install (or if you clean the cached compilation results in ~/.theano) following flags will prevent me from training a MLP model:

import os
os.environ['THEANO_FLAGS'] ='optimizer=None,mode=DebugMode,allow_gc=False,exception_verbosity=high,device=gpu,floatX=float32'

The process fails to build cuda code with this error:

/usr/bin/ld: cannot find -lcuda_convnet

Relevant frames:

  File "~/pylearn2/pylearn2/models/mlp.py", line 488, in __init__
    self._update_layer_input_spaces()
  File "~/pylearn2/pylearn2/models/mlp.py", line 553, in _update_layer_input_spaces
    layers[0].set_input_space(self.get_input_space())
  File "~/pylearn2/pylearn2/sandbox/rnn/models/mlp_hook.py", line 337, in outer
    return set_input_space(self, input_space)
  File "~/pylearn2/pylearn2/models/maxout.py", line 724, in set_input_space
    dummy_p = dummy_p.eval()
  File "~/theano/theano/gof/graph.py", line 413, in eval
    self._fn_cache[inputs] = theano.function(inputs, self)
  File "~/theano/theano/compile/function.py", line 266, in function
    profile=profile)
  File "~/theano/theano/compile/pfunc.py", line 511, in pfunc
    on_unused_input=on_unused_input)
  File "~/theano/theano/compile/function_module.py", line 1468, in orig_function
    defaults)
  File "~/theano/theano/compile/debugmode.py", line 2424, in create
    _fn, _i, _o = self.linker.make_thunk(input_storage=input_storage)
  File "~/theano/theano/gof/link.py", line 559, in make_thunk
    output_storage=output_storage)[:3]
  File "~/theano/theano/compile/debugmode.py", line 1686, in make_all
    output_storage=node_output_storage)
  File "~/theano/theano/gof/cc.py", line 1072, in make_thunk
    keep_lock=keep_lock)
  File "~/theano/theano/gof/cc.py", line 1014, in __compile__
    keep_lock=keep_lock)
  File "~/theano/theano/gof/cc.py", line 1441, in cthunk_factory
    key=key, lnk=self, keep_lock=keep_lock)
  File "~/theano/theano/gof/cmodule.py", line 1075, in module_from_key
    module = lnk.compile_cmodule(location)
  File "~/theano/theano/gof/cc.py", line 1353, in compile_cmodule
    preargs=preargs)
  File "~/theano/theano/sandbox/cuda/nvcc_compiler.py", line 423, in compile_str
    'for cmd', ' '.join(cmd))

the command that was issued:

/usr/local/cuda-6.5/bin/nvcc -shared -O3 -use_fast_math -arch=sm_30 -m64 
    -Xcompiler-fno-math-errno,-Wno-unused-label,-Wno-unused-variable,-Wno-write-strings,-DCUDA_NDARRAY_CUH=b0c165044f8fd1e0f58745b1202dab6e,-D NPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION,-fPIC 
    -Xlinker -rpath,~/.theano/compiledir_Linux-3.13--generic-x86_64-with-debian-jessie-sid-x86_64-2.7.9-64/cuda_ndarray 
    -Xlinker -rpath,/usr/local/cuda-6.5/lib 
    -Xlinker -rpath,/usr/local/cuda-6.5/lib64 
    -I~/.theano/compiledir_Linux-3.13--generic-x86_64-with-debian-jessie-sid-x86_64-2.7.9-64/cuda_ndarray 
    -I/usr/local/cuda-6.5/include 
    -I~/pylearn2/pylearn2/sandbox/cuda_convnet/ 
    -I~/anaconda/lib/python2.7/site-packages/numpy/core/include 
    -I~/anaconda/include/python2.7 
    -I~/theano/theano/sandbox/cuda 
    -o ~/.theano/compiledir_Linux-3.13--generic-x86_64-with-debian-jessie-sid-x86_64-2.7.9-64/tmp08k3eU/9da75eac8cce435a25905399484c7f9d.so mod.cu 
    -L~/.theano/compiledir_Linux-3.13--generic-x86_64-with-debian-jessie-sid-x86_64-2.7.9-64/cuda_ndarray 
    -L~/.theano/compiledir_Linux-3.13--generic-x86_64-with-debian-jessie-sid-x86_64-2.7.9-64/cuda_convnet 
    -L~/anaconda/lib 
    -lpython2.7 -lcudart -lcublas -lcuda_ndarray -lcuda_convnet

and ls ~/.theano/compiledir_Linux-3.13--generic-x86_64-with-debian-jessie-sid-x86_64-2.7.9-64 returns:

cuda_ndarray  cutils_ext  __init__.py  lazylinker_ext  tmp_BgUSu  tmpGZiioy  tmpLzQSKt  tmpwIsazn  tmpxsY8_t

It obviously expects libcuda_convnet.so in cuda_convnet directory but those are not generated.

Following flags will work:

import os
os.environ['THEANO_FLAGS'] ='optimizer=fast_compile,exception_verbosity=high,device=gpu,floatX=float32'

and I can go back to initial set of flags, as now there is a cuda_convnet directory with a libcuda_convnet.so inside.

I'm using latest source for both Theano and Pylearn2.

Thanks for the report. It looks like a problem in DebugMode, which tries to compiles the C thunk of the pooling node before calling the make_thunk method of the op itself, which is supposed to call convnet_available() to ensure it has already been compiled.
The long-term fix would be to get rid of the special linker used by DebugMode, and make it behave like the regular Mode.
Before that, I'll try to see if we can make DebugMode call make_thunk in the right place.
In the short term, the work-around you found (manually filling the cache by launching an experiment without DebugMode) is probably the best you can get.

I tried to address that in Theano/Theano#2729, it was easier than I expected...
It would be great if you could test that branch and check that it solves your problem.
Thanks for the report!

I will but my spot instance is rather busy right now.
As soon as they kick me out and I have to instantiate a new one I will test it.
Thanks.

PS
I see that there is a comment on your PR. Is it relevant to my test?

THe PR is merged. So just update Theano to test it again.

The comment was about old comment that was moved in the PR. It wasn't
affecting you.

On Fri, Apr 3, 2015 at 8:04 AM, Nicu Tofan notifications@github.com wrote:

I will but my spot instance is rather busy right now.
As soon as they kick me out and I have to instantiate a new one I will
test it.
Thanks.

PS
I see that there is a comment on your PR. Is it relevant to my test?


Reply to this email directly or view it on GitHub
#1463 (comment).