[bug] cuda_convnet not found with mode=DebugMode
TNick opened this issue · 4 comments
Not sure if this is Theano or Pylearn2 bug. Here it goes:
On a fresh install (or if you clean the cached compilation results in ~/.theano) following flags will prevent me from training a MLP model:
import os
os.environ['THEANO_FLAGS'] ='optimizer=None,mode=DebugMode,allow_gc=False,exception_verbosity=high,device=gpu,floatX=float32'
The process fails to build cuda code with this error:
/usr/bin/ld: cannot find -lcuda_convnet
Relevant frames:
File "~/pylearn2/pylearn2/models/mlp.py", line 488, in __init__
self._update_layer_input_spaces()
File "~/pylearn2/pylearn2/models/mlp.py", line 553, in _update_layer_input_spaces
layers[0].set_input_space(self.get_input_space())
File "~/pylearn2/pylearn2/sandbox/rnn/models/mlp_hook.py", line 337, in outer
return set_input_space(self, input_space)
File "~/pylearn2/pylearn2/models/maxout.py", line 724, in set_input_space
dummy_p = dummy_p.eval()
File "~/theano/theano/gof/graph.py", line 413, in eval
self._fn_cache[inputs] = theano.function(inputs, self)
File "~/theano/theano/compile/function.py", line 266, in function
profile=profile)
File "~/theano/theano/compile/pfunc.py", line 511, in pfunc
on_unused_input=on_unused_input)
File "~/theano/theano/compile/function_module.py", line 1468, in orig_function
defaults)
File "~/theano/theano/compile/debugmode.py", line 2424, in create
_fn, _i, _o = self.linker.make_thunk(input_storage=input_storage)
File "~/theano/theano/gof/link.py", line 559, in make_thunk
output_storage=output_storage)[:3]
File "~/theano/theano/compile/debugmode.py", line 1686, in make_all
output_storage=node_output_storage)
File "~/theano/theano/gof/cc.py", line 1072, in make_thunk
keep_lock=keep_lock)
File "~/theano/theano/gof/cc.py", line 1014, in __compile__
keep_lock=keep_lock)
File "~/theano/theano/gof/cc.py", line 1441, in cthunk_factory
key=key, lnk=self, keep_lock=keep_lock)
File "~/theano/theano/gof/cmodule.py", line 1075, in module_from_key
module = lnk.compile_cmodule(location)
File "~/theano/theano/gof/cc.py", line 1353, in compile_cmodule
preargs=preargs)
File "~/theano/theano/sandbox/cuda/nvcc_compiler.py", line 423, in compile_str
'for cmd', ' '.join(cmd))
the command that was issued:
/usr/local/cuda-6.5/bin/nvcc -shared -O3 -use_fast_math -arch=sm_30 -m64
-Xcompiler-fno-math-errno,-Wno-unused-label,-Wno-unused-variable,-Wno-write-strings,-DCUDA_NDARRAY_CUH=b0c165044f8fd1e0f58745b1202dab6e,-D NPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION,-fPIC
-Xlinker -rpath,~/.theano/compiledir_Linux-3.13--generic-x86_64-with-debian-jessie-sid-x86_64-2.7.9-64/cuda_ndarray
-Xlinker -rpath,/usr/local/cuda-6.5/lib
-Xlinker -rpath,/usr/local/cuda-6.5/lib64
-I~/.theano/compiledir_Linux-3.13--generic-x86_64-with-debian-jessie-sid-x86_64-2.7.9-64/cuda_ndarray
-I/usr/local/cuda-6.5/include
-I~/pylearn2/pylearn2/sandbox/cuda_convnet/
-I~/anaconda/lib/python2.7/site-packages/numpy/core/include
-I~/anaconda/include/python2.7
-I~/theano/theano/sandbox/cuda
-o ~/.theano/compiledir_Linux-3.13--generic-x86_64-with-debian-jessie-sid-x86_64-2.7.9-64/tmp08k3eU/9da75eac8cce435a25905399484c7f9d.so mod.cu
-L~/.theano/compiledir_Linux-3.13--generic-x86_64-with-debian-jessie-sid-x86_64-2.7.9-64/cuda_ndarray
-L~/.theano/compiledir_Linux-3.13--generic-x86_64-with-debian-jessie-sid-x86_64-2.7.9-64/cuda_convnet
-L~/anaconda/lib
-lpython2.7 -lcudart -lcublas -lcuda_ndarray -lcuda_convnet
and ls ~/.theano/compiledir_Linux-3.13--generic-x86_64-with-debian-jessie-sid-x86_64-2.7.9-64
returns:
cuda_ndarray cutils_ext __init__.py lazylinker_ext tmp_BgUSu tmpGZiioy tmpLzQSKt tmpwIsazn tmpxsY8_t
It obviously expects libcuda_convnet.so
in cuda_convnet
directory but those are not generated.
Following flags will work:
import os
os.environ['THEANO_FLAGS'] ='optimizer=fast_compile,exception_verbosity=high,device=gpu,floatX=float32'
and I can go back to initial set of flags, as now there is a cuda_convnet
directory with a libcuda_convnet.so
inside.
I'm using latest source for both Theano and Pylearn2.
Thanks for the report. It looks like a problem in DebugMode, which tries to compiles the C thunk of the pooling node before calling the make_thunk
method of the op itself, which is supposed to call convnet_available()
to ensure it has already been compiled.
The long-term fix would be to get rid of the special linker used by DebugMode, and make it behave like the regular Mode.
Before that, I'll try to see if we can make DebugMode call make_thunk
in the right place.
In the short term, the work-around you found (manually filling the cache by launching an experiment without DebugMode) is probably the best you can get.
I tried to address that in Theano/Theano#2729, it was easier than I expected...
It would be great if you could test that branch and check that it solves your problem.
Thanks for the report!
I will but my spot instance is rather busy right now.
As soon as they kick me out and I have to instantiate a new one I will test it.
Thanks.
PS
I see that there is a comment on your PR. Is it relevant to my test?
THe PR is merged. So just update Theano to test it again.
The comment was about old comment that was moved in the PR. It wasn't
affecting you.
On Fri, Apr 3, 2015 at 8:04 AM, Nicu Tofan notifications@github.com wrote:
I will but my spot instance is rather busy right now.
As soon as they kick me out and I have to instantiate a new one I will
test it.
Thanks.PS
I see that there is a comment on your PR. Is it relevant to my test?—
Reply to this email directly or view it on GitHub
#1463 (comment).