akanazawa/hmr

Demo Run Failed With Cudnn Error

Closed this issue · 3 comments

Hi,

I've tried to make demo run and I got this error:

2019-08-11 18:09:17.244335: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-08-11 18:09:17.464974: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-08-11 18:09:18.180194: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-08-11 18:09:18.213845: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
File "/media/alon/hdd/Guardian/od/hmr/demo.py", line 151, in
main(config.img_path, config.json_path)
File "/media/alon/hdd/Guardian/od/hmr/demo.py", line 136, in main
input_img, get_theta=True)
File "/media/alon/hdd/Guardian/od/hmr/src/RunModel.py", line 140, in predict
results = self.predict_dict(images)
File "/media/alon/hdd/Guardian/od/hmr/src/RunModel.py", line 166, in predict_dict
results = self.sess.run(fetch_dict, feed_dict)
File "/media/alon/hdd/Guardian/od/hmr/venv_hmr/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 950, in run
run_metadata_ptr)
File "/media/alon/hdd/Guardian/od/hmr/venv_hmr/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "/media/alon/hdd/Guardian/od/hmr/venv_hmr/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
run_metadata)
File "/media/alon/hdd/Guardian/od/hmr/venv_hmr/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node Encoder_resnet/resnet_v2_50/conv1/Conv2D (defined at tmp/tmpvc6uDQ.py:12) ]]
[[add_2/_573]]
(1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node Encoder_resnet/resnet_v2_50/conv1/Conv2D (defined at tmp/tmpvc6uDQ.py:12) ]]
0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node Encoder_resnet/resnet_v2_50/conv1/Conv2D:
Encoder_resnet/resnet_v2_50/Pad (defined at media/alon/hdd/Guardian/od/hmr/src/models.py:48)

Input Source operations connected to node Encoder_resnet/resnet_v2_50/conv1/Conv2D:
Encoder_resnet/resnet_v2_50/Pad (defined at media/alon/hdd/Guardian/od/hmr/src/models.py:48)

Original stack trace for u'Encoder_resnet/resnet_v2_50/conv1/Conv2D':
File "media/alon/hdd/Guardian/od/hmr/demo.py", line 151, in
main(config.img_path, config.json_path)
File "media/alon/hdd/Guardian/od/hmr/demo.py", line 125, in main
model = RunModel(config, sess=sess)
File "media/alon/hdd/Guardian/od/hmr/src/RunModel.py", line 62, in init
self.build_test_model_ief()
File "media/alon/hdd/Guardian/od/hmr/src/RunModel.py", line 82, in build_test_model_ief
reuse=False)
File "media/alon/hdd/Guardian/od/hmr/src/models.py", line 48, in Encoder_resnet
scope='resnet_v2_50')
File "media/alon/hdd/Guardian/od/hmr/venv_hmr/local/lib/python2.7/site-packages/tensorflow/contrib/slim/python/slim/nets/resnet_v2.py", line 287, in resnet_v2_50
scope=scope)
File "media/alon/hdd/Guardian/od/hmr/venv_hmr/local/lib/python2.7/site-packages/tensorflow/contrib/slim/python/slim/nets/resnet_v2.py", line 214, in resnet_v2
net = resnet_utils.conv2d_same(net, 64, 7, stride=2, scope='conv1')
File "media/alon/hdd/Guardian/od/hmr/venv_hmr/local/lib/python2.7/site-packages/tensorflow/contrib/slim/python/slim/nets/resnet_utils.py", line 146, in conv2d_same
scope=scope)
File "media/alon/hdd/Guardian/od/hmr/venv_hmr/local/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args
return func(*args, **current_args)
File "media/alon/hdd/Guardian/od/hmr/venv_hmr/local/lib/python2.7/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1159, in convolution2d
conv_dims=2)
File "media/alon/hdd/Guardian/od/hmr/venv_hmr/local/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args
return func(*args, **current_args)
File "media/alon/hdd/Guardian/od/hmr/venv_hmr/local/lib/python2.7/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1057, in convolution
outputs = layer.apply(inputs)
File "media/alon/hdd/Guardian/od/hmr/venv_hmr/local/lib/python2.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1479, in apply
return self.call(inputs, *args, **kwargs)
File "media/alon/hdd/Guardian/od/hmr/venv_hmr/local/lib/python2.7/site-packages/tensorflow/python/layers/base.py", line 537, in call
outputs = super(Layer, self).call(inputs, *args, **kwargs)
File "media/alon/hdd/Guardian/od/hmr/venv_hmr/local/lib/python2.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 634, in call
outputs = call_fn(inputs, *args, **kwargs)
File "media/alon/hdd/Guardian/od/hmr/venv_hmr/local/lib/python2.7/site-packages/tensorflow/python/autograph/impl/api.py", line 146, in wrapper
), args, kwargs)
File "media/alon/hdd/Guardian/od/hmr/venv_hmr/local/lib/python2.7/site-packages/tensorflow/python/autograph/impl/api.py", line 450, in converted_call
result = converted_f(*effective_args, **kwargs)
File "tmp/tmpvc6uDQ.py", line 12, in tf__call
outputs = ag__.converted_call('convolution_op', self, ag_.ConversionOptions(recursive=True, force_conversion=False, optional_features=(), internal_convert_user_code=True), (inputs, self.kernel), None)
File "media/alon/hdd/Guardian/od/hmr/venv_hmr/local/lib/python2.7/site-packages/tensorflow/python/autograph/impl/api.py", line 356, in converted_call
return _call_unconverted(f, args, kwargs)
File "media/alon/hdd/Guardian/od/hmr/venv_hmr/local/lib/python2.7/site-packages/tensorflow/python/autograph/impl/api.py", line 255, in _call_unconverted
return f(*args)
File "media/alon/hdd/Guardian/od/hmr/venv_hmr/local/lib/python2.7/site-packages/tensorflow/python/ops/nn_ops.py", line 1079, in call
return self.conv_op(inp, filter)
File "media/alon/hdd/Guardian/od/hmr/venv_hmr/local/lib/python2.7/site-packages/tensorflow/python/ops/nn_ops.py", line 635, in call
return self.call(inp, filter)
File "media/alon/hdd/Guardian/od/hmr/venv_hmr/local/lib/python2.7/site-packages/tensorflow/python/ops/nn_ops.py", line 234, in call
name=self.name)
File "media/alon/hdd/Guardian/od/hmr/venv_hmr/local/lib/python2.7/site-packages/tensorflow/python/ops/nn_ops.py", line 1953, in conv2d
name=name)
File "media/alon/hdd/Guardian/od/hmr/venv_hmr/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 1071, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "media/alon/hdd/Guardian/od/hmr/venv_hmr/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "media/alon/hdd/Guardian/od/hmr/venv_hmr/local/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "media/alon/hdd/Guardian/od/hmr/venv_hmr/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
op_def=op_def)
File "media/alon/hdd/Guardian/od/hmr/venv_hmr/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2005, in init
self._traceback = tf_stack.extract_stack()

I've make it work with inserting lines according to GPU memory allocation in this issue.

Does anyone has a better solution or is that it?

Regards, Alon

Hello @alon1samuel

Did you manage to find a fix for the above error by any chance? Sorry I have been trying to get it working for the last one week but have not been having too much luck.

Thanks,
Vignesh

Hi, I don't remember if I fixed it or how.
From reading the message that it gave me, my guess is that the main part is "cudnn failed to initialize".
I would guess that it's not a problem with this repo, but a problem with any TF model you are running that uses conv2d layers (or similar).
So I would suggest to check back the installation again with a simple model like in this tutorial to see it works first.
Hope it helps!
Alonsh

Hi, thanks for your reply. I got it working a similar way to how you did initially by limiting GPU usage. I shall try it on the simple model and try to figure out the problem.

Thanks