google-research/pathdreamer

UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [Op:Conv2D]

Raghvender1205 opened this issue · 4 comments

While running both image generation and video generation model on Colab. I run into this error even though i selected GPU with a Mirrored Strategy. You can see the error in the Screenshot below.
Screenshot 2021-11-29 185446

Please help with CuDNN error. At first glance it has to do something with sample_noise=False argument

Hi,

Does restarting the colab fix this? If not, can you share a version of your colab so I can look at the full stack trace?

Hi,

Referred to this as I am getting the same error upon several attempts restarting the colab. The full stack trace is as follows:

UnknownError                              Traceback (most recent call last)

[<ipython-input-9-9d6d82e064a9>](https://localhost:8080/#) in <module>()
     16   # The first step is trivially inferred from groundtruth information.
     17   add_to_mem = (frame_idx > 0)
---> 18   outputs = stoch_model(end_pos, add_preds_to_memory=add_to_mem, sample_noise=False)
     19   total_dist += dist
     20   predicted_data['distance'].append(total_dist.numpy())

17 frames

[/content/pathdreamer/models/pathdreamer_models.py](https://localhost:8080/#) in __call__(self, position, add_preds_to_memory, sample_noise, use_projected_pathdreamer_rgb, z)
    309         sample_noise=sample_noise,
    310         z=z,
--> 311         training=False)
    312     mu = mu[:, 0, ...]
    313     logvar = logvar[:, 0, ...]

[/usr/local/lib/python3.7/dist-packages/keras/engine/base_layer.py](https://localhost:8080/#) in __call__(self, *args, **kwargs)
   1035         with autocast_variable.enable_auto_cast_variables(
   1036             self._compute_dtype_object):
-> 1037           outputs = call_fn(inputs, *args, **kwargs)
   1038 
   1039         if self._activity_regularizer:

[/content/pathdreamer/models/point_cloud_models.py](https://localhost:8080/#) in call(self, inputs, groundtruth_inputs, sample_noise, z, training)
    253         [pred_feat_tensor_merged, pred_depth_tensor_merged], axis=-1)
    254 
--> 255     hidden_spatial, skip = self.encoder(combined_input)
    256     if self.flatten:
    257       # Convert hidden to a (N, 1, 1, C) tensor for decoder.

[/usr/local/lib/python3.7/dist-packages/keras/engine/base_layer.py](https://localhost:8080/#) in __call__(self, *args, **kwargs)
   1035         with autocast_variable.enable_auto_cast_variables(
   1036             self._compute_dtype_object):
-> 1037           outputs = call_fn(inputs, *args, **kwargs)
   1038 
   1039         if self._activity_regularizer:

[/content/pathdreamer/models/image_models.py](https://localhost:8080/#) in call(self, x, training)
     97            x: tf.Tensor,
     98            training=None) -> Tuple[tf.Tensor, List[tf.Tensor]]:
---> 99     out_x = self.block1(x)
    100     b1 = out_x
    101     out_x = self.maxpool(out_x)

[/usr/local/lib/python3.7/dist-packages/keras/engine/base_layer.py](https://localhost:8080/#) in __call__(self, *args, **kwargs)
   1035         with autocast_variable.enable_auto_cast_variables(
   1036             self._compute_dtype_object):
-> 1037           outputs = call_fn(inputs, *args, **kwargs)
   1038 
   1039         if self._activity_regularizer:

[/usr/local/lib/python3.7/dist-packages/keras/engine/sequential.py](https://localhost:8080/#) in call(self, inputs, training, mask)
    367       if not self.built:
    368         self._init_graph_network(self.inputs, self.outputs)
--> 369       return super(Sequential, self).call(inputs, training=training, mask=mask)
    370 
    371     outputs = inputs  # handle the corner case where self.layers is empty

[/usr/local/lib/python3.7/dist-packages/keras/engine/functional.py](https://localhost:8080/#) in call(self, inputs, training, mask)
    413     """
    414     return self._run_internal_graph(
--> 415         inputs, training=training, mask=mask)
    416 
    417   def compute_output_shape(self, input_shape):

[/usr/local/lib/python3.7/dist-packages/keras/engine/functional.py](https://localhost:8080/#) in _run_internal_graph(self, inputs, training, mask)
    548 
    549         args, kwargs = node.map_arguments(tensor_dict)
--> 550         outputs = node.layer(*args, **kwargs)
    551 
    552         # Update tensor_dict.

[/usr/local/lib/python3.7/dist-packages/keras/engine/base_layer.py](https://localhost:8080/#) in __call__(self, *args, **kwargs)
   1035         with autocast_variable.enable_auto_cast_variables(
   1036             self._compute_dtype_object):
-> 1037           outputs = call_fn(inputs, *args, **kwargs)
   1038 
   1039         if self._activity_regularizer:

[/usr/local/lib/python3.7/dist-packages/keras/layers/convolutional.py](https://localhost:8080/#) in call(self, inputs)
    247       inputs = tf.pad(inputs, self._compute_causal_padding(inputs))
    248 
--> 249     outputs = self._convolution_op(inputs, self.kernel)
    250 
    251     if self.use_bias:

[/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/dispatch.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
    204     """Call target, and fall back on dispatchers if there is a TypeError."""
    205     try:
--> 206       return target(*args, **kwargs)
    207     except (TypeError, ValueError):
    208       # Note: convert_to_eager_tensor currently raises a ValueError, not a

[/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/nn_ops.py](https://localhost:8080/#) in convolution_v2(input, filters, strides, padding, data_format, dilations, name)
   1136       data_format=data_format,
   1137       dilations=dilations,
-> 1138       name=name)
   1139 
   1140 

[/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/nn_ops.py](https://localhost:8080/#) in convolution_internal(input, filters, strides, padding, data_format, dilations, name, call_from_convolution, num_spatial_dims)
   1266           data_format=data_format,
   1267           dilations=dilations,
-> 1268           name=name)
   1269     else:
   1270       if channel_index == 1:

[/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/nn_ops.py](https://localhost:8080/#) in _conv2d_expanded_batch(input, filters, strides, padding, data_format, dilations, name)
   2720         data_format=data_format,
   2721         dilations=dilations,
-> 2722         name=name)
   2723   return squeeze_batch_dims(
   2724       input,

[/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py](https://localhost:8080/#) in conv2d(input, filter, strides, padding, use_cudnn_on_gpu, explicit_paddings, data_format, dilations, name)
    930       return _result
    931     except _core._NotOkStatusException as e:
--> 932       _ops.raise_from_not_ok_status(e, name)
    933     except _core._FallbackException:
    934       pass

[/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py](https://localhost:8080/#) in raise_from_not_ok_status(e, name)
   6939   message = e.message + (" name: " + name if name is not None else "")
   6940   # pylint: disable=protected-access
-> 6941   six.raise_from(core._status_to_exception(e.code, message), None)
   6942   # pylint: enable=protected-access
   6943 

/usr/local/lib/python3.7/dist-packages/six.py in raise_from(value, from_value)

UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [Op:Conv2D]

Thanks in advance!

Thanks for flagging this. This was due to our requirements.txt installing a version of TensorFlow that is no longer compatible with the cuDNN version on Colab.

To fix this, I've updated the Colabs to use requirements_colab.txt, which avoids overwriting the TF version of Colab. Can you give it a try, and let me know if it still doesn't work?

Hi,

Many thanks for the quick update. I verify that it's working like a charm now!

Best.