RuntimeError: cudnn RNN backward can only be called in training mode

Question

RuntimeError: cudnn RNN backward can only be called in training mode

Yu-Wu opened this issue 6 years ago · 10 comments

Hi,

Thanks for your great work.
However, I met some problem when running this code.
I followed the instructions and put the required files in /data folder and run training command.

My environments: Pytorch 0.3.1, Cuda 9.0, Cudnn 7.1.2.
Could you help me to run the script correctly?

➜  ave_code git:(master) ✗ source activate pytorch0.3.1
(pytorch0.3.1) ➜  ave_code git:(master) ✗ python weak_supervised_main.py --train
/home/wuyu07/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
3517
=== Epoch {0}   Loss: {0.7096}  Running time: {2.684252}
0.06890547263681591
Traceback (most recent call last):
  File "weak_supervised_main.py", line 171, in <module>
    train(args)
  File "weak_supervised_main.py", line 93, in train
    loss.backward()
  File "/home/wuyu07/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/wuyu07/anaconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cudnn RNN backward can only be called in training mode

Answer 1 · 2019-02-21T15:36:21.000Z

Just re-install a machine and run the code.
Environment: 0.3.1-py36_cuda9.1.85_cudnn7.0.5_2 pytorch [cuda91]
Results:
3517
=== Epoch {0} Loss: {0.7096} Running time: {2.906535}
0.08358208955223881
=== Epoch {1} Loss: {0.7093} Running time: {1.947218}
=== Epoch {2} Loss: {0.7089} Running time: {1.911076}
=== Epoch {3} Loss: {0.7080} Running time: {1.911666}
=== Epoch {4} Loss: {0.7067} Running time: {1.905607}
=== Epoch {5} Loss: {0.7043} Running time: {1.963768}
0.3875621890547264

I did not meet the same issue. Can you config the same environment? The issue may be from the different Cudnn version.

BTW, there is a small typo in previous code "data/video_feature_noisy.h5" ==> "data/visual_feature_noisy.h5". I think you have noticed that before.

Answer 2 · 2019-02-21T16:27:41.000Z

Thanks for your quick reply.

Finally, I found the reason.
I run your code in Pytorch 1.0 by mistake, instead of Pytorch 0.3.1.

Answer 3 · 2021-05-30T09:33:53.000Z

Loading model parameters.
/usr/local/lib/python3.7/dist-packages/torchtext/data/field.py:197: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  return Variable(arr, volatile=not train), lengths
/usr/local/lib/python3.7/dist-packages/torchtext/data/field.py:198: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  return Variable(arr, volatile=not train)
/content/Seq2Sick/onmt/translate/Translator.py:48: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  def var(a): return Variable(a, volatile=True)
/content/Seq2Sick/onmt/modules/GlobalAttention.py:179: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
  align_vectors = self.sm(align.view(batch*targetL, sourceL))
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/container.py:119: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument.
  input = module(input)
/content/Seq2Sick/onmt/translate/Translator.py:191: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  src.volatile = False
attack.py:64: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument.
  output_a, attn, output_i= translator.getOutput(new_embedding, src, batch)
tensor(18.6335, device='cuda:0') 	 tensor(0., device='cuda:0')
tensor(999., device='cuda:0') 	 tensor(0., device='cuda:0')
Traceback (most recent call last):
  File "attack.py", line 312, in <module>
    main()
  File "attack.py", line 272, in main
    modifier, output_a, attn, new_word, output_i, CFLAG = attack(all_word_embedding, label_onehot, translator, src, batch, new_embedding, input_embedding, modifier, const, GROUP_LASSO, TARGETED, GRAD_REG, NN)
  File "attack.py", line 138, in attack
    loss.backward(retain_graph=True)
  File "/usr/local/lib/python3.7/dist-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: cudnn RNN backward can only be called in training mode

Answer 4 · 2021-05-30T09:34:12.000Z

Loading model parameters.
/usr/local/lib/python3.7/dist-packages/torchtext/data/field.py:197: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  return Variable(arr, volatile=not train), lengths
/usr/local/lib/python3.7/dist-packages/torchtext/data/field.py:198: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  return Variable(arr, volatile=not train)
/content/Seq2Sick/onmt/translate/Translator.py:48: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  def var(a): return Variable(a, volatile=True)
/content/Seq2Sick/onmt/modules/GlobalAttention.py:179: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
  align_vectors = self.sm(align.view(batch*targetL, sourceL))
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/container.py:119: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument.
  input = module(input)
/content/Seq2Sick/onmt/translate/Translator.py:191: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  src.volatile = False
attack.py:64: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument.
  output_a, attn, output_i= translator.getOutput(new_embedding, src, batch)
tensor(18.6335, device='cuda:0') 	 tensor(0., device='cuda:0')
tensor(999., device='cuda:0') 	 tensor(0., device='cuda:0')
Traceback (most recent call last):
  File "attack.py", line 312, in <module>
    main()
  File "attack.py", line 272, in main
    modifier, output_a, attn, new_word, output_i, CFLAG = attack(all_word_embedding, label_onehot, translator, src, batch, new_embedding, input_embedding, modifier, const, GROUP_LASSO, TARGETED, GRAD_REG, NN)
  File "attack.py", line 138, in attack
    loss.backward(retain_graph=True)
  File "/usr/local/lib/python3.7/dist-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: cudnn RNN backward can only be called in training mode

cmhcbb/Seq2Sick#2

Answer 5 · 2021-05-30T10:15:31.000Z

@Yu-Wu @YapengTian Please resolve this too, please
Similar to your issue.

cmhcbb/Seq2Sick#2

Answer 6 · 2021-05-30T16:50:32.000Z

Could you check your pytorch version? Choose this Pytorch version: 0.3.1.

Answer 7 · 2021-05-30T17:56:36.000Z

Still same error @YapengTian

Answer 8 · 2021-05-31T09:09:39.000Z

Still same error @YapengTian

you can try to add "net_model.train()" at the begining of training in the "for epoch in range(args.nb_epoch)".

Answer 9 · 2021-06-01T02:36:33.000Z

I knew you must try to install the pytorch version. But, could you print the pytorch version out in case it was not successfully installed? (import torch print(torch.version))

Answer 10 · 2021-07-28T13:29:42.000Z

Still same error @YapengTian

you can try to add "net_model.train()" at the begining of training in the "for epoch in range(args.nb_epoch)".

I solved this problem by your comment, thank you so much bro