A problem about training.

Question

A problem about training.

Closed this issue 4 years ago · 4 comments

Hello, thanks for your great work. However, when I run your train.py file, I have some problems as follows. Could you help me to solve this problem. Thanks!

Traceback (most recent call last):
File "train.py", line 218, in
train()
File "train.py", line 141, in train
d_fake = generator_step(net_d, out2, net_loss, optimizer)
File "/userhome/point-cloud/VRCNet/utils/train_utils.py", line 42, in generator_step
total_gen_loss_batch.backward(torch.ones(torch.cuda.device_count()).cuda(), retain_graph=True, )
File "/opt/conda/lib/python3.6/site-packages/torch/tensor.py", line 120, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/opt/conda/lib/python3.6/site-packages/torch/autograd/init.py", line 93, in backward
grad_tensors = _make_grads(tensors, grad_tensors)
File "/opt/conda/lib/python3.6/site-packages/torch/autograd/init.py", line 29, in _make_grads
+ str(out.shape) + ".")
RuntimeError: Mismatch in shape: grad_output[0] has a shape of torch.Size([1]) and output[0] has a shape of torch.Size([]).

Answer 1 · 2021-06-09T09:51:55.000Z

Hi,

I am having the same issues. I use python 3.8, PyTorch 1.8.1 and Cuda 11.
Does anyone have a fix for this?

Thanks

Answer 2 · 2021-06-09T10:42:43.000Z

Hi,

I find that the issue may occur when trained with only one gpu, you can try more than two.

Best regards.

Answer 3 · 2021-06-09T16:17:17.000Z

Hi,

Thanks for your input, it helped fix the issue. If you only have one GPU, PyTorch outputs the resulting loss as a scalar, not a vector of losses, but the command torch.ones(...) always creates a vector for the gradients.

The edge case can be caught by either not using torch.ones(...) or squeezing the result by changing line 134 in train.py to:
net_loss.backward(torch.squeeze(torch.ones(torch.cuda.device_count())).cuda())

Answer 4 · 2021-06-10T09:21:14.000Z

It helps, thanks!