RuntimeError occurred during training

Question

RuntimeError occurred during training

Closed this issue 4 years ago · 8 comments

Hi, here is some issue when i run 'training.py'

RuntimeError: Given groups=4, weight of size [2048, 512, 3, 3], expected input[1, 8192, 4, 4] to have 2048 channels, but got 8192 channels instead

I used batch_size=4 and resolution=512 with endless_summer dataset. (python 3.8 / pytorch 1.8.1)

And it seems good with the default parameters.(batch_size=16, resolution=32)

Thank you.

Answer 1 · 2021-05-03T23:29:10.000Z

I used batch_size=4 and resolution=512 with endless_summer dataset. (python 3.8 / pytorch 1.8.1)

I have confirmed the training.py with python 3.7 and pytorch 1.7.
Could you check again with python 3.7, pytorch 1.7?
Or try running it on Google colabo.

Answer 2 · 2021-05-04T03:59:44.000Z

I used batch_size=4 and resolution=512 with endless_summer dataset. (python 3.8 / pytorch 1.8.1)

I have confirmed the training.py with python 3.7 and pytorch 1.7.
Could you check again with python 3.7, pytorch 1.7?
Or try running it on Google colabo.

I already downgraded to python and pytorch, and the same issue occured again.
In addition, i really do not want to running it on colab, because we have 4xV100 gpus already.

Thank you.

Answer 3 · 2021-05-04T05:19:42.000Z

Thank you for checking the operation with python3.7 and pytorch1.7.
I've just checked the training.py with google colabo and it's working fine.
Please check the attached image.
I started training with a trained model, so a good image is generated

You said you have 4 v100s, but have you modified the source and made it work in a multi-GPU environment?
Try setting the image processed by 1 GPU to a multiple of 4.
Since it is 4 images x 4 GPUs, I think the batch size should be at least 16.

Answer 4 · 2021-05-04T06:15:19.000Z

I'll try it later. Thank you for your comments.
(Before i tried with only 1 GPU by using CUDA_VISIBLE_DEVICES, batch size4 for 512 input size, and the source code not modified yet.)

Answer 5 · 2021-05-04T11:07:37.000Z

I'll try it later. Thank you for your comments.
(Before i tried with only 1 GPU by using CUDA_VISIBLE_DEVICES, batch size4 for 512 input size, and the source code not modified yet.)

Again, I made a new conda env.(python3.7 torch 1.7.0 and numpy 1.19.5) and tried '--batch_size 4 --resolution=512
',

(in this time, i changed 'import pickle' to 'import pickle5 as pickle' in the base_layer.py after install pickle5 on my new conda env.)

has came out. Maybe colab environments different from mine. Would you check your source code on the gpu-machine or sth?

Answer 6 · 2021-05-04T16:30:56.000Z

I ran training.py in a non-Google colabo environment.
The operating environment is EC2 on AWS and the OS is Ubuntu 18.
training.py seems to be working properly.
I will attach an image about my environment, so please refer to it.
By the way, the Python version will be 3.7.4

Answer 7 · 2021-05-06T13:13:08.000Z

Sorry, I figured out what was the problem.

As you already know, I first trained the model to the end with default values. Then, when I retrained the model with a different resolution and batch size, the above error occurred.

After that, I erased all the first training results(all pickles) and trained with a different batch size and resolution than the default, so the same error as before did not occur. In the end, when rerunning the training code, the most recently trained pickles seem to be loaded and the same error as before appears.

Answer 8 · 2021-05-06T14:08:04.000Z

Setting the --is_restore_model option in training.py to False does not use the trained model.
Please try it and attach the image here when a good image is generated