RuntimeError occurred during training
Closed this issue · 8 comments
Hi, here is some issue when i run 'training.py'
RuntimeError: Given groups=4, weight of size [2048, 512, 3, 3], expected input[1, 8192, 4, 4] to have 2048 channels, but got 8192 channels instead
I used batch_size=4 and resolution=512 with endless_summer dataset. (python 3.8 / pytorch 1.8.1)
And it seems good with the default parameters.(batch_size=16, resolution=32)
Thank you.
I used batch_size=4 and resolution=512 with endless_summer dataset. (python 3.8 / pytorch 1.8.1)
I have confirmed the training.py with python 3.7 and pytorch 1.7.
Could you check again with python 3.7, pytorch 1.7?
Or try running it on Google colabo.
I used batch_size=4 and resolution=512 with endless_summer dataset. (python 3.8 / pytorch 1.8.1)
I have confirmed the training.py with python 3.7 and pytorch 1.7.
Could you check again with python 3.7, pytorch 1.7?
Or try running it on Google colabo.
I already downgraded to python and pytorch, and the same issue occured again.
In addition, i really do not want to running it on colab, because we have 4xV100 gpus already.
Thank you.
Thank you for checking the operation with python3.7 and pytorch1.7.
I've just checked the training.py with google colabo and it's working fine.
Please check the attached image.
I started training with a trained model, so a good image is generated
You said you have 4 v100s, but have you modified the source and made it work in a multi-GPU environment?
Try setting the image processed by 1 GPU to a multiple of 4.
Since it is 4 images x 4 GPUs, I think the batch size should be at least 16.
I'll try it later. Thank you for your comments.
(Before i tried with only 1 GPU by using CUDA_VISIBLE_DEVICES, batch size4 for 512 input size, and the source code not modified yet.)
I'll try it later. Thank you for your comments.
(Before i tried with only 1 GPU by using CUDA_VISIBLE_DEVICES, batch size4 for 512 input size, and the source code not modified yet.)
Again, I made a new conda env.(python3.7 torch 1.7.0 and numpy 1.19.5) and tried '--batch_size 4 --resolution=512
',
(in this time, i changed 'import pickle' to 'import pickle5 as pickle' in the base_layer.py after install pickle5 on my new conda env.)
has came out. Maybe colab environments different from mine. Would you check your source code on the gpu-machine or sth?
Sorry, I figured out what was the problem.
As you already know, I first trained the model to the end with default values. Then, when I retrained the model with a different resolution and batch size, the above error occurred.
After that, I erased all the first training results(all pickles) and trained with a different batch size and resolution than the default, so the same error as before did not occur. In the end, when rerunning the training code, the most recently trained pickles seem to be loaded and the same error as before appears.
Setting the --is_restore_model option in training.py to False does not use the trained model.
Please try it and attach the image here when a good image is generated