thunil/TecoGAN

Training isn't starting with test case 3 or 4

Closed this issue · 4 comments

Hi, I downloaded and prepared the dataset. When I chose options 3 or 4 for training the network, all it does is run one round of evaluation on the calendar dataset and quits. Any help with that is appreciated. Here's my output from runGan.py 4

Testing test case 4
Delete existing folder ex_FRVSR06-23-14/?(Y/N)
y
ex_FRVSR06-23-14_1/
Using TensorFlow backend.
Preparing train_data
[Config] Use random crop
[Config] Use random crop
[Config] Use random flip
Sequenced batches: 27610, sequence length: 10
Preparing validation_data
[Config] Use random crop
[Config] Use random crop
[Config] Use random flip
Sequenced batches: 2860, sequence length: 10
tData count = 27610, steps per epoch 27610
Finish building the network.
Scope generator:
Variable: generator/generator_unit/input_stage/conv/Conv/weights:0
Shape: [3, 3, 51, 64]
Variable: generator/generator_unit/input_stage/conv/Conv/biases:0
Shape: [64]
Variable: generator/generator_unit/resblock_1/conv_1/Conv/weights:0
Shape: [3, 3, 64, 64]
Variable: generator/generator_unit/resblock_1/conv_1/Conv/biases:0
Shape: [64]
Variable: generator/generator_unit/resblock_1/conv_2/Conv/weights:0
Shape: [3, 3, 64, 64]
Variable: generator/generator_unit/resblock_1/conv_2/Conv/biases:0
Shape: [64]
Variable: generator/generator_unit/resblock_2/conv_1/Conv/weights:0
Shape: [3, 3, 64, 64]
Variable: generator/generator_unit/resblock_2/conv_1/Conv/biases:0
Shape: [64]
Variable: generator/generator_unit/resblock_2/conv_2/Conv/weights:0
Shape: [3, 3, 64, 64]
Variable: generator/generator_unit/resblock_2/conv_2/Conv/biases:0
Shape: [64]
Variable: generator/generator_unit/resblock_3/conv_1/Conv/weights:0
Shape: [3, 3, 64, 64]
Variable: generator/generator_unit/resblock_3/conv_1/Conv/biases:0
Shape: [64]
Variable: generator/generator_unit/resblock_3/conv_2/Conv/weights:0
Shape: [3, 3, 64, 64]
Variable: generator/generator_unit/resblock_3/conv_2/Conv/biases:0
Shape: [64]
Variable: generator/generator_unit/resblock_4/conv_1/Conv/weights:0
Shape: [3, 3, 64, 64]
Variable: generator/generator_unit/resblock_4/conv_1/Conv/biases:0
Shape: [64]
Variable: generator/generator_unit/resblock_4/conv_2/Conv/weights:0
Shape: [3, 3, 64, 64]
Variable: generator/generator_unit/resblock_4/conv_2/Conv/biases:0
Shape: [64]
Variable: generator/generator_unit/resblock_5/conv_1/Conv/weights:0
Shape: [3, 3, 64, 64]
Variable: generator/generator_unit/resblock_5/conv_1/Conv/biases:0
Shape: [64]
Variable: generator/generator_unit/resblock_5/conv_2/Conv/weights:0
Shape: [3, 3, 64, 64]
Variable: generator/generator_unit/resblock_5/conv_2/Conv/biases:0
Shape: [64]
Variable: generator/generator_unit/resblock_6/conv_1/Conv/weights:0
Shape: [3, 3, 64, 64]
Variable: generator/generator_unit/resblock_6/conv_1/Conv/biases:0
Shape: [64]
Variable: generator/generator_unit/resblock_6/conv_2/Conv/weights:0
Shape: [3, 3, 64, 64]
Variable: generator/generator_unit/resblock_6/conv_2/Conv/biases:0
Shape: [64]
Variable: generator/generator_unit/resblock_7/conv_1/Conv/weights:0
Shape: [3, 3, 64, 64]
Variable: generator/generator_unit/resblock_7/conv_1/Conv/biases:0
Shape: [64]
Variable: generator/generator_unit/resblock_7/conv_2/Conv/weights:0
Shape: [3, 3, 64, 64]
Variable: generator/generator_unit/resblock_7/conv_2/Conv/biases:0
Shape: [64]
Variable: generator/generator_unit/resblock_8/conv_1/Conv/weights:0
Shape: [3, 3, 64, 64]
Variable: generator/generator_unit/resblock_8/conv_1/Conv/biases:0
Shape: [64]
Variable: generator/generator_unit/resblock_8/conv_2/Conv/weights:0
Shape: [3, 3, 64, 64]
Variable: generator/generator_unit/resblock_8/conv_2/Conv/biases:0
Shape: [64]
Variable: generator/generator_unit/resblock_9/conv_1/Conv/weights:0
Shape: [3, 3, 64, 64]
Variable: generator/generator_unit/resblock_9/conv_1/Conv/biases:0
Shape: [64]
Variable: generator/generator_unit/resblock_9/conv_2/Conv/weights:0
Shape: [3, 3, 64, 64]
Variable: generator/generator_unit/resblock_9/conv_2/Conv/biases:0
Shape: [64]
Variable: generator/generator_unit/resblock_10/conv_1/Conv/weights:0
Shape: [3, 3, 64, 64]
Variable: generator/generator_unit/resblock_10/conv_1/Conv/biases:0
Shape: [64]
Variable: generator/generator_unit/resblock_10/conv_2/Conv/weights:0
Shape: [3, 3, 64, 64]
Variable: generator/generator_unit/resblock_10/conv_2/Conv/biases:0
Shape: [64]
Variable: generator/generator_unit/conv_tran2highres/conv_tran1/Conv2d_transpose/weights:0
Shape: [3, 3, 64, 64]
Variable: generator/generator_unit/conv_tran2highres/conv_tran1/Conv2d_transpose/biases:0
Shape: [64]
Variable: generator/generator_unit/conv_tran2highres/conv_tran2/Conv2d_transpose/weights:0
Shape: [3, 3, 64, 64]
Variable: generator/generator_unit/conv_tran2highres/conv_tran2/Conv2d_transpose/biases:0
Shape: [64]
Variable: generator/generator_unit/output_stage/conv/Conv/weights:0
Shape: [3, 3, 64, 3]
Variable: generator/generator_unit/output_stage/conv/Conv/biases:0
Shape: [3]
total size: 843587
Scope fnet:
Variable: fnet/autoencode_unit/encoder_1/conv_1/Conv/weights:0
Shape: [3, 3, 6, 32]
Variable: fnet/autoencode_unit/encoder_1/conv_1/Conv/biases:0
Shape: [32]
Variable: fnet/autoencode_unit/encoder_1/conv_2/Conv/weights:0
Shape: [3, 3, 32, 32]
Variable: fnet/autoencode_unit/encoder_1/conv_2/Conv/biases:0
Shape: [32]
Variable: fnet/autoencode_unit/encoder_2/conv_1/Conv/weights:0
Shape: [3, 3, 32, 64]
Variable: fnet/autoencode_unit/encoder_2/conv_1/Conv/biases:0
Shape: [64]
Variable: fnet/autoencode_unit/encoder_2/conv_2/Conv/weights:0
Shape: [3, 3, 64, 64]
Variable: fnet/autoencode_unit/encoder_2/conv_2/Conv/biases:0
Shape: [64]
Variable: fnet/autoencode_unit/encoder_3/conv_1/Conv/weights:0
Shape: [3, 3, 64, 128]
Variable: fnet/autoencode_unit/encoder_3/conv_1/Conv/biases:0
Shape: [128]
Variable: fnet/autoencode_unit/encoder_3/conv_2/Conv/weights:0
Shape: [3, 3, 128, 128]
Variable: fnet/autoencode_unit/encoder_3/conv_2/Conv/biases:0
Shape: [128]
Variable: fnet/autoencode_unit/decoder_1/conv_1/Conv/weights:0
Shape: [3, 3, 128, 256]
Variable: fnet/autoencode_unit/decoder_1/conv_1/Conv/biases:0
Shape: [256]
Variable: fnet/autoencode_unit/decoder_1/conv_2/Conv/weights:0
Shape: [3, 3, 256, 256]
Variable: fnet/autoencode_unit/decoder_1/conv_2/Conv/biases:0
Shape: [256]
Variable: fnet/autoencode_unit/decoder_2/conv_1/Conv/weights:0
Shape: [3, 3, 256, 128]
Variable: fnet/autoencode_unit/decoder_2/conv_1/Conv/biases:0
Shape: [128]
Variable: fnet/autoencode_unit/decoder_2/conv_2/Conv/weights:0
Shape: [3, 3, 128, 128]
Variable: fnet/autoencode_unit/decoder_2/conv_2/Conv/biases:0
Shape: [128]
Variable: fnet/autoencode_unit/decoder_3/conv_1/Conv/weights:0
Shape: [3, 3, 128, 64]
Variable: fnet/autoencode_unit/decoder_3/conv_1/Conv/biases:0
Shape: [64]
Variable: fnet/autoencode_unit/decoder_3/conv_2/Conv/weights:0
Shape: [3, 3, 64, 64]
Variable: fnet/autoencode_unit/decoder_3/conv_2/Conv/biases:0
Shape: [64]
Variable: fnet/autoencode_unit/output_stage/conv1/Conv/weights:0
Shape: [3, 3, 64, 32]
Variable: fnet/autoencode_unit/output_stage/conv1/Conv/biases:0
Shape: [32]
Variable: fnet/autoencode_unit/output_stage/conv2/Conv/weights:0
Shape: [3, 3, 32, 2]
Variable: fnet/autoencode_unit/output_stage/conv2/Conv/biases:0
Shape: [2]
total size: 1745506
The first run takes longer time for training data loading...
Save initial checkpoint, before any training
[testWhileTrain] step 0:
python3 main.py --output_dir ex_FRVSR06-23-14_1/train/ --summary_dir ex_FRVSR06-23-14_1/train/ --mode inference --num_resblock 10 --checkpoint ex_FRVSR06-23-14_1/model-0 --cudaID 0 --input_dir_LR ./LR/calendar/ --output_pre  --output_name 000000000 --input_dir_len 10
Using TensorFlow backend.
input shape: [1, 144, 180, 3]
output shape: [1, 576, 720, 3]
Finish building the network
Loading weights from ckpt model
Frame evaluation starts!!
Warming up 5
Warming up 4
Warming up 3
Warming up 2
Warming up 1
saving image 000000000_0001
saving image 000000000_0002
saving image 000000000_0003
saving image 000000000_0004
saving image 000000000_0005
saving image 000000000_0006
saving image 000000000_0007
saving image 000000000_0008
saving image 000000000_0009
saving image 000000000_0010
total time 1.9974193572998047, frame number 15


and it quits after that without any errors

Thanks @tom-doerr, I compared the main.py and runGan.py from your repo. and they are identical. But later it occurred to me that it could be because of the mixed GPU environment that I have. I have 2080tis and 1080tis on the same machine and for some reason it wasn't running on the 2080ti GPU. I specified the 1080ti ID and it worked. Thanks!

btw @tom-doerr, did you run the training on multi-gpu using your docker image?