jgkwak95/SURF-GAN

Question about the speed of reproducing the code of this article

Closed this issue · 3 comments

In practice, replicating source code is much slower than this article suggests.

My practical operations are as follows:
Download img_align_celeba.zip dataset and unzip it in the specified catalogue.
Run the following code "python train_surf.py --output_dir third --curriculum CelebA"
The results I got are as follows.

The code shows that completing this program needs about 813 hours and I repeated this procedures, only to get the same results.
6b5ff5fafc5b17971b6b50e1f24a36a

Hi! Thanks for your attention.

You don't have to worry about "total progress", just check the "stage" (iteration)!
The "total progress" above is from our baseline pi-GAN.
Just like you, when I first saw this in pi-GAN, I thought training would never end :(

In my experience, a total of 140,000 (32x32: 60,000 and 64x64: 80,000) iterations was enough to train our model on 64x64.

-Jeong-gi

Hi! Thanks for your attention.

You don't have to worry about "total progress", just check the "stage" (iteration)! The "total progress" above is from our baseline pi-GAN. Just like you, when I first saw this in pi-GAN, I thought training would never end :(

In my experience, a total of 140,000 (32x32: 60,000 and 64x64: 80,000) iterations was enough to train our model on 64x64.

-Jeong-gi

Hi!Thanks for your answer.

And I have another questions when i transferred my attention to iterations. When I finished the first stage and was going to get into the next stage, it would show this code:

RuntimeError: CUDA out of memory. Tried to allocate 336.00 MiB (GPU 0; 23.70 GiB total capacity; 18.86 GiB already allocated; 36.56 MiB free; 19.51 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

But it went well in the first stage. I have tried to modify the batch size but get nothing improved. The memory of a single GPU is 24000MiB and the whole experiment used 5 GPUs.

And Does 'the total of 140,000 iterations' means I need to go through 60,000 iterations (one stage) when I set the image size as 32x32 and need to go through 80,000 iterations when I set the image size as 64x64?

Thanks for your answer anain.

If you meet OOM, you need to reduce your batch size (per GPU) in curriculum.py
In addition, for Multi-GPU training please refer pi-GAN because we modified some parts for single gpu training from pi-gan implementation.

And Does 'the total of 140,000 iterations' means I need to go through 60,000 iterations (one stage) when I set the image size as
32x32 and need to go through 80,000 iterations when I set the image size as 64x64?

Yes