Estimated Number of Epoch Required For Training

Question

Estimated Number of Epoch Required For Training

WeiyunJiang opened this issue 8 months ago · 6 comments

Hi Zheng and Mengqi,

Thank you for your amazing work and making your code public! I wonder if you could kindly provide some insights on my experiments below?

I am training on the truncated CelebA dataset, which only has 5000 256x256 images. And I am using CLIP embedding. My batch-size is 96.

How many epoch would it take for the train.py to converge?
How many epoch would it take for the train_latent.py to converge?
Should I run train.py longer or train_latent.py longer when the generated images have grid-like artifacts?

Answer 1 · 2024-05-15T16:10:03.000Z

For train.py, I am at 12M steps (96 batchsize x # of iterations). And the sampled training images look pretty good:
Sampled:

Real:

However, the unconditionally generated images do not make sense. For train_latent.py, I am at 36M step (~~1 batchsize~~ x # of iterations). I used 96 batchsize (same as the original GitHub code).
Loss:

Testing:

Answer 2 · 2024-05-15T16:12:07.000Z

Did not intend to close the issue. Any suggestions or insights would be appreciated! Thank you! @zh-ding @Mq-Zhang1

Answer 3 · 2024-05-15T22:18:53.000Z

Hello, thanks for your interest. Can you share more details on how you train the latents stage? It looks a bit weird to me since the images from training and testing should look similar. I can see from you results that the latent training stage doesn't learn the data latent distribution at all. So I'm wondering if there is any mismatch in this process.

BTW, why is the batch size only 1 for the latent training?

Answer 4 · 2024-05-16T14:32:55.000Z

Hi @zh-ding,

Thank you so much for the prompt reply. Yes, I agree that the latent code was not sampled properly to match the training latent code distribution. Sorry for the confusion, I did not use batch size of 1. The batch size during latents stage was 128 (same as the original GitHub code).

For training the latents stage, I set the model_path the same as the last model from the training stage.
I use the following command:
python train_latent.py --model_path ./checkpoints/exp_clip/last.ckpt --name train_latent_clip

I keep most of the code the same. However, I did modify the codes on line 432, experiment.py to suppress the error mentioned here (#8 (comment)).

if self.conf.train_mode.require_dataset_infer():
    imgs = None
    idx = None
else:
    imgs = batch['img']
    idx = batch["index"]
    
    self.log_sample(x_start = imgs, step = self.global_step, idx = idx)

After the above modification, the tensorboard would no longer log the images during the latent stage.

Thus, I used test.py for inference and generated those testing images.
The command I used for testing is as follows:
python test.py --batch_size 1 --patch_size 64 --output_dir ./output_images_clip --image_size 256x256 --img_num 5 --full_path ./checkpoints/train_latent_clip/last.ckpt

P.S. I wonder if the number of training images is not enough for the latent code sampler to learn the latent distribution? I only used 5000 256x256 celebA images for training. Did you ever use 5000 images for training and have any luck with them? In your paper, I think the smallest dataset (nature 21K) you used has 21K images.

Please let me know if you have any insights or suggestions! I really appreciate that! Thank you.

Answer 5 · 2024-05-16T21:18:26.000Z

Hi @WeiyunJiang,

Thank you for all the details provided!

The problem may caused by conf.latent_znormalize set to True, learning a normalized latent distribution instead of the original. Add conf.latent_znormalize = False in train_latent.py should fix the problem.

We didn't experiment on smaller datasets. An approximate training epochs for 21k Nature is 1882, but actually at 235 epoch, we could already observe reasonable results (batch size set to 256).

I will fix this normalization config problem during latent training and sampling immediately to make it more clear. Thanks for raising this up! Plz feel free to let me know if results are still weird.

Answer 6 · 2024-05-17T15:38:07.000Z

Hi @Mq-Zhang1 and @zh-ding ,

Thanks again for the prompt response. After the fix, it works like a charm. THANK YOU! Fantastic work! :)