Is there anyone success to train this model?

Question

Is there anyone success to train this model?

Jihun999 opened this issue 10 months ago · 34 comments

I tried to train this model few days. However, the reconstruction results always abnormal. If there is anyone success to train this model, can you tell me some tips for training?

Answer 1 · 2024-02-16T17:10:59.000Z

The reconstruction images are like solid image.

Answer 2 · 2024-02-20T02:35:59.000Z

Can you show the reconstruction images after training?

Answer 3 · 2024-02-20T05:57:44.000Z

It always looks like this image.

Answer 4 · 2024-02-20T11:38:31.000Z

@bridenmj How much epochs do you use? Are you working on the ImageNet Pretrain?

Answer 5 · 2024-02-20T13:05:36.000Z

Yes I'm working on ImageNet pretraining, It passed 12000 steps. The output image looks always the same. So, I tried LFQ in my own autoencoder, the training works well. It looks like there is something wrong in magvit2 model architecture.

Answer 6 · 2024-02-20T13:22:38.000Z

Actually I reimplement the model structure to align with the magvit2 paper. But I find that the LFQ Loss is negative and the recon loss will get converage easily with or without GAN. The reconstructed images are vague but not the solid color. What about you? @Jihun999

Answer 7 · 2024-02-25T13:07:00.000Z

Ok, I will reimplement the model first. Thank you for your comment.

Answer 8 · 2024-03-06T08:31:07.000Z

Actually I reimplement the model structure to align with the magvit2 paper. But I find that the LFQ Loss is negative and the recon loss will get converage easily with or without GAN. The reconstructed images are vague but not the solid color. What about you? @Jihun999

Hey, is it possible to share the code modification for model architecture alignment? Thanks a lot!

Answer 9 · 2024-04-25T13:59:37.000Z

someone i know has trained it successfully.

Answer 10 · 2024-04-27T14:10:31.000Z

wow, could i know who did it.

Answer 11 · 2024-05-15T04:12:25.000Z

@RobertLuo1 @Jihun999 @lucidrains If you successfully trained this model, would you like to share the pretrained weights and the modified model code?

Answer 12 · 2024-05-16T19:44:57.000Z

Hello there,
Thanks @lucidrains for your work! I have successful trainings on toy data (tried it on images and video) with code in this fork https://github.com/vinyesm/magvit2-pytorch/blob/trainvideo/examples/train-on-video.py and with this video data https://huggingface.co/datasets/mavi88/phys101_frames/tree/main. What seemed to fix the issue is to stop using accelerate (I only train on one GPU).

I tried with only MSE and then also the other losses, and also with/without attend_space layers. All work but I did not try to tune hyperparameters..

Answer 13 · 2024-05-16T22:59:08.000Z

thank you for sharing this Marina! I'll see if I can find the bug, and worse comes to worse, can always rewrite the training code in pytorch lightning

Answer 14 · 2024-06-17T06:21:39.000Z

Hi, recently we have devoted a lot to training the tokenizer in Magvit2, and now we have open source the tokenizer trained with imagenet. Feel free to use that. The project page is https://github.com/TencentARC/Open-MAGVIT2. Thanks @lucidrains so much for your reference code and discussions!

Answer 15 · 2024-07-23T01:08:49.000Z

Hey @lucidrains, I trained a MAGVIT2 tokenizer without modifying your implementation of the accelerate framework. As others have experienced, I initially saw just a solid block in the results/sampled.x.gif files. However, upon loading the model weights from my most recent checkpoint, I was able to get pretty good reconstructions in a sample script that I wrote that performs inference without using the accelerate framework. Additionally, the reconstruction MSE scores were consistent with the ones observed in your training script. This means that whatever bug others are experiencing is not the result of flawed model training, but rather something going wrong with the gif rendering.

*Note: the first file is the saved gif in the results folder. The ground truth frames have a weird colour scheme because I normalized the frame pixels to be between [-1, 1]. The second file is a reconstructed frame from my inference script. MSE was ~0.011 after training on a v100 for 5 hours.