lucidrains/magvit2-pytorch

train on video dataset

Y-ichen opened this issue · 4 comments

Thanks a lot for your implementation! Can this tokenizer be trained on video dataset in the current version? I found that its recon_loss is very large and cannot converge, and discr_loss cannot converge either.

Here shows the losses on the video dataset:

image

your x-axis, is that number of steps? you only did 600 training steps?

Yes, these are the results after only 600 training steps. I trained magvit in an unconditional manner on UCF101 dataset.

During the training, I noticed the initial recon_loss value was very large (2e+4), so I checked the tensor value ranges when calculating recon_loss between the video and reconstructed video. I found the video values were between 0.0-255.0, while the reconstructed video values were around -1.0 to 1.0.

Therefore, I additionally normalized the data when loading videos to rescale the tensor range to -1.0 to 1.0. With this, the initial recon_loss is around 0.3, but the discr_loss is still around 2.0, much larger than recon_loss. I'm not sure if this will affect training, so I shrink discr_loss a bit by adding discr_weight of 0.1 to balance it with recon_loss. (Then the initial value of losses becomes: recon_loss=0.3, disrc_loss=0.2 around) Here is my new results of 3k steps training with these settings:
image

I'm retraining as above now - should I increase the training steps to at least 20k? And should I apply this normalization of the loaded video tensor range?

Fixed

how did you do it?