train on video dataset

Question

train on video dataset

Y-ichen opened this issue a year ago · 4 comments

Thanks a lot for your implementation! Can this tokenizer be trained on video dataset in the current version? I found that its recon_loss is very large and cannot converge, and discr_loss cannot converge either.

Here shows the losses on the video dataset:

Y-ichen commented a year ago

Fixed

Answer 1 · 2023-11-08T16:10:54.000Z

your x-axis, is that number of steps? you only did 600 training steps?

Answer 2 · 2023-11-09T02:18:54.000Z

Yes, these are the results after only 600 training steps. I trained magvit in an unconditional manner on UCF101 dataset.

During the training, I noticed the initial recon_loss value was very large (2e+4), so I checked the tensor value ranges when calculating recon_loss between the video and reconstructed video. I found the video values were between 0.0-255.0, while the reconstructed video values were around -1.0 to 1.0.

Therefore, I additionally normalized the data when loading videos to rescale the tensor range to -1.0 to 1.0. With this, the initial recon_loss is around 0.3, but the discr_loss is still around 2.0, much larger than recon_loss. I'm not sure if this will affect training, so I shrink discr_loss a bit by adding discr_weight of 0.1 to balance it with recon_loss. (Then the initial value of losses becomes: recon_loss=0.3, disrc_loss=0.2 around) Here is my new results of 3k steps training with these settings:

I'm retraining as above now - should I increase the training steps to at least 20k? And should I apply this normalization of the loaded video tensor range?

Answer 3 · 2024-03-08T19:11:33.000Z

how did you do it？