TIGER-AI-Lab/ConsistI2V

Discussion on Computing Resources and Training Details

Closed this issue · 1 comments

Hello, I am very interested in your work, and I am really impressed with your demo. I would like to inquire about the number of GPUs used for training the diffusion model and the duration of the training time. Additionally, the dataset is sampled from WebVid-10M, and I noticed that you only sampled 16 frames each video. How do you ensure that the sampled series are sufficiently dynamic, and is this 16-frame sampling a tradeoff? Looking forward to your response!

Hi, the model is trained on 16 A100 80G for two weeks. The 16 frames are sampled based on a frame interval from 1 to 5 to ensure these frames span a larger window of ~2 seconds in the original video. Please refer to the supplementary of the paper for more details.