About the training dataset
Kevin-1342 opened this issue · 1 comments
Kevin-1342 commented
Regarding the training dataset, would you mind me asking how did you collect tens of millions of clips? My initial understanding was that the label for long video may not be suitable for short video clips. Many Thanks.
aakashs commented
Yes, creating a text-to-video generator is a challenge - public text-video datasets are few and far between, and typically consist of clips of non-uniform length, low resolutions, encoding artifacts, and motion blur.
Bootstrapping off a text-to-image foundation model takes advantage of existing knowledge from the more available text-image datasets and reposes text-to-video generation more narrowly as temporal understanding.