hotshotco/Hotshot-XL

About the training dataset

Kevin-1342 opened this issue · 1 comments

Regarding the training dataset, would you mind me asking how did you collect tens of millions of clips? My initial understanding was that the label for long video may not be suitable for short video clips. Many Thanks.

Yes, creating a text-to-video generator is a challenge - public text-video datasets are few and far between, and typically consist of clips of non-uniform length, low resolutions, encoding artifacts, and motion blur.

Bootstrapping off a text-to-image foundation model takes advantage of existing knowledge from the more available text-image datasets and reposes text-to-video generation more narrowly as temporal understanding.