Recent advances in diffusion-based generative modeling have led to the development of text-to-video (T2V) models that can generate high-quality videos conditioned on a text prompt. Most of these T2V models often produce single-scene video clips that depict an entity performing a particular action (e.g., 'a red panda climbing a tree'). However, it is pertinent to generate multi-scene videos since they are ubiquitous in the real-world (e.g., 'a red panda climbing a tree' followed by `the red panda sleeps on the top of the tree'). To generate multi-scene videos from the pretrained T2V model, we introduce \textbf{T}ime-\textbf{Al}igned \textbf{C}aptions (\name) framework. Specifically, we enhance the text-conditioning mechanism in the T2V architecture to recognize the temporal alignment between the video scenes and scene descriptions. For instance, we condition the visual features of the earlier and later scenes of the generated video with the representations of the first scene description (e.g., 'a red panda climbing a tree') and second scene description (e.g., 'the red panda sleeps on the top of the tree'), respectively. As a result, we show that the T2V model can generate multi-scene videos that adhere to the multi-scene text descriptions and be visually consistent (e.g., entity and background). Further, we finetune the pretrained T2V model with multi-scene video-text data using the \name framework. We show that the \name-finetuned model outperforms the baseline methods by 15.5 points in the overall score, which averages visual consistency and text adherence using human evaluation.
Examples
Scene 1: Superman is surfing on the waves.Scene 2: The Superman falls into the water.
Baseline (Merging Captions)
TALC (Ours)
Scene 1: Spiderman is surfing on the waves.
Scene 2: Darth Vader is surfing on the same waves.
Baseline (Merging Captions)
TALC (Ours)
Scene 1: A stuffed toy is lying on the road.
Scene 2: A person enters and picks the stuffed toy.
Baseline (Merging Captions)
TALC (Ours)
Scene 1: Red panda is moving in the forest.
Scene 2: The red panda spots a treasure chest.
Scene 3: The red panda finds a map inside the treasure chest.
Merging Captions
TALC (Ours)
Scene 1: A koala climbs a tree.
Scene 2: The koala eats the eucalyptus leaves.
Scene 3: The koala takes a nap.
We provide a sample command to generate multi-scene (n = 2) videos from the base ModelScopeT2V model using the TALC framework.
CUDA_VISIBLE_DEVICES=0pythoninference.py--outfiletest_scene.mp4--model-name-pathdamo-vilab/text-to-video-ms-1.7b--talc--captions"koala is climbing a tree.""kangaroo is eating fruits."
In the above command, replacing --talc with --merge will generate different video scenes for individual captions and output a merged video.
To perform inference using the merging captions method, you can use:
CUDA_VISIBLE_DEVICES=0pythoninference.py--outfiletest_scene.mp4--model-name-pathdamo-vilab/text-to-video-ms-1.7b--captions"koala is climbing a tree.""kangaroo is eating fruits."
To generate multi-scene videos using the TALC-finetuned model, the command is:
CUDA_VISIBLE_DEVICES=4pythoninference.py--outfiletest_scene.mp4--model-name-pathtalc_finetuned_modelscope_t2v--talc--captions--captions"spiderman surfing in the ocean.""darth vader surfing in the ocean."
We also provide a file that provides a mapping between the video segments and the number of video frames in each video segment. We calculated these using the opencv. This information is useful for the finetuning purpose.
{'video_segment': number_of_video_frames}
Finetuning
We utilize Huggingface accelerate to finetune the model on multiple GPUs.
Accelerate setup, run accelerate config on the terminal and use the following settings:
Setup the wandb directory using wandb init in the terminal. If you want to disable wandb, then uncomment os.environ["WANDB_DISABLED"] = "true" in train.py.