TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation

[Paper] [Website] [Dataset] [Checkpoint]

Abstract

Recent advances in diffusion-based generative modeling have led to the development of text-to-video (T2V) models that can generate high-quality videos conditioned on a text prompt. Most of these T2V models often produce single-scene video clips that depict an entity performing a particular action (e.g., 'a red panda climbing a tree'). However, it is pertinent to generate multi-scene videos since they are ubiquitous in the real-world (e.g., 'a red panda climbing a tree' followed by `the red panda sleeps on the top of the tree'). To generate multi-scene videos from the pretrained T2V model, we introduce \textbf{T}ime-\textbf{Al}igned \textbf{C}aptions (\name) framework. Specifically, we enhance the text-conditioning mechanism in the T2V architecture to recognize the temporal alignment between the video scenes and scene descriptions. For instance, we condition the visual features of the earlier and later scenes of the generated video with the representations of the first scene description (e.g., 'a red panda climbing a tree') and second scene description (e.g., 'the red panda sleeps on the top of the tree'), respectively. As a result, we show that the T2V model can generate multi-scene videos that adhere to the multi-scene text descriptions and be visually consistent (e.g., entity and background). Further, we finetune the pretrained T2V model with multi-scene video-text data using the \name framework. We show that the \name-finetuned model outperforms the baseline methods by 15.5 points in the overall score, which averages visual consistency and text adherence using human evaluation.

Examples

Scene 1: Superman is surfing on the waves.Scene 2: The Superman falls into the water.


Baseline (Merging Captions)	TALC (Ours)

Scene 1: Spiderman is surfing on the waves. Scene 2: Darth Vader is surfing on the same waves.


Baseline (Merging Captions)	TALC (Ours)

Scene 1: A stuffed toy is lying on the road. Scene 2: A person enters and picks the stuffed toy.


Baseline (Merging Captions)	TALC (Ours)

Scene 1: Red panda is moving in the forest. Scene 2: The red panda spots a treasure chest. Scene 3: The red panda finds a map inside the treasure chest.


Merging Captions	TALC (Ours)

Scene 1: A koala climbs a tree. Scene 2: The koala eats the eucalyptus leaves. Scene 3: The koala takes a nap.


Merging Captions	TALC (Ours)

Installation

Creating conda environment

conda create -n talc python=3.10
conda activate talc

Install Dependencies

pip install -r requirements.txt
conda install -c menpo opencv

Inference

We provide a sample command to generate multi-scene (n = 2) videos from the base ModelScopeT2V model using the TALC framework.

CUDA_VISIBLE_DEVICES=0 python inference.py --outfile test_scene.mp4 --model-name-path damo-vilab/text-to-video-ms-1.7b --talc --captions "koala is climbing a tree." "kangaroo is eating fruits."

In the above command, replacing --talc with --merge will generate different video scenes for individual captions and output a merged video.
To perform inference using the merging captions method, you can use:

CUDA_VISIBLE_DEVICES=0 python inference.py --outfile test_scene.mp4 --model-name-path damo-vilab/text-to-video-ms-1.7b --captions "koala is climbing a tree." "kangaroo is eating fruits."

To generate multi-scene videos using the TALC-finetuned model, the command is:

CUDA_VISIBLE_DEVICES=4 python inference.py --outfile test_scene.mp4 --model-name-path talc_finetuned_modelscope_t2v --talc --captions --captions "spiderman surfing in the ocean." "darth vader surfing in the ocean."

In essense, we make changes to pipeline_text_to_video_synth.py to support TALC framework.

Data

Task Prompts (4 scenes)

Single characters under multiple visual context - file
Different characters, single context - file
Multi-scene captions from real videos - file

Finetuning Data

We provide the video segments and caption dataset on HF 🤗 - Link.
The data of the give form where c1 and c2 are the captions that align with the video segments v1 and v2, respectively.

{'captions': [c1, c2], 'video_segments': [v1, v2]}

We also provide a file that provides a mapping between the video segments and the number of video frames in each video segment. We calculated these using the opencv. This information is useful for the finetuning purpose.

{'video_segment': number_of_video_frames}

Finetuning

We utilize Huggingface accelerate to finetune the model on multiple GPUs.
Accelerate setup, run accelerate config on the terminal and use the following settings:

- multi-GPU
- (How many machines?) 1
- (..) no
- (Number of GPUs) 3
- (np/fp16/bf16) fp16

Make relevant changes to the config.yaml.
Setup the wandb directory using wandb init in the terminal. If you want to disable wandb, then uncomment os.environ["WANDB_DISABLED"] = "true" in train.py.
Sample run command:

CUDA_VISIBLE_DEVICES=4,5,6 accelerate launch train.py --config config.yaml

We make a changes to the unet_3d_condition.py to support TALC framework.

Automatic Evaluation

We provide a script to perform automatic evaluation of the generated videos for entity consistency, background consistency, and text adherence.
Sample command for eval.py to evaluate multi-scene generated video for a two-scene description:

OPENAI_API_KEY=[OPENAI_API_KEY] python eval.py --vidpath video.mp4 --captions "elephants is standing near the water" "the elephant plays with the water"

Hritikbansal/talc