TIA2V: Video Generation Conditioned on Triple Modalities of Text-Image-Audio

This is the official implement of our proposed method of TIA2V task. As a progressive development of our previous work TA2V, in this paper, we combine text, image and audio reasonably and effectively through a single diffusion model as composable conditions, to generate more controllable and customized videos, which will be generalized among all kinds of dataset.

Examples

without SHR module

video_2.mp4

video_30.mp4

video_42.mp4

video_45.mp4

with SHR module

video_2.mp4

video_30.mp4

video_42.mp4

video_45.mp4

Setup

Create the virtual environment

conda create -n tia python==3.9
conda activate tia
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.6 -c pytorch -c conda-forge
pip install pytorch-lightning==1.5.4 einops ftfy h5py imageio regex scikit-image scikit-video tqdm lpips blobfile mpi4py opencv-python-headless kornia termcolor pytorch-ignite visdom piq joblib av==10.0.0 matplotlib ffmpeg==4.2.2 pillow==9.5.0
pip install git+https://github.com/openai/CLIP.git wav2clip transformers

Create a saved_ckpts folder to download pretrained checkpoints.

Datasets

We create two three-modality datasets named as URMP-VAT.

Download pre-trained checkpoints

coming soon

Sampling Procedure

Sample Short Music Performance Videos

data_path: path to dataset, default is post_URMP
text_emb_model: model to encode text, choices: bert, clip
audio_emb_model: model to encode audio, choices: audioclip, wav2clip
text_stft_cond: load text-audio-video data
n_sample: the number of videos need to be sampled
run: index for each run
resolution: resolution to extract data
model_path: the path of pre-trained checkpoint
image_size: the resolution used in training process
in_channels: the number of channels of the input videos/frames
diffusion_steps: the number of steps to denoise
noise_schedule: choices: cosine, linear
num_channels: latent channels base
num_res_blocks: the number of resnet blocks in diffusion

python scripts/sample_motion_optim.py --resolution 64 --batch_size 1 --diffusion_steps 4000 --noise_schedule cosine \
--num_channels 64 --num_res_blocks 2 --class_cond False --model_path saved_ckpts/your_model.pt \
--num_samples 50 --image_size 64 --learn_sigma True --text_stft_cond --audio_emb_model beats --data_path datasets/post_URMP \
--in_channels 3 --clip_denoised True --run 0

Training Procedure

You can also train the models on customized datasets. Here we provide the command to train content and motion parts individually.

train content

save_dir: path to save checkpoints
diffusion_steps: the number of steps to denoise
noise_schedule: choices: cosine, linear
num_channels: latent channels base
num_res_blocks: the number of resnet blocks in diffusion
class_cond: whether using class or not
image_size: resolution of videos/images
sequence_length: the number of frames unsed in training
lr: the learning rate

python scripts/train_content.py --num_workers 8 --gpus 1 --batch_size 1 --data_path datasets/post_URMP/ \
--save_dir saved_ckpts/your_directory_path --resolution 64 --sequence_length 16 --text_stft_cond --diffusion_steps 4000 \
--noise_schedule cosine --lr 5e-5 --num_channels 64 --num_res_blocks 2 --class_cond False --log_interval 50 \
--save_interval 10000 --image_size 64 --learn_sigma True --in_channels 3

train motion

save_dir: path to save checkpoints
diffusion_steps: the number of steps to denoise
noise_schedule: choices: cosine, linear
num_channels: latent channels base
num_res_blocks: the number of resnet blocks in diffusion
class_cond: whether using class or not
image_size: resolution of videos/images
sequence_length: the number of frames unsed in training
model_path: the path of content model
audio_emb_model: model to encode audio, choices: audioclip, wav2clip

python scripts/train_temp.py --num_workers 8 --batch_size 1 --data_path datasets/post_URMP/ \
--model_path saved_ckpts/your_content_model.pt --save_dir saved_ckpts/your_directory_path --resolution 64 \
--sequence_length 16 --text_stft_cond --audio_emb_model beats --diffusion_steps 4000 --noise_schedule cosine \
--num_channels 64 --num_res_blocks 2 --class_cond False --image_size 64 --learn_sigma True --in_channels 3 \
--lr 5e-5 --log_interval 50 --save_interval 5000 --gpus 1

Acknowledgements

Our code is based on Latent-Diffusion.

iamfaith/TIA2V