/TIA2V

Primary LanguagePython

TIA2V: Video Generation Conditioned on Triple Modalities of Text-Image-Audio

This is the official implement of our proposed method of TIA2V task. As a progressive development of our previous work TA2V, in this paper, we combine text, image and audio reasonably and effectively through a single diffusion model as composable conditions, to generate more controllable and customized videos, which will be generalized among all kinds of dataset.

model

Examples

without SHR module

video_2.mp4
video_30.mp4
video_42.mp4
video_45.mp4

with SHR module

video_2.mp4
video_30.mp4
video_42.mp4
video_45.mp4

Setup

  1. Create the virtual environment
conda create -n tia python==3.9
conda activate tia
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.6 -c pytorch -c conda-forge
pip install pytorch-lightning==1.5.4 einops ftfy h5py imageio regex scikit-image scikit-video tqdm lpips blobfile mpi4py opencv-python-headless kornia termcolor pytorch-ignite visdom piq joblib av==10.0.0 matplotlib ffmpeg==4.2.2 pillow==9.5.0
pip install git+https://github.com/openai/CLIP.git wav2clip transformers
  1. Create a saved_ckpts folder to download pretrained checkpoints.

Datasets

We create two three-modality datasets named as URMP-VAT.

Download pre-trained checkpoints

coming soon

Sampling Procedure

Sample Short Music Performance Videos

  • data_path: path to dataset, default is post_URMP
  • text_emb_model: model to encode text, choices: bert, clip
  • audio_emb_model: model to encode audio, choices: audioclip, wav2clip
  • text_stft_cond: load text-audio-video data
  • n_sample: the number of videos need to be sampled
  • run: index for each run
  • resolution: resolution to extract data
  • model_path: the path of pre-trained checkpoint
  • image_size: the resolution used in training process
  • in_channels: the number of channels of the input videos/frames
  • diffusion_steps: the number of steps to denoise
  • noise_schedule: choices: cosine, linear
  • num_channels: latent channels base
  • num_res_blocks: the number of resnet blocks in diffusion
python scripts/sample_motion_optim.py --resolution 64 --batch_size 1 --diffusion_steps 4000 --noise_schedule cosine \
--num_channels 64 --num_res_blocks 2 --class_cond False --model_path saved_ckpts/your_model.pt \
--num_samples 50 --image_size 64 --learn_sigma True --text_stft_cond --audio_emb_model beats --data_path datasets/post_URMP \
--in_channels 3 --clip_denoised True --run 0

Training Procedure

You can also train the models on customized datasets. Here we provide the command to train content and motion parts individually.

train content

  • save_dir: path to save checkpoints
  • diffusion_steps: the number of steps to denoise
  • noise_schedule: choices: cosine, linear
  • num_channels: latent channels base
  • num_res_blocks: the number of resnet blocks in diffusion
  • class_cond: whether using class or not
  • image_size: resolution of videos/images
  • sequence_length: the number of frames unsed in training
  • lr: the learning rate
python scripts/train_content.py --num_workers 8 --gpus 1 --batch_size 1 --data_path datasets/post_URMP/ \
--save_dir saved_ckpts/your_directory_path --resolution 64 --sequence_length 16 --text_stft_cond --diffusion_steps 4000 \
--noise_schedule cosine --lr 5e-5 --num_channels 64 --num_res_blocks 2 --class_cond False --log_interval 50 \
--save_interval 10000 --image_size 64 --learn_sigma True --in_channels 3

train motion

  • save_dir: path to save checkpoints
  • diffusion_steps: the number of steps to denoise
  • noise_schedule: choices: cosine, linear
  • num_channels: latent channels base
  • num_res_blocks: the number of resnet blocks in diffusion
  • class_cond: whether using class or not
  • image_size: resolution of videos/images
  • sequence_length: the number of frames unsed in training
  • model_path: the path of content model
  • audio_emb_model: model to encode audio, choices: audioclip, wav2clip
python scripts/train_temp.py --num_workers 8 --batch_size 1 --data_path datasets/post_URMP/ \
--model_path saved_ckpts/your_content_model.pt --save_dir saved_ckpts/your_directory_path --resolution 64 \
--sequence_length 16 --text_stft_cond --audio_emb_model beats --diffusion_steps 4000 --noise_schedule cosine \
--num_channels 64 --num_res_blocks 2 --class_cond False --image_size 64 --learn_sigma True --in_channels 3 \
--lr 5e-5 --log_interval 50 --save_interval 5000 --gpus 1

Acknowledgements

Our code is based on Latent-Diffusion.