This is the official implement of our proposed method of TIA2V task. As a progressive development of our previous work TA2V, in this paper, we combine text, image and audio reasonably and effectively through a single diffusion model as composable conditions, to generate more controllable and customized videos, which will be generalized among all kinds of dataset.
video_2.mp4
video_30.mp4
video_42.mp4
video_45.mp4
video_2.mp4
video_30.mp4
video_42.mp4
video_45.mp4
- Create the virtual environment
conda create -n tia python==3.9
conda activate tia
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.6 -c pytorch -c conda-forge
pip install pytorch-lightning==1.5.4 einops ftfy h5py imageio regex scikit-image scikit-video tqdm lpips blobfile mpi4py opencv-python-headless kornia termcolor pytorch-ignite visdom piq joblib av==10.0.0 matplotlib ffmpeg==4.2.2 pillow==9.5.0
pip install git+https://github.com/openai/CLIP.git wav2clip transformers
- Create a
saved_ckpts
folder to download pretrained checkpoints.
We create two three-modality datasets named as URMP-VAT.
coming soon
data_path
: path to dataset, default ispost_URMP
text_emb_model
: model to encode text, choices:bert
,clip
audio_emb_model
: model to encode audio, choices:audioclip
,wav2clip
text_stft_cond
: load text-audio-video datan_sample
: the number of videos need to be sampledrun
: index for each runresolution
: resolution to extract datamodel_path
: the path of pre-trained checkpointimage_size
: the resolution used in training processin_channels
: the number of channels of the input videos/framesdiffusion_steps
: the number of steps to denoisenoise_schedule
: choices:cosine
,linear
num_channels
: latent channels basenum_res_blocks
: the number of resnet blocks in diffusion
python scripts/sample_motion_optim.py --resolution 64 --batch_size 1 --diffusion_steps 4000 --noise_schedule cosine \
--num_channels 64 --num_res_blocks 2 --class_cond False --model_path saved_ckpts/your_model.pt \
--num_samples 50 --image_size 64 --learn_sigma True --text_stft_cond --audio_emb_model beats --data_path datasets/post_URMP \
--in_channels 3 --clip_denoised True --run 0
You can also train the models on customized datasets. Here we provide the command to train content and motion parts individually.
save_dir
: path to save checkpointsdiffusion_steps
: the number of steps to denoisenoise_schedule
: choices:cosine
,linear
num_channels
: latent channels basenum_res_blocks
: the number of resnet blocks in diffusionclass_cond
: whether using class or notimage_size
: resolution of videos/imagessequence_length
: the number of frames unsed in traininglr
: the learning rate
python scripts/train_content.py --num_workers 8 --gpus 1 --batch_size 1 --data_path datasets/post_URMP/ \
--save_dir saved_ckpts/your_directory_path --resolution 64 --sequence_length 16 --text_stft_cond --diffusion_steps 4000 \
--noise_schedule cosine --lr 5e-5 --num_channels 64 --num_res_blocks 2 --class_cond False --log_interval 50 \
--save_interval 10000 --image_size 64 --learn_sigma True --in_channels 3
save_dir
: path to save checkpointsdiffusion_steps
: the number of steps to denoisenoise_schedule
: choices:cosine
,linear
num_channels
: latent channels basenum_res_blocks
: the number of resnet blocks in diffusionclass_cond
: whether using class or notimage_size
: resolution of videos/imagessequence_length
: the number of frames unsed in trainingmodel_path
: the path of content modelaudio_emb_model
: model to encode audio, choices:audioclip
,wav2clip
python scripts/train_temp.py --num_workers 8 --batch_size 1 --data_path datasets/post_URMP/ \
--model_path saved_ckpts/your_content_model.pt --save_dir saved_ckpts/your_directory_path --resolution 64 \
--sequence_length 16 --text_stft_cond --audio_emb_model beats --diffusion_steps 4000 --noise_schedule cosine \
--num_channels 64 --num_res_blocks 2 --class_cond False --image_size 64 --learn_sigma True --in_channels 3 \
--lr 5e-5 --log_interval 50 --save_interval 5000 --gpus 1
Our code is based on Latent-Diffusion.