This repository contains unofficial implementation of the paper I2V-Adapter: A General Image-to-Video Adapter for Video Diffusion Models. Due to the lack of computing resources we have (4 A100 80G GPU in total), we only trained the model with a small set of WebVid-10M dataset with no self-collected high-quality video clips as reported in the original paper. We don't attempt to employ any data filtering strategies either. If someone has a robust model trained on a large amount of high-quality data and is willing to share it, feel free to make a pull request.
Input Image | Official samples from official project page | Our Samples |
- Release training script.
- Release inference script.
- Release unofficial pretrained weights.
You can setup the repository by running the following commands
git clone https://github.com/xUhEngwAng/I2V-Adapter-Unofficial.git
cd I2V-Adapter-Unofficial
conda create -n I2VAdapter
pip install -r requirements.txt
git lfs install
git clone https://huggingface.co/SG161222/Realistic_Vision_V5.1_noVAE
git clone https://huggingface.co/guoyww/animatediff-motion-adapter-v1-5-2
git clone https://huggingface.co/h94/IP-Adapter
Before training, you should first download the corresponding video dataset. Take the frequently-used WebVid-10M dataset for example. You can download the video files and the .csv
annotations and place them under the ./data
directory. If you use custom dataset, the dataset class under src/data.py
should also be modified.
To train I2V-Adapter modules, run
accelerate launch ./src/train_image_to_video.py --task_name your_task_name --num_train_epochs 10 --checkpoint_epoch 2
You can also finetune the motion modules by passing in --update_motion_modules
accelerate launch ./src/train_image_to_video.py --task_name your_task_name --num_train_epochs 10 --checkpoint_epoch 2 --update_motion_modules
As mentioned in the original AnimateDiff and PIA paper, you can also first finetune the base T2I model by using the individual frames in the video dataset. This can be accomplished by running
accelerate launch ./src/train_text_to_image.py
The condition images and text prompts are given via ./data/WebVid-25K/I2VAdapter-eval.csv
, you can freely alter this file and provide your own condition images and prompts. Then run the following commands:
python src/pipeline/pipeline_i2v_adapter.py --task_name I2VAdapter-25K-finetune --checkpoint_epoch 25
This codebase is based on diffusers library and AnimateDiff. The implementation of first frame similarity prior is inspied by PIA. We thank all the contributors of these repositories. Additionally, we would like to thank the authors of I2VAdapter for their open research and foundational work, which inspired this unofficial implementation.