This repository is the official implementation of Tune-A-Video.
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
Jay Zhangjie Wu,
Yixiao Ge,
Xintao Wang,
Stan Weixian Lei,
Yuchao Gu,
Wynne Hsu,
Ying Shan,
Xiaohu Qie,
Mike Zheng Shou
pip install -r requirements.txt
Installing xformers is highly recommended for more efficiency and speed on GPUs.
To enable xformers, set enable_xformers_memory_efficient_attention=True
(default).
You can download the pre-trained Stable Diffusion models (e.g., Stable Diffusion v1-4):
git lfs install
git clone https://huggingface.co/CompVis/stable-diffusion-v1-4
Alternatively, you can use a personalized DreamBooth model (e.g., mr-potato-head):
git lfs install
git clone https://huggingface.co/sd-dreambooth-library/mr-potato-head
To fine-tune the text-to-image diffusion models for text-to-video generation, run this command:
accelerate launch train_tuneavideo.py --config="configs/man-surfing.yaml"
Once the training is done, run inference:
from tuneavideo.pipelines.pipeline_tuneavideo import TuneAVideoPipeline
from tuneavideo.models.unet import UNet3DConditionModel
from tuneavideo.util import save_videos_grid
import torch
model_id = "path-to-your-trained-model"
unet = UNet3DConditionModel.from_pretrained(model_id, subfolder='unet', torch_dtype=torch.float16).to('cuda')
pipe = TuneAVideoPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", unet=unet, torch_dtype=torch.float16).to("cuda")
prompt = "a panda is surfing"
video = pipe(prompt, video_length=8, height=512, width=512, num_inference_steps=50, guidance_scale=7.5).videos
save_videos_grid(video, f"{prompt}.gif")
[Training] a man is surfing. | a panda is surfing. | Iron Man is surfing in the desert. | a raccoon is surfing, cartoon style. |
sks mr potato head. | sks mr potato head, wearing a pink hat, is surfing. | sks mr potato head, wearing sunglasses, is surfing. | sks mr potato head is surfing in the forest. |
@article{wu2022tuneavideo,
title={Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation},
author={Wu, Jay Zhangjie and Ge, Yixiao and Wang, Xintao and Lei, Stan Weixian and Gu, Yuchao and Hsu, Wynne and Shan, Ying and Qie, Xiaohu and Shou, Mike Zheng},
journal={arXiv preprint arXiv:2212.11565},
year={2022}
}