/vid2vid-zero

Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models

Primary LanguagePython

vid2vid-zero for Zero-Shot Video Editing

Wen Wang1*,   Kangyang Xie1*,   Zide Liu1*,   Hao Chen1,   Yue Cao2,   Xinlong Wang2,   Chunhua Shen1

1ZJU,   2BAAI


Hugging Face Demo


We propose vid2vid-zero, a simple yet effective method for zero-shot video editing. Our vid2vid-zero leverages off-the-shelf image diffusion models, and doesn't require training on any video. At the core of our method is a null-text inversion module for text-to-video alignment, a cross-frame modeling module for temporal consistency, and a spatial regularization module for fidelity to the original video. Without any training, we leverage the dynamic nature of the attention mechanism to enable bi-directional temporal modeling at test time. Experiments and analyses show promising results in editing attributes, subjects, places, etc., in real-world videos.

Highlights

  • Video editing with off-the-shelf image diffusion models.

  • No training on any video.

  • Promising results in editing attributes, subjects, places, etc., in real-world videos.

News

  • [2023.4.12] Online Gradio Demo is available here.
  • [2023.4.11] Add Gradio Demo (runs in local).
  • [2023.4.9] Code released!

Installation

Requirements

pip install -r requirements.txt

Installing xformers is highly recommended for improved efficiency and speed on GPUs.

Weights

[Stable Diffusion] Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input. The pre-trained Stable Diffusion models can be downloaded from 🤗 Hugging Face (e.g., Stable Diffusion v1-4, v2-1). We use Stable Diffusion v1-4 by default.

Zero-shot testing

Simply run:

accelerate launch test_vid2vid_zero.py --config path/to/config

For example:

accelerate launch test_vid2vid_zero.py --config configs/car-moving.yaml

Gradio Demo

Launch the local demo built with gradio:

python app.py

Or you can use our online gradio demo here.

Note that we disable Null-text Inversion and enable fp16 for faster demo response.

Examples

Input Video Output Video Input Video Output Video
"A car is moving on the road" "A Porsche car is moving on the desert" "A car is moving on the road" "A jeep car is moving on the snow"
"A man is running" "Stephen Curry is running in Time Square" "A man is running" "A man is running in New York City"
"A child is riding a bike on the road" "a child is riding a bike on the flooded road" "A child is riding a bike on the road" "a lego child is riding a bike on the road.gif"
"A car is moving on the road" "A car is moving on the snow" "A car is moving on the road" "A jeep car is moving on the desert"

Citation

@article{vid2vid-zero,
  title={Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models},
  author={Wang, Wen and Xie, kangyang and Liu, Zide and Chen, Hao and Cao, Yue and Wang, Xinlong and Shen, Chunhua},
  journal={arXiv preprint arXiv:2303.17599},
  year={2023}
}

Acknowledgement

Tune-A-Video, diffusers, prompt-to-prompt.

Contact

We are hiring at all levels at BAAI Vision Team, including full-time researchers, engineers and interns. If you are interested in working with us on foundation model, visual perception and multimodal learning, please contact Xinlong Wang (wangxinlong@baai.ac.cn) and Yue Cao (caoyue@baai.ac.cn).