Rerender A Video - Official PyTorch Implementation

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation
Shuai Yang, Yifan Zhou, Ziwei Liu and Chen Change Loy
in SIGGRAPH Asia 2023 Conference Proceedings
Project Page | Paper | Supplementary Video | Input Data and Video Results

Abstract: Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.

Features:

Temporal consistency: cross-frame constraints for low-level temporal consistency.
Zero-shot: no training or fine-tuning required.
Flexibility: compatible with off-the-shelf models (e.g., ControlNet, LoRA) for customized translation.

overview.mp4

Updates

[09/2023] Code is released.
[09/2023] Accepted to SIGGRAPH Asia 2023 Conference Proceedings!
[06/2023] Integrated to 🤗 Hugging Face. Enjoy the web demo!
[05/2023] This website is created.

TODO

Integrate into Diffusers.
~~Add Inference instructions in README.md.~~
~~Add Examples to webUI.~~
~~Add optional poisson fusion to the pipeline.~~
~~Add Installation instructions for Windows~~

Installation

Please make sure your installation path only contain English letters or _

Clone the repository. (Don't forget --recursive. Otherwise, please run git submodule update --init --recursive)

git clone git@github.com:williamyang1991/Rerender_A_Video.git --recursive
cd Rerender_A_Video

If you have installed PyTorch CUDA, you can simply set up the environment with pip.

pip install -r requirements.txt

You can also create a new conda environment from scratch.

conda env create -f environment.yml
conda activate rerender

Run the installation script. The required models will be downloaded in ./models.

python install.py

You can run the demo with rerender.py

python rerender.py --cfg config/real2sculpture.json

Installation on Windows

Before running the above 1-4 steps, you need prepare:

Install CUDA
Install git
Install VS with Windows 10/11 SDK (for building deps/ebsynth/bin/ebsynth.exe)

Installation Fails?

In case building ebsynth fails, we provides our complied ebsynth
KeyError: 'dataset': upgrade Graido to the latest version (williamyang1991#14 (comment))

(1) Inference

WebUI (recommended)

python webUI.py

The Gradio app also allows you to flexibly change the inference options. Just try it for more details. (For WebUI, you need to download revAnimated_v11 and realisticVisionV20_v20 to ./models/ after Installation)

Upload your video, input the prompt, select the seed, and hit:

Run 1st Key Frame: only translate the first frame, so you can adjust the prompts/models/parameters to find your ideal output appearance before running the whole video.
Run Key Frames: translate all the key frames based on the settings of the first frame, so you can adjust the temporal-related parameters for better temporal consistency before running the whole video.
Run Propogation: propogate the key frames to other frames for full video translation
Run All: Run 1st Key Frame, Run Key Frames and Run Propogation

We provide abundant advanced options to play with

Using customized models

Using LoRA/Dreambooth/Finetuned/Mixed SD models
- Modify sd_model_cfg.py to add paths to the saved SD models
Using other controls from ControlNet (e.g., Depth, Pose)
- Add more options like control_type = gr.Dropdown(['HED', 'canny', 'depth'] here https://github.com/williamyang1991/Rerender_A_Video/blob/b6cafb5d80a79a3ef831c689ffad92ec095f2794/webUI.py#L690
- Add model loading options like elif control_type == 'depth': following https://github.com/williamyang1991/Rerender_A_Video/blob/b6cafb5d80a79a3ef831c689ffad92ec095f2794/webUI.py#L88
- Add model detectors like elif control_type == 'depth': following https://github.com/williamyang1991/Rerender_A_Video/blob/b6cafb5d80a79a3ef831c689ffad92ec095f2794/webUI.py#L122
- One example is given here

Advanced options for the 1st frame translation

Resolution related (Frame resolution, left/top/right/bottom crop length): crop the frame and resize its short side to 512.
ControlNet related:
- ControlNet strength: how well the output matches the input control edges
- Control type: HED edge or Canny edge
- Canny low/high threshold: low values for more edge details
SDEdit related:
- Denoising strength: repaint degree (low value to make the output look more like the original video)
- Preserve color: preserve the color of the original video
SD related:
- Steps: denoising step
- CFG scale: how well the output matches the prompt
- Base model: base Stable Diffusion model (SD 1.5)
  - Stable Diffusion 1.5: official model
  - revAnimated_v11: a semi-realistic (2.5D) model
  - realisticVisionV20_v20: a photo-realistic model
- Added prompt/Negative prompt: supplementary prompts

Advanced options for the key frame translation

Key frame related
- Key frame frequency (K): Uniformly sample the key frame every K frames. Small value for large or fast motions.
- Number of key frames (M): The final output video will have K*M+1 frames with M+1 key frames.
Temporal consistency related
- Cross-frame attention:
  - Cross-frame attention start/end: When applying cross-frame attention for global style consistency
  - Cross-frame attention update frequency (N): Update the reference style frame every N key frames. Should be large for long videos to avoid error accumulation.
- Shape-aware fusion Check to use this feature
  - Shape-aware fusion start/end: When applying shape-aware fusion for local shape consistency
- Pixel-aware fusion Check to use this feature
  - Pixel-aware fusion start/end: When applying pixel-aware fusion for pixel-level temporal consistency
  - Pixel-aware fusion strength: The strength to preserve the non-inpainting region. Small to avoid error accumulation. Large to avoid burry textures.
  - Pixel-aware fusion detail level: The strength to sharpen the inpainting region. Small to avoid error accumulation. Large to avoid burry textures.
  - Smooth fusion boundary: Check to smooth the inpainting boundary (avoid error accumulation).
- Color-aware AdaIN Check to use this feature
  - Color-aware AdaIN start/end: When applying AdaIN to make the video color consistent with the first frame

Advanced options for the full video translation

Gradient blending: apply Poisson Blending to reduce ghosting artifats. May slow the process and increase flickers.
Number of parallel processes: multiprocessing to speed up the process. Large value (8) is recommended.

Command Line

We also provide a flexible script rerender.py to run our method.

Simple mode

Set the options via command line. For example,

python rerender.py --input videos/pexels-antoni-shkraba-8048492-540x960-25fps.mp4 --output result/man/man.mp4 --prompt "a handsome man in van gogh painting"

The script will run the full pipeline. A work directory will be created at result/man and the result video will be saved as result/man/man.mp4

Advanced mode

Set the options via a config file. For example,

python rerender.py --cfg config/van_gogh_man.json

The script will run the full pipeline. We provide some examples of the config in config directory. Most options in the config is the same as those in WebUI. Please check the explanations in the WebUI section.

Specifying customized models by setting sd_model in config. For example:

{
  "sd_model": "models/realisticVisionV20_v20.safetensors",
}

Customize the pipeline

Similar to WebUI, we provide three-step workflow: Rerender the first key frame, then rerender the full key frames, finally rerender the full video with propagation. To run only a single step, specify options -one, -nb and -nr:

Rerender the first key frame

python rerender.py --cfg config/van_gogh_man.json -one -nb

Rerender the full key frames

python rerender.py --cfg config/van_gogh_man.json -nb

Rerender the full video with propagation

python rerender.py --cfg config/van_gogh_man.json -nr

Our Ebsynth implementation

We provide a separate Ebsynth python script video_blend.py with the temporal blending algorithm introduced in Stylizing Video by Example for interpolating style between key frames. It can work on your own stylized key frames independently of our Rerender algorithm.

Usage:

video_blend.py [-h] [--output OUTPUT] [--fps FPS] [--beg BEG] [--end END] [--itv ITV] [--key KEY]
                      [--n_proc N_PROC] [-ps] [-ne] [-tmp]
                      name

positional arguments:
  name             Path to input video

optional arguments:
  -h, --help       show this help message and exit
  --output OUTPUT  Path to output video
  --fps FPS        The FPS of output video
  --beg BEG        The index of the first frame to be stylized
  --end END        The index of the last frame to be stylized
  --itv ITV        The interval of key frame
  --key KEY        The subfolder name of stylized key frames
  --n_proc N_PROC  The max process count
  -ps              Use poisson gradient blending
  -ne              Do not run ebsynth (use previous ebsynth output)
  -tmp             Keep temporary output

For example, to run Ebsynth on video man.mp4,

Put the stylized key frames to videos/man/keys for every 10 frames (named as 0001.png, 0011.png, ...)
Put the original video frames in videos/man/video (named as 0001.png, 0002.png, ...).
Run Ebsynth on the first 101 frames of the video with poisson gradient blending and save the result to videos/man/blend.mp4 under FPS 25 with the following command:

python video_blend.py videos/man \
  --beg 1 \
  --end 101 \
  --itv 10 \
  --key keys \
  --output videos/man/blend.mp4 \
  --fps 25.0 \
  -ps

(2) Results

Key frame translation


white ancient Greek sculpture, Venus de Milo, light pink and blue background	a handsome Greek man	a traditional mountain in chinese ink wash painting	a cartoon tiger


a swan in chinese ink wash painting, monochrome	a beautiful woman in CG style	a clean simple white jade sculpture	a fluorescent jellyfish in the deep dark blue sea

Full video translation

Text-guided virtual character generation.

more_result_1.mp4

more_result_2.mp4

Video stylization and video editing.

more_result_3.mp4

Citation

If you find this work useful for your research, please consider citing our paper:

@inproceedings{yang2023rerender,
 title = {Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation},
 author = {Yang, Shuai and Zhou, Yifan and Liu, Ziwei and and Loy, Chen Change},
 booktitle = {ACM SIGGRAPH Asia Conference Proceedings},
 year = {2023},
}

Acknowledgments

The code is mainly developed based on ControlNet, Stable Diffusion, GMFlow and Ebsynth.

poryagrand/Rerender_A_Video