/MOFA-Video

[ECCV 2024] MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model.

Primary LanguagePythonOtherNOASSERTION

🦄️ MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model (ECCV 2024)

Muyao Niu 1,2   Xiaodong Cun2,*   Xintao Wang2   Yong Zhang2   Ying Shan2   Yinqiang Zheng1,*  
1 The University of Tokyo   2 Tencent AI Lab   * Corresponding Author  

In European Conference on Computer Vision (ECCV) 2024

     

🔥🔥🔥 New Features/Updates

  • (2024.08.07) We have released the inference script for keypoint-based facial image animation! Please refer to Here for more instructions.

  • (2024.07.15) We have released the training code for trajectory-based image animation! Please refer to Here for more instructions.

  • MOFA-Video will be appeared in ECCV 2024! 🇮🇹🇮🇹🇮🇹

  • We have released the Gradio inference code and the checkpoints for Hybrid Controls! Please refer to Here for more instructions.

  • Free online demo via HuggingFace Spaces will be coming soon!

  • If you find this work interesting, please do not hesitate to give a ⭐!

📰 CODE RELEASE

  • (2024.05.31) Gradio demo and checkpoints for trajectory-based image animation
  • (2024.06.22) Gradio demo and checkpoints for image animation with hybrid control
  • (2024.07.15) Training scripts for trajectory-based image animation
  • (2024.08.07) Inference scripts and checkpoints for keypoint-based facial image animation
  • Training scripts for keypoint-based facial image animation

TL;DR

Image 🏞️ + Hybrid Controls 🕹️ = Videos 🎬🍿




Trajectory + Landmark Control




Trajectory Control





Landmark Control
Check the gallery of our project page for more visual results!

Introduction

We introduce MOFA-Video, a method designed to adapt motions from different domains to the frozen Video Diffusion Model. By employing sparse-to-dense (S2D) motion generation and flow-based motion adaptation, MOFA-Video can effectively animate a single image using various types of control signals, including trajectories, keypoint sequences, AND their combinations.

During the training stage, we generate sparse control signals through sparse motion sampling and then train different MOFA-Adapters to generate video via pre-trained SVD. During the inference stage, different MOFA-Adapters can be combined to jointly control the frozen SVD.

🕹️ Image Animation with Hybrid Controls

1. Clone the Repository

git clone https://github.com/MyNiuuu/MOFA-Video.git
cd ./MOFA-Video

2. Environment Setup

The demo has been tested on CUDA version of 11.7.

cd ./MOFA-Video-Hybrid
conda create -n mofa python==3.10
conda activate mofa
pip install -r requirements.txt
pip install opencv-python-headless
pip install "git+https://github.com/facebookresearch/pytorch3d.git"

IMPORTANT: ⚠️⚠️⚠️ Gradio Version of 4.5.0 in the requirements.txt should be strictly followed since other versions may cause errors.

3. Downloading Checkpoints

  1. Download the checkpoint of CMP from here and put it into ./MOFA-Video-Hybrid/models/cmp/experiments/semiauto_annot/resnet50_vip+mpii_liteflow/checkpoints.

  2. Download the ckpts folder from the huggingface repo which contains necessary pretrained checkpoints and put it under ./MOFA-Video-Hybrid. You may use git lfs to download the entire ckpts folder:

    1. Download git lfs from https://git-lfs.github.com. It is commonly used for cloning repositories with large model checkpoints on HuggingFace.
    2. Execute git clone https://huggingface.co/MyNiuuu/MOFA-Video-Hybrid to download the complete HuggingFace repository, which currently only includes the ckpts folder.
    3. Copy or move the ckpts folder to the GitHub repository.

    NOTE: If you encounter the error git: 'lfs' is not a git command on Linux, you can try this solution that has worked well for my case.

    Finally, the checkpoints should be orgnized as ./MOFA-Video-Hybrid/ckpt_tree.md.

4. Run Gradio Demo

Using audio to animate the facial part

cd ./MOFA-Video-Hybrid
python run_gradio_audio_driven.py

🪄🪄🪄 The Gradio Interface is displayed as below. Please refer to the instructions on the gradio interface during the inference process!

Using reference video to animate the facial part

cd ./MOFA-Video-Hybrid
python run_gradio_video_driven.py

🪄🪄🪄 The Gradio Interface is displayed as below. Please refer to the instructions on the gradio interface during the inference process!

💫 Trajectory-based Image Animation

Please refer to Here for instructions.

Training your own MOFA-Adapter

Please refer to Here for more instructions.

Citation

@article{niu2024mofa,
  title={MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model},
  author={Niu, Muyao and Cun, Xiaodong and Wang, Xintao and Zhang, Yong and Shan, Ying and Zheng, Yinqiang},
  journal={arXiv preprint arXiv:2405.20222},
  year={2024}
}

Acknowledgements

We sincerely appreciate the code release of the following projects: DragNUWA, SadTalker, AniPortrait, Diffusers, SVD_Xtend, Conditional-Motion-Propagation, and Unimatch.