DynamicPose: A robust image-to-video framework for portrait animation driven by pose sequences

Introduction

We introduce DynamicPose, a simple and robust framework for animating human images, specifically designed for portrait animation driven by human pose sequences. In summary, our key contributions are as follows:

Large-Scale Motion: Our model supports large-scale motion in diverse environments and generalizes well to non-realistic scenes, such as cartoons.
High-Quality Video Generation: The model can generate high-quality video dance sequences from a single photo, outperforming most open-source models in the same domain.
Accurate Pose Alignment: We employed a high-accuracy pose detection algorithm and a pose alignment algorithm, which enables us to maintain pose accuracy while preserving the consistency of human body limbs to the greatest extent possible.
Comprehensive Code Release: We will gradually release the code for data filtering, data preprocessing, data augmentation, model training (DeepSpeed Zero2), as well as optimized inference scripts.

We are committed to providing the complete source code for free and regularly updating DynamicPose. By open-sourcing this technology, we aim to drive advancements in the digital human field and promote the widespread adoption of virtual human technology across various industries. If you are interested in any of the modules, please feel free to email us to discuss further. Additionally, if our work can benefit you, we would greatly appreciate it if you could give us a star ⭐!

News

[08/28/2024] 🔥 Release DynamicPose project and pretrained models.
[08/28/2024] 🔥 Release pose server and pose align algorithm.
In the coming two weeks, we will release Comfyui and Gradio of DynamicPose.

Release Plans

Inference codes and pretrained weights
Release pose server based FaseAPI and pose align algorithm.
Comfyui of DynamicPose.
Huggingface Gradio demo.
Data cleaning and preprocessing pipeline.
Training scripts.

Demos

testphoto8_4_1_768x512_3_1333.mov	testphoto26_1_1_768x512_3_1333.mov
testphoto27_1_1_768x512_3_1333.mov	testphoto38_1_2_768x512_3_1334.mov
rick_1_1_768x512_3_2213.mov	xm_4_1_768x512_3_2213.mov
testphoto92_4_1_768x512_3_1334.mov	testphoto4_3_1_768x512_3_1333.mov

Installation

Build Environtment

We Recommend a python version >=3.10 and cuda version =11.7. Then build environment as follows:

# [Optional] Create a virtual env
python -m venv .venv
source .venv/bin/activate
# Install with pip:
pip install -r requirements_min.txt
pip install --no-cache-dir -U openmim 
mim install mmengine 
mim install "mmcv>=2.0.1" 
mim install "mmdet>=3.1.0" 
mim install "mmpose>=1.1.0"

Download weights

You can download weights manually as follows.

Manually downloading: You can also download weights manually, which has some steps:

Download our trained weights, which include four parts: denoising_unet.pth, reference_unet.pth, pose_guider.pth and motion_module.pth.
Download pretrained weight of based models and other components:
Download rtmpose weights (rtmw-x_simcc-cocktail14_pt-ucoco_270e-384x288-f840f204_20231122.pth, tmdet_m_8xb32-100e_coco-obj365-person-235e8209.pth) and the corresponding scripts from mmpose repository.

Finally, these weights should be orgnized as follows:

./pretrained_weights/
|-- rtmpose
|   |--rtmw-x_simcc-cocktail14_pt-ucoco_270e-384x288-f840f204_20231122.pth
|   |-- rtmw-x_8xb320-270e_cocktail14-384x288.py
|   |-- rtmdet_m_640-8xb32_coco-person.py
|   `-- rtmdet_m_8xb32-100e_coco-obj365-person-235e8209.pth
|-- DWPose
|   |-- dw-ll_ucoco_384.onnx
|   `-- yolox_l.onnx
|-- image_encoder
|   |-- config.json
|   `-- pytorch_model.bin
|-- denoising_unet.pth
|-- motion_module.pth
|-- pose_guider.pth
|-- reference_unet.pth
|-- sd-vae-ft-mse
|   |-- config.json
|   |-- diffusion_pytorch_model.bin
|   `-- diffusion_pytorch_model.safetensors
`-- stable-diffusion-v1-5
    |-- feature_extractor
    |   `-- preprocessor_config.json
    |-- model_index.json
    |-- unet
    |   |-- config.json
    |   `-- diffusion_pytorch_model.bin
    `-- v1-inference.yaml

Inference

stage 1 image inference:

python -m scripts.pose2img --config ./configs/prompts/animation_stage1.yaml -W 512 -H 768

stage 2 video inference:

python -m scripts.pose2vid --config ./configs/prompts/animation_stage2.yaml -W 512 -H 784 -L 64

You can refer the format of configs to add your own reference images or pose videos. First, extract the keypoints from the input reference images and target pose for alignment:

python data_prepare/video2pose.py path/to/ref/images path/to/save/results image  #image

python data_prepare/video2pose.py path/to/tgt/videos path/to/save/results video #video

Community Resources

ComfyUI

-DynamicPose-Comfyui

Limitation

This work also has some limitations, which are outlined below:

When the input image features a profile face, the model is prone to generating distorted faces.
When the background is complex, the model struggles to accurately distinguish between the human body region and the background region.
When the input image features a person with objects attached to their hands, such as bags or phones, the model has difficulty deciding whether to include these objects in the generated output

Acknowledgement

We thank AnimateAnyone for their technical report, and have refer much to Moore-AnimateAnyone and diffusers.
We thank open-source components like dwpose, Stable Diffusion, rtmpose, etc..

License

code: The code of DynamicPose is released under the MIT License.
other models: Other open-source models used must comply with their license, such as stable-diffusion-v1-5, dwpose, rtmpose, etc..

Citation

@software{DynamicPose,
  author = {Yanqin Chen, Changhao Qiao, Bin Zou, Dejia Song},
  title = {DynamicPose: A effective image-to-video framework for portrait animation driven by human pose sequences},
  month = {August},
  year = {2024},
  url = {https://github.com/dynamic-X-LAB/DynamicPose}
}