AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animations
Huawei Wei, Zejun Yang, Zhisheng Wang
Tencent Games Zhiji, Tencent
Here we propose AniPortrait, a novel framework for generating high-quality animation driven by audio and a reference portrait image. You can also provide a video to achieve face reenacment.
-
Now our paper is available on arXiv.
-
Update the code to generate pose_temp.npy for head pose control.
-
We will release audio2pose pre-trained weight for audio2video after futher optimization. You can choose head pose template in
./configs/inference/head_pose_temp
as substitution.
cxk.mp4 |
solo.mp4 |
Aragaki.mp4 |
num18.mp4 |
jijin.mp4 |
kara.mp4 |
lyl.mp4 |
zl.mp4 |
prerequisites: 3.11>=python>=3.8
, CUDA>=11.3
, ffmpeg
and git
.
Python and Git:
-
Python 3.10.11: https://www.python.org/ftp/python/3.10.11/python-3.10.11-amd64.exe
-
Install ffmpeg for your operating system (https://www.geeksforgeeks.org/how-to-install-ffmpeg-on-windows/)
notice:step 4 use windows system Set Enviroment Path.
Give unrestricted script access to powershell so venv can work:
- Open an administrator powershell window
- Type
Set-ExecutionPolicy Unrestricted
and answer A - Close admin powershell window
git clone --recurse-submodules https://github.com/sdbds/AniPortrait-for-windows
Install with Powershell run install.ps1
or install-cn.ps1
(for Chinese)
Add loading local safetensors or ckpt,you can change configs/prompts/animation_facereenac.yaml
about pretrained_base_model_path
for your local SD1.5 model.
such as "D:\\stablediffusion-webui\\models\\Stable-diffusion\\v1-5-pruned.ckpt"
### Download weights
All the weights should be placed under the ./pretrained_weights
direcotry. You can download weights manually as follows:
1. Download our trained weights, which include four parts: denoising_unet.pth
, reference_unet.pth
, pose_guider.pth
, motion_module.pth
and audio2mesh.pt
.
2. Download pretrained weight of based models and other components:
- StableDiffusion V1.5
- sd-vae-ft-mse
- image_encoder
- wav2vec2-base-960h
Finally, these weights should be orgnized as follows:
./pretrained_weights/
|-- image_encoder
| |-- config.json
| `-- pytorch_model.bin
|-- sd-vae-ft-mse
| |-- config.json
| |-- diffusion_pytorch_model.bin
| `-- diffusion_pytorch_model.safetensors
|-- stable-diffusion-v1-5
| |-- feature_extractor
| | `-- preprocessor_config.json
| |-- model_index.json
| |-- unet
| | |-- config.json
| | `-- diffusion_pytorch_model.bin
| `-- v1-inference.yaml
|-- wav2vec2-base-960h
| |-- config.json
| |-- feature_extractor_config.json
| |-- preprocessor_config.json
| |-- pytorch_model.bin
| |-- README.md
| |-- special_tokens_map.json
| |-- tokenizer_config.json
| `-- vocab.json
|-- audio2mesh.pt
|-- denoising_unet.pth
|-- motion_module.pth
|-- pose_guider.pth
`-- reference_unet.pth
Note: If you have installed some of the pretrained models, such as StableDiffusion V1.5
, you can specify their paths in the config file (e.g. ./config/prompts/animation.yaml
).
Here are the cli commands for running inference scripts:
Kindly note that you can set -L to the desired number of generating frames in the command, for example, -L 300.
python -m scripts.pose2vid --config ./configs/prompts/animation.yaml -W 512 -H 512
You can refer the format of animation.yaml to add your own reference images or pose videos. To convert the raw video into a pose video (keypoint sequence), you can run with the following command:
python -m scripts.vid2pose --video_path pose_video_path.mp4
python -m scripts.vid2vid --config ./configs/prompts/animation_facereenac.yaml -W 512 -H 512
Add source face videos and reference images in the animation_facereenac.yaml.
python -m scripts.audio2vid --config ./configs/prompts/animation_audio.yaml -W 512 -H 512
Add audios and reference images in the animation_audio.yaml.
You can use this command to generate a pose_temp.npy for head pose control:
python -m scripts.generate_ref_pose --ref_video ./configs/inference/head_pose_temp/pose_ref_video.mp4 --save_path ./configs/inference/head_pose_temp/pose.npy
Extract keypoints from raw videos and write training json file (here is an example of processing VFHQ):
python -m scripts.preprocess_dataset --input_dir VFHQ_PATH --output_dir SAVE_PATH --training_json JSON_PATH
Update lines in the training config file:
data:
json_path: JSON_PATH
Run command:
accelerate launch train_stage_1.py --config ./configs/train/stage1.yaml
Put the pretrained motion module weights mm_sd_v15_v2.ckpt
(download link) under ./pretrained_weights
.
Specify the stage1 training weights in the config file stage2.yaml
, for example:
stage1_ckpt_dir: './exp_output/stage1'
stage1_ckpt_step: 30000
Run command:
accelerate launch train_stage_2.py --config ./configs/train/stage2.yaml
We first thank the authors of EMO, and part of the images and audios in our demos are from EMO. Additionally, we would like to thank the contributors to the Moore-AnimateAnyone, majic-animate, animatediff and Open-AnimateAnyone repositories, for their open research and exploration.
@misc{wei2024aniportrait,
title={AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animations},
author={Huawei Wei and Zejun Yang and Zhisheng Wang},
year={2024},
eprint={2403.17694},
archivePrefix={arXiv},
primaryClass={cs.CV}
}