The pytorch implementation for our CVPR2023 paper "DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation".
[Project] [Paper] [Video Demo]
- python 3.7.0
- pytorch 1.10.0
- pytorch-lightning 1.2.5
- torchvision 0.11.0
- pytorch-lightning==1.2.5
For more details, please refer to the requirements.txt
. We conduct the experiments with 8 NVIDIA 3090Ti GPUs.
Put the first stage model to ./models
.
Please download the HDTF dataset for training and test, and process the dataset as following.
Data Preprocessing:
- Set all videos to 25 fps.
- Extract the audio signals and facial landmarks.
- Put the processed data in
./data/HDTF
, and construct the data directory as following. - Constract the
data_train.txt
anddata_test.txt
as following.
./data/HDTF:
|——data/HDTF
|——images
|——0_0.jpg
|——0_1.jpg
|——...
|——N_M.bin
|——landmarks
|——0_0.lms
|——0_1.lms
|——...
|——N_M.lms
|——audio_smooth
|——0_0.npy
|——0_1.npy
|——...
|——N_M.npy
./data/data_train(test).txt:
0_0
0_1
0_2
...
N_M
N is the total number of classes, and M is the class size.
run preprocessing as following python3 scripts/preprocess.py /mnt/hard3/rhs/intern/ops/sample*.webm
sh run.sh
sh inference.sh
- The DiffTalk models talking head generation as an iterative denoising process, which needs more time to synthesize a frame compared with most GAN-based approaches. This is also a common problem of LDM-based works.
- The model is trained on the HDTF dataset, and it sometimes fails on some identities from other datasets.
- When driving a portrait with more challenging cross-identity audio, the audio-lip synchronization of the synthesized video is slightly inferior to the ones under self-driven setting.
- During inference, the network is also sensitive to the mask shape in z_T , where the mask needs to cover the mouth region completely and its shape cannot leak any lip shape information.
This code is built upon the publicly available code latent-diffusion. Thanks the authors of latent-diffusion for making their excellent work and codes publicly available.
Please cite the following paper if you use this repository in your research.
@inproceedings{shen2023difftalk,
author={Shen, Shuai and Zhao, Wenliang and Meng, Zibin and Li, Wanhua and Zhu, Zheng and Zhou, Jie and Lu, Jiwen},
title={DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation},
booktitle={CVPR},
year={2023}
}