
(CVPR 2023)SadTalker:Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation

😭 SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation

Wenxuan Zhang *,1,2Xiaodong Cun *,2Xuan Wang 3Yong Zhang 2Xi Shen 2
Yu Guo1 Ying Shan 2   Fei Wang 1

1 Xi'an Jiaotong University   2 Tencent AI Lab   3 Ant Group  

CVPR 2023


TL;DR: A realistic and stylized talking head video generation method from a single image and audio.

📋 Changelog

  • [2023.03.30]: Launch new feature: through using reference videos, our algorithm can generate videos with more natural eye blinking and some eyebrow movement.

  • [2023.03.29]: resize mode is online by python infererence.py --preprocess resize! Where we can produce a larger crop of the image as discussed in OpenTalker#35.

  • [2023.03.29]: local gradio demo is online! python app.py to start the demo. New requirments.txt is used to avoid the bugs in librosa.

  • [2023.03.28]: Online demo is launched in Hugging Face Spaces, thanks AK!

  • [2023.03.22]: Launch new feature: generating the 3d face animation from a single image. New applications about it will be updated.

  • [2023.03.22]: Launch new feature: still mode, where only a small head pose will be produced via python inference.py --still.


Previous Changelogs

  • [2023.03.18]: Support expression intensity, now you can change the intensity of the generated motion: python inference.py --expression_scale 1.3 (some value > 1).

  • [2023.03.18]: Reconfig the data folders, now you can download the checkpoint automatically using bash scripts/download_models.sh.

  • [2023.03.18]: We have offically integrate the GFPGAN for face enhancement, using python inference.py --enhancer gfpgan for better visualization performance.

  • [2023.03.14]: Specify the version of package joblib to remove the errors in using librosa, Open In Colab is online!

  • [2023.03.06]: Solve some bugs in code and errors in installation

  • [2023.03.03]: Release the test code for audio-driven single image animation!

  • [2023.02.28]: SadTalker has been accepted by CVPR 2023!

🎼 Pipeline



  • Generating 2D face from a single Image.
  • Generating 3D face from Audio.
  • Generating 4D free-view talking examples from audio and a single image.
  • Gradio/Colab Demo.
  • Full body/image Generation.
  • training code of each componments.
  • Audio-driven Anime Avatar.
  • interpolate ChatGPT for a conversation demo 🤔
  • integrade with stable-diffusion-web-ui. (stay tunning!)

🔮 Installation

Dependence Installation

CLICK ME For Mannual Installation
git clone https://github.com/Winfredy/SadTalker.git
cd SadTalker 
conda create -n sadtalker python=3.8
source activate sadtalker
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
conda install ffmpeg
pip install dlib-bin # [dlib-bin is much faster than dlib installation] conda install dlib 
pip install -r requirements.txt

### install gpfgan for enhancer
pip install git+https://github.com/TencentARC/GFPGAN

CLICK For Docker Installation

A dockerfile are also provided by @thegenerativegeneration in docker hub, which can be used directly as:

docker run --gpus "all" --rm -v $(pwd):/host_dir wawa9000/sadtalker \
    --driven_audio /host_dir/deyu.wav \
    --source_image /host_dir/image.jpg \
    --expression_scale 1.0 \
    --still \
    --result_dir /host_dir

Trained Models


You can run the following script to put all the models in the right place.

bash scripts/download_models.sh

OR download our pre-trained model from google drive or our github release page, and then, put it in ./checkpoints.

Model Description
checkpoints/auido2exp_00300-model.pth Pre-trained ExpNet in Sadtalker.
checkpoints/auido2pose_00140-model.pth Pre-trained PoseVAE in Sadtalker.
checkpoints/mapping_00229-model.pth.tar Pre-trained MappingNet in Sadtalker.
checkpoints/facevid2vid_00189-model.pth.tar Pre-trained face-vid2vid model from the reappearance of face-vid2vid.
checkpoints/epoch_20.pth Pre-trained 3DMM extractor in Deep3DFaceReconstruction.
checkpoints/wav2lip.pth Highly accurate lip-sync model in Wav2lip.
checkpoints/shape_predictor_68_face_landmarks.dat Face landmark model used in dilb.
checkpoints/BFM 3DMM library file.
checkpoints/hub Face detection models used in face alignment.

🔮 Inference Demo

Generating 2D face from a single Image

python inference.py --driven_audio <audio.wav> \
                    --source_image <video.mp4 or picture.png> \
                    --batch_size <default equals 2, a larger run faster> \
                    --expression_scale <default is 1.0, a larger value will make the motion stronger> \
                    --result_dir <a file to store results> \
                    --still <add this flag will show fewer head motion> \
                    --preprocess <resize or crop the input image, default is crop> \
                    --enhancer <default is None, you can choose gfpgan or RestoreFormer> \
                    --ref_video <default is None, ref_video is used to provide more natural eyebrow movement and eye blinking>
basic w/ still mode w/ exp_scale 1.3 w/ gfpgan

Kindly ensure to activate the audio as the default audio playing is incompatible with GitHub.

| basic | w/ reference video | reference video |


If the reference video is shorter than the input audio, we will loop the reference video .

Generating 3D face from Audio

Input Animated 3d face

Kindly ensure to activate the audio as the default audio playing is incompatible with GitHub.

More details to generate the 3d face can be founded here

Generating 4D free-view talking examples from audio and a single image

We use camera_yaw, camera_pitch, camera_roll to control camera pose. For example, --camera_yaw -20 30 10 means the camera yaw degree changes from -20 to 30 and then changes from 30 to 10.

python inference.py --driven_audio <audio.wav> \
                    --source_image <video.mp4 or picture.png> \
                    --result_dir <a file to store results> \
                    --camera_yaw -20 30 10


🛎 Citation

If you find our work useful in your research, please consider citing:

  title={SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation},
  author={Zhang, Wenxuan and Cun, Xiaodong and Wang, Xuan and Zhang, Yong and Shen, Xi and Guo, Yu and Shan, Ying and Wang, Fei},
  journal={arXiv preprint arXiv:2211.12194},

💗 Acknowledgements

Facerender code borrows heavily from zhanglonghao's reproduction of face-vid2vid and PIRender. We thank the authors for sharing their wonderful code. In training process, We also use the model from Deep3DFaceReconstruction and Wav2lip. We thank for their wonderful work.

🥂 Related Works

📢 Disclaimer

This is not an official product of Tencent. This repository can only be used for personal/research/non-commercial purposes.