/KDTalker

Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait

Primary LanguagePython

Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait

arXiv License GitHub Stars

Chaolong Yang 1,3* Kai Yao 2*Yuyao Yan 3 Chenru Jiang 4 Weiguang Zhao 1,3
Jie Sun 3 Guangliang Cheng 1 Yifei Zhang 5 Bin Dong 4 Kaizhu Huang 4

1 University of Liverpool   2 Ant Group   3 Xi’an Jiaotong-Liverpool University  
4 Duke Kunshan University   5 Ricoh Software Research Center  

Comparative videos

Comparative.mp4

Demo

Gradio Demo KDTalker.

shot

Environment

Our KDTalker could be conducted on one RTX4090 or RTX3090.

1. Clone the code and prepare the environment

Note: Make sure your system has git, conda, and FFmpeg installed.

git clone https://github.com/chaolongy/KDTalker
cd KDTalker

# create env using conda
conda create -n KDTalker python=3.9
conda activate KDTalker

conda install pytorch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 pytorch-cuda=11.8 -c pytorch -c nvidia

pip install -r requirements.txt

2. Download pretrained weights

First, you can download all LiverPorait pretrained weights from Google Drive. Unzip and place them in ./pretrained_weights. Ensuring the directory structure is as follows:

pretrained_weights
├── insightface
│   └── models
│       └── buffalo_l
│           ├── 2d106det.onnx
│           └── det_10g.onnx
└── liveportrait
    ├── base_models
    │   ├── appearance_feature_extractor.pth
    │   ├── motion_extractor.pth
    │   ├── spade_generator.pth
    │   └── warping_module.pth
    ├── landmark.onnx
    └── retargeting_models
        └── stitching_retargeting_module.pth

You can download the weights for the face detector, audio extractor and KDTalker from Google Drive. Put them in ./ckpts.

OR, you can download above all weights in Huggingface.

Inference

python inference.py -source_image ./example/source_image/WDA_BenCardin1_000.png -driven_audio ./example/driven_audio/WDA_BenCardin1_000.wav -output ./results/output.mp4

Contact

Our code is under the CC-BY-NC 4.0 license and intended solely for research purposes. If you have any questions or wish to use it for commercial purposes, please contact us at chaolong.yang@liverpool.ac.uk

Citation

If you find this code helpful for your research, please cite:

@misc{yang2025kdtalker,
      title={Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait}, 
      author={Chaolong Yang and Kai Yao and Yuyao Yan and Chenru Jiang and Weiguang Zhao and Jie Sun and Guangliang Cheng and Yifei Zhang and Bin Dong and Kaizhu Huang},
      year={2025},
      eprint={2503.12963},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.12963}, 
}

Acknowledge

We acknowledge these works for their public code and selfless help: SadTalker, LivePortrait, Wav2Lip, Face-vid2vid etc.