This repository contains the code for the paper "Learning Landmarks Motion from Speech for Speaker-Agnostic 3D Talking Heads Generation" (link). The paper presents a novel approach based on landmarks motion for generating 3D Talking Heads from speech. The code includes the implementation of two models proposed in the paper: S2L and S2D. Check out some qualitative results in this video
To run the code, you need to install the following dependencies:
- Python 3.8
- PyTorch-GPU 1.13.0
- Trimesh 3.22.1
- Librosa 3.9.2
- Transformers 4.6.1 from Hugging Face
- MPI-IS for mesh rendering (link)
- Additional dependencies for running the demo: pysimplegui==4.60.5, sounddevice==0.4.6, soundfile==0.12.1
- Clone the repository:
git clone https://github.com/FedeNoce/s2l-s2d.git
- Download the vocaset dataset from here (Training Data, 8GB).
- Put the downloaded file into the "S2L/vocaset" and "S2D/vocaset" directories.
- To train S2L, preprocess the data by running "preprocess_voca_data.py" in the "S2L/vocaset" directory. Then, run "train_S2L.py".
- To train S2D, preprocess the data by running "Data_processing.py" in the "S2D" directory. Then, run "train_S2D.py".
- Download the pretrained models from here and place them in the "S2L/Results" and "S2D/Results" directories.
- Run the GUI demo using "demo.py".
- If you're interested, we have an updated version of the demo that allows us to reconstruct a user's face from a webcam photo using a 3DMM fitting. Before running the demo using "demo_with_rec.py" you'll need to download a file from here and place it in the "Rec/Values" directory.
If you use this code or find it helpful, please consider citing:
@misc{nocentini2023learning,
title={Learning Landmarks Motion from Speech for Speaker-Agnostic 3D Talking Heads Generation},
author={Federico Nocentini and Claudio Ferrari and Stefano Berretti},
year={2023},
eprint={2306.01415},
archivePrefix={arXiv},
primaryClass={cs.CV}
}