S2L-S2D: Learning Landmarks Motion from Speech for Speaker-Agnostic 3D Talking Heads Generation

This repository contains the code for the paper "Learning Landmarks Motion from Speech for Speaker-Agnostic 3D Talking Heads Generation" (link). The paper presents a novel approach based on landmarks motion for generating 3D Talking Heads from speech. The code includes the implementation of two models proposed in the paper: S2L and S2D. Check out some qualitative results in this video

Installation

To run the code, you need to install the following dependencies:

Python 3.8
PyTorch-GPU 1.13.0
Trimesh 3.22.1
Librosa 3.9.2
Transformers 4.6.1 from Hugging Face
MPI-IS for mesh rendering (link)
Additional dependencies for running the demo: pysimplegui==4.60.5, sounddevice==0.4.6, soundfile==0.12.1

Training Setup

Clone the repository:

git clone https://github.com/FedeNoce/s2l-s2d.git

Download the vocaset dataset from here (Training Data, 8GB).
Put the downloaded file into the "S2L/vocaset" and "S2D/vocaset" directories.
To train S2L, preprocess the data by running "preprocess_voca_data.py" in the "S2L/vocaset" directory. Then, run "train_S2L.py".
To train S2D, preprocess the data by running "Data_processing.py" in the "S2D" directory. Then, run "train_S2D.py".

Inference

Download the pretrained models from here and place them in the "S2L/Results" and "S2D/Results" directories.
Run the GUI demo using "demo.py".
If you're interested, we have an updated version of the demo that allows us to reconstruct a user's face from a webcam photo using a 3DMM fitting. Before running the demo using "demo_with_rec.py" you'll need to download a file from here and place it in the "Rec/Values" directory.

Citation

If you use this code or find it helpful, please consider citing:

@misc{nocentini2023learning,
  title={Learning Landmarks Motion from Speech for Speaker-Agnostic 3D Talking Heads Generation},
  author={Federico Nocentini and Claudio Ferrari and Stefano Berretti},
  year={2023},
  eprint={2306.01415},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

FedeNoce/s2l-s2d

S2L-S2D: Learning Landmarks Motion from Speech for Speaker-Agnostic 3D Talking Heads Generation

Installation

Training Setup

Inference

Citation

Authors