/Lip2Wav

Lip movements -> Speech

Primary LanguagePythonMIT LicenseMIT

Lip2Wav

Generate high quality speech from only lip movements. This code is part of the paper: Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis published at CVPR'20.

[Paper] | [Project Page] | [Demo Video]


Highlights

  • First work to generate intelligible speech from only lip movements in unconstrained settings.
  • First Multi-speaker Lip to Speech Generation Results
  • Complete training code and pretrained models made available.
  • Inference code to generate results from the pre-trained models.
  • Code to calculate metrics reported in the paper is also made available.

You might also be interested in:

🎉 Lip-sync talking face videos to any speech using Wav2Lip: https://github.com/Rudrabha/Wav2Lip

Prerequisites

  • Python 3.7.4 (code has been tested with this version)
  • ffmpeg: sudo apt-get install ffmpeg
  • Install necessary packages using pip install -r requirements.txt
  • Face detection pre-trained model should be downloaded to face_detection/detection/sfd/s3fd.pth
  • Speaker Embeddings pre-trained model at this link should be downloaded (navigate to encoder/saved_models/pretrained.pt) to encoder/saved_models/pretrained.pt.

Getting the weights

Download the weights of our model trained on the LRW dataset.

Preprocessing the LRW dataset

The LRW dataset is organized as follows.

data_root (lrw/ in the below examples)
├── word1
|	├── train, val, test (3 splits)
|	|    ├── *.mp4, *.txt
├── word2
|	├── ...
├── ...
python preprocess.py --data_root lrw/ --preprocessed_root lrw_preprocessed/ --split test

# dump speaker embeddings in the same preprocessed folder
python preprocess_speakers.py --preprocessed_root lrw_preprocessed/

Additional options like batch_size and number of GPUs, split to use can also be set. You should get:

data_root (lrw_preprocessed/ in the above example)
├── word1
|	├── train, val, test (preprocessed splits)
|	|    ├── word1_00001, word1_00002...
|	|    |    ├── *.jpg, mels.npz, ref.npz 
├── word2
|	├── ...
├── ...

Generating for the given test split

python complete_test_generate.py -d lrw_preprocessed/ -r lrw_test_results/ --checkpoint <path_to_checkpoint>

#A sample checkpoint_path  can be found in hparams.py alongside the "eval_ckpt" param.

This will create:

lrw_test_results/
├── gts/  (ground-truth audio files)
|	├── *.wav
├── wavs/ (generated audio files)
|	├── *.wav

Calculating the metrics

You can calculate the PESQ, ESTOI and STOI scores for the above generated results using score.py:

python score.py -r lrw_test_results/

Training

python train.py <name_of_run> --data_root Dataset/chem/

Additional arguments can also be set or passed through --hparams, for details: python train.py -h

License and Citation

The software is licensed under the MIT License. Please cite the following paper if you have use this code:

@InProceedings{Prajwal_2020_CVPR,
author = {Prajwal, K R and Mukhopadhyay, Rudrabha and Namboodiri, Vinay P. and Jawahar, C.V.},
title = {Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis},
booktitle = {The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2020}
}

Acknowledgements

The repository is modified from this TTS repository. We thank the author for this wonderful code. The code for Face Detection has been taken from the face_alignment repository. We thank the authors for releasing their code and models.