Generate high quality speech from only lip movements. This code is part of the paper: Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis published at CVPR'20.
[Paper] | [Project Page] | [Demo Video]
- First work to generate intelligible speech from only lip movements in unconstrained settings.
- First Multi-speaker Lip to Speech Generation Results
- Complete training code and pretrained models made available.
- Inference code to generate results from the pre-trained models.
- Code to calculate metrics reported in the paper is also made available.
🎉 Lip-sync talking face videos to any speech using Wav2Lip: https://github.com/Rudrabha/Wav2Lip
Python 3.7.4
(code has been tested with this version)- ffmpeg:
sudo apt-get install ffmpeg
- Install necessary packages using
pip install -r requirements.txt
- Face detection pre-trained model should be downloaded to
face_detection/detection/sfd/s3fd.pth
- Speaker Embeddings pre-trained model at this link should be downloaded (navigate to
encoder/saved_models/pretrained.pt
) toencoder/saved_models/pretrained.pt
.
Download the weights of our model trained on the LRW dataset.
The LRW dataset is organized as follows.
data_root (lrw/ in the below examples)
├── word1
| ├── train, val, test (3 splits)
| | ├── *.mp4, *.txt
├── word2
| ├── ...
├── ...
python preprocess.py --data_root lrw/ --preprocessed_root lrw_preprocessed/ --split test
# dump speaker embeddings in the same preprocessed folder
python preprocess_speakers.py --preprocessed_root lrw_preprocessed/
Additional options like batch_size
and number of GPUs, split
to use can also be set. You should get:
data_root (lrw_preprocessed/ in the above example)
├── word1
| ├── train, val, test (preprocessed splits)
| | ├── word1_00001, word1_00002...
| | | ├── *.jpg, mels.npz, ref.npz
├── word2
| ├── ...
├── ...
python complete_test_generate.py -d lrw_preprocessed/ -r lrw_test_results/ --checkpoint <path_to_checkpoint>
#A sample checkpoint_path can be found in hparams.py alongside the "eval_ckpt" param.
This will create:
lrw_test_results/
├── gts/ (ground-truth audio files)
| ├── *.wav
├── wavs/ (generated audio files)
| ├── *.wav
You can calculate the PESQ
, ESTOI
and STOI
scores for the above generated results using score.py
:
python score.py -r lrw_test_results/
python train.py <name_of_run> --data_root Dataset/chem/
Additional arguments can also be set or passed through --hparams
, for details: python train.py -h
The software is licensed under the MIT License. Please cite the following paper if you have use this code:
@InProceedings{Prajwal_2020_CVPR,
author = {Prajwal, K R and Mukhopadhyay, Rudrabha and Namboodiri, Vinay P. and Jawahar, C.V.},
title = {Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis},
booktitle = {The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2020}
}
The repository is modified from this TTS repository. We thank the author for this wonderful code. The code for Face Detection has been taken from the face_alignment repository. We thank the authors for releasing their code and models.