This code is part of the paper: A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild published at ACM Multimedia 2020.
[Paper] | [Project Page] | [Demo Video] | [Interactive Demo] | [Collab Notebook] | [ReSyncED] (coming soon)
- Lip-sync videos to any target speech with high accuracy. Try our interactive demo.
- Works for any identity, voice, and language. Also works for CGI faces and synthetic voices.
- Complete training code, inference code, and pretrained models are available.
- Or, quick-start with the Google Colab Notebook: Link
- Several new, reliable evaluation benchmarks and metrics [
evaluation/
folder of this repo] released. - Code to calculate metrics reported in the paper is also made available.
All results from this open-source code or our demo website should only be used for research/academic/personal purposes only. As the models are trained on the LRS2 dataset, any form of commercial use is strictly prohibhited. Please contact us for all further queries.
Python 3.5.2
(code has been tested with this version)- ffmpeg:
sudo apt-get install ffmpeg
- Install necessary packages using
pip install -r requirements.txt
- Face detection pre-trained model should be downloaded to
face_detection/detection/sfd/s3fd.pth
. Alternative link if the above does not work.
Model | Description | Link to the model |
---|---|---|
Wav2Lip | Highly accurate lip-sync | Link |
Wav2Lip + GAN | Slightly inferior lip-sync, but better visual quality | Link |
Expert Discriminator | Weights of the expert discriminator | Link |
You can lip-sync any video to any audio:
python inference.py --checkpoint_path <ckpt> --face <video.mp4> --audio <an-audio-source>
The result is saved (by default) in results/result_voice.mp4
. You can specify it as an argument, similar to several other available options. The audio source can be any file supported by FFMPEG
containing audio data: *.wav
, *.mp3
or even a video file, from which the code will automatically extract the audio.
- Experiment with the
--pads
argument to adjust the detected face bounding box. Often leads to improved results. You might need to increase the bottom padding to include the chin region. E.g.--pads 0 20 0 0
. - Experiment with the
--resize_factor
argument, to get a lower resolution video. Why? The models are trained on faces which were at a lower resolution. You might get better, visually pleasing results for 720p videos than for 1080p videos (in many cases, the latter works well too). - The Wav2Lip model without GAN usually needs more experimenting with the above two to get the most ideal results, and sometimes, can give you a better result as well.
Our models are trained on LRS2. Training on other datasets might require small modifications to the code. Changes to FPS etc. would need significant code changes.
data_root (mvlrs_v1)
├── main, pretrain (we use only main folder in this work)
| ├── list of folders
| │ ├── five-digit numbered video IDs ending with (.mp4)
Place the LRS2 filelists (train, val, test) .txt
files in the filelists/
folder.
python preprocess.py --data_root data_root/main --preprocessed_root lrs2_preprocessed/
Additional options like batch_size
and number of GPUs to use in parallel to use can also be set.
preprocessed_root (lrs2_preprocessed)
├── list of folders
| ├── Folders with five-digit numbered video IDs
| │ ├── *.jpg
| │ ├── audio.wav
There are two major steps: (i) Train the expert lip-sync discriminator, (ii) Train the Wav2Lip model(s).
You can download the pre-trained weights if you want to skip this step. To train it:
python color_syncnet_train.py --data_root lrs2_preprocessed/ --checkpoint_dir <folder_to_save_checkpoints>
You can either train the model without the additional visual quality disriminator (< 1 day of training) or use the discriminator (~2 days). For the former, run:
python wav2lip_train.py --data_root lrs2_preprocessed/ --checkpoint_dir <folder_to_save_checkpoints> --syncnet_checkpoint_path <path_to_expert_disc_checkpoint>
To train with the visual quality discriminator, you should run hq_wav2lip_train.py
instead. The arguments for both the files are similar. In both the cases, you can resume training as well. Look at python wav2lip_train.py --help
for more details. You can also set additional less commonly-used hyper-parameters at the bottom of the hparams.py
file.
Will be updated.
The software is licensed under the MIT License. Please cite the following paper if you have use this code:
@misc{prajwal2020lip,
title={A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild},
author={K R Prajwal and Rudrabha Mukhopadhyay and Vinay Namboodiri and C V Jawahar},
year={2020},
eprint={2008.10010},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Parts of the code structure is inspired by this TTS repository. We thank the author for this wonderful code. The code for Face Detection has been taken from the face_alignment repository. We thank the authors for releasing their code and models.