LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading

Authors

Yochai Yemini, Aviv Shamsian, Lior Bracha, Sharon Gannot and Ethan Fetaya

Introduction

Official implementation of LipVoicer, a lip-to-speech method. Given a silent video, we first predict the spoken text using a pre-trained lip-reading network. We then condition a diffusion model on the video and use the extracted text through a classifier-guidance mechanism where a pre-trained ASR serves as the classifier. LipVoicer outperforms multiple lip-to-speech baselines on LRS2 and LRS3, which are in-the-wild datasets with hundreds of unique speakers in their test set and an unrestricted vocabulary.

The lip reading network used in LipVoicer is taken from the Visual Speech Recognition for Multiple Languages repository. The ASR system is adapted from Audio-Visual Efficient Conformer for Robust Speech Recognition.

Installation

Clone the repository:

git clone https://github.com/yochaiye/LipVoicer.git
cd LipVoicer

Install the required packages and ffmpeg

pip install -r requirements.txt
conda install -c conda-forge ffmpeg
cd ..

Install ibug.face_detection

git clone https://github.com/hhj1897/face_detection.git
cd face_detection
git lfs pull
pip install -e .
cd ..

Install ibug.face_alignment

git clone https://github.com/hhj1897/face_alignment.git
cd face_alignment
pip install -e .
cd ..

Install RetinaFace or MediaPipe face tracker
Install ctcdecode for the ASR beam search

git clone --recursive https://github.com/parlance/ctcdecode.git
cd ctcdecode
pip install .
cd ..

Training

For training LipVoicer on the benchmark datasets, please download LRS2 or LRS3. In all next steps, make sure to adhere to the dataset's structure.

Data Preparation

Perform the following steps inside the LipVoicer directory:

Extract the audio files from the videos (audio files will be saved in a WAV format)

python ...

Compute the log mel-spectrograms and save them

cd dataloaders
python wav2mel.py dataset.audio_dir=<audio_dir>
cd ..

Inference

Random (In-the-Wild) Video

For Benchmark Datasets

If you wish to generate audio for all of the test videos of LRS2/LRS3, use the following

python generate_full_test_split.py generate.save_dir=<save_dir> \
                                   generate.lipread_text_dir=<lipread_text_dir> \
                                   dataset.dataset_path=<dataset_path> \
                                   dataset.audio_dir=<audio_dir> \
                                   dataset.mouthrois_dir=<mouthrois_dir

techshoww/LipVoicer