Paper | Introduction | Preparation | Benchmark | Inference | Model zoo |
Yochai Yemini, Aviv Shamsian, Lior Bracha, Sharon Gannot and Ethan Fetaya
Official implementation of LipVoicer, a lip-to-speech method. Given a silent video, we first predict the spoken text using a pre-trained lip-reading network. We then condition a diffusion model on the video and use the extracted text through a classifier-guidance mechanism where a pre-trained ASR serves as the classifier. LipVoicer outperforms multiple lip-to-speech baselines on LRS2 and LRS3, which are in-the-wild datasets with hundreds of unique speakers in their test set and an unrestricted vocabulary.
The lip reading network used in LipVoicer is taken from the Visual Speech Recognition for Multiple Languages repository. The ASR system is adapted from Audio-Visual Efficient Conformer for Robust Speech Recognition.
- Clone the repository:
git clone https://github.com/yochaiye/LipVoicer.git
cd LipVoicer
- Install the required packages and ffmpeg
pip install -r requirements.txt
conda install -c conda-forge ffmpeg
cd ..
- Install
ibug.face_detection
git clone https://github.com/hhj1897/face_detection.git
cd face_detection
git lfs pull
pip install -e .
cd ..
- Install
ibug.face_alignment
git clone https://github.com/hhj1897/face_alignment.git
cd face_alignment
pip install -e .
cd ..
- Install RetinaFace or MediaPipe face tracker
- Install ctcdecode for the ASR beam search
git clone --recursive https://github.com/parlance/ctcdecode.git
cd ctcdecode
pip install .
cd ..
For training LipVoicer on the benchmark datasets, please download LRS2 or LRS3. In all next steps, make sure to adhere to the dataset's structure.
Perform the following steps inside the LipVoicer directory:
- Extract the audio files from the videos (audio files will be saved in a WAV format)
python ...
- Compute the log mel-spectrograms and save them
cd dataloaders python wav2mel.py dataset.audio_dir=<audio_dir> cd ..
If you wish to generate audio for all of the test videos of LRS2/LRS3, use the following
python generate_full_test_split.py generate.save_dir=<save_dir> \
generate.lipread_text_dir=<lipread_text_dir> \
dataset.dataset_path=<dataset_path> \
dataset.audio_dir=<audio_dir> \
dataset.mouthrois_dir=<mouthrois_dir