Official PyTorch implementation for the following paper:
Intelligible Lip-to-Speech Synthesis with Speech Units
Jeongsoo Choi, Minsu Kim, Yong Man Ro
Interspeech 2023
[Paper] [Project]
conda create -y -n lip2speech python=3.10
conda activate lip2speech
git clone -b main --single-branch https://github.com/choijeongsoo/lip2speech-unit.git
cd lip2speech-unit
pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116
pip install -r requirements.txt
git clone https://github.com/facebookresearch/fairseq.git
cd fairseq
git checkout afc77bd
pip install -e ./
cd ..
${ROOT}/datasets/${DATASET}/audio
for processed audio files${ROOT}/datasets/${DATASET}/video
for processed video files${ROOT}/datasets/${DATASET}/label/*.tsv
for training manifests
- reference: https://github.com/facebookresearch/fairseq/tree/main/examples/textless_nlp/gslm/speech2unit
- 6th layer of HuBERT Base + KM200
-
config
filter_length: 640 hop_length: 160 win_length: 640 n_mel_channels: 80 sampling_rate: 16000 mel_fmin: 0.0 mel_fmax: 8000.0
We provide sample data in 'datasets/lrs3' directory.
Lip Reading Sentences 3 (LRS3)
1st stage | 2nd stage | STOI | ESTOI | PESQ | WER(%) |
---|---|---|---|---|---|
Multi-target Lip2Speech | Multi-input Vocoder | 0.552 | 0.354 | 1.31 | 50.4 |
Multi-target Lip2Speech | Multi-input Vocoder + augmentation |
0.543 | 0.351 | 1.28 | 50.2 |
Multi-target Lip2Speech + AV-HuBERT |
Multi-input Vocoder + augmentation |
0.578 | 0.393 | 1.31 | 29.8 |
Lip Reading Sentences 2 (LRS2)
1st stage | 2nd stage | STOI | ESTOI | PESQ | WER(%) |
---|---|---|---|---|---|
Multi-target Lip2Speech | Multi-input Vocoder | ||||
Multi-target Lip2Speech | Multi-input Vocoder + augmentation |
0.565 | 0.395 | 1.32 | 44.8 |
Multi-target Lip2Speech + AV-HuBERT |
Multi-input Vocoder + augmentation |
0.585 | 0.412 | 1.34 | 35.7 |
We use the pre-trained AV-HuBERT Large (LRS3 + VoxCeleb2 (En))
model available from here.
For inference, download the checkpoints and place them in the 'checkpoints' directory.
scripts/${DATASET}/train.sh
in 'multi_target_lip2speech' and 'multi_input_vocoder' directory
scripts/${DATASET}/inference.sh
in 'multi_target_lip2speech' and 'multi_input_vocoder' directory
This repository is built using Fairseq, AV-HuBERT, ESPnet, speech-resynthesis. We appreciate the open source of the projects.
If our work is useful for your research, please cite the following paper:
@article{choi2023intelligible,
title={Intelligible Lip-to-Speech Synthesis with Speech Units},
author={Jeongsoo Choi and Minsu Kim and Yong Man Ro},
journal={arXiv preprint arXiv:2305.19603},
year={2023},
}