Intelligible Lip-to-Speech Synthesis with Speech Units

Official PyTorch implementation for the following paper:

Intelligible Lip-to-Speech Synthesis with Speech Units
Jeongsoo Choi, Minsu Kim, Yong Man Ro
Interspeech 2023
[Paper] [Project]

Installation

conda create -y -n lip2speech python=3.10
conda activate lip2speech

git clone -b main --single-branch https://github.com/choijeongsoo/lip2speech-unit.git
cd lip2speech-unit

pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116
pip install -r requirements.txt

git clone https://github.com/facebookresearch/fairseq.git
cd fairseq
git checkout afc77bd
pip install -e ./
cd ..

Data Preparation

Video and Audio

reference: https://github.com/facebookresearch/av_hubert/tree/main/avhubert/preparation

${ROOT}/datasets/${DATASET}/audio for processed audio files
${ROOT}/datasets/${DATASET}/video for processed video files
${ROOT}/datasets/${DATASET}/label/*.tsv for training manifests

Speech Units

reference: https://github.com/facebookresearch/fairseq/tree/main/examples/textless_nlp/gslm/speech2unit

6th layer of HuBERT Base + KM200

Speaker Embedding

reference: https://github.com/CorentinJ/Real-Time-Voice-Cloning

Mel-spectrogram

reference: https://github.com/facebookresearch/fairseq/tree/main/examples/textless_nlp/gslm/unit2speech/tacotron2

config

filter_length: 640
hop_length: 160
win_length: 640
n_mel_channels: 80
sampling_rate: 16000
mel_fmin: 0.0
mel_fmax: 8000.0

We provide sample data in 'datasets/lrs3' directory.

Model Checkpoints

Lip Reading Sentences 3 (LRS3)

1st stage	2nd stage	STOI	ESTOI	PESQ	WER(%)
Multi-target Lip2Speech	Multi-input Vocoder	0.552	0.354	1.31	50.4
Multi-target Lip2Speech	Multi-input Vocoder + augmentation	0.543	0.351	1.28	50.2
Multi-target Lip2Speech + AV-HuBERT	Multi-input Vocoder + augmentation	0.578	0.393	1.31	29.8

Lip Reading Sentences 2 (LRS2)

1st stage	2nd stage	STOI	ESTOI	PESQ	WER(%)
Multi-target Lip2Speech	Multi-input Vocoder
Multi-target Lip2Speech	Multi-input Vocoder + augmentation	0.565	0.395	1.32	44.8
Multi-target Lip2Speech + AV-HuBERT	Multi-input Vocoder + augmentation	0.585	0.412	1.34	35.7

We use the pre-trained AV-HuBERT Large (LRS3 + VoxCeleb2 (En)) model available from here.

For inference, download the checkpoints and place them in the 'checkpoints' directory.

Training

scripts/${DATASET}/train.sh

in 'multi_target_lip2speech' and 'multi_input_vocoder' directory

Inference

scripts/${DATASET}/inference.sh

in 'multi_target_lip2speech' and 'multi_input_vocoder' directory

Acknowledgement

This repository is built using Fairseq, AV-HuBERT, ESPnet, speech-resynthesis. We appreciate the open source of the projects.

Citation

If our work is useful for your research, please cite the following paper:

@article{choi2023intelligible,
      title={Intelligible Lip-to-Speech Synthesis with Speech Units},
      author={Jeongsoo Choi and Minsu Kim and Yong Man Ro},
      journal={arXiv preprint arXiv:2305.19603},
      year={2023},
}