/Speech2Lip

[ICCV2023] Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video

Primary LanguagePython

Speech2Lip

Official PyTorch implementation for the paper "Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video".

Project Page | Paper

Feel free to contact xzwu@eee.hku.hk if you have any questions about the code.

Prerequisites

  • You can create an environment with:

    pip install -r requirements.txt
    
  • PyTorch3D

    git clone https://github.com/facebookresearch/pytorch3d.git
    cd pytorch3d/ && pip install -e .
    
  • Basel Face Model 2009

    • Place 01_MorphableModel.mat in preprocess/data_util/face_tracking/3DMM/
    • Convert the file:
      cd preprocess/data_util/face_tracking/
      python convert_BFM.py
      
  • FFmpeg is required to cut the video and combine the audio with the silent generated videos.

Data Preprocessing

The source videos used in our experiments are referred to as LSP and Youtube Video. In this example, we use May's video and provide the bash scripts. After data preprocessing, the training data will be created in the dataset/may_face_crop_lip/ directory. Please replace it with your own data.

  • Video preprocessing

    • Download the original video may.mp4. Refer to LSP for the URL and duration.
    • Convert to images:
      ffmpeg -i may.mp4 -q:v 2 -r 25 %05d.jpg
      
      • Place the images in dataset/may/images/.
      • Once the data preprocessing is complete, the directory dataset/may/ can be deleted.
    • Extract the audio audio.wav:
      ffmpeg -i may.mp4 -vn -acodec pcm_s16le -ar 16000 audio.wav
      
      • Place it in dataset/may_face_crop_lip/audio/.
    • For convenience, we provide the cropped video of May here.
  • Audio preprocessing

    • Extract the DeepSpeech features audio.npy:
      cd preprocess/deepspeech_features/
      bash extract_ds_features_may.sh
      
      • If successful, a file named audio.npy will be created in dataset/may_face_crop_lip/audio/.
  • Image preprocessing

    • [Only for data preprocessing] Download 79999_iter.pth and place it in preprocess/face_parsing/.
    • Generate all the files for training:
      cd preprocess/
      bash preprocess_may.sh
      
  • Configuration file

    • We offer a sample in configs/face_simple_configs/may/.
    • To train with your data, modify the data-related items which are highlighted in the provided sample.
  • [Only for train] Sync expert network

Train Speech2Lip

We use May's video as an example and provide the bash scripts.

  • Train with command:
    bash scripts/example/train_may.sh
    

Pretrained Models

  • Our pretrained models are available here.
  • To run inference, place the pretrained model model_may.pt in log/face_simple/may.

Inference

We use May's video as an example and provide the bash scripts.

  • For evaluation:
    • Generate images
      bash scripts/example/inference_may.sh
      
      • We split the video into 90% train and 10% test sets.
      • Images are generated in rendering_result/may/example/postfusion.
    • Combine images into a video:
      ffmpeg -r 25 -i %05d.jpg -c:v libx264 -pix_fmt yuv420p output.mp4
      
    • Combine the video with the test audio:
      ffmpeg -i output.mp4 -i audio_test.wav -c:v copy -c:a aac -strict experimental output_with_audio.mp4
      
      • For the video demo, split the wav file into 90%/10% using ffmpeg, with the 10% used in inference.
      • We provide audio_test.wav as an example.
    • Evaluation metrics including PSNR, SSIM, CPBD, LMD and Sync score can be applied.
  • For any given audio:
    • Place new audio audio.npy in dataset/may_face_crop_lip/audio_test/
    bash scripts/example/inference_new_audio_may.sh
    

Citation

If you find our work useful in your research, please consider citing our paper:

@inproceedings{wu2023speech2lip,
  title={Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video},
  author={Wu, Xiuzhe and Hu, Pengfei and Wu, Yang and Lyu, Xiaoyang and Cao, Yan-Pei and Shan, Ying and Yang, Wenming and Sun, Zhongqian and Qi, Xiaojuan},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={22168--22177},
  year={2023}
}

Acknowledgments

We use face-parsing.PyTorch to compute head mask in the canonical space, DeepSpeech for audio feature extraction, Wav2Lip for sync expert network, and we are highly grateful to ADNeRF for their data preprocessing script.