This is unofficial extension of Wav2Lip: Accurately Lip-syncing Videos In The Wild repository. We use image super resolution and face segmentation for improving visual quality of lip-synced videos.
Our work is to a great extent based on the code from the following repositories:
- Clearly, Wav2Lip repository, that is a core model of our algorithm that performs lip-sync.
- Moreover, face-parsing.PyTorch repository provides us with a model for face segmentation.
- We also use extremely useful BasicSR respository for super resolution.
- Finally, Wav2Lip heavily depends on face_alignment repository for detection.
Our algorithm consists of the following steps:
- Pretrain ESRGAN on a video with some speech of a target person.
- Apply Wav2Lip model to the source video and target audio, as it is done in official Wav2Lip repository.
- Upsample the output of Wav2Lip with ESRGAN.
- Use BiSeNet to change only relevant pixels in video.
You can learn more about the method in this article (in russian).
Our approach is definetly not at all flawless, and some of the frames produced with it contain artifacts or weird mistakes. However, it can be used to perform lip-sync to high quality videos with plausible output.
The simpliest way is to use our Google Colab demo. However, if you want to test the algorithm on your own machine, run the following commands. Beware that you need Python 3 and CUDA installed.
-
Clone this repository and install requirements:
git clone https://github.com/Markfryazino/wav2lip-hq.git cd wav2lip-hq pip3 install -r requirements.txt
-
Download all the
.pth
files from here and place them in checkpoints folder.Apart from that, вownload the face detection model checkpoint:
!wget "https://www.adrianbulat.com/downloads/python-fan/s3fd-619a316812.pth" -O "face_detection/detection/sfd/s3fd.pth"
-
Run the inference script:
!python inference.py \ --checkpoint_path "checkpoints/wav2lip_gan.pth" \ --segmentation_path "checkpoints/face_segmentation.pth" \ --sr_path "checkpoints/esrgan_yunying.pth" \ --face <path to source video> \ --audio <path to source audio> \ --outfile <desired path to output>
Although we provide a checkpoint of pre-trained ESRGAN, it's training dataset was quite modest, so the results may be insufficient. Hence, it can be useful to finetune the model on your target video. 1 or 2 minutes of speech is usually enough.
To simplify finetuning the model, we provide a colab notebook. You can also run the commands listed there on your machine: namely, you have to download the models, run inference with saving all the frames on-the-fly, resize them and train ESRGAN.
Bear in mind that the procedure is quite time- and memory-consuming.