Project description will be added
Explore the docs »
View Demo
·
Report Bug
·
Request Feature
Table of Contents
High-quality Wav2Lip, which can be trained on arbitrary datasets. As long as the training and inference scripts, the scripts for the required preprocessings are provided.
- Convert the audio sampling rate to 16000 Hz.
- Compute and save the mel-spectrogram for each audio.
- Convert the video frame rate to 25 fps.
- Extract and save raw frames(no face detection) from each video.
- Compute the offset between each audio and video pair by using the pretrained SyncNet. The offset values are needed for the sync-correction of the dataset.
- [Not provided] Estimate the face bounding box. Crop and save the bounding box region for each frame.
- Recommendation. Use any high performance face detection tool rather than s3fd(the one used in here). I used InsightFace.
Changes from the official implementation
- Any datasets are available.
- Multi-GPU training is supported.
- To avoid bottleneck, mel-spectrograms are computed and saved as .npy files beforehand. (Previously, STFT is computed everytime when the
__getitem__
function is called)
- The
FaceEncoder
of SyncNet takes 48 x 48 lip region image, rather than 48 x 96 lower half image. (conditioned by thetighter_box
option)
pip install -r requirements.txt
To begin with, the audio files are resampled with the sampling rate of 16000Hz. Also, STFT is applied to the resampled audio signals to obtain corresponding mel-spectrograms.
cd scripts/preprocess
python process_audio.py
Since the video files downloaded from YouTube have different frame rates(FPS), we should equalize this rate. The terminal command ffmpeg
is used for frame rate conversion. The video length remains the same after conversion, so the audio doesn't have to be modified.
python process_video.py
The official SyncNet implementation and its pretrained checkpoint are used for sync-correction. All the dependencies should be installed before moving on to the next step.
git clone https://github.com/joonson/syncnet_python.git
cd syncnet_python
Two python files(get_offset.py and newSyncNetInstance.py) in the scripts/preprocess/sync-correction directory need to be located in the syncnet_python directory. The shift value that minimizes syncnet loss is selected as offset. The offset value obtained for each video is recorded in the output/offset.csv file. If the input videos are not separated into frame images, adding --separate_frames
option at the end of the line will help you.
python get_offset.py # --separate_frames
git clone https://github.com/yukyeongleee/Wav2Lip-HQ.git
cd Wav2Lip-HQ
python scripts/train_syncnet.py {run_id} # SyncNet training
python scripts/train_wav2lip.py {run_id} # Wav2Lip training
- Add dataset preprocessing scripts
- Add sync-correction scripts
See the open issues for a full list of proposed features (and known issues).
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Contact us: ukok828@gmail.com
Project Link: https://github.com/yukyeongleee/Wav2Lip-HQ
- Rudrabha/Wav2Lip: The official implementation
- joonson/syncnet_python: The official implementation
- Innerverz-AI/CodeTemplate
- othneildrew/Best-README-Template