ObamaNet : Lip Sync from Audio
List of Contents
- Requirements
- Data Extraction
- Data Preprocessing
- Training Different Models
- Pretrained Model
- How to run an example
- Citations
- FAQs
Requirements
You may install the requirements by running the following command
sudo pip3 install -r requirements.txt
The project is built for python 3.5 and above. The other libraries are listed below
- OpenCV (
sudo pip3 install opencv-contrib-python
) - Dlib (
sudo pip3 install dlib
) with this file unzipped in the data folder - Python Speech Features (
sudo pip3 install python-speech-features
)
For a complete list refer to requirements.txt
file.
I used the tools below to extract and manipulate the data:
- ffmpeg (
sudo apt-get install ffmpeg
) - YouTube-dl
Data Extraction
I extracted the data from youtube using youtube-dl. It's perhaps the best downloader for youtube on linux. Commands for extracting particular streams are given below.
- Subtitle Extraction
youtube-dl --sub-lang en --skip-download --write-sub --output '~/obamanet/data/captions/%(autonumber)s.%(ext)s' --batch-file ~/obamanet/data/obama_addresses.txt --ignore-config
- Video Extraction
youtube-dl --batch-file ~/obamanet/data/obama_addresses.txt -o '~/obamanet/data/videos/%(autonumber)s.%(ext)s' -f "best[height=720]" --autonumber-start 1
(Videos not available in 720p: 165)
- Video to Audio Conversion
python3 vid2wav.py
- Video to Images
ffmpeg -i 00001.mp4 -r 1/5 -vf scale=-1:720 images/00001-$filename%05d.bmp
To convert from BMP format to JPG format, use the following in the directory
mogrify -format jpg *.bmp
rm -rf *.bmp
Copy the patched images into folder a
and the cropped images to folder b
python3 tools/process.py --input_dir a --b_dir b --operation combine --output_dir c
python3 tools/split.py --dir c
You may use this pretrained model or train pix2pix from scratch using this dataset. Unzip the dataset into the pix2pix main directory.
python3 pix2pix.py --mode train --output_dir output --max_epochs 200 --input_dir c/train/ --which_direction AtoB
To run the pix2pix trained model
python3 pix2pix.py --mode test --output_dir test_out/ --input_dir c_test/ --checkpoint output/
To convert images to video
ffmpeg -r 30 -f image2 -s 256x256 -i %d-targets.png -vcodec libx264 -crf 25 ../targets.mp4
Pretrained Model
Link to the pretrained model and a subset of the data is here - Link
Download and extract the checkpoints and the data folders into the repository. The file structure should look as shown below.
obamanet
|
└─ data
| | audios
| | a2key_data
| ...
|
└─ checkpoints
| | output
| | model.h5
| ...
└─ train.py
└─ run.py
└─ run.sh
...
Running sample wav file
Run the following commands
bash run.sh <relative_path_to_audio_wav_file>
Example:
bash run.sh data/audios/karan.wav
Feel free to experiment with different voices. However, the result will depend on how close your voice is to the subject we trained on.
Citation
If you use this code for your research, please cite the paper this code is based on: ObamaNet: Photo-realistic lip-sync from text and also the amazing repository of pix2pix by affinelayer.
Cite as arXiv:1801.01442v1 [cs.CV]