This repository contains the demo for the audio-to-video synchronisation network (SyncNet). This network can be used for audio-visual synchronisation tasks including:
- Removing temporal lags between the audio and visual streams in a video;
- Determining who is speaking amongst multiple faces in a video.
The model can be used for research purposes under Creative Commons Attribution License. Please cite the paper below if you make use of the software.
The following packages are required to run the SyncNet demo:
python (2.7.12)
pytorch (0.4.0)
numpy (1.14.3)
scipy (1.0.1)
opencv-python (3.4.0) - via opencv-contrib-python
python_speech_features (0.6)
cuda (8.0)
ffmpeg (3.4.2)
In addition to above, these are required to run the full pipeline:
tensorflow (1.2, 1.4)
pyscenedetect (0.3.5) - does not work with 0.4
The demo has been tested with the package versions shown above, but may also work on other versions.
SyncNet demo:
python demo_syncnet.py --videofile data/example.avi --tmp_dir /path/to/temp/directory
Check that this script returns:
AV offset: 4
Min dist: 6.742
Confidence: 10.447
Full pipeline:
sh download_model.sh
python run_pipeline.py --videofile /path/to/video.mp4 --reference name_of_video --data_dir /path/to/output
python run_syncnet.py --videofile /path/to/video.mp4 --reference name_of_video --data_dir /path/to/output
python run_visualise.py --videofile /path/to/video.mp4 --reference name_of_video --data_dir /path/to/output
Outputs:
$DATA_DIR/pycrop/$REFERENCE/*.avi - cropped face tracks
$DATA_DIR/pywork/$REFERENCE/offsets.txt - audio-video offset values
$DATA_DIR/pyavi/$REFERENCE/video_out.avi - output video (as shown below)
@InProceedings{Chung16a,
author = "Chung, J.~S. and Zisserman, A.",
title = "Out of time: automated lip sync in the wild",
booktitle = "Workshop on Multi-view Lip-reading, ACCV",
year = "2016",
}