This library implements the end-to-end facial synthesis model described in this paper.
The models were hosted on git LFS. However the demand was so high that I reached the quota for free gitLFS storage. I have moved the models to GoogleDrive. Models can be found here. Place the model file(s) under sda/data/
To install the library do:
$ pip install .
To create the animations you will need to instantiate the VideoAnimator class. Then you provide an image and audio clip (or the paths to the files) and a video will be produced.
The model has been trained on the GRID, TCD-TIMIT, CREMA-D and LRW datasets. The default model is GRID. To load another pretrained model simply instantiate the VideoAnimator with the following arguments:
import sda
va = sda.VideoAnimator(gpu=0, model_path="crema")# Instantiate the animator
The models that are currently uploaded are:
- grid
- timit
- crema
- lrw
import sda
va = sda.VideoAnimator(gpu=0)# Instantiate the animator
vid, aud = va("example/image.bmp", "example/audio.wav")
import sda
import scipy.io.wavfile as wav
from PIL import Image
va = sda.VideoAnimator(gpu=0)# Instantiate the animator
fs, audio_clip = wav.read("example/audio.wav")
still_frame = Image.open("example/image.bmp")
vid, aud = va(frame, audio_clip, fs=fs)
va.save_video(vid, aud, "generated.mp4")
The encoders for audio and video are made available so that they can be used to produce features for classification tasks.
The Audio Encoder (which is made of Audio-Frame encoder and RNN) is provided along with a dictionary which has information such as the feature length (in seconds) required by the Audio Frame encoder and the overlap between audio frames.
import sda
encoder, info = sda.get_audio_feature_extractor(gpu=0)