Realistic-Neural-Talking-Head-Models

Implementation of Few-Shot Adversarial Learning of Realistic Neural Talking Head Models (Egor Zakharov et al.). https://arxiv.org/abs/1905.08233

This repo is based on https://github.com/vincent-thevenin/Realistic-Neural-Talking-Head-Models

My changes to the original repo

Download caffe-trained version of VGG19 converted to pytorch .

As there are some layer names mismatching in the converted model,

change VGG19_caffe_weight_path in params.py to your path and run

python change_vgg19_caffelayer_name.py

Main code changes in loss_generator.py:

self.vgg19_caffe_RGB_mean = torch.FloatTensor([123.68, 116.779, 103.939]).view(1, 3, 1, 1).to(device) # RGB order
self.vggface_caffe_RGB_mean = torch.FloatTensor([129.1863,104.7624,93.5940]).view(1, 3, 1, 1).to(device) # RGB order

x_vgg19 = x * 255  - self.vgg19_caffe_RGB_mean
x_vgg19 = x_vgg19[:,[2,1,0],:,:]
x_hat_vgg19 = x_hat * 255 - self.vgg19_caffe_RGB_mean
x_hat_vgg19 = x_hat_vgg19[:,[2,1,0],:,:]
x_vggface = x * 255 - self.vggface_caffe_RGB_mean
x_vggface = x_vggface[:,[2,1,0],:,:] # B RGB H W -> B BGR H W
x_hat_vggface = x_hat * 255 - self.vggface_caffe_RGB_mean
x_hat_vggface = x_hat_vggface[:,[2,1,0],:,:] # B RGB H W -> B BGR H W

Explanations

The vgg19 and vggface loss mentioned in the paper are caffe trained version, the input should be in range 0-255, and in BGR order.

However, in the original repo, vgg19 and vggface takes images in RGB order with value 0-1, but the losses weights the same as paper, i.e. vgg19_weight=1.5e-1, vggface_weight=2.5e-2, which cause these two losses to be very small compared to other loss terms.

So either change the weight of the losses, or change the pretrained model to caffe pretrained version to balance the losses.

For me, I download the caffe version of vgg19 from https://github.com/jcjohnson/pytorch-vgg, and use 0-255, BGR order to calculate vgg loss.

Results

The following results are generated from the same person (id_08696) with different driving videos.

Click the images to view video results on Youtube

1. Feed forward without finetuning

2. Fine tuning for 100 epochs

As we can see, identity gap exists in feed forward results, but can be briged by finetuning.

3. More results:

Jarvisss/Realistic-Neural-Talking-Head-Models

Realistic-Neural-Talking-Head-Models

My changes to the original repo

Explanations

Results