faceswap-GAN

Adding Adversarial loss and perceptual loss (VGGface) to deepfakes' auto-encoder architecture.

Updates

Date	Update
2018-03-03	Model architecture: Add a new notebook which contains an improved GAN architecture. The architecture is greatly inspired by XGAN and MS-D neural network.
2018-02-13	Video conversion: Add a new video procesisng script using MTCNN for face detection. Faster detection with configurable threshold value. No need of CUDA supported dlib. (New notebook: v2_test_vodeo_MTCNN)
2018-02-10	Video conversion: Add a optional (default `False`) histogram matching function for color correction into video conversion pipeline. Set `use_color_correction = True` to enable this feature. (Updated notebooks: v2_sz128_train, v2_train, and v2_test_video)

Descriptions

GAN-v1

FaceSwap_GAN_github.ipynb
- Script for training the version 1 GAN model.
- Video conversion functions are also included.

GAN-v2

FaceSwap_GAN_v2_train.ipynb (recommneded for trainnig)
- Script for training the version 2 GAN model.
- Video conversion functions are also included.
FaceSwap_GAN_v2_test_video.ipynb
- Script for generating videos.
- Using face_recognition module for face detection.
FaceSwap_GAN_v2_test_video_MTCNN.ipynb (recommneded for video conversion)
- Script for generating videos.
- Using MTCNN for face detection. Does not reqiure CUDA supported dlib.
faceswap_WGAN-GP_keras_github.ipynb
- This notebook contains a class of GAN mdoel using WGAN-GP.
- Perceptual loss is discarded for simplicity.
- The WGAN-GP model gave me similar result with LSGAN model after tantamount (~18k) generator updates.
```
gan = FaceSwapGAN() # instantiate the class
gan.train(max_iters=10e4, save_interval=500) # start training
```
FaceSwap_GAN_v2_sz128_train.ipynb
- Input and output images have larger shape (128, 128, 3).
- Minor updates on the architectures:
  1. Add instance normalization to generators and discriminators.
  2. Add additional regressoin loss (mae loss) on 64x64 branch output.
- Not compatible with _test_video and _test_video_MTCNN notebooks above.

Miscellaneous

dlib_video_face_detection.ipynb
1. Detect/Crop faces in a video using dlib's cnn model.
2. Pack cropped face images into a zip file.
Training data: Face images are supposed to be in ./faceA/ and ./faceB/ folder for each target respectively. Face images can be of any size.

Results

In below are results that show trained models transforming Hinako Sano (佐野ひなこ) to Emi Takei (武井咲).

Source video: 佐野ひなことすごくどうでもいい話？(遊戯王)

1. Autorecoder baseline

Autoencoder based on deepfakes' script. It should be mentoined that the result of autoencoder (AE) can be much better if we train it longer.

Results:

2. Generative Adversarial Network, GAN (version 1)

Improved output quality: Adversarial loss improves reconstruction quality of generated images.

.
VGGFace perceptual loss: Perceptual loss improves direction of eyeballs to be more realistic and consistent with input face.
Smoothed bounding box (Smoothed bbox): Exponential moving average of bounding box position over frames is introduced to eliminate jitter on the swapped face.

3. Generative Adversarial Network, GAN (version 2)

Version 1 features: Most of the features in version 1 are inherited, including perceptual loss and smoothed bbox.
Unsupervised segmentation mask: Model learns a proper mask that helps on handling occlusion, eliminating artifacts on bbox edges, and producing natrual skin tone.
- From left to right: source face, swapped face (before masking), swapped face (after masking).
- From left to right: source face, swapped face (after masking), mask heatmap.
Optional 128x128 input/output resolution: Increase input and output size from 64x64 to 128x128.
Mask refinement: VGGFace ResNet50 is introduced for mask refinement (as the preceptual loss). The following figure shows generated masks before/after refinement. Input faces are from CelebA dataset.
Mask comparison: The following figure shows comparison between (i) generated masks and (ii) face segmentations using YuvalNirkin's FCN netwrok. Surprisingly, FCN sometimes fails to segment out face occlusions (see the 2nd and 4th rows).
Face detection/tracking using MTCNN and Kalman filter: More stable detection and smooth tracking.
V2.1 update: An improved architecture is updated in order to stablize training. The architecture is greatly inspired by XGAN and MS-D neural network.
- In v2.1 architecture, we add more discriminators/losses to the GAN. To be specific, they are:
  1. GAN loss for non-masked outputs: Add two more discriminators to non-masked outputs.
  2. Semantic consistency loss (XGAN): Use cosine distance of embeddings of real faces and reconstructed faces.
  3. Domain adversarial loss (XGAN): Encourage embeddings to lie in the same subspace.
  4. (WIP) Frame loss: An L1 loss between output of current frame and previous frame, resulting smooth transition in output video.
- One res_block in the decoder is replaced by MS-D network (default depth = 16) for output refinement.
  - This is a very inefficient implementation of MS-D network.
- Preview images are saved in ./previews folder.
- (WIP) Random motion blur as data augmentation, reducing ghost effect in output video.
- FCN8s for face segmentation is introduced to improve masking in video conversion (default use_FCN_mask = False).
  - To enable this feature, keras weights file should be generated through jupyter notebook provided in this repo.

Frequently asked questions

1. Slow video processing / OOM error?

It is likely due to too high resolution of input video, modify the parameters in step 13 or 14 will solve it.
- First, increase video_scaling_offset = 0 to 1 or higher.
- If it doesn't help, set manually_downscale = True.
- If the above still do not help, disable CNN model for face detectoin.
```
def process_video(...):
  ...
  #faces = get_faces_bbox(image, model="cnn") # Use CNN model
  faces = get_faces_bbox(image, model='hog') # Use default Haar features.  
```

2. How does it work?

This illustration shows a very high-level and abstract (but not exactly the same) flowchart of the denoising autoencoder algorithm. The objective functions look like this.

3. No audio in output clips?

Set audio=True in the video making cell.

output = 'OUTPUT_VIDEO.mp4'
clip1 = VideoFileClip("INPUT_VIDEO.mp4")
clip = clip1.fl_image(process_video)
%time clip.write_videofile(output, audio=True) # Set audio=True

4. Previews look good, but video result does not seem to transform the face?

Default setting transfroms face B to face A.
To transform face A to face B, modify the following parameters depending on your current running notebook:
- Change path_abgr_A to path_abgr_B in process_video() (step 13/14 of v2_train.ipynb and v2_sz128_train.ipynb).
- Change whom2whom = "BtoA" to whom2whom = "AtoB" (step 12 of v2_test_video.ipynb).

Requirements

keras 2
Tensorflow 1.3
Python 3
OpenCV
moviepy
dlib (optional)
face_recognition (optinoal)

Acknowledgments

Code borrows from tjwei, eriklindernoren, fchollet, keras-contrib and deepfakes. The generative network is adopted from CycleGAN. Weights and scripts of MTCNN are from FaceNet. Illustrations are from irasutoya.

jackyueq/faceswap-GAN