My BrainStation Capstone Project: Deep Learning for DeepFake Detection

What are DeepFakes?

Take a look at this GIF. Which one do you think is the original clip?

Difficult, right? Especially if you don't know what movie this is from. Did you know this was generated by a computer? Well if you aren't aware, this is a DeepFake, an altered video produced by Artificial Intelligence (AI). And it's not just limited to face swapping, any aspect of digital media is subject to 'DeepFaking' - a person's mouth movements can be adjusted to match any audio sample, for instance. It also takes a fraction of the time and cost to make DeepFakes than if a person were to produce the same results with CGI. This technology brings major implications to society; from spreading false information to fabricating evidence in court to sabotaging politics, it's not hard to imagine the many dangers of DeepFakes if left unchecked. Imagine if DeepFakes were to come out today of health officials and political figures saying that the Covid-19 pandemic is over, everyone can go out and socialize - the consequences would be disastrous!

DeepFake Detection Challenge

In an effort to curb the emerging threat of DeepFakes, a Kaggle competition was built in collaboration between Amazon, Facebook, Microsoft and Partnership on AI to invite enthusiasts of all backgrounds to compete for the best performing DeepFake detection model. As someone who loves challenging problems and has genuine concerns about DeepFakes, I chose to tackle this challenge for my capstone project. Over the course of ~7 weeks I performed an end-to-end Data Science workflow in which I obtained data, processed it and trained several deep learning models to differentiate between real and fake videos. To date, my best model achieved 96% precision (of all predicted fakes, 96% were indeed fake) and 83% specificity (able to correctly identify 83% of real videos) using just a single frame per video. Below summarizes the tools used for this project and the steps I took to get there!

Resources

Tools Used

Bash
Python
Github
Google Colab

Python Packages:

Data Science: numpy, pandas
Plotting: matplotlib
Machine Learning: tensorflow version 2.1, keras
Other: jupyter, os, imageio, pickle, h5py

Dataset

The competition dataset consisted of close to 500 GB of videos, each with a length of 10 seconds at 30 frames per second (FPS). Due to computational and time limitations, I opted to use a pre-processed dataset which consisted of 160x160 resolution images of extracted faces from the original videos. Credits goes to Hieu Phung for generating this dataset - information about the pre-processing workflow can be found here. I also downloaded a subsample of the entire dataset, consisting of 10,420 videos (extracted frames).

The dataset was split into several parts, with a zip file containing a variable number of folders with each containing a set of images pertaining to a unique video. Each zip file also came with a metadata.csv file which had the ids and labels of the associated 'videos'. Each data batch came in a zip file in the naming format of 'deepfake-detection-faces-part-i-j.zip' where 'i-j' refers to the batch number. I created a bash script unzip_batch.sh to unzip a zip file given a batch number and appropriately rename its associated metadata file with 'metadata_i-j.csv'. All metadata were consolidated into a single metadata file with this simple bash command:

cat data/metadata*.csv > data/metadata.csv

Important aspects of this dataset (relevance explained later):

Imbalanced classes: ~88% labeled Fake, ~12% labeled Real
Multiple fakes derived from each original

Exploratory Data Analysis and Data Cleaning

Filtering Based on Number of Frames

Based on the original dataset comprising of videos that are all 10 seconds long at 30 FPS, I should expect the pre-processed dataset to consist of 300 images or frames of faces for each video. From browsing the pre-processed data, I encountered cases where frames misidentified as faces were present and cases with more than 1 face per frame (video featured more than 1 actor). I thus investigated the number of frames across the dataset. Video filenames and their number of frames was obtained using a bash script get_n_frames.sh and written into a csv file (n_frames.csv). The distribution of frame numbers was then plotted using Seaborn (see notebook)

There is some variability in the number of frames that were extracted from each video, with the majority of extracted frames falling between 200 and 400. There is also a noticeable group around 600 frames - these must be the videos that featured 2 actors.

Some explanations on why some videos had differing frame numbers are as follows: the ones with more frames had extra ones due to misidentified faces during pre-processing while the ones with less frames may had gaps in the video in which the actor's face was not detectable (e.g. actor may have turned their head or moved out of view) These factors can pose problems during classification since these extra frames or gaps would result in an inconsistent sequence of images. So I removed these 'outliers' and kept just the videos with between 200 and 400 frames since most of my data were within this range (see notebook). Outlier names were exported as n_frame_outliers.txt and then used to move the matching directories into an archive folder:

xargs -a n_frame_outliers.txt mv -t data/archived/n_frame_outliers

If the above returns mv: cannot stat '<path>'$'\r': No such file or directory, run tr -d '\r' <n_frame_outliers.txt >n_frame_outliers_new.txt && mv n_frame_outliers_new.txt n_frame_outliers.txt

After this filtering steps, 8,537 videos remained in my dataset.

Extracting 30 Frames per video

Since I used keras models which takes a non-variable input shape, I needed all of my videos to have the same number of frames. I also considered whether 300 frames per second is necessary, since many frames were nearly identical. With my limited resources in time and computation, I rationalized that reducing down to 3 Frames per second (30 frames per video) was a reasonable idea. Using a Bash script, I extracted every 10th frame per video up to 30 frames. I also skipped frames that had multiple 'faces' to avoid misidentified frames.

Train-Validation-Test Split

The fact that there are multiple fake videos derived from each original video presents a concern regarding the train-test split. If I were to perform random stratification, videos derived from the same original would be present in both training and validation/testing sets. This is an issue because the similarity between videos originating from the same source may bias the model such that it may be able to classify a video more easily if it has learned from related videos during training. Hence, I ensured that each original plus their derivatives were not separated during stratification. See here for relevant code. 20% of my data (n=1,568) went into the test set, and the remaining were split 80/20 into the training (n=5,585) and validation (n=1,384) sets, respectively.

Building Deep Learning Models for DeepFake detection

Once all of the cleaning steps were done, I uploaded my data onto my Google Drive so that I could access it from Google Colab. Neural networks for DeepFake detection were made and trained in a Colab notebook with GPU as the runtime type. Models were trained for 50 epochs unless specified otherwise. ModelCheckpoint was also used to save the parameters that resulted in the best performance (lowest validation loss).

Detection using a custom CNN with 1 Frame per Video

To get a baseline performance, I first framed this as an image classification problem by training models on just the 15th frame (the middle frame) for each video. The first model I used was a custom Convolutional Neural Network (CNN) with a relatively simple architecture of 6 convolutional layers, 3 pooling layers and 2 dense layers (not including output):

However, this model did not appear to be able to learn during training. As seen in the figure below, the training and validation loss remained static, at least for the first 20 epochs. The training and validation accuracy also did not appear to improve, the model appeared to flip between predicting everything as fake (~88% accuracy) or real (~12% accuracy) on the validation set. Adjusting the learning rate did not appear to change this tendency.

This model performed with an ~88% accuracy on the test set, but with a specificity (proportion of correctly classified real videos) of zero. Again, this model predicted every test video as fake. Another negative indication was its ROC AUC score of 0.5 (equivalent to random guessing). I took this as a sign that I needed deeper and more complex models. So, I sought to apply transfer learning on pre-trained ImageNet models that are available in the Keras package.

Detection using Transfer Learning with 1 Frame per Video

Keras has several built-in deep learning models that have been trained on millions of images from the ImageNet dataset. To apply transfer learning, I imported these models and modified the input shape and output layer to conform to my data. I also appended two dense layers before the output layer so that these models could learn aspects about my data:

I noticed that these models tended to show signs of overfitting. While the training loss and accuracy appeared to improve as training progressed, the validation loss and accuracy did not, as illustrated below.

Still, these models were at least learning something. After trying several different built-in models, I was able to train one that achieved a precision of 0.96 (96% of predicted fakes were indeed fake) and a specificity of 0.83 (correctly identified 83% of real videos)! However, this came at the tradeoff of a high false negative rate or low recall of 0.58 (only 58% of fake videos were correctly identified). Moreover, the ROC AUC score was 0.71, indicating that there remained much room for improvement.

Detection using Time Distributed CNN + Recurrent NN with 30 Frames per video

Next step involves moving beyond image classification to video classification. For this, I apply the Time Distribution functionality around the built-in models and pass it to a recurrent LSTM layer:

The idea here is that the convolutions are applied on each frame individually and then consolidated into the LSTM layers which takes into account the temporal sequence of the frames. However, with the time limit on free GPU usage on Google Colab, I could not train these models beyond 20 epochs. Of the few that I've tried training, signs of overfitting were evident as well - so far I did not obtain a model that outperformed my best one using 1 frame per video, adjustments to the recurrent layers or training for more epochs could help in the future.

Concluding Remarks

Considering the potentially enormous ramifications of malicious DeepFakes, my best performing model remains far from ideal. While false negatives (fakes misidentified as real) are perhaps more damaging than false positives (real videos misidentified as fake), minimization of both are incredibly important for the overarching goal of differentiating between authentic versus doctored media. Generative Adversarial Networks which are deep learning frameworks in which two networks (one for generating fakes and one for detection) compete against each other in a kind of evolutionary arms race, remain at the forefront of AI methodologies for DeepFake generation and detection. Future steps could explore GANs or other cutting edge AI frameworks to arrive at a more robust model.

Questions? Reach out to me on LinkedIn

sdlee94/DeepFake-Detection-Using-Neural-Networks