2D3MF: Deepfake Detection using Multi Modal Middle Fusion

Caution

This repo is under development. No hyper parameter tuning is presented yet here; hence, the current architecture is not optimal for deepfake detection.

This repo is the implementation for the paper 2D3MF: Deepfake Detection using Multi Modal Middle Fusion.

Repository Structure

.
├── assets                # Images for README.md
├── LICENSE
├── README.md
├── MODEL_ZOO.md
├── CITATION.cff
├── .gitignore
├── .github

# below is for the PyPI package marlin-pytorch
├── src                   # Source code for marlin-pytorch and audio feature extractors
├── tests                 # Unittest
├── requirements.lib.txt
├── setup.py
├── init.py
├── version.txt

# below is for the paper implementation
├── configs              # Configs for experiments settings
├── TD3MF                # 2D3MF model code
├── preprocess           # Preprocessing scripts
├── dataset              # Dataloaders
├── utils                # Utility functions
├── train.py             # Training script
├── evaluate.py          # Evaluation script
├── requirements.txt

Installing and running our model

Feature Extraction - 2D3MF

Install 2D3MF from pypi

pip install 2D3MF

Sample code snippet for feature extraction

from TD3MF.classifier import TD3MF
ckpt = "ckpt/celebvhq_marlin_deepfake_ft/last-v72.ckpt"
model = TD3MF.load_from_checkpoint(ckpt)
features = model.feature_extraction("2D3MF_Datasets/test/SampleVideo_1280x720_1mb.mp4")

We have some pretrained marlin checkpoints and configurations here

Paper Implementation

Requirements:

Python >= 3.7, < 3.12
PyTorch ~= 1.11
Torchvision ~= 0.12
ffmpeg

Installation

Install PyTorch from the official website

Clone the repo and install the requirements:

git clone https://github.com/aiden200/2D3MF
cd 2D3MF
pip install -e .

Training

1. Download Datasets

Forensics++

We cannot offer the direct script in our repository due to their terms on using the dataset. Please follow the instructions on the [Forensics++](https://github.com/ondyari/FaceForensics?tab=readme-ov-file) page to obtain the download script.

Storage

- FaceForensics++
    - The original downladed source videos from youtube: 38.5GB
    - All h264 compressed videos with compression rate factor
        - raw/0: ~500GB
        - 23: ~10GB (Which we use)

Downloading the data

Please download the Forensics++ dataset. We used the all light compressed original & altered videos of three manipulation methods. It's the script in the Forensics++ repository that ends with: <output path> -d all -c c23 -t videos

The script offers two servers which can be selected by add --server <EU or CA>. If the EU server is not working for you, you can also try EU2 which has been reported to work in some of those instances.

Audio download

Once the first two steps are executed, you should have a structure of

-- Parent_dir
|-- manipulated_sequences
|-- original_sequences

Since the Forensics++ dataset doesn't provide audio data, we need to extract the data ourselves. Please run the script in the Forensics++ repository that ends with: <Parent_dir from last step> -d original_youtube_videos_info

Now you should have a directory with the following structure:

-- Parent_dir
|-- manipulated_sequences
|-- original_sequences
|-- downloaded_videos_info

Please run the script from our repository: python3 preprocess/faceforensics_scripts/extract_audio.py --dir [Parent_dir]

After this, you should have a directory with the following structure:

-- Parent_dir
|-- manipulated_sequences
|-- original_sequences
|-- downloaded_videos_info
|-- audio_clips

References

Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, Matthias Nießner. "FaceForensics++: Learning to Detect Manipulated Facial Images." In International Conference on Computer Vision (ICCV), 2019.

DFDC

Kaggle provides a nice and easy way to download the [DFDC dataset](https://www.kaggle.com/c/deepfake-detection-challenge/data)

DeepFakeTIMIT

We recommend downloading the data from the [DeepfakeTIMIT Zenodo Record](https://zenodo.org/records/4068245)

FakeAVCeleb

We recommend requesting access to FakeAVCeleb via their [repo README](https://github.com/DASH-Lab/FakeAVCeleb)

RAVDESS

We recommend downloading the data from the [RAVDESS Zenodo Record](https://zenodo.org/records/1188976)

2. Preprocess the dataset

We recommend using the following unified dataset structure

2D3MF_Dataset/
├── DeepfakeTIMIT
│   ├── audio/*.wav
│   └── video/*.mp4
├── DFDC
│   ├── audio/*.wav
│   └── video/*.mp4
├── FakeAVCeleb
│   ├── audio/*.wav
│   └── video/*.mp4
├── Forensics++
│   ├── audio/*.wav
│   └── video/*.mp4
├── RAVDESS
    ├── audio/*.wav
    └── video/*.mp4

Crop the face region from the raw video. Run:

python3 preprocess/preprocess_clips.py --data_dir [Dataset_Dir]

3. Extract features from pretrained models

EfficientFace

Download the pre-trained EfficientFace from here under 'Pre-trained models'. In our experiments, we use the model pre-trained on AffectNet7, i.e., EfficientFace_Trained_on_AffectNet7.pth.tar. Please place it under the pretrained directory

Run:

python preprocess/extract_features.py --data_dir /path/to/data --video_backbone [VIDEO_BACKBONE] --audio_backbone [AUDIO_BACKBONE]

[VIDEO_BACKBONE] can be replaced with one of the following:

marlin_vit_small_ytf
marlin_vit_base_ytf
marlin_vit_large_ytf
efficientface

[AUDIO_BACKBONE] can be replaced with one of the following:

MFCC
xvectors
resnet
emotion2vec
eat

Optionally add the --Forensics flag in the end if Forensics++ is the dataset being processed.

From our paper, we found that eat works the best as the audio backbone.

Split the train val and test sets. Run:

python preprocess/gen_split.py --data_dir /path/to/data --test 0.1 --val 0.1 --feat_type [AUDIO_BACKBONE]

Note that the pre-trained video_backbone and audio_backbone can be downloaded from MODEL_ZOO.md

4. Train and evaluate

Train and evaluate the 2D3MF model..

Please use the configs in config/*.yaml as the config file.

python evaluate.py \
    --config /path/to/config \
    --data_path /path/to/CelebV-HQ
    --num_workers 4
    --batch_size 16


python evaluate.py \
    --config /path/to/config  \
    --data_path /path/to/dataset \
    --num_workers 4 \
    --batch_size 8 \
    --marlin_ckpt pretrained/marlin_vit_base_ytf.encoder.pt \
    --epochs 300


python evaluate.py --config config/celebvhq_marlin_deepfake_ft.yaml --data_path 2D3MF_Datasets --num_workers 4     --batch_size 1 --marlin_ckpt pretrained/marlin_vit_small_ytf.encoder.pt --epochs 300

Optionally, add

--skip_train --resume /path/to/checkpoint

To skip the training.

5. Configuration File

Set a configuration file based on your hyperparameters and backbones. You can find a example config file under config/

Explanation:

training_datasets - list, can contain one or more datasets within "DeepfakeTIMIT", "RAVDESS", "Forensics++", "DFDC", "FakeAVCeleb"
eval_datasets- list, can contain one or more datasets within "DeepfakeTIMIT", "RAVDESS", "Forensics++", "DFDC", "FakeAVCeleb"
learning_rate - int, ex: 1.00e-3
num_heads - int, Number of attention heads
fusion - str, Choice of fusion type: "mf" for middle fusion and "lf" for late fusion.
audio_positional_encoding - bool, add audio positional encoding
hidden_layers - int, hidden layers
lp_only - bool, setting this to be true will perform inference from the video features only
audio_backbone- str, select one of the following options: "MFCC", "eat", "xvectors", "resnet", "emotion2vec"
middle_fusion_type- str, select one of the following options: "default", "audio_refuse", "video_refuse", "self_attention", "self_cross_attention"
modality_dropout - float, modality dropout rate
video_backbone - str, select one of the following options: "efficientface", "marlin"

6. Performing Grid Search

config/grid_search_config.py
--grid_search

7. Monitoring Performance:

Run

tensorboard --logdir=lightning_logs/

Should be hosted on http://localhost:6006/

License

This project is under the CC BY-NC 4.0 license. See LICENSE for details.

References

Please cite our work!

Acknowledgements

Some code about model is based on ControlNet/MARLIN. The code related to middle fusion is from Self-attention fusion for audiovisual emotion recognition with incomplete data.

Our Audio Feature Extraction Models:

Our Video Feature Extraction Models: