Caution
This repo is under development. No hyper parameter tuning is presented yet here; hence, the current architecture is not optimal for deepfake detection.
This repo is the implementation for the paper 2D3MF: Deepfake Detection using Multi Modal Middle Fusion.
.
├── assets # Images for README.md
├── LICENSE
├── README.md
├── MODEL_ZOO.md
├── CITATION.cff
├── .gitignore
├── .github
# below is for the PyPI package marlin-pytorch
├── src # Source code for marlin-pytorch and audio feature extractors
├── tests # Unittest
├── requirements.lib.txt
├── setup.py
├── init.py
├── version.txt
# below is for the paper implementation
├── configs # Configs for experiments settings
├── TD3MF # 2D3MF model code
├── preprocess # Preprocessing scripts
├── dataset # Dataloaders
├── utils # Utility functions
├── train.py # Training script
├── evaluate.py # Evaluation script
├── requirements.txt
Install 2D3MF from pypi
pip install 2D3MF
Sample code snippet for feature extraction
from TD3MF.classifier import TD3MF
ckpt = "ckpt/celebvhq_marlin_deepfake_ft/last-v72.ckpt"
model = TD3MF.load_from_checkpoint(ckpt)
features = model.feature_extraction("2D3MF_Datasets/test/SampleVideo_1280x720_1mb.mp4")
We have some pretrained marlin checkpoints and configurations here
Requirements:
- Python >= 3.7, < 3.12
- PyTorch ~= 1.11
- Torchvision ~= 0.12
- ffmpeg
Install PyTorch from the official website
Clone the repo and install the requirements:
git clone https://github.com/aiden200/2D3MF
cd 2D3MF
pip install -e .
Forensics++
We cannot offer the direct script in our repository due to their terms on using the dataset. Please follow the instructions on the [Forensics++](https://github.com/ondyari/FaceForensics?tab=readme-ov-file) page to obtain the download script.- FaceForensics++
- The original downladed source videos from youtube: 38.5GB
- All h264 compressed videos with compression rate factor
- raw/0: ~500GB
- 23: ~10GB (Which we use)
Please download the Forensics++ dataset. We used the all light compressed original & altered videos of three manipulation methods. It's the script in the Forensics++ repository that ends with: <output path> -d all -c c23 -t videos
The script offers two servers which can be selected by add --server <EU or CA>
. If the EU
server is not working for you, you can also try EU2
which has been reported to work in some of those instances.
Once the first two steps are executed, you should have a structure of
-- Parent_dir
|-- manipulated_sequences
|-- original_sequences
Since the Forensics++ dataset doesn't provide audio data, we need to extract the data ourselves. Please run the script in the Forensics++ repository that ends with: <Parent_dir from last step> -d original_youtube_videos_info
Now you should have a directory with the following structure:
-- Parent_dir
|-- manipulated_sequences
|-- original_sequences
|-- downloaded_videos_info
Please run the script from our repository:
python3 preprocess/faceforensics_scripts/extract_audio.py --dir [Parent_dir]
After this, you should have a directory with the following structure:
-- Parent_dir
|-- manipulated_sequences
|-- original_sequences
|-- downloaded_videos_info
|-- audio_clips
- Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, Matthias Nießner. "FaceForensics++: Learning to Detect Manipulated Facial Images." In International Conference on Computer Vision (ICCV), 2019.
DFDC
Kaggle provides a nice and easy way to download the [DFDC dataset](https://www.kaggle.com/c/deepfake-detection-challenge/data)DeepFakeTIMIT
We recommend downloading the data from the [DeepfakeTIMIT Zenodo Record](https://zenodo.org/records/4068245)FakeAVCeleb
We recommend requesting access to FakeAVCeleb via their [repo README](https://github.com/DASH-Lab/FakeAVCeleb)RAVDESS
We recommend downloading the data from the [RAVDESS Zenodo Record](https://zenodo.org/records/1188976)We recommend using the following unified dataset structure
2D3MF_Dataset/
├── DeepfakeTIMIT
│ ├── audio/*.wav
│ └── video/*.mp4
├── DFDC
│ ├── audio/*.wav
│ └── video/*.mp4
├── FakeAVCeleb
│ ├── audio/*.wav
│ └── video/*.mp4
├── Forensics++
│ ├── audio/*.wav
│ └── video/*.mp4
├── RAVDESS
├── audio/*.wav
└── video/*.mp4
Crop the face region from the raw video. Run:
python3 preprocess/preprocess_clips.py --data_dir [Dataset_Dir]
EfficientFace
Download the pre-trained EfficientFace from here under 'Pre-trained models'. In our experiments, we use the model pre-trained on AffectNet7, i.e., EfficientFace_Trained_on_AffectNet7.pth.tar. Please place it under the pretrained
directory
Run:
python preprocess/extract_features.py --data_dir /path/to/data --video_backbone [VIDEO_BACKBONE] --audio_backbone [AUDIO_BACKBONE]
[VIDEO_BACKBONE] can be replaced with one of the following:
- marlin_vit_small_ytf
- marlin_vit_base_ytf
- marlin_vit_large_ytf
- efficientface
[AUDIO_BACKBONE] can be replaced with one of the following:
- MFCC
- xvectors
- resnet
- emotion2vec
- eat
Optionally add the --Forensics
flag in the end if Forensics++ is the dataset being processed.
From our paper, we found that eat
works the best as the audio backbone.
Split the train val and test sets. Run:
python preprocess/gen_split.py --data_dir /path/to/data --test 0.1 --val 0.1 --feat_type [AUDIO_BACKBONE]
Note that the pre-trained video_backbone
and audio_backbone
can be downloaded from MODEL_ZOO.md
Train and evaluate the 2D3MF model..
Please use the configs in config/*.yaml
as the config file.
python evaluate.py \
--config /path/to/config \
--data_path /path/to/CelebV-HQ
--num_workers 4
--batch_size 16
python evaluate.py \
--config /path/to/config \
--data_path /path/to/dataset \
--num_workers 4 \
--batch_size 8 \
--marlin_ckpt pretrained/marlin_vit_base_ytf.encoder.pt \
--epochs 300
python evaluate.py --config config/celebvhq_marlin_deepfake_ft.yaml --data_path 2D3MF_Datasets --num_workers 4 --batch_size 1 --marlin_ckpt pretrained/marlin_vit_small_ytf.encoder.pt --epochs 300
Optionally, add
--skip_train --resume /path/to/checkpoint
To skip the training.
Set a configuration file based on your hyperparameters and backbones. You can find a example config file under config/
Explanation:
training_datasets
- list, can contain one or more datasets within"DeepfakeTIMIT"
,"RAVDESS"
,"Forensics++"
,"DFDC"
,"FakeAVCeleb"
eval_datasets
- list, can contain one or more datasets within"DeepfakeTIMIT"
,"RAVDESS"
,"Forensics++"
,"DFDC"
,"FakeAVCeleb"
learning_rate
- int, ex:1.00e-3
num_heads
- int, Number of attention headsfusion
- str, Choice of fusion type:"mf"
for middle fusion and"lf"
for late fusion.audio_positional_encoding
- bool, add audio positional encodinghidden_layers
- int, hidden layerslp_only
- bool, setting this to be true will perform inference from the video features onlyaudio_backbone
- str, select one of the following options:"MFCC"
,"eat"
,"xvectors"
,"resnet"
,"emotion2vec"
middle_fusion_type
- str, select one of the following options:"default"
,"audio_refuse"
,"video_refuse"
,"self_attention"
,"self_cross_attention"
modality_dropout
- float, modality dropout ratevideo_backbone
- str, select one of the following options:"efficientface"
,"marlin"
- config/grid_search_config.py
- --grid_search
Run
tensorboard --logdir=lightning_logs/
Should be hosted on http://localhost:6006/
This project is under the CC BY-NC 4.0 license. See LICENSE for details.
Please cite our work!
Some code about model is based on ControlNet/MARLIN. The code related to middle fusion is from Self-attention fusion for audiovisual emotion recognition with incomplete data.
Our Audio Feature Extraction Models:
Our Video Feature Extraction Models: