davide-coccomini/Combining-EfficientNet-and-Vision-Transformers-for-Video-Deepfake-Detection

weights

Shiro-LK opened this issue · 1 comments

Hello,

Thank you for sharing this deepfakes model.

I have a question regarding the script efficient_vit.py.

It seems to load a checkpoint : ""weights/final_999_DeepFakeClassifier_tf_efficientnet_b7_ns_0_23" but I haven't find it in the repo. is it possible to have an upload on this one ?
Moreover what is the difference between this checkpoint and efficient_vit.pth ?

Best regards,

Hi Shiro, this is one of the experiments we conducted during the research.

There are two different versions of our Efficient ViT:

  • EfficientNet B0 with ViT: In this case, you combine a classic pre-trained on ImageNet EfficientNet B0 with Vision Transformer and train both of them together on deepfakes. This architecture conducted to good results which are reported in the paper. The efficient_vit.pth refers to that architecture.
  • EfficientNet B7 with ViT: This is a trial to make the architecture work with a larger EfficientNet (in this case B7) but since the model becomes too large, it is very difficult to train both the extractor (EfficientNet B7) and the Vision Transformers. To solve that problem the EfficientNet B7 exploits the weights available from the Selim Seferbekov repository on deepfake detection (https://github.com/selimsef/dfdc_deepfake_challenge) and it is not trained together with ViT.
    This is really an experimental approach that needs more improvements since it did not take to good accuracies for a lot of different reasons so I suggest you use the classic EfficientNet B0 + ViT model with the weights available in our repository.