Koushik Srivatsan,
Muzammal Naseer,
Karthik Nandakumar
MBZUAI, UAE.
- 28-09-2023: Code released.
- We show that direct finetuning of a multimodal pre-trained ViT (e.g., CLIP image encoder) achieves better FAS generalizability without any bells and whistles.
- We propose a new approach for robust cross-domain FAS by grounding the visual representation using natural language semantics. This is realized by aligning the image representation with an ensemble of text prompts (describing the class) during finetuning.
- We propose a multimodal contrastive learning strategy, which enforces the model to learn more generalized features that bridge the FAS domain gap even with limited training data. This strategy leverages view-based image self-supervision and view-based cross-modal image-text similarity as additional constraints during the learning process.
- Extensive experiments on three standard protocols demonstrate that our method significantly outperforms the state- of-the-art methods, achieving better zero-shot transfer performance than five-shot transfer of “adaptive ViTs”.
- Get Code
git clone https://github.com/koushiksrivats/FLIP.git
- Build Environment
cd FLIP
conda env create -f environment.yml
conda activate fas
Please refer to datasets.md for acquiring and pre-processing the datasets.
Please refer to run.md for training and evaluating the models.
Please refer to model_zoo.md for the pre-trained models.
Cross Domain performance in Protocol 1
Cross Domain performance in Protocol 2
Cross Domain performance in Protocol 3
Attention Maps on the spoof samples in MCIO datasets: Attention highlights are on the spoof-specific clues such as paper texture (M), edges of the paper (C), and moire patterns (I and O).
Attention Maps on the spoof samples in WCS datasets: Attention highlights are on the spoof-specific clues such as screen edges/screen reflection (W), wrinkles in printed cloth (C), and cut-out eyes/nose (S).
If you're using this work in your research or applications, please cite using this BibTeX:
@InProceedings{Srivatsan_2023_ICCV,
author = {Srivatsan, Koushik and Naseer, Muzammal and Nandakumar, Karthik},
title = {FLIP: Cross-domain Face Anti-spoofing with Language Guidance},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2023},
pages = {19685-19696}
}
Our code is built on top of the few_shot_fas repository. We thank the authors for releasing their code.