CVPR24-SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection
Peng Qi, Zehong Yan, Wynne Hsu, Mong Li Lee
Misinformation is a prevalent societal issue due to its potential high risks. Out-Of-Context (OOC) misinformation where authentic images are repurposed with false text is one of the easiest and most effective ways to mislead audiences. Current methods focus on assessing image-text consistency but lack convincing explanations for their judgments which are essential for debunking misinformation. While Multimodal Large Language Models (MLLMs) have rich knowledge and innate capability for visual reasoning and explanation generation they still lack sophistication in understanding and discovering the subtle cross-modal differences. In this paper we introduce SNIFFER a novel multimodal large language model specifically engineered for OOC misinformation detection and explanation. SNIFFER employs two-stage instruction tuning on InstructBLIP. The first stage refines the model's concept alignment of generic objects with news-domain entities and the second stage leverages OOC-specific instruction data generated by language-only GPT-4 to fine-tune the model's discriminatory powers. Enhanced by external tools and retrieval SNIFFER not only detects inconsistencies between text and image but also utilizes external knowledge for contextual verification. Our experiments show that SNIFFER surpasses the original MLLM by over 40% and outperforms state-of-the-art methods in detection accuracy. SNIFFER also provides accurate and persuasive explanations as validated by quantitative and human evaluations.
- Release inferecne demo code
- Release training and evaluation code
$ conda create -n lavis python=3.10
$ conda activate lavis
$ pip install torch==2.1.2 torchvision==0.16.2
$ pip install -r requirements.txt
Download the following pre-trained models and put them into llm-ckpt
:
Download the NewsCLIPpings dataset and the corresponsing evidences.
You can find the construction process of instruction data in our paper. We also show the demo data in /datasets
.
Stage 1 (News Domain Alignment):
$ sh run_scripts/instructblip2/train/ddp_train_newsvqa_newsclip.sh
Stage 2 (Task-Specific Tuning):
$ sh run_scripts/instructblip2/train/ddp_train_factvqa_newsclip.sh
$ sh run_scripts/instructblip2/eval/ddp_eval_factvqa_newsclip.sh
A whole pipeline that includes internal checking based on the mllm module, external checking and combined reasoning based on the llm module (could be arbitrarily replaced with other better llms).
$ python demo_inference.py
If you make use of our work, please cite our paper.
@InProceedings{Qi_2024_CVPR,
author = {Qi, Peng and Yan, Zehong and Hsu, Wynne and Lee, Mong Li},
title = {SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
pages = {13052-13062}
}
We build our code on top of the LAVIS. We sincerely thank to LAVIS team for the amazing work and well-structured code.