While Large Vision-Language Models (LVLMs) have rapidly advanced in recent years, the prevalent issue known as the ‘hallucination’ problem has emerged as a significant bottleneck, hindering their real-world deployments. Existing methods mitigate this issue mainly from two perspectives: One approach leverages extra knowledge like robust instruction tuning LVLMs with curated datasets or employing auxiliary analysis networks, which inevitable incur additional costs. Another approach, known as contrastive decoding, induces hallucinations by manually disturbing the vision or instruction raw inputs and mitigates them by contrasting the outputs of the disturbed and original LVLMs. However, these approaches rely on empirical holistic input disturbances and double the inference cost. To avoid these issues, we propose a simple yet effective method named Self-Introspective Decoding (SID). Our empirical investigation reveals that pretrained LVLMs can introspectively assess the importance of vision tokens based on preceding vision and text (both instruction and generated) tokens. We develop the Context and Text-aware Token Selection (CT2S) strategy, which preserves only unimportant vision tokens after early layers of LVLMs to adaptively amplify text-informed hallucination during the auto-regressive decoding. This approach ensures that multimodal knowledge absorbed in the early layers induces multimodal contextual rather than aimless hallucinations. Subsequently, the original token logits subtract the amplified vision-and-text association hallucinations, guiding LVLMs decoding faithfully. Extensive experiments illustrate SID generates less-hallucination and higher-quality texts across various metrics, without extra knowledge and much additional computation burdens.
Self-Introspective Mechanism of pre-trained LVLMs. Retained vision tokens mainly focus on spurious related regions informed by vision and text (both instruction and generated texts).As we design the LVLMs decoding strategy, it is convenient to use SID by installing our modified transformers
package.
conda env create -f environment.yml
conda activate SID
python -m pip install -e transformers
After setup the environment, you can directly use our code base to imply three LVLMs Decoding-based Hallucination Alleviation methods: Vision Contrastive Decoding (VCD), Instruction Contrastive Decoding (ICD), OPERA, and our SID:
python pope_eval.py --pope-type coco_adversarial --model llava-1.5 --use-cd --use-fast-v --sample --sample-greedy #SID_greedy
python pope_eval.py --pope-type coco_adversarial --model llava-1.5 --use-vcd --sample --sample-greedy #VCD_greedy
python pope_eval.py --pope-type coco_adversarial --model llava-1.5 --use-icd --sample --sample-greedy #ICD_greedy
python pope_eval.py --pope-type coco_adversarial --model llava-1.5 --beam 5 #Beam Search
python pope_eval.py --pope-type coco_adversarial --model llava-1.5 --beam 5 --opera #OPERA
The CHAIR metric utilizes the same configuration.
We provide extensive evaluation metrics including GPT-4V eval_utils/gpt4v_eval.py
, GPT4 shr_eval.py
, POPE pope_eval.py
, CHAIR eval_utils/chair_eval.py
The following evaluation requires for MSCOCO 2014 / AOKVQA / GPA / Visual Genome dataset. Please download here dataset/download_cqa.py
, dataset/download_ha_dpo.py
, dataset/download_visual_genome_v1.2.py
and extract it in the data path.
Besides, it needs you to prepare the following checkpoints of 7B base models:
- Download LLaVA-1.5 merged 7B model and specify it at
eval_configs/llava-1.5_eval.yaml
. - Download Vicuna 7B v1.1 model and specify it at
minigpt4/configs/models/blip2_instruct_vicuna7b.yaml
. - Download Shikra merged 7B model and specify it at
eval_configs/shikra_eval.yaml
.
Argument | Example | Description |
---|---|---|
--model |
llava-1.5 |
Specify the LVLM model. |
--data-path |
/path/to/dataset |
Path to the dataset file or folder. |
--pope-type |
coco_adversarial |
Type for POPE evaluation. |
--sample |
store_true |
Use the modified decoding strategy. |
--sample-greedy |
store_true |
Use CD with sampling and greedy decoding. |
--beam |
5 |
Beam search number. |
--opera |
store_true |
Use OPERA. |
This repo is based on the LVLMs codebase of OPERA, VCD, and HA-DPO . Thanks for their excellent works!