Large MultiModal Model Hallucination

LMM hallucination😵 refers to occasional instances where LMMs generate content that appears plausible but deviates from or conflicts with the provided image. LMMs tend to rely more on their own parametric knowledge than on provided visual features, causing them to respond with guesses and generate multimodal hallucinations.

In the MLLM community, we've developed methods for detecting, evaluating, and mitigating hallucinations👍.

Awesome LMM Hallucination

Detecting

FDPO: Detecting and Preventing Hallucinations in Large Vision Language Models, (Gunjal et al. 2023)
HaELM: Evaluation and Analysis of Hallucination in Large Vision-Language Models, (Wang et al. 2023a)
- An automatic MLLM hallucination detection framework, Train LLM to detect
HallE-Switch: Rethinking and Controlling Object Existence Hallucinations in Large Vision-Language Models for Detailed Caption, (Zhai et al. 2023)

Evaluating

POPE: Evaluating Object Hallucination in Large Vision-Language Models, (Li et al. EMNLP 2023)
- Discriminative Task: Object Existence, 3k * 3 VQA pairs
- LLM-free
HaELM: Evaluation and Analysis of Hallucination in Large Vision-Language Models, (Wang et al. 2023a)
- Discriminative Task, 1500 VQA pairs
HallusionBench: An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Model, (Liu et al. 2023)
- Image Reasoning Task, 200 VQA pairs
NOPE: Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models, (Lovenia et al.)
Bingo: Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges, (Cui et al.)
FaithScore: Evaluating Hallucinations in Large Vision-Language Models, (Jing et al.)
- Generative Task: Object Existence, Attribute, Relationship, 180 VQA pairs
- open-end find-grained evaluation, need other models to help evaluation
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation, (Wang et al.)
- Discriminative Task: Object Existence, Attribute, Relationship
- Generative Task: Object Existence
- LLM-free
Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models, (Villa et al.)

Mitigating

LRV-Instruction: Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning, (Liu et al. ICLR2024)
- [dataset] propose an instruction-tuning dataset that includes both positive and negative sample
- GAIVE: evaluation approach which uses GPT-4
LURE: Analyzing and Mitigating Object Hallucination in Large Vision-Language Models, (Zhou et al. ICLR2024)
- [post-hoc revision] train a revision model to detect and correct hallucinated objects in the base model’s response.
HallE-Switch: Rethinking and Controlling Object Existence Hallucinations in Large Vision-Language Models for Detailed Caption, (Zhai et al. 2023)
- CCEval, a GPT-4 assisted evaluation method tailored for detailed captioning
Woodpecker: Hallucination Correction for Multimodal Large Language Models, (Yin et al.)
- [revision] post-hoc correction
- need other pre-trained visual models
LLaVA-RLHF: Aligning Large Multimodal Models with Factually Augmented RLHF, (Sun et al.)
- [RLHF-PPO] the first LMM trained with RLHF
- propose benchmark: MMHal-Bench
Volcano: Mitigating Multimodal Hallucination through Self-Feedback Guided Revision, (Lee et al.)
- self-feedback, according to self-generate natural language feedback to self-revise response
HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data， (Yu et al.)
VCD: Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding, (Leng et al.)
- train-free
HA-DPO: Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization
Mitigating Hallucination in Visual Language Models with Visual Supervision, (Chen et al.)
- construct a fine-grained vision instruction dataset, RAI-30k. It contains multi-modal conversations focusing on specific vision relations in an image.
- propose a new benchmark: RAHBench
- incorporating SAM in the vision instruction tuning process'
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation, (Huang et al.)
FOHE: Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites, (Wang et al.)
- use ChatGPT to post-hoc correction
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
- [RLHF-DPO] 1.4K preference data, natural language feedback
MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations, (Ben-Kish et al.)
- [RLHF]
HACL: Hallucination Augmented Contrastive Learning for Multimodal Large Language Model, (Jiang et al.)
Silkie: Preference Distillation for Large Visual Language Models, (Li et al.)
MMCot: Multimodal Chain-of-Thought Reasoning in Language Models, (Zhang et al.)
- [CoT]
KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning, (Mondal et al. AAAI 2024)
- [CoT]

hitum-dev/awesome-Large-MultiModal-Hallucination

Large MultiModal Model Hallucination

Detecting

Evaluating

Mitigating