CLIP-Guided Decoding

Seeing is Believing: Mitigating Hallucination in Large Vision-Language Models via CLIP-Guided Decoding: [Paper]

Ailin Deng, Zhirui Chen, Bryan Hooi


run to install basic dependencies. (Recommend using conda or other virtual environment before running set-up)

Models Installation

Install custom transformers after installing the models:

    cd dep/transformers_custom/transformers-4.31.0
    pip install -e .

The modifications compared with the original code are in src/generation/ to return raw logits.


We provide easy inference code at inference.ipynb.

For Evaluation

Note that MySQL and Java are required in evaluation as package pycocoevalcap's requirements.

The COCO samples we tested can be accessed via link. The json files contains generated responses (with top-k sampling here) with different random seeds. The mscoco id is "image_id" for each item in the json file.

COCO Data Structure

Download data from here. You could organize the downloaded data like:

        - COCO_val2014_000000358301.jpg
        - COCO_val2014_000000455735.jpg
        - ...
        - captions_val2014.json
        - instances_val2014.json
        - person_keypoints_val2014.json
        - ...

After data preparation, change the data_path in conf/mscoco_captions.yaml.

Run Tests

see and parameter arguments in conf/mscoco_captions.yaml

Possible Issues

  • (Evaluation) change the in pycocoeval package to enable larger cpu size (16 or 32 or 64G) to avoid memory error when using spice
# change '-Xmx8G' to '-Xmx16G' in
  • (Evaluation) When using pycocoeval to compute BLEU/METEOR/ROUGE/SPICE metrics, it will raise an assertation issue as pycocoeval will evaluate all COCO samples but we only need to eval a subset of the dataset. You could remove the assertation and assign imgIds with res.keys().


CHAIR metrics implementation:

MMVet Evaluation:


  title         = {Seeing is Believing: Mitigating Hallucination in Large Vision-Language Models via CLIP-Guided Decoding},
  author        = {Deng, Ailin and Chen, Zhirui and Hooi, Bryan},
  year          = {2024},
  journal       = {arXiv preprint arXiv:2402.15300},
  archivePrefix = {arXiv},
  eprint        = {2402.15300},