Quantitative evaluation of CLIP's cross-modal grounding abilities via attention-explainability method.
Powerful multimodal models such as CLIP combine vision and language to reliably align image-text pairs. However, it is unclear if CLIP focuses on the right signals while aligning images and text. To answer this, we leverage a state-of-the-art attention-explainability method called Transformer-MM-Explainability and quantify how well CLIP grounds lingusitic concepts in images and visual concepts in text.
Towards this, we use the Panoptic Narrative Grounding benchmark proposed by Gonzalez et al. that provides fine-grained segmentation masks corresponding to the parts of sentences.
Follow the steps provided here to create a conda
enviroment and activate it.
Download the MSCOCO dataset (only validation images are required for this work) and its panoptic segmentation annotations by running:
bash setup/download_mscoco.sh
This shall result in the following folder structure:
data/panoptic_narrative_grounding
├── __MACOSX
│ └── panoptic_val2017
├── annotations
│ ├── panoptic_segmentation
│ ├── panoptic_train2017.json
│ ├── panoptic_val2017.json
│ └── png_coco_val2017.json
└── images
└── val2017
6 directories, 3 files
⌛ This step takes about 30 minutes (depending on your Internet connection).
In order to run our code on samples from the PNG benchmark dataset, please run this notebook. It assumes that you have a conda
environment setup as before and the dataset downloaded.
🤗 Alternatively, check out a Huggingface spaces demo here.
In order to reproduce our results of CLIP model on Panoptic Narrative Grounding (PNG) benchmark dataset, we use the following procedure:
- Activate
conda
enviroment and setPYTHONPATH
. Make sure you are at the repo root.conda activate clip-grounding export PYTHONPATH=$PWD
- Run the evaluation script:
CLIP (multi-modal): To run evaluation with CLIP using both modes, run
python clip_grounding/evaluation/clip_on_png.py --eval_method clip
This shall save metrics in outputs/
folder. Result (numbers) are presented below.
CLIP (unimodal): To run a stronger baseline using only one modality in CLIP, run
python clip_grounding/evaluation/clip_on_png.py --eval_method clip-unimodal
Random baseline: To run baseline evaluation (with random attributions), run
python clip_grounding/evaluation/clip_on_png.py --eval_method random
The cross-modal grounding results for different variants are summarized in the following table:
Random | CLIP-Unimodal | CLIP | |
---|---|---|---|
Text-to-Image (IoU) | 0.2763 | 0.4310 | 0.4917 |
Image-to-Text (IoU) | 0.2557 | 0.4570 | 0.5099 |
We'd like to thank the TAs, in particular, Jaap Jumelet and Tom Kersten, for useful initial discussions, and the course instructor Prof. Jelle Zuidema.
We greatly appreciate the open-sourced code/datasets/models from the following resources:
- Panoptic Narrative Grounding
- MS-COCO
- CLIP by OpenAI
- Transformer multi-modal Explanability by Hila Chefer et al.