CLIP-grounding

Quantitative evaluation of CLIP's cross-modal grounding abilities via attention-explainability method.

Abstract

Powerful multimodal models such as CLIP combine vision and language to reliably align image-text pairs. However, it is unclear if CLIP focuses on the right signals while aligning images and text. To answer this, we leverage a state-of-the-art attention-explainability method called Transformer-MM-Explainability and quantify how well CLIP grounds lingusitic concepts in images and visual concepts in text.

Towards this, we use the Panoptic Narrative Grounding benchmark proposed by Gonzalez et al. that provides fine-grained segmentation masks corresponding to the parts of sentences.

Setup

Follow the steps provided here to create a conda enviroment and activate it.

Dataset

Download the MSCOCO dataset (only validation images are required for this work) and its panoptic segmentation annotations by running:

bash setup/download_mscoco.sh

This shall result in the following folder structure:

data/panoptic_narrative_grounding
├── __MACOSX
│   └── panoptic_val2017
├── annotations
│   ├── panoptic_segmentation
│   ├── panoptic_train2017.json
│   ├── panoptic_val2017.json
│   └── png_coco_val2017.json
└── images
    └── val2017

6 directories, 3 files

⌛ This step takes about 30 minutes (depending on your Internet connection).

Demo

In order to run our code on samples from the PNG benchmark dataset, please run this notebook. It assumes that you have a conda environment setup as before and the dataset downloaded.

🤗 Alternatively, check out a Huggingface spaces demo here.

Quantitative evaluation

In order to reproduce our results of CLIP model on Panoptic Narrative Grounding (PNG) benchmark dataset, we use the following procedure:

Activate conda enviroment and set PYTHONPATH. Make sure you are at the repo root.
```
conda activate clip-grounding
export PYTHONPATH=$PWD
```
Run the evaluation script:

CLIP (multi-modal): To run evaluation with CLIP using both modes, run

python clip_grounding/evaluation/clip_on_png.py --eval_method clip

This shall save metrics in outputs/ folder. Result (numbers) are presented below.

CLIP (unimodal): To run a stronger baseline using only one modality in CLIP, run

python clip_grounding/evaluation/clip_on_png.py --eval_method clip-unimodal

Random baseline: To run baseline evaluation (with random attributions), run

python clip_grounding/evaluation/clip_on_png.py --eval_method random

The cross-modal grounding results for different variants are summarized in the following table:

	Random	CLIP-Unimodal	CLIP
Text-to-Image (IoU)	0.2763	0.4310	0.4917
Image-to-Text (IoU)	0.2557	0.4570	0.5099

Acknowledgements

We'd like to thank the TAs, in particular, Jaap Jumelet and Tom Kersten, for useful initial discussions, and the course instructor Prof. Jelle Zuidema.

We greatly appreciate the open-sourced code/datasets/models from the following resources:

Panoptic Narrative Grounding
MS-COCO
CLIP by OpenAI
Transformer multi-modal Explanability by Hila Chefer et al.