/awesome-vl-compositionality

Awesome Vision-Language Compositionality, a comprehensive curation of research papers in literature.

MIT LicenseMIT

Awesome Vision-Language Compositionality

Awesome GitHub stars MIT License PRs Welcome Maintenance

Welcome to Awesome Vision-Language Compositionality, an extensively curated collection of research papers and resources on compositional understanding in vision-language models (VLMs). This repository will serve as a comprehensive resource to keep up to date with the latest advancements and to provide an overarching view of the vision-language compositionality landscape.

We'd welcome contributions and feedback to continuously improve and expand this collection. 😊
How to contribute?


Our Works

🌟 Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality, EMNLP 2024. 🌟
[Paper] [Project Page] [Code]

TL;DR; We present a new fine-tuning framework to increase compositional reasoning of CLIP without sacrificing the multi-modal capabilities.

🌟 Exploring the Spectrum of Visio-Linguistic Compositionality and Recognition, CVPRW 2024. 🌟
[Paper] [Code]

TL;DR; We comprehensively curate VLMs and benchmarks for compositionality and recognition evaluation!


Repository Structure

Icon Glossary

  • πŸ—‚οΈ Dataset: New benchmarks or datasets for evaluating compositionality.
  • πŸ€– Model: New architectures or training methodologies for enhanced compositional understanding.
  • βš–οΈ Evaluation: Assessment metrics and benchmarks for compositional reasoning.

Compositionality in Image-Text Understanding

πŸ—‚οΈπŸ€– Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs. (CV-Bench). [NeurIPS, 2024].
Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Austin Wang, Rob Fergus, Yann LeCun, Saining Xie.
[Paper] [Code] [HF Dataset]

πŸ—‚οΈ NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples. (NaturalBench). [NeurIPS, 2024].
Baiqi Li, Zhiqiu Lin, Wenxuan Peng, Jean de Dieu Nyandwi, Daniel Jiang, Zixian Ma, Simran Khanuja, Ranjay Krishna, Graham Neubig, Deva Ramanan.
[Paper] [Code] [HF Dataset]

πŸ—‚οΈπŸ€– TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives. (TripletCLIP). [NeurIPS, 2024].
Maitreya Patel, Abhiram Kusumba, Sheng Cheng, Changhoon Kim, Tejas Gokhale, Chitta Baral, Yezhou Yang.
[Paper] [Code] [HF Dataset]

πŸ—‚οΈ ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs. (ConMe). [NeurIPS, 2024].
Irene Huang, Wei Lin, M. Jehanzeb Mirza, Jacob A. Hansen, Sivan Doveh, Victor Ion Butoi, Roei Herzig, Assaf Arbelle, Hilde Kuhene, Trevor Darrel, Chuang Gan, Aude Oliva, Rogerio Feris, Leonid Karlinsky.
[Paper] [Code] [HF Dataset]

πŸ—‚οΈπŸ€– VisMin: Visual Minimal-Change Understanding. (VisMin). [NeurIPS, 2024].
Rabiul Awal, Saba Ahmadi, Le Zhang, Aishwarya Agrawal.
[Paper] [HF Dataset]

πŸ—‚οΈ BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval. (BiVLC). [NeurIPS, 2024].
Imanol Miranda, Ander Salaberria, Eneko Agirre, Gorka Azkune.
[Paper] [Code] [HF Dataset]

πŸ€– Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality. (FSC-CLIP). [EMNLP, 2024].
Youngtaek Oh, Jae Won Cho, Dong-Jin Kim, In So Kweon, Junmo Kim.
[Paper] [Code]

πŸ€– Distilling Knowledge from Text-to-Image Generative Models Improves Visio-Linguistic Reasoning in CLIP. (SDS-CLIP). [EMNLP, 2024].
Samyadeep Basu, Shell Xu Hu, Maziar Sanjabi, Daniela Massiceti, Soheil Feizi.
[Paper]

πŸ€– Natural Language Inference Improves Compositionality in Vision-Language Models. (CECE). [arXiv, 2024].
Paola Cascante-Bonilla, Yu Hou, Yang Trista Cao, Hal DaumΓ© III, Rachel Rudinger.
[Paper] [Code]

πŸ€– Locality Alignment Improves Vision-Language Models. [arXiv, 2024].
Ian Covert, Tony Sun, James Zou, Tatsunori Hashimoto.
[Paper] [Code]

πŸ—‚οΈ VL-GLUE: A Suite of Fundamental yet Challenging Visuo-Linguistic Reasoning Tasks. (VL-GLUE). [arXiv, 2024].
Shailaja Keyur Sampat, Mutsumi Nakamura, Shankar Kailas, Kartik Aggarwal, Mandy Zhou, Yezhou Yang, Chitta Baral.
[Paper] [Code]

πŸ—‚οΈ MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models. (MMComposition). [arXiv, 2024].
Hang Hua, Yunlong Tang, Ziyun Zeng, Liangliang Cao, Zhengyuan Yang, Hangfeng He, Chenliang Xu, Jiebo Luo.
[Paper] [Code]

πŸ€– Compositional Entailment Learning for Hyperbolic Vision-Language Models. (HyCoCLIP). [arXiv, 2024].
Avik Pal, Max van Spengler, Guido Maria D'Amely di Melendugno, Alessandro Flaborea, Fabio Galasso, Pascal Mettes.
[Paper]

πŸ—‚οΈπŸ€– The Hard Positive Truth about Vision-Language Compositionality. (HP+HN). [ECCV, 2024].
Amita Kamath, Cheng-Yu Hsieh, Kai-Wei Chang, Ranjay Krishna.
[Paper] [Code]

πŸ—‚οΈ Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment. (MismatchQuest). [ECCV, 2024].
Brian Gordon, Yonatan Bitton, Yonatan Shafir, Roopal Garg, Xi Chen, Dani Lischinski, Daniel Cohen-Or, Idan Szpektor.
[Paper] [Code] [HF Dataset]

βš–οΈ Removing Distributional Discrepancies in Captions Improves Image-Text Alignment. (LLaVA-score). [ECCV, 2024].
Yuheng Li, Haotian Liu, Mu Cai, Yijun Li, Eli Shechtman, Zhe Lin, Yong Jae Lee, Krishna Kumar Singh.
[Paper] [Code]

πŸ—‚οΈπŸ€– Evaluating Text-to-Visual Generation with Image-to-Text Generation. (VQAScore). [ECCV, 2024].
Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, Deva Ramanan.
[Paper] [Code] [Model] [HF Dataset]

πŸ—‚οΈβš–οΈ FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction. (FINEMATCH). [ECCV, 2024].
Hang Hua, Jing Shi, Kushal Kafle, Simon Jenni, Daoan Zhang, John Collomosse, Scott Cohen, Jiebo Luo.
[Paper] [Code]

πŸ—‚οΈ Detect, Describe, Discriminate: Moving Beyond VQA for MLLM Evaluation. (D3). [ECCVW, 2024].
Manu Gaur, Darshan Singh S, Makarand Tapaswi.
[Paper] [Code]

πŸ—‚οΈ ColorSwap: A Color and Word Order Dataset for Multimodal Evaluation. (ColorSwap). [ACL Findings, 2024].
Jirayu Burapacheep, Ishan Gaur, Agam Bhatia, Tristan Thrush.
[Paper] [Code] [HF Dataset]

βš–οΈ An Examination of the Compositionality of Large Generative Vision-Language Models. (SADE). [NAACL, 2024].
Teli Ma, Rong Li, Junwei Liang.
[Paper] [Code]

βš–οΈ Diagnosing the Compositional Knowledge of Vision Language Models from a Game-Theoretic View. [ICML, 2024].
Jin Wang, Shichao Dong, Yapeng Zhu, Kelu Yao, Weidong Zhao, Chao Li, Ping Luo.
[Paper] [Code]

πŸ€– Revisiting the Role of Language Priors in Vision-Language Models. (VisualGPTScore). [ICML, 2024].
Zhiqiu Lin, Xinyue Chen, Deepak Pathak, Pengchuan Zhang, Deva Ramanan.
[Paper] [Code]

βš–οΈ Exploring the Spectrum of Visio-Linguistic Compositionality and Recognition. [CVPRW, 2024].
Youngtaek Oh, Pyunghwan Ahn, Jinhyung Kim, Gwangmo Song, Soonyoung Lee, In So Kweon, Junmo Kim.
[Paper] [Code]

πŸ—‚οΈπŸ€– Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs. (MMVP). [CVPR, 2024].
Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, Saining Xie.
[Paper] [Code] [HF Dataset]

πŸ€– Compositional Chain-of-Thought Prompting for Large Multimodal Models. (CCoT). [CVPR, 2024].
Chancharik Mitra, Brandon Huang, Trevor Darrell, Roei Herzig.
[Paper] [Code]

πŸ—‚οΈ A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions. (DCI). [CVPR, 2024].
Jack Urbanek, Florian Bordes, Pietro Astolfi, Mary Williamson, Vasu Sharma, Adriana Romero-Soriano.
[Paper] [Code]

πŸ—‚οΈπŸ€– Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding. (SPEC). [CVPR, 2024].
Wujian Peng, Sicheng Xie, Zuyao You, Shiyi Lan, Zuxuan Wu.
[Paper] [Code] [HF Dataset]

πŸ€– Iterated Learning Improves Compositionality in Large Vision-Language Models. (IL-CLIP). [CVPR, 2024].
Chenhao Zheng, Jieyu Zhang, Aniruddha Kembhavi, Ranjay Krishna.
[Paper]

πŸ€– Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding. (CE-CLIP). [CVPR, 2024].
Le Zhang, Rabiul Awal, Aishwarya Agrawal.
[Paper] [Code]

πŸ€– MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training. (MobileCLIP). [CVPR, 2024].
Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel.
[Paper] [Code] [HF Dataset]

πŸ€– Efficient Vision-Language Pre-training by Cluster Masking. [CVPR, 2024].
Zihao Wei, Zixuan Pan, Andrew Owens.
[Paper] [Code]

πŸ€– Building Vision-Language Models on Solid Foundations with Masked Distillation. (SF-CLIP). [CVPR, 2024].
Sepehr Sameni, Kushal Kafle, Hao Tan, Simon Jenni.
[Paper]

βš–οΈ Probing Conceptual Understanding of Large Visual-Language Models. (UnderstandingVisualTextModels). [CVPRW, 2024].
Madeline Schiappa, Raiyaan Abdullah, Shehreen Azad, Jared Claypoole, Michael Cogswell, Ajay Divakaran, Yogesh Rawat.
[Paper] [Code]

πŸ—‚οΈ EVil-Probe - a Composite Benchmark for Extensive Visio-Linguistic Probing (Evil-Probe). [LREC, 2024].
Marie Bexte, Andrea Horbach, Torsten Zesch.
[Paper] [Code]

πŸ—‚οΈπŸ€– CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples. (CounterCurate). [ACL Findings, 2024].
Jianrui Zhang, Mu Cai, Tengyang Xie, Yong Jae Lee.
[Paper] [Code]

πŸ€– ContextBLIP: Doubly Contextual Alignment for Contrastive Image Retrieval from Linguistically Complex Descriptions. (ContextBLIP). [ACL Findings, 2024].
Honglin Lin, Siyu Li, Guoshun Nan, Chaoyue Tang, Xueting Wang, Jingxin Xu, Rong Yankai, Zhili Zhou, Yutong Gao, Qimei Cui, Xiaofeng Tao.
[Paper] [Code]

πŸ—‚οΈ Do Vision-Language Models Understand Compound Nouns? (Compun). [NAACL, 2024].
Sonal Kumar, Sreyan Ghosh, S Sakshi, Utkarsh Tyagi, Dinesh Manocha.
[Paper] [Code]

πŸ€– ComCLIP: Training-Free Compositional Image and Text Matching. (ComCLIP). [NAACL, 2024].
Kenan Jiang, Xuehai He, Ruize Xu, Xin Eric Wang.
[Paper] [Code]

βš–οΈ Lost in Space: Probing Fine-grained Spatial Understanding in Vision and Language Resamplers. (probing-resamplers). [NAACL, 2024].
Georgios Pantazopoulos, Alessandro Suglia, Oliver Lemon, Arash Eshghi.
[Paper] [Code]

βš–οΈ How and where does CLIP process negation?, [ALVR, 2024].
Vincent Quantmeyer, Pablo Mosteiro, Albert Gatt.
[Paper]

πŸ—‚οΈRainbow - A Benchmark for Systematic Testing of How Sensitive Visio-Linguistic Models are to Color Naming. (Rainbow). [EACL, 2024].
Marie Bexte, Andrea Horbach, Torsten Zesch.
[Paper] [Code]

πŸ€– Fine-tuning CLIP Text Encoders with Two-step Paraphrasing. (ParaCLIP). [EACL Findings, 2024].
Hyunjae Kim, Seunghyun Yoon, Trung Bui, Handong Zhao, Quan Tran, Franck Dernoncourt, Jaewoo Kang.
[Paper] [Code]

πŸ€– Diffusion Feedback Helps CLIP See Better. (DIVA). [arXiv, 2024].
Wenxuan Wang, Quan Sun, Fan Zhang, Yepeng Tang, Jing Liu, Xinlong Wang.
[Paper] [Code]

πŸ—‚οΈ SUGARCREPE++ Dataset: Vision-Language Model Sensitivity to Semantic and Lexical Alterations. (SUGARCREPE++). [arXiv, 2024].
Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Sastry, Evangelos Milios, Sageev Oore, Hassan Sajjad.
[Paper] [Code]

πŸ—‚οΈ ColorFoil: Investigating Color Blindness in Large Vision and Language Models. (ColorFoil). [arXiv, 2024].
Ahnaf Mozib Samin, M. Firoz Ahmed, Md. Mushtaq Shahriyar Rafee.
[Paper] [Code]

πŸ—‚οΈ VISLA Benchmark: Evaluating Embedding Sensitivity to Semantic and Lexical Alterations. (VISLA). [arXiv, 2024].
Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Sastry, Evangelos Milios, Sageev Oore, Hassan Sajjad.
[Paper] [Code]

πŸ—‚οΈπŸ€– Learn "No" to Say "Yes" Better: Improving Vision-Language Models via Negations. (CoN-CLIP). [arXiv, 2024].
Jaisidh Singh, Ishaan Shrivastava, Mayank Vatsa, Richa Singh, Aparna Bharati.
[Paper] [Code]

πŸ€– Enhancing Conceptual Understanding in Multimodal Contrastive Learning through Hard Negative Samples. [arXiv, 2024].
Philipp J. RΓΆsch, Norbert Oswald, Michaela Geierhos, JindΕ™ich LibovickΓ½.
[Paper]

πŸ€– CLoVe: Encoding Compositional Language in Contrastive Vision-Language Models. (CLoVe). [arXiv, 2024].
Santiago Castro, Amir Ziai, Avneesh Saluja, Zhuoning Yuan, Rada Mihalcea.
[Paper] [Code]

πŸ€– Prompting Large Vision-Language Models for Compositional Reasoning. (KeyComp). [arXiv, 2024].
Timothy Ossowski, Ming Jiang, Junjie Hu.
[Paper] [Code]

πŸ€– FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos. (FiG-CLIP). [arXiv, 2024].
Darshan Singh S, Zeeshan Khan, Makarand Tapaswi.
[Paper] [Code]

πŸ€– Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured Representations. (Structure-CLIP). [AAAI, 2024].
Yufeng Huang, Jiji Tang, Zhuo Chen, Rongsheng Zhang, Xinfeng Zhang, Weijie Chen, Zeng Zhao, Zhou Zhao, Tangjie Lv, Zhipeng Hu, Wen Zhang.
[Paper] [Code]

πŸ€– Enhancing Multimodal Compositional Reasoning of Visual Language Models with Generative Negative Mining. (GNM-CLIP). [WACV, 2024].
Ugur Sahin, Hang Li, Qadeer Khan, Daniel Cremers, Volker Tresp.
[Paper] [Code]

πŸ—‚οΈ COCO-Counterfactuals: Automatically Constructed Counterfactual Examples for Image-Text Pairs. (COCO-Counterfactuals). [NeurIPS D&B, 2023].
Tiep Le, Vasudev Lal, Phillip Howard.
[Paper] [Code] [HF Dataset]

πŸ—‚οΈ SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality. (SugarCrepe). [NeurIPS D&B, 2023].
Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kembhavi, Ranjay Krishna.
[Paper] [Code]

πŸ—‚οΈ PUG: Photorealistic and Semantically Controllable Synthetic Data for Representation Learning. (PUG). [NeurIPS, 2023].
Florian Bordes, Shashank Shekhar, Mark Ibrahim, Diane Bouchacourt, Pascal Vincent, Ari S. Morcos.
[Paper] [Code]

πŸ—‚οΈ COLA: A Benchmark for Compositional Text-to-image Retrieval. (COLA). [NeurIPS, 2023].
Arijit Ray, Filip Radenovic, Abhimanyu Dubey, Bryan A. Plummer, Ranjay Krishna, Kate Saenko.
[Paper] [Code]

πŸ€– Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models. (DAC). [NeurIPS, 2023].
Sivan Doveh, Assaf Arbelle, Sivan Harary, Roei Herzig, Donghyun Kim, Paola Cascante-bonilla, Amit Alfassy, Rameswar Panda, Raja Giryes, Rogerio Feris, Shimon Ullman, Leonid Karlinsky.
[Paper] [Code]

πŸ€– Image Captioners Are Scalable Vision Learners Too. (CapPa). [NeurIPS, 2023].
Michael Tschannen, Manoj Kumar, Andreas Steiner, Xiaohua Zhai, Neil Houlsby, Lucas Beyer.
[Paper]

πŸ—‚οΈπŸ€– When and why vision-language models behave like bags-of-words, and what to do about it?. (ARO, NegCLIP). [ICLR, 2023].
Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, James Zou.
[Paper] [Code]

πŸ—‚οΈ What's "up" with vision-language models? Investigating their struggle with spatial reasoning. (WhatsUp). [EMNLP, 2023].
Amita Kamath, Jack Hessel, Kai-Wei Chang.
[Paper] [Code]

πŸ—‚οΈβš–οΈ Text encoders bottleneck compositionality in contrastive vision-language models. (ControlledLMCaps). [EMNLP, 2023].
Amita Kamath, Jack Hessel, Kai-Wei Chang.
[Paper] [Code]

πŸ—‚οΈ The BLA Benchmark: Investigating Basic Language Abilities of Pre-Trained Multimodal Models. (BLA). [EMNLP, 2023].
Xinyi Chen, Raquel FernΓ‘ndez, Sandro Pezzelle.
[Paper] [Code]

πŸ—‚οΈ When are Lemons Purple? The Concept Association Bias of Vision-Language Models. (CAB). [EMNLP, 2023].
Yutaro Yamada, Yingtian Tang, Yoyo Zhang, Ilker Yildirim.
[Paper]

πŸ€– Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality. (MosaiCLIP). [EMNLP, 2023].
Harman Singh, Pengchuan Zhang, Qifan Wang, Mengjiao Wang, Wenhan Xiong, Jingfei Du, Yu Chen.
[Paper]

πŸ€– Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs. (SGVL). [EMNLP, 2023].
Roei Herzig, Alon Mendelson, Leonid Karlinsky, Assaf Arbelle, Rogerio Feris, Trevor Darrell, Amir Globerson.
[Paper] [Code]

πŸ—‚οΈ Visual Spatial Reasoning. (VSR). [TACL, 2023].
Fangyu Liu, Guy Emerson, Nigel Collier.
[Paper] [Code]

πŸ—‚οΈπŸ€– Equivariant Similarity for Vision-Language Foundation Models. (EqBen). [ICCV, 2023].
Tan Wang, Kevin Lin, Linjie Li, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, Lijuan Wang.
[Paper] [Code]

πŸ—‚οΈ Teaching CLIP to Count to Ten. (CountBench). [ICCV, 2023].
Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, Tali Dekel.
[Paper] [Code]

πŸ€– Going Beyond Nouns With Vision & Language Models Using Synthetic Data. (SyViC). [ICCV, 2023].
Paola Cascante-Bonilla, Khaled Shehada, James Seale Smith, Sivan Doveh, Donghyun Kim, Rameswar Panda, GΓΌl Varol, Aude Oliva, Vicente Ordonez, Rogerio Feris, Leonid Karlinsky.
[Paper] [Code]

βš–οΈ Measuring Progress in Fine-grained Vision-and-Language Understanding. [ACL, 2023].
Emanuele Bugliarello, Laurent Sartran, Aishwarya Agrawal, Lisa Anne Hendricks, Aida Nematzadeh.
[Paper] [Code]

πŸ—‚οΈ CREPE: Can Vision-Language Foundation Models Reason Compositionally? (CREPE). [CVPR, 2023].
Zixian Ma, Jerry Hong, Mustafa Omer Gul, Mona Gandhi, Irena Gao, Ranjay Krishna.
[Paper] [Code]

πŸ€– Teaching Structured Vision&Language Concepts to Vision&Language Models. (TSVLC). [CVPR, 2023].
Sivan Doveh, Assaf Arbelle, Sivan Harary, Rameswar Panda, Roei Herzig, Eli Schwartz, Donghyun Kim, Raja Giryes, Rogerio Feris, Shimon Ullman, Leonid Karlinsky.
[Paper] [Code]

πŸ—‚οΈ HL Dataset: Visually-grounded Description of Scenes, Actions and Rationales. (HL Dataset). [INLG, 2023].
Michele Cafagna, Kees van Deemter, Albert Gatt.
[Paper] [Code]

πŸ—‚οΈ Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language Pretraining? (SNARE). [arXiv, 2023].
Fei Wang, Liang Ding, Jun Rao, Ye Liu, Li Shen, Changxing Ding.
[Paper] [Code]

πŸ—‚οΈ Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?. (Predicate-Noun-Dependencies). [EMNLP, 2022].
Mitja Nikolaus, Emmanuelle Salin, Stephane Ayache, Abdellah Fourtassi, Benoit Favre.
[Paper] [Code]

πŸ—‚οΈβš–οΈ Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality. [EMNLP, 2022].
Anuj Diwan, Layne Berry, Eunsol Choi, David Harwath, Kyle Mahowald.
[Paper] [Code]

πŸ—‚οΈ VIPHY: Probing "Visible" Physical Commonsense Knowledge. (ViPhy). [EMNLP Findings, 2022].
Shikhar Singh, Ehsan Qasemi, Muhao Chen.
[Paper] [Code]

πŸ—‚οΈ VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations. (VL-Checklist). [EMNLP Demo, 2022].
Tiancheng Zhao, Tianqi Zhang, Mingwei Zhu, Haozhan Shen, Kyusong Lee, Xiaopeng Lu, Jianwei Yin.
[Paper] [Code]

πŸ—‚οΈ Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality. (Winoground). [CVPR, 2022].
Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, Candace Ross.
[Paper] [HF Dataset]

πŸ—‚οΈ VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena. (VALSE). [ACL, 2022].
Letitia Parcalabescu, Michele Cafagna, Lilitta Muradjan, Anette Frank, Iacer Calixto, Albert Gatt.
[Paper] [Code]

πŸ—‚οΈ Image Retrieval from Contextual Descriptions. (ImageCoDe). [ACL, 2022].
Benno Krojer, Vaibhav Adlakha, Vibhav Vineet, Yash Goyal, Edoardo Ponti, Siva Reddy.
[Paper] [Code]

πŸ—‚οΈ Probing Image-Language Transformers for Verb Understanding. (SVO Probes). [ACL Findings, 2021].
Lisa Anne Hendricks, Aida Nematzadeh.
[Paper] [Code] [HF Dataset]

πŸ—‚οΈ Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks. (Counting-Probe). [MMSR, 2021].
Letitia Parcalabescu, Albert Gatt, Anette Frank, Iacer Calixto.
[Paper] [Code]

πŸ—‚οΈ FOIL it! Find One mismatch between Image and Language caption. (FOIL). [ACL, 2017].
Ravi Shekhar, Sandro Pezzelle, Yauhen Klimovich, Aurelie Herbelot, Moin Nabi, Enver Sangineto, Raffaella Bernardi.
[Paper] [Dataset]


Compositionality in Video-Text Understanding

πŸ—‚οΈ TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models. (TemporalBench). [arXiv, 2024].
Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, Yao Dou, Jaden Park, Jianfeng Gao, Yong Jae Lee, Jianwei Yang.
[Paper] [Code] [HF Dataset]

πŸ—‚οΈ Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos. (Vinoground). [arXiv, 2024].
Jianrui Zhang, Mu Cai, Yong Jae Lee.
[Paper] [Code] [HF Dataset]

πŸ—‚οΈπŸ€– VideoCon: Robust Video-Language Alignment via Contrast Captions. (VideoCon). [CVPR, 2024].
Hritik Bansal, Yonatan Bitton, Idan Szpektor, Kai-Wei Chang, Aditya Grover.
[Paper] [Code] [Project] [HF Dataset] [HF Model]

πŸ—‚οΈ NAVERO: Unlocking Fine-Grained Semantics for Video-Language Compositionality. (NAVERO). [arXiv, 2024].
Chaofan Tao, Gukyeong Kwon, Varad Gunjal, Hao Yang, Zhaowei Cai, Yonatan Dukler, Ashwin Swaminathan, R. Manmatha, Colin Jon Taylor, Stefano Soatto.
[Paper]


Compositionality in Text to Image Generation

πŸ—‚οΈ ConceptMix: A Compositional Image Generation Benchmark with Controllable Difficulty. (ConceptMix). [NeurIPS, 2024].
Xindi Wu, Dingli Yu, Yangsibo Huang, Olga Russakovsky, Sanjeev Arora.
[Paper] [Code]


Contributing

If you find any errors or would like to add papers, please feel free to contribute by contacting me, posting an issue, or submitting a pull request. Please use the following Markdown format for the pull requests, including <br /> tag:

**Paper Title.** *(Optional Method/Benchmark name or abbreviation).* [Conference/Journal, Year]. <br />
*Authors.*  <br />
[[Paper](link)] [[Code](link)] [[HF Dataset](link)]