/Awesome-Multimodal-Reasoning

Collection of papers and resources on Multimodal Reasoning, including Vision-Language Models, Multimodal Chain-of-Thought, Visual Inference, and others.

MIT LicenseMIT

Awesome Multimodal Reasoning

Collection of papers and resources on how to unlock reasoning abilities under multimodal settings.

Animation from ViperGPT (Surís et al.)

Consider how difficult it would be to study from a book that lacks any figures, diagrams or tables. We enhance our learning ability when we combine different data modalities, such as vision, language, and audio [1]. Recently, large language models (LLMs) have achieved remarkable results in complex reasoning tasks by generating intermediate steps before deducing the answer via chain-of-thought (CoT) reasoning [2] [3]. However, most of the research on CoT reasoning only involves the language modality and not others. We present a collection of papers and resources on how to unlock these abilities under multimodal settings.

Contents

Technique

End-to-end Models

  1. Learning to Reason: End-to-End Module Networks for Visual Question Answering. ICCV 2017

    Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Kate Saenko. [Paper], 2017.4

  2. Flamingo: a Visual Language Model for Few-Shot Learning. NeurIPS 2022

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, Karen Simonyan. [Blog] [Paper], 2022.4

  3. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. Preprint

    Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi. [Paper] [Code], 2023.1

  4. Language Is Not All You Need: Aligning Perception with Language Models. Preprint

    Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra, Qiang Liu, Kriti Aggarwal, Zewen Chi, Johan Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, Furu Wei. [Paper], 2023.2

  5. Prismer: A Vision-Language Model with An Ensemble of Experts. Preprint

    Shikun Liu, Linxi Fan, Edward Johns, Zhiding Yu, Chaowei Xiao, Anima Anandkumar. [Project] [Paper] [Code] [Demo], 2023.3

  6. PaLM-E: An Embodied Multimodal Language Model. Preprint

    Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, Pete Florence. [Project] [Paper], 2023.3

  7. GPT-4 Technical Report. Preprint

    OpenAI. [Blog] [Paper], 2023.3

  8. Visual Instruction Tuning. Preprint

    Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee. [Project] [Paper] [Code] [Demo], 2023.4

  9. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. Preprint

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny. [Project] [Paper] [Code], 2023.4

  10. Otter: A Multi-Modal Model with In-Context Instruction Tuning Preprint

    Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, Ziwei Liu. [Paper] [Code], 2023.5

  11. VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks. Preprint

    VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks. [Paper] [Code] [Demo], 2023.5

  12. Kosmos-2: Grounding Multimodal Large Language Models to the World Preprint

    Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei. [Paper], 2023.6

  13. BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs. Preprint

    Yang Zhao, Zhijie Lin, Daquan Zhou, Zilong Huang, Jiashi Feng, Bingyi Kang. [Paper] [Code], 2023.7

  14. Augmenting CLIP with Improved Visio-Linguistic Reasoning. Preprint

    Samyadeep Basu, Maziar Sanjabi, Daniela Massiceti, Shell Xu Hu, Soheil Feizi. [Paper], 2023.7

  15. Med-Flamingo: a Multimodal Medical Few-shot Learner. Preprint

    Michael Moor, Qian Huang, Shirley Wu, Michihiro Yasunaga, Cyril Zakka, Yash Dalmia, Eduardo Pontes Reis, Pranav Rajpurkar, Jure Leskovec. [Paper] [Code], 2023.7

  16. Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities. Preprint

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, Jingren Zhou. [Paper] [Code], 2023.8

  17. Kosmos-2.5: A Multimodal Literate Model. Preprint

    Tengchao Lv, Yupan Huang, Jingye Chen, Lei Cui, Shuming Ma, Yaoyao Chang, Shaohan Huang, Wenhui Wang, Li Dong, Weiyao Luo, Shaoxiang Wu, Guoxin Wang, Cha Zhang, Furu Wei. [Paper], 2023.9

  18. Improved Baselines with Visual Instruction Tuning. Preprint

    Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee. [Project] [Paper] [Code], 2023.10

  19. G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model. Preprint

    Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, Lingpeng Kong. [Paper], 2023.12

  20. G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model. Preprint

    Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, Lingpeng Kong. [Paper], 2023.12

  21. Gemini: A Family of Highly Capable Multimodal Models. Preprint

    Gemini Team, Google. [Paper], 2023.12

  22. Gemini: A Family of Highly Capable Multimodal Models. Preprint

    Gemini Team, Google. [Paper], 2023.12

  23. Gemini in Reasoning: Unveiling Commonsense in Multimodal Large Language Models. Preprint

    Yuqing Wang, Yun Zhao. [Paper], 2023.12

Prompting & In-context Learning

  1. Multimodal Few-Shot Learning with Frozen Language Models. NeurIPS 2021

    Multimodal Few-Shot Learning with Frozen Language Models. [Paper], 2021.6

  2. Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language. ICLR 2023

    Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choromanski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, Pete Florence. [Project] [Paper] [Code], 2022.4

  3. Multimodal Chain-of-Thought Reasoning in Language Models. Preprint

    Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, Alex Smola. [Paper] [Code], 2023.2

  4. Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models. Preprint

    Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, Nan Duan. [Paper] [Code], 2023.3

  5. MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action. Preprint

    Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, Lijuan Wang. [Project] [Paper] [Code] [Demo], 2023.3

  6. Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings. Preprint

    Daniel Rose, Vaishnavi Himakunthala, Andy Ouyang, Ryan He, Alex Mei, Yujie Lu, Michael Saxon, Chinmay Sonar, Diba Mirza, William Yang Wang. [Paper] [Code], 2023.5

  7. Link-Context Learning for Multimodal LLMs. Preprint

    Yan Tai, Weichen Fan, Zhao Zhang, Feng Zhu, Rui Zhao, Ziwei Liu. [Paper] [Code], 20233.8

  8. Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding Preprint

    Zilong Wang, Hao Zhang, Chun-Liang Li, Julian Martin Eisenschlos, Vincent Perot, Zifeng Wang, Lesly Miculicich, Yasuhisa Fujii, Jingbo Shang, Chen-Yu Lee, Tomas Pfister. [Paper], 2024.1

Compositional & Symbolic Approach

  1. Inferring and Executing Programs for Visual Reasoning. ICCV 2017

    Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Li Fei-Fei, C. Lawrence Zitnick, Ross Girshick. [Project] [Paper] [Code], 2017.5

  2. Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding. NeurIPS 2018

    Kexin Yi, Jiajun Wu, Chuang Gan, Antonio Torralba, Pushmeet Kohli, Joshua B. Tenenbaum. [Project] [Paper] [Code], 2018.10

  3. Visual Programming: Compositional visual reasoning without training. CPVR 2023

    Tanmay Gupta, Aniruddha Kembhavi. [Project] [Paper] [Code], 2022.11

  4. ViperGPT: Visual Inference via Python Execution for Reasoning. Preprint

    Dídac Surís, Sachit Menon, Carl Vondrick. [Project] [Paper] [Code], 2023.3

  5. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace. Preprint

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, Yueting Zhuang. [Paper] [Code], 2023.3

  6. Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models. Preprint

    Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Jianfeng Gao. [Project] [Paper] [Code], 2023.4

  7. Woodpecker: Hallucination Correction for Multimodal Large Language Models. Preprint

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, Enhong Chen. [Paper] [Code], 2023.10

  8. MM-VID: Advancing Video Understanding with GPT-4V(ision). Preprint

    Kevin Lin, Faisal Ahmed, Linjie Li, Chung-Ching Lin, Ehsan Azarnasab, Zhengyuan Yang, Jianfeng Wang, Lin Liang, Zicheng Liu, Yumao Lu, Ce Liu, Lijuan Wang. [Project] [Paper] [Demo], 2023.10

Benchmark

  • SCIENCEQA Multimodal multiple choice questions with diverse science topics and annotations of their answers with corresponding lectures and explanations.
  • ARO Systematically evaluate the ability of VLMs to understand different types of relationships, attributes, and order.
  • OK-VQA Visual question answering that requires methods which can draw upon outside knowledge to answer questions.
  • A-OKVQA Knowledge-based visual question answering benchmark.
  • NExT-QA Video question answering (VideoQA) benchmark to advance video understanding from describing to explaining the temporal actions.
  • GQA Compositional questions over real-world images.
  • VQA Questions about images that require an understanding of vision, language and commonsense knowledge.
  • VQAv2 2nd iteration of the Visual Question Answering Dataset (VQA).
  • TAG Questions that require understanding the textual cues in an image.
  • Bongard-HOI Visual reasoning benchmark on compositional learning of human-object interactions (HOIs) from natural images.
  • ARC General artificial intelligence benchmark, targetted at artificially intelligent systems that aim at emulating a human-like form of general fluid intelligence.

Other Useful Resources

Other Awesome Lists

  • LLM-Reasoning-Papers Collection of papers and resources on Reasoning in Large Language Models, including Chain-of-Thought, Instruction-Tuning, and others.
  • Chain-of-ThoughtsPapers A trend starts from "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models".
  • Prompt4ReasoningPapers Repository for the paper "Reasoning with Language Model Prompting: A Survey".
  • Deep-Reasoning-Papers Recent Papers including Neural Symbolic Reasoning, Logical Reasoning, Visual Reasoning, planning and any other topics connecting deep learning and reasoning.

Contributing

  • Add a new paper or update an existing paper, thinking about which category the work should belong to.
  • Use the same format as existing entries to describe the work.
  • Add the abstract link of the paper (/abs/ format if it is an arXiv publication).

Don't worry if you do something wrong, it will be fixed for you!

Contributors