/Awesome-Multimodal-LLM-for-Math-STEM

Paper collections of multi-modal LLM for Math/STEM/Code.

MIT LicenseMIT

Awesome-Multimodal-LLM-for-Math/STEM Awesome

🔥 Collections of multi-modal LLM for Math/STEM/Code.

Table of Content

Awesome Papers

  1. MAVIS: Mathematical Visual Instruction Tuning Preprint

    Renrui Zhang, Xinyu Wei, Dongzhi Jiang, Yichi Zhang, Ziyu Guo,Chengzhuo Tong, Jiaming Liu, Aojun Zhou, Bin Wei, Shanghang Zhang, Peng Gao, Hongsheng Li.[Paper], 2024.7

  2. COMET: “Cone of experience” enhanced large multimodal model for mathematical problem generation. Preprint

    Sannyuya Liu, Jintian Feng, Zongkai Yang, Yawei Luo, Qian Wan, Xiaoxuan Shen, Jianwen Sun. [Paper], 2024.7

  3. Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B: A Technical Report. Preprint

    Di Zhang, Xiaoshui Huang, Dongzhan Zhou, Yuqiang Li, Wanli Ouyang. [Paper], 2024.6

  4. Visual SKETCHPAD: Sketching as a Visual Chain of Thought for Multimodal Language Models. Preprint

    Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, Ranjay Krishna. [Paper], [Code], 2024.6

  5. TextSquare: Scaling up Text-Centric Visual Instruction Tuning. Preprint

    Jingqun Tang, Chunhui Lin, Zhen Zhao, Shu Wei, Binghong Wu, Qi Liu, Hao Feng, Yang Li, Siqi Wang, Lei Liao, Wei Shi, Yuliang Liu, Hao Liu, Yuan Xie, Xiang Bai, Can Huang. [Paper], 2024.4

  6. Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs. ACL 2024

    Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou. [Paper], 2024.3

  7. mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding. Preprint

    Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou. [Paper], 2024.3

  8. ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning. Preprint

    Renqiu Xia, Bo Zhang, Hancheng Ye, Xiangchao Yan, Qi Liu, Hongbin Zhou, Zijun Chen, Min Dou, Botian Shi, Junchi Yan, Yu Qiao. [Paper], 2024.2

  9. InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions. Preprint

    Ryota Tanaka, Taichi Iki, Kyosuke Nishida, Kuniko Saito, Jun Suzuki. [Paper], 2024.1

  10. G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model. Preprint

    Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, Lingpeng Kong. [Paper], [Code], 2023.12

  11. mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Models. Preprint

    Anwen Hu, Yaya Shi, Haiyang Xu, Jiabo Ye, Qinghao Ye, Ming Yan, Chenliang Li, Qi Qian, Ji Zhang, Fei Huang. [Paper], 2023.11

  12. Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning. Preprint

    Xingchen Zeng, Haichuan Lin, Yilin Ye, Wei Zeng. [Paper], [Code], 2024.7

  13. Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning. Preprint

    Wenwen Zhuang, Xin Huang, Xiantao Zhang, Jin Zeng. [Paper], 2024.8

  14. Diagram Formalization Enhanced Multi-Modal Geometry Problem Solver. Preprint

    Zeren Zhang, Jo-Ku Cheng, Jingyang Deng, Lu Tian, Jinwen Ma, Ziran Qin, Xiaokai Zhang, Na Zhu, and Tuo Leng. [Paper], 2024.9

  15. Transformers Utilization in Chart Understanding: A Review of Recent Advances & Future Trends. Preprint

    Mirna Al-Shetairy, Hanan Hindy, Dina Khattab, Mostafa M. Aref. [Paper], 2024.10

  16. IMPROVE VISION LANGUAGE MODEL CHAIN-OFTHOUGHT REASONING. Preprint

    Ruohong Zhang, Bowen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun, Zhe Gan, Yinfei Yang, Ruoming Pang, Yiming Yang. [Paper], 2024.10

  17. R-COT : REVERSE CHAIN-OF-THOUGHT PROBLEM GENERATION FOR GEOMETRIC REASONING IN LARGE MULTIMODAL MODELS. Preprint

    Linger Deng, Yuliang Liu, Bohan Li, Dongliang Luo, Liang Wu, Chengquan Zhang, Pengyuan Lyu, Ziyang Zhang, Gang Zhang, Errui Ding, Yingying Zhu, Xiang Bai. [Paper], 2024.10

  18. GeoCoder: Solving Geometry Problems by Generating Modular Code through Vision-Language Models. Preprint

    Aditya Sharma, Aman Dalmia, Mehran Kazemi, Amal Zouaq, Christopher J. Pal. [Paper], 2024.10

MLLM Math/STEM Dataset

Name Paper Notes
ScienceQA Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering A benchmark consists of ∼21k multimodal multiple choice questions with diverse science topics.
CMM12K COMET: “Cone of experience” enhanced large multimodal model for mathematical problem generation A Chinese MM SFT dataset for math, not released
SPIQA SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers Designed to interpret complex figures and tables within the context of scientific research articles across various domains of computer science
InstructDoc InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions Collection of 30 publicly available VDU datasets, each with diverse instructions in a unified format.
M-Paper mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model Built by parsing Latex source files of high-quality papers.
DocStruct4M/DocReason25K mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding Based on publicly available datasets. A high-quality instruction tuning dataset.
DocGenome DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models A structured document benchmark constructed by annotating 500K scientific documents from 153 disciplines in the arXiv.
ArXivCap/ArXivQA Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models A figure-caption dataset comprising 6.4M images and 3.9M captions, sourced from 572K ArXiv papers. A QA dataset generated by prompting GPT-4V based on scientific figures.
FigureQA FigureQA: An Annotated Figure Dataset for Visual Reasoning A visual reasoning corpus of over one million QA pairs grounded in over 100,000 images. The images are synthetic, scientific-style figures from five classes: line plots, dotline plots, vertical and horizontal bar graphs, and pie charts
DVQA DVQA: Understanding Data Visualizations via Question Answering A dataset that tests many aspects of bar chart understanding in a question answering framework.
SciGraphQA SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs A synthetic multi-turn QA dataset related to academic graphs.
SciCap SciCap: Generating Captions for Scientific Figures A large-scale figure caption dataset based on Computer Science arXiv papers published between 2010 and 2020, contained over 416k figures that focused on graphplot.
FigCap Figure Captioning with Reasoning and Sequence-Level Training Generated based on FigureQA
FigureSeer FigureSeer: Parsing Result-Figures in Research Papers -
UniChart UniChart: A Universal Vision-language Pretrained Model for Chart Comprehension and Reasoning A large-scale chart corpus for pretraining, covering a diverse range of visual styles and topics.
MapQA MapQA: A Dataset for Question Answering on Choropleth Maps A large-scale dataset of ~800K question-answer pairs over ~60K map images.
TabMWP Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning A dataset containing 38,431 open-domain grade-level problems that require mathematical reasoning on both textual and tabular data
CLEVR-Math CLEVR-Math: A Dataset for Compositional Language, Visual and Mathematical Reasoning A multi-modal math word problems dataset consisting of simple math word problems involving addition/subtraction
GUICourse GUICourse: From General Vision Language Model to Versatile GUI Agent A suite of datasets to train visual-based GUI agents from general VLMs
PIN-14M PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents 14 million samples derived from Chinese and English sources, tailored to include complex web and scientific content.
MathV360K Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models 40K high-quality images with QA pairs from 24 existing datasets and synthesizing 320K new pairs.
MMSci MMSci: A Multimodal Multi-Discipline Dataset for PhD-Level Scientific Comprehension Collected a multimodal dataset from open-access scientific articles published in Nature Communications journals.
MAVIS-Caption/Instruct MAVIS: Mathematical Visual Instruction Tuning -
Geo170K G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model Utilize the geometry characteristic to construct a multi-modal geometry dataset, building upon existing datasets.
SciOL/MuLMS-Img SciOL and MuLMS-Img: Introducing A Large-Scale Multimodal Scientific Dataset and Models for Image-Text Tasks in the Scientific Domain Pretraining corpus for multimodal models in the scientific domain.
PlotQA PlotQA: Reasoning over Scientific Plots With 28.9 million QA pairs over 224,377 plots on data from realworld sources and questions based on crowd-sourced question templates.
ChartInstructionData Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning A dataset of 467K, which includes 108K table-chart pairs and 359K chart-QA pairs.
MMTab Multimodal Table Understanding Dataset for multimodal table understanding problem, based on 14 publicly available table datasets of 8 domains.
Multimodal Self-Instruct Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model Instruction dataset for eight visual scenarios: charts, tables, simulated maps, dashboards, flowcharts, relation graphs, floor plans, and visual puzzles.
GeoGPT4V GeoGPT4V: Towards Geometric Multi-modal Large Language Models with Geometric Image Generation Leverages GPT-4 and GPT-4V to generate relatively basic geometry problems with aligned text and images.
InfiMM-WebMath-40B InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning Interleaved image-text documents, comprises 24 million web pages, 85 million associated image URLs, and 40 billion text tokens, extracted and filtered from CommonCrawl.
MultiMath-300K MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models Spans K-12 levels with image captions and step-wise solutions.
MathVL MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model A fine-tuning dataset including both several public datasets and our curated Chinese dataset collected from K12 education levels.

MLLM Math/STEM Benchmark

Name Paper Note
GeoEval GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving An benchmark for evaluating MLLMs' capability in solving geometry math problems
Geometry3K Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning Consisting of 3,002 geometry problems with dense annotation in formal language.
GEOS Solving Geometry Problems: Combining Text and Diagram Interpretation -
GeoQA GeoQA: A Geometric Question Answering Benchmark Towards Multimodal Numerical Reasoning 4,998 geometric problems with cor- responding annotated programs
GeoQA+ An Augmented Benchmark Dataset for Geometric Question Answering through Dual Parallel Text Encoding Based on GeoQA, newly annotate 2,518 geometric problems with richer types and greater difficulty
UniGeo UniGeo: Unifying Geometry Logical Reasoning via Reformulating Mathematical Expression Contains 4,998 calculation problems and 9,543 proving problems
PGPS9K A Multi-Modal Neural Geometric Solver with Textual Clauses Parsed from Diagram Labeled with both fine-grained diagram annotation and interpretable solution program.
GeomVerse GeomVerse: A Systematic Evaluation of Large Models for Geometric Reasoning A synthetic benchmark of geometry questions with controllable difficulty levels along multiple axes
MathVista MATHVISTA: EVALUATING MATHEMATICAL REASONING OF FOUNDATION MODELS IN VISUAL CONTEXTS A benchmark designed to combine challenges from diverse mathematical and visual tasks.
OlympiadBench OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems An Olympiad-level bilingual multimodal scientific benchmark, from mathematics and physics competitions
OlympicArena OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI Encompass a wide range of disciplines spanning seven fields and 62 international Olympic competitions.
SciBench SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models A benchmark for college-level scientific problems sourced from instructional textbooks.
MMMU MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI Designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning.
CMMMU CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark A new Chinese Massive Multi-discipline Multimodal Understanding benchmark designed to evaluate LMMs on tasks demanding college-level subject knowledge and deliberate reasoning in a Chinese context.
MULTI MULTI: Multimodal Understanding Leaderboard with Text and Images Includes over 18,000 questions, and challenges MLLMs with a variety of tasks, ranging from formula derivation to image detail analysis and cross-modality reasoning.
M3GIA M3GIA: A Cognition Inspired Multilingual and Multimodal General Intelligence Ability Benchmark Designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning.
M3Exam M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models Sourced from real and official human exam questions for evaluating LLMs in a multilingual, multimodal, and multilevel context.
MathVerse MATHVERSE: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources.
MATH-Vision Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset 3,040 high-quality mathe- matical problems with visual contexts sourced from real math competitions.
AI2D A Diagram Is Worth A Dozen Images A dataset of diagrams with annotations of constituents and relationships for over 5,000 diagrams and 15,000 QAs.
IconQA IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning A benchmark with the goal of answering a question in an icon image context.
TQA Are You Smarter Than A Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension Includes 1,076 lessons and 26,260 multi-modal questions, taken from middle school science curricula.
ScienceQA Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering A benchmark consists of ∼21k multimodal multiple choice questions with diverse science topics.
ChartX ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning A multi-modal evaluation set covering 18 chart types, 7 chart tasks, 22 disciplinary topics, and high-quality chart data
PlotQA PlotQA: Reasoning over Scientific Plots With 28.9 million question-answer pairs over 224,377 plots on data from realworld sources and questions based on crowd-sourced question templates.
Chart-to-text Chart-to-Text: A Large-Scale Benchmark for Chart Summarization A large-scale benchmark with two datasets and a total of 44,096 charts covering a wide range of topics and chart types.
ChartQA ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning A large-scale benchmark covering 9.6K human-written questions as well as 23.1K questions generated from human-written chart summaries.
OpenCQA OpenCQA: Open-ended Question Answering with Charts The goal is to answer an open-ended question about a chart with descriptive texts.
ChartBench ChartBench: A Benchmark for Complex Visual Reasoning in Charts A comprehensive benchmark designed to assess chart comprehension and data reliability through complex visual reasoning.
DocVQA DocVQA: A Dataset for VQA on Document Images Consists of 50,000 questions defined on 12,000+ document images
InfoVQA InfographicVQA Comprises a diverse collection of infographics along with question-answer annotations.
WTQ Compositional Semantic Parsing on Semi-Structured Tables A dataset of 22,033 complex questions on Wikipedia tables.
TableFact TabFact : A Large-scale Dataset for Table-based Fact Verification A large-scale dataset with 16k Wikipedia tables as the evidence for 118k human-annotated natural language statements.
MM-Math MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification Consists of 5,929 open-ended middle school math problems with visual contexts, with fine-grained classification.
MathCheck Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist A well-designed checklist for testing task generalization and reasoning robustness.
PuzzleVQA PUZZLEVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns A collection of 2000 puzzle instances based on abstract patterns.
SMART-101 Are Deep Neural Networks SMARTer than Second Graders? Evaluating the abstraction, deduction, and generalization abilities of neural networks in solving visul-linguistic puzzles.
AlgpPuzzleVQA ARE LANGUAGE MODELS PUZZLE PRODIGIES? Algorithmic Puzzles Unveil Serious Challenges in Multimodal Reasoning Evaluate the capabilities in solving algorithmic puzzles.
ChartMimic ChartMimic: Evaluating LMM’s Cross-Modal Reasoning Capability via Chart-to-Code Generation Aimed at assessing the visually-grounded code generation capabilities.
ChartSumm ChartSumm: A Comprehensive Benchmark for Automatic Chart Summarization of Long and Short Summaries -
MMCode MMCode: Evaluating Multi-Modal Code Large Language Models with Visually Rich Programming Problems Contains 3,548 questions and 6,620 images collected from real-world programming challenges harvested from 10 code competition websites.
Design2Code Design2Code: How Far Are We From Automating Front-End Engineering Manually curate a benchmark of 484 diverse real-world webpages
Plot2Code Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots A comprehensive visual coding benchmark.
CharXiv CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs A comprehensive evaluation suite involving 2,323 natural, challenging, and diverse charts from arXiv papers.
We-Math WE-MATH: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning? 6.5K visual math problems, spanning 67 hierarchical knowledge concepts and 5 layers of knowledge granularity.
SceMQA SceMQA: A Scientific College Entrance Level Multimodal Question Answering Benchmark A benchmark for scientific multimodal question answering at the college entrance leve.
TheoremQA TheoremQA: A Theorem-driven Question Answering dataset Curated by domain experts containing 800 high-quality questions covering 350 theorems from Math, Physics, EE&CS, and Finance.
NPHardEval4V NPHardEval4V: A Dynamic Reasoning Benchmark of Multimodal Large Language Models Built by converting textual description of questions from NPHardEval to image representations.
MathScape MathScape: Evaluating MLLMs in multimodal Math Scenarios through a Hierarchical Benchmark Designed to evaluate photo-based math problem scenarios, assessing the theoretical understanding and application ability of MLLMs through a categorical hierarchical approach.
TableBench TableBench: A Comprehensive and Complex Benchmark for Table Question Answering Including 18 fields within four major categories of table question answering capabilities.
GRAB GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models Synthetic, comprised of 2170 questions, covering four tasks and 23 graph properties.
LogicVista LogicVista: A Benchmark for Evaluating Multimodal Logical Reasoning Evaluate general logical cognition abilities across 5 logical reasoning tasks encompassing 9 different capabilities, using a sample of 448 multiple-choice questions.
CMM-Math CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models Contains over 28,000 high-quality samples, featuring a variety of problem types with detailed solutions across 12 grade levels from elementary to high school in China.
SWE-bench Multimodal SWE-BENCH MULTIMODAL: DO AI SYSTEMS GENERALIZE TO VISUAL SOFTWARE DOMAINS? Contains 617 task instances collected from 17 JavaScript libraries used for web interface design, diagramming, data visualization, syntax highlighting, and interactive mapping.
MMIE MMIE: MASSIVE MULTIMODAL INTERLEAVED COMPREHENSION BENCHMARK FOR LARGE VISIONLANGUAGE MODELS Comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.
MultiChartQA MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems Multi-hop reasoning required to extract and integrate information from multiple charts, comprises 655 charts and 944 questions
Sketch2Code Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping Evaluating automating the conversion of rudimentary sketches into webpage prototypes, collected a total of 731 sketches for 484 webpage screenshots
PolyMath POLYMATH: A CHALLENGING MULTI-MODAL MATHEMATICAL REASONING BENCHMARK Comprises 5,000 manually collected high-quality images of cognitive textual and visual challenges across 10 distinct categories, including pattern recognition, spatial reasoning, and relative reasoning

Contributors


If you have any question about this opinionated list, do not hesitate to create an issue.