Awesome-Multimodal-LLM-for-Math/STEM

🔥 Collections of multi-modal LLM for Math/STEM/Code.

Table of Content

Awesome-Multimodal-LLM-for-Math/STEM

Awesome Papers

MAVIS: Mathematical Visual Instruction Tuning Preprint

Renrui Zhang, Xinyu Wei, Dongzhi Jiang, Yichi Zhang, Ziyu Guo,Chengzhuo Tong, Jiaming Liu, Aojun Zhou, Bin Wei, Shanghang Zhang, Peng Gao, Hongsheng Li.[Paper], 2024.7
COMET: “Cone of experience” enhanced large multimodal model for mathematical problem generation. Preprint

Sannyuya Liu, Jintian Feng, Zongkai Yang, Yawei Luo, Qian Wan, Xiaoxuan Shen, Jianwen Sun. [Paper], 2024.7
Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B: A Technical Report. Preprint

Di Zhang, Xiaoshui Huang, Dongzhan Zhou, Yuqiang Li, Wanli Ouyang. [Paper], 2024.6
Visual SKETCHPAD: Sketching as a Visual Chain of Thought for Multimodal Language Models. Preprint

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, Ranjay Krishna. [Paper], [Code], 2024.6
TextSquare: Scaling up Text-Centric Visual Instruction Tuning. Preprint

Jingqun Tang, Chunhui Lin, Zhen Zhao, Shu Wei, Binghong Wu, Qi Liu, Hao Feng, Yang Li, Siqi Wang, Lei Liao, Wei Shi, Yuliang Liu, Hao Liu, Yuan Xie, Xiang Bai, Can Huang. [Paper], 2024.4
Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs. ACL 2024

Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou. [Paper], 2024.3
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding. Preprint

Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou. [Paper], 2024.3
ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning. Preprint

Renqiu Xia, Bo Zhang, Hancheng Ye, Xiangchao Yan, Qi Liu, Hongbin Zhou, Zijun Chen, Min Dou, Botian Shi, Junchi Yan, Yu Qiao. [Paper], 2024.2
InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions. Preprint

Ryota Tanaka, Taichi Iki, Kyosuke Nishida, Kuniko Saito, Jun Suzuki. [Paper], 2024.1
G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model. Preprint

Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, Lingpeng Kong. [Paper], [Code], 2023.12
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Models. Preprint

Anwen Hu, Yaya Shi, Haiyang Xu, Jiabo Ye, Qinghao Ye, Ming Yan, Chenliang Li, Qi Qian, Ji Zhang, Fei Huang. [Paper], 2023.11
Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning. Preprint

Xingchen Zeng, Haichuan Lin, Yilin Ye, Wei Zeng. [Paper], [Code], 2024.7
Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning. Preprint

Wenwen Zhuang, Xin Huang, Xiantao Zhang, Jin Zeng. [Paper], 2024.8
Diagram Formalization Enhanced Multi-Modal Geometry Problem Solver. Preprint

Zeren Zhang, Jo-Ku Cheng, Jingyang Deng, Lu Tian, Jinwen Ma, Ziran Qin, Xiaokai Zhang, Na Zhu, and Tuo Leng. [Paper], 2024.9
Transformers Utilization in Chart Understanding: A Review of Recent Advances & Future Trends. Preprint

Mirna Al-Shetairy, Hanan Hindy, Dina Khattab, Mostafa M. Aref. [Paper], 2024.10
IMPROVE VISION LANGUAGE MODEL CHAIN-OFTHOUGHT REASONING. Preprint

Ruohong Zhang, Bowen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun, Zhe Gan, Yinfei Yang, Ruoming Pang, Yiming Yang. [Paper], 2024.10
R-COT : REVERSE CHAIN-OF-THOUGHT PROBLEM GENERATION FOR GEOMETRIC REASONING IN LARGE MULTIMODAL MODELS. Preprint

Linger Deng, Yuliang Liu, Bohan Li, Dongliang Luo, Liang Wu, Chengquan Zhang, Pengyuan Lyu, Ziyang Zhang, Gang Zhang, Errui Ding, Yingying Zhu, Xiang Bai. [Paper], 2024.10
GeoCoder: Solving Geometry Problems by Generating Modular Code through Vision-Language Models. Preprint

Aditya Sharma, Aman Dalmia, Mehran Kazemi, Amal Zouaq, Christopher J. Pal. [Paper], 2024.10

MLLM Math/STEM Dataset

Name	Paper	Notes
ScienceQA	Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering	A benchmark consists of ∼21k multimodal multiple choice questions with diverse science topics.
CMM12K	COMET: “Cone of experience” enhanced large multimodal model for mathematical problem generation	A Chinese MM SFT dataset for math, not released
SPIQA	SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers	Designed to interpret complex figures and tables within the context of scientific research articles across various domains of computer science
InstructDoc	InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions	Collection of 30 publicly available VDU datasets, each with diverse instructions in a unified format.
M-Paper	mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model	Built by parsing Latex source files of high-quality papers.
DocStruct4M/DocReason25K	mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding	Based on publicly available datasets. A high-quality instruction tuning dataset.
DocGenome	DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models	A structured document benchmark constructed by annotating 500K scientific documents from 153 disciplines in the arXiv.
ArXivCap/ArXivQA	Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models	A figure-caption dataset comprising 6.4M images and 3.9M captions, sourced from 572K ArXiv papers. A QA dataset generated by prompting GPT-4V based on scientific figures.
FigureQA	FigureQA: An Annotated Figure Dataset for Visual Reasoning	A visual reasoning corpus of over one million QA pairs grounded in over 100,000 images. The images are synthetic, scientific-style figures from five classes: line plots, dotline plots, vertical and horizontal bar graphs, and pie charts
DVQA	DVQA: Understanding Data Visualizations via Question Answering	A dataset that tests many aspects of bar chart understanding in a question answering framework.
SciGraphQA	SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs	A synthetic multi-turn QA dataset related to academic graphs.
SciCap	SciCap: Generating Captions for Scientific Figures	A large-scale figure caption dataset based on Computer Science arXiv papers published between 2010 and 2020, contained over 416k figures that focused on graphplot.
FigCap	Figure Captioning with Reasoning and Sequence-Level Training	Generated based on FigureQA
FigureSeer	FigureSeer: Parsing Result-Figures in Research Papers	-
UniChart	UniChart: A Universal Vision-language Pretrained Model for Chart Comprehension and Reasoning	A large-scale chart corpus for pretraining, covering a diverse range of visual styles and topics.
MapQA	MapQA: A Dataset for Question Answering on Choropleth Maps	A large-scale dataset of ~800K question-answer pairs over ~60K map images.
TabMWP	Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning	A dataset containing 38,431 open-domain grade-level problems that require mathematical reasoning on both textual and tabular data
CLEVR-Math	CLEVR-Math: A Dataset for Compositional Language, Visual and Mathematical Reasoning	A multi-modal math word problems dataset consisting of simple math word problems involving addition/subtraction
GUICourse	GUICourse: From General Vision Language Model to Versatile GUI Agent	A suite of datasets to train visual-based GUI agents from general VLMs
PIN-14M	PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents	14 million samples derived from Chinese and English sources, tailored to include complex web and scientific content.
MathV360K	Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models	40K high-quality images with QA pairs from 24 existing datasets and synthesizing 320K new pairs.
MMSci	MMSci: A Multimodal Multi-Discipline Dataset for PhD-Level Scientific Comprehension	Collected a multimodal dataset from open-access scientific articles published in Nature Communications journals.
MAVIS-Caption/Instruct	MAVIS: Mathematical Visual Instruction Tuning	-
Geo170K	G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model	Utilize the geometry characteristic to construct a multi-modal geometry dataset, building upon existing datasets.
SciOL/MuLMS-Img	SciOL and MuLMS-Img: Introducing A Large-Scale Multimodal Scientific Dataset and Models for Image-Text Tasks in the Scientific Domain	Pretraining corpus for multimodal models in the scientific domain.
PlotQA	PlotQA: Reasoning over Scientific Plots	With 28.9 million QA pairs over 224,377 plots on data from realworld sources and questions based on crowd-sourced question templates.
ChartInstructionData	Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning	A dataset of 467K, which includes 108K table-chart pairs and 359K chart-QA pairs.
MMTab	Multimodal Table Understanding	Dataset for multimodal table understanding problem, based on 14 publicly available table datasets of 8 domains.
Multimodal Self-Instruct	Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model	Instruction dataset for eight visual scenarios: charts, tables, simulated maps, dashboards, flowcharts, relation graphs, floor plans, and visual puzzles.
GeoGPT4V	GeoGPT4V: Towards Geometric Multi-modal Large Language Models with Geometric Image Generation	Leverages GPT-4 and GPT-4V to generate relatively basic geometry problems with aligned text and images.
InfiMM-WebMath-40B	InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning	Interleaved image-text documents, comprises 24 million web pages, 85 million associated image URLs, and 40 billion text tokens, extracted and filtered from CommonCrawl.
MultiMath-300K	MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models	Spans K-12 levels with image captions and step-wise solutions.
MathVL	MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model	A fine-tuning dataset including both several public datasets and our curated Chinese dataset collected from K12 education levels.

MLLM Math/STEM Benchmark

Name	Paper	Note
GeoEval	GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving	An benchmark for evaluating MLLMs' capability in solving geometry math problems
Geometry3K	Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning	Consisting of 3,002 geometry problems with dense annotation in formal language.
GEOS	Solving Geometry Problems: Combining Text and Diagram Interpretation	-
GeoQA	GeoQA: A Geometric Question Answering Benchmark Towards Multimodal Numerical Reasoning	4,998 geometric problems with cor- responding annotated programs
GeoQA+	An Augmented Benchmark Dataset for Geometric Question Answering through Dual Parallel Text Encoding	Based on GeoQA, newly annotate 2,518 geometric problems with richer types and greater difficulty
UniGeo	UniGeo: Unifying Geometry Logical Reasoning via Reformulating Mathematical Expression	Contains 4,998 calculation problems and 9,543 proving problems
PGPS9K	A Multi-Modal Neural Geometric Solver with Textual Clauses Parsed from Diagram	Labeled with both fine-grained diagram annotation and interpretable solution program.
GeomVerse	GeomVerse: A Systematic Evaluation of Large Models for Geometric Reasoning	A synthetic benchmark of geometry questions with controllable difficulty levels along multiple axes
MathVista	MATHVISTA: EVALUATING MATHEMATICAL REASONING OF FOUNDATION MODELS IN VISUAL CONTEXTS	A benchmark designed to combine challenges from diverse mathematical and visual tasks.
OlympiadBench	OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems	An Olympiad-level bilingual multimodal scientific benchmark, from mathematics and physics competitions
OlympicArena	OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI	Encompass a wide range of disciplines spanning seven fields and 62 international Olympic competitions.
SciBench	SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models	A benchmark for college-level scientific problems sourced from instructional textbooks.
MMMU	MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI	Designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning.
CMMMU	CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark	A new Chinese Massive Multi-discipline Multimodal Understanding benchmark designed to evaluate LMMs on tasks demanding college-level subject knowledge and deliberate reasoning in a Chinese context.
MULTI	MULTI: Multimodal Understanding Leaderboard with Text and Images	Includes over 18,000 questions, and challenges MLLMs with a variety of tasks, ranging from formula derivation to image detail analysis and cross-modality reasoning.
M3GIA	M3GIA: A Cognition Inspired Multilingual and Multimodal General Intelligence Ability Benchmark	Designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning.
M3Exam	M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models	Sourced from real and official human exam questions for evaluating LLMs in a multilingual, multimodal, and multilevel context.
MathVerse	MATHVERSE: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?	2,612 high-quality, multi-subject math problems with diagrams from publicly available sources.
MATH-Vision	Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset	3,040 high-quality mathe- matical problems with visual contexts sourced from real math competitions.
AI2D	A Diagram Is Worth A Dozen Images	A dataset of diagrams with annotations of constituents and relationships for over 5,000 diagrams and 15,000 QAs.
IconQA	IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning	A benchmark with the goal of answering a question in an icon image context.
TQA	Are You Smarter Than A Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension	Includes 1,076 lessons and 26,260 multi-modal questions, taken from middle school science curricula.
ScienceQA	Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering	A benchmark consists of ∼21k multimodal multiple choice questions with diverse science topics.
ChartX	ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning	A multi-modal evaluation set covering 18 chart types, 7 chart tasks, 22 disciplinary topics, and high-quality chart data
PlotQA	PlotQA: Reasoning over Scientific Plots	With 28.9 million question-answer pairs over 224,377 plots on data from realworld sources and questions based on crowd-sourced question templates.
Chart-to-text	Chart-to-Text: A Large-Scale Benchmark for Chart Summarization	A large-scale benchmark with two datasets and a total of 44,096 charts covering a wide range of topics and chart types.
ChartQA	ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning	A large-scale benchmark covering 9.6K human-written questions as well as 23.1K questions generated from human-written chart summaries.
OpenCQA	OpenCQA: Open-ended Question Answering with Charts	The goal is to answer an open-ended question about a chart with descriptive texts.
ChartBench	ChartBench: A Benchmark for Complex Visual Reasoning in Charts	A comprehensive benchmark designed to assess chart comprehension and data reliability through complex visual reasoning.
DocVQA	DocVQA: A Dataset for VQA on Document Images	Consists of 50,000 questions defined on 12,000+ document images
InfoVQA	InfographicVQA	Comprises a diverse collection of infographics along with question-answer annotations.
WTQ	Compositional Semantic Parsing on Semi-Structured Tables	A dataset of 22,033 complex questions on Wikipedia tables.
TableFact	TabFact : A Large-scale Dataset for Table-based Fact Verification	A large-scale dataset with 16k Wikipedia tables as the evidence for 118k human-annotated natural language statements.
MM-Math	MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification	Consists of 5,929 open-ended middle school math problems with visual contexts, with fine-grained classification.
MathCheck	Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist	A well-designed checklist for testing task generalization and reasoning robustness.
PuzzleVQA	PUZZLEVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns	A collection of 2000 puzzle instances based on abstract patterns.
SMART-101	Are Deep Neural Networks SMARTer than Second Graders?	Evaluating the abstraction, deduction, and generalization abilities of neural networks in solving visul-linguistic puzzles.
AlgpPuzzleVQA	ARE LANGUAGE MODELS PUZZLE PRODIGIES? Algorithmic Puzzles Unveil Serious Challenges in Multimodal Reasoning	Evaluate the capabilities in solving algorithmic puzzles.
ChartMimic	ChartMimic: Evaluating LMM’s Cross-Modal Reasoning Capability via Chart-to-Code Generation	Aimed at assessing the visually-grounded code generation capabilities.
ChartSumm	ChartSumm: A Comprehensive Benchmark for Automatic Chart Summarization of Long and Short Summaries	-
MMCode	MMCode: Evaluating Multi-Modal Code Large Language Models with Visually Rich Programming Problems	Contains 3,548 questions and 6,620 images collected from real-world programming challenges harvested from 10 code competition websites.
Design2Code	Design2Code: How Far Are We From Automating Front-End Engineering	Manually curate a benchmark of 484 diverse real-world webpages
Plot2Code	Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots	A comprehensive visual coding benchmark.
CharXiv	CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs	A comprehensive evaluation suite involving 2,323 natural, challenging, and diverse charts from arXiv papers.
We-Math	WE-MATH: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?	6.5K visual math problems, spanning 67 hierarchical knowledge concepts and 5 layers of knowledge granularity.
SceMQA	SceMQA: A Scientific College Entrance Level Multimodal Question Answering Benchmark	A benchmark for scientific multimodal question answering at the college entrance leve.
TheoremQA	TheoremQA: A Theorem-driven Question Answering dataset	Curated by domain experts containing 800 high-quality questions covering 350 theorems from Math, Physics, EE&CS, and Finance.
NPHardEval4V	NPHardEval4V: A Dynamic Reasoning Benchmark of Multimodal Large Language Models	Built by converting textual description of questions from NPHardEval to image representations.
MathScape	MathScape: Evaluating MLLMs in multimodal Math Scenarios through a Hierarchical Benchmark	Designed to evaluate photo-based math problem scenarios, assessing the theoretical understanding and application ability of MLLMs through a categorical hierarchical approach.
TableBench	TableBench: A Comprehensive and Complex Benchmark for Table Question Answering	Including 18 fields within four major categories of table question answering capabilities.
GRAB	GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models	Synthetic, comprised of 2170 questions, covering four tasks and 23 graph properties.
LogicVista	LogicVista: A Benchmark for Evaluating Multimodal Logical Reasoning	Evaluate general logical cognition abilities across 5 logical reasoning tasks encompassing 9 different capabilities, using a sample of 448 multiple-choice questions.
CMM-Math	CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models	Contains over 28,000 high-quality samples, featuring a variety of problem types with detailed solutions across 12 grade levels from elementary to high school in China.
SWE-bench Multimodal	SWE-BENCH MULTIMODAL: DO AI SYSTEMS GENERALIZE TO VISUAL SOFTWARE DOMAINS?	Contains 617 task instances collected from 17 JavaScript libraries used for web interface design, diagramming, data visualization, syntax highlighting, and interactive mapping.
MMIE	MMIE: MASSIVE MULTIMODAL INTERLEAVED COMPREHENSION BENCHMARK FOR LARGE VISIONLANGUAGE MODELS	Comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.
MultiChartQA	MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems	Multi-hop reasoning required to extract and integrate information from multiple charts, comprises 655 charts and 944 questions
Sketch2Code	Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping	Evaluating automating the conversion of rudimentary sketches into webpage prototypes, collected a total of 731 sketches for 484 webpage screenshots
PolyMath	POLYMATH: A CHALLENGING MULTI-MODAL MATHEMATICAL REASONING BENCHMARK	Comprises 5,000 manually collected high-quality images of cognitive textual and visual challenges across 10 distinct categories, including pattern recognition, spatial reasoning, and relative reasoning

Contributors

If you have any question about this opinionated list, do not hesitate to create an issue.

InfiMM/Awesome-Multimodal-LLM-for-Math-STEM

Awesome-Multimodal-LLM-for-Math/STEM

Table of Content

Awesome Papers

MLLM Math/STEM Dataset

MLLM Math/STEM Benchmark

Contributors