🔥 Collections of multi-modal LLM for Math/STEM/Code.
-
MAVIS: Mathematical Visual Instruction Tuning
Preprint
Renrui Zhang, Xinyu Wei, Dongzhi Jiang, Yichi Zhang, Ziyu Guo,Chengzhuo Tong, Jiaming Liu, Aojun Zhou, Bin Wei, Shanghang Zhang, Peng Gao, Hongsheng Li.[Paper], 2024.7
-
COMET: “Cone of experience” enhanced large multimodal model for mathematical problem generation.
Preprint
Sannyuya Liu, Jintian Feng, Zongkai Yang, Yawei Luo, Qian Wan, Xiaoxuan Shen, Jianwen Sun. [Paper], 2024.7
-
Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B: A Technical Report.
Preprint
Di Zhang, Xiaoshui Huang, Dongzhan Zhou, Yuqiang Li, Wanli Ouyang. [Paper], 2024.6
-
Visual SKETCHPAD: Sketching as a Visual Chain of Thought for Multimodal Language Models.
Preprint
Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, Ranjay Krishna. [Paper], [Code], 2024.6
-
TextSquare: Scaling up Text-Centric Visual Instruction Tuning.
Preprint
Jingqun Tang, Chunhui Lin, Zhen Zhao, Shu Wei, Binghong Wu, Qi Liu, Hao Feng, Yang Li, Siqi Wang, Lei Liao, Wei Shi, Yuliang Liu, Hao Liu, Yuan Xie, Xiang Bai, Can Huang. [Paper], 2024.4
-
Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs.
ACL 2024
Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou. [Paper], 2024.3
-
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding.
Preprint
Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou. [Paper], 2024.3
-
ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning.
Preprint
Renqiu Xia, Bo Zhang, Hancheng Ye, Xiangchao Yan, Qi Liu, Hongbin Zhou, Zijun Chen, Min Dou, Botian Shi, Junchi Yan, Yu Qiao. [Paper], 2024.2
-
InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions.
Preprint
Ryota Tanaka, Taichi Iki, Kyosuke Nishida, Kuniko Saito, Jun Suzuki. [Paper], 2024.1
-
G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model.
Preprint
Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, Lingpeng Kong. [Paper], [Code], 2023.12
-
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Models.
Preprint
Anwen Hu, Yaya Shi, Haiyang Xu, Jiabo Ye, Qinghao Ye, Ming Yan, Chenliang Li, Qi Qian, Ji Zhang, Fei Huang. [Paper], 2023.11
-
Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning.
Preprint
Xingchen Zeng, Haichuan Lin, Yilin Ye, Wei Zeng. [Paper], [Code], 2024.7
-
Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning.
Preprint
Wenwen Zhuang, Xin Huang, Xiantao Zhang, Jin Zeng. [Paper], 2024.8
-
Diagram Formalization Enhanced Multi-Modal Geometry Problem Solver.
Preprint
Zeren Zhang, Jo-Ku Cheng, Jingyang Deng, Lu Tian, Jinwen Ma, Ziran Qin, Xiaokai Zhang, Na Zhu, and Tuo Leng. [Paper], 2024.9
-
Transformers Utilization in Chart Understanding: A Review of Recent Advances & Future Trends.
Preprint
Mirna Al-Shetairy, Hanan Hindy, Dina Khattab, Mostafa M. Aref. [Paper], 2024.10
-
IMPROVE VISION LANGUAGE MODEL CHAIN-OFTHOUGHT REASONING.
Preprint
Ruohong Zhang, Bowen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun, Zhe Gan, Yinfei Yang, Ruoming Pang, Yiming Yang. [Paper], 2024.10
-
R-COT : REVERSE CHAIN-OF-THOUGHT PROBLEM GENERATION FOR GEOMETRIC REASONING IN LARGE MULTIMODAL MODELS.
Preprint
Linger Deng, Yuliang Liu, Bohan Li, Dongliang Luo, Liang Wu, Chengquan Zhang, Pengyuan Lyu, Ziyang Zhang, Gang Zhang, Errui Ding, Yingying Zhu, Xiang Bai. [Paper], 2024.10
-
GeoCoder: Solving Geometry Problems by Generating Modular Code through Vision-Language Models.
Preprint
Aditya Sharma, Aman Dalmia, Mehran Kazemi, Amal Zouaq, Christopher J. Pal. [Paper], 2024.10
Name | Paper | Notes |
---|---|---|
ScienceQA | Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering | A benchmark consists of ∼21k multimodal multiple choice questions with diverse science topics. |
CMM12K | COMET: “Cone of experience” enhanced large multimodal model for mathematical problem generation | A Chinese MM SFT dataset for math, not released |
SPIQA | SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers | Designed to interpret complex figures and tables within the context of scientific research articles across various domains of computer science |
InstructDoc | InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions | Collection of 30 publicly available VDU datasets, each with diverse instructions in a unified format. |
M-Paper | mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model | Built by parsing Latex source files of high-quality papers. |
DocStruct4M/DocReason25K | mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding | Based on publicly available datasets. A high-quality instruction tuning dataset. |
DocGenome | DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models | A structured document benchmark constructed by annotating 500K scientific documents from 153 disciplines in the arXiv. |
ArXivCap/ArXivQA | Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models | A figure-caption dataset comprising 6.4M images and 3.9M captions, sourced from 572K ArXiv papers. A QA dataset generated by prompting GPT-4V based on scientific figures. |
FigureQA | FigureQA: An Annotated Figure Dataset for Visual Reasoning | A visual reasoning corpus of over one million QA pairs grounded in over 100,000 images. The images are synthetic, scientific-style figures from five classes: line plots, dotline plots, vertical and horizontal bar graphs, and pie charts |
DVQA | DVQA: Understanding Data Visualizations via Question Answering | A dataset that tests many aspects of bar chart understanding in a question answering framework. |
SciGraphQA | SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs | A synthetic multi-turn QA dataset related to academic graphs. |
SciCap | SciCap: Generating Captions for Scientific Figures | A large-scale figure caption dataset based on Computer Science arXiv papers published between 2010 and 2020, contained over 416k figures that focused on graphplot. |
FigCap | Figure Captioning with Reasoning and Sequence-Level Training | Generated based on FigureQA |
FigureSeer | FigureSeer: Parsing Result-Figures in Research Papers | - |
UniChart | UniChart: A Universal Vision-language Pretrained Model for Chart Comprehension and Reasoning | A large-scale chart corpus for pretraining, covering a diverse range of visual styles and topics. |
MapQA | MapQA: A Dataset for Question Answering on Choropleth Maps | A large-scale dataset of ~800K question-answer pairs over ~60K map images. |
TabMWP | Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning | A dataset containing 38,431 open-domain grade-level problems that require mathematical reasoning on both textual and tabular data |
CLEVR-Math | CLEVR-Math: A Dataset for Compositional Language, Visual and Mathematical Reasoning | A multi-modal math word problems dataset consisting of simple math word problems involving addition/subtraction |
GUICourse | GUICourse: From General Vision Language Model to Versatile GUI Agent | A suite of datasets to train visual-based GUI agents from general VLMs |
PIN-14M | PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents | 14 million samples derived from Chinese and English sources, tailored to include complex web and scientific content. |
MathV360K | Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models | 40K high-quality images with QA pairs from 24 existing datasets and synthesizing 320K new pairs. |
MMSci | MMSci: A Multimodal Multi-Discipline Dataset for PhD-Level Scientific Comprehension | Collected a multimodal dataset from open-access scientific articles published in Nature Communications journals. |
MAVIS-Caption/Instruct | MAVIS: Mathematical Visual Instruction Tuning | - |
Geo170K | G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model | Utilize the geometry characteristic to construct a multi-modal geometry dataset, building upon existing datasets. |
SciOL/MuLMS-Img | SciOL and MuLMS-Img: Introducing A Large-Scale Multimodal Scientific Dataset and Models for Image-Text Tasks in the Scientific Domain | Pretraining corpus for multimodal models in the scientific domain. |
PlotQA | PlotQA: Reasoning over Scientific Plots | With 28.9 million QA pairs over 224,377 plots on data from realworld sources and questions based on crowd-sourced question templates. |
ChartInstructionData | Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning | A dataset of 467K, which includes 108K table-chart pairs and 359K chart-QA pairs. |
MMTab | Multimodal Table Understanding | Dataset for multimodal table understanding problem, based on 14 publicly available table datasets of 8 domains. |
Multimodal Self-Instruct | Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model | Instruction dataset for eight visual scenarios: charts, tables, simulated maps, dashboards, flowcharts, relation graphs, floor plans, and visual puzzles. |
GeoGPT4V | GeoGPT4V: Towards Geometric Multi-modal Large Language Models with Geometric Image Generation | Leverages GPT-4 and GPT-4V to generate relatively basic geometry problems with aligned text and images. |
InfiMM-WebMath-40B | InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning | Interleaved image-text documents, comprises 24 million web pages, 85 million associated image URLs, and 40 billion text tokens, extracted and filtered from CommonCrawl. |
MultiMath-300K | MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models | Spans K-12 levels with image captions and step-wise solutions. |
MathVL | MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model | A fine-tuning dataset including both several public datasets and our curated Chinese dataset collected from K12 education levels. |
Name | Paper | Note |
---|---|---|
GeoEval | GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving | An benchmark for evaluating MLLMs' capability in solving geometry math problems |
Geometry3K | Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning | Consisting of 3,002 geometry problems with dense annotation in formal language. |
GEOS | Solving Geometry Problems: Combining Text and Diagram Interpretation | - |
GeoQA | GeoQA: A Geometric Question Answering Benchmark Towards Multimodal Numerical Reasoning | 4,998 geometric problems with cor- responding annotated programs |
GeoQA+ | An Augmented Benchmark Dataset for Geometric Question Answering through Dual Parallel Text Encoding | Based on GeoQA, newly annotate 2,518 geometric problems with richer types and greater difficulty |
UniGeo | UniGeo: Unifying Geometry Logical Reasoning via Reformulating Mathematical Expression | Contains 4,998 calculation problems and 9,543 proving problems |
PGPS9K | A Multi-Modal Neural Geometric Solver with Textual Clauses Parsed from Diagram | Labeled with both fine-grained diagram annotation and interpretable solution program. |
GeomVerse | GeomVerse: A Systematic Evaluation of Large Models for Geometric Reasoning | A synthetic benchmark of geometry questions with controllable difficulty levels along multiple axes |
MathVista | MATHVISTA: EVALUATING MATHEMATICAL REASONING OF FOUNDATION MODELS IN VISUAL CONTEXTS | A benchmark designed to combine challenges from diverse mathematical and visual tasks. |
OlympiadBench | OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems | An Olympiad-level bilingual multimodal scientific benchmark, from mathematics and physics competitions |
OlympicArena | OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI | Encompass a wide range of disciplines spanning seven fields and 62 international Olympic competitions. |
SciBench | SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models | A benchmark for college-level scientific problems sourced from instructional textbooks. |
MMMU | MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI | Designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. |
CMMMU | CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark | A new Chinese Massive Multi-discipline Multimodal Understanding benchmark designed to evaluate LMMs on tasks demanding college-level subject knowledge and deliberate reasoning in a Chinese context. |
MULTI | MULTI: Multimodal Understanding Leaderboard with Text and Images | Includes over 18,000 questions, and challenges MLLMs with a variety of tasks, ranging from formula derivation to image detail analysis and cross-modality reasoning. |
M3GIA | M3GIA: A Cognition Inspired Multilingual and Multimodal General Intelligence Ability Benchmark | Designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. |
M3Exam | M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models | Sourced from real and official human exam questions for evaluating LLMs in a multilingual, multimodal, and multilevel context. |
MathVerse | MATHVERSE: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? | 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources. |
MATH-Vision | Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset | 3,040 high-quality mathe- matical problems with visual contexts sourced from real math competitions. |
AI2D | A Diagram Is Worth A Dozen Images | A dataset of diagrams with annotations of constituents and relationships for over 5,000 diagrams and 15,000 QAs. |
IconQA | IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning | A benchmark with the goal of answering a question in an icon image context. |
TQA | Are You Smarter Than A Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension | Includes 1,076 lessons and 26,260 multi-modal questions, taken from middle school science curricula. |
ScienceQA | Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering | A benchmark consists of ∼21k multimodal multiple choice questions with diverse science topics. |
ChartX | ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning | A multi-modal evaluation set covering 18 chart types, 7 chart tasks, 22 disciplinary topics, and high-quality chart data |
PlotQA | PlotQA: Reasoning over Scientific Plots | With 28.9 million question-answer pairs over 224,377 plots on data from realworld sources and questions based on crowd-sourced question templates. |
Chart-to-text | Chart-to-Text: A Large-Scale Benchmark for Chart Summarization | A large-scale benchmark with two datasets and a total of 44,096 charts covering a wide range of topics and chart types. |
ChartQA | ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning | A large-scale benchmark covering 9.6K human-written questions as well as 23.1K questions generated from human-written chart summaries. |
OpenCQA | OpenCQA: Open-ended Question Answering with Charts | The goal is to answer an open-ended question about a chart with descriptive texts. |
ChartBench | ChartBench: A Benchmark for Complex Visual Reasoning in Charts | A comprehensive benchmark designed to assess chart comprehension and data reliability through complex visual reasoning. |
DocVQA | DocVQA: A Dataset for VQA on Document Images | Consists of 50,000 questions defined on 12,000+ document images |
InfoVQA | InfographicVQA | Comprises a diverse collection of infographics along with question-answer annotations. |
WTQ | Compositional Semantic Parsing on Semi-Structured Tables | A dataset of 22,033 complex questions on Wikipedia tables. |
TableFact | TabFact : A Large-scale Dataset for Table-based Fact Verification | A large-scale dataset with 16k Wikipedia tables as the evidence for 118k human-annotated natural language statements. |
MM-Math | MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification | Consists of 5,929 open-ended middle school math problems with visual contexts, with fine-grained classification. |
MathCheck | Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist | A well-designed checklist for testing task generalization and reasoning robustness. |
PuzzleVQA | PUZZLEVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns | A collection of 2000 puzzle instances based on abstract patterns. |
SMART-101 | Are Deep Neural Networks SMARTer than Second Graders? | Evaluating the abstraction, deduction, and generalization abilities of neural networks in solving visul-linguistic puzzles. |
AlgpPuzzleVQA | ARE LANGUAGE MODELS PUZZLE PRODIGIES? Algorithmic Puzzles Unveil Serious Challenges in Multimodal Reasoning | Evaluate the capabilities in solving algorithmic puzzles. |
ChartMimic | ChartMimic: Evaluating LMM’s Cross-Modal Reasoning Capability via Chart-to-Code Generation | Aimed at assessing the visually-grounded code generation capabilities. |
ChartSumm | ChartSumm: A Comprehensive Benchmark for Automatic Chart Summarization of Long and Short Summaries | - |
MMCode | MMCode: Evaluating Multi-Modal Code Large Language Models with Visually Rich Programming Problems | Contains 3,548 questions and 6,620 images collected from real-world programming challenges harvested from 10 code competition websites. |
Design2Code | Design2Code: How Far Are We From Automating Front-End Engineering | Manually curate a benchmark of 484 diverse real-world webpages |
Plot2Code | Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots | A comprehensive visual coding benchmark. |
CharXiv | CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs | A comprehensive evaluation suite involving 2,323 natural, challenging, and diverse charts from arXiv papers. |
We-Math | WE-MATH: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning? | 6.5K visual math problems, spanning 67 hierarchical knowledge concepts and 5 layers of knowledge granularity. |
SceMQA | SceMQA: A Scientific College Entrance Level Multimodal Question Answering Benchmark | A benchmark for scientific multimodal question answering at the college entrance leve. |
TheoremQA | TheoremQA: A Theorem-driven Question Answering dataset | Curated by domain experts containing 800 high-quality questions covering 350 theorems from Math, Physics, EE&CS, and Finance. |
NPHardEval4V | NPHardEval4V: A Dynamic Reasoning Benchmark of Multimodal Large Language Models | Built by converting textual description of questions from NPHardEval to image representations. |
MathScape | MathScape: Evaluating MLLMs in multimodal Math Scenarios through a Hierarchical Benchmark | Designed to evaluate photo-based math problem scenarios, assessing the theoretical understanding and application ability of MLLMs through a categorical hierarchical approach. |
TableBench | TableBench: A Comprehensive and Complex Benchmark for Table Question Answering | Including 18 fields within four major categories of table question answering capabilities. |
GRAB | GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models | Synthetic, comprised of 2170 questions, covering four tasks and 23 graph properties. |
LogicVista | LogicVista: A Benchmark for Evaluating Multimodal Logical Reasoning | Evaluate general logical cognition abilities across 5 logical reasoning tasks encompassing 9 different capabilities, using a sample of 448 multiple-choice questions. |
CMM-Math | CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models | Contains over 28,000 high-quality samples, featuring a variety of problem types with detailed solutions across 12 grade levels from elementary to high school in China. |
SWE-bench Multimodal | SWE-BENCH MULTIMODAL: DO AI SYSTEMS GENERALIZE TO VISUAL SOFTWARE DOMAINS? | Contains 617 task instances collected from 17 JavaScript libraries used for web interface design, diagramming, data visualization, syntax highlighting, and interactive mapping. |
MMIE | MMIE: MASSIVE MULTIMODAL INTERLEAVED COMPREHENSION BENCHMARK FOR LARGE VISIONLANGUAGE MODELS | Comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts. |
MultiChartQA | MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems | Multi-hop reasoning required to extract and integrate information from multiple charts, comprises 655 charts and 944 questions |
Sketch2Code | Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping | Evaluating automating the conversion of rudimentary sketches into webpage prototypes, collected a total of 731 sketches for 484 webpage screenshots |
PolyMath | POLYMATH: A CHALLENGING MULTI-MODAL MATHEMATICAL REASONING BENCHMARK | Comprises 5,000 manually collected high-quality images of cognitive textual and visual challenges across 10 distinct categories, including pattern recognition, spatial reasoning, and relative reasoning |
If you have any question about this opinionated list, do not hesitate to create an issue.