/MM-Math

Primary LanguagePythonMIT LicenseMIT

🤗 HF Repo • 📃 Paper

📖 MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification

MM-Math is a benchmark designed to evaluate the mathematical reasoning capabilities of multimodal models, featuring problems that incorporate images and step-by-step reasoning. All problems are calculation-based and utilize an open-ended format. MM-MATH is annotated across three dimensions: difficulty, grade level, and knowledge points, enabling comprehensive evaluation of multimodal models.

We recognize the potentially high costs associated with model evaluation, particularly in scenarios involving diagrammatic math reasoning. Consequently, we adopt two fully automated evaluation method: outcome evaluation and process evaluation. For outcome evaluation, the final answer in MM-MATH is enclosed in \boxed{} format, requiring the extraction of models' final answer in the same format. For process evaluation, we provide carefully designed prompts, using GPT-4V to compare the model's problem-solving process with the ground truth, identify the first error, and classify it.

🖥️ Leaderboard

Here is the average scores (%) on the outcome evaluation under the Zero-shot scenario. We conducted assessments across three dimensions: difficulty, grade level, and knowledge points. Human evaluation results are derived from exam scores.

Multimodal Modal w/o Image

Model Easy Medium Hard Seven Eight Nine Trans Shape Func Average
Human 90.7 81.9 47.6 85.6 73.7 77.9 81.1 83.2 77.5 80.4
Gemini-Pro-V 10.1 5.7 1.8 10.0 5.3 6.7 6.6 5.7 6.4 6.2
Claude-3-Opus 31.7 17.3 7.2 32.5 14.9 2.2 20.8 18.5 12.9 19.2
GPT-4 37.0 20.3 7.2 38.7 17.1 26.2 23.3 21.4 18.1 22.5
GPT-4V 35.2 18.1 7.2 31.2 17.2 22.3 18.4 21.4 13.3 20.4
GPT-4o 41.4 23.9 3.6 35.0 23.9 30.5 22.8 29.7 19.4 27.6

Multimodal modal w/ Image

Model Easy Medium Hard Seven Eight Nine Trans Shape Func Average
Human 90.7 81.9 47.6 85.6 73.7 77.9 81.1 83.2 77.5 80.4
DeepSeek-VL-7B-Chat 17.4 4.7 1.4 7.5 6.6 3.9 3.4 6.0 3.5 5.4
Yi-34B-Chat 12.9 5.0 1.5 21.3 5.6 3.5 5.0 7.6 3.8 6.5
LLaVA-V1.6-34B 8.8 5.4 1.8 12.6 6.5 4.2 4.0 6.5 3.8 5.8
InternVL-4B-Chat-1.5 18.5 10.7 1.8 12.5 11.1 11.9 11.4 12.3 5.5 11.6
Qwen-VL-Max 14.5 11.2 3.6 16.2 1.1 11.3 11.0 12.5 10.5 11.4
Gemini-Pro-V 19.3 8.2 0.0 1.5 7.4 11.5 10.4 10.6 7.1 9.7
Claude-3-Opus 29.5 19.3 3.6 32.5 16.4 23.0 20.6 21.7 16.9 20.3
GPT-4V 37.8 21.2 1.8 28.7 17.9 28.0 22.2 24.7 19.5 23.1
GPT-4o 45.8 30.0 10.9 40.0 26.0 36.0 30.7 33.7 26.2 31.8

Data Format

All data in MM-Math are standardized to the following format:

{
    "question": "The text of each question statement conforms to LaTeX code.",
    "file_name": "The names of the question images in the image folder.",
    "solution": "The text of each question' soluation conforms to LaTeX code.",
    "year": "The grade level annotated from each year examination.",
    "difficult": "The difficult level annotated by  examination scores.",
    "knowledge": "Each knowledge points contained in the question, which is annotated by middle school teacher."
}