ChartInsights: Evaluating Multimodal Large Language Models for Low-Level Chart Question Answering

🥳About ChartsInsights
- ChartInsights Overall & Evaluation Design
- Dataset Overview
🎮Dataset Construction
- Pipeline for Dataset Construction
- Dataset Examples
🧪Evaluation Scripts on ChartInsights
📊Analysis of Evaluation Results with GPT-4o
⭐Leaderboard
- Results of advanced MLLMs

News

🎊Our Paper has been accepted by EMNLP2024!!

About ChartInsights

ChartInsights Overall & Evaluation Design

Figure 1: The Pipeline for Evaluating MLLMs with ChartInsights

In this paper, we aim to systematically investigate the capabilities of GPT-4o in addressing 10 low-level data analysis tasks. Our study seeks to answer the following critical questions, shedding light on the potential of MLLMs in performing detailed, granular analyses.

Q1: Impact of Textual Prompt Variations. What is the impact of different textual prompts on GPT-4o ’s output accuracy? This question aims to assess the baseline performance and capabilities of GPT-4o in different low-level tasks.

Q2: Impact of Visual Variations and Visual Prompts: How do different visual prompts, such as alterations in color schemes, layout configurations (e.g., aspect ratio), and image quality, affect the performance of GPT-4o in low-level tasks?

Q3: Impact of Chain-of-Thoughts. Can we enhance basic textual prompts in Q1 with a chain-of-thoughts like approach?

Q4: Synergistic Effect of Visual and Textual Prompts: Can the combination of visual and textual prompts lead to enhanced performance in low-level ChartQA tasks with GPT-4o? This question explores the potential for achieving better results by integrating both types of prompts.

Dataset Overview

ChartInsight contains 22,347 (chart, task, query, answer) samples across 7 chart types for 10 low-level data analysis tasks on charts. Our dataset includes 10 low-level data analysis tasks on charts, as shown in Figure 3. These tasks are well-defined by the visualization and visual analysis community. Among them, the proportion of 10 low-level tasks are in 3 groups, e.g. analysis, search, and query, the analysis task group is the largest, accounting for 42.46%. This task group examines the reasoning power of multi-modal large models on charts. Figure 4 shows the distribution of 10 low-level tasks and 7 chart types.

Figure 2: Task Distribution

Figure 3: Low-level Tasks vs. Chart Types

Dataset Construction

Pipeline for dataset construction

Figure 4: The Pipeline for ChartInsights Construction

Step1: Candidate Charts Selection. In order to more comprehensively evaluate the ability of MLLMs on low-level data analysis tasks, and to conduct more detailed and extended experiments, the datasets (tabular data) and visualization charts we collected need to meet the following three requirements: First, these datasets should contain the original metadata of the charts. Second, the charts in these datasets should contain data labels. Third, these datasets should contain both simple and complex charts so that the difficulty of the charts is reasonable.

Step2: Low-Level Tasks Generation. Next, we design a set of 10 low-level tasks for the collected charts. We group the 10 low-level tasks into three categories, namely Analysis, Search, and Query. Finally, we have 22,347 (chart, task, question, answer).

Step3: Question Type Variation. In order to better explore the impact of different types of questions influence the interaction with MLLMs. We have designed 4 question types, namely Fill-in-the-Blank, Multiple-choice, Yes-or-No, and Error Correction questions.

Step4: In this step, we carried out three sub experiments in total. They are Varying the Chart Elements and Varying the Chart Quality. In the Varying the Chart Elements experiment, we totally modify our chart with 15 different methods. In Varying the Chart Quality experiment, we add 6 different noises into our charts.

Step5: We consider three types of visual prompts.The first is to directly circle the content in the chart that is highly relevant to the question in handwriting, such as circling the values of the two elements mentioned in the reasoning question. The second method is regular shapes, which uses regular shapes (such as circles or rectangles) to label elements in the diagram. This makes it easier to use the size of a shape to imply the sequential relationship of elements. For example, use three circles of different sizes to correspond to the three values in the ordering task. The third way is special design. We design effective visual prompts tailored for different low-level tasks. For example, we use arrows to represent the monotonicity of the trend, for the correlation task.

Dataset Examples

Figure 5: Examples for ChartInsights

Environment Setup

Install required python libraries from requirements.txt:

git clone https://github.com/ChartInsight/chartinsight.git
cd chartinsight
pip install -r requirement.txt

Evaluation scripts on ChartInsights

A Quick Start About Overall Evaluation on MLLMs With ChartInsights

Here we use GPT-4o as an example, before running the code below, you can get overall_test_qa_pairs and overall_test_annotations and charts in Dataset/Overall Evalution. After you get all the files and charts, please list them in the following format:

├── test_qa_pairs.json
├── test_annotations.json
├── overall_evaluation.py
├── charts
    ├── 1.jpg
    ├── 2.jpg
    └── ...

overall_evaluation.py:

import os
import json
from openai import OpenAI
import random
import base64
import requests
import time

api_key = 'YOUR API KEY'

## read relavant qa_pairs and annotations
with open("test_qa_pairs.json", 'r') as file:
    test_qa_pairs = json.load(file)
with open("test_annotations.json", 'r') as file:
    test_annotations = json.load(file)

def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

headers = {
  "Content-Type": "application/json",
  "Authorization": f"Bearer {api_key}"
}

def get_gpt_4o_reply(image_url, final_question):
    base64_image = encode_image(image_url)
    payload = {
      "model": "gpt-4o",
      "messages": [
        {
          "role": "user",
          "content": [
            {
              "type": "text",
              "text": final_question
            },
            {
              "type": "image_url",
              "image_url": {
                "url": f"data:image/png;base64,{base64_image}",
              },
            }
          ]
        }
      ],
      "max_tokens": 1000
    }

    response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
    gpt_4o_reply = response.json()['choices'][0]['message']
    return  gpt_4o_reply

## main function
start_length = 0
total_questions = 0
annotation_id = 0

for _, each_annotation in enumerate(test_annotations[annotation_id:]):
    total_charts = []
    begin = time.time()
    image_index = each_annotation['image']
    image_url = 'charts' + image_index
    image_type = each_annotation['type']
    index = each_annotation['id']

    for i, each_qa_pair in enumerate(test_qa_pairs[start_length:]):
        each_chart = {}
        each_chart['image_url'] = image_url
        each_chart['image_type'] = image_type
        task_category = each_qa_pair['type']
        each_chart['task_category'] = task_category
        print(each_qa_pair['QA_pairs'][0]['fill_the_blank'])
        if index == each_qa_pair['image_index']:
            start_length += 1
            total_questions += 4
            print(f"{index}, {task_category}")
            for each_question_format in each_qa_pair['QA_pairs']:
                each_key = list(each_question_format.keys())[0]
                final_question = each_question_format[each_key][0]
                annotation = each_question_format[each_key][1]
                gpt_4v_reply = get_gpt_4o_reply(image_url, final_question)
                each_chart[each_key + ' question'] = final_question
                each_chart[each_key + ' annotation'] = annotation
                each_chart[each_key + ' GPT-4v'] = gpt_4v_reply
            each_chart['start_length'] = start_length
            each_chart['annotation_id'] = annotation_id
            each_chart['pair_index'] = each_qa_pair['pair_index']
            total_charts.append(each_chart)
        else:
            break
    end = time.time()
    each_annotation_time = round(end - begin,2)
    annotation_id += 1
    print(f"{annotation_id}costs time{each_annotation_time}s")

Textual Prompt

You can find the textual_evaluation.py code in Scripts. Before running the code, you can get textural_test_qa_pairs and charts in Dataset/Textual Prompts. and tables in Dataset/Textual Prompts. After you get all the files and charts, please list them in the following format:

├── textual_test_qa_pairs.json
├── textual_evaluation.py
├── charts
    ├── 1.jpg
    ├── 2.jpg
    └── ...
├── tables
    ├── 1.csv
    ├── 2.csv
    └── ...

Visual Prompt

You can find the visual_prompt_evaluation.py code in Scripts. Before running the code, you can get visual_prompt_test_qa_pairs and and charts in Dataset/Visual Prompt. After you get all the files and charts, please list them in the following format:

├── visual_prompt_test_qa_pairs.json
├── visual_prompt_evaluation.py
├── charts
    ├── 1.jpg
    ├── 2.jpg
    └── ...

Vary the Chart Elements

You can find the varying_chart_element_evaluation.py code in Scripts. Before running the code, you can get varying_chart_element_test_qa_pairs.json and varying_chart_element_test_annotations.json and charts in Dataset/Vary Chart Element. After you get all the files and charts, please list them in the following format:

├── varying_chart_element_test_qa_pairs.json
├── varying_chart_element_test_annotations.json
├── varying_chart_element_evaluation.py
├── charts
    ├── 1.jpg
    ├── 2.jpg
    └── ...

Vary the Chart Quality

You can find the varying_chart_quality_evaluation.py code in Scripts. Before running the code, you can get varying_chart_qualityt_test_qa_pairs.json and varying_chart_quality_test_annotations.json and charts in Dataset/Vary Chart Element. After you get all the files and charts, please list them in the following format:

├── varying_chart_quality_test_qa_pairs.json
├── varying_chart_quality_test_annotations.json
├── varying_chart_quality_evaluation.py
├── charts
    ├── 1.jpg
    ├── 2.jpg
    └── ...

Analysis of Evaluation Results with GPT-4o

Overall Results Analysis

After you finish the overall experiments, you will get the results listed. And if you want to analyze those results, you can find relevant python files in Evalution and order them in the following format:

├── results
    ├── 1.json
    ├── 2.json
    └── ...
├── Check_Answers.py
├── Combine.py
├── Overall_Evaluation.py

Step1: Run the Combine.py to combine all the single JSON files into a combined JSON file.

python Combine.py

After running the code above, you will get total_results.json in the results directory like this.

├── results
    ├── 1.json
    ├── 2.json
    ├── ......
    ├── total_results.json 
├── Check_Answers.py
├── Combine.py
├── Evaluation.py

Step2: Run the Overall_Evaluation.py to calculate the accuracy of GPT-4o.

python Overall_Evaluation.py

Step3: After processing the above two steps, you will get 6 CSV files for results. You can check Results/Accuracy_Table as an example.

Textual Prompt Results Analysis

After you finish the Textual Prompt experiments, you will get the results listed. And if you want to analyze those results, you can find relevant python files in Evalution and order them in the following format:

├── results
    ├── 1.json
    ├── 2.json
    └── ...
├── Check_Answers.py
├── Combine.py
├── Textual_Prompt_Evaluation.py

Follow the three steps in Overall Evaluation and you will get 6 CSV files.

Visual Prompt Results Analysis

After you finish the Visual Prompt experiments, you will get the results listed. And if you want to analyze those results, you can find relevant python files in Evalution and order them in the following format:

├── results
    ├── 1.json
    ├── 2.json
    └── ...
├── Check_Answers.py
├── Combine.py
├── Visual_Prompt_Evaluation.py

Follow the three steps in Overall Evaluation and you will get 6 CSV files.

Chart Element Results Analysis

After you finish the Chart Element experiments, you will get the results listed. And if you want to analyze those results, you can find relevant Python files in Evalution and order them in the following format:

├── results
    ├── 1.json
    ├── 2.json
    └── ...
├── Check_Answers.py
├── Combine.py
├── Chart_Element_Evaluation.py

Follow the three steps in Overall Evaluation and you will get 6 CSV files.

Chart Quality Results Analysis

After you finish the Chart Quality experiments, you will get the results listed. And if you want to analyze those results, you can find relevant Python files in Evalution and order them in the following format:

├── results
    ├── 1.json
    ├── 2.json
    └── ...
├── Check_Answers.py
├── Combine.py
├── Image_Quality_Evaluation.py

Follow the three steps in Overall Evaluation and you will get 6 CSV files.

Leaderboard

Results of advanced MLLMs on ChartInsights

Overall	Model	Reasoning	Anomaly	Distribution	Correlation	Range	Order	Filter	Retrieval	Extreme	Cluster
🥇 69.17	GPT-4o	55.9	34.0	70.1	68.5	80.6	68.9	49.9	82.6	93.9	74.3
🥈56.13	GPT-4V	35.17	19.53	53.43	59.62	70.04	41.92	44.32	67.59	88.66	72.87
🥉51.655	qwen-vl-max	28.84	25.78	62.25	62.98	66.12	40.19	38.86	66.99	79.64	66.76
49.545	Claude3(Haiku)	33.04	8.98	42.65	46.15	60.44	26.15	39.96	62.3	75.08	66.76
48.4375	ChatGLM-4V	34.07	28.91	39.22	42.31	55.5	18.85	43.41	58.08	69.31	71.43
48.36	Gemini_pro	25.6	30.08	45.59	58.65	75.26	32.88	30.06	60.37	80.88	55.26
42.595	qwen-vl-plus	30.77	27.34	47.06	47.12	42.95	34.62	20.72	58.69	65.52	62.5
40.2325	Sphinx-v2	30.02	28.91	37.75	36.06	25.78	23.46	36.74	49.72	66.31	45.31
38.51	LLava-NEXT	30.6	7.42	26.47	37.98	29.5	33.27	23.39	53.49	59.81	52.27
33.295	mPLUG-Owl2	30.99	26.95	29.41	35.34	28.39	22.5	40.25	30.89	41.1	27.27
33.425	qwen-VL-chat	27.78	36.33	45.1	55.77	33.75	20	28.69	31.25	50.21	27.13
33.7525	llava_vip	28.77	6.64	34.8	30.29	21.87	35.77	40.42	42.19	38.29	33.81
32.3875	ChartAssistant	24.58	27.73	35.78	28.12	30.48	22.5	14.66	39.38	63.04	26.42
32.1875	llava1.5	32.39	6.25	30.88	23.08	21.74	32.69	35.55	32.57	35.76	43.47
31.055	OmniLMM-12B	24.67	19.92	26.96	34.86	35.7	28.27	29.96	33.01	39.94	33.1
33.015	MiniCPM-v2	19.54	55.08	33.33	56.49	24.87	16.73	36.31	37.94	52.4	31.96
29.4325	cogvlm	20.32	23.05	43.63	29.57	37.73	10.77	9.07	37.86	56.62	26.7
28.305	Blip2	24.8	23.44	25	15.14	25.33	20.19	39.83	27.8	30.26	30.11
26.1875	VisCPM	28.44	46.09	33.33	51.92	22.98	6.35	25.13	15.75	32	29.55

HKUSTDial/ChartInsights

ChartInsights: Evaluating Multimodal Large Language Models for Low-Level Chart Question Answering

News

About ChartInsights

ChartInsights Overall & Evaluation Design

Dataset Overview

Dataset Construction

Pipeline for dataset construction

Dataset Examples

Environment Setup

Evaluation scripts on ChartInsights

A Quick Start About Overall Evaluation on MLLMs With ChartInsights

Textual Prompt

Visual Prompt

Vary the Chart Elements

Vary the Chart Quality

Analysis of Evaluation Results with GPT-4o

Overall Results Analysis

Textual Prompt Results Analysis

Visual Prompt Results Analysis

Chart Element Results Analysis

Chart Quality Results Analysis

Leaderboard

Results of advanced MLLMs on ChartInsights