M³CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought

[ArXiv] | [🤗HuggingFace] | [Website]

🌟 Any contributions via PRs, issues, emails or other methods are greatly appreciated.

🔥News

🎖️ Our work is accepted by ACL2024.
🔥 We have release benchmark on [🤗HuggingFace].
🔥 The paper is also available on [ArXiv].
🔮 Interactive benchmark website & more exploration are available on [https://lightchen233.github.io/m3cot.github.io/].

💡 Motivation

Multi-modal Chain-of-Thought (MCoT) requires models to leverage knowledge from both textual and visual modalities for step-by-step reasoning, which gains increasing attention. Nevertheless, the current MCoT benchmark still faces some challenges: (1) absence of visual modal reasoning, (2) single-step visual modal reasoning, and (3) Domain missing, thereby hindering the development of MCoT. Motivated by this, we introduce a novel benchmark (M³CoT) to address the above challenges, advancing the multi-domain, multi-step, and multi-modal CoT. Additionally, we conduct a thorough evaluation involving abundant MCoT approaches on Vision Large Language Models (VLLMs). In addition, we highlight that the current VLLMs still struggle to correctly reason in M³CoT and there remains a large gap between existing VLLMs and human performance in M³CoT, despite their superior results on previous MCoT benchmarks. To our knowledge, we take the first meaningful step toward the multi-domain, multi-step, and multi-modal scenario in MCoT. We hope that M³CoT can serve as a valuable resource, providing a pioneering foundation in multi-domain, multi-step, multi-modal chain-of-thought research.

🎯 Installation

1. Dataset Preparation

Load Dataset from Huggingface

import datasets
dataset = datasets.load_dataset("LightChen2333/M3CoT")

Load Dataset from Google Drive

Please download the corresponding data set from Here and place the unzipped content in the data folder.

import datasets
dataset = datasets.load_dataset("data/m3cot.py")

In addition, we also hope that you will use our M3CoT class to better manage and analyze data. Our class supports two initialization formats:

import datasets
from utils.data import M3CoT
dataset = datasets.load_dataset("data/m3cot.py")
prepared_dataset = M3CoT(dataset=dataset)

And

from utils.data import M3CoT
prepared_dataset = M3CoT(data_path="data")

2. Install from git

M3CoT requires Python>=3.10, and torch>=2.0.

git clone https://github.com/LightChen233/M3CoT.git && cd M3CoT/
pip install -r requirements.txt

3. Evaluation for reproduction

python evaluate.py --setting zero-shot \
                   --model gpt4v \
                   --prompt cot \
                   --metric_by topic

where --setting can be selected from [zero-shot, few-shot, tool-usage]. --metric_by can be selected from [topic, domain, all]

For zero-shot setting:

--model can be selected from [kosmos-2, cogvlm, gemini, gpt4v, instruct-blip-7b, instruct-blip-13b, llava-7b, llava-13b, openflamingo]
--prompt can be selected from [direct, cot, ccot, dsp]

4. Evaluation for your results

python evaluate.py --setting custom \
                   --metric_path [JSONL_PATH]

Among them, each line of file in jsonl must meet the following format:

{
  "id": "[ID]",
  "choices": ["[CHOICE1]", "[CHOICE2]", ...],
  "answer": "A/B/C/...",
  "domain": "[DOMAIN]",
  "topic": "[TOPIC]",
  "messages": [
    "[QUESTION]",
    "[ANSWER]"
  ]
}

🖨️File Structure

root
├── data           # data folder where the dataset is loaded
├── experiment     # All experimental data
│   ├── zero-shot         # Experimental results under zero-shot setting. Subfolders are for each model, and each model folder contains the results of three prompts.
│   ├── few-shot          # Experimental results under few-shot setting.
│   └── tool-usage        # Experimental results under tool-usage setting.
├── utils          # Tool library folder
│   ├── common_tool.py    # Some common utility functions
│   ├── data.py           # Dataset loading class
│   ├── gemini_request.py # Gemini request tool
│   ├── image_tool.py     # Image processing function.
│   └── metric.py         # Indicator calculation tool.
├── scripts
│   ├── load_dataset.py   # Example script to load a dataset
│   └── parse_to_sqa_format.py   # Convert dataset to ScienceQA format
└── evaluate.py     # Evaluation script

✒️ Reference

If you find this project useful for your research, please consider citing the following paper:

@inproceedings{chen-etal-2024-m3cot,
    title = "M$^3$CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought",
    author = "Chen, Qiguang  and
      Qin, Libo  and
      Zhang, Jin  and
      Chen, Zhi  and
      Xu, Xiao  and
      Che, Wanxiang",
    booktitle = "Proc. of ACL",
    year = "2024",
}

📲 Contact

Please create Github issues here or email Qiguang Chen if you have any questions or suggestions.

LightChen233/M3CoT

M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought