/MMInstruct

The official implementation of the paper "MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity". The MMInstruct dataset includes 973K instructions from 24 domains and four instruction types.

Primary LanguagePythonApache License 2.0Apache-2.0

MMInstruct

The official implementation of the paper "MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity".

The dataset is available on Hugging Face at 🤗 yuecao0119/MMInstruct.

Todo List

  • Data Engine.
  • Open Source Datasets.
  • Release the checkpoint.

Introduction

Vision-language supervised fine-tuning effectively enhances VLLM performance, but existing visual instruction tuning datasets have limitations:

  1. Instruction Annotation Quality: Despite strong performance, advanced VLLMs may generate instructions with inaccuracies, such as hallucinations.
  2. Instruction and Image Diversity: Limited instruction types and lack of diverse image data impact the model's ability to generate varied and realistic outputs.

MMInstruct Dataset

To address these challenges, we created the MMInstruct dataset, featuring:

  • 973K instructions from 24 domains
  • Four instruction types: Judgement, Multiple-Choice, Long Visual Question Answering, and Short Visual Question Answering.
image

The open source datasets on Hugging Face 🤗 yuecao0119/MMInstruct include:

  • caption_cn: 144K English detailed image caption data generated using gpt-4-vision-preview.
  • caption_en: 18.2K Chinese detailed image caption data generated using gpt-4-vision-preview.
  • qa_en: 216K instruction data generated using GPT-3.5-turbo, including 161K multi-round long questions and answers and 55K manually corrected instruction data from 23 fields, as shown in the figure below.

We also expand MMInstruct with other open-source data, including:

Domain Dataset
mathematics datasets GEOS; UniGeo; GeoQA+; Geometry3k; CLEVR-Math; Supre-CLEVR; TabMWP
charts and plots DVQA (100K); FigureQA
scientific figure TQA
map chart MapQA

Data Engine

We developed an instruction generation data engine leveraging GPT-4V, GPT-3.5, and manual correction. This engine allows semi-automatic, low-cost, multi-domain instruction generation at 1/6 the cost of manual construction.

image

Performance

image

Citation

If this work is helpful for your research, please consider citing the following BibTeX entry.

@article{liu2024mminstruct,
  title={MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity},
  author={Liu, Yangzhou and Cao, Yue and Gao, Zhangwei and Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Tian, Hao and Lu, Lewei and Zhu, Xizhou and Lu, Tong and others},
  journal={arXiv preprint arXiv:2407.15838},
  year={2024}
}