The official implementation of the paper "MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity".
The dataset is available on Hugging Face at 🤗 yuecao0119/MMInstruct.
- Data Engine.
- Open Source Datasets.
- Release the checkpoint.
Vision-language supervised fine-tuning effectively enhances VLLM performance, but existing visual instruction tuning datasets have limitations:
- Instruction Annotation Quality: Despite strong performance, advanced VLLMs may generate instructions with inaccuracies, such as hallucinations.
- Instruction and Image Diversity: Limited instruction types and lack of diverse image data impact the model's ability to generate varied and realistic outputs.
To address these challenges, we created the MMInstruct dataset, featuring:
- 973K instructions from 24 domains
- Four instruction types: Judgement, Multiple-Choice, Long Visual Question Answering, and Short Visual Question Answering.
The open source datasets on Hugging Face 🤗 yuecao0119/MMInstruct include:
caption_cn
: 144K English detailed image caption data generated using gpt-4-vision-preview.caption_en
: 18.2K Chinese detailed image caption data generated using gpt-4-vision-preview.qa_en
: 216K instruction data generated using GPT-3.5-turbo, including 161K multi-round long questions and answers and 55K manually corrected instruction data from 23 fields, as shown in the figure below.
We also expand MMInstruct with other open-source data, including:
Domain | Dataset |
---|---|
mathematics datasets | GEOS; UniGeo; GeoQA+; Geometry3k; CLEVR-Math; Supre-CLEVR; TabMWP |
charts and plots | DVQA (100K); FigureQA |
scientific figure | TQA |
map chart | MapQA |
We developed an instruction generation data engine leveraging GPT-4V, GPT-3.5, and manual correction. This engine allows semi-automatic, low-cost, multi-domain instruction generation at 1/6 the cost of manual construction.
If this work is helpful for your research, please consider citing the following BibTeX entry.
@article{liu2024mminstruct,
title={MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity},
author={Liu, Yangzhou and Cao, Yue and Gao, Zhangwei and Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Tian, Hao and Lu, Lewei and Zhu, Xizhou and Lu, Tong and others},
journal={arXiv preprint arXiv:2407.15838},
year={2024}
}