Docopilot: Improving Multimodal Models for Document-Level Understanding

The official implementation of the paper "Docopilot: Improving Multimodal Models for Document-Level Understanding".

[📜 Paper] [🤗 HF Models 2B | 8B ] [📖 HF Datasets]

📕 Overview

We construct Doc-750K, the first large-scale, high-quality dataset for document-level multimodal understanding, with 758K QA pairs covering 9 task types.
We propose Docopilot, a native document-level VLM that outperforms existing methods and Gemini-1.5-Pro on MMLongBench-Doc, making it the closest open-source model to GPT-4o.
Docopilot achieves much lower inference latency than RAG-based methods, and when combined with RAG, its performance further improves, showing that RAG effectively enhances its retrieval and reasoning.

🗓️ Schedule

Release Evaluation Code
Release Training Code
Release Doc-750K
Release Docopilot Checkpoints

⚙️ Data Preparation

Download `Doc-750K` (need about 1.5TB space)

mkdir data
cd data
# pip install -U huggingface_hub
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/Doc-750K --local-dir Doc-750K --repo-type dataset

# unzip each images folder
cd Doc-750K/openreview
unzip images.zip
cd ../generated
unzip images.zip
cd ../arxivqa
unzip images.zip
cd ../scihub
unzip images.zip

Custom your own training data (Optional)

Follow this link to prepare your own training data.

Notice: Put the meta in a single json, similar to playground/Doc-750K.json.

🔥 Supervised Finetuning

Pretrained Model Preparation

Our models are finetuned from InternVL2-2B and InternVL2-8B. Please download the above model weights and place them in the pretrained/ folder.

model name	type	download	size
InternVL2-2B	VLM	🤗 HF link	4.4 GB
InternVL2-8B	VLM	🤗 HF link	16 GB

mkdir pretrained
cd pretrained/
# pip install -U huggingface_hub
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-2B --local-dir InternVL2-2B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-8B --local-dir InternVL2-8B

Training

sh shell/slurm_train_example.sh

📦 Model Zoo

model name	type	download	size
Docopilot-2B	VLM	🤗 HF link	4.4 GB
Docopilot-8B	VLM	🤗 HF link	16 GB

🖊️ Citation

If you find this work helpful in your research, please consider citing:

@inproceedings{duan2025docopilot,
  title={Docopilot: Improving Multimodal Models for Document-Level Understanding},
  author={Duan, Yuchen and Chen, Zhe and Hu, Yusong and Wang, Weiyun and Ye, Shenglong and Shi, Botian and Lu, Lewei and Hou, Qibin and Lu, Tong and Li, Hongsheng and others},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={4026--4037},
  year={2025}
}

OpenGVLab/Docopilot