PDF-WuKong : A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling

[📜 Paper] [🚀 Code] [🤗 HF Dataset] [📖 Project Page]

Please give us a star ⭐ for the latest update.

[ArXiv] PDF-Wukong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling
Xudong Xie*, Liang Yin*, Hao Yan*, Yang Liu*, Jing Ding, Minghui Liao, Yuliang Liu, Wei Chen，Xiang Bai

💡 Monkey series projects:✨.

[CVPR'24] Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, Xiang Bai

TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document
Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, Xiang Bai

Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language Models
Mingxin Huang, Yuliang Liu, Dingkang Liang, Lianwen Jin, Xiang Bai

News

2024.10.10 🚀 We release the paper PDF-Wukong.

Methodology

The overall structure of PDF-WuKong

The construction process of PaperPDF

Dataset

The statistic of PaperPDF.

Text-only and Image-only indicate that the QA pairs are generated based on either a single text paragraph or an image extracted from the PDF. Meanwhile, Image-text, Section, and Cross-paragraph denote that the QA pairs are generated from a paragraph and its corresponding references, an entire section, or non-contiguous paragraphs, respectively.

PaperPDF is publicly available on Hugging Face Datasets: PaperPDF.

Repository Structure

The structure of this repository is shown as follows.

PaperPDF
│
├── Original PDFs                # Original PDF documents
│
│── filter.py                    # Code for filtering data based on rules.
│
├── Parsed Data
│   ├── PaperPDF.py              # Code for extracting text and image information from XML documents
│   ├── pdf_xml                  # XML files generated by Grobid from the PDF documents   
│   └── pdf_figure              
│       ├── figure               # Extracted images from the PDF documents
│       └── data                 # Metadate of the images
│
├── Train  
│   ├── train_100w.jsonl         # The complete 1 million training data 
│   ├── train_50w.jsonl          # 500,000 training data for ablation studies
│   └── train_10w.jsonl          # 100,000 training data for ablation studies
│ 
└── Test
    └── test.jsonl               # The test set

Data Instances

For each instance in the dataset, the following fields are provided:

json
{
  {
  "PDF name": "1507.04291v1",
  "Category": "single-text_img",
  "Query": "According to Table 1, which sections discuss TCB-included Chebyshev kernels for both position and velocity?",
  "Answer": ["Sections 5.3.3 and 5.3.4 discuss TCB-included Chebyshev kernels for both position and velocity.", "Sections 5.3.3."],
  "Evidence": {
    "Texts": [{"idx": 11, "Text": "The six SPK data types, listed in Table 1, for ephemerides of natural solar system bodies..."}],
    "Figures": [{"idx": 220, "Caption": "Table 1: Double precision kernel data types of interest.", "Figure": "1507.04291v1-Table1-1.png"}]
    }
  }
  ...
}

Data Fields

PDF name: a string containing the name of the PDF document.
Category: a string representing the category of the query, which can be one of the following: single-text_only, single-image_only, multi-text_image, multi-section, multi-cross_paragraph.
Query: a string containing the question posed to the PDF
Answer: an array of the two answers generated, the training set and test set has different prompt for the answers (see [title](### Dataset Creation) below for more details)
Evidence: an object containing supporting texts and figures (if provided) from the PDF document

Evaluate

Performance comparison with other commercial products on PaperPDF.

Performance comparison with other DocVLMs on single-page document understanding.

Performance comparison with other DocVLMs on multi-page document understanding.

Please refer to our paper for more details.

Training/evaluation code, checkpoint, demo will be released soon.

Citing PDF-Wukong

If you wish to refer to the baseline results published here, please use the following BibTeX entries:

@article{xie2024pdfwukong,
        title={PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling}, 
        author={Xie, Xudong and Yin, Liang and Yan, Hao and Liu, Yang and Ding, Jing and Liao, Minghui and Liu, Yuliang and Chen, Wei and Bai, Xiang},
        year={2024},
        journal={arXiv preprint arXiv:2410.05970},
        url={https://arxiv.org/abs/2410.05970},
      }

Copyright

PDF-Wukong project is intended for non-commercial use only. For commercial inquiries, please contact haoyan at haoyan@hust.edu.cn.

yh-hust/PDF-Wukong