harrytea/Awesome-Document-Understanding

Document Artifical Intelligence

Awesome-Document-Understanding

A repository used to collect various document artifical intelligence

continue update 🤗

Table of contents

Document Understanding
MLLM/LMM
Grounded MLLM
Video LLM

Document Understanding

2024

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites (Shanghai AI Lab,CUHK,THU,NJU,FDU,SenseTime) | 24.4.25 | arXiv | Code
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD (Shanghai AI Lab,CUHK,THU,SenseTime) | 24.4.9 | arXiv | Code
LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding (Alibaba,ZJU) | 24.4.8 | arXiv | Code
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding (Alibaba,RUC) | 24.3.19 | arXiv | Code
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document (HUST) | 24.3.7 | arXiv | Code
HRVDA: High-Resolution Visual Document Assistant (Tencent YouTu Lab,USTC) | 24.2.29 | CVPR24
Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models (Tencent YouTu Lab) | 24.2.29 | CVPR24
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model (Shanghai AI Lab,CUHK,SenseTime) | 24.1.29 | arXiv | Code
Small Language Model Meets with Reinforced Vision Vocabulary (MEGVII,UCAS,HUST) | 24.1.23 | arXiv | Code

2023

DocLLM: A layout-aware generative language model for multimodal document understanding (JPMorgan AI Research) | 23.12.31 | arXiv
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models (MEGVII,UCAS,HUST) | 23.12.11 | arXiv | Code
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model (Alibaba) | 23.11.30 | arXiv | Code
Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs (USTC) | 23.11.22 | arXiv | Code
DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding (USTC,ByteDance) | 23.11.20 | arXiv
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models (HUST) | 23.11.11 | CVPR24 | Code
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration (Alibaba) | 23.11.07 | CVPR24 | Code
Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation (SCUT) | 23.10.25 | arXiv | Code
UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model (DAMO,RUC,ECNU) | 23.10.08 | arXiv | Code
Kosmos-2.5: A Multimodal Literate Model (MSRA) | 23.9.20 | arXiv | Code
BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions (UC San Diego) | 23.8.19 | AAAI24 | Code
UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding (USTC,ByteDance) | 23.8.19 | arXiv
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding (DAMO) | 23.7.4 | arXiv | Code
On the Hidden Mystery of OCR in Large Multimodal Models (HUST,SCUT,Microsoft) | 23.5.13 | arXiv | Code
Visual Information Extraction in the Wild: Practical Dataset and End-to-end Solution (HUST) | 23.5.12 | arXiv | Code
Document Understanding Dataset and Evaluation (DUDE) | 23.5.15 | arXiv | Website
StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training (Baidu) | 23.03.01 | ICLR23 | Code

2022

Wukong-Reader: Multi-modal Pre-training for Fine-grained Visual Document Understanding (Huawei) | 22.12.19 | ACL23
Unifying Vision, Text, and Layout for Universal Document Processing (Microsoft) | 22.12.05 | CVPR23 | Code
ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding (Baidu) | 22.10.12 | arXiv | Code
Unified Pretraining Framework for Document Understanding (Adobe) | 22.04.22 | NIPS21
LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking (Microsoft) | 22.04.18 | ACM MM22 | Code
XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich Document Understanding (Alibaba) | 22.3.14 | Code Unofficial
DiT: Self-supervised Pre-training for Document Image Transformer (Microsoft) | 22.03.04 | ACM MM22 | Code
Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark (Huawei) | 22.2.14 | NIPS22 | Code

2021

LayoutReader: Pre-training of Text and Layout for Reading Order Detection (Microsoft) | 21.08.26 | EMNLP21 | Code
LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding (Microsoft) | 21.04.18 | arXiv | Code
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer (Applica) | 21.02.18 | ICDAR21 | Code

2020

LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding (Microsoft) | 20.12.29 | arXiv | Code

2020

LayoutLM: Pre-training of Text and Layout for Document Image Understanding (Microsoft) | 19.12.31 | KDD20 | Code

MLLM

2024

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites (Shanghai AI Lab,SenseTime,THU,NJU,FU,CUHK) | 24.04.25 | arXiv | Code
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images (THU,NUS,UCAS) | 24.03.18 | arXiv | Code

2023

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks (OpenGVLab,NJU,HKU,CUHK,THU,USTC,SenseTime) | 23.12.21 | CVPR24 | Code
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions (USTC,Shanghai AI Lab) | 23.11.28 | arXiv | Code
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning (KAUST,Meta) | 23.10.14 | arXiv | Code
Improved Baselines with Visual Instruction Tuning (UWM,Microsoft) | 23.10.05 | arXiv | Code
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond (Alibaba) | 23.08.24 | arXiv | Code
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action (Azure) | 23.05.20 | arXiv | Code
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning (Salesforce) | 23.05.11 | arXiv | Code
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality (DAMO) | 23.04.27 | arXiv | Code
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models (KAUST) | 23.04.20 | arXiv | Code
Visual Instruction Tuning (UWM,Microsoft) | 23.04.17 | NeurIPS | Code
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (Salesforce) | 23.01.30 | arXiv | Code

Grounded MLLM

2024

Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models (CU,UCSB,Apple) | 24.04.11 | arXiv | Code

2023

LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models (HKUST,SCUT,IDEA,CUHK) | 23.12.05 | arXiv | Code
Ferret: Refer and Ground Anything Anywhere at Any Granularity (CU,Apple) | 23.10.11 | arXiv | Code
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs (ByteDance) | 23.07.17 | arXiv | Code
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic (SenseTime,BUAA,SJTU) | 23.06.27 | arXiv | Code
Kosmos-2: Grounding Multimodal Large Language Models to the World (Microsoft) | 23.06.26 | arXiv | Code

Video LLM

2023

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding (PKU,Noah) | 23.12.04 | CVPR24 | Code
Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models (PKU,PengCheng,Microsoft,FarReel) | 23.11.27 | arXiv | Code
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection (PKU,PengCheng) | 23.11.16 | arXiv | code
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding (PKU,PengCheng) | 23.11.14 | arXiv | Code
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding (DAMO) | 23.06.05 | arXiv | code