A repository used to collect various document artifical intelligence
continue update 🤗
2024
- How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites (Shanghai AI Lab,CUHK,THU,NJU,FDU,SenseTime) | 24.4.25 | arXiv | Code
- InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD (Shanghai AI Lab,CUHK,THU,SenseTime) | 24.4.9 | arXiv | Code
- LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding (Alibaba,ZJU) | 24.4.8 | arXiv | Code
- mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding (Alibaba,RUC) | 24.3.19 | arXiv | Code
- TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document (HUST) | 24.3.7 | arXiv | Code
- HRVDA: High-Resolution Visual Document Assistant (Tencent YouTu Lab,USTC) | 24.2.29 | CVPR24
- Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models (Tencent YouTu Lab) | 24.2.29 | CVPR24
- InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model (Shanghai AI Lab,CUHK,SenseTime) | 24.1.29 | arXiv | Code
- Small Language Model Meets with Reinforced Vision Vocabulary (MEGVII,UCAS,HUST) | 24.1.23 | arXiv | Code
2023
- DocLLM: A layout-aware generative language model for multimodal document understanding (JPMorgan AI Research) | 23.12.31 | arXiv
- Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models (MEGVII,UCAS,HUST) | 23.12.11 | arXiv | Code
- mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model (Alibaba) | 23.11.30 | arXiv | Code
- Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs (USTC) | 23.11.22 | arXiv | Code
- DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding (USTC,ByteDance) | 23.11.20 | arXiv
- Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models (HUST) | 23.11.11 | CVPR24 | Code
- mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration (Alibaba) | 23.11.07 | CVPR24 | Code
- Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation (SCUT) | 23.10.25 | arXiv | Code
- UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model (DAMO,RUC,ECNU) | 23.10.08 | arXiv | Code
- Kosmos-2.5: A Multimodal Literate Model (MSRA) | 23.9.20 | arXiv | Code
- BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions (UC San Diego) | 23.8.19 | AAAI24 | Code
- UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding (USTC,ByteDance) | 23.8.19 | arXiv
- mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding (DAMO) | 23.7.4 | arXiv | Code
- On the Hidden Mystery of OCR in Large Multimodal Models (HUST,SCUT,Microsoft) | 23.5.13 | arXiv | Code
- Visual Information Extraction in the Wild: Practical Dataset and End-to-end Solution (HUST) | 23.5.12 | arXiv | Code
- Document Understanding Dataset and Evaluation (DUDE) | 23.5.15 | arXiv | Website
- StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training (Baidu) | 23.03.01 | ICLR23 | Code
2022
- Wukong-Reader: Multi-modal Pre-training for Fine-grained Visual Document Understanding (Huawei) | 22.12.19 | ACL23
- Unifying Vision, Text, and Layout for Universal Document Processing (Microsoft) | 22.12.05 | CVPR23 | Code
- ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding (Baidu) | 22.10.12 | arXiv | Code
- Unified Pretraining Framework for Document Understanding (Adobe) | 22.04.22 | NIPS21
- LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking (Microsoft) | 22.04.18 | ACM MM22 | Code
- XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich Document Understanding (Alibaba) | 22.3.14 | Code Unofficial
- DiT: Self-supervised Pre-training for Document Image Transformer (Microsoft) | 22.03.04 | ACM MM22 | Code
- Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark (Huawei) | 22.2.14 | NIPS22 | Code
2021
- LayoutReader: Pre-training of Text and Layout for Reading Order Detection (Microsoft) | 21.08.26 | EMNLP21 | Code
- LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding (Microsoft) | 21.04.18 | arXiv | Code
- Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer (Applica) | 21.02.18 | ICDAR21 | Code
2020
- LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding (Microsoft) | 20.12.29 | arXiv | Code
2020
- LayoutLM: Pre-training of Text and Layout for Document Image Understanding (Microsoft) | 19.12.31 | KDD20 | Code
2024
- How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites (Shanghai AI Lab,SenseTime,THU,NJU,FU,CUHK) | 24.04.25 | arXiv | Code
- LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images (THU,NUS,UCAS) | 24.03.18 | arXiv | Code
2023
- InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks (OpenGVLab,NJU,HKU,CUHK,THU,USTC,SenseTime) | 23.12.21 | CVPR24 | Code
- ShareGPT4V: Improving Large Multi-Modal Models with Better Captions (USTC,Shanghai AI Lab) | 23.11.28 | arXiv | Code
- MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning (KAUST,Meta) | 23.10.14 | arXiv | Code
- Improved Baselines with Visual Instruction Tuning (UWM,Microsoft) | 23.10.05 | arXiv | Code
- Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond (Alibaba) | 23.08.24 | arXiv | Code
- MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action (Azure) | 23.05.20 | arXiv | Code
- InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning (Salesforce) | 23.05.11 | arXiv | Code
- mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality (DAMO) | 23.04.27 | arXiv | Code
- MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models (KAUST) | 23.04.20 | arXiv | Code
- Visual Instruction Tuning (UWM,Microsoft) | 23.04.17 | NeurIPS | Code
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (Salesforce) | 23.01.30 | arXiv | Code
2024
- Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models (CU,UCSB,Apple) | 24.04.11 | arXiv | Code
2023
- LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models (HKUST,SCUT,IDEA,CUHK) | 23.12.05 | arXiv | Code
- Ferret: Refer and Ground Anything Anywhere at Any Granularity (CU,Apple) | 23.10.11 | arXiv | Code
- BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs (ByteDance) | 23.07.17 | arXiv | Code
- Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic (SenseTime,BUAA,SJTU) | 23.06.27 | arXiv | Code
- Kosmos-2: Grounding Multimodal Large Language Models to the World (Microsoft) | 23.06.26 | arXiv | Code
2023
- TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding (PKU,Noah) | 23.12.04 | CVPR24 | Code
- Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models (PKU,PengCheng,Microsoft,FarReel) | 23.11.27 | arXiv | Code
- Video-LLaVA: Learning United Visual Representation by Alignment Before Projection (PKU,PengCheng) | 23.11.16 | arXiv | code
- Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding (PKU,PengCheng) | 23.11.14 | arXiv | Code
- Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding (DAMO) | 23.06.05 | arXiv | code