LayTextLLM

A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding

Introduction

LayTextLLM projects each bounding box to a single embedding and interleaves it with text, efficiently avoiding long sequence issues while leveraging autoregressive traits of LLMs. LayTextLLM not only streamlines the interaction of layout and textual data but also shows enhanced performance in Key Information Extraction (KIE) and Visual Question Answering (VQA).

Framework

LayTextLLM

Performance on Benchmarks

Document-Oriented VQA KIE
DocVQA VisualMRC Avg FUNSD CORD SROIE Avg
Metric ANLS % / CIDEr F-score %
Text
Llama2-7B-base 34.0 182.7 108.3 25.6 51.9 43.4 40.3
Llama2-7B-chat 20.5 6.3 13.4 23.4 51.8 58.6 44.6
Text + Polys
Llama2-7B-basecoor 8.4 3.8 6.1 6.0 46.4 34.7 29.0
Llama2-7B-chatcoor 12.3 28.0 20.1 14.4 38.1 50.6 34.3
Davinci-003-175Bcoor - - - - 92.6 95.8 -
DocLLM~[2] 69.5* 264.1* 166.8 51.8* 67.6* 91.9* 70.3
LayTextLLMzero (Ours) 65.5 200.2 132.9 47.2 77.2 83.7 69.4
LayTextLLMvqa (Ours) 75.6* 179.5 127.6 52.6 70.7 79.3 67.5
LayTextLLMall (Ours) 77.2* 277.8* 177.6 64.0* 96.5* 95.8* 85.4
Comparison with other OCR-based methods. * indicates the training set used.

Usage

Run the corresponding bash file for inference on the target dataset.

bash /run/run_*.sh

Dataset

More dataset files can be accessed at Google Drive