document-analysis

There are 168 repositories under document-analysis topic.

opendatalab/MinerU
A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具，将PDF转换成Markdown和JSON格式。
Language:Python43.9k 183 1.7k3.6k
bytedance/Dolphin
The official repo for “Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting”, ACL, 2025.
Language:Python5.8k469
ucbepic/docetl
A system for agentic LLM-powered data processing and ETL
Language:Python2.8k 18 111302
UglyToad/PdfPig
Read and extract text and other content from PDFs in C# (port of PDFBox)
Language:C#2.2k 48 556283
AlibabaResearch/AdvancedLiterateMachinery
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
Language:C++1.8k 43 205198
NanoNets/docext
An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)
Language:Python1.7k130
tstanislawek/awesome-document-understanding
A curated list of resources for Document Understanding (DU) topic
1.5k 37 2164
DocumindHQ/documind
Open-source platform for extracting structured data from documents using AI.
Language:JavaScript1.4k 11 1057
Yuliang-Liu/Curve-Text-Detector
This repository provides train＆test code, dataset, det.&rec. annotation, evaluation script, annotation tool, and ranking.
Language:Jupyter Notebook649 30 58158
ispras/dedoc
Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser
Language:Python595 12 2844
wenwenyu/PICK-pytorch
Code for the paper "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020)
Language:Python568 22 114192
CybercentreCanada/assemblyline
AssemblyLine 4: File triage and malware analysis
Language:Python363 8 27718
jpWang/LiLT
Official PyTorch implementation of LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding (ACL 2022)
Language:Python355 6 4741
pandora-analysis/pandora
Pandora is an analysis framework to discover if a file is suspicious and conveniently show the results
Language:Python269 9 15741
lazyFrogLOL/llmdocparser
A package for parsing PDFs and analyzing their content using LLMs.
Language:Python267 3 39
masyagin1998/robin
RObust document image BINarization
Language:Python182 11 1139
ppaanngggg/yolo-doclaynet
YOLO models trained by DocLayNet - power your Document Intelligent by Layout Analysis
Language:Python134 3 519
chriswolfvision/local_adaptive_binarization
Local adaptive image binarization
Language:C++126 10 425
mirabdullahyaser/Retrieval-Augmented-Generation-Engine-with-LangChain-and-Streamlit
Powerful web application that combines Streamlit, LangChain, and Pinecone to simplify document analysis. Powered by OpenAI's GPT-3, RAG enables dynamic, interactive document conversations, making it ideal for efficient document retrieval and summarization.
Language:Python125 3 459
anisha2102/docvqa
Document Visual Question Answering
Language:Python124 4 1525
aws-samples/amazon-textract-transformer-pipeline
Post-process Amazon Textract results with Hugging Face transformer models for document understanding
Language:Python99 25 1824
monniert/docExtractor
(ICFHR 2020 oral) Code for "docExtractor: An off-the-shelf historical document element extraction" paper
Language:Python88 7 2210
Xyntopia/pydoxtools
Effortlessly extract information from unstructured data with this library, utilizing advanced AI techniques. Compose AI in customizable pipelines and diverse sources for your projects.
Language:Python84 5 214
abdur75648/UTRNet-High-Resolution-Urdu-Text-Recognition
UTRNet: High-Resolution Urdu Text Recognition In Printed Documents (ICDAR'23)
Language:Python56 6 1110
ZeningLin/ViBERTgrid-PyTorch
An unofficial PyTorch implementation of "Lin et al. ViBERTgrid: A Jointly Trained Multi-Modal 2D Document Representation for Key Information Extraction from Documents. ICDAR, 2021"
Language:Python53 4 115
JPLeoRX/detectron2-publaynet
Trained Detectron2 object detection models for document layout analysis based on PubLayNet dataset
Language:Python50 3 37
aws-solutions/enhanced-document-understanding-on-aws
Enhanced Document Understanding on AWS delivers an easy-to-use web application that ingests and analyzes documents, extracts content, identifies and redacts sensitive customer information, and creates search indexes from the analyzed data.
Language:JavaScript40 16 2017
ankanbhunia/AdverseBiNet
Improving Document Binarization via Adversarial Noise-Texture Augmentation (ICIP 2019)
Language:Python38 6 69
lin-tan/DocTer
For our ISSTA22 paper "DocTer: Documentation-Guided Fuzzing for Testing Deep Learning API Functions" by Danning Xie, Yitong Li, Mijung Kim, Hung Viet Pham, Lin Tan, Xiangyu Zhang, Mike Godfrey
37 3 04
AILab-UniFI/GNN-TableExtraction
Code for ICPR2022 paper: "Graph Neural Networks and Representation Embedding for table extraction in PDF Documents"
Language:Python35 5 65
microsoft/synthetic-rag-index
Service to import data from various sources and index it in AI Search. Increases data relevance and reduces final size by 90%+. Useful for RAG scenarios with LLM. Hosted in Azure with serverless architecture.
Language:Python33 3 47
retab-dev/retab
The developper starter pack for document processing
Language:Jupyter Notebook33
BjornMelin/docmind-ai-llm
DocMind AI is a powerful, open-source Streamlit application leveraging LlamaIndex, LangGraph, and local Large Language Models (LLMs) via Ollama, LMStudio, llama.cpp, or vLLM for advanced document analysis. Analyze, summarize, and extract insights from a wide array of file formats—securely and privately, all offline.
Language:Python29 1 0
CaseDrive/publaynet-models
Trained Detectron2 object detection models for document layout analysis based on PubLayNet dataset
Language:Python28 2 02
muhd-umer/pyramidtabnet
Official PyTorch implementation of PyramidTabNet: Transformer-based Table Recognition in Image-based Documents
Language:Python28 1 02
swapnil-ahlawat/Document_Layout_Analysis-MonkAI
DL models that take a document image file as input, locate the position of paragraphs, lines, images, etc. with their labels and confidence scores.
Language:Jupyter Notebook26 1 17

document-analysis

opendatalab/MinerU

bytedance/Dolphin

ucbepic/docetl

UglyToad/PdfPig

AlibabaResearch/AdvancedLiterateMachinery

NanoNets/docext

tstanislawek/awesome-document-understanding

DocumindHQ/documind

Yuliang-Liu/Curve-Text-Detector

ispras/dedoc

wenwenyu/PICK-pytorch

CybercentreCanada/assemblyline

jpWang/LiLT

pandora-analysis/pandora

lazyFrogLOL/llmdocparser

masyagin1998/robin

ppaanngggg/yolo-doclaynet

chriswolfvision/local_adaptive_binarization

mirabdullahyaser/Retrieval-Augmented-Generation-Engine-with-LangChain-and-Streamlit

anisha2102/docvqa

aws-samples/amazon-textract-transformer-pipeline

monniert/docExtractor

Xyntopia/pydoxtools

abdur75648/UTRNet-High-Resolution-Urdu-Text-Recognition

ZeningLin/ViBERTgrid-PyTorch

JPLeoRX/detectron2-publaynet

aws-solutions/enhanced-document-understanding-on-aws

ankanbhunia/AdverseBiNet

lin-tan/DocTer

AILab-UniFI/GNN-TableExtraction

microsoft/synthetic-rag-index

retab-dev/retab

BjornMelin/docmind-ai-llm

CaseDrive/publaynet-models

muhd-umer/pyramidtabnet

swapnil-ahlawat/Document_Layout_Analysis-MonkAI