document-analysis

There are 168 repositories under document-analysis topic.

  • opendatalab/MinerU

    A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。

    Language:Python43.9k1831.7k3.6k
  • bytedance/Dolphin

    The official repo for “Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting”, ACL, 2025.

    Language:Python5.8k469
  • ucbepic/docetl

    A system for agentic LLM-powered data processing and ETL

    Language:Python2.8k18111302
  • UglyToad/PdfPig

    Read and extract text and other content from PDFs in C# (port of PDFBox)

    Language:C#2.2k48556283
  • AlibabaResearch/AdvancedLiterateMachinery

    A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.

    Language:C++1.8k43205198
  • NanoNets/docext

    An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)

    Language:Python1.7k130
  • tstanislawek/awesome-document-understanding

    A curated list of resources for Document Understanding (DU) topic

  • DocumindHQ/documind

    Open-source platform for extracting structured data from documents using AI.

    Language:JavaScript1.4k111057
  • Yuliang-Liu/Curve-Text-Detector

    This repository provides train&test code, dataset, det.&rec. annotation, evaluation script, annotation tool, and ranking.

    Language:Jupyter Notebook6493058158
  • ispras/dedoc

    Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser

    Language:Python595122844
  • wenwenyu/PICK-pytorch

    Code for the paper "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020)

    Language:Python56822114192
  • CybercentreCanada/assemblyline

    AssemblyLine 4: File triage and malware analysis

    Language:Python363827718
  • jpWang/LiLT

    Official PyTorch implementation of LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding (ACL 2022)

    Language:Python35564741
  • pandora-analysis/pandora

    Pandora is an analysis framework to discover if a file is suspicious and conveniently show the results

    Language:Python269915741
  • lazyFrogLOL/llmdocparser

    A package for parsing PDFs and analyzing their content using LLMs.

    Language:Python267339
  • masyagin1998/robin

    RObust document image BINarization

    Language:Python182111139
  • ppaanngggg/yolo-doclaynet

    YOLO models trained by DocLayNet - power your Document Intelligent by Layout Analysis

    Language:Python1343519
  • chriswolfvision/local_adaptive_binarization

    Local adaptive image binarization

    Language:C++12610425
  • mirabdullahyaser/Retrieval-Augmented-Generation-Engine-with-LangChain-and-Streamlit

    Powerful web application that combines Streamlit, LangChain, and Pinecone to simplify document analysis. Powered by OpenAI's GPT-3, RAG enables dynamic, interactive document conversations, making it ideal for efficient document retrieval and summarization.

    Language:Python1253459
  • anisha2102/docvqa

    Document Visual Question Answering

    Language:Python12441525
  • amazon-textract-transformer-pipeline

    aws-samples/amazon-textract-transformer-pipeline

    Post-process Amazon Textract results with Hugging Face transformer models for document understanding

    Language:Python99251824
  • monniert/docExtractor

    (ICFHR 2020 oral) Code for "docExtractor: An off-the-shelf historical document element extraction" paper

    Language:Python8872210
  • Xyntopia/pydoxtools

    Effortlessly extract information from unstructured data with this library, utilizing advanced AI techniques. Compose AI in customizable pipelines and diverse sources for your projects.

    Language:Python845214
  • abdur75648/UTRNet-High-Resolution-Urdu-Text-Recognition

    UTRNet: High-Resolution Urdu Text Recognition In Printed Documents (ICDAR'23)

    Language:Python5661110
  • ZeningLin/ViBERTgrid-PyTorch

    An unofficial PyTorch implementation of "Lin et al. ViBERTgrid: A Jointly Trained Multi-Modal 2D Document Representation for Key Information Extraction from Documents. ICDAR, 2021"

    Language:Python534115
  • JPLeoRX/detectron2-publaynet

    Trained Detectron2 object detection models for document layout analysis based on PubLayNet dataset

    Language:Python50337
  • aws-solutions/enhanced-document-understanding-on-aws

    Enhanced Document Understanding on AWS delivers an easy-to-use web application that ingests and analyzes documents, extracts content, identifies and redacts sensitive customer information, and creates search indexes from the analyzed data.

    Language:JavaScript40162017
  • ankanbhunia/AdverseBiNet

    Improving Document Binarization via Adversarial Noise-Texture Augmentation (ICIP 2019)

    Language:Python38669
  • lin-tan/DocTer

    For our ISSTA22 paper "DocTer: Documentation-Guided Fuzzing for Testing Deep Learning API Functions" by Danning Xie, Yitong Li, Mijung Kim, Hung Viet Pham, Lin Tan, Xiangyu Zhang, Mike Godfrey

  • AILab-UniFI/GNN-TableExtraction

    Code for ICPR2022 paper: "Graph Neural Networks and Representation Embedding for table extraction in PDF Documents"

    Language:Python35565
  • microsoft/synthetic-rag-index

    Service to import data from various sources and index it in AI Search. Increases data relevance and reduces final size by 90%+. Useful for RAG scenarios with LLM. Hosted in Azure with serverless architecture.

    Language:Python33347
  • retab-dev/retab

    The developper starter pack for document processing

    Language:Jupyter Notebook33
  • BjornMelin/docmind-ai-llm

    DocMind AI is a powerful, open-source Streamlit application leveraging LlamaIndex, LangGraph, and local Large Language Models (LLMs) via Ollama, LMStudio, llama.cpp, or vLLM for advanced document analysis. Analyze, summarize, and extract insights from a wide array of file formats—securely and privately, all offline.

    Language:Python2910
  • CaseDrive/publaynet-models

    Trained Detectron2 object detection models for document layout analysis based on PubLayNet dataset

    Language:Python28202
  • muhd-umer/pyramidtabnet

    Official PyTorch implementation of PyramidTabNet: Transformer-based Table Recognition in Image-based Documents

    Language:Python28102
  • swapnil-ahlawat/Document_Layout_Analysis-MonkAI

    DL models that take a document image file as input, locate the position of paragraphs, lines, images, etc. with their labels and confidence scores.

    Language:Jupyter Notebook26117