document-parsing

There are 54 repositories under document-parsing topic.

PaddlePaddle/PaddleOCR
Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.
Language:Python63.6k 497 10.2k9.3k
docling-project/docling
Get your documents ready for gen AI
Language:Python43.6k 185 1.4k3.1k
Unstructured-IO/unstructured
Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.
Language:HTML13.2k 68 1.2k1.1k
run-llama/llama_cloud_services
Knowledge Agents and Management in the Cloud
Language:TypeScript4.2k 25 567459
enoch3712/ExtractThinker
ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.
Language:Python1.5k 20 173142
NanoNets/docstrange
Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.
Language:Python1k 5 1494
opendataloader-project/opendataloader-pdf
Safe, Open, High-Performance — PDF for AI
Language:Java753 3 636
edenai/edenai-apis
Eden AI: simplify the use and deployment of AI technologies by providing a unique API that connects to the best possible AI engines
Language:Python461 8 1468
harishdeivanayagam/rowfill
Open-source spreadsheets platform for deep research and document processing
Language:TypeScript364 8 421
GiftMungmeeprued/document-parsers-list
A comprehensive list of document parsers, covering PDF-to-text conversion and layout extraction. Each tested for support of tables, equations, handwriting, two-column layouts, and multi-column layouts.
164 6 14
AdemBoukhris457/Documents-Parsing-Lab
Jupyter notebooks testing different OCR models for document parsing (Dolphin, MonkeyOCR, Marker, Nanonets, ...)
Language:Jupyter Notebook748
CycloneBoy/pdf_table
A Unified Toolkit for Deep Learning-Based Table Extraction
Language:Python52 5 49
papercast-dev/papercast
A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GROBID, LangChain, listen as podcast. Customize your own pipelines.
Language:Python52 1 93
Unstructured-IO/community
Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
29 23 298
Hyland/DocumentFilters
Document Filters is an SDK for applications like content indexing, e-discovery, data migration, and feeding data into AI/ML models by extracting data from unstructured sources. It gives the ability to perform deep inspection, data extraction, output manipulation, and conversion for virtually any type of document, in any programming language.
Language:C++23 9 02
docling-project/docling4j
Docling4j brings the functionalities of Docling in document understanding to Java® projects
Language:Java18 2 01
acenji/ats
Applicant Tracking System (ATS): A powerful platform leveraging generative AI and soft-match algorithms to analyze resumes against job descriptions. Built with React and Node.js, it streamlines hiring insights. Future plans include expanding to investor pitches and other structured documents.
Language:JavaScript11 1 03
aimagelab/mugat
Official implementation of our ECCVW paper "μgat: Improving Single-Page Document Parsing by Providing Multi-Page Context"
Language:Python11 4 00
J-sephB-lt-n/pdf-bank-statement-parser
Tool for converting First National Bank (FNB) bank statement PDFs into useful structured data
Language:Python5 1 14
baughmann/tikara
The metadata and text content extractor for almost every file type.
Language:Python4 1 90
renswickd/document-parser-collection
This is a collection of various document parsers and hands-on to construct structured data for your RAG applications.
Language:Python30
syw2014/langparse
LangParse is a universal document parsing and text chunking engine for LLM or Agent applications — Documents In, Knowledge Out.
3 1 0
ziming/laravel-docparser
Docparser OCR Package for PHP Laravel
Language:PHP3 1 00
ajaycode/unstructured
Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
Language:HTML2 0 00
Anmol-Baranwal/doc-parsing
Python scripts to parse and structure invoice data from PDFs using OpenAI, Anthropic and Invofox APIs
Language:Python2
Bharathyalagi/OCR-Document-parser
Smart OCR application built with Tesseract and Streamlit that extracts structured data from Inputs
Language:Python20
Kathan-max/RAG-Enhanced-Chatbot-with-LoRA-Fine-Tuning
Transform your documents into intelligent conversations. This open-source RAG chatbot combines semantic search with fine-tuned language models (LLaMA, Qwen2.5VL-3B) to deliver accurate, context-aware responses from your own knowledge base. Join our community!
Language:Python2 0 00
rithulkamesh/docproc
Opinionated and Sophisticated Document Region Analyzer.
Language:Python2 1 90
Anwarsha7/resumeparser
An intelligent resume parsing engine built with Python and NLP, aimed at automating the tedious task of sifting through resumes. It accurately extracts vital candidate information such as contact details, employment history, educational qualifications, and technical skills, making it an invaluable asset for recruitment and HR professionals.
Language:HTML1
anyparser/anyparser_crewai
Supercharge your AI workflows by combining Anyparser’s advanced content extraction with Crew AI. With this integration, you can effortlessly leverage Anyparser’s document processing and data extraction tools within your Crew AI applications.
Language:Python1 0 00
hftuner/clovaai-donut
collection of notebooks for finetuning donut model for various visual document understanding tasks, using huggingface Trainer.
Language:Jupyter Notebook1
MegrezAI/LeapRAG
LeapRAG is an open-source platform that integrates advanced RAG technology with Google’s A2A protocol, enabling users to build context-aware, data-driven agents. These agents are automatically A2A-compliant and can be discovered and used by any compatible client without extra development.
Language:Python1 2 00
Mouez-Yazidi/Multilingual-Invoice-Parsing-with-LLaMA-4
Combining OCR for text extraction with LLMs for accurate, efficient document structuring.
Language:Python1 1 01
OneOffTech/parxyval
Evaluation framework for document parsing
Language:Python10
PRITHIVSAKTHIUR/DocScope-R1
A powerful multi-modal AI application that combines three state-of-the-art vision-language models for comprehensive image and video analysis. DocScope-R1 provides OCR capabilities, detailed scene understanding, and video content analysis through an intuitive Gradio interface.
Language:Python1 0 01
PRITHIVSAKTHIUR/dots.ocr-fix-demo
This Gradio application demonstrates the capabilities of the "dots.ocr" model, a powerful multilingual document parser.
Language:Jupyter Notebook1

document-parsing

PaddlePaddle/PaddleOCR

docling-project/docling

Unstructured-IO/unstructured

run-llama/llama_cloud_services

enoch3712/ExtractThinker

NanoNets/docstrange

opendataloader-project/opendataloader-pdf

edenai/edenai-apis

harishdeivanayagam/rowfill

GiftMungmeeprued/document-parsers-list

AdemBoukhris457/Documents-Parsing-Lab

CycloneBoy/pdf_table

papercast-dev/papercast

Unstructured-IO/community

Hyland/DocumentFilters

docling-project/docling4j

acenji/ats

aimagelab/mugat

J-sephB-lt-n/pdf-bank-statement-parser

baughmann/tikara

renswickd/document-parser-collection

syw2014/langparse

ziming/laravel-docparser

ajaycode/unstructured

Anmol-Baranwal/doc-parsing

Bharathyalagi/OCR-Document-parser

Kathan-max/RAG-Enhanced-Chatbot-with-LoRA-Fine-Tuning

rithulkamesh/docproc

Anwarsha7/resumeparser

anyparser/anyparser_crewai

hftuner/clovaai-donut

MegrezAI/LeapRAG

Mouez-Yazidi/Multilingual-Invoice-Parsing-with-LLaMA-4

OneOffTech/parxyval

PRITHIVSAKTHIUR/DocScope-R1

PRITHIVSAKTHIUR/dots.ocr-fix-demo