document-parsing

There are 54 repositories under document-parsing topic.

  • PaddlePaddle/PaddleOCR

    Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

    Language:Python63.5k49610.2k9.3k
  • docling

    docling-project/docling

    Get your documents ready for gen AI

    Language:Python43.5k1851.4k3.1k
  • Unstructured-IO/unstructured

    Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.

    Language:HTML13.2k681.2k1.1k
  • run-llama/llama_cloud_services

    Knowledge Agents and Management in the Cloud

    Language:TypeScript4.2k25567459
  • ExtractThinker

    enoch3712/ExtractThinker

    ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.

    Language:Python1.5k20173142
  • NanoNets/docstrange

    Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.

    Language:Python1k51494
  • opendataloader-project/opendataloader-pdf

    Safe, Open, High-Performance — PDF for AI

    Language:Java7533636
  • edenai/edenai-apis

    Eden AI: simplify the use and deployment of AI technologies by providing a unique API that connects to the best possible AI engines

    Language:Python46181468
  • harishdeivanayagam/rowfill

    Open-source spreadsheets platform for deep research and document processing

    Language:TypeScript3648421
  • GiftMungmeeprued/document-parsers-list

    A comprehensive list of document parsers, covering PDF-to-text conversion and layout extraction. Each tested for support of tables, equations, handwriting, two-column layouts, and multi-column layouts.

  • Documents-Parsing-Lab

    AdemBoukhris457/Documents-Parsing-Lab

    Jupyter notebooks testing different OCR models for document parsing (Dolphin, MonkeyOCR, Marker, Nanonets, ...)

    Language:Jupyter Notebook748
  • CycloneBoy/pdf_table

    A Unified Toolkit for Deep Learning-Based Table Extraction

    Language:Python52549
  • papercast-dev/papercast

    A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GROBID, LangChain, listen as podcast. Customize your own pipelines.

    Language:Python52193
  • Unstructured-IO/community

    Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

  • Hyland/DocumentFilters

    Document Filters is an SDK for applications like content indexing, e-discovery, data migration, and feeding data into AI/ML models by extracting data from unstructured sources. It gives the ability to perform deep inspection, data extraction, output manipulation, and conversion for virtually any type of document, in any programming language.

    Language:C++23902
  • docling-project/docling4j

    Docling4j brings the functionalities of Docling in document understanding to Java® projects

    Language:Java18201
  • ats

    acenji/ats

    Applicant Tracking System (ATS): A powerful platform leveraging generative AI and soft-match algorithms to analyze resumes against job descriptions. Built with React and Node.js, it streamlines hiring insights. Future plans include expanding to investor pitches and other structured documents.

    Language:JavaScript11103
  • aimagelab/mugat

    Official implementation of our ECCVW paper "μgat: Improving Single-Page Document Parsing by Providing Multi-Page Context"

    Language:Python11400
  • J-sephB-lt-n/pdf-bank-statement-parser

    Tool for converting First National Bank (FNB) bank statement PDFs into useful structured data

    Language:Python5114
  • baughmann/tikara

    The metadata and text content extractor for almost every file type.

    Language:Python4190
  • renswickd/document-parser-collection

    This is a collection of various document parsers and hands-on to construct structured data for your RAG applications.

    Language:Python30
  • syw2014/langparse

    LangParse is a universal document parsing and text chunking engine for LLM or Agent applications — Documents In, Knowledge Out.

  • ziming/laravel-docparser

    Docparser OCR Package for PHP Laravel

    Language:PHP3100
  • ajaycode/unstructured

    Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

    Language:HTML2000
  • Anmol-Baranwal/doc-parsing

    Python scripts to parse and structure invoice data from PDFs using OpenAI, Anthropic and Invofox APIs

    Language:Python2
  • Bharathyalagi/OCR-Document-parser

    Smart OCR application built with Tesseract and Streamlit that extracts structured data from Inputs

    Language:Python20
  • Kathan-max/RAG-Enhanced-Chatbot-with-LoRA-Fine-Tuning

    Transform your documents into intelligent conversations. This open-source RAG chatbot combines semantic search with fine-tuned language models (LLaMA, Qwen2.5VL-3B) to deliver accurate, context-aware responses from your own knowledge base. Join our community!

    Language:Python2000
  • rithulkamesh/docproc

    Opinionated and Sophisticated Document Region Analyzer.

    Language:Python2190
  • Anwarsha7/resumeparser

    An intelligent resume parsing engine built with Python and NLP, aimed at automating the tedious task of sifting through resumes. It accurately extracts vital candidate information such as contact details, employment history, educational qualifications, and technical skills, making it an invaluable asset for recruitment and HR professionals.

    Language:HTML1
  • anyparser/anyparser_crewai

    Supercharge your AI workflows by combining Anyparser’s advanced content extraction with Crew AI. With this integration, you can effortlessly leverage Anyparser’s document processing and data extraction tools within your Crew AI applications.

    Language:Python1000
  • hftuner/clovaai-donut

    collection of notebooks for finetuning donut model for various visual document understanding tasks, using huggingface Trainer.

    Language:Jupyter Notebook1
  • MegrezAI/LeapRAG

    LeapRAG is an open-source platform that integrates advanced RAG technology with Google’s A2A protocol, enabling users to build context-aware, data-driven agents. These agents are automatically A2A-compliant and can be discovered and used by any compatible client without extra development.

    Language:Python1200
  • Mouez-Yazidi/Multilingual-Invoice-Parsing-with-LLaMA-4

    Combining OCR for text extraction with LLMs for accurate, efficient document structuring.

    Language:Python1101
  • parxyval

    OneOffTech/parxyval

    Evaluation framework for document parsing

    Language:Python10
  • PRITHIVSAKTHIUR/DocScope-R1

    A powerful multi-modal AI application that combines three state-of-the-art vision-language models for comprehensive image and video analysis. DocScope-R1 provides OCR capabilities, detailed scene understanding, and video content analysis through an intuitive Gradio interface.

    Language:Python1001
  • PRITHIVSAKTHIUR/dots.ocr-fix-demo

    This Gradio application demonstrates the capabilities of the "dots.ocr" model, a powerful multilingual document parser.

    Language:Jupyter Notebook1