document-parser

There are 58 repositories under document-parser topic.

  • docling

    docling

    Get your documents ready for gen AI

    Language:Python38.8k
  • unstructured

    Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.

    Language:HTML12.7k
  • AutoRAG

    AutoRAG

    AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

    Language:Python4.3k
  • llama_cloud_services

    Knowledge Agents and Management in the Cloud

    Language:TypeScript4.1k
  • so-novel

    小说下载|网文下载 | 网络小说

    Language:Java4.1k
  • open-parse

    open-parse

    Improved file parsing for LLM’s

    Language:Python3.1k
  • deepdoctection

    A Repo For Document AI

    Language:Python3k
  • layra

    LAYRA—an enterprise-ready, out-of-the-box solution—unlocks next-generation intelligent systems powered by visual RAG and limitless visual multi-step agent workflow orchestration.

    Language:TypeScript807
  • docstrange

    Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.

    Language:Python551
  • vision-parse

    Parse PDFs into markdown using Vision LLMs

    Language:Python428
  • document-parsers-list

    A comprehensive list of document parsers, covering PDF-to-text conversion and layout extraction. Each tested for support of tables, equations, handwriting, two-column layouts, and multi-column layouts.

  • marie-ai

    Complex data extraction and orchestration framework designed for processing unstructured documents. It integrates AI-powered document pipelines (GenAI, LLM, VLLM) into your applications, supporting various tasks such as document cleanup, optical character recognition (OCR), classification, splitting, named entity recognition, and form processing

    Language:Python73
  • papercast

    A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GROBID, LangChain, listen as podcast. Customize your own pipelines.

    Language:Python52
  • opencv-text-deskew

    Tutorial on how to deskew (straighten) text images

    Language:Python52
  • bella-domify

    文档解析(Document Parser),支持 PDF、TXT、DOC、DOCX、Markdown 等文件格式,高效提取与解析内容,生成标准文档树结构。内置 PDF Parser、Text Parser、Word Parser,助力 RAG、知识库、全文检索等智能应用。

    Language:Python38
  • Invoiceable

    Invoiceable

    The invoice, document, and resume parser powered by AI.

    Language:Python38
  • graphlit

    Graphlit Platform

  • semantic-ai

    An open source framework for Retrieval-Augmented System (RAG) uses semantic search helps to retrieve the expected results and generate human readable conversational response with the help of LLM (Large Language Model).

    Language:Python21
  • smart-docs-parser

    An OCR based document parser to extract information from identity document images

    Language:TypeScript21
  • Resume_Parsing

    Resume Parsing app to extract information using AI

    Language:Jupyter Notebook17
  • docling4j

    Docling4j brings the functionalities of Docling in document understanding to Java® projects

    Language:Java16
  • novalad

    Novalad offers a unified, centralized platform enabling organizations to extract meaningful data and perform advanced processing at high speed.

    Language:Jupyter Notebook16
  • graphlit-client-python

    Python client library for Graphlit Platform

    Language:Python14
  • df-extract

    DF Extract Lib

    Language:Python14
  • clearedge

    Build a RAG preprocessing pipeline

    Language:Jupyter Notebook12
  • docparser

    Extract text from your DOCX documents.

    Language:Python11
  • opendataloader-pdf

    Safe, Open, High-Performance — OpenDataLoader PDF for AI

    Language:Java7
  • docviz

    Advanced document contents extraction with multiple output formats

    Language:Python6
  • DrParser

    Dr.Parser 🩸📊 – AI-powered blood report parser that extracts and analyzes medical data from images/PDFs. Built with React, FastAPI, EasyOCR, and Gemini AI. 🚀 🔹 Local Setup Available | 🔹 Future Enhancements Planned | 🔹 Hackathon Project 👉 Clone, run, and explore the future of AI-driven healthcare!

    Language:Python4
  • docparser

    🧰 Tools to Upload/Parse Documents to 'docparser' and Retrieve Extracted Results

    Language:R4
  • AI-Powered-Resume-Evaluator

    An AI-powered resume evaluation app that compares a candidate’s resume with a job description using Google’s Gemini 1.5 Flash model to provide HR-style feedback and an ATS-style match scoring through a simple and interactive Streamlit interface.

    Language:Python3
  • docpa

    A simple library that I use for web scraping. Uses htmlparser2 to parse dom.

    Language:TypeScript3
  • local-RAG-backend

    This is the backend for a RAG system that runs on Docker Compose. It registers documents in a wide range of file formats, which can be searched using the MCP server.

    Language:Python2
  • LeapRAG

    LeapRAG is an open-source platform that integrates advanced RAG technology with Google’s A2A protocol, enabling users to build context-aware, data-driven agents. These agents are automatically A2A-compliant and can be discovered and used by any compatible client without extra development.

    Language:Python2
  • Document_Parser_using_AI

    Parse documents using AI - any document converted to markdown suitable for RAG applications

    Language:Jupyter Notebook2
  • techStandards

    Download and parse technical standard documents

    Language:R2