document-parser

There are 66 repositories under document-parser topic.

infiniflow/ragflow
RAGFlow is a leading open-source Retrieval-Augmented Generation (RAG) engine that fuses cutting-edge RAG with Agent capabilities to create a superior context layer for LLMs
Language:TypeScript67.3k 294 5.7k7.2k
docling-project/docling
Get your documents ready for gen AI
Language:Python43.2k 183 1.4k3.1k
Unstructured-IO/unstructured
Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.
Language:HTML13.1k 68 1.2k1.1k
freeok/so-novel
小说下载｜网文下载 | 网络小说
Language:Java5.4k 13 171430
Marker-Inc-Korea/AutoRAG
AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation
Language:Python4.4k 32 651351
run-llama/llama_cloud_services
Knowledge Agents and Management in the Cloud
Language:TypeScript4.2k 25 562459
Filimoa/open-parse
Improved file parsing for LLM’s
Language:Python3.1k 21 47138
deepdoctection/deepdoctection
A Repo For Document AI
Language:Python3k 20 200172
NanoNets/docstrange
Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.
Language:Python1k 5 1493
liweiphys/layra
LAYRA—an enterprise-ready, out-of-the-box solution—unlocks next-generation intelligent systems powered by visual RAG and limitless visual multi-step agent workflow orchestration.
Language:TypeScript885 16 3794
opendataloader-project/opendataloader-pdf
Safe, Open, High-Performance — PDF for AI
Language:Java748 4 536
iamarunbrahma/vision-parse
Parse PDFs into markdown using Vision LLMs
Language:Python441 5 2562
GiftMungmeeprued/document-parsers-list
A comprehensive list of document parsers, covering PDF-to-text conversion and layout extraction. Each tested for support of tables, equations, handwriting, two-column layouts, and multi-column layouts.
163 6 14
marieai/marie-ai
Complex data extraction and orchestration framework designed for processing unstructured documents. It integrates AI-powered document pipelines (GenAI, LLM, VLLM) into your applications, supporting various tasks such as document cleanup, optical character recognition (OCR), classification, splitting, named entity recognition, and form processing
Language:Python76 3 1409
LianjiaTech/bella-domify
文档解析（Document Parser），支持 PDF、TXT、DOC、DOCX、Markdown 等文件格式，高效提取与解析内容，生成标准文档树结构。内置 PDF Parser、Text Parser、Word Parser，助力 RAG、知识库、全文检索等智能应用。
Language:Python56 0 09
JPLeoRX/opencv-text-deskew
Tutorial on how to deskew (straighten) text images
Language:Python52 3 215
papercast-dev/papercast
A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GROBID, LangChain, listen as podcast. Customize your own pipelines.
Language:Python52 1 93
InvoiceableAI/Invoiceable
The invoice, document, and resume parser powered by AI.
Language:Python40 1 33
graphlit/graphlit
Graphlit Platform
24 1 01
decisionfacts/semantic-ai
An open source framework for Retrieval-Augmented System (RAG) uses semantic search helps to retrieve the expected results and generate human readable conversational response with the help of LLM (Large Language Model).
Language:Python21 2 01
urbanclap-engg/smart-docs-parser
An OCR based document parser to extract information from identity document images
Language:TypeScript21 2 37
docling-project/docling4j
Docling4j brings the functionalities of Docling in document understanding to Java® projects
Language:Java18 2 01
brazilian-code/Resume_Parsing
Resume Parsing app to extract information using AI
Language:Jupyter Notebook17 0 19
graphlit/graphlit-client-python
Python client library for Graphlit Platform
Language:Python16 3 02
novaladai/novalad
Novalad offers a unified, centralized platform enabling organizations to extract meaningful data and perform advanced processing at high speed.
Language:Jupyter Notebook160
decisionfacts/df-extract
DF Extract Lib
Language:Python14 1 00
Clearedge-AI/clearedge
Build a RAG preprocessing pipeline
Language:Jupyter Notebook12 2 00
has-abi/docparser
Extract text from your DOCX documents.
Language:Python11 1 02
privateai-com/docviz
Advanced document contents extraction with multiple output formats
Language:Python6
Gyanvir/DrParser
Dr.Parser 🩸📊 – AI-powered blood report parser that extracts and analyzes medical data from images/PDFs. Built with React, FastAPI, EasyOCR, and Gemini AI. 🚀 🔹 Local Setup Available | 🔹 Future Enhancements Planned | 🔹 Hackathon Project 👉 Clone, run, and explore the future of AI-driven healthcare!
Language:Python4 1 00
hrbrmstr/docparser
🧰 Tools to Upload/Parse Documents to 'docparser' and Retrieve Extracted Results
Language:R4 2 0
coderosh/docpa
A simple library that I use for web scraping. Uses htmlparser2 to parse dom.
Language:TypeScript3 1 0
connectaman/deepseek-ocr-multigpu-infer
Efficient multi-GPU OCR inference framework leveraging parallel processes for accelerated token throughput and faster batch processing. Designed for scalable, high-performance optical character recognition workloads using PyTorch. Supports dynamic GPU assignment, optimized resource utilization, and easy integration for large-scale image datasets.
Language:Python3
shijincai/fast360
The industry's first "Open Source OCR Arena," a free, no-login utility for one-click benchmarking of 7 top-tier models (Marker, MinerU, MonkeyOCR, Docling, Dolphin, OCRFlux, PP-StructureV3) on your PDF/image files, specializing in PDF-to-Markdown conversion.
3
Vetrivel07/AI-Powered-Resume-Evaluator
An AI-powered resume evaluation app that compares a candidate’s resume with a job description using Google’s Gemini model to provide HR-style feedback and an ATS-style match scoring through a simple and interactive Streamlit interface.
Language:Python3 0 0
CyrilDesch/SRAG
An Open-source Scala-based Hybrid RAG offering deep document understanding and audio processing. Built with a flexible architecture that lets you easily plug in different models or storage systems, stateless and scalable by design.
Language:Scala2

document-parser

infiniflow/ragflow

docling-project/docling

Unstructured-IO/unstructured

freeok/so-novel

Marker-Inc-Korea/AutoRAG

run-llama/llama_cloud_services

Filimoa/open-parse

deepdoctection/deepdoctection

NanoNets/docstrange

liweiphys/layra

opendataloader-project/opendataloader-pdf

iamarunbrahma/vision-parse

GiftMungmeeprued/document-parsers-list

marieai/marie-ai

LianjiaTech/bella-domify

JPLeoRX/opencv-text-deskew

papercast-dev/papercast

InvoiceableAI/Invoiceable

graphlit/graphlit

decisionfacts/semantic-ai

urbanclap-engg/smart-docs-parser

docling-project/docling4j

brazilian-code/Resume_Parsing

graphlit/graphlit-client-python

novaladai/novalad

decisionfacts/df-extract

Clearedge-AI/clearedge

has-abi/docparser

privateai-com/docviz

Gyanvir/DrParser

hrbrmstr/docparser

coderosh/docpa

connectaman/deepseek-ocr-multigpu-infer

shijincai/fast360

Vetrivel07/AI-Powered-Resume-Evaluator

CyrilDesch/SRAG