document-processing

There are 249 repositories under document-processing topic.

ucbepic/docetl
A system for agentic LLM-powered data processing and ETL
Language:Python3k 26 134317
enoch3712/ExtractThinker
ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.
Language:Python1.5k 20 173142
dhlab-epfl/dhSegment
Generic framework for historical document processing
Language:Python379 27 50114
ucbepic/TWIX
TWIX is an open-source data extraction tool that reconstructs structured data from documents at scale, accurately and at low cost, by inferring the shared underlying visual template across documents
Language:Python207 1 416
awslabs/project-lakechain
:zap: Cloud-native, AI-powered, document processing pipelines on AWS.
Language:TypeScript185 11 3526
formkiq/formkiq-core
Open-source document management platform leveraging AWS managed services. RESTful API for document storage, processing, full-text search, and metadata management. Multi-tenant serverless architecture with auto-scaling... deployed entirely in your AWS account.
Language:Java144 5 30424
Tele-AI/doc-ops-mcp
MCP server for seamless document format conversion and processing
Language:TypeScript132 1 23
iamarunbrahma/pdf-to-markdown
Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.
Language:Python101 3 28
awslabs/rhubarb
A Python framework for multi-modal document understanding with Amazon Bedrock
Language:Python96 6 1312
parsee-ai/parsee-core
Retrieval of fully structured data made easy. Use LLMs or custom models. Specialized on PDFs and HTML files. Extensive support of tabular data extraction and multimodal queries.
Language:Python74 4 01
steindani/pandoc-include
An include filter for Pandoc
Language:Haskell62 4 1120
Addepto/graph_builder
Open-source toolkit to extract structured knowledge graphs from documents and tables — power analytics, digital twins, and AI-driven assistants.
Language:Python56 2 09
PSPDFKit/nutrient-document-engine-mcp-server
A Model Context Protocol (MCP) server implementation exposes document processing capabilities through natural language, supporting both direct human interaction and AI agent tool calling.
Language:TypeScript561
jmanhype/DSPy-Multi-Document-Agents
An advanced distributed knowledge fabric for intelligent document processing, featuring multi-document agents, optimized query handling, and semantic understanding.
Language:Python49 1 05
abdullahshafiq-20/ResumeTex
ResumeTex is an AI-powered tool that converts standard PDF resumes into professionally formatted LaTeX documents. This service helps you create elegant, structured resumes without needing to learn LaTeX syntax.
Language:JavaScript415
aws-solutions/enhanced-document-understanding-on-aws
Enhanced Document Understanding on AWS delivers an easy-to-use web application that ingests and analyzes documents, extracts content, identifies and redacts sensitive customer information, and creates search indexes from the analyzed data.
Language:JavaScript40 16 2019
cburschka/lyx
Unofficial mirror of git://git.lyx.org/lyx.git (updates daily. not affiliated with lyx.org.)
Language:C++39 6 07
seehiong/pdfusion
A powerful PDF processing engine that deconstructs documents into their core elements—text, images, and tables—and seamlessly reconstructs them into pristine, structured Markdown. Built with a React frontend and a robust Python (PyMuPDF) backend on Appwrite.
Language:Python37 0 00
kili-technology/awesome-datasets
A comprehensive list of annotated training datasets classified by use case.
35 3 06
afrozas/proceedings
Semantic extraction from conference proceedings.
Language:Python31 0 01
ucbepic/BARGAIN
Low-Cost LLM-Powered Data Processing with Theoretical Guarantees
Language:Python293
autollama/autollama
Anthropic's Contextual Retrieval implementation with visual chunk comparison. Preview context enrichment before/after embedding.
Language:HTML25 1 280
aws-samples/sample-document-processing-with-amazon-bedrock-data-automation
This repository contains examples for customers to get started using Amazon Bedrock Data Automation. The samples focus mainly on document processing use cases
Language:Jupyter Notebook23 2 112
MBAigner/PDFSegmenter
This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified and returned. Tables are retrieved formatted as a CSV.
Language:Python22 1 03
greed2411/tokyo
tokyo, a REST API, when given any type of document 📄, Identifies mime-type 🧐. Suggests extension 🦔. Alas Extracts text 💪.
Language:Clojure18 1 00
AI-Engineering-Study-Group/docugent
AI-powered document intelligence platform for automated analysis, processing, and insights extraction from various document formats.
Language:Python1712
OlegCheban/WaterMarkIt
A lightweight, framework-agnostic Java library for adding watermarks to various file types, including PDFs and images
Language:Java17 1 1319
smart-models/Normalized-Semantic-Chunker
Cutting-edge tool that unlocks the full potential of semantic chunking
Language:Python17
eklem/stopword-trainer
A module for creating stopword lists for any language, based on a set of documents.
Language:JavaScript15 1 620
martin-papy/qdrant-loader
Enterprise-ready vector database toolkit for building searchable knowledge bases from multiple data sources. Supports multi-project management, automatic ingestion from Confluence/JIRA/Git, intelligent file conversion (PDF/Office/images), and semantic search. Includes MCP server for seamless AI assistant integration.
Language:Python155
thammuio/doc-genius-ai
DocGenius AI - Generative AI Chatbot for your Documents
Language:Python14 2 06
felixdittrich92/docling-OCR-OnnxTR
OnnxTR OCR plugin for Docling
Language:Python130
Swiftgum/swiftgum
ETL for RAG. Transform any source into LLM-ready markdown. Focus on your AI, not integrations.
Language:TypeScript11 2 00
aws-samples/sample-for-multi-modal-document-to-json-with-sagemaker-ai
This open-source project delivers a complete pipeline for converting multi-page documents (PDFs/images) into structured JSON using Vision LLMs on Amazon SageMaker. The solution leverages the SWIFT Framework to fine-tune models specifically for document understanding tasks.
Language:Jupyter Notebook10 1 02
vakharwalad23/mark-minion
The Ultimate Web Content Extraction & Conversion Tool for AI/LLM Applications. Convert almost any web content into clean Markdown with intelligent AI processing.
Language:TypeScript10 1 01
quarkiverse/quarkus-docling
Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem
Language:Java84

document-processing

ucbepic/docetl

enoch3712/ExtractThinker

dhlab-epfl/dhSegment

ucbepic/TWIX

awslabs/project-lakechain

formkiq/formkiq-core

Tele-AI/doc-ops-mcp

iamarunbrahma/pdf-to-markdown

awslabs/rhubarb

parsee-ai/parsee-core

steindani/pandoc-include

Addepto/graph_builder

PSPDFKit/nutrient-document-engine-mcp-server

jmanhype/DSPy-Multi-Document-Agents

abdullahshafiq-20/ResumeTex

aws-solutions/enhanced-document-understanding-on-aws

cburschka/lyx

seehiong/pdfusion

kili-technology/awesome-datasets

afrozas/proceedings

ucbepic/BARGAIN

autollama/autollama

aws-samples/sample-document-processing-with-amazon-bedrock-data-automation

MBAigner/PDFSegmenter

greed2411/tokyo

AI-Engineering-Study-Group/docugent

OlegCheban/WaterMarkIt

smart-models/Normalized-Semantic-Chunker

eklem/stopword-trainer

martin-papy/qdrant-loader

thammuio/doc-genius-ai

felixdittrich92/docling-OCR-OnnxTR

Swiftgum/swiftgum

aws-samples/sample-for-multi-modal-document-to-json-with-sagemaker-ai

vakharwalad23/mark-minion

quarkiverse/quarkus-docling